Exadata Auto Mgmt Process Dropping Disks

I was recently working on an Exadata X2-2 machine that had several disks that were not in ASM. Upon further inspection, the physical, celldisk, and griddisks were all fine, but there was an issue trying to add them back into ASM.

I’ll show you the steps provided by Oracle support to get the disks added back into the diskgroups in ASM.

Here, you can see the status of the disks on the storage cells and within ASM.

-- celldisk

name: CD_03_exaucel02
comment:
creationTime: 2011-01-05T16:42:36-06:00
deviceName: /dev/sdd
devicePartition: /dev/sdd
diskType: HardDisk
errorCount: 4
freeSpace: 0
id: 0000012d-5858-9649-0000-000000000000
interleaving: none
lun: 0_3
physicalDisk: L2HTLY
raidLevel: 0
size: 1861.703125G
status: normal

-- griddisk

name: DATA_CD_03_exaucel02
asmDiskgroupName: DATA
asmDiskName: DATA_CD_03_EXAUCEL02
asmFailGroupName: EXAUCEL02
availableTo:
cachingPolicy: default
cellDisk: CD_03_exaucel02
comment:
creationTime: 2011-01-05T16:45:47-06:00
diskType: HardDisk
errorCount: 4
id: 0000012d-585b-811e-0000-000000000000
offset: 32M
size: 1562G
status: active

name: RECO_CD_03_exaucel02
asmDiskgroupName: RECO
asmDiskName: RECO_CD_03_EXAUCEL02
asmFailGroupName: EXAUCEL02
availableTo:
cachingPolicy: default
cellDisk: CD_03_exaucel02
comment:
creationTime: 2011-01-11T10:44:37-06:00
diskType: HardDisk
errorCount: 0
id: 90c29a38-f9d3-405e-8909-d79dcdf5a909
offset: 1562.046875G
size: 299.65625G
status: active


From V$ASM_DISK
SQL:ASM> @asm_info

GROUP_NUMBER FAILGROUP                      PATH                                     MOUNT_STATUS STATE
------------ ------------------------------ ---------------------------------------- ------------ --------
           0 EXAUCEL02                      o/192.168.10.6/DATA_CD_03_exaucel02      CLOSED       NORMAL
           0 EXAUCEL02                      o/192.168.10.6/RECO_CD_03_exaucel02      CLOSED       NORMAL
           0 EXAUCEL07	                    o/192.168.10.11/DATA_CD_08_elsucel07     IGNORED	  NORMAL
           0 EXAUCEL07		            o/192.168.10.11/RECO_CD_08_elsucel07     IGNORED	  NORMAL

The disks had been in this state for some time as the log files had already aged off, and since the disks seemed to be fine, I moved forward with Oracle support assuming that the disks were fine and to attempt to add the disks back into the diskgroups in ASM. For reference, you can read the following note for reference.

After replacing disk on Exadata storage, v$asm_disk shows CLOSED/IGNORED as mount_status [ID 1347155.1]

Unfortunately, we had to go through several attempts at getting the disks back into ASM. Our attempts included:

ATTEMPT 1
Run the commands to add the disk back to the diskgroup. I did not see any errors, however the mount_status changed from ignored to closed.
sql> alter diskgroup RECO add disk ‘o/192.168.10.6/RECO_CD_03_exaucel02’ force;
sql> alter diskgroup DATA add disk ‘o/192.168.10.6/DATA_CD_03_exaucel02’ force;

ATTEMPT 2
Next, we tried to add 2 of the disks to +DATA first as they need to be added in pairs to preserve the disk partnerships.
alter diskgroup data
add failgroup EXAUCEL04 disk ‘o/192.168.10.8/DATA_CD_10_exaucel04’ force
add failgroup EXAUCEL07 disk ‘o/192.168.10.11/DATA_CD_08_exaucel07’ force
rebalance power 11;

ATTEMPT 3
Next, we tried the following to clear the cache of the Exadata Auto Mgmt process.
On all DB nodes, identify the PIDs of xdmg and xdwk processes and kill them; then add the disks back into ASM.
ps -ef | grep xdmg
ps -ef | grep xdwk
Since the xd* processes are non-fatal background processes, killing it does not bring down the ASM instance; they will be automatically respawned. Once the xd* processes are back up, the DISKs were added back again.

— ————————————————————
— Always the same result,
— the disk was always dropped by the Exadata Auto Mgmt process
— The following is from the ASM alert log.
— ————————————————————

...
Starting background process XDWK
Sun May 05 02:27:30 2013
XDWK started with pid=40, OS id=28500
SQL> /* Exadata Auto Mgmt: Proactive DROP ASM Disk */
alter diskgroup DATA drop
disk DATA_CD_03_exauCEL02 force
...

The next logical step was to drop and re-create the celldisk and griddisks through the cellcli on the cell.
In order to do this, it is necessary to gather some info on the names and sizes of the disks as follows.
Please note that some of the output has been truncated for brevity.

As is in most cases, here’s a note that has the steps and their explanation.
Steps to manually create cell/grid disks on Exadata V2 if auto-create fails during disk replacement [ID 1281395.1]

[root@exaucel02 ~]# cellcli
CellCLI: Release 11.2.3.2.1 - Production on Wed May 29 13:12:33 CDT 2013

Copyright (c) 2007, 2012, Oracle.  All rights reserved.
Cell Efficiency Ratio: 1,955

CellCLI> list physicaldisk
	 28:0     	 L2KDEH    	 normal
	 28:1     	 L5KQS3    	 normal
	 28:2     	 L2KD6X    	 normal
	 28:3     	 L2HTLY    	 normal
	 28:4     	 L2HTB8    	 normal
	 28:5     	 L2KJAB    	 normal
	 28:6     	 L2KJ98    	 normal
	 28:7     	 L2KD54    	 normal
	 28:8     	 L2KD6Z    	 normal
	 28:9     	 L37G5R    	 normal
	 28:10    	 L2KD6V    	 normal
	 28:11    	 L2J1LM    	 normal
...

CellCLI> list lun
	 0_0 	 0_0 	 normal
	 0_1 	 0_1 	 normal
	 0_2 	 0_2 	 normal
	 0_3 	 0_3 	 normal
	 0_4 	 0_4 	 normal
	 0_5 	 0_5 	 normal
	 0_6 	 0_6 	 normal
	 0_7 	 0_7 	 normal
	 0_8 	 0_8 	 normal
	 0_9 	 0_9 	 normal
	 0_10	 0_10	 normal
	 0_11	 0_11	 normal
...

CellCLI> list celldisk
	 CD_00_exaucel02	 normal
	 CD_01_exaucel02	 normal
	 CD_02_exaucel02	 normal
	 CD_03_exaucel02	 normal
	 CD_04_exaucel02	 normal
	 CD_05_exaucel02	 normal
	 CD_06_exaucel02	 normal
	 CD_07_exaucel02	 normal
	 CD_08_exaucel02	 normal
	 CD_09_exaucel02	 normal
	 CD_10_exaucel02	 normal
	 CD_11_exaucel02	 normal
...

CellCLI> list griddisk
	 DATA_CD_00_exaucel02	 active
	 DATA_CD_01_exaucel02	 active
	 DATA_CD_02_exaucel02	 active
	 DATA_CD_03_exaucel02	 active
	 DATA_CD_04_exaucel02	 active
	 DATA_CD_05_exaucel02	 active
	 DATA_CD_06_exaucel02	 active
	 DATA_CD_07_exaucel02	 active
	 DATA_CD_08_exaucel02	 active
	 DATA_CD_09_exaucel02	 active
	 DATA_CD_10_exaucel02	 active
	 DATA_CD_11_exaucel02	 active
	 RECO_CD_00_exaucel02	 active
	 RECO_CD_01_exaucel02	 active
	 RECO_CD_02_exaucel02	 active
	 RECO_CD_03_exaucel02	 active
	 RECO_CD_04_exaucel02	 active
	 RECO_CD_05_exaucel02	 active
	 RECO_CD_06_exaucel02	 active
	 RECO_CD_07_exaucel02	 active
	 RECO_CD_08_exaucel02	 active
	 RECO_CD_09_exaucel02	 active
	 RECO_CD_10_exaucel02	 active
	 RECO_CD_11_exaucel02	 active

CellCLI> list physicaldisk where name=28:3 detail
	 name:              	 28:3
	 deviceId:          	 24
	 diskType:          	 HardDisk
	 enclosureDeviceId: 	 28
	 errMediaCount:     	 0
	 errOtherCount:     	 0
	 foreignState:      	 false
	 luns:              	 0_3
	 makeModel:         	 "SEAGATE ST32000SSSUN2.0T"
	 physicalFirmware:  	 061A
	 physicalInsertTime:	 2010-12-21T01:04:07-06:00
	 physicalInterface: 	 sas
	 physicalSerial:    	 L2HTLY
	 physicalSize:      	 1862.6559999994934G
	 slotNumber:        	 3
	 status:            	 normal

CellCLI>  list griddisk where celldisk=CD_03_exaucel02 attributes name,size,offset
	 DATA_CD_03_exaucel02	 1562G     	 32M
	 RECO_CD_03_exaucel02	 299.65625G	 1562.046875G

Using the above names and sizes, I then dropped and re-created the celldisk and griddisks, and then added the disk back into their respective diskgroups.

CellCLI> drop   celldisk CD_03_exaucel02 force

CellCLI> create celldisk CD_03_exaucel02 lun=0_3

CellCLI> create griddisk DATA_CD_03_exaucel02 celldisk=CD_03_exaucel02,size=1562G

CellCLI> create griddisk RECO_CD_03_exaucel02 celldisk=CD_03_exaucel02,size=299.65625G

CellCLI> list griddisk where celldisk=CD_03_exaucel02 attributes name,size,offset

SQL> alter diskgroup DATA add disk 'o/192.168.10.6/DATA_CD_03_exaucel02' ;
SQL> alter diskgroup RECO add disk 'o/192.168.10.6/RECO_CD_03_exaucel02' ;

The reblance operation was now running and could be seen in the gv$asm_operation view in the ASM instances. When completed, the disks were back in ASM.

From V$ASM_DISK
SQL:ASM> @asm_info

GROUP_NUMBER FAILGROUP                      PATH                                     MOUNT_STATUS STATE
------------ ------------------------------ ---------------------------------------- ------------ --------
           1 EXAUCEL02                      o/192.168.10.6/DATA_CD_03_exaucel02      CACHED       NORMAL
           2 EXAUCEL02                      o/192.168.10.6/RECO_CD_03_exaucel02      CACHED       NORMAL

The same steps were followed to add the other disk back into ASM.

Advertisements

One thought on “Exadata Auto Mgmt Process Dropping Disks

  1. I’ve read some excellent stuff here. Definitely price bookmarking for revisiting. I surprise how much effort you place to create this type of great informative site. edgadkdkcfbe

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s