Unsolved
This post is more than 5 years old
4 Posts
1
13196
Clariion CX4 - Failed Drive, how can i tell if Hotspare has kicked in
Hi
I am new to the storage world so please excuse me if i'm asking simple questions.
I have a failed drive on a CX4 and received an email so a call has been raised with EMC to swap it out.
How can I tell if the Hot spares have now taken its place so my RAID Group is still protected?
Also I see there is an option in unisphere for replace disk, do I need to do this when the disk arrives or can I just pull it and swap it with the new one, currently the slot with the failed drive is showing as removed so the drive is dead.
Lastly do I need to do anything in unisphere to activate the new drive or is it all automatic.
Thanks
Nick
Anonymous
5 Practitioner
5 Practitioner
•
274.2K Posts
0
November 1st, 2013 06:00
Hello, in Unisphere, you will see the failed drive say "transitioning or rebuilding". Not sure the exact verbiage with newest code. Wait until it states "removed" with no other indicator. Once that's complete the HS has taken over and joined the RG. When you get the new drive, just pull out old one "Amber light" and push new one in. Then, automatically Unisphere will tell you it's rebuilding back. Once complete, it will not have any "T" or any indicator on that failed drive and the new drive will then be back to RG as you were prior to failure. My terminology may be slightly off due to new Unisphere wording but just make sure the drive ONLY says removed with no other activity or indication that it's copying to HS and you can then just pull and push new one in.
kelleg
4.5K Posts
0
November 5th, 2013 14:00
Was your question answered correctly? If so, please remember to mark your question Answered when you get the correct answer and award points to the person providing the answer. This helps others searching for a similar issue.
glen
ThatsLUNny
12 Posts
1
November 5th, 2013 15:00
You can run a navicli -h spa getdisk 0_0_5 (if 0_0_5 is your hot spare) and it should tell you if it is rebuilding or in place/active of the failed disk.
kelleg
4.5K Posts
0
November 6th, 2013 13:00
Was your question answered correctly? If so, please remember to mark your question Answered when you get the correct answer and award points to the person providing the answer. This helps others searching for a similar issue.
glen
zhouzengchao
2 Intern
2 Intern
•
1.4K Posts
0
November 6th, 2013 19:00
There is one KB article for your reference: emc250611. I copied out the main steps here:
To check whether a hot spare is actively replacing a failed disk from the Navisphere Manager:
If the hot spare is replacing the failed disk, the status will be displayed as Active.
Alternatively, select the disk under the hot spare and right click and select the properties. If the hot spare is invoked, it will display as Engaged under the current state and under the hot spare replacing the status will be displayed as Active.
For command line check, you may issue getdisk -hs. For example, my 1_0_8 is down, to check if HS was involved:
getdisk -hs
Bus 1 Enclosure 0 Disk 6
Hot Spare: 24567: YES
Hot Spare Replacing: 1_0_8
Bus 1 Enclosure 0 Disk 7
Hot Spare: NO
Bus 1 Enclosure 0 Disk 8
State: Removed
As you can see, the removed drive 1_0_8 is now replaced by HS 1_0_6
bayya1
15 Posts
0
September 8th, 2015 03:00
Hi Nick,
As with the other features the new VNXe is now on parity with the VNX for RAID and resiliency The first enhancement is Permanent Sparing. Traditionally when a drive in a RAID set failed the array would grab a designated hot spare and use that to rebuild the RAID set. When you replaced the failed drive the array would copy the data from the hot spare to the new drive, and then make the hot spare a hot spare yet again. That’s no longer the case. Now the array keeps using the hot spare drive. Big deal? I don’t think so. Just be aware.
How hot spares are specified has also changed..and by changed I mean gone away. You no longer specify drives as hot spares. Any unbound drive is capable of being a hot spare. The array is smart in how it chooses which drive to use (capacity, rotation, bus, etc) so that it doesn’t pick an odd drive on a different bus, unless it has to do so.
MCx also has a timeout for RAID rebuilds. If a drive goes offline, or fails, or you pull it out for some reason the array now waits 5 minutes before activating a spare and rebuilding the set. It does this to make sure you didn’t accidentally do something or that you’re not moving drives around.
You can now pull drives from a slot and put them in another slot and the array will detect it and put it back online without activating a rebuild..as long as you do it within 5 minutes. You can also shut down the array and re-cable the backend buses if you want and it will still know which drives belong to what. Let’s be clear here. Don’t just do this without planning. You’re still moving drives and changing things. Do it for a purpose. Also, you can’t move drives, or whole RAID groups, between arrays…even between MCx arrays. It’s only within the same array. Use caution.
MCx does parallel rebuilds on RAID6, if you lose two drives. FLARE would rebuild the set with one drive…then rebuild it again for the second drive. MCx is more intelligent and if you fail two drives it will rebuild both parity sets at once.
Thanks,
Reddy......
yogad
78 Posts
0
May 23rd, 2018 14:00
Steve, I see it different on NS960. We had a drive failure on 3 0 12 and its replaced and the drive status is enabled but i dont see hot spare going back to inactive state. Its been few hours in that state
Bus 3 Enclosure 0 Disk 12
Hot Spare: 47: NO
Bus 3 Enclosure 1 Disk 13
Hot Spare: 24571: YES
Hot Spare Replacing: 3_0_12
/nas/sbin/navicli -h spa getdisk 3_0_12 -state -rb
Bus 3 Enclosure 0 Disk 12
State: Enabled
Prct Rebuilt: 47: 100
kelleg
4.5K Posts
0
May 24th, 2018 07:00
First - the last message from Steve was in 2013, not sure he would still be watching this thread.
Second, the rebuild for the replacement disk is finished. When a disk fails and is replaced by the hot spare, the data that was on the failed disk will be rebuilt from the remaining disks in the raid group that owned the failed disk (rebuilt from the raid parity). Once that's complete, if you remove and insert the replacement disk, the process to move the data from the hot spare to the new disk is called "equalize" - this is basically a copy of the data from the hot spare to the new disk. Both of these processes take time. Depending on the type of disk that failed it could take hours to days. The slower the disk being replaced, the longer the rebuild/equalize will take. The slowest disks on the high capacity ATA disk, the fastest are the SSDs.
/nas/sbin/navicli -h spa getdisk 3_0_12 -state -rb
Bus 3 Enclosure 0 Disk 12
State: Enabled
Prct Rebuilt: 47: 100 <----------
I'd checked it again in a couple of hours to see if the equalize has started/finished.
glen
yogad
78 Posts
0
May 24th, 2018 08:00
Thanks Glen. Agree that i should have created a new discussion.
The failed disk is a 600GB FC drive and is replaced with another disk of same specifications. I just checked its status and there are no changes. Is there a way to kick back hot spare 3 1 13 to inactive.
/nas/sbin/navicli -h spa getdisk 3_0_12 -state -rb -rds -wrts -write
Bus 3 Enclosure 0 Disk 12
State: Enabled
Prct Rebuilt: 47: 100
Read Requests: 326972
Write Requests: 1134749
Number of Writes: 1134749
/nas/sbin/navicli -h spa getdisk 3_1_13 -type -hr -hs -state -rb
Bus 3 Enclosure 1 Disk 13
Type: 24571: Hot Spare
Hard Read Errors: 0
Hot Spare: 24571: YES
Hot Spare Replacing: 3_0_12
State: Enabled
Prct Rebuilt: 24571: 100
kelleg
4.5K Posts
0
May 25th, 2018 07:00
Try running the same commands from the SPB side to see if the other bus reports the same information.
If its the same, then try restarting the Management Service for SPA and SPB, one SP at a time. Using two browsers, log into each SP IP_Address/setup. Restart SPA, wait for it to come back up, then restart SPB and wait for it to come up, then check the status of the hot spare.
glen
yogad
78 Posts
0
May 30th, 2018 07:00
Restarting mgmt service has updated the disk status. Thanks Glen.
Is there a way for me to find out from a failed disk is what part of which raid group. All that it shows now is its removed and the hot spare replacing it shows its own hot spare raid group but not the RG to which 1 0 3 belongs to. This is another failed drive by the way.
/nas/sbin/navicli -h spa getdisk 1_0_3
Bus 1 Enclosure 0 Disk 3
State: Removed
kelleg
4.5K Posts
0
May 30th, 2018 09:00
You can try the getdisk with the -rg switch (raidgroup) or try the getrg -all command - that should list all the raid groups and the disks for each raid group. When you have a hot spare, it's in it own RG. When a disk fails, the hot spare changes it's ID to that of the failed disk and adopts the RG ID also.
glen
yogad
78 Posts
0
May 30th, 2018 10:00
That doesn't seem to be case but i was able to find from which RG the failed disk is part of by checking each RG since there were less number of RG's
/nas/sbin/navicli -h spa getdisk 1_0_3 -rg
Bus 1 Enclosure 0 Disk 3
State: Removed
/nas/sbin/navicli -h spa getrg 11
RaidGroup ID: 11
RaidGroup Type: r5
RaidGroup State: Explicit_Remove
Valid_luns
List of disks: Bus 1 Enclosure 0 Disk 0
Bus 1 Enclosure 0 Disk 1
Bus 1 Enclosure 0 Disk 2
Bus 1 Enclosure 0 Disk 3
Bus 1 Enclosure 0 Disk 4
List of luns: 110 111
Max Number of disks: 16
Max Number of luns: 256
Raw Capacity (Blocks): 5628494360
Logical Capacity (Blocks): 4502795392
Free Capacity (Blocks,non-contiguous): 0
Free contiguous group of unbound segments: 0
Defrag/Expand priority: Medium
Percent defragmented: 100
Percent expanded: 100
Disk expanding onto: N/A
Lun Expansion enabled: NO
Legal RAID types: r5
/nas/sbin/navicli -h spa getdisk 0_0_14 -rg
Bus 0 Enclosure 0 Disk 14
Raid Group ID: 201
/nas/sbin/navicli -h spa getdisk 0_0_14 -rg -hs
Bus 0 Enclosure 0 Disk 14
Raid Group ID: 201
Hot Spare: 24574: YES
Hot Spare Replacing: 1_0_3