We have a Clariion CX3-10 system with three RAID groups.
The first two each contain 5 disks and are configured as RAID-5. The 11th disk is configured in a separate RAID group as "Hot Spare".
Last week one disk failed and the hot spare was initiated.
From the NaviSphere point of view:
- There's still a fault on the array being reported
- But the hot spare claims to be "replacing bus 0 enclosure 0 disk 2"
Browsing through the Navisphere interface I don't see any actions it is prompting me to do.
At this point I -assumed- the hot spare was actively acting as a replacement for disk 2, assuring that RAID-5 redundancy was still intact.
But now the interesting part -- We use software that reports on disk usage (IOs, cache hit, response time, etc)
This shows that as soon as the failure was recognized, both the failing disk and the hot spare started a huge load of IO, indicating that the system was pro-actively copying all the data to the hot spare.
But as soon as this copy was done, the hot spare became dead-silent once again.
It is reported in the proper array (along with disk 0,1,3,4), but has not been used since. No IOs at all.
My questions are therefore:
Once a disk fails, and the hot-spare is initiated, should I (a) do anything else to restore full redundancy, and (b) do anything to allow the system to start using the hot-spare for actual disk IO?
Any tips or thoughts on this issue would be appreciated.
Hot spare will stand in for the faulted drive and the I/O will happen to the hot spare instead of failed drive. You dont have to do anything for this, but the drive needs to be replaced ASAP to make the 'spare ready'. Just see the 'fault status' (right click on the system serial and select 'faults') to check the system health.
Ans a) No user action required if the HS disk rebuilds successfully thereby restoring the RAID level redundancy.
Ans b) No user action required, storage system will take care of this once HD disk has rebuilded.
Note: Make sure to replace the failed disk asap so that copy-back happens to this disk and again the HS disk is back acting as hot-spare.
Many thanks for the amazingly quick replies.
As I understand from the three replies, all of you say that the Hot Spare should be covering for the faulted disk in every way as expected.
This was my assumption too.
There are two things why I decided to challenge this assumption:
1. Write cache was automatically disabled on the system and has since been left disabled.
In fact, this write cache was disabled not at the moment of the proactive copy to the hot-spare, but when the copy to the hot-spare was finished.
At this same moment the Navisphere event viewer shows a "Unit shutdown" of the hot-spare disk.
Performance degraded significantly after this.
No further automatic actions were taken by the unit to re-enable write cache.
Would write-cache remain disabled until I actually completely replace the faulty disk, even though a hot-spare is active?
2. Disk performance analysis reports no IOs on the hot-spare, except for during the proactive copy.
The software with which we read the disk performance is based on the SMI-S data of the storage system.
I am leaving open the option that this software may mis-understand the provided data, so I will try to see what "Navi Analyzer" shows for the disk.
Unfortunately I have no knowledge of Navi Analyzer to know if I have access to it, or have it installed anywhere.
Hopefully I can get back with you about this soon.
Note: The faulty disk was this morning removed, inspected, and re-inserted. Upon re-insert the system happily claimed everything is fine, and it is now equalizing the data back to disk 2. So "unfortunately" I cannot go back to the faulty state as I started with this morning and show exact running details.
Once again, thanks for the quick replies already.
>Is the failed disk one of the vault disks with which you have created the RG?
Yes, the faulty disk was one of the disks with the sticker on it that said "Caution: Array Software on drives 0-4. Removing or relocating them Will Make the Array Unusable."
Does that cause the write-cache to stay disabled until further action is taken?
If a vault disk fails in which a data LUN is bound, HS disk will kick-in and write cache will get disabled, once every actions are taken care and the RG is back with its usual disk members..still the cache will remain disabled, user need to enable the cache manually.
- We see some issues with one Standby Power Supply, but this does not seem recent.
- The system claims in its alerts that write-cache is still disabled, but we can see in the status of the storage system and each and every LUN that it is actually enabled. Not sure how to clear this alert
With regards to the original problem -- the lack of I/Os:
We think it may be a bug in the SMI-S provider that we're running. We run quite an old version of the EMC SMI-S provider also.
When system started the equalize process back to the normal disk, it did actively read from the hot-spare, which would probably only make sense if it actually had been writing new data to this disk anyway.
Unfortunately there's no way for us right now to see what happened during the fault-period on the hot-spare.
Navisphere Analyze wasn't running at the time of fault, and I did start it later yesterday on the system (I think).
This gave me some .naz files, but unfortunately no idea what to do with those.