Start a Conversation

Unsolved

This post is more than 5 years old

M

5750

December 12th, 2012 04:00

CX3 hot spare replacing failed drive, but no IOs reported

We have a Clariion CX3-10 system with three RAID groups.

The first two each contain 5 disks and are configured as RAID-5. The 11th disk is configured in a separate RAID group as "Hot Spare".

Last week one disk failed and the hot spare was initiated.

From the NaviSphere point of view:

- There's still a fault on the array being reported

- But the hot spare claims to be "replacing bus 0 enclosure 0 disk 2"

Browsing through the Navisphere interface I don't see any actions it is prompting me to do.

At this point I -assumed- the hot spare was actively acting as a replacement for disk 2, assuring that RAID-5 redundancy was still intact.

But now the interesting part -- We use software that reports on disk usage (IOs, cache hit, response time, etc)

This shows that as soon as the failure was recognized, both the failing disk and the hot spare started a huge load of IO, indicating that the system was pro-actively copying all the data to the hot spare.

But as soon as this copy was done, the hot spare became dead-silent once again.

It is reported in the proper array (along with disk 0,1,3,4), but has not been used since. No IOs at all.

My questions are therefore:

Once a disk fails, and the hot-spare is initiated, should I (a) do anything else to restore full redundancy, and (b) do anything to allow the system to start using the hot-spare for actual disk IO?

Any tips or thoughts on this issue would be appreciated.

1 Rookie

 • 

20.4K Posts

December 12th, 2012 04:00

You should see i/o to HS disk, are you using Navi Analyzer to look at performance numbers ?

812 Posts

December 12th, 2012 04:00

Hot spare will stand in for the faulted drive and the I/O will happen to the hot spare instead of failed drive. You dont have to do anything for this, but the drive needs to be replaced ASAP to make the 'spare ready'.  Just see the 'fault status' (right click on the system serial and select 'faults') to check the system health.

December 12th, 2012 04:00

Hi

Ans a) No user action required if the HS disk rebuilds successfully thereby restoring the RAID level redundancy.

Ans b) No user action required, storage system will take care of this once HD disk has rebuilded.

Note: Make sure to replace the failed disk asap so that copy-back happens to this disk and again the HS disk is back acting as hot-spare.

Thanks

Anirudh

December 12th, 2012 05:00

Hi

Is the failed disk one of the vault disks with which you have created the RG?

Thanks

Anirudh

5 Posts

December 12th, 2012 05:00

>Is the failed disk one of the vault disks with which you have created the RG?

Yes, the faulty disk was one of the disks with the sticker on it that said "Caution: Array Software on drives 0-4. Removing or relocating them Will Make the Array Unusable."

Does that cause the write-cache to stay disabled until further action is taken?

5 Posts

December 12th, 2012 05:00

Many thanks for the amazingly quick replies.

As I understand from the three replies, all of you say that the Hot Spare should be covering for the faulted disk in every way as expected.

This was my assumption too.

There are two things why I decided to challenge this assumption:

1. Write cache was automatically disabled on the system and has since been left disabled.

In fact, this write cache was disabled not at the moment of the proactive copy to the hot-spare, but when the copy to the hot-spare was finished.

At this same moment the Navisphere event viewer shows a "Unit shutdown" of the hot-spare disk.

Performance degraded significantly after this.

No further automatic actions were taken by the unit to re-enable write cache.

Would write-cache remain disabled until I actually completely replace the faulty disk, even though a hot-spare is active?

2. Disk performance analysis reports no IOs on the hot-spare, except for during the proactive copy.

The software with which we read the disk performance is based on the SMI-S data of the storage system.

I am leaving open the option that this software may mis-understand the provided data, so I will try to see what "Navi Analyzer" shows for the disk.

Unfortunately I have no knowledge of Navi Analyzer to know if I have access to it, or have it installed anywhere.

Hopefully I can get back with you about this soon.

Note: The faulty disk was this morning removed, inspected, and re-inserted. Upon re-insert the system happily claimed everything is fine, and it is now equalizing the data back to disk 2. So "unfortunately" I cannot go back to the faulty state as I started with this morning and show exact running details.

Once again, thanks for the quick replies already.

December 12th, 2012 05:00

Hi

If a vault disk fails in which a data LUN is bound, HS disk will kick-in and write cache will get disabled, once every actions are taken care and the RG is back with its usual disk members..still the cache will remain disabled, user need to enable the cache manually.

Thanks

Anirudh

812 Posts

December 12th, 2012 09:00

Check for any other faults in the system also.

5 Posts

December 13th, 2012 04:00

Other faults:

- We see some issues with one Standby Power Supply, but this does not seem recent.

- The system claims in its alerts that write-cache is still disabled, but we can see in the status of the storage system and each and every LUN that it is actually enabled. Not sure how to clear this alert

With regards to the original problem -- the lack of I/Os:

We think it may be a bug in the SMI-S provider that we're running. We run quite an old version of the EMC SMI-S provider also.

When system started the equalize process back to the normal disk, it did actively read from the hot-spare, which would probably only make sense if it actually had been writing new data to this disk anyway.

Unfortunately there's no way for us right now to see what happened during the fault-period on the hot-spare.

Navisphere Analyze wasn't running at the time of fault, and I did start it later yesterday on the system (I think).

This gave me some .naz files, but unfortunately no idea what to do with those.

December 13th, 2012 05:00

Hi

This .naz files are a collection of performance logs. If you have Navisphere analyzer enabler installed, you can generate same collection of performance logs but in a different file format (.nar). If you want this .naz files to be analysed, I would suggest you to involve EMC support and raise a SR for performance analysis and upload this .naz files.

Thanks

Anirudh

5 Posts

December 14th, 2012 05:00

System is all proper again and all alerts were fixed.

The performance analysis will have to wait for another time, as the hot-spare is sitting nicely idle again as it should.

We will look into the underlying SMI-S provider from EMC to see if an upgrade of this will help us in the future.

Many thanks for all replies and assistance.

M Leppink

1.4K Posts

December 15th, 2012 09:00

If you have received the correct answer, please remember to mark you questions as Answered.

February 24th, 2015 09:00

Looking on the Google for a solution for the exact same problem and found this tread. Can you help me with it? I'm trying to find out if this is ok to replace that drive under the sticker? The message on the sticker makes me worry. Status of the disk is Removed. As I understand it copied the content to a HS Disk and disabled the Write Cache. I do have the new disk but I'm not sure if I can remove the drive from the Array Software Disk

224 Posts

February 24th, 2015 09:00

Is this drive that failed a part of your Vault disk set?

Even if it is your Vault drive that failed you'd have to replace it. If the copy to hotspare is complete, go ahead and replace the drive.

After replacing keep that sticker on the new drive so no one pulls the vault drive.

Regards,

Sheron

224 Posts

February 24th, 2015 10:00

When a drive fails it initiates a copy to HS is there is a HS configured.

Did it happen, did the copy to hotspare complete? ( From your post I see 'When the vault drive failed I saw the message on the HS drive' )
Are you again trying to copy to hotspare , and is that why you get that alert?
Is it just one drive that has failed on the drive?

If all answers for the above question is "yes" then the you can go ahead with the replacement, wait till the data is moved from HS to the new drive.

As per EMC 's best practices we are not to create any user LUNS on vault.

Vault drives are mirrored ( depends on the different Clariions, vaults are 3ple mirrored), it is done so to save the FLARE in case if something happens to the vault drives.

Regards,

Sheron

No Events found!

Top