Start a Conversation

Unsolved

This post is more than 5 years old

A

1662

June 20th, 2016 14:00

Multiple disk failures - VD offline

We had multiple simultaneous disk failures. I don't think all of these disks are failed, I think there is some bug in PERC 6 firmware or HDD firmare that I would like to know, because I've lost the whole RAID volume

Here you can find the whole log starting form a SAS_DISCOVERY event.

http://pastebin.com/KwAFw0wz

If this is a known bug, I would like to know the exact version where it is fixed, as I have other servers with PERC 6i (different firmware version) and same HDD but with different firmware to check.

I think this is a CRITICAL bug/issue. Is almost impossible that 4 drives fails in a couple of minute/hours with no apparent reasons. SMART doesn't report any issue.

5 Practitioner

 • 

274.2K Posts

June 20th, 2016 16:00

Hello.

 

I don't think all of these disks are failed, I think there is some bug in PERC 6 firmware or HDD firmare that I would like to know, because I've lost the whole RAID volume

At the the moment there are no known bugs on the PERC 6i, Seagate and Hitachi HDDs. The past bugs and fixes released were unrelated to multiple drive failures on RAID arrays. The firmware on the PERC 6/i, hard drives and probably the back-plane is way out of date. What is the model of your server? The logs do not provide any information as to why the drives are failing. However, looking at this line in the log as an example " 06/18/16  4:25:45: EVT#71304-06/18/16  4:25:45: 101=Rebuild failed on PD 03(e0x20/s3) due to source drive error". It is highly likely that there was communication failure between the drives and controller due to out of date firmware making it difficult for drives to perform parity bit checks, correct medium errors and rebuild thereby corrupting your array. That is why it is necessary to perform consistency checks on your array to correct and recover bad blocks on the array data. Ensure that you have backup of your data before applying the following updates.

PERC 6/i

Hitachi Drive

Seagate Drive  For the drives, you may also use this Nautilus tool: http://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=WMKVJ 

For a system that is way out of date, there is no guarantee that these updates will be applied successfully and that they may render the system to be inoperable.

93 Posts

June 20th, 2016 22:00

The server is a PE2950

A consistency check was ended properly a couple of hours before these multiple failures

93 Posts

June 21st, 2016 01:00

In addition to understand why I had multiple PD failures at the same time after a couple of hours from the last Patrol Read and consistency check, I would like to update an R710 with the same PERC 6i

What do you suggest, upgrade everything through Lifecycle controller or applying only the RAID firmware and HDD firmware that you provided ?

5 Practitioner

 • 

274.2K Posts

June 21st, 2016 09:00

Unfortunately, we are not able to perform root cause analysis for multi PD failures. Yes, it is best to update the firmware through the LifeCycle Controller. However, it is recommended that you first apply the NIC, Controller and Chipset drivers first before applying firmware updates through the LCC. Often, the drives' updates may not apply in which case will have to update them separately using the same links provided.

93 Posts

June 21st, 2016 09:00

I've updated hdd firmware and perc6 firmware to the latest versions on a similiar server.

I hope that this doesn't happens anymore or I'll have very big troubles if dell is unable to debug big issues like this with their hardware.

No Events found!

Top