Unsolved
This post is more than 5 years old
93 Posts
0
1662
Multiple disk failures - VD offline
We had multiple simultaneous disk failures. I don't think all of these disks are failed, I think there is some bug in PERC 6 firmware or HDD firmare that I would like to know, because I've lost the whole RAID volume
Here you can find the whole log starting form a SAS_DISCOVERY event.
If this is a known bug, I would like to know the exact version where it is fixed, as I have other servers with PERC 6i (different firmware version) and same HDD but with different firmware to check.
I think this is a CRITICAL bug/issue. Is almost impossible that 4 drives fails in a couple of minute/hours with no apparent reasons. SMART doesn't report any issue.
Anonymous
5 Practitioner
5 Practitioner
•
274.2K Posts
0
June 20th, 2016 16:00
Hello.
At the the moment there are no known bugs on the PERC 6i, Seagate and Hitachi HDDs. The past bugs and fixes released were unrelated to multiple drive failures on RAID arrays. The firmware on the PERC 6/i, hard drives and probably the back-plane is way out of date. What is the model of your server? The logs do not provide any information as to why the drives are failing. However, looking at this line in the log as an example " 06/18/16 4:25:45: EVT#71304-06/18/16 4:25:45: 101=Rebuild failed on PD 03(e0x20/s3) due to source drive error". It is highly likely that there was communication failure between the drives and controller due to out of date firmware making it difficult for drives to perform parity bit checks, correct medium errors and rebuild thereby corrupting your array. That is why it is necessary to perform consistency checks on your array to correct and recover bad blocks on the array data. Ensure that you have backup of your data before applying the following updates.
PERC 6/i
Hitachi Drive
Seagate Drive For the drives, you may also use this Nautilus tool: http://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=WMKVJ
For a system that is way out of date, there is no guarantee that these updates will be applied successfully and that they may render the system to be inoperable.
ale123
93 Posts
0
June 20th, 2016 22:00
The server is a PE2950
A consistency check was ended properly a couple of hours before these multiple failures
ale123
93 Posts
0
June 21st, 2016 01:00
In addition to understand why I had multiple PD failures at the same time after a couple of hours from the last Patrol Read and consistency check, I would like to update an R710 with the same PERC 6i
What do you suggest, upgrade everything through Lifecycle controller or applying only the RAID firmware and HDD firmware that you provided ?
Anonymous
5 Practitioner
5 Practitioner
•
274.2K Posts
0
June 21st, 2016 09:00
Unfortunately, we are not able to perform root cause analysis for multi PD failures. Yes, it is best to update the firmware through the LifeCycle Controller. However, it is recommended that you first apply the NIC, Controller and Chipset drivers first before applying firmware updates through the LCC. Often, the drives' updates may not apply in which case will have to update them separately using the same links provided.
ale123
93 Posts
0
June 21st, 2016 09:00
I've updated hdd firmware and perc6 firmware to the latest versions on a similiar server.
I hope that this doesn't happens anymore or I'll have very big troubles if dell is unable to debug big issues like this with their hardware.