RAID 50 with errors

Question

Hi,

I have a Poweredge R720xd with a PERC H710 Mini controller that is hosting a 8-disk RAID 50 array. I have a situation where two drives are marked as 'predictive failure' and I believe that both of these are on the same sub-array. One of the affected drives has been attempting a rebuild for several weeks and I suspect that it may be propagating errors onto the second drive. I don't think I can replace both of the affected drives simultaneously, so I wonder if it is better to do them one at a time, or perhaps to install two new drives as hot spares and let the controller manage the rebuild? Additional details are provided below.

Several weeks ago messages of the form

megaraid_sas 0000:03:00.0: 5500219 (725958049s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 01(e0x20/s1) at c7d84a0

started to appear in the logs. I took the system down and ran 'Hardware Diagnostics' from the LifeCycle Controller. Both Disks 00:01:01 and 00:01:03 (PD 01, and 03 in the logs) showed Error Code 2000-0151 and incorrect status = 5D. The RAID setup showed disk 00:01:01 Foreign,Online and disk 00:01:03 as Foreign,Rebuild Pending. This is strange since the error messages in the logs seemed to indicate that disk 00:01:01 was under rebuild and no errors were reported on disk 00:01:03. I imported the foreign configuration during boot and was able to get the array back online. The errors regarding 'medium error during recovery on PD 01' continued, but no errors were reported on disk 00:01:03. The system froze a few days later and once I managed to reboot it I started to see 'Puncturing bad block on PD 01' and 'Puncturing bad block on PD 03' errors in the logs. I then used openmanage to stop the rebuild on disk 1. and later the controller took the entire array offline. This is a poorly-managed system with no backup. It is running Ubuntu 20.04

DELL-Joey C · Answer

Hi @boulderlund,

Yes, install 2 new drives for the RAID controller handle the rebuild by hotspare. But, through your checks on diagnostics, even the rebuild has finished, puncture bad block may lead to multiple drives failure. I may suggest doing a full data back up and reconstruct the whole array by scratch. As there are multiple disk failure with array puncture, is hard to recover. Puncture bad blocks can be recovered (hopefully) by doing array Check Consistency option, but sometimes it may not work as there might contain double fault parity.

Puncture ref: https://dell.to/3Y7pegQ

boulderlund · Answer

Hi Joey,

Thank you for your suggestion - I will attempt the two hot spare approach. I mainly understand the consequences of punctures and the need to rebuild the array from backup when this happens. I am bit unsure of the best sequence of operations, however. Should I follow these steps?

1. Add two hot spares and let the controller rebuild the array onto them.

2. Run array consistency check.

3. Make a backup.

4. Delete the array and re-initialize a new one.

5. Restore from backup.

One other detail that I forgot to mention in my original post is that I also saw messages in the logs of the form

VD bad block table on VD 00/1 is full; unable to log block 168a83e0

accompanying the array puncture messages. I assume that the block table full and puncture messages are related and do not indicate further trouble?

DELL-Joey C · Answer

Hi @boulderlund,

Oh. If you are getting messages on bad block table is full, it is very risky to replace the predictive fail drive. Any time of point being, it is critical to have full data back up by now. I would:

1. Make a backup.

2. Delete the array and re-initialize a new one.

3. Restore from backup.

But after read your first post, you should be having 2 disk per span issue. This is risky.

RE: The bad block table is used for remapping bad disk blocks. This table fills, as bad disk blocks are remapped. When the table is full, bad disk blocks can no longer be remapped and disk errors can no longer be corrected. At this point, data loss can occur.

DELL-Chris H · Answer

Boulderlund,

The issue now is anything that we do now may corrupt the data as a whole. Hence why Joey was suggesting getting a backup, if there isn't a backup of the data then I would consider a 3rd party data recovery company to retrieve the remaining data. Now to answer your question, you could try importing the foreign drive and see if that corrects anything. Importing the foreign tells the controller to take the virtual disk configuration data from the drives, and to replace the config data on the controller with it. You do the import normally when the VD isn't bootable, where as you would clear the foreign if the VD was booting normally.
Now again, if there is a puncture then anything you do risks destroying the remaining data, so I would still advise you to look at data recovery, or if a backup is present to delete and reconfigure, then restore from that backup. The reason being is it gives it a clean stable platform again, instead of trying to get this to work, which may not be stable.

Let me know if this helps.

boulderlund · Answer

Joey,

Thanks for another useful reply. I expect that I might lose some data, but I really don't want to lose all of it! It seems likely that the system freeze that I described in my first post occurred when the bad block table first filled. I did not see log entries to this effect, but when I rebooted the machine I started to see both the puncture and bad block table full errors. After a short while I used openmanage to cancel the the rebuild on disk 1, which took it offline. At this point the controller started to rebuild disk 3. I mounted the array read-only and attempted to copy data. This proved to be unreliable as some reads took a very long time and others failed with errors. After several hours the logs contained a message that disk 3 had failed and the entire array went offline. Now if I use openmanage to query the array state, all disks are listed as non-critical, disk 1 is offline, disk 3 is foreign, and the rest are online. Is there any way to bring the array back up and not have it attempt to do any rebuilding? If not, if I insert two hot spares will the controller focus its rebuild efforts on the spares?

boulderlund · Answer

Chris,

Thank you for the reply. Unfortunately I do not have a backup. I believe the disks still spin up and data can be read from them (I could easily verify this). Thus if I could get the array back up without the controller attempting to rebuild anything I would hope that I could read most of the data to form a backup. If there is no way to do this, then it seems possible to move the affected drives to a different computer and then use a tool like ddrecover to clone the readable portion of the data to a new drive. It may be a stretch, but if these cloned drives were inserted back into the array, would the controller be happy?

Praveen.Singh · Answer

My personal suggestion if don't have the backup copy contact data recovery providers, they will get pull the data from the drives for sure but the charges may be high and the disk is of no use after that.

then go for new drives create a new array and use your system.

PowerEdge HDD/SCSI/RAID

RAID 50 with errors

Was this post helpful?