RMS PTA
1 Copper

H200 RAID 1 Predicted Failures

Jump to solution

Hi all,

I seem to be having a strange problem with a RAID 1 array on one Dell R410 server. I urgently need some quick replies, if at all possible. It would make me so grateful.

After the server Blue-Screened one Sunday evening I started investigating and found that according to Dell OpenManage console Drive 0 had a 'Predicted Failrure'. The LED light for Drive 0 on the front panel of the server was also flashing orange, presumably to indicate this exact problem. Also, when I checked my Backup Logs I can actually see between 3 and 4 reported CRC errors.

I phoned Dell support and after having me run a DSET Report and an Online Diagnostic test of the Drives they arranged for a replacement Drive. We replaced the bad Drive, waited for the RAID to rebuild, and afterwards found to our amazement that Drive 1 was now reporting a 'Predicted Failure'.

I phoned Dell again, and they said that it could simply be coincidence. I thought that this is not an unreasonable thing to say, so we arranged for Drive 1 to be replaced also. As you might have guessed, we replaced the Drive 1, waited for the RAID to rebuild, and behold, now Drive 0 is again reporting a 'Predicted Failure'.

So obviously we now know something is wrong here. We thought that maybe it is actually incorrectly reporting the Drives as faulty, so we performed a BIOS update on the server, as well as Firmware updates on the iDRAC 6 and on the H200 controller itself. But this did not make a difference.

Naturally I phoned Dell again, and they have now arranged for the RAID Controller itself to be replaced, as well as at least one Drive (but possibly both, if need be).

My question is:
(1) What could be causing this?
(2) Is the latest solution offered by Dell the best and most logical next step in resolving this problem, or could there be something else that they have not considered replacing yet?
(3) If the problem is being caused by Bad Sectors, the Bad Sectors can't jump / migrate from one Drive to another, can they?
I guess the worst thing that can happen is that whatever data resides on those Bad Sectors is now unreadable and unrecoverable, even from my Backups? (since it has now resynced to both Drives) Or are there other things to consider?

1 Solution

Accepted Solutions
theflash1932
6 Indium

Re: H200 RAID 1 Predicted Failures

Jump to solution

Having never troubleshot an H200 I can't say for certainty what you would find in a controller log - or if there even is one (old SAS 6/iR did not have one), but you most likely have a corrupt/damaged array.  It may have started with a bad disk - or even two - but if the array has a missing/damaged portion on one drive and it cannot read that same location from the other drive in order to fix it, it reports a problem.  There are two types of predicted failures - physical (drive reports it has a problem and is only a matter of time before it is asked to do something it can no longer do - drives will fail diagnostics), and logical (array data between members is unavailable/unreconcilable and it is only a matter of time before a location is requested that the controller cannot provide accurate information for, causing read failure of that disk - drives will pass diagnostics, if healthy).

Unfortunately, there is no "fix" for this other than to delete the array and recreate it.  Running periodic Consistency Checks can help mitigate the fallout from these types of errors.

View solution in original post

14 Replies
theflash1932
6 Indium

Re: H200 RAID 1 Predicted Failures

Jump to solution

Having never troubleshot an H200 I can't say for certainty what you would find in a controller log - or if there even is one (old SAS 6/iR did not have one), but you most likely have a corrupt/damaged array.  It may have started with a bad disk - or even two - but if the array has a missing/damaged portion on one drive and it cannot read that same location from the other drive in order to fix it, it reports a problem.  There are two types of predicted failures - physical (drive reports it has a problem and is only a matter of time before it is asked to do something it can no longer do - drives will fail diagnostics), and logical (array data between members is unavailable/unreconcilable and it is only a matter of time before a location is requested that the controller cannot provide accurate information for, causing read failure of that disk - drives will pass diagnostics, if healthy).

Unfortunately, there is no "fix" for this other than to delete the array and recreate it.  Running periodic Consistency Checks can help mitigate the fallout from these types of errors.

View solution in original post

RMS PTA
1 Copper

Re: H200 RAID 1 Predicted Failures

Jump to solution

Well, Dell Support had me run the Online Diagnostic Utility on the Drives, and in each case they failed the Disk Self Test. So it seems hardware related, yes? But if that's the case, then why are we now already on the 3rd Predicted Failure? Can a faulty RAID controller actually cause Faults on new Hard Drives?

0 Kudos
theflash1932
6 Indium

Re: H200 RAID 1 Predicted Failures

Jump to solution

The 32-bit Diagnostics (bootable) are much more reliable in this situation.

If the drives are faulty, then yes, a faulty drive - or drives - could have caused the original damage.  A faulty RAID controller can cause damage to the arrays if the data is not getting written or cared for properly but is much less likely.  Regardless of the cause, no matter how many times you pass the data back and forth between disks in a rebuild, it will always have a hole in the array that it knows is a problem.

RMS PTA
1 Copper

Re: H200 RAID 1 Predicted Failures

Jump to solution

So what you're saying is that, regardless of whether it is a hardware or logical failure that caused the original problem, the best solution is to replace both drives at the same time? Recreate the RAID array, and restore the server from a backup?

0 Kudos
theflash1932
6 Indium

Re: H200 RAID 1 Predicted Failures

Jump to solution

Yes, and because of the inconvenience this usually represents, Dell often will replace parts to be sure that that is the course that is required.  With RAID controllers where you can actually review the controller log, it is usually evident what the problem is and if hardware should be replaced or if it is a logical error; since this controller probably doesn't have a log with information, the only option is to ensure you have sound hardware first.

0 Kudos
RMS PTA
1 Copper

Re: H200 RAID 1 Predicted Failures

Jump to solution

Does replacing the Controller result in a loss of data?

0 Kudos
theflash1932
6 Indium

Re: H200 RAID 1 Predicted Failures

Jump to solution

No.  When you replace the controller, the controller will import the config saved on the drive(s) for use on the controller.

RMS PTA
1 Copper

Re: H200 RAID 1 Predicted Failures

Jump to solution

Well, I suppose that's good news in a certain sense.

Thanks very much for all your answers so far. I do appreciate it.

But obviously if we need to re-create the RAID Array, like you suggested, then that will result in loss of data?

Also, how can I check if the H200 actually does have its own logs?

0 Kudos
theflash1932
6 Indium

Re: H200 RAID 1 Predicted Failures

Jump to solution

Yes, recreating the array will cause data loss, so we are talking about backup/delete/create/restore.

If it had a log, it should have been pulled as part of your DSET Report (DSET\log\RAID Controllers\), or you can pull the latest version of the controller log in OMSA (Storage, PERC, Information/Configuration, Export Log from the dropdown menu).

0 Kudos