RMS PTA

7 Posts

102055

March 21st, 2012 12:00

H200 RAID 1 Predicted Failures

Hi all,

I seem to be having a strange problem with a RAID 1 array on one Dell R410 server. I urgently need some quick replies, if at all possible. It would make me so grateful.

After the server Blue-Screened one Sunday evening I started investigating and found that according to Dell OpenManage console Drive 0 had a 'Predicted Failrure'. The LED light for Drive 0 on the front panel of the server was also flashing orange, presumably to indicate this exact problem. Also, when I checked my Backup Logs I can actually see between 3 and 4 reported CRC errors.

I phoned Dell support and after having me run a DSET Report and an Online Diagnostic test of the Drives they arranged for a replacement Drive. We replaced the bad Drive, waited for the RAID to rebuild, and afterwards found to our amazement that Drive 1 was now reporting a 'Predicted Failure'.

I phoned Dell again, and they said that it could simply be coincidence. I thought that this is not an unreasonable thing to say, so we arranged for Drive 1 to be replaced also. As you might have guessed, we replaced the Drive 1, waited for the RAID to rebuild, and behold, now Drive 0 is again reporting a 'Predicted Failure'.

So obviously we now know something is wrong here. We thought that maybe it is actually incorrectly reporting the Drives as faulty, so we performed a BIOS update on the server, as well as Firmware updates on the iDRAC 6 and on the H200 controller itself. But this did not make a difference.

Naturally I phoned Dell again, and they have now arranged for the RAID Controller itself to be replaced, as well as at least one Drive (but possibly both, if need be).

My question is:
(1) What could be causing this?
(2) Is the latest solution offered by Dell the best and most logical next step in resolving this problem, or could there be something else that they have not considered replacing yet?
(3) If the problem is being caused by Bad Sectors, the Bad Sectors can't jump / migrate from one Drive to another, can they?
I guess the worst thing that can happen is that whatever data resides on those Bad Sectors is now unreadable and unrecoverable, even from my Backups? (since it has now resynced to both Drives) Or are there other things to consider?

Responses(14)

theflash1932

9 Legend

•

16.3K Posts

2

March 21st, 2012 13:00

Having never troubleshot an H200 I can't say for certainty what you would find in a controller log - or if there even is one (old SAS 6/iR did not have one), but you most likely have a corrupt/damaged array. It may have started with a bad disk - or even two - but if the array has a missing/damaged portion on one drive and it cannot read that same location from the other drive in order to fix it, it reports a problem. There are two types of predicted failures - physical (drive reports it has a problem and is only a matter of time before it is asked to do something it can no longer do - drives will fail diagnostics), and logical (array data between members is unavailable/unreconcilable and it is only a matter of time before a location is requested that the controller cannot provide accurate information for, causing read failure of that disk - drives will pass diagnostics, if healthy).

Unfortunately, there is no "fix" for this other than to delete the array and recreate it. Running periodic Consistency Checks can help mitigate the fallout from these types of errors.

RMS PTA

7 Posts

0

March 21st, 2012 13:00

So what you're saying is that, regardless of whether it is a hardware or logical failure that caused the original problem, the best solution is to replace both drives at the same time? Recreate the RAID array, and restore the server from a backup?

RMS PTA

7 Posts

0

March 21st, 2012 13:00

Well, Dell Support had me run the Online Diagnostic Utility on the Drives, and in each case they failed the Disk Self Test. So it seems hardware related, yes? But if that's the case, then why are we now already on the 3rd Predicted Failure? Can a faulty RAID controller actually cause Faults on new Hard Drives?

theflash1932

9 Legend

•

16.3K Posts

0

March 21st, 2012 13:00

Yes, and because of the inconvenience this usually represents, Dell often will replace parts to be sure that that is the course that is required. With RAID controllers where you can actually review the controller log, it is usually evident what the problem is and if hardware should be replaced or if it is a logical error; since this controller probably doesn't have a log with information, the only option is to ensure you have sound hardware first.

theflash1932

9 Legend

•

16.3K Posts

1

March 21st, 2012 13:00

The 32-bit Diagnostics (bootable) are much more reliable in this situation.

If the drives are faulty, then yes, a faulty drive - or drives - could have caused the original damage. A faulty RAID controller can cause damage to the arrays if the data is not getting written or cared for properly but is much less likely. Regardless of the cause, no matter how many times you pass the data back and forth between disks in a rebuild, it will always have a hole in the array that it knows is a problem.

theflash1932

9 Legend

•

16.3K Posts

1

March 21st, 2012 23:00

No. When you replace the controller, the controller will import the config saved on the drive(s) for use on the controller.

RMS PTA

7 Posts

0

March 21st, 2012 23:00

Does replacing the Controller result in a loss of data?

RMS PTA

7 Posts

0

March 22nd, 2012 00:00

Well, I suppose that's good news in a certain sense.

Thanks very much for all your answers so far. I do appreciate it.

But obviously if we need to re-create the RAID Array, like you suggested, then that will result in loss of data?

Also, how can I check if the H200 actually does have its own logs?

theflash1932

9 Legend

•

16.3K Posts

0

March 22nd, 2012 09:00

Yes, recreating the array will cause data loss, so we are talking about backup/delete/create/restore.

If it had a log, it should have been pulled as part of your DSET Report (DSET\log\RAID Controllers\), or you can pull the latest version of the controller log in OMSA (Storage, PERC, Information/Configuration, Export Log from the dropdown menu).

RMS PTA

7 Posts

0

March 23rd, 2012 04:00

I checked, but no gold - No logs for the RAID Controller, unfortunately.

But it's not like actual bad sectors can be migrated via synchronisation from one drive in the RAID array to the other drive, though, right? Yet if the one drive develops bad blocks and renders some files corrupt, it can synchronise those corrupt files over to the other drive in the array; is that right? Which is perhaps how the problem got started?

theflash1932

9 Legend

•

16.3K Posts

1

March 23rd, 2012 09:00

"But it's not like actual bad sectors can be migrated via synchronisation from one drive in the RAID array to the other drive, though, right?"

Right. The controller will not copy sectors it knows to be bad from one disk to another during a rebuild, but as you said, 1) it may not know the data on that sector is bad and copy it, or 2) if it can't read it to copy it, it may not copy it at all, leaving a blank spot on the new drive - now matching the unreadable spot on the original - many controllers will abort the rebuild in this situation, or if there are a significant number of them. When controllers do patrol reads and consistency checks, if it finds bad sectors or data on one drive that is corrupt/bad, it tries to correct it with the data on the other disk. If it can't, that's when the controller knows there is a problem and alerts you.

RMS PTA

7 Posts

0

March 26th, 2012 02:00

For whatever reason, even though Dell said they would supply 2 new drives and a new controller, the technician that brought the parts only showed up with 1 drive + controller. So we kind of made the best of a bad situation by replacing the parts that he brought, deleting the RAID Array, recreating it, and restoring the server from a backup. So far everything still seems to be going fine - No additional predicted failures have been reported.

So is this kind of problem unique to RAID 1? Does it not defeat the purpose somewhat of having a RAID config when one drive failure can cripple the whole thing like this?

theflash1932

9 Legend

•

16.3K Posts

0

March 26th, 2012 09:00

No, it is not unique to RAID 1, but it is more common ... it can also infect a RAID 5 or a RAID 10, etc. The higher-end controllers (PERC 5/6, H7x0) are able to run Consistency Checks and Patrol Reads to pro-actively detect and repair problems like this ... well worth the extra $300-400 in my opinion.

T

tqmbill

6 Posts

0

March 19th, 2016 07:00

Thanks to all who participated here. I have this identical problem with PERC H200 RAID1 and have been chasing my tail for months now: new disk, rebuild mirror and boom! "predicted failure." It's out of warranty so I've been on my own and this kind of help is invaluable. I found this reference that gives some great insight into "Double Faults and Punctured Arrays."

www.dell.com/.../EN

View All

No Events found!

PowerEdge HDD/SCSI/RAID

H200 RAID 1 Predicted Failures