RAID 1 Controller FAILED a good drive, did not detect SMART errors

Question

Our PowerEdge 400SC serves as our Windows 2000 server domain controller. It appears this system (CERC ATA100 4 channel) might mark a "good" drive as "FAILED" and try keeping a very bad drive in service.

A month ago, the RAID controller failed out one of the two drives. Dell promptly replaced it. I suspect that BOTH drives were starting to fail, but the RAID card was happy after the new drive was installed and rebuilt.

Last week, the new drive failed! I thought this was rather odd; Dell sent another new drive, and I spent the last 14 HOURS in this process...

- Removed the drive which the RAID controller indicated was failed (via the controller BIOS, prepare drive for removal, etc).
- Attempted to back up the single drive; the backup process crashed the server, just as it had for the past several days.
- Installed the new drive from Dell.
- Attempted to rebuild the new drive from the RAID BIOS.
- Rebuild failed immediately.
- Removed the new drive from Dell, leaving only the single now-suspect drive.
- Booted server with BartPE diagnostics CDROM to a secondary Windows environment to run CHKDSK.
- CHKDSK indicated extensive errors on the drive; repeated runs of CHKDSK fixed a few errors, but did not run to completion. Uh oh...
- Ghost read-only partition check indicated sector read errors. Double uh oh!
- Removed the now-known-bad drive, installed in a standalone system, and ran Spinrite for low-level recovery.
- Spinrite stated "SMART diagnostics indicates this drive is about to fail! Do not run intensive recovery..."; ran recovery at the "lightest" setting. (There were 1000+ sector reallocation events before Spinrite even started. That's very, very bad.)
- During Spinrite run, I installed the drive which the RAID controller failed originally and attempted a rebuild of it onto the new drive from Dell, just to make certain the RAID controller was functional. This rebuild worked; however, this rebuild lacked 4 days of data due to the inability to back up the server while the bad drive was "in charge". But this proved I have two good drives now.
- After Spinrite recovery (14 ECC corrections, 2 dynastat), I made a successful Ghost image of the OS partition. *whew!*
- Installed the repaired drive into the server.
- Offlined the second drive.
- Performed a successful backup of the Spinrite-repaired drive.
- Rebuilt the new Dell drive from the Spinrite-repaired drive.
- Offlined and removed the Spinrite-repaired drive; this will be returned to Dell. Quickly!
- Installed the remaining "known good" drive in the server.
- Attempted a rebuild of the remaining good drive, but set the rebuild rate too high and crashed the O.S. after 45 minutes (lesson learned the hard way!). Don't set rebuild rate to 100% and expect Windows to keep running. *stupid me*
- At reboot, CHKDSK found and FINALLY repaired the USN journal (one long-standing error which was never quite fixed on the bad drive).
- Restarted the rebuild of the remaining good drive, setting the rebuild rate to a more sensible 25%.

As of this moment, I estimate the rebuild will be complete by mid-morning; the server is up, running, and is notably faster now that the heavily damaged drive is out of the RAID array.

SO... anyone else seen a RAID controller fail the "wrong" drive in a RAID 1 mirror?

Do CERC controllers properly report SMART errors? Sure doesn't look like it.

warwizard55 · Answer

TDThompson,

Good job recovering your data!

Yes I've seen many instances of a RAID controller failing the wrong drive. If I was troubleshooting the system, the one thing I'd insist on doing is going into the controller BIOS and selecting objects, physical, select each drive in turn and press F2 to get the drive info. This screen reports media and other errors on the drive.

Now one thing to note, if the replacement drive rebuilt, at that time there were no media errors on the original drive, for if there were media errors, you would have had a double fault condition, and the rebuild would have failed at the first such error. Now there could have been plenty of other errors which the user's guide say could be cabling or interface or mechanical /electronic faults in the drive.

The user's guide says it supports smart errors, the driver should report such into the system event log. For more intrusive alerts you need to have a management utility installed.

warwizard DCSP

PowerEdge HDD/SCSI/RAID

RAID 1 Controller FAILED a good drive, did not detect SMART errors

Was this post helpful?