Start a Conversation

Unsolved

This post is more than 5 years old

TP

1460

November 29th, 2016 15:00

PERC 6/i Raid in PowerEdge 2900 fails rebuild

I have a now degraded RAID 5 in a PowerEdge 2900 with a PERC 6/i.  It was originally 5 SATA drives in bays 0,1,2,3,7.  Bay 2 had a flashing amber for what I understand is a drive failed.

I acquired a replacement drive and hot swapped it into the failed bay (as I believe have once before in the past when under warranty) but it didn't seem to take.  I lacked Openmanage and ended up feeling the need to restart in order to use the BIOs utility to declare it a spare.  It then appeared to begin to rebuild but didn't actually finish.  I tried it again and same issue where it brought the drive offline.  The drive carrier light is a green>off>amber>off which when I look up looks to me to be "Drive being spun down by user request or other non-failure condition".  The utilities also list the drive as offline at this point.

I assumed the replacement drive I had was a dud so I acquired a different replacement drive but get the same issue.  They both fail consistently after 9% rebuild so I'm skeptical the second is coincidentally a dud unless there is some significance to 10% being a certain transition point in the process and not simply 10% of the data copied.  I find it particularly frustrating because I have no feedback for what it is doing.  It just changes from rebuild to taking the drive offline.

I have tried rebuild from both OpenManage (installed after the initial failure) and the BIOs.  Both going from rebuilding to taking the drive offline.  I have also tried rebuilding in a different bay.  The firmware and driver on the PERC 6/i are current. The Dell Online Diagnostic software tests comes back with the battery requiring replacement and nothing else.

This is outside of my knowledge so it's possible I'm doing or have done something wrong (in addition to restarting which I couldn't figure out a way around).  I'm rather lost at this point.  Is there something else I should be checking?  Should I buy a third replacement drive to try?  If it actually thinks the drive is bad, isn't a "predicted failure" error more normal?  Could it be related to the lack of battery and using write-through over write-back?

5 Practitioner

 • 

274.2K Posts

November 30th, 2016 09:00

Hello.

A hard drive in a predictive fail state requires that it be off lined before replacement can be made. Is this what you followed? However, there may be other issues as well related to your data.  Reply with a DSET report to the e-mail I am gonna send you for further analysis.

5 Practitioner

 • 

274.2K Posts

December 12th, 2016 09:00

Review of the controller logs indicates that you have a punctured array and that your data is corrupted. This explains persistent failure of replacement physical drives on slot 1:0:6. The only option to restore your array to optimal state is to ensure that you have a validated backup of your data. What you can do now is to run offline diagnostics on the drives then replace any failed or failing drives. Don’t forget to clear the existing hardware logs and follow the steps below:

  • Verify a file-level backup. Do not use an image based backup.
  • From the RAID BIOS utility or from within OMSA, delete the VD 0
  • Delete the entire affected Virtual Disk 0 and replace the faulty drives. This require a full OS re-installation
  • Replace any failed or predictive failed drives
  • Create a temporary dissimilar RAID  6 array
  • Fast initialize this temporary array.
  • Delete the temporary array.
  • Create your actual RAID 5 array
  • Perform slowly or full initialization of the array to get rid of the bad logical blocks
  • Re-install your OS and restore your data from backup
  • This sounds tedious but it is the best recommendation to prevent similar issues from re-occurring.

To avert a similar situation in future, it is recommended that you run consistencychecks on all VD’s at least once of month during off-hours See link: http://www.dell.com/support/article/us/en/04/SLN156918/?c=us&

No Events found!

Top