Start a Conversation

Unsolved

This post is more than 5 years old

3190

July 6th, 2011 16:00

Drive problems with SAS5/iR controller

Hi all, I'm currently experiencing an issue that I'm hoping someone can shed some light on.

Originally, I had two identical Seagate 1TB SAS drives in RAID1 configuration in an SC1435 with a SAS5/iR controller.  The server runs VMware ESXi, and was functioning well until about 2 months ago when I received an alert that the RAID container was in DEGRADED status.

During a maintenance window, I took down the host and inspected the RAID controller BIOS to figure out the problem, and it said that drive0 was 0% synched, and drive1 was the primary, and also in predictive fail state.  We ordered a new drive to replace drive 1, which arrived about 2 weeks ago.

During another maintenance window, we took down the server, and went into the RAID BIOS and forced a resynch.  This took approximately 9 hours, and at the end of it drive0 was the primary drive, in synch, and not in predictive fail.  We then swapped out drive1 for the new drive, and waited for it to synch with drive0.  This happened in about the same amount of time as the last resynch, so we put the machine back into service and called it a day.

Now, after 1 week with the new drive we're in exactly the same situation:

drive0: 0% synched, not in predictive fail
drive1: primary, predictive fail

We tried forcing a resynch, but after 1 hour we barely reached 1% completion.

Given the generally easy lives and the advertised duty cycle for this model of drive, I find it really hard to believe that we've had 2 sequential bad-apple disks.

What we did was power on another server, a PE1950, which had a similar SAS RAID card in it.  We pulled the drives from the SC1435, and put them in the 1950 and went into the RAID BIOS.  RAID BIOS recognized the contained, and said that the array was currently in inactive state/degraded, listing the same information on the drives as the SC1435:

drive0 0% synched
drive1 primary predictive fail

Now, I can understand the 0% synch status and the need for a resynch, but how is the primary still in predictive fail?  Is this sourced from SMART data?  Do I need to reset the SMART reporting in order to see if it still happens on this controller? 

Any ideas for paths forward?  Any ideas are welcome and appreciated.  

 

Thanks!

No Responses!
No Events found!

Top