Blejzer
1 Copper

PowerEdge 720 Raid Controller PERC H710P mini two drives predictive failure failed after hot swap during rebuild

Jump to solution

OK, so here is the chronology of the events that led to several sleepless nights, packed with stress and tension, numerous walks down to the server room, discussions, planning, tearing up plans, planning again, and so on.

 

Here is the basic system description:

System: Dell PowerEdge 720

Raid Controller: PERC H710P Mini

8 Drives in raid 5 (0-7) (1.818 TB SATA HDD WD2000FYYX)

(Raid 5 because we had to sacrifice safety for space… I know!)

Symptoms:

  • We noticed Xen Center warnings for the last three or four weekends, Xen Center gets disconnected from the machine. (Even though it was installed on win VM on the Dell. And just like that it would stay connected sometime Saturday night. (Our thought were that it was a backup issue, as weekly was starting at approximately same time on Saturday.)
  • We realized that at on Friday, just before this started, we created one testing VM (We’ll call it LastOne). Nothing special about it though.
  • Then users started calling us last Saturday morning about slow connection to our CMS.
  • We investigated logs on all production VMs and could not find anything suspicious, apart from the fact that everything was really slow.
  • Also, Daily backup reported successful on Friday.

Can’t remember if we noticed any other symptoms, but after rebooting most of our production VMs, which did not produced any positive results, we noticed that there are some Xen updates, so we decided, while we think about how to proceed, to reboot Xen. And that is when this whole thing started:

We started shutdown procedure, turning off one VM at time. It was working fine until we sent shutdown command to LastOne. After 15 minutes it was still not going down, so we tried forced shutdown that did not worked either.

At this point, we became pretty suspicious, because we had attacks in the past that we had to deal with, so we decided to go down to the server room and get on the machine and check it out. Turns out it was smart move as we realized that two out of eight drives blink green / amber – predictive failure (call them disk1 and disk6).

Getting right on the machine we still could not shut down the VM! We were getting messages that Xen could not normally shutdown VM and is attempting to force shutdown VM and then it tries again. So it was stuck in a loop. We did one thing we could there, we did hard shutdown on the server…

Rebooted it into Ctrl+R mode (Raid Controller), and put disk1 offline and swapped it with one spare we had (let’s call it OnlySpare). Looking at the Raid Controller, we realized that it starts rebuild automatically, so we were happy. Once we realized it changed from 0% to 1% we got out and went upstairs. When we tried to check it out, we realized that the Xen was in read only mode (or so it was saying). Went down stairs, and what we saw was an image from out of ***:

OnlySpare was blinking green / amber / off  - 3s each (Drive being spun down by user request or other non-failure condition.), and disk 6 was flashing amber (Drive has failed.)

And there was nothing else we could do. It was dead. We killed it. Or at least we felt like we did!

Well, we decided to try everything we can to recover it, so we came up with a plan.

Current situation:

Disk1 – predictive failure

Disk6 – failed

OnlySpare – good spare

(Oh, and just a side note: If you try to find a 2TB drive in Bosnia and Herzegovina, during weekend, and especially looking for enterprise grade, you will fail miserably)

 

So we came up with a plan:

Plan 1:

Took OnlySpare and Disk6, tested Disk6 and since it was readable, used dc3dd to make image out of it to OnlySpare. Took A LOONG time to do it though.

We used all of that time to research online, cry, headbanging, pull our hair, kick the server, curse Dell, WD, Dell and WD, etc.

Once completed, we took it downstairs and put original disk1 in its slot, and OnlySpare as image of disk6 in the slot of disk6.

Raid Controller did not recognized it. It recognized it as a new drive and could not do anything with it.

We've been told that this is not doable, as raid controllers also memorize data from disk firmware, but we at least had data preserved, or we thought so.

Now we started panicking. Stopped for a moment and came up with another plan.

Current situation:
Disk1 – predictive failure
Disk6 – failed
OnlySpare – dd image of disk6 (useless)

Plan 2:

We decided, since we had disk1 not totally dead, to put it back in, take disk6 out and replace it with OnlySpare and try to rebuild it as a brand new drive in slot of disk6. Thought was that it would recognize raid and rebuild OnlySpare as Disk6.

Once the disks were in, it actually recognized the Raid (we did 'import foreign configuration'), and automatically started to rebuild, but only seconds later, disk1 reported that it failed!!!

First thought was, why me???

So I guess this plan did not work

Current situation:
Disk1 – failed
Disk6 – failed
OnlySpare – dd image of disk6 (useless)

Plan 3 (already desperate):

Put Disk1 in disk6 slot and OnlySpare in slot of disk1…

I know, I know, failed.

After we ran out of possible combinations, logical and ones not so logical, we thought about trying one of the more radical suggestions we found online, but one that showed most success.

Plan n:

Take electronics from both disk1 and disk6 and switch them. We’ve been told it might work if it is an electronics that failed, and drives are IDENTICAL. Well our drives were.

 While looking for a screwdriver to unscrew boards from the drives, we put original drives into their slots (disk1 and disk6). Imported foreign configuration, and…

*** Xen started to boot normally, like nothing happened!

Of course, drives started reporting predictive failure almost immediately, but we were already up stairs on our machines making backups of all of the VMs configurations…

 

The rest of the story is all happy end, but another thing was bugging me:

How come this whole thing happened?

Two drive with predictive failure, and once one is replaced, other one fails?

 

So I did ‘little’ research on this, and found following:

Raid Controller, checks drives for physical issues and once it starts finding bad sectors on drive it registers it and writes information about it in its log. Information that was in that bad sector probably does not get rewritten somewhere else, as in raid it already exists somewhere else on one of the other drives? This is my presumption as it would explain following. So, once the number of bad sectors on one physical drive reaches certain threshold, S.M.A.R.T. on Raid Controller issues predictive failure warning. And in most of the cases it is ok, as information missing from bad sectors is readable somewhere else. You replace one drive and it gets rebuilt, and that is it.

In our case, seems to me, we had number of bad sectors on two disks that existed only on another drive with predictive failure. So once we removed one, it could not rebuild it as it was missing information about locations of data on other drive. So the other drive became unreadable and failed. This would explain why both drives were shown as failed once the other one was removed. And it would also explain why it booted normally once we returned original setup after all of the things we tried. 

I guess the odds for something like this are like winning the lottery in USA, UK, EU, Argentina and Russia at the same time! Or at least it feels that way.

So, as a lesson learned, we went to raid 6, told management to deal with it, and now doing more close monitoring of physical devices… 

Hope this someone finds useful…

0 Kudos
1 Solution

Accepted Solutions
Moderator
Moderator

RE: PowerEdge 720 Raid Controller PERC H710P mini two drives predictive failure failed after hot swap during rebuild

Jump to solution

Hi,

Here is some information regarding what happens when there are bad blocks on multiple drives. The more drives that you use and the larger the drives the more likely this is too occur. Also how often the disks are accessed plays a part in this. The odds are not as long as you think, for an array of your size it is somewhere between 1/50 and 1/250. Switching to RAID 6 like you did moves it to around 1/500 to 1/1000. Having full tested backups is the best way to prevent a data loss scenario.   

Thanks,
Josh Craig
Dell EMC Enterprise Support Services
Get support on Twitter @DellCaresPRO
0 Kudos
1 Reply
Moderator
Moderator

RE: PowerEdge 720 Raid Controller PERC H710P mini two drives predictive failure failed after hot swap during rebuild

Jump to solution

Hi,

Here is some information regarding what happens when there are bad blocks on multiple drives. The more drives that you use and the larger the drives the more likely this is too occur. Also how often the disks are accessed plays a part in this. The odds are not as long as you think, for an array of your size it is somewhere between 1/50 and 1/250. Switching to RAID 6 like you did moves it to around 1/500 to 1/1000. Having full tested backups is the best way to prevent a data loss scenario.   

Thanks,
Josh Craig
Dell EMC Enterprise Support Services
Get support on Twitter @DellCaresPRO
0 Kudos