Start a Conversation

Unsolved

This post is more than 5 years old

25308

December 5th, 2013 20:00

Odd RAID5 behavior--three failed drives at once....or not?

I'm puzzled...spent several hours researching this behavior and have not come up with anything that makes sense.

OK, we've got a Poweredge 1800 server with a LSI Logic 1020/1030 RIAD card that won't boot up. Upon arriving at the office, a piercing beeping could be heard coming from the server closet.After popping off the front panel, we can see Drive 0 is green, but drive 1,2 and 3 are showing two solid red lights.  As much as I findi it hard to believe that three harddrives failed simultaneously, that would be the logical conclusion, right? After logging into the RAID configuration tool and seeing drive 0 is online and drives 1,2,and 3 have failed that would seem to confirm that.

However, this is the odd thing.

I switch Drive 0 and Drive 1 to see if I the failed drives move and am surprised that all the lights show single green lights and I check the RAID configuration again, and suddenly Drive 0 (the old drive 1) is now online, drive 1 (the old drive 0) has a status of REBUILD, and drives 2 and 3 (which I have not moved) suddenly show as being online.

I reseated the RAID card, the back panel, and all connectors and this condition is absolutely consistent.

In the original order, drive 0 is online and drives, 1,2,and 3 are failed; swap drives 0 and 1 and suddenly drives 0,2 and 3 are online and drive one is showing a status of rebuilt.

A visual inspection of the RAID card and back panel show nothing obvious.

I am tempted to try and rebuild the RAID with the drives out of order, but that seems "ill-advised"


This is a system that is soon to be retired, but their is archival data on it we would like to recover. Unfortunately, there has not been a recent backup.

Any suggestions on how you might proceed if you were in a similar situation?

Moderator

 • 

6.2K Posts

December 6th, 2013 11:00

Hello

Any suggestions on how you might proceed if you were in a similar situation?

The metadata tags that describe the array is written to each drive. When the controller comes online it checks the tags from the first drive that was last online, or the first drive chronologically that was online. These tags contain time stamps of last update. If the time on the other drives does not match up then they will be put into a failed state.

In your situation this is what I expect occurred. One drive went into a failed state, and then another drive failed causing the array to go offline. The drive that failed first was likely in slot 0. The time stamps did not match with the other drives so they remained in a failed state. When you swapped 0 and 1 then the controller brought that drive online and saw that the tags matched those of the drives in slots 2,3 so it brought them online, but left 1 offline because it didn't match.

Randomly shuffling drives is not the best approach to recovering a RAID array. I'm glad it worked for you, but in the future it is best to pull a controller log to find out what happened. Then you can force drives online to attempt to bring the array back up.

Thanks

16 Posts

December 6th, 2013 12:00

thanks for the feedback

I only moved 0 and 1 to test if the drive would be physically detected--if a bad drive does not work in one port, it will not work in another.

I haven't actually attempted to rebuild the array. I wasn't going to rebuild the array with the drives flopped. I left it in its current unrebuilt state to do a little research.

My thinking was that the drive 0 drive has failed or is failing as well and rebuilding it would be pointless/will fail.

I think I will have more luck if I replace drive 0 and rebuild the array from drives 1,2,3 parity.

What do you think?

Thanks

 

Moderator

 • 

6.2K Posts

December 6th, 2013 12:00

I haven't actually attempted to rebuild the array. I wasn't going to rebuild the array with the drives flopped. I left it in its current unrebuilt state to do a little research.

This doesn't make any sense to me because of your previous post:

swap drives 0 and 1 and suddenly drives 0,2 and 3 are online and drive one is showing a status of rebuilt.

It is my understanding that when you moved the drives around the array came online and started rebuilding the drive that you moved from slot 0 to 1. I'm guessing that you stopped the rebuild? If that is correct then there is nothing to do. As soon as that rebuild started it rewrote the metadata on the drive in slot 1 and started rewriting the data on the drive. The array should be functional in a degraded state right now with 3 of the 4 drives in a RAID 5 in an online state. Have you attempted to boot the system?

At this point the drive in slot 1 cannot be forced back into the array. The only option for that drive is to rebuild it. The question is whether or not the array is functional and the data is intact with the remaining drives. The only way to know for sure is to attempt to boot. The problem with that is that as soon as you start reading and writing to the drives the controller will start matching them, so if there is a blank drive in the array it will wipe all of the other drives to match. It won't do this until you try to boot the controller.

Since a rebuild already started then the drives began a write procedure and starting syncing up, so I don't see a reason why not to go ahead and try to boot up. If they work then great, if not then disconnect them from the controller ASAP and contact a data recovery company if the data is important.

Moderator

 • 

6.2K Posts

December 6th, 2013 13:00

those are double red lights on the bottom three drives...but for some reason, drive 0 is showing as OK

That is a picture from earlier before you moved the drives, right? According to the picture you just took of the PERC BIOS 0, 2, and 3 should be solid green and 1 should be blinking green and amber.

BTW, the top light is amber in those pictures. It is not two red lights.

Thanks

16 Posts

December 6th, 2013 13:00

those are double red lights on the bottom three drives...but for some reason, drive 0 is showing as OK

Moderator

 • 

6.2K Posts

December 6th, 2013 13:00

Drive 1 is rebuilding right now. The drives will sync when performing a rebuild since that is a read/write operation on the array. You can't cause any more damage by attempting to boot. I would suggest going ahead and trying to boot into the operating system.

16 Posts

December 6th, 2013 13:00

I'm sorry I have not made myself clear.

I entered the raid configuration screen before allowing the drives to boot.  This is what I saw

 

16 Posts

December 6th, 2013 13:00

Before swapping the drives, and then entering the raid configuration utility, 0 was online, but 1,2,and 3 were marked as OFFLINE and FAILED

I have not yet allowed the drives to automatically rebuid or attempted to force a rebuild.  Thus far, I have averted the system trying to boot

In the original state (with 0,1,2,and 3) in the orginal position, the system simply would not boot. with 1,2,3 showing failed

 

16 Posts

December 6th, 2013 14:00

yes, that picture of the front of the case was taken before swapping the drives and entering the RAID configuration utility.

At that point, within the RAID configuration utility, drive 0 was listed as online, while drives 1,2,3 were listed as FAILED OFFLINE

Yes, amber...Im tired....

16 Posts

December 6th, 2013 14:00

Interesting...but it only appears online with drive 0 and 1 swapped.

In the original configuration drives in original sequence (0,1,2,3, instead of 1,0,2,3), it is offline.

It doesn't make sense that swapping 0 and 1 should result in it being online, but degraded. That is what is driving me nuts.

However, if I hadn't swapped 0 and 1, I would have assumed 0 was good and 1,2,3 were bad.

BTW, thank you very much for hashing this out. I've only dealt with two other failed RAIDs and in both cases, it was fairly straight forward...the drive either was detected or it was not. Here...puzzling.

16 Posts

December 6th, 2013 14:00

My inclination is to place 1,2,3 back where they belong and put in a new 0 and attempt to rebuild the array...need to get my hands on a new drive first, tho

Moderator

 • 

6.2K Posts

December 6th, 2013 14:00

The array should be online in a degraded state. I would suggest booting up to see if you can access the data. The array has already sync'd so attempting to boot will not change anything.

16 Posts

December 19th, 2013 15:00

So, I've replaced the drive and rebuilt the RAID (took about 5 hours) and the RAID utility is reporting "rebuild completed with errors"
Now what?

I ran the consistency check on the logical drives. Logical Drive 1 reports: ERRORS
Logical Drive 2 is still running. Hasn't moved off ) percent for about 15 minutes)
Thoughts?

Moderator

 • 

6.2K Posts

December 21st, 2013 09:00

So, I've replaced the drive and rebuilt the RAID (took about 5 hours) and the RAID utility is reporting "rebuild completed with errors"
Now what?

I ran the consistency check on the logical drives. Logical Drive 1 reports: ERRORS
Logical Drive 2 is still running. Hasn't moved off ) percent for about 15 minutes)
Thoughts?

You are likely going to need to delete and recreate the array. The array data appears to be corrupted. If the array is corrupt then a rebuild cannot complete. I suspect this because the controller told you the array is corrupt when it reported uncorrectable errors on LD 1 after the consistency check, and because the rebuild stops at a certain point.

A consistency check verifies that all of the logical blocks are intact. If they are corrupt then it will try to repair them if redundant data is available. If it is not available then you will get an error like you received.

If the data is important enough to spend the money to recovery then contact a data recovery company. If not, then delete and recreate the array. Everything on the drives will be lost if you delete and recreate the array.

Thanks

16 Posts

December 21st, 2013 12:00

Thanks Daniel,

The consistency check on the second logical drive completed with out errors.

 

the first logical drive was a small drive that was simply the boot drive; the second which completed was the data and application storage.

 

I think I am going to side-load another copy of WIndows Server on a standalone SATA drive, boot into that, and see if I can retrieve data from the second logical drive, which may be complete still_ I just cannot boot into it.

I was going to try and boot into a degraded raid, but I don't think that has any chance of succeeding.

 

No Events found!

Top