Highlighted
bgibbs
1 Copper

Raid 5 failure, looking for advice/opinions

Jump to solution

Am running a Poweredge 1900 with Win2003.  It has four 73GB SAS disks in a Raid 5 configuration (PERC 5/i).  One drive failed... I could hear it clicking... so I downloaded the monitoring software, and it showed failed.  It was drive 2.  In trying to determine what type of disk was installed, I pulled out the bay a bit too far and, while the machine was powered up, the cable came partially loose from drive 0.  I didn't notice it until I saw the reboot process running on the monitor.  I opened the case and reseated the cable and tried to reboot again, but it would not boot and went to the "PXE-E53 No Boot Filename Received" message.

After trying several more reseats and reboots, I got into the Raid configuration and the physical drives showed as follows: 0=missing, 1=Online, 2=Failed, 3=Online.   It also showed a "foreign configuration" message.  Based on some of the posts in this forum, I cleared the foreign configuration.  Now the drives show as: 0 and 2 missing, 1 and 3 online.  I have tried to replace the missing/failed drives with the new drives, just to see if it would affect the booting process, but it does not. 

I'm really just trying to determine if there's any way to salvage the Raid array and the system without having to start from scratch.  I really don't want to tackle reloading Windows and reconfiguring the Raid, etc.  I've worked with a lot of PC's, but never have configured a server of this magnitude. 

Anyone have any thoughts or advice? 

Thanks!

Tags (3)
0 Kudos
1 Solution

Accepted Solutions
theflash1932
5 Iridium

RE: Raid 5 failure, looking for advice/opinions

Jump to solution

"although expressing missgivings about our chances of success"

They are trained to set "proper" expectations - don't get the customer's hopes up and let success be a welcome surprise.

"I'm still amazed that Dell provides free support for these "older" servers.  Must be more to the story, but while it doesn't make sense, it sure came in handy for me."

This practice was an awesome way that Dell stood out from their competition, being more concerned with helping customers (non-technical end users and support staff alike) work through any of their issues to protect company data, uptime, jobs, and reputation, than in whatever meager savings might exist in pushing unsupported systems out the door.  In 2009, they made a decision to no longer provide "free" support to servers shipped after a certain date (10/29/2009).  Dell stated that this would bring their service and costs inline with their competitors, but I believe it to be a huge mistake, damaging the reputation they had carried for years as the best "enterprise" support in the business.  Of course, not everyone knows that this service no longer is included with Dell servers and still believe they get top-notch support, but those who have been in the business no doubt sees it as a bad sign, whether we used the support or not (I don't, but it is extremely valuable to those who are not familiar with Dell hardware/software).

Good luck!

0 Kudos
7 Replies
theflash1932
5 Iridium

RE: Raid 5 failure, looking for advice/opinions

Jump to solution

"It also showed a "foreign configuration" message.  Based on some of the posts in this forum, I cleared the foreign configuration."

There are two things you can do with a foreign config:  clear and import.  Whatever thread you referenced for clearing the foreign config was for a dissimilar situation.  You wanted to IMPORT your foreign configuration in this situation.

Right now, you now have only one option:  put your original disks in and do what is called a "retag".  You will basically configure RAID WITHOUT initializing the array.  This will assign the disk's position within the array.  In CTRL-R, you will need to delete the existing offline/failed RAID 5, then configure RAID 5, making sure the Advanced/Initialize boxes are not checked.  The settings (stripe size, size, etc.) MUST be identical as before - assuming whoever set it up used the default settings, you can use the default settings.  Once you do the retag, force offline disk 2 as soon as possible.  If the original disk 2 is not recognized, you can put in another "new" disk, just  be SURE to force it offline as soon as the retag is completed.

If you are in the US and your system shipped before 11/2009, you can call Dell to have them walk you through this procedure over the phone free of charge.

0 Kudos
bgibbs
1 Copper

RE: Raid 5 failure, looking for advice/opinions

Jump to solution

Thanks for your reply.  Guess I should have checked more thoroughly about the foreign configuration option in reference to my situation.

Two more questions.... do you think I've got a shot at reviving this system?  Most of the Googling I've done seemed to say that a Raid 5 system couldn't be brought back if just 2 of the 4 disks were functional.

And, since I am in the US and this system was installed in 2008, if I call Dell, what information do I have to give them to receive the free tech support?

0 Kudos
theflash1932
5 Iridium

RE: Raid 5 failure, looking for advice/opinions

Jump to solution

"do you think I've got a shot at reviving this system?  Most of the Googling I've done seemed to say that a Raid 5 system couldn't be brought back if just 2 of the 4 disks were functional."

I might recommend asking someone (other than self-searching) the next time, as EVERY situation is different.

Yes, it is true that a RAID 5 with two "offline" disks is considered dead - the data is inaccessible - and if both disks were physically dead and completely non-functional, then there is NO way, other than professional data recovery, to get it back.  BUT, we aren't talking about physically or completely failed hard disks.  Any time the controller detects a configuration on a disk that does not match the configuration (including timestamp) of the configuration it holds in its memory, that disk's configuration is flagged as foreign.  This flag does not affect the data on the disk, just the headers that determine its place in the/a RAID array.  When you unplugged disk 0 (this is why 'cabled' disks are not usually considered/treated as hot-swappable and the system powered down when unplugging disks), it was then out of synch (even if mere milliseconds), and thus marked foreign.  Had you imported the config, it would have simply forced the disk's signature/timestamp to match the rest of the array and everything would have been online (except disk 2).  Clearing the foreign config told the controller to wipe the configuration from disk 0, so now there is nothing telling the controller where it fits within any array and it is simply ready to be used however else you like.  Retagging the array will simply write the configuration header information to each of the disks, leaving the data on the disk intact.

"since I am in the US and this system was installed in 2008, if I call Dell, what information do I have to give them to receive the free tech support?"

Call (800) 456-3355 and give them your Service Tag (serial number).  Assuming it was sold directly to your company/organization, give them the name of your organization when asked for the name of the company to which it is registered.

 

 

0 Kudos
bgibbs
1 Copper

RE: Raid 5 failure, looking for advice/opinions

Jump to solution

Many thanks for your time and your replies.  If possible, I'd like to get a clearer view of the steps I'm going to have to perform now.  If I have to resort to calling Dell and trying to find some guidance there, I'd rather be armed with as much info about this "retagging process" as I can be.  I've read through your two replies and understand most of what you have laid out, but wanted to get a few specifics, if possible. 

First, I'm running the PERC Integrated BIOS Configuration Utility 1.04-019A.  I have the original disks back in the system now, with 0 and 2 showing Missing and 1 and 3 showing Ready.  I have "VD Mgmt", "PD Mgmt", and "Ctrl Mgmt" showing as menu choices.  In the tree view, I have "Controller 0", under that, "Disk Group 0", under which the main branches are, "Virtual Disks", "Physical Disks", "Space Allocation", and "Hot Spares".

When you say that I'll need to delete the existing "failed RAID 5", am I deleting it through this screen and, if so, which of these levels am I actually deleting? 

I'm assuming the default values were used when the RAID was configured... no reason for them to do otherwise in our case... so I intend to stick with the those values.  Again, do I use this screen to reconfigure the RAID, and if so, under which of the functions will I find that option?

Once I get through the retagging process, is there a way to tell whether the default values actually caused everything to line back up properly, or if that is not the case, will it be obvious?  In other words, if it turns out that the folks who configured it originally used some other values, what will I see?

Finally, if the retagging is successful, and disk 2 is not recognized and I force the disk 2 offline, when I put in a new disk, what procedure causes it to be rebuilt?

Sorry to be so picky about the details, but I want to make as sure as possible that I don't make any more missteps and that what I'm being told by any support people matches what you are telling me, since you seem to be seriously expert concerning these things.  Again, my thanks for your time and efforts on my behalf.

0 Kudos
theflash1932
5 Iridium

RE: Raid 5 failure, looking for advice/opinions

Jump to solution

"When you say that I'll need to delete the existing "failed RAID 5", am I deleting it through this screen and, if so, which of these levels am I actually deleting? "

You should only have one VD, and that is what you need to delete.

"do I use this screen to reconfigure the RAID, and if so, under which of the functions will I find that option?"

RAID is configured on the VD MGMT screen.  After deleting the VD, highlight the controller, hit F2, Create/New, check the boxes for all four disks, select RAID 5, name it if you like, leave all other settings as default (make sure Initialize is NOT checked), then OK.  Then you need to force offline disk ID 2, then see if you can  boot your OS (keeping in mind it may have been damaged and needs repaired beyond the VD repair process).  If you are able to boot (or see an OS message about why it can't boot), then it "may" have worked.  Only after booting the OS, making sure your data is accessible and intact, running a CHKDSK /R, rebuilding your failed drive, and running a Consistency Check will you know for sure that it worked.

"if the retagging is successful, and disk 2 is not recognized and I force the disk 2 offline, when I put in a new disk, what procedure causes it to be rebuilt?"

Worry about the retag, then worry about rebuilding disk 2.  However, you MUST force disk 2 offline as soon as possible after the retag.  Its data is stale and will corrupt the rest of the array if left in an online state.  If disk 2 is not recognized during the process of creating the RAID 5, then you can use a blank disk, but is even more important that you force it offline as soon as possible.

There is no guarantee that your data will come out of this alive, but the outcome is usually good.

Most Dell support agents should be very familiar with this process.

0 Kudos
bgibbs
1 Copper

RE: Raid 5 failure, looking for advice/opinions

Jump to solution

Happy ending!

Printed all of your information and made the call to Dell.  Worked through to a support tech who, although expressing missgivings about our chances of success, basically followed your script.  Bottom line is, it appears as though it's going to work! 

I've booted the server up, checked some of our applications, and am currently running a new disk (slot 2) through a rebuild process.  I think we may be back from the brink of oblivion!  I'll be doing quite a bit more testing of the data and have a lot of work ahead of me to get everyone back on this server, but I'm one happy boy!

I'm still amazed that Dell provides free support for these "older" servers.  Must be more to the story, but while it doesn't make sense, it sure came in handy for me. 

Hope I won't need to get back in here, but if and when I encounter any more problems with these servers, I'll ask a lot of questions before attempting any repairs.

Thanks again for all of your help.

0 Kudos
theflash1932
5 Iridium

RE: Raid 5 failure, looking for advice/opinions

Jump to solution

"although expressing missgivings about our chances of success"

They are trained to set "proper" expectations - don't get the customer's hopes up and let success be a welcome surprise.

"I'm still amazed that Dell provides free support for these "older" servers.  Must be more to the story, but while it doesn't make sense, it sure came in handy for me."

This practice was an awesome way that Dell stood out from their competition, being more concerned with helping customers (non-technical end users and support staff alike) work through any of their issues to protect company data, uptime, jobs, and reputation, than in whatever meager savings might exist in pushing unsupported systems out the door.  In 2009, they made a decision to no longer provide "free" support to servers shipped after a certain date (10/29/2009).  Dell stated that this would bring their service and costs inline with their competitors, but I believe it to be a huge mistake, damaging the reputation they had carried for years as the best "enterprise" support in the business.  Of course, not everyone knows that this service no longer is included with Dell servers and still believe they get top-notch support, but those who have been in the business no doubt sees it as a bad sign, whether we used the support or not (I don't, but it is extremely valuable to those who are not familiar with Dell hardware/software).

Good luck!

0 Kudos