Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

809015

January 24th, 2013 07:00

Dell Poweredge 2950 HDD Fault; need advice on hot swapping the drive

This server houses our small business database, so I don’t want to do anything that might corrupt the data on it.  The other IT staff (a programmer) backs up the files to external hard drives.  I’m new in the IT field, so I could use some advice.  I did not set up the server, and of course the server is a few months out of warranty at this point.

The LED display panel on the front of the server shows this error – “E1810 HDD ## Fault”.  I have no way to manage this server, as it was set up as all virtual machines in VMWare ESXi 4.  Right now the programmer doesn’t want to install Dell OpenManage, because the server would first need updated firmware, etc.  When I look at the physical server itself, I can clearly see the hard drive with issues because there are amber lights blinking instead of green.  This is the most detailed information I can get from VMWare console: 

Alert! >>Storage  - Drive 5 in enclosure 32 on controller 0 Fw:HS09 – UNCONFIGURED BAD  Warning!  RAID 5 Logical Volume 0 on controller 0, Drives (0e32,1e32,2e32,3e32,4e32,?) DEGRADED Warning!

Disk Drive Bay 1 Drive 5: Drive Fault – Assert    Alert!

Disk Drive Bay 1 Drive 4: In Critical Array – Assert   Alert!

Disk Drive Bay 1 Drive 3: In Critical Array – Assert   Alert!

Disk Drive Bay 1 Drive 2: In Critical Array – Assert   Alert!

Disk Drive Bay 1 Drive 1: In Critical Array – Assert   Alert!

Disk Drive Bay 1 Drive 0: In Critical Array – Assert   Alert!

 Since I’m clueless as to what I’m dealing with here, can anyone answer the following questions for me?

 I assume this is a RAID 5 configuration, correct? 


I obtained this info from the system configuration page on dell.com:

               1)Backplane, Dell2950, 3.5X6SAS                                             

               2)ControllerDRAC5, PE, Mid-Life Kicker

               3)FG0271Card, Backplane, Key, TOE, 2 PORT Enterprise Systems Group

               4)MC3601Assembly, Cable SASX4-PERC5X4-X6BKPLN

               5)Controller PERC6IINT, Serial Attached ScsiShort Lead

 1)      What exactly is a backplane? I could look this up in Wikipedia, but I can probably get a better working explanation from a true network person!  I assume the SAS stands for serial attached SCSI drives, since that’s what the hard drive description is (SCSI).

2)      Is DRAC5 the brand of the controller?  If not, what does DRAC5 mean?   Mid-Life Kicker??

3)      TOE?

4)      Not sure what this is saying…

5)      What is a PERC6IINT controller, and why does line 4 show PERC5?

 

I have been reading the Dell troubleshooting documentation and it has some terms I don’t know how to address.  Based on the above information, this server has an SAS RAID controller, correct?  How do I know if I have an SAS RAID controller daughter card (that term is in the troubleshooting steps)?

 Finally, we purchased a Dell replacement hard drive.  It’s hot swappable, and we were just going to exchange it with the failed drive.  Will I need to configure anything on the server or drive, or is it as simple as just swapping out the bad drive with the new one?  Will it rebuild without any intervention on my part?  If there seem to be problems, it looks like I will at least be able to monitor something by going into the RAID BIOS, correct? 

 With my lack of server knowledge & experience, what would you suggest?  Should we just go for it, and try hot swapping the drives after hours?  Or do you think we should “bite the bullet” on this one, and opt for paying Dell a service call to do this?  I may be overly cautious, but I don’t want to be responsible for taking down a critical server when I’m still a new at this.   Any input would be most appreciated!  

7 Technologist

 • 

16.3K Posts

January 24th, 2013 08:00

"I assume this is a RAID 5 configuration, correct?"

Strike 1:  Assume.  Never assume anything :)  With 6 physical disks on a PERC 6, you could have a RAID 5, 6, or 10 (6 disks could also be a RAID 0, but given one disk has already failed and the data is still accessible, it could not be at this point).  What you do with the failed disk isn't affected by this fact.

1. The backplane is the board where all the drives connect.  A system with a backplane does not have individual drive cables connecting the drives to the controller - the controller has a cable(s) that goes from the controller card to the backplane, and all the drives connect to power and data through the backplane board.  (SAS does stand for Serial-Attached SCSI, but do not confuse SAS with SCSI - the ONLY similarity is the internal commands supported on the drives - they are not otherwise compatible at all.)

2. DRAC stands for Dell Remote Access Controller.  This allows FULL remote access to the server (console access, so BIOS screens included), plus some other amenities for managing the server remotely.  This has nothing to do with your current problem, so I won't go into more detail (feel free to ask about this).  Mid-life kicker is an update do a device "mid-life" ... so it is a substantial upgrade to the hardware but still carries the name.  The 2950 has three revisions (mid-life kickers) ... I, II, III.  They are all 2950's but have significantly different capabilities (TPM, RAM capacity, FSB, processor support, etc.).

3. TOE stands for TCP Offload Engine ... network performance feature, but again, irrelevant to your current issue.

4. This is just the parts description for your backplane CABLE - it indicates the 6-disk backplane option (there are 4, 6, and 8-disk options).  The backplane was originally manufactured for the PERC 5 - the PERC 6 came out later but is perfectly compatible.

5. Your RAID controller is the PERC 6/i.  The PERC 6 come in two versions:  and Adapter and an Integrated version - the Integrated versions go in systems (like the 2950) that have a dedicated slot that is optimized for that particular controller.

Yes, your system has a SAS RAID controller (sometimes the Integrated version is called a "daughtercard").  The ONLY supported cards for the 2950 are SAS controllers ... SAS controllers can take SAS OR SATA drives.

It may rebuild automatically, but there is a chance it will not.  If not, then you need to assign the disk as a Hot-Spare to begin the rebuild.  If the drive shows as "foreign", the foreign config must first be cleared before you can assign it as a hot-spare.  This is all done in the CTRL-R BIOS utility for the controller (or from OMSA, which IS an option for ESXi setups as well).

Changing the drive is the most basic and simple of all server maintenance.  Whatever you do, do NOT turn off the machine to swap the drive:  If it is hot-swap, swap it hot.

7 Posts

January 24th, 2013 09:00

1) "RAID 5 Logical Volume 0"... so yes, RAID5 :)

The rest has been answered by flash :-)

D.

4 Posts

October 27th, 2014 12:00

I believe that the hot swap drives were designed to minimize downtime so they can be done anytime. Personally, I'm cautious so I do all my server maintenance outside of "normal" network hours - just in case there would happen to be a problem with the drive re-building or such so I'm not put on the spot with nervous employees and management looking over my shoulder!  I've never had a problem occur yet during any of swap, though.  

7 Technologist

 • 

16.3K Posts

December 18th, 2014 08:00

If the failing disk is showing Predictive Failure, then force it offline and let the hot-spare rebuild in its place, then replace the failing disk.

Never power down to replace a hot-swappable disk.

7 Technologist

 • 

16.3K Posts

January 24th, 2013 09:00

Good call ... I skimmed over that :)

4 Posts

January 28th, 2013 08:00

Okay, thank you so much for taking the time to explain it all, theflash1932!  Thanks for confirming the RAID5 configuration, damirc!  Appreciate it!  We're going to try to swap out drives after hours this Friday.  Wish me luck!  

2 Posts

April 18th, 2014 12:00

Hello Flash, I ran into this post because I was getting the same drive errors as Texican. I previously swapped one drive that was degraded on a raid 5 array also on a 2950, but I turned off the server... I know, rookie mistake. I realized this was a big no no after reading your excellent reply to this post.

Question, what can happen after swapping a hot swappable drive while turning off the server? Because before swapping out the drive, I only had one degraded drive. After I did this, I pretty much got the exact same errors as Texican ("Drive Fault - Assert" and "Critical Array - Assert"). I'm guessing I pretty much answered myself, but I would like to hear your input on it. Thanks!

7 Technologist

 • 

16.3K Posts

April 18th, 2014 16:00

Every time the controller boots up, it loads the configuration, including disk ID's and RAID metadata, then checks each disk for a matching configuration.  If its configuration does not match the disk configuration(s), then it has to decide what to do and which configuration to use.  While it usually is smart enough to tell which configuration it should use, there are a number of things that can throw it for a loop.  When the controller is unsure of which is the correct disk(s) and/or configuration(s), then things get dangerous.  Keeping firmware up to date is one of the best ways to ensure the controller knows how to properly determine the correct configuration(s) to load.

When you insert a disk "hot", the controller has already loaded its configuration and will ignore any subsequent configuration that it sees while it is up and running.  It will recognize the configuration and mark it as foreign (or recognize no configuration and mark it as ready), but it will NOT attempt to load a configuration from a disk inserted "hot", thus avoiding the problem describe above.  While technically all SAS and SATA disks are "hot-swappable" by spec, when the disks are accessible from the outside of the system, it indicates that the disks' hot-swap-ability is recognized by the controller in how it handles its configurations.

Consider it a highly-recommended "best-practice". Like just yanking a flash drive without exiting ... it probably won't corrupt the drive or data, but it can.

2 Posts

April 19th, 2014 21:00

Excellent information. Thank you so much. Now I have more insight on how to troubleshoot my problem. Thanks again!

2 Posts

October 27th, 2014 10:00

Another question dealing with hot swaps.. Is the swap OK to do while server is in use, or is after hours a better time?

7 Technologist

 • 

16.3K Posts

October 27th, 2014 20:00

Anytime a disk fails from an an array (virtual disk), whether intentional or unplanned, the performance will be degraded, so if the server is heavily used and you have the ability to choose when to do it, then you should do it during off hours when the affect will be minimal.

2 Posts

October 28th, 2014 05:00

Thank you Texian and Flash for the info.. performed a hot swap right at the end of business hours on one of our clients... waited around and just waited for the standard IT explosion which is usually the case with my luck.  But had all green lights on disks, and server is up and running.  Appreciate all the info!... (what did IT people do before google by the way)?

2 Posts

December 16th, 2014 09:00

I have a PE2950 with PERC 6/i Integrated RAID controller configured for RAID5 with four drives and one drive assigned as a Global Host Spare. One of four physical disks is showing a problem with status: Non-Critical. Windows is reporting the following events: Event 2346: Error occurred: Error on PD 01(e0x20/s1) (Error f0).:  Physical Disk 0:0:1 Controller 0, Connector 0 and Event  2405: Command timeout on physical disk:  Physical Disk 0:0:1 Controller 0, Connector 0

Although I noticed that the firmware version (FS04) is different from the remaining three disks (FS03), my hunch is the disk is going bad and I want to replace it.

A few noob questions:

What’s the best course of action here? Take the failing disk offline and let the Hot Spare kick in automatically or pull the drive out while server’s powered on and let the system replace it with the Hot Spare?

Also, I’m not clear whether I should S=set Hot Spare Protection Policy beforehand.

What is the difference is between Connector 1 (RAID) where four  disks, RAID5 members are listed and Connector 1 (RAID) where the Hot Spare disk assigned?

2 Posts

December 18th, 2014 07:00

Can someone respond to my inquiry above. TIA

5 Posts

November 23rd, 2016 22:00

I am new in the field of technology inormatsionnyh I interested in this topic. Can you recommend specialized literature?

No Events found!

Top