GeforceXP

12 Posts

159490

January 22nd, 2014 08:00

Degraded RAID 1 Array:PowerEdge 2800

Hello to everyone. I hope someone can shed some light on this fairly unusual situation.

I have a PowerEdge 2800 which is due to be replaced before May this year, it's a great server and done a great job over the last 7 years.

Ok to the problem; Server Administrator was predicting a failure on 0:0

In the server we have:

0:0 Maxtor 36GB 15k 0:1 Fujitsu 73GB 15k 1:2 Something .. 1:3 Something ..

So, I popped 0:0 and the server was working fine with a degraded array, and after 15 seconds I inserted it back in to the PE2800. After approximately 45 minutes the RAID 1 array was rebuilt, but it still said predicted failure. So I thought I best order some new drives - I received them today; 2 x 146GB 15k Seagate Cheetahs.

Ok, so I popped 0:0 drive and inserted one of the 146GB cheetah's and the server crashed - would not boot at all, and was giving me an error saying the logical drive had failed.

So, in the PERC4i BIOS, I forced drive 0:1 online, and the server is now back up and running albeit with a degraded array.

So this is where I'm at, what choices do I have?

Insert the old drive with a predicted failure, leave it alone, or try the 146GB cheetah again. I don't really fancy doing any of them, but I will order another 73GB Fujitsu in the meantime.

What I want to know is why did the server crash after inserting the cheetah drive? I have swapped hot pluggable hard drives many-a-time in the past in many dell severs and never experienced anything like this.

Currently running degraded, so any replies would be massively appreciated.

Thanks a lot.

Responses(22)

theflash1932

7 Technologist

•

16.3K Posts

0

January 22nd, 2014 15:00

"So, I'm now back to square one, where I have a RAID1 system with one predicted disk failure. What would you do from here on in?"

Another thing that many people don't know is that with a pf drive, it should be forced offline before replacing, otherwise the pf flag can be assigned to the new disk (until an actual 'rescan' is performed) and can result in some odd behavior (the controller is more sensitive to potential failure of pf drives).

"Common sense says I cannot mix-match 10k/15k rpm drives!"

You CAN mix/match speeds and/or sizes (there is nothing wrong with it 'technically'), but in real-world scenarios, you would never want to mix speeds (depending on the configuration, it would degrade the speed of the entire array). One thing you CAN'T do is mix U320 and U160 SCSI drives on a backplane/controller.

"What would you do from here on in?"

I guess from here depends on what you feel most comfortable doing. There is a chance that the pf drive will operate just fine until you are done with the server (or until you can get another replacement - if you feel another replacement would be better than the ones you currently have), although it could cause additional issues if its status degrades. There is a chance that it could happen again if you try to replace it, but I believe it is unlikely to happen again. If it were me (and it's important to remember that it's not :)), I would force offline disk 0 and replace it "hot" with the other 146GB drive. Another thing you could do, which would probably feel less risky, is to simply insert the 146GB drive into an open slot, assign it as a hot-spare, then force offline disk 0, which will cause a rebuild to automatically start with disk 0.

theflash1932

7 Technologist

•

16.3K Posts

0

January 22nd, 2014 08:00

Important: Did you replace disk 0 "hot", or did you power down the server to replace the drive?

theflash1932

7 Technologist

•

16.3K Posts

0

January 22nd, 2014 08:00

You didn't say so specifically, and although people often throw around the term "hot-swap", they often don't realize what it really means, especially in the context of how the RAID controller uses that to manage the arrays. Don't be offended that I asked - it is a VERY common mistake, made even by experienced IT staff. You did it right ... just making sure :)

1. It could have been a fluke. Firmware, static electricity, poor Moon/Venus alignment, etc. You could try it again to confirm the issue.

2. Could be something about the 146GB drive it doesn't like. Is this a Dell-certified drive? Do you have another system you can test the drive with? It could be bad (shorts or other power issues can result in exactly what you saw). You could try the other 146GB.

3. REALLY old PERC or BIOS firmware. Support for 146GB and 300GB drives was added early on in the 28x0 lifecycle, as were a number of fixes for the controller - recovery, performance, etc. What is your PERC and BIOS firmware at?

4. Could be a slot issue. The original drive may not be bad - it may simply be assigned bad because of a fault with the slot. You could put the 146GB drive in another slot, then assign it as a hot-spare to begin the rebuild.

Obviously, these are some educated guesses at possible causes - what you experienced is NOT normal, so educated guesses as to things that could cause such an issue is all we have to go on.

GeforceXP

12 Posts

0

January 22nd, 2014 08:00

Hot. As I always do.

theflash1932

7 Technologist

•

16.3K Posts

0

January 22nd, 2014 09:00

0. No offence taken :)

1. It could have been a fluke. Firmware, static electricity, poor Moon/Venus alignment, etc. You could try it again to confirm the issue.

- That was my first thought.

2. Yes I'm sure it's certified - I got it from a certified partner for out-of-warranty Dell Spares, and yes I have another 2800 I can test it on before I do anything else, that was my second thought.

3. BIOS Firmware A07, PERC 4e/Di Firmware 5B2D, Driver Version 6.46.2.32, Storport Driver Version 5.2.3790.3959, then it says Minimum required storport driver version 5.2.3790.4173?

4. Could be. This is a Mail Server for 100 employee's - downtime is ultra critical - That's why I'm half tempted to leave it well alone until we can get a new server sorted, but ultimately I don't want to be leaving a degraded array realistically. A rock and a hard place comes to mind.

I totally appreciate your immediate replies by the way, thank you.

3. You should look into updating the storport driver once this is resolved:
http://www.dell.com/support/drivers/us/en/04/DriverDetails/Product/poweredge-2800?driverId=GDCG3&osCode=WNET&fileId=2731107941&languageCode=en&categoryId=SR

4. Understood. You could try plugging in the drive after hours. If it is a power issue with the drive, it will likely do the same thing when you plug it in. If not, it will simply sit there until you do something with it. You may not have issues with it over the next few months, but as an mail server, it will get a lot of activity, and every second you are degraded, you run the risk of permanent data loss from read/write errors ... rock/hard place ... I get it :)

GeforceXP

12 Posts

0

January 22nd, 2014 09:00

0. No offence taken :)

1. That was my first thought.

2. Yes I'm sure it's certified - I got it from a certified partner for out-of-warranty Dell Spares, and yes I have another 2800 I can test it on before I do anything else, that was my second thought.

3. BIOS Firmware A07, PERC 4e/Di Firmware 5B2D, Driver Version 6.46.2.32, Storport Driver Version 5.2.3790.3959, then it says Minimum required storport driver version 5.2.3790.4173?

4. Could be. This is a Mail Server for 100 employee's - downtime is ultra critical - That's why I'm half tempted to leave it well alone until we can get a new server sorted, but ultimately I don't want to be leaving a degraded array realistically. A rock and a hard place comes to mind.

I totally appreciate your immediate replies by the way, thank you.

theflash1932

7 Technologist

•

16.3K Posts

0

January 22nd, 2014 09:00

I only responded to 3 and 4 :)

GeforceXP

12 Posts

0

January 22nd, 2014 09:00

Not sure what happened to 1 and 2 from your reply, but thanks anyway.

Just to add some futher info;

I've reinserted the "predicted failure" drive back in to the PE2800 and it says it's rebuild the array successfully........but, even though Virtual Disk 0 has a green "check mark", when you click Virtual Disk 0, it does not list 0:0, only 0:1, yet the logs say that it has rebuilt successfully?

Does that now mean I have to re-scan via the Open Server Administrator console?

GeforceXP

12 Posts

0

January 22nd, 2014 09:00

0. No offence taken :)

1. That was my first thought.

2. Yes I'm sure it's certified - I got it from a certified partner for out-of-warranty Dell Spares, and yes I have another 2800 I can test it on before I do anything else, that was my second thought.

3. BIOS Firmware A07, PERC 4e/Di Firmware 5B2D, Driver Version 6.46.2.32, Storport Driver Version 5.2.3790.3959, then it says Minimum required storport driver version 5.2.3790.4173?

4. Could be. This is a Mail Server for 100 employee's - downtime is ultra critical - That's why I'm half tempted to leave it well alone until we can get a new server sorted, but ultimately I don't want to be leaving a degraded array realistically. A rock and a hard place comes to mind.

I totally appreciate your immediate replies by the way, thank you!

theflash1932

7 Technologist

•

16.3K Posts

0

January 22nd, 2014 11:00

It may simply need to be "refreshed". You can try a rescan though, if refreshing doesn't work ... there is no harm in it. What version of OMSA do you have? Have you tried closing OMSA altogether and reopening? You should be able to see the drive if it has rebuilt.

GeforceXP

12 Posts

0

January 22nd, 2014 13:00

I've tried refreshing the page, and deleting the history/cache etc, but it still doesn't show - (pause 2 minutes) The rescan feature worked! Hurrah.

So, I'm now back to square one, where I have a RAID1 system with one predicted disk failure.

What would you do from here on in?

Furthermore, I can't test those drives I purchased as my other PE2800 has all 8 slots taken by 10krpm drives, and my other server has 15k drives. Common sense says I cannot mix-match 10k/15k rpm drives!

So, just go out and purchase another 73GB drive? Any thoughts would be very welcomed.

Edit: Dell OpenManage Server Administrator Version 6.2.0

Thank you.

GeforceXP

12 Posts

0

January 22nd, 2014 15:00

Please forgive my lack of response (I'm in bed on my iPhone as it's nearly midnight here)

Firstly I wasn't aware that one should force offline a PF disk, but that does make sense.

Just going back to what you said last about inserting the drive in to one of the open slots which I do have in this PE2800, what you're saying is, the PE will automatically rebuild the array to any hotspare that's available in the system once it detects a degraded array?

if I'm wrong please tell me I am.

Also can I assume then, if I make a disk 'offline' essentially that's a 'good copy' of the said data at that time before it was taken offline?

I'm just thinking that if something was to go horribly wrong, it'd nice to have a disk that's a copy of the OS that I can just pop back in and force online, even if it was degraded/PF.

I should be very honest here at this point and say that I've not had any experience with hotspares and what their function are.

Very very early start for me tomorrow - a downed mailsever would be catastrophic as you can probably imagine.

thanks again for your for your replies !!

theflash1932

7 Technologist

•

16.3K Posts

0

January 22nd, 2014 15:00

"Just going back to what you said last about inserting the drive in to one of the open slots which I do have in this PE2800, what you're saying is, the PE will automatically rebuild the array to any hotspare that's available in the system once it detects a degraded array?"

That's right.

"Also can I assume then, if I make a disk 'offline' essentially that's a 'good copy' of the said data at that time before it was taken offline?"

I wouldn't make that assumption, especially about a disk with a predicted failure, but it could be used in its previous state to get the system back online. If you ever do this though, you won't "force online" the drive ... you will need to boot to CTRL-M, Configure, View/Add and choose Disk View and save on exit to essentially "import" the configuration from the disk. Only if it is the only disk available in the RAID 1 and currently showing FAILED should "force online" be used.

Hot spares are simply drives that are dedicated to automatically take over and rebuild should a disk fail. Very helpful for systems that are not actively monitored and located in an area where an audible or visual indication of a failed drive can be heard/seen. Often times, people will go for days, weeks, or even months with a failed drive, then they lose everything when the second drive dies, because they had no idea that the first had failed. A hot-spare helps mitigate data loss - or downtime - in those types of situations by simply rebuilding when one fails. It is also the ONLY way to rebuild a drive in some situations (when a drive does not show as 'failed' - you cannot 'rebuild' a 'ready' drive - only a 'failed' drive).

Get some sleep :)

GeforceXP

12 Posts

0

January 23rd, 2014 01:00

Of course, logically the term "Hot Spare", means it's a hot-spare, as and when one drive fails. Duh! :emotion-11:

I can share the experience of having servers that are not actively monitored, apart from logging in via OMSA now and again, and in some cases I've had, like you said drives that have failed days, or even a few weeks, but not months.

So, I think I'm at the point where I'm going to load another 146GB drive I purchased (as I bought two) in to a spare slot, and see what happens. I can assign it as a global Hot Spare through OMSA and then I have the choice of either waiting for the PF drive to fail, or manually take it offline.

In this scenario, would you say the best practice is to pull the PF drive whilst "hot", or go through the OMSA to take it offline? (not even sure if it can be done this way)

Thanks.

Edit:

I should add that the server can support 146GB SCSI drives as the drives in 1:2 and 1:3 are both 146GB 15k drives - Hitachi's.

The ones I've purchased are Seagate Cheetah 15k.5 model number ST3146855, which of course are U320's.

The theory is, these drives *should* work, but after my initial experience with the first one, no wonder I'm extremely hesitant!

theflash1932

7 Technologist

•

16.3K Posts

0

January 23rd, 2014 08:00

"In this scenario, would you say the best practice is to pull the PF drive whilst "hot", or go through the OMSA to take it offline?"

It can be done either way, but best practice with a pf drive is to force it offline from OMSA before removing it.

"I should add that the server can support 146GB SCSI drives as the drives in 1:2 and 1:3 are both 146GB"

Good to know, since you just referred to them as "something" in your first post :)

1
2

View All

No Events found!

PowerEdge HDD/SCSI/RAID

Degraded RAID 1 Array:PowerEdge 2800