Few things. Array Manager or Server Administrator are not supported with XP. You will not have a way to manage the array in the OS. The PERC 4/SC as an alarm on it however that will go off upon a drive going offline. It is enabled by default and it is very very loud. If it is on, you'll know if there is any by it. To make sure that it is enabled, you can go to the PERC BIOS, objects, adapters and alarm control.
If you have a hotspare, then upon a drive going offline the rebuild will start on the hotspare and the controller alarm will not stop until the rebuild is complete on the rebuild.
As far as "gotchas", just because a drive goes offline, doesn't mean it is always bad. Never blindly replace a drive; make sure you know why the drive went offline first.
To replace a drive, if they are the 80-pin on a backplane, then just remove the drive and put in the new one; in the OS if you'd like, just as long as the RAID controller is initialized. IF it's a 68-pin, then you will need to bring the system down to replace it and then boot to the RAID controller and start a rebuild on the replaced drive.
I tried to install the array manager tonight and found out it will not install to XP, as you said.
Sounds like the alarm notification will be adequate for my purposes.
My hardware uses the 68-pin devices. Because it is only a workstation, there would be no problem with bringing it down to do a rebuild.
At the point in time where I would do the rebuild, would the new (replacement) drive automatically become the new hot spare or would I have to do something to tell the controller to make it a hot spare, or is there some procedure for doing the rebuild on the new drive and then the old hot spare would return to being a hot spare after the new drive takes it place in the array?
Perhaps it would be best for me to practice this procedure before I put any valuable data on the machine. If I pull the power connector from a drive while the system is up, would this be an accurate simulation of a drive failure?
To perform the rebuild, would I have to boot from the System Management CD or could I do this using Ctrl-M during the boot sequence?
On the LSI website they have a "GAM TT" Global Array Manager Transition Tool available for download and there is a version for Windows XP. Do you know anything about this tool? Perhaps it is an acceptable substitute for the Dell Array Manager.
THANKS again,
-Marty
Message Edited by martinmarty on 02-27-2007 12:15 AM
Message Edited by martinmarty on 02-27-2007 12:18 AM
After a rebuild of of a failed drive using a hot spare your PERC BIOS view add configuration display will show the following:
(for example ID2 failed, and ID3 was the hotspare)
ID0 Online A0-0
ID1 Online A0-1
ID2 Ready
ID3 Online A0-2
Your system event logs should have messages from your PERC driver about the drive failure and rebuild events.
And lastly if you have status LED's on your drives, the hot spare will show activity, and one of the drives, (ID2 in my example) will have no activity.
You go to objects physical drives and check the F2 details on each drive looking for a reason that ID2 was offline. (Media errors on ID2, Other errors on any drive ) If the failure is obvious and the repair action is to replace ID2, you then replace the drive and go into the objects physical drives, select ID 2 and select set hotspare. Your view add configuration will now be:
ID0 Online A0-0
ID1 Online A0-1
ID2 Hotsp
ID3 Online A0-2
If you want to get the drives back to the original configuration, then you would force ID 3 offline, allowing ID2 to rebuild. After the rebuild is complete you go back and make ID3 the hotspare. (NOT strictly needed, but is a good idea unless you plan to keep track of the changed configuration after each drive failure).
The software on their site MIGHT work, but the F/W and the drivers for the PERC are not the same as that S/W was written for. I've not tested it.
Dell-GaryS
Message Edited by DELL-GaryS on 02-27-2007 11:37 AM
Wow - cancel that firmware update :smileysurprised:
Now that I look again, my current firmware says version 3.28 not 1.28 so trying to flash it to 1L47 scares me! The controller's BIOS version says 1.05 - maybe that's the number referred to in 1L47.
In any case, the numbers I'm getting from the card don't seem to jive with the numbers I'm getting from the LSI website. I don't want to screw up my firmware since the system
is working at the moment.
Maybe there is something different between a PERC 4/SC and a LSI MegaRAID SCSI 320-1.
Is a PERC 4/SC actually a "special" version of a an LSI 320-1, i.e. some special firmware just for Dell? Maybe I should be looking for a firmware update on the Dell site instead of the LSI site.
Thanks for your detailed answer. That is what I needed to know.
You were right. The software from the LSI website did not work for me. The GAM software did not recognize my disk controller. I'm going to try a firmware update just in case, plus it looks like there were really some big bugs fixed along the way. Mine is v1.28 and the current is 1.47. Hope I don't make a big paperweight out of this thing. :smileywink:
OK. I found the PERC 4/SC firmware v352B update on the Dell site and installed that and the system came up working.
After the update, the GAM TT software from the LSI website can now recognize my controller card and display the drive info, status, etc.
Only thing is,
BE CAREFUL WHAT YOU WISH FOR BECAUSE YOU JUST MIGHT GET IT...
Now that I have software reporting on the RAID status, it is logging errors on one of my drives! It says "Hardware impending failure data error rate too high" on ID #2. The device must be reporting enough jitters to suggest that a failure is coming. It showed a hard error count of 5, which I reset just to see if the count will go back up over time.
I'm going to leave the PC run tonight and see how many errors are logged. So far it seems to be generating the same error on the same device approximately every 4 - 5 minutes.
I may get to practice the drive replacement procedure sooner than I had anticipated.
There is error monitoring F/W on the drives (this is called SMART) that will warn you of a predicted failure. If you have had 5 media errors on ID2, and it's warning you about a predicted failure, your next step is to replace the drive. There is no way to silence the predictive failure outside of the drive maker's depot.
Ok, procedure... first look at the F2 info on the other drives, if you have any media errors on them, you'll need to run a consistency check to remap the errors before failing ID2. Failure to remap the errors will cause your rebuild to either fail, or to complete with errors (files are damaged in this case), in which case you're going to be looking at a reinstall and restore from backup. FYI, with the F/W update you should get the completed with errors, which will allow you to repair the damaged files.
Once you have determined that your other drives are good, or you've remapped the errors. You back up your data. Next you turn off the hotspare, and force ID2 offline, remove the ID2 drive and replace it with a compatible replacement. Set the new ID2 to be hot spare, allow it to rebuild, then set ID 3 back to being the hotspare.
Turn on patrol read. (this does a nearly continuous consistency check in the background, finding and remapping any new media errors) If you had media errors on the other drives, you may want to consider replacing them.
A few years back Dell started making our drive F/W force the drives to the exact sizes (73GB for example would be truly be 73 GB, not 73.03 GB.) If you got the replacement drive on the open market, then it would not have the dell F/W on it, hence the size difference. You're perfectly good with this setup, no need to rebuild it.
AUTO is a good setting for your patrol read.
I'm glad to assist, my regular job gave me some spare time today, it usually keeps me busy, hence the sometimes 2 week gap between a flurry of posts from me. :-)
I followed your procedure, which was better than what I would have done on my own because yours ends up with the hot spare still being ID3. I probably would have just removed the power plug from ID2 and then let the ID3 hot spare take over and then replaced ID2 and designated it as hot spare.
After forcing ID2 offline, the ID2 drive immediately went to Failed status and then the controller would not let me make it a hot spare because it was in failed status, so I just initiated a manual rebuild and waited for it to finish. This took a little over 1.5 hours for a 73.5GB drive. I just stayed in the Ctrl-M screen while the rebuild proceeded.
I was very pleased to see that after it finished XP came up running just like nothing ever happened and the drive is no longer logging errors. None of the drives are logging errors.
The only thing that seems a little unusual is that the "rebuilt" drive came out to 70010 MB while the others are 69880 MB. They are the exact same model drives, Fujitsu MAP3735NP. I wonder if this is because the RAID array was originally created by the v3.28 firmware but the rebuild was performed using my updated 3.52B firmware. Do you think this is true? For the reliability of the system, do you suppose I should start over and rebuild the whole array under the 3.52B firmware? (I am setting this machine up from scratch so there is no valuable data on it yet, it would just mean a repeat of the whole XP and software installation).
My Patrol Read is set to "Auto". Is this the right setting?
I appreciate all the help and advice you have given me.
Thanks again,
-Marty
Message Edited by martinmarty on 02-28-2007 03:32 PM
I found a function in the GAM TT software to display the Patrol Read status. It indicates that 2 iterations of the Patrol Read have occurred since system startup and that it is 28% through a third iteration. It has gone up a couple percent as I've been typing these replies, so I guess the "Auto" setting is OK because Patrol Reads are occurring. I leave my PCs on all the time for the most part, so it should have plenty of time to patrol.
Thanks,
-Marty
p.s. None of my currently installed drives are showing any media errors so I think I am "good to go", but I think I better order another spare or two while they are still available.
Message Edited by martinmarty on 02-28-2007 03:48 PM
Thanks for the info and reassurance on the variation in drive sizes. Yes, I did get the drive on the open market so it probably did not have the Dell firmware.
Looks like my hot spare has a bad spot. The Patrol Read is kicking out errors on that drive, ID3, near the end of every Patrol Read cycle. It reports a series of two "Read retries exhausted" messages followed by "Unable to recover medium error during patrol read".
Does this drive need to be replaced or can I run the consistency check to remap the bad spots?
(I'm guessing that the Patrol Read process would have remapped the bad spots if it was possible to do so)
I'm also betting that I should immediately un-designate this drive as a hot spare. If one of the other drives goes bad, I don't want the system to move the data onto a malfunctioning hot spare. Correct?
That would be a yes. It appears to me that hot spares tend to have a shorter life than a data drive, perhaps they are "on" all the time or something. What does the "F2" info report on ID3 in the controller BIOS? (objects, physical drives)
Well... at the moment it says zero errors, but I think that is because I mistakenly reset the error count via the GAM software after I made the drive not be a hot spare anymore.
When Patrol Read originally alerted me to the problem, there were 32 hard errors shown for that drive and I believe that number would have increased with each patrol cycle. I could hear the drive clicking away doing its retries when it got to the bad spots, and see the errors logged simultaneously.
That's weird about the hot spares having a shorter life. Maybe when my replacement arrives I'll just leave it uninstalled until it's needed. On the other hand, it's kind of a sense of security to know that the Patrol Read has checked it out so it is ready if needed. I've ordered some used drives from eBay so it would be nice to know if they work. Maybe I'll install them and let Patrol Read exercise them for a couple of days and if they're good, I'll unhook them until needed.
Decisions, decisions... :smileywink:
Thanks,
-Marty
Message Edited by martinmarty on 03-01-2007 02:29 PM
at5147
884 Posts
0
February 27th, 2007 00:00
If you have a hotspare, then upon a drive going offline the rebuild will start on the hotspare and the controller alarm will not stop until the rebuild is complete on the rebuild.
As far as "gotchas", just because a drive goes offline, doesn't mean it is always bad. Never blindly replace a drive; make sure you know why the drive went offline first.
To replace a drive, if they are the 80-pin on a backplane, then just remove the drive and put in the new one; in the OS if you'd like, just as long as the RAID controller is initialized. IF it's a 68-pin, then you will need to bring the system down to replace it and then boot to the RAID controller and start a rebuild on the replaced drive.
martinmarty
12 Posts
0
February 27th, 2007 04:00
Message Edited by martinmarty on 02-27-2007 12:15 AM
Message Edited by martinmarty on 02-27-2007 12:18 AM
DELL-GaryS
777 Posts
0
February 27th, 2007 15:00
Message Edited by DELL-GaryS on 02-27-2007 11:37 AM
martinmarty
12 Posts
0
February 27th, 2007 22:00
martinmarty
12 Posts
0
February 27th, 2007 22:00
martinmarty
12 Posts
0
February 27th, 2007 23:00
DELL-GaryS
777 Posts
0
February 28th, 2007 13:00
DELL-GaryS
777 Posts
0
February 28th, 2007 19:00
martinmarty
12 Posts
0
February 28th, 2007 19:00
Message Edited by martinmarty on 02-28-2007 03:32 PM
martinmarty
12 Posts
0
February 28th, 2007 19:00
Message Edited by martinmarty on 02-28-2007 03:48 PM
martinmarty
12 Posts
0
March 1st, 2007 13:00
martinmarty
12 Posts
0
March 1st, 2007 14:00
DELL-GaryS
777 Posts
0
March 1st, 2007 15:00
martinmarty
12 Posts
0
March 1st, 2007 18:00
Message Edited by martinmarty on 03-01-2007 02:29 PM