Start a Conversation

Unsolved

This post is more than 5 years old

T

3510

June 17th, 2016 04:00

Battery and Disk failures

Hello,

we have a PowerEdge R510 server with Perc H700 Raid controller  installed, configured to run with Raid-5. So far we have configured 2 virtual disks. The first virtual disk consists of 4 hdd x 2 TB each, configured with Raid-5.

The second virtual disk has 4 hdds  x 3 TB each. 

We have noticed that 2 hard disks (of 2 TB each), from the first pool, blink orange. We run the ESET tool,  and smart tool (the server runs centos), and get from the report, that the hard disks are in predicted failure. Also from the report we noticed that the battery of the controller is in failed state

What are the proper steps to follow to replace the battery and the drives?

Best regards

Maria

Maria

Moderator

 • 

6.2K Posts

June 17th, 2016 10:00

Hello

I would start by making sure that all important data is backed up.

You are experiencing several issues at once. I think someone should review your controller log to determine what is going on. There are a lot of possibilities. Rather than making a lot of guesses it would be easiest to just look at a TTY/controller log and see what is going on.

Please do not copy/paste a log into the forum. Upload it somewhere or use one of the paste.it/pastebin sites and provide a link. It will be easy to check the status of the drives and battery from the log.

Thanks

10 Posts

June 20th, 2016 03:00

Dear Daniel

Many thanks for you quick replay.

I install and run the megacli and I get the tty log.

You can find it in the following link https://owncloud.cs.ucy.ac.cy/public.php?service=files&t=16aff642961fa11e487bc9f08b7a1786

If you need anything else please do not hesitate to email me

Regards

Maria

Moderator

 • 

6.2K Posts

June 22nd, 2016 11:00

I install and run the megacli and I get the tty log.

I receive an error that the link is no longer available or I don't have access to view it.

Thanks

10 Posts

June 23rd, 2016 03:00

Apologize for the wrong link.

Please try this one

 https://owncloud.cs.ucy.ac.cy/public.php?service=files&t=9b970f3ef94ab18e96c37fa5712e64c8

Thank you

Maria

Moderator

 • 

6.2K Posts

June 23rd, 2016 10:00

The battery has an absolute state of charge of 14%. A 100% absolute charge means that the battery is capable of 72 hours of operation at full charge(100% relative charge). When the capacity drops below 33% (24 hours) it is considered failed because the PERC battery is supposed to be capable of 24 hours of operation in the event of power loss. The battery has been in a failed state for quite some time, so it is unlikely that it is related to the predictive failure hard drives.

The log is very short, so I am unable to see when or what the circumstances were that PD 1 and 2 went predictive failure. Given that the system has been experiencing errors for quite some time with the battery I suspect the drives have been been having errors for some time as well. The log does not go far enough back for me to seen when the block errors started on either drive.

If any data on the array is important I recommend a backup before doing anything. The first thing I would do is replace the battery. If you have to wait on a battery and have drives available to replace then you can go ahead and replace the drives while you wait on the battery.

RAID 5 can only function with a single failure. If two drives go offline then the array will fail, so you will have to replace the drives one at a time. Pick one of the drives and replace it. After that drive has completed the rebuild replace the second drive and let it rebuild. Replacement procedure for the drives:

  1. Offline the drive you are replacing
  2. Physically remove the drive you are replacing
  3. Insert the replacement drive into the server
  4. Wait for the rebuild to complete

You should perform the replacement "hot", while the server is on, so that the controller can detect the drive removal and insertion. If you perform these steps with the server off then there will likely be additional steps required. Also, it is very important that the rebuild complete on the first drive before replacing the second drive. If you start replacing the second drive before the first drive is fully online the array will fail.

After you complete all of these steps to bring the array back to optimal I would take additional steps to monitor the server. When an array operates in the state that this array is in for long periods of time the possibility of corruption increases. It is possible that the array may have been corrupted due to all of the bad blocks across the drives. Since the array is a logical layer the corruption can be transferred to the replacement drives. Since I can't determine from the current log if that has taken place I recommend that you monitor the system more than normal for a couple of months. If the array is corrupted then more drives will go predictive failure.

  • Pull weekly controller logs and save them for your records
  • Increase the frequency of critical data backups

I would perform the above steps for at least two months. If you receive no further drive failures, predictive failures, or several bad block errors during that time I would go back to normal operation. Bad blocks occur all the time on drives, but if you see a lot of them in the log that is cause for concern.

Thanks

10 Posts

June 24th, 2016 04:00

Daniel,

I would like to thank you for your explanatory answer. 

I am not sure if the log file I post today can help you. I  create the EventLog since commissioning of the controller. You can find it at the link below

https://owncloud.cs.ucy.ac.cy/public.php?service=files&t=cab229612ba5c654a9b6056f78fee630

In addition I would like to ask you some questions:

1. Which tool do I have to use to make the drives offline/rebuild etc, as you stated above. Should I done these though megacli or through OMSA

2. We haven't installed the OMSA so far.(the server runs centos 6.5). Can you guide me which version to install, since I try to install it lately on centos 7.2, and it has failed?

3. We haven't update the system drivers and firmware for long time ago.. Should we update the server, before or after the hard disk replacement. I have used recently the usb bootable as described here and it was amazing!! https://www.dell.com/support/article/us/en/19/SLN296511?docLang=EN#collapse700

4. What tool do you suggest for server and disk monitoring  as  you mentioned above?idrac has not been setup on the server. Should we setup it, or we can monitor it through OMSA. We run nagios on several machine, do you think that nagios is enough?

5. We are in the process of setup 3 more Dell servers (R520,R220). What tools should I install to monitor also them?

Thank you

Maria

Moderator

 • 

6.2K Posts

June 24th, 2016 11:00

I am not sure if the log file I post today can help you. I  create the EventLog since commissioning of the controller. You can find it at the link below

https://owncloud.cs.ucy.ac.cy/public.php?service=files&t=cab229612ba5c654a9b6056f78fee630

The text file I downloaded was not formatted. It was just a large block of text. It is extremely difficult to read an unformatted log file. I searched through the file and did not see any uncorrectable or badlba messages. I show that PD1 went predictive failure in September 2014. I did not review the log to see if that is the same disk or if it was replaced at some point. Array corruption would show in a log by having the same logical block address being uncorrectable or bad on more than one disk in the array. I did not see that in the log. I would still follow the instructions I posted in my first message.

1. Which tool do I have to use to make the drives offline/rebuild etc, as you stated above. Should I done these though megacli or through OMSA

You can do it through the controller BIOS or OMSA

2. We haven't installed the OMSA so far.(the server runs centos 6.5). Can you guide me which version to install, since I try to install it lately on centos 7.2, and it has failed?

CentOS is not a supported operating system. If you go to the download page for the server and select the equivalent version of RHEL as the operating system that version would likely work.

3. We haven't update the system drivers and firmware for long time ago.. Should we update the server, before or after the hard disk replacement. I have used recently the usb bootable as described here and it was amazing!! https://www.dell.com/support/article/us/en/19/SLN296511?docLang=EN#collapse700

It is best to perform updates prior to attempting recoveries or rebuilds. Firmware updates on enterprise equipment almost always correct issues and improve functionality. Rarely is a firmware or driver update provided that simply adds or removes a feature without providing some type of fix or improvement. Being at the latest firmware will put the system in the best position to perform the tasks you are requesting of it.

4. What tool do you suggest for server and disk monitoring  as  you mentioned above?idrac has not been setup on the server. Should we setup it, or we can monitor it through OMSA. We run nagios on several machine, do you think that nagios is enough?

We have a lot of management and monitoring tools. I would suggest researching our OpenManage product line. OMSA is the best tool for monitoring a single server. OpenManage Essentials is our tool for monitoring many systems from a central location.

5. We are in the process of setup 3 more Dell servers (R520,R220). What tools should I install to monitor also them?

You can use the same tools on those.

Thanks

10 Posts

July 5th, 2016 05:00

Hello Daniel,

We have just updated the bios/firmware of the server, using the BootableISO_2016-06-07_16-51-40. I noticed that the bios and Raid firmware among others, have been upgraded, but after two reboots of the server, I get the message "System Services Required Updated", during the Dell splash screen. 

1. What is this message. Did I omit  something?

2.The machine has no idrac installed (not even the express version). There is only the BMC configuration Utility. Do you think that setting the BMC, will help me to monitor server's hardware? 

Thank you

Maria

Moderator

 • 

6.2K Posts

July 6th, 2016 11:00

1. What is this message. Did I omit  something?

The message indicates that the USC/LCC is not loading properly. Even without an iDRAC installed I think the USC is present. I think an iDRAC is required to update to the LCC. This should be the latest USC/LCC version that you can install with just a BMC:

http://www.dell.com/support/home/us/en/04/Drivers/DriversDetails?driverId=JR7CG

I would run that update from the operating system to see if it corrects the system services loading error.

2.The machine has no idrac installed (not even the express version). There is only the BMC configuration Utility. Do you think that setting the BMC, will help me to monitor server's hardware? 

The BMC allows for better hardware management and monitoring. I would suggest using one of our OpenManage applications like OpenManage Server Administrator to monitor the system via the BMC. The BMC also works with industry standard protocols like IPMI, so you can use 3rd party monitoring and management software with it. 3rd party utilities will require more setup and configuration than our tools for them to function properly.

Thanks

No Events found!

Top