itechroz

11 Posts

5562

July 23rd, 2018 07:00

PERC H700 2 RAID 5 drives predictive failure

Hello,

A couple days ago I rebooted my PowerEdge R710. After reboot, one of the hard drives in my RAID 5 3-drive array was blinking amber. In Dell Open Manage, it showed that the hard drive was in predictive failure state. Today (2 days later), I logged into Open Manage and now there is a 2nd hard drive shown in predictive failure state. I just ordered 2 new Dell branded hard drives, but now I'm a little uneasy and think there will be more to this than just replacing the hard drives.

I also noticed that the Perc H700 firmware is out of date

Can you please give me some guidance on how to resolve my issue?

Thank you in advance.

Responses(7)

DELL-Daniel My

Moderator

•

6.2K Posts

1

July 23rd, 2018 12:00

Here is a similar post, it also contains links to another post on the subject of replacing predictive failure drives.

https://www.dell.com/community/PowerEdge-Hardware-General/PowerEdge-R620-drive-predicted-failure/td-p/6111997

DELL-Daniel My

Moderator

•

6.2K Posts

1

July 23rd, 2018 11:00

Hello

The first step would be to make sure any important data is backed up.

The first troubleshooting step with a PERC is to get a controller/TTY log. Every time you restart the system or do anything on the controller new lines of information are added to the log. The log stores the last 10k lines, so when you restart and it adds several lines to the log it is also deleting several lines.

After you get a log I would suggest updating all firmware and then run a consistency check. You will want to run consistency check before replacing the drives. If there are any issues with the virtual disk you should get uncorrectable errors with the consistency check. You may need to review the controller log to view consistency check errors.

http://www.dell.com/storagecontrollermanuals/

Thanks

I

itechroz

11 Posts

0

July 23rd, 2018 12:00

Daniel,

I read through your posts. It is very helpful, thank you. My last questions for now are,

1) I'm assuming I should replace one disk at a time. After I replace the first failing disk and rebuild is complete, then replace the other failing drive?

2) Since I have 2 of 3 RAID5 drives in predictive failure state, I am afraid that the 3rd drive will also enter predictive failure state before I can swap out the drives. Have you seen this happen before?

I

itechroz

11 Posts

0

July 23rd, 2018 12:00

Daniel, thank you.

I just exported a controller log, but I also took a snapshot of controller info from OMSA attached below. So next steps will be back up, update firmware, then run consistency check. IF consistency checks out OK, what steps should I follow to replace the failing drives?
Perc H700 info.PNG

I

itechroz

11 Posts

0

July 24th, 2018 11:00

Daniel,

I updated PERC controller firmware last night, restarted, then ran consistency check on virtual array. Consistency checked out OK.

Per the other articles you posted, I will follow those instructions for hard drive replacements. I'm assuming I should replace 1 failed drive at a time. So after first drive is replaced and rebuild of array is complete, then replace second drive, let rebuild complete, run consistency check again, then run hard drive diagnostics?

Sound good or am I looking at a potential punctured array based on my 2 predicted failing drives? What I can tell you is I ran a search of "punctur" from the controller logs and no errors related to punctured array came up in my search.

Thanks

DELL-Daniel My

Moderator

•

6.2K Posts

1

July 26th, 2018 09:00

Yes, you can only replace one drive at a time. If you attempt to replace another drive before the rebuild is complete the virtual disk will fail.

It is not necessary to run a consistency check in between or after replacing the disks.

@itechroz wrote:
I ran a search of "punctur" from the controller logs and no errors related to punctured array came up in my search.

Some of our controllers will announce a puncture if they find one. The best way to find a puncture is to locate the reported bad blocks or bad LBAs. Make a list of the logical block addresses and search for each one. If you find the same LBA marked bad on more than one disk then the array is punctured.

Scenarios that typically result in a large number of predictive failure drives are punctured arrays, incompatible hardware, and faulty hardware in the RAID chain.

I

itechroz

11 Posts

0

July 26th, 2018 15:00

Daniel,

Thanks for your expertise. Last night, I replaced the failing hard drives then ran online diagnostics on the drives. All is reporting healthy now, but I will follow your advice in regards to bad blocks. I will go through logs to determine if they exist and whether they exist on multiple drives. I hope that there are none!

Thanks again for you assistance.

View All

No Events found!

PowerEdge HDD/SCSI/RAID

PERC H700 2 RAID 5 drives predictive failure