itechroz
1 Nickel

PERC H700 predictive failure continues with new drive

Jump to solution

Hi,

System is a PowerEdge R710.with PERC H700 integrated controller.  The virtual drive in question is RAID 5 with 3 drives. I have been monitoring controller logs weekly, and running consistency check for the past 3 months with 0 issues.  Suddenly yesterday, OMSA is showing 1 drive as predictive failure.  I went through logs and see a bunch of unexpected sense logs many of them stating "corrected medium error", but I do not see any unrecoverable errors.  I went onsite, offlined drive, then inserted new Dell branded drive.  Rebuild completed, but the new drive is also reporting as predictive failure.  I went through logs again, and notice many more unexpected sense logs during rebuild process.  I then ran consistency check, which again had many unexpected sense logs.  Drive is still in predicted failure state, so I replaced drive again with a new drive.  This time, after rebuild drive was not showing predictive failure.  Just to be sure, I ran another consistency check, which put same drive in predictive failure state again.  There are again a bunch of unexpected sense logs many of them stating "corrected medium error".  I doubt that both drives are bad, and am unsure how to proceed.

The firmware for controller, drives, & BIOS are all up to date, but IDRAC6 & Lifecycle are out of date.  IDRAC6 is at 1.92.00 (build 5) and Lifecycle is at 1.4.0.445

Please let me know your thoughts.  If you recommend updating IDRAC6 & Lifecycle, please let me know where to find updates and steps for updating.  I am a little confused identifying where to locate these updates on support page.  Also, can I just update to latest version or does it have to be in steps?

Thank you in advance!

Labels (2)
0 Kudos
10 Replies
Moderator
Moderator

Re: PERC H700 predictive failure continues with new drive

Jump to solution

Hello

I suspect that the virtual disk is punctured. Predictive failure status is a SMART issue that is reported by the firmware on the disk. When the virtual disk has a puncture it will copy bad virtual disk information onto the replacement disk. This bad information can cause false bad blocks which can lead to predictive failure.

You will need to review the controller log to look for a puncture. If there is a puncture then you should see the same bad block listed on two or more disks. I think the H700 is the first controller that will report a puncture. You can search the log for the word puncture. The log only stores the last 10k lines, so if you have been replacing and rebuilding disks the information may be gone.

If the virtual disk is punctured you will need to delete the virtual disk, create and initialize an unlike virtual disk, and then create the desired virtual disk. This process is data destructive, so you will need to backup any important information first.

http://www.dell.com/storagecontrollermanuals/

Thanks

Daniel Mysinger
Dell EMC, Enterprise Engineer

Get support on Twitter @DellCaresPRO

0 Kudos
itechroz
1 Nickel

Re: PERC H700 predictive failure continues with new drive

Jump to solution

Thanks for your response Daniel.

I saved copies of all the logs before exporting, and just did a search for "puncture" on all the log files with turned up 0 results.  All the unexpected sense errors are on PD 03 which is the drive I replaced twice.

I do agree that there is something wrong with the virtual disk.  Do you have any other suggestions before I re-create virtual disk?

0 Kudos
Moderator
Moderator

Re: PERC H700 predictive failure continues with new drive

Jump to solution

If you would like to upload the controller logs to a text sharing site and provide links then myself or someone else in the community could review them. I think it would just be verifying the issue of a puncture though.

Daniel Mysinger
Dell EMC, Enterprise Engineer

Get support on Twitter @DellCaresPRO

0 Kudos
itechroz
1 Nickel

Re: PERC H700 predictive failure continues with new drive

Jump to solution

Thank you, after opening links below you will have to click on "new version" bottom left corner to view text

http://pasted.co/d1dc28d9

http://pasted.co/8698e7eb

http://pasted.co/deac49b5

I couldn't upload all the logs because I think they are too large.  But these are the first 3 from yesterday.

Thanks again.

 

 

 

 

0 Kudos
Moderator
Moderator

Re: PERC H700 predictive failure continues with new drive

Jump to solution

That site is blocked by our proxy. I had someone on a different network download them and send them to me. They said that the pages are blank. I would suggest uploading again to pastebin.

Daniel Mysinger
Dell EMC, Enterprise Engineer

Get support on Twitter @DellCaresPRO

0 Kudos
itechroz
1 Nickel

Re: PERC H700 predictive failure continues with new drive

Jump to solution

Ok, no problem.  Below are 2 logs from yesterday.  My other logs are too big, but I can sign up for an account if you need more info.   Thanks

 

https://pastebin.com/htkUVcJM

https://pastebin.com/DuXuEewk

 

 

0 Kudos
Moderator
Moderator

Re: PERC H700 predictive failure continues with new drive

Jump to solution

I don't see a puncture in those logs. It looks like they go back to 11/30/2018. During the rebuild the controller is encountering a lot of bad blocks on the virtual disk. Here is a snippet of what is occurring several times during the rebuild.

02/03/19 14:49:52: MedErr is for: cmdId=9a, ld=1, src=1, cmd=1, lba=247eb8, cnt=8, rmwOp=0
02/03/19 14:49:52: ErrLBAOffset (1) LBA(123f38) BadLba=123f39
02/03/19 14:49:53: EVT#23597-02/03/19 14:49:53: 113=Unexpected sense: PD 03(e0x20/s3) Path 5000c5003bdfcb99, CDB: 28 00 00 12 3f 3a 00 00 06 00, Sense: 3/11/00
02/03/19 14:49:53: Raw Sense for PD 3: f0 00 03 00 12 3f 3b 0a 00 00 00 00 11 00 81 80 00 97
02/03/19 14:49:53: DEV_REC:Medium Error DevId[3] devHandle f RDM=806cba00 retires=0
02/03/19 14:49:53: MedErr is for: cmdId=9a, ld=1, src=1, cmd=1, lba=247eb8, cnt=8, rmwOp=0
02/03/19 14:49:53: ErrLBAOffset (1) LBA(123f3a) BadLba=123f3b

I'm decent at reading logs, but I don't know what everything in the log means. We do not have detailed documentation on the logs.

I think 247eb8 is the LBA address on the physical disk. In the ErrLBAOffset, the first LBA is the good or expected LBA. The second LBA is the returned value that is in error. The (1) is the offset value or difference between the two.

The sense key or key code qualifier is 3/11/0, medium error - read retries exhausted.

What I think is happening, information is being copied to the drive during the rebuild. During the verification process it is reading physical disk logical block address 247eb8. It is expecting virtual disk LBA 123f38 to be written in that block but 123f39 is there instead. After the read retries are exhausted it moves onto the next logical block. It tries to copy another virtual disk LBA to the same physical disk LBA and encounters the same error.

I can't say with certainty whether or not the virtual disk or physical disks are at fault since I'm not seeing the same bad blocks on more than one disk. I didn't check all of the blocks though. I would run diagnostics on the disks. If the disks are shelf spares that have been sitting for a long time or were received in the same shipment then they may be faulty.

Daniel Mysinger
Dell EMC, Enterprise Engineer

Get support on Twitter @DellCaresPRO

itechroz
1 Nickel

Re: PERC H700 predictive failure continues with new drive

Jump to solution

Daniel, thank you much for taking the time to examine the logs.  The drives were purchased together in same shipment 4 months ago.  I ran Dell Online Diagnostics disk self test on the drives.  PD03 failed the dst test, but passed the quick test.  The other 2 drives in array passed the dst test.

I ordered more Dell drives.  I'll try replacing drive again in a few days and let you know results.

Here's my plan:

1)  Run consistency check

2)  Force drive offline via OMSA

3)  Insert new disk and verify build results

4)  Run consistency check

0 Kudos
Highlighted
itechroz
1 Nickel

Re: PERC H700 predictive failure continues with new drive

Jump to solution

Thanks for your help, Daniel.  I replaced PD 03 for a third time, which resolved issue.  Although I am happy to see controller not logging errors and PD 03 in normal state, I'm still not confident virtual disk is healthy.

0 Kudos