System is a PowerEdge R710.with PERC H700 integrated controller. The virtual drive in question is RAID 5 with 3 drives. I have been monitoring controller logs weekly, and running consistency check for the past 3 months with 0 issues. Suddenly yesterday, OMSA is showing 1 drive as predictive failure. I went through logs and see a bunch of unexpected sense logs many of them stating "corrected medium error", but I do not see any unrecoverable errors. I went onsite, offlined drive, then inserted new Dell branded drive. Rebuild completed, but the new drive is also reporting as predictive failure. I went through logs again, and notice many more unexpected sense logs during rebuild process. I then ran consistency check, which again had many unexpected sense logs. Drive is still in predicted failure state, so I replaced drive again with a new drive. This time, after rebuild drive was not showing predictive failure. Just to be sure, I ran another consistency check, which put same drive in predictive failure state again. There are again a bunch of unexpected sense logs many of them stating "corrected medium error". I doubt that both drives are bad, and am unsure how to proceed.
The firmware for controller, drives, & BIOS are all up to date, but IDRAC6 & Lifecycle are out of date. IDRAC6 is at 1.92.00 (build 5) and Lifecycle is at 126.96.36.1995
Please let me know your thoughts. If you recommend updating IDRAC6 & Lifecycle, please let me know where to find updates and steps for updating. I am a little confused identifying where to locate these updates on support page. Also, can I just update to latest version or does it have to be in steps?
Thank you in advance!
Solved! Go to Solution.
I suspect that the virtual disk is punctured. Predictive failure status is a SMART issue that is reported by the firmware on the disk. When the virtual disk has a puncture it will copy bad virtual disk information onto the replacement disk. This bad information can cause false bad blocks which can lead to predictive failure.
You will need to review the controller log to look for a puncture. If there is a puncture then you should see the same bad block listed on two or more disks. I think the H700 is the first controller that will report a puncture. You can search the log for the word puncture. The log only stores the last 10k lines, so if you have been replacing and rebuilding disks the information may be gone.
If the virtual disk is punctured you will need to delete the virtual disk, create and initialize an unlike virtual disk, and then create the desired virtual disk. This process is data destructive, so you will need to backup any important information first.
Thanks for your response Daniel.
I saved copies of all the logs before exporting, and just did a search for "puncture" on all the log files with turned up 0 results. All the unexpected sense errors are on PD 03 which is the drive I replaced twice.
I do agree that there is something wrong with the virtual disk. Do you have any other suggestions before I re-create virtual disk?
If you would like to upload the controller logs to a text sharing site and provide links then myself or someone else in the community could review them. I think it would just be verifying the issue of a puncture though.
Thank you, after opening links below you will have to click on "new version" bottom left corner to view text
I couldn't upload all the logs because I think they are too large. But these are the first 3 from yesterday.
That site is blocked by our proxy. I had someone on a different network download them and send them to me. They said that the pages are blank. I would suggest uploading again to pastebin.
I don't see a puncture in those logs. It looks like they go back to 11/30/2018. During the rebuild the controller is encountering a lot of bad blocks on the virtual disk. Here is a snippet of what is occurring several times during the rebuild.
02/03/19 14:49:52: MedErr is for: cmdId=9a, ld=1, src=1, cmd=1, lba=247eb8, cnt=8, rmwOp=0 02/03/19 14:49:52: ErrLBAOffset (1) LBA(123f38) BadLba=123f39 02/03/19 14:49:53: EVT#23597-02/03/19 14:49:53: 113=Unexpected sense: PD 03(e0x20/s3) Path 5000c5003bdfcb99, CDB: 28 00 00 12 3f 3a 00 00 06 00, Sense: 3/11/00 02/03/19 14:49:53: Raw Sense for PD 3: f0 00 03 00 12 3f 3b 0a 00 00 00 00 11 00 81 80 00 97 02/03/19 14:49:53: DEV_REC:Medium Error DevId devHandle f RDM=806cba00 retires=0 02/03/19 14:49:53: MedErr is for: cmdId=9a, ld=1, src=1, cmd=1, lba=247eb8, cnt=8, rmwOp=0 02/03/19 14:49:53: ErrLBAOffset (1) LBA(123f3a) BadLba=123f3b
I'm decent at reading logs, but I don't know what everything in the log means. We do not have detailed documentation on the logs.
I think 247eb8 is the LBA address on the physical disk. In the ErrLBAOffset, the first LBA is the good or expected LBA. The second LBA is the returned value that is in error. The (1) is the offset value or difference between the two.
The sense key or key code qualifier is 3/11/0, medium error - read retries exhausted.
What I think is happening, information is being copied to the drive during the rebuild. During the verification process it is reading physical disk logical block address 247eb8. It is expecting virtual disk LBA 123f38 to be written in that block but 123f39 is there instead. After the read retries are exhausted it moves onto the next logical block. It tries to copy another virtual disk LBA to the same physical disk LBA and encounters the same error.
I can't say with certainty whether or not the virtual disk or physical disks are at fault since I'm not seeing the same bad blocks on more than one disk. I didn't check all of the blocks though. I would run diagnostics on the disks. If the disks are shelf spares that have been sitting for a long time or were received in the same shipment then they may be faulty.
Daniel, thank you much for taking the time to examine the logs. The drives were purchased together in same shipment 4 months ago. I ran Dell Online Diagnostics disk self test on the drives. PD03 failed the dst test, but passed the quick test. The other 2 drives in array passed the dst test.
I ordered more Dell drives. I'll try replacing drive again in a few days and let you know results.
Here's my plan:
1) Run consistency check
2) Force drive offline via OMSA
3) Insert new disk and verify build results
4) Run consistency check
Thanks for your help, Daniel. I replaced PD 03 for a third time, which resolved issue. Although I am happy to see controller not logging errors and PD 03 in normal state, I'm still not confident virtual disk is healthy.