I'm pretty upset by a fatal issue and would like to hear your opinion. I've got 4 R720 servers running with ESXi and around 2 weeks ago I've noticed that a Windows VM (BlueIris CCTV) had a few corrupted recordings which have blocked the auto archiving. I've deleted the affected files and did not think about a hardware problem, especially because my server shows "all green" and did not throw any alarms (I'm also getting any alarms or warnings by iDRAC via eMail to react as soon as possible i.e. by replacing a faulty disk).
Sunday I've decided to swap the USB boot key of the server againt a better one. So I took all VMs and the host to a clean shutdown for the maintenance.
After rebooting the server I found the Windows VM and 4 other Linux VMs not booting up. Windows was even "unrepairable". Even worse, I have 14 days of backups from the VMs and all of them have been broken as well (the corrupted files went into the backups as well).
I've migrated the unaffected VMs to other datastores / hosts, bought 4 new HDDs yesterday and have replaced all disks - because the H710 still shows all drives healthy. The last patrol read had finished Saturday without issues.
After I've migrated back the VMs to the "new" datastore (I still need to set up 2 VMs fully from scratch), I took the removed HDDs one by one for examination to my workstation.
The second one showed horrible SMART data:
More than 1000 (!) reassigned sectors and 9 uncorrectable errors!
It's insane why the H710 did not ring all alarm bells with these SMART values! Instead, it let the disk still run for further days and days. If I'd get an alarm in time, I'd be able to replace the disk ... having still usable backups if the rebuild would fail.
I don't see the problem sitting in front of the server (I'm not aware of an "Ignore SMART" switch) ... or should I?
Thanks for your opinions.
I have not seen a controller do that, but what I would look at first is if the server is up to date on BIOS, iDrac, H710, and also the drives (if available). If you have an issue with timeouts on the drives I can see that causing an issue with it being reported, is why I suggest starting there.
Let me know what you see, as well as the current versions.
I was also surprised, running Dells in my Homelab since 8 years and the Perc H700 (previous R710) and H710 (my R720s) always did a great job in noticing predictive disk errors, even with a few reassigned blocks. It's the first time it fully failed. But I hope you agree that >1000 reassigned sectors and 9 uncorrectable errors in SMART is serious.
Bios is 2.9.0
H710 Firmware: 21.3.5-0002, Driver 7.719.02.00
iDRAC: 22.214.171.124 Build 15
No timeout issues with the drives noticed as mentioned patrol drive finished a day before I've noticed it - and all drives were still "green"
Hi, is your system in warranty? If so, I would say it's best to have we TechSupport to take a look at the TTY logs. Wish you a good one.
Social Media and Communities Professional
Dell Technologies | Enterprise Support Services
Did I answer your query? Please click on ‘Accept as Solution’. ‘Kudo’ the posts you like!
then you could check the log by your own to check the exact failure.