A Drive May Require Replacement Due to I/O Errors or if Software-Defined Storage Marks the Drive as Failed or Unusable
Summary: Users may request a drive be replaced due to I/O errors or if the drive is marked as "failed" or "unusable" by Software-Defined-Storage (SDS) solution.
Instructions
There are several different SDS solutions like Ceph (Linux), vSAN (VMware), Nutanix, and so on. Several identically configured servers are joined together over a network to create a storage cluster. The servers are configured with a Host Bus Adapter (HBA), instead of a PERC so that the drives are presented to the operating system "as-is." The operating system manages all the drives in each server directly without any intervention from the HBA. The drive is listed as "Healthy" in Dell monitoring tools (like iDRAC and OMSA) and ePSA Offline Diagnostics. SMARTCTL data for the drive may or may not have uncorrected read and write errors. SMART Tests (short, long, and extended) pass and the drive is listed as "Healthy."
Software-Defined-Storage solutions (SDS) shift all storage-related controls from hardware to software with the use of Host Bus Adapter (HBA) to provide physical connectivity to the drives.
The RAID controller (PERC) is responsible for performing several proactive maintenance activities on the drives which include patrol read and consistency checks on virtual disks. Since SDS solutions use Host Bus Adapter (HBA) instead of PERC, the software now performs all those proactive maintenance activities.
Users may report SDS marking a drive as "failed or unusable" or it might list I/O errors on a drive. Dell Monitoring tools like iDRAC and OMSA are reporting the drive as healthy and operational.
Tools like "SMARTMON" or "SMARTCTL" may list some errors on one or more indicated drives, but the overall drive health is listed as "HEALTHY or OK."
This discrepancy is due to the following factors:
- iDRAC shows the health status of the component. If the drive firmware reports that it is healthy, the iDRAC reflects the same. If the drive firmware states that it is in Predictive Failure condition, the iDRAC reflects the same.
- All drives can tolerate some bad blocks or uncorrectable errors and continue to operate without any functional impact. The threshold for bad blocks is programmed into the drive firmware by the drive manufacturer and is not a standard number or percentage.
- Drives remain operational until the total number of bad blocks or uncorrectable errors on the drive breaches the predictive failure or failure threshold.
- An offset address on the drive is marked as a bad block and the data are relocated ONLY if a WRITE operation fails at that specific address. The drive firmware does not consider READ errors for marking sectors as bad blocks.
- I/O errors logged at the operating system level might not be reflected in the Lifecycle logs.
In such a scenario, the drives are functional and well within their operational parameters. They do not qualify for nor require a hardware replacement. The recommended plan of action here is to perform the necessary maintenance activities from the software layer to resolve the issue.
In such a scenario, capture a complete operating system log bundle or reports from one or more affected servers. Engage a Dell Engineer (if covered by warranty agreement) or the Operating System vendor for log review as they must advise on the next corrective steps.
Dell Engineer or operating system vendor determines the following details:
- Total I/O errors recorded by the operating system kernel (if any).
- Which devices (one or more) are the errors logged against.
- Type of corruption: File or Metadata level (if any)
- Did the storage service crash? If yes, why?
- Corrective actions are available in the SDS to resolve such errors.