A Drive May Require Replacement Due to I/O Errors or if Software-Defined Storage Marks the Drive as Failed or Unusable

요약: Users may request a drive be replaced due to I/O errors or if the drive is marked as "failed" or "unusable" by Software-Defined-Storage (SDS) solution.

이 문서는 다음에 적용됩니다. 이 문서는 다음에 적용되지 않습니다. 이 문서는 특정 제품과 관련이 없습니다. 모든 제품 버전이 이 문서에 나와 있는 것은 아닙니다.

지침

There are several different SDS solutions like Ceph (Linux), vSAN (VMware), Nutanix, and so on. Several identically configured servers are joined together over a network to create a storage cluster. The servers are configured with a Host Bus Adapter (HBA), instead of a PERC so that the drives are presented to the operating system "as-is." The operating system manages all the drives in each server directly without any intervention from the HBA. The drive is listed as "Healthy" in Dell monitoring tools (like iDRAC and OMSA) and ePSA Offline Diagnostics. SMARTCTL data for the drive may or may not have uncorrected read and write errors. SMART Tests (short, long, and extended) pass and the drive is listed as "Healthy."

 

Software-Defined-Storage solutions (SDS) shift all storage-related controls from hardware to software with the use of Host Bus Adapter (HBA) to provide physical connectivity to the drives.

The RAID controller (PERC) is responsible for performing several proactive maintenance activities on the drives which include patrol read and consistency checks on virtual disks. Since SDS solutions use Host Bus Adapter (HBA) instead of PERC, the software now performs all those proactive maintenance activities.

Users may report SDS marking a drive as "failed or unusable" or it might list I/O errors on a drive. Dell Monitoring tools like iDRAC and OMSA are reporting the drive as healthy and operational.

Tools like "SMARTMON" or "SMARTCTL" may list some errors on one or more indicated drives, but the overall drive health is listed as "HEALTHY or OK."

 

This discrepancy is due to the following factors:

  • iDRAC shows the health status of the component. If the drive firmware reports that it is healthy, the iDRAC reflects the same. If the drive firmware states that it is in Predictive Failure condition, the iDRAC reflects the same.
  • All drives can tolerate some bad blocks or uncorrectable errors and continue to operate without any functional impact. The threshold for bad blocks is programmed into the drive firmware by the drive manufacturer and is not a standard number or percentage.
  • Drives remain operational until the total number of bad blocks or uncorrectable errors on the drive breaches the predictive failure or failure threshold.
  • An offset address on the drive is marked as a bad block and the data are relocated ONLY if a WRITE operation fails at that specific address. The drive firmware does not consider READ errors for marking sectors as bad blocks.
  • I/O errors logged at the operating system level might not be reflected in the Lifecycle logs.

In such a scenario, the drives are functional and well within their operational parameters. They do not qualify for nor require a hardware replacement. The recommended plan of action here is to perform the necessary maintenance activities from the software layer to resolve the issue.

In such a scenario, capture a complete operating system log bundle or reports from one or more affected servers. Engage a Dell Engineer (if covered by warranty agreement) or the Operating System vendor for log review as they must advise on the next corrective steps.

 

Dell Engineer or operating system vendor determines the following details:

  • Total I/O errors recorded by the operating system kernel (if any).
  • Which devices (one or more) are the errors logged against.
  • Type of corruption: File or Metadata level (if any)
  • Did the storage service crash? If yes, why?
  • Corrective actions are available in the SDS to resolve such errors.
Note: The points mentioned above for the Dell Engineer or operating system Vendor are not an all-encompassing list. There may be several other references or data points in their investigation.

 

해당 제품

XC Core Systems, XC Series Appliances, Dell EMC Microsoft Storage Spaces Direct Ready Nodes, Dell EMC vSAN Ready Nodes, PowerEdge SDS 100 (Storage System)
문서 속성
문서 번호: 000219050
문서 유형: How To
마지막 수정 시간: 17 6월 2025
버전:  4
다른 Dell 사용자에게 질문에 대한 답변 찾기
지원 서비스
디바이스에 지원 서비스가 적용되는지 확인하십시오.