VxRail: vSAN Object Inaccessible, Disk Failure, Excessive I/O Latency, Disk Overall Health Red
Summary: Do not Remove disks during vSAN resync as it can result in a Data Loss.
Symptoms
This article is applicable for both VxRail 7.x and VxRail 8.x versions.
vSAN health check finds disk failure, or vmware-vsan-health-summary-result.log finds physdiskoverall health is red or yellow.
VxRail-Virtual-SAN-Cluster-xxxxxxxxx Overall Health : red Group physicaldisks health : red Test physdiskoverall health : red DisksWithIssues: Host Disk OverallOperationHealth Metadata Operational InCmmds/Vsi OperationalState Recommendation Uuid (Host-10, LocalToshibaDisk(Naa.50000xxxxxxxxxx), Red, Green, Red, Yes/Yes, ImpendingPermanentDiskFailure,EvacuationFailedDueToInaccessibleObjects, PleaseReferTo'Data'HealthCheckAndResolveTheInaccessibleObjects
vsandevicemonitord.log reports:
INFO vsandevicemonitord WARNING - WRITE Average Latency on VSAN device naa.50000xxxxxxxx has exceeded threshold value 2000000 us 2 times. INFO vsandevicemonitord Tier 2 (naa.50000xxxxxxxx) as unhealthy
Cause
The Dying Disk Handling (DDH) feature of vSAN diagnoses disk or disk group health by detecting either excessive I/O latency for a vSAN disk or maximum log congestion that vSAN determines to be due to log leak issues in a vSAN disk group over an extended period. Unhealthy disk or disk groups are marked as such and the disk or disk groups are no longer used for new data placement.
When DDH detects that a disk has exceeded the I/O latency threshold during the monitoring interval, vSAN generates a VMkernel Observation (VOB) and log a message to the vsandevicemonitord.log file in the /var/run/log directory. The log entry below is an example for a disk that must be replaced once the required data evacuation is complete and the disk is in an evacuated state:
WARNING - WRITE Average Latency on VSAN device <NAA disk name> has exceeded threshold value <IO latency threshold for disk> us <# of intervals with excessive IO latency> times.
When DDH detects that a caching tier has excessive log congestion during the monitoring interval, vSAN generates a VOB and log to the vsandevicemonitord.log file. Excessive log congestion messages are in this format:
WARNING - Maximum log congestion on VSAN device <NAA disk name> <current intervals with excessive log congestion>/<intervals required to be unhealthy>
In both of these situations, vSAN triggers the evacuation of some or all data from the affected disk or disk groups. The overall disks health section in the vSAN health monitoring UI reports any of the following operational states for the affected disk or disk groups along with recommendations for the user. The recommendations after the evacuation is complete differ depending on whether vSAN detected excessive I/O latencies or excessive log congestion.
Resolution
See VMware article 326878, Dying Disk Handling (DDH) in vSAN
Do not remove or replace disk during the below situations when vSAN resync is ongoing. If you do that, Data Loss may occur.
Impending permanent disk failure, data evacuation failed due to insufficient resources (Health state - Red)
Or
Impending permanent disk failure, data evacuation failed due to inaccessible objects (Health state - Red)
Do not remove or replace a disk when the object is inaccessible.
Object inaccessible means that all copies of the object are missing. If you remove or replace a disk, this may cause data loss.
Workaround:
- Engage VMware
- If excessive I/O latency caused the capacity disk unhealthy status, recover the disk by remount. Remounting the disk does not change the vSAN UUID of the disk.
esxcli vsan storage diskgroup unmount -u <disk group UUID> esxcli vsan storage diskgroup mount -u <disk group UUID>