Dell Unity/VNX: Random temporary loss of connection and/or performance degradation on ESXi hosts from version 5.5 u2 and later (User Correctable)
Summary: Heavily loaded arrays or networks or fabrics may slow ATS commands enough that the array returns a miscompare check condition on an ATS command that ESXi does not expect. Due to this ATS miscompare on a VMFS HeartBeat slot the ESXi host attempts to regain control of the device. To do this the host issues a SCSI device reset on the LUN holding the VMFS. All active I/O on this LUN is aborted and the SCSI device is reset. A temporary loss in connectivity shows up in the VMkernel logs. ...
Symptoms
SCENARIO:
- Host upgrade to ESXi 5.5 update 2 or ESXi 6.0
- One or more ESXi hosts lose connection to the VMFS datastore for a short period of time. Any VMs on the datastore may crash or have I/O errors.
- Due to an Atomic Test and Set (ATS) miscompare on a VMFS HeartBeat slot the ESXi host attempts to regain control of the device by issuing a SCSI device reset on the LUN holding the VMFS.
- All active I/O on this LUN will be aborted and the SCSI device will be reset.
- A temporary loss in connectivity shows up in the VMkernel logs.
ATS Miscompare can happen both with NMP and PowerPath.
Error messages indicating an ATS miscompare similar to this appear in /var/log/vmkernel.log:
2015-11-20T22:12:47.194Z cpu13:33467)ScsiDeviceIO: 2645: Cmd(0x439dd0d7c400) 0x89, CmdSN 0x2f3dd6 from world 3937473 to dev "naa.50002ac0049412fa" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.
Other issues that can occur:
- Hosts disconnecting from vSphere vCenter
- Virtual machines hanging on I/O operations
Cause
This issue has been seen on arrays or networks or fabrics which are overloaded enough that hosts are canceling I/O requests.
Several array vendors (Dell included) are having issues with the ATS heartbeat feature which was introduced in ESXi 5.5u2.
NOTE: As per Broadcom (VMware) KB 326437 (external link), this issue affects ESXi versions VMware ESXi 5.5.x and VMware ESXi 6.0.x, not providing all the specific versions, therefore this KB is assuming all ESXi hosts with version 5.5u2 onwards and all ESXi 6.0 versions are affected.
A host indicates its liveness by periodically performing I/O to its heartbeat on a given volume. Therefore, if no activity is seen on the host's heartbeat slot for a period of time, then we can conclude that the host has lost connectivity to the volume.
ATS heartbeat I/O has a very low time-out value that can lead to host disconnects and application outages, translating in connection loss to disks and/or performance degradation on the hosts.
The host then registers the miscompare on the heartbeat slot and aborts all active IO on the LUN as it issues the reset. All pending IO on this LUN will fail with host sense 8 (H:0x8 SCSI reset).
Resolution
If this condition is observed, the recommended temporary workaround is to disable the VAAI ATS heartbeat mechanism. See Broadcom (VMware) KB 326437 (external link) for more information. Disabling the ATS heartbeat mechanism will revert the host back to legacy mode. Once the load has been addressed, re-enable the ATS heartbeat mechanism.
Contact VMware for confirmation of the issue or provide an ESXi emcgrab with vmsupport for confirmation. Disabling the VAAI ATS Heartbeat functionality on the ESX server is ONLY recommended for affected customers until the load problems can be addressed.
Additional Information
The Unity logs can be used to identify this particular type of aborts (Sense Key = 0e, ASC = 1d, ASCQ = 00)
The log location in the extracted logs is:
To check the logs, extract all the c4_safe_ktrace.log* logs in the location above, and then look for "SK = 0x0e, ASC/Q = 0x1d00".
Example with a linux system or similar:
grep -i "SK = 0x0e, ASC/Q = 0x1d00" spa/EMC/C4Core/log/c4_safe_ktrace.* | wc -l 15744 <<<< count of aborts on SPA in this example.
If the ktrace logs are not extraced, simply use zgrep:
zgrep -i "SK = 0x0e, ASC/Q = 0x1d00" spa/EMC/C4Core/log/c4_safe_ktrace.* | wc -l 15744 <<<< count of aborts on SPA in this example.