Some article numbers may have changed. If this isn't what you're looking for, try searching all articles. Search articles

Article Number: 000056125

Dell VxRail: vSAN Health reports occasionally "Hosts with communication issues" messages

Summary: ESXi hosts within a VxRail VSAN cluster may experience temporary connection issues and as a results VSAN health may report "Hosts with communication issues" error messages.

This article may have been automatically translated. If you have any feedback regarding its quality, please let us know using the form at the bottom of this page.

Article Content

Symptoms

From time to time, Hosts can report connection issues. Hosts will remain connected but vSAN skyline health check may periodically show random hosts with communication issues. If VSAN health is retested, the issue disappears but will return after a couple of minutes.

Versions affected:
This issue manifests on VxRail versions 4.5.x and 4.7.x.

Log analysis summary:

We can see vSAN health alarm being generated displayed in vCenter:

2019-08-14T12:56:01.422Z INFO vsan-mgmt[EventMonitor] [VsanEventUtil::_generateVcEvent opID=noOpId] Generate VC event for managed object NC1V01 with testName=Hosts with connectivity issues, testId=com.vmware.vsan.health.test.hostconnectivity, preStatus=green, curStatus=red

From vmware-vsan-health-summary-result.log we can see vSAN health host connect issues:

2019-08-14T12:56:01.355Z INFO vsan-mgmt[EventMonitor] [VsanHealthSummaryLogUtil::PrintHealthResult opID=noOpId] Cluster NB1X01  Overall Health : red
   Group network health : red
      Test hostdisconnected health : green
      Test hostconnectivity health : red
         HostsWithCommunicationIssues: Host
                                       (Host-234),
      Test clusterpartition health : green
      Test vsanvmknic health : green
      Test smallping health : green
      Test largeping health : green
      Test vmotionpingsmall health : green
      Test vmotionpinglarge health : green
      Test hostlatencycheck health : green
         NetworkLatencyCheckResults: FromHost  ToHost  NetworkLatency(Ms)  NetworkLatencyCheckResult
                                     (Host-227, Host-236, 0.18, Green), (Host-227, Host-234, 0.23, Green), (Host-227, Host-238, 0.16, Green), (Host-227, Host-232, 0.12, Green), (Host-234, Host-232, 0.27, Green),
                                     (Host-234, Host-238, 0.31, Green), (Host-234, Host-236, 0.29, Green), (Host-234, Host-227, 0.26, Green), (Host-236, Host-227, 0.1, Green), (Host-236, Host-234, 0.12, Green),
                                     (Host-236, Host-238, 0.1, Green), (Host-236, Host-232, 0.1, Green), (Host-232, Host-236, 0.1, Green), (Host-232, Host-238, 0.1, Green), (Host-232, Host-234, 0.12, Green),
                                     (Host-232, Host-227, 0.11, Green), (Host-238, Host-232, 0.15, Green), (Host-238, Host-236, 0.11, Green), (Host-238, Host-234, 0.23, Green), (Host-238, Host-227, 0.12, Green),
   Group cloudhealth health : yellow
      Test vsancloudhealthceipexception health : yellow
   Group vum health : yellow
      Test vumconfig health : yellow

vmware-vsan-health-service.log:

2019-08-14T12:55:54.403Z ERROR vsan-mgmt[Thread-590807] [VsanPyVmomiProfiler::InvokeMethod opID=noOpId] Timed out for host nc1v02ps12.corp.ukrail.net in invoke-method:vsanSystem:Query
HostStatus
2019-08-14T12:55:54.403Z INFO vsan-mgmt[Thread-590807] [VsanPyVmomiProfiler::logProfile opID=noOpId]   invoke-method:vsanSystem:QueryHostStatus: 8.44s:nc1v02ps12.corp.ukrail.net
2019-08-14T12:55:54.403Z ERROR vsan-mgmt[Thread-590807] [VsanClusterHealthSystemImpl::PerHostQueryNetworkHealth opID=noOpId] Exception in host nc1v02ps12.corp.ukrail.net:
Traceback (most recent call last):
  File "C:\Program Files\VMware\vCenter Server\vsan-health\pyMoVsan\VsanClusterHealthSystemImpl.py", line 1004, in PerHostQueryNetworkHealth
    SetHostClusterUuid(host, hostInfos[host], fetchHostStatus=True)
  File "C:\Program Files\VMware\vCenter Server\vsan-health\pyMoVsan\VsanClusterHealthSystemImpl.py", line 784, in SetHostClusterUuid
    status = vs.QueryHostStatus()
..
..
..
    return self._sslobj.read(len, buffer)
  File "C:\Program Files\VMware\vCenter Server\python\lib\ssl.py", line 583, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

Cause

By default PTAgent is set to perform a SCSI device and bus rescan every 3 minutes. This type of query is to look for new disks or other hardware devices that are attached to the server. It has been extended to also check other block devices such as iSCSI. The ESXi storage stack also performs its own device and bus rescan every 5 minutes by default also looking for the same. A device and bus rescan is an expensive operation and it can result in certain parts of the SCSI bus being blocked waiting for the operation to complete. This can have a knock on impact of increased latency waiting for the operation to complete. If there are a lot of storage operations already in flight, it may need to let them finish before it can go to the rescan.

We have identified that there are times that PTAgent and ESXi have rescans running at essentially the same time. This can result in a delay in a response while the rescans are completing which is occasionally triggering a vSAN health alarm. vSAN health is not triggering an alarm for a failed test, but the test it is running is marked as failed as the vSAN health query timed out.
Overall the issue is one of timing. vSAN health has a short timeout for queries to respond, and has no retry or other verification mechanism to confirm a fault. The rescan from PTAgent and ESXi running simultaneously (along with other queued I/O) may result in a delay long enough that it triggers the vSAN health timeout.

Resolution

The workaround is to disable the PTAgent rescan and essentially leave the default ESXi storage rescan in place. This is essentially using the same rescan interval that VMware use by default with vSAN. There is no risk to data or I/O operations with this change. It does mean the rescan will not occur as frequently but disks being added or removed is not something that is a common occurrence. If a disk is added through hot plug, the HBA has special logic to inform the operating system (ESXi) that there has been a disk change. Other times that you would add or remove disks when the server is off, and a rescan is part of the boot sequence. There are some cases that parallel rescans may be desirable. Such as replication failover, or new disks added from a iSCSI, FC, FCoE array). But failover mechanisms such as SRM, have logic to handle this through additional rescans, or are using features of those disks types (like RSCN in FC). None of those scenarios should be applicable in this case and even when they are in play ESXi does handle them well.

Workaround:

NOTE: the following behavior is properly implemented on PTAgent 1.9.2 and above.

Check the VxRail release notes for PTAgent version included in the current release.

1) Check if rescans are indeed still triggering:

[root@vs218:~] grep -w "Dispatch rescan" /var/run/log/hostd.log |tail -10
2019-10-17T12:16:06.080Z info hostd[2106293] [Originator@6876 sub=Solo.VmwareCLI opID=esxcli-0a-ae0b user=root] Dispatch rescan
2019-10-17T12:16:07.231Z info hostd[2106293] [Originator@6876 sub=Solo.VmwareCLI opID=esxcli-0a-ae0b user=root] Dispatch rescan done


2) Put the ESXi host into maintenance mode.

3) Disable the rescan by applying these commands:

       # /opt/dell/DellPTAgent/tools/pta_cfg set in_band_device_scan_enabled=false
       # /opt/dell/DellPTAgent/tools/pta_cfg set in_band_device_poll_interval_minutes=0

4) Confirm it is disabled with

       # /opt/dell/DellPTAgent/tools/pta_cfg list |grep "in_band_device"
           in_band_device_poll_interval_minutes => 0
           in_band_device_scan_enabled          => False
       # grep -A4 in_band_device_scan_enabled /scratch/dell/config/PTAgent.config
           "in_band_device_scan_enabled": {
               "value": false,
               "defaultValue": true,
               "description": "On ESXi platforms, controls if PT-agent should force adapter scans periodically (controlled by in_band_device_poll_interval_minutes) before probing storage devices."
           },

5) Restart the PTAgent service on the node with:

       # /etc/init.d/DellPTAgent restart

6) Exit maintenance mode.

7) Repeat same steps for all nodes in the cluster.

Additional Information

There are no storage capability or functionality concerns by turning off PTAgent rescan as ESXi does it already itself at regular intervals.
Even if the in-band device scan is disabled, PTAgent will still scan at start-up. If the symptom keeps happening even after disabling the scan, it is necessary to look into the reason why PTAgent is repeatedly restarted.

Article Properties

Affected Product

VxRail Appliance Family

Product

VxRail Appliance Family, VxRail Appliance Series, VxRail E Series Nodes, VxRail E460, VxRail E560, VxRail E560F, VxRail P470, VxRail P570, VxRail P570F, VxRail S570, VxRail Software

Dell VxRail: vSAN Health reports occasionally "Hosts with communication issues" messages

Summary: ESXi hosts within a VxRail VSAN cluster may experience temporary connection issues and as a results VSAN health may report "Hosts with communication issues" error messages.

Article Content

Symptoms

Cause

Resolution

Additional Information

Article Properties

Affected Product

Product

Last Published Date

Version

Article Type

Welcome

Welcome to Dell

Dell VxRail: vSAN Health reports occasionally "Hosts with communication issues" messages

Summary: ESXi hosts within a VxRail VSAN cluster may experience temporary connection issues and as a results VSAN health may report "Hosts with communication issues" error messages.

Article Content

Symptoms

Cause

Resolution

Additional Information

Article Properties

Affected Product

Product

Last Published Date

Version

Article Type