Article Number: 000056125
We can see vSAN health alarm being generated displayed in vCenter:
2019-08-14T12:56:01.422Z INFO vsan-mgmt[EventMonitor] [VsanEventUtil::_generateVcEvent opID=noOpId] Generate VC event for managed object NC1V01 with testName=Hosts with connectivity issues, testId=com.vmware.vsan.health.test.hostconnectivity, preStatus=green, curStatus=redFrom vmware-vsan-health-summary-result.log we can see vSAN health host connect issues:
2019-08-14T12:56:01.355Z INFO vsan-mgmt[EventMonitor] [VsanHealthSummaryLogUtil::PrintHealthResult opID=noOpId] Cluster NB1X01 Overall Health : red Group network health : red Test hostdisconnected health : green Test hostconnectivity health : red HostsWithCommunicationIssues: Host (Host-234), Test clusterpartition health : green Test vsanvmknic health : green Test smallping health : green Test largeping health : green Test vmotionpingsmall health : green Test vmotionpinglarge health : green Test hostlatencycheck health : green NetworkLatencyCheckResults: FromHost ToHost NetworkLatency(Ms) NetworkLatencyCheckResult (Host-227, Host-236, 0.18, Green), (Host-227, Host-234, 0.23, Green), (Host-227, Host-238, 0.16, Green), (Host-227, Host-232, 0.12, Green), (Host-234, Host-232, 0.27, Green), (Host-234, Host-238, 0.31, Green), (Host-234, Host-236, 0.29, Green), (Host-234, Host-227, 0.26, Green), (Host-236, Host-227, 0.1, Green), (Host-236, Host-234, 0.12, Green), (Host-236, Host-238, 0.1, Green), (Host-236, Host-232, 0.1, Green), (Host-232, Host-236, 0.1, Green), (Host-232, Host-238, 0.1, Green), (Host-232, Host-234, 0.12, Green), (Host-232, Host-227, 0.11, Green), (Host-238, Host-232, 0.15, Green), (Host-238, Host-236, 0.11, Green), (Host-238, Host-234, 0.23, Green), (Host-238, Host-227, 0.12, Green), Group cloudhealth health : yellow Test vsancloudhealthceipexception health : yellow Group vum health : yellow Test vumconfig health : yellow
vmware-vsan-health-service.log:
2019-08-14T12:55:54.403Z ERROR vsan-mgmt[Thread-590807] [VsanPyVmomiProfiler::InvokeMethod opID=noOpId] Timed out for host nc1v02ps12.corp.ukrail.net in invoke-method:vsanSystem:Query HostStatus 2019-08-14T12:55:54.403Z INFO vsan-mgmt[Thread-590807] [VsanPyVmomiProfiler::logProfile opID=noOpId] invoke-method:vsanSystem:QueryHostStatus: 8.44s:nc1v02ps12.corp.ukrail.net 2019-08-14T12:55:54.403Z ERROR vsan-mgmt[Thread-590807] [VsanClusterHealthSystemImpl::PerHostQueryNetworkHealth opID=noOpId] Exception in host nc1v02ps12.corp.ukrail.net: Traceback (most recent call last): File "C:\Program Files\VMware\vCenter Server\vsan-health\pyMoVsan\VsanClusterHealthSystemImpl.py", line 1004, in PerHostQueryNetworkHealth SetHostClusterUuid(host, hostInfos[host], fetchHostStatus=True) File "C:\Program Files\VMware\vCenter Server\vsan-health\pyMoVsan\VsanClusterHealthSystemImpl.py", line 784, in SetHostClusterUuid status = vs.QueryHostStatus() .. .. .. return self._sslobj.read(len, buffer) File "C:\Program Files\VMware\vCenter Server\python\lib\ssl.py", line 583, in read v = self._sslobj.read(len, buffer) socket.timeout: The read operation timed out
By default PTAgent is set to perform a SCSI device and bus rescan every 3 minutes. This type of query is to look for new disks or other hardware devices that are attached to the server. It has been extended to also check other block devices such as iSCSI. The ESXi storage stack also performs its own device and bus rescan every 5 minutes by default also looking for the same. A device and bus rescan is an expensive operation and it can result in certain parts of the SCSI bus being blocked waiting for the operation to complete. This can have a knock on impact of increased latency waiting for the operation to complete. If there are a lot of storage operations already in flight, it may need to let them finish before it can go to the rescan.
We have identified that there are times that PTAgent and ESXi have rescans running at essentially the same time. This can result in a delay in a response while the rescans are completing which is occasionally triggering a vSAN health alarm. vSAN health is not triggering an alarm for a failed test, but the test it is running is marked as failed as the vSAN health query timed out.
Overall the issue is one of timing. vSAN health has a short timeout for queries to respond, and has no retry or other verification mechanism to confirm a fault. The rescan from PTAgent and ESXi running simultaneously (along with other queued I/O) may result in a delay long enough that it triggers the vSAN health timeout.
1) Check if rescans are indeed still triggering:
[root@vs218:~] grep -w "Dispatch rescan" /var/run/log/hostd.log |tail -10 2019-10-17T12:16:06.080Z info hostd[2106293] [Originator@6876 sub=Solo.VmwareCLI opID=esxcli-0a-ae0b user=root] Dispatch rescan 2019-10-17T12:16:07.231Z info hostd[2106293] [Originator@6876 sub=Solo.VmwareCLI opID=esxcli-0a-ae0b user=root] Dispatch rescan done
2) Put the ESXi host into maintenance mode. 3) Disable the rescan by applying these commands:
# /opt/dell/DellPTAgent/tools/pta_cfg set in_band_device_scan_enabled=false # /opt/dell/DellPTAgent/tools/pta_cfg set in_band_device_poll_interval_minutes=0
4) Confirm it is disabled with
# /opt/dell/DellPTAgent/tools/pta_cfg list |grep "in_band_device" in_band_device_poll_interval_minutes => 0 in_band_device_scan_enabled => False # grep -A4 in_band_device_scan_enabled /scratch/dell/config/PTAgent.config "in_band_device_scan_enabled": { "value": false, "defaultValue": true, "description": "On ESXi platforms, controls if PT-agent should force adapter scans periodically (controlled by in_band_device_poll_interval_minutes) before probing storage devices." },
5) Restart the PTAgent service on the node with:
# /etc/init.d/DellPTAgent restart
6) Exit maintenance mode.
7) Repeat same steps for all nodes in the cluster.
VxRail Appliance Family
VxRail Appliance Family, VxRail Appliance Series, VxRail E Series Nodes, VxRail E460, VxRail E560, VxRail E560F, VxRail P470, VxRail P570, VxRail P570F, VxRail S570, VxRail Software
17 Jun 2023
6
Solution