Dell EMC VxRail: ESXi host shows 'not responding' or 'disconnected' in vCenter (Customer Correctable)

Table of Contents

Detailed Article

Symptoms

Cause

Resolution

Additional Info

Affected Products

Provide Feedback

Summary: This article describes what to check or do about hosts that are not manageable by the vCenter when they are expected to be. In vCenter, these hosts show in the navigator with (disconnected) or (not responding) next to the hostname. Steps in this article may also be relative to other issues with host responsiveness, even if the host does not necessarily look disconnected from vCenter. ...

This article applies to This article does not apply to

Check out resources for

Symptoms

A host was previously showing up in a cluster in vCenter but gets marked as disconnected or not responding. Management and monitoring of the host are no longer available in the vCenter web client. Any VMs that were running on that host have either moved to available hosts (HA) or also show up unavailable or disconnected.

A host shows 'disconnected' or 'not responding' in the vCenter web client. The host can still show as running, and Virtual Machines (VMs) may still be running on it. However, the host is not manageable from vCenter and may not respond properly in other ways. The VMs running on the host cannot be migrated to other hosts in the cluster using vMotion.

Cause

This situation is due to certain services on the host that are not running properly. While there can be other reasons (see Notes section below for links to additional articles), unresponsive host services can be caused by the host running out of available resources (usually memory). An ESXi host shuts down resource-intensive services like hostd rather than take from what is allocated to running VMs, if they need it.

The vCenter cannot communicate with the host due to the lack of these services and general difficulty with a lagging host that does not have enough available resources to function properly. The result is a host that fails to respond to vCenter and is slow and difficult to access or manage in other ways. In these cases, the host's storage is still fully functional within the VSAN datastore and its VMs usually continue running properly for a time. Rebooting the host to get back memory that was not properly reclaimed by the host after use (a memory leak) is frequently the only way to recover the host's manageability and reconnect it to vCenter.

See Detailed Description in the Notes field below for more information.

Resolution

Users:
If after following the troubleshooting steps below, the host is still not connecting to the vCenter properly or you have questions or need assistance, contact Dell EMC Technical Support, or your Authorized Service Representative to open a Service Request.

Initial checks:

Check if VMs known to be on the host (these are likely also showing as disconnected in the vCenter) are still running.
Try to open a vSphere web client to the host: https://<host's management IP>, a PuTTY session, or a Direct Connection User Interface ("DCUI" -open a console through IDRAC on Dell platform or BMC on Quanta). Inability to connect by PuTTY or vSphere to a host that is online (VMs are running), bad lag in the DCUI (or cannot even get to it), and inability to restart services or run other commands using command line, indicate lack of available host resources as described in the Detailed Description.
In the DCUI, press Alt + F11 to see if the host monitor indicates that an issue like hostd service stopped is showing.
Check to see if you can ping the host. If not, it may be unresponsive to diagnostic screen ("PSOD"), powered down (turn it on if possible), or have a networking issue.
If the disconnected host does not seem to have an issue with services (pings fine, web client to host works, DCUI looks fine, and so forth), you can try right-clicking the host in the vCenter's 'Navigator' under 'Hosts and Clusters' and selecting the options for connecting the host.

If you can run commands on the host:

Check the status of 'hostd' and 'vpxa': /etc/init.d/hostd status && /etc/init.d/vpxa status
Try restarting 'hostd' and 'vpxa': /etc/init.d/hostd restart && /etc/init.d/vpxa restart
If the commands worked, but the host is still not connected, try restarting all management services:

Check for LACP (Do not restart all services if LACP is being used as this can cause other issues): # localcli network vswitch dvs vmware lacp config get
If no LACP, restart all management services on the host with one of these commands:
- With vSphere 6.5 and later: # services.sh restart and tail -f /var/log/jumpstart-stdout.log

Tailing the log does not stop when services are all restarted. Use Ctrl + C once done (about 5-10 minutes).

With vSphere versions before 6.5 (VxRail code below 4.5.x), run: services.sh restart

Check if the host is reconnected after all services are restarted.

If step 3 and 4 work, the host should automatically reconnect to the vCenter. You may have to refresh the vCenter web page or manually connect the host again. You should place the host in Maintenance Mode (Ensure Accessibility), migrating off any running VMs. Before rebooting, follow these steps to get the logs that are required if further analysis of the cause is wanted:

Create hostd dump from memory by running the command on the host: vmkbacktrace -n hostd -c -w
Check that it is there with the output of this command: ls -alrth /var/core/hostd*

Return looks like: rwx------ 1 root root 32.8M Aug 15 05:10 /var/core/hostd-worker-zdump.001

Connect to the host with WinSCP, Filezilla, and so forth, and download the file.
Reboot the host, ensure it is connecting to the vCenter and looks healthy, and turn VMs on or migrate VMs back to the host as wanted.

If the host is turned on and pingable, but you cannot SSH to it:

Try to connect to the host through IDRAC (BMC on Quanta) or a KVM to run commands through a console shell. If that works where PuTTY (or other SSH connection) is not possible, try to restore connectivity following the same steps above.
Free up some memory by shutting down some of the host's VMs. Doing so can sometimes be enough to enable the above steps to work, without having to turn off every VM on the host. If attempting this method, High Availability (HA) functions might automatically bring hosts back online in connected hosts in the cluster.
To check memory state, run "esxtop" on the host, then press 'm' to check memory usage.
Also, see VMware article https://kb.vmware.com/s/article/1006160 for information about manually registering VMs on other hosts. This allows you to minimize the time that VMs are shut down before or during a reboot.

Steps for rebooting a disconnected host to restore connectivity to vCenter:

Check what VMs are running on the host through vCenter. This information is old since the host is not connected to vCenter, but you can usually find VMs that are on the host by seeing which ones also show as disconnected in the cluster's Related Objects > Virtual Machines/VMs tab.
Remote connect or SSH to the virtual machines directly and shut them down.

You can try doing the least important ones first to see if you can restore services and reconnect. If needed, you can register VMs on another host and power them up (if prompted, select 'I moved it') right after shutting them down on the disconnected host.

Windows: Use RDP or other access software to bring up the VM and shut it down.
RecoverPoint VMs: Log in through PuTTY as boxmgmt and follow prompts to shut down.
Secure Remote Services: PuTTY in using admin and run the poweroff command or shutdown now command.
Linux: Command 'shutdown now' brings it down, where 'shutdown -r now' is the reboot command.

Alternative VM shutdown method if you have command line available through PuTTY or DCUI shell to the host and cannot access the VMs directly. See KB article https://kb.vmware.com/kb/1014165.

Command to see if a VM is running on a node and get the World ID: # localcli vm process list.
Command to shut a VM down: # localcli vm process kill -t soft -w <worldID>.

Using 'soft', as above, is the most graceful shutdown. If that does not work, use 'hard' instead to perform an immediate shutdown. The option 'force' should be used as a last resort.

In vCenter, ensure that the rest of the cluster is healthy and that there is not a VSAN resync or any other situation that prevents you from safely removing the host from the VSAN temporarily for a reboot, even if the host were to not come back up right away.
Place the host into Maintenance Mode, if possible. Use the vSphere web client, if available. If not, open the shell from the DCUI with Alt + F1, log in as root, and use command line to attempt to place the host in Maintenance Mode. Ensure that any VMs on the host are powered down, then place the host in Maintenance Mode with command line using one of the following options:

Ensure Accessibility: esxcli system maintenanceMode set --enable true -m ensureObjectAccessibility.
No data migration: esxcli system maintenanceMode set --enable true -m noAction.
Full Data Migration: esxcli system maintenanceMode set --enable true -m evacuateAllData.

Check again if you can get a console before you reboot as this would be your last chance to get any logs cleared after booting.

Reboot the host either through IDRAC or BMC controls or on the command line using: reboot.
Once the host is up, it should be back in the vCenter. Attempt to connect it to the cluster manually, if not. If it still is not coming in, you may also want to reboot the vCenter. For vCenter appliances that do not have the PSC embedded, reboot the PSC first and then the VCSA.
Gather a vCenter support bundle (including host logs) and prepare a timeline of related events if an analysis of the cause is requested. See KB article https://support.emc.com/kb/333684 for information about how to gather logs in a VxRail environment. If a Service Request is opened with VxRail Support, hardware logs from the host and a VxRail Manager log bundle are needed as well.

Additional Information

Detailed Description (explains why host reboot is typically needed):

Most commonly, a host that is disconnected from the vCenter has an unavailable (or stopped by the host) hostd service. This service is required to make the host manageable by the vCenter. It is memory intensive, so it gets turned off when the host is suffering due to lack of available memory. Some other services act in a similar way. Restarting services (specific ones or all management services) on a disconnected host should allow the host to reconnect to the vCenter again. However, the reason the service is not running is often due to insufficient resources (CPU, memory) available to the host. If this is caused by something like a "hostd memory leak," a reboot is required to get the host running properly again.

Memory leaks happen when something uses system memory and then that memory is not able to be reclaimed properly afterwards. The 'hostd' service is frequently related but does not have to be. ESXi stops that service to try to prevent being unresponsive. Host management services (not just hostd) can also stop responding and be unable to restart due to lack of available resources. Once hostd (also vpxa and some others) stops running, a host is no longer connected to the vCenter. Sometimes it is possible to restart the services and reconnect the host to vCenter, especially if enough resources can be freed up first.

Rebooting the host resolves these connection issues automatically as unreclaimed memory becomes available again and all services are reinitiated during boot. Even if a host can be reconnected by getting hostd, vpxa, so forth running again, it is best to then migrate running VMs off the host (you can now that the host is connected to vCenter), put it in Maintenance Mode, and reboot to ensure any locked up resources become available again. This helps avoid the host returning to the unresponsive state in the short-term. Beyond that, it is important to keep up on ESXi patches and upgrades that address underlying causes of memory leaks and other issues that can cause services to stop responding (or otherwise lead to disconnected hosts).

The verification and troubleshooting steps in the Resolution section can be followed to attempt to resolve these situations in the best way possible. While the process attempts to less impactful things first, most cases do end up requiring a host reboot. This is often inconvenient or not an option for customer at the time. A host can often run fine for a while even if it is disconnected from vCenter, and VMs on the host + VSAN participation of the host's drives often continues. Getting the host rebooted as soon as possible is best, but if a maintenance window or waiting for off-production hours is needed, it is fine to do so. A host without memory will likely eventually crash, and restarting VMs is not likely to work. Unless there are issues elsewhere in the cluster, though, fault tolerance (HA, VSAN protection, ability to bring up VMs on other hosts, so forth) should still be in place.

Although getting the host rebooted is the main way to fix nonresponsive/disconnected hosts, the logs that are required to analyze the cause of these issues are not likely to be available after rebooting. If you can get the host to respond enough to generate and collect the necessary logs, this makes any further analysis more likely to be able to identify an underlying cause. If these logs cannot be collected because the host was unresponsive, or a reboot is already done or is opted for to get a faster resolution (it takes a while to get the logs and frequently may not be worth it), there are still some logs that might help narrow a cause down or help with determining the best recommendations going forward. At minimum, you should collect the vCenter log bundle with the faulted host's log and one healthy host's to compare with.

Other articles related to nonresponsive/disconnected (in vCenter) hosts:

Sometimes, the smart agent can cause a host to show 'not responding' in vCenter. See VMware article https://kb.vmware.com/kb/2145106 and
KB article https://support.emc.com/kb/502016.

Affected Products

VxRail Appliance Family

Products

Pivotal Ready Architecture, VxRail 460 and 470 Nodes, VxRail Appliance Family, VxRail Appliance Series, VxRail G410, VxRail G Series Nodes, VxRail E Series Nodes, VxRail E460, VxRail E560, VxRail E560F, VxRail G560, VxRail Gen2 Hardware, VxRail P470 , VxRail P570, VxRail P570F, VxRail S Series Nodes, VxRail S470, VxRail S570, VxRail Software, VxRail V Series Nodes, VxRail V470, VxRail V570, VxRail V570F ...

Dell EMC VxRail: ESXi host shows 'not responding' or 'disconnected' in vCenter (Customer Correctable)

Symptoms

Cause

Resolution

Additional Information

Affected Products

Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Welcome

Welcome to Dell

Dell EMC VxRail: ESXi host shows 'not responding' or 'disconnected' in vCenter (Customer Correctable)

Detailed Article

Symptoms

Cause

Resolution

Additional Info

Affected Products

Symptoms

Cause

Resolution

Additional Information

Affected Products

Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services