PowerFlex ESXi hosts become unresponsive during NFC backup or replication operations
Summary: This article describes a cause for hostd service on ESXi to be non-responsive, and cause the server to become unresponsive to management operations from vCenter. The particular scenarios where this happens is when NFC operations such as backup or replication jobs are performed on disks with an IO filter attached. ...
Symptoms
ESXi hosts become unresponsive with hostd failing or stopping responding when NFC operations such as backup or replication jobs are performed on disks with an IO filter attached. Messages in /var/run/log/hostd.log indicating IoTracker hostd threads taking increasingly long amounts of time to complete:
2022-06-23T09:59:46.880Z warning hostd[3044912] [Originator@6876 sub=IoTracker] In thread 3049831, preadv("/vmfs/devices/vdfm/1322229d-vdfm") took over 19214 sec.
2022-06-23T09:59:56.883Z warning hostd[3045356] [Originator@6876 sub=IoTracker] In thread 3049832, preadv("/vmfs/devices/vdfm/1322229d-vdfm") took over 19224 sec.
2022-06-23T09:59:56.883Z warning hostd[3045356] [Originator@6876 sub=IoTracker] In thread 3049833, preadv("/vmfs/devices/vdfm/1322229d-vdfm") took over 19224 sec.
2022-06-23T09:59:56.883Z warning hostd[3045356] [Originator@6876 sub=IoTracker] In thread 3049831, preadv("/vmfs/devices/vdfm/1322229d-vdfm") took over 19224 sec.
The host can eventually be recovered into a responding state by rebooting it, after shutting down and moving any active VMs on it.
Cause
This issue happens when a hostd worker thread limits exhaustion during specific NFC operations
Resolution
Workaround:
There is a workaround provided in the VMware article. It is worth monitoring the backup and replication operation to observe if there is any impact on the normal state of the operation.
User can work around this issue by limiting NFCs' ability to overwhelm hostd. Set the maximum asynchronous NFC threads to 2 in the hostd configuration.
1. Export the hostd configuration settings from ConfigStore to a json file using the following command.
configstorecli config current get -c esx -g services -k hostd -outfile tmp.json
2. Edit the tmp.json file:
vi tmp.json
3. Add the "max_async_threads": Two lines to the "nfcsvc" section as seen below and save the file (the other options may differ in your environment).
"nfcsvc": {
"log_level": "INFO",
"max_memory": 100663296,
"max_stream_memory": 35651584,
"max_async_threads": 2
4. Run the following command to apply the file to the ConfigStore database:
configstorecli config current set -c esx -g services -k hostd -infile tmp.json
5. Run the following command to restart hostd service:
/etc/init.d/hostd restart
This is based on a VMware KB article https://kb.vmware.com/s/article/89650