PowerProtect Data Manager: In the PPDM UI, the status of the search cluster shows that a particular search node is in a failed state
Summary: The search node becomes unresponsive, and indexing jobs remain in a queued state as they cannot run on failed nodes. This can happen with a Search node that is 19.16 release or earlier. ...
Symptoms
On the Search Node which is in a failed state, go to /var/log and check the messages log. You see an entry similar to:
2024-07-08T10:00:12.049322-04:00 search_node_name kernel: [518834.025665][ C1] watchdog: BUG: soft lockup - CPU#1 stuck for 235970s! [nfsd:2692]
Affected versions: 19.16 and below
Investigated by Dell Engineering in PPDMESC-6808
Cause
The NFS daemon on the Search Cluster hits an OS level "Soft lockup." For more details about a soft lockup, read:
https://www.suse.com/support/kb/doc/?id=000018705
Resolution
Workaround:
Log in to the search node which had nfsd was unresponsive.
source /opt/emc/vmdirect/unit/vmdirect.env && /opt/emc/vmdirect/bin/infranodemgmt get -secret
This supplies the admin and root credentials for the search nodes. Open SSH session to the search node in question as the admin user and run the following commands:
echo 20 > /proc/sys/kernel/watchdog_thresh
This command modifies the watchdog threshold to 20. However, applying this configuration change does not persist across restart of the server. Make the following change to persist this across server restart.
echo "kernel.watchdog_thresh=20" > /etc/sysctl.d/99-watchdog_thresh.conf sysctl -p /etc/sysctl.d/99-watchdog_thresh.conf
Permanent Fix: PowerProtect Data Manager version 19.16 P2 & 19.17+ release