PowerScale: rpcbind Fails During Update if Using a Custom sysctl Value
Summary: "rpcbind" fails to start correctly during upgrades to specific OneFS versions if a custom value is set for "kern.ipc.somaxconn."
Symptoms
After upgrading to one of the following OneFS versions:
- 9.7.1.3
- 9.10.0.0
Client access is interrupted across all protocols, and running isi auth commands on the cluster cause the following error to appear:
p970-1# isi auth users list Unable to connect to authentication daemon. Please wait until authentication daemon has restarted and retry.
Messages in /var/log/messages indicating a failure to connect to the Remote Procedure Call (RPC) server:
2024-11-25T14:59:51.084340+00:00 <1.3> p970-1(id1) isi_celog_capture[4169]: drive_d_connect: Failed to connect to RPC server at 127.0.0.1 (errno=Invalid argument, rpc clnt_stat=15); retrying 2 of 3.
Cause
An issue in the logic that evaluated this setting in the two impacted versions of OneFS causes this issue. It is addressed in all other versions.
Resolution
This issue can be avoided if addressed before upgrading to an impacted OneFS version. If the cluster is already impacted, there are recovery steps as well.
Before Upgrade
Check for a custom value using the script below:
sys_files="/etc/mcp/templates/sysctl.conf /etc/mcp/override/sysctl.conf /etc/local/sysctl.conf"; while read -r file; do grep "somaxconn" "$file" 2>/dev/null done <<<"$sys_files"
If there is output; write down the value (512 is common), then use the following script to remove the entry:
sys_files="/etc/mcp/templates/sysctl.conf /etc/mcp/override/sysctl.conf /etc/local/sysctl.conf"; while read -r file; do sed -i bak "s/^kern.ipc.somaxconn.*//g" "$file" 2>/dev/null done <<<"$sys_files"
The upgrade can now be performed safely. After the upgrade, revert the setting to the previously written down value with the following command. Replace $val with the value noted.
isi_sysctl_cluster kern.ipc.somaxconn=$val
Then reboot nodes manually one at a time using whatever process if preferred.
After Impact
A reboot of the impacted nodes is required.
shutdown -r now