NetWorker: Server on Red Hat cluster service fails to start "unable to initialize multisubnet state"
Summary: A NetWorker server deployed on a Red Hat pacemaker (pcs) High Availability cluster fails to start. The NetWorker server daemon.raw shows "Unable to create the connection with 'nsrexec' to host 'localhost' with address '127.0.0.1' at port number 7937" and "unable to initialize multisubnet state" ...
Symptoms
- The NetWorker server pacemaker (
pcs) resourcenwsis in a stopped state:
pcs status
Resource Group: NW_group:
* fs (ocf::heartbeat:Filesystem): Started NWrhelNode2.amer.lan
* ip (ocf::heartbeat:IPaddr): Started NWrhelNode2.amer.lan
* nws (ocf::EMC_NetWorker:Server): Stopped
- Attempts to start the
nwsresource usingpcs resource enable nwsorpcs resource debug-start nwsfail - The NetWorker server's
/nsr_share/nsr/logs/daemon.rawshows failures to connect with thensrexecport on the local node. This occurs regardless of which node a service startup is attempted on:
172089 05/07/2024 10:23:52 AM nsrexecd RPC error Unable to create the connection with 'nsrexec' to host 'localhost6' with address '::1' at port number 7937.
173677 05/07/2024 10:23:52 AM nsrexecd RPC critical Check whether the client services are running on the host '::1'.
173680 05/07/2024 10:23:52 AM nsrexecd RPC error RPC client handle: Connection refused.
172089 05/07/2024 10:23:52 AM nsrexecd RPC error Unable to create the connection with 'nsrexec' to host 'localhost' with address '127.0.0.1' at port number 7937.
173677 05/07/2024 10:23:52 AM nsrexecd RPC critical Check whether the client services are running on the host '127.0.0.1'.
...
200170 05/07/24 10:24:18 nsrexecd RPC severe unable to initialize multisubnet state: RPC send operation failed; peer = ::1:[7937], errno = Connection reset by peer
Cause
There pcs ip resource is misconfigured. There is a conflict with the pacemaker ip resource setting and address resolution of the NetWorker server resource.
root@NWrhelNode2:~# lcmap
type: NSR_CLU_TYPE;
clu_type: NSR_LC_TYPE;
interface version: 1.0;
type: NSR_CLU_VIRTHOST;
hostname: 192.168.25.30;
local: TRUE;
owned paths: /nsr_share;
clu_nodes: NWrhelNode1.amer.lan NWrhelNode2.amer.lan;
root@NWrhelNode2:~# pcs resource config ip | grep ip=
ip=192.168.25.30
root@NWrhelNode2:~# nslookup NWrhelClus
Server: 192.168.25.2
Address: 192.168.25.2#53
Name: NWrhelClus.amer.lan
Address: 192.168.25.20
root@NWrhelNode2:~# nslookup 192.168.25.30
30.25.168.192.in-addr.arpa name = NWrhelNode2.amer.lan
In this example, the clustered NetWorker server resource tries to start with the physical node's IP address instead of the logical IP assigned to the hostname.
lcmap command returns the logical cluster hostname instead of the IP address.
Resolution
1. Check the /etc/hosts file on each node in the cluster and confirm that any IP and hostnames (if set) are correct.
2. Confirm the correct IP address used by the shared NetWorker server resource by Doman Name System (DNS):
/nsr directory is mounted:
nsradmin -d /nsr_share/nsr/res/nsrdb
show name
print type: nsr
Example:
root@NWrhelNode2:~# nsradmin -d /nsr_share/nsr/res/nsrdb
NetWorker administration program.
Use the "help" command for help, "visual" for full-screen mode.
nsradmin> show name
nsradmin> print type: nsr
name: NWrhelClus.amer.lan;
nsradmin> quit
root@NWrhelNode2:~#
/nsr_share. Confirm the shared folder path by running the command pcs resource config | grep directory.
lcmap | grep hostname
nslookup hostname_value
root@NWrhelNode2:~# lcmap | grep hostname hostname: 192.168.25.30; root@NWrhelNode2:~# nslookup 192.168.25.30 30.25.168.192.in-addr.arpa name = NWrhelNode2.amer.lan
lcmap returns the logical cluster hostname instead of the IP address. Ensure that the hostname resolves in DNS and that the IP address is correctly set in the pcs IP resource as per the below steps.
C. Confirm the correct IP address used by the shared resource:
nslookup nsr_name
root@NWrhelNode2:~# nslookup NWrhelClus
Server: 192.168.25.2
Address: 192.168.25.2#53
Name: NWrhelClus.amer.lan
Address: 192.168.25.20
root@NWrhelNode2:~#
3. Update the pcs ip resource with the IP address which resolves to the NetWorker server name:
pcs resource update ip ip=IP_ADDRESS
root@NWrhelNode2:~# pcs resource update ip ip=192.168.25.20
root@NWrhelNode2:~# lcmap | grep hostname
hostname: 192.168.25.20;
4. If the local nodes client service is not running, start it:
ps -ef | grep nsrexecd
B. If not running, start it:
/usr/sbin/nsrexecd
5. Start the clustered NetWorker server resource:
pcs resource debug-start nws
nws_start process takes longer than allowed by the nws start timeout settings. If the nws resource shows as (disabled), enable and start it with pcs resource enable nws.
6. Confirm NetWorker services have started:
pcs resource
root@NWrhelNode2:~# pcs resource
* Resource Group: NW_group:
* fs (ocf::heartbeat:Filesystem): Started NWrhelNode2.amer.lan
* ip (ocf::heartbeat:IPaddr): Started NWrhelNode2.amer.lan
* nws (ocf::EMC_NetWorker:Server): Started NWrhelNode2.amer.lan
root@NWrhelNode2:~#