NetWorker: Troubleshooting Guide for Red Hat Cluster Service Issue
Summary: This article provides an overview of how to approach NetWorker service startup issues for NetWorker servers deployed on Red Hat pacemaker (pcs) clusters. This article is appropriate for NetWorker backup administrators and NetWorker support to aid in troubleshooting these issues. ...
Instructions
NetWorker servers can be deployed in a cluster failover configuration on Red Hat nodes using pacemaker (pcs) services. NetWorker is installed on multiple nodes. The server databases are on shared storage, passed between nodes based on the active node in the pacemaker configuration. The NetWorker server uses a shared cluster name and IP address, ensuring consistent naming and addressing regardless of the hosting node. See the NetWorker Cluster Integration Guide for details on how to set up NetWorker in a cluster. This guide is available on the Dell Support Product Page.
Cluster Topology:
This article uses an example cluster with the following configuration:
NetWorker Cluster Topology
|
Hostname
|
IP Address
|
Function
|
|
lnx-node1.amer.lan
|
192.168.9.108
|
Physical Node 1
|
|
lnx-node2.amer.lan
|
192.168.9.109
|
Physical Node 2
|
|
lnx-nwcluster.amer.lan
|
192.168.9.110
|
Logical Name used by NetWorker
|
The file system on the nodes manages NetWorker using symbolic links.
Active Node:
/nsr to the shared storage location:
root@lnx-node1:~# ls -l / | grep nsr
lrwxrwxrwx. 1 root root 14 Oct 5 10:49 nsr -> /nsr_share/nsr
drwxr-xr-x. 11 root root 116 Aug 31 17:20 nsr.NetWorker.local
drwxr-xr-x. 3 root root 17 Aug 31 17:23 nsr_share
Passive Node:
/nsr to /nsr.NetWorker.local:
root@lnx-node2:~# ls -l / | grep nsr
lrwxrwxrwx. 1 root root 20 Oct 3 17:08 nsr -> /nsr.NetWorker.local
drwxr-xr-x. 11 root root 116 Aug 31 17:19 nsr.NetWorker.local
drwxr-xr-x. 2 root root 6 Aug 31 17:18 nsr_share
When a node is in a passive state, the nsrexecd (NetWorker client) software is running using /nsr.NetWorker.local. Each physical node has its own client resource using the physical node's Domain Name System (DNS) resolvable name and IP address. The NetWorker server only runs using the shared storage (/nsr_share) and uses the shared IP address and hostname. This can only be active on one node at a time.
The following pacemaker (pcs) commands are used to get an overview of the pacemaker configuration and status:
-
Cluster configuration:
pcs status
root@lnx-node1:~# pcs status Cluster name: rhelclus Status of pacemakerd: 'Pacemaker is running' (last updated 2023-10-05 10:59:19 -04:00) Cluster Summary: * Stack: corosync * Current DC: lnx-node1.amer.lan (version 2.1.5-9.3.el8_8-a3f44794f94) - partition with quorum * Last updated: Thu Oct 5 10:59:20 2023 * Last change: Thu Oct 5 10:59:13 2023 by root via cibadmin on lnx-node1.amer.lan * 2 nodes configured * 3 resource instances configured Node List: * Online: [ lnx-node1.amer.lan lnx-node2.amer.lan ] Full List of Resources: * Resource Group: NW_group: * fs (ocf::heartbeat:Filesystem): Started lnx-node1.amer.lan * ip (ocf::heartbeat:IPaddr): Started lnx-node1.amer.lan * nws (ocf::EMC_NetWorker:Server): Started lnx-node1.amer.lan Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
fs), cluster resource IP address (ip), and the NetWorker services (nws). The resource names used here are the defaults used in the NetWorker Cluster Integration Guide; however, it is possible that different names are used. If you are using different names, make note of the resource names and replace as needed when following the instructions in this article.
- Pacemaker resource configuration:
pcs resource config
Example:
root@lnx-node1:~# pcs resource config Group: NW_group Resource: fs (class=ocf provider=heartbeat type=Filesystem) Attributes: fs-instance_attributes device=/dev/sdb1 directory=/nsr_share fstype=xfs Operations: monitor: fs-monitor-interval-20 interval=20 timeout=300 start: fs-start-interval-0s interval=0s timeout=60s stop: fs-stop-interval-0s interval=0s timeout=60s Resource: ip (class=ocf provider=heartbeat type=IPaddr) Attributes: ip-instance_attributes cidr_netmask=24 ip=192.1xx.9.1x0 nic=ens192 Operations: monitor: ip-monitor-interval-15 interval=15 timeout=120 start: ip-start-interval-0s interval=0s timeout=20s stop: ip-stop-interval-0s interval=0s timeout=20s Resource: nws (class=ocf provider=EMC_NetWorker type=Server) Meta Attributes: nws-meta_attributes is-managed=true Operations: meta-data: nws-meta-data-interval-0 interval=0 timeout=10 migrate_from: nws-migrate_from-interval-0 interval=0 timeout=120 migrate_to: nws-migrate_to-interval-0 interval=0 timeout=60 monitor: nws-monitor-interval-100 interval=100 timeout=1200 start: nws-start-interval-0 interval=0 timeout=600 stop: nws-stop-interval-0 interval=0 timeout=600 validate-all: nws-validate-all-interval-0 interval=0 timeout=10
The above command details each pcs resources configuration. Important things to make note of during the initial overview:
- FS resource "device=": This is the device used as the mountpoint for the shared storage on the node file system. This device must be the same on each node. This is discussed later in this KB.
- FS resource "directory=": This is the directory which the shared NetWorker storage uses. The directory should be associated as the mountpoint for the "device=" field. This is discussed later in this KB.
- IP resource "ip=": This is the IP address which is associated with the logical (shared) hostname used by the NetWorker server. This IP address is hosted on the active node.
- Pacemaker visibility of the shared address and storage:
lcmap
Example:
root@lnx-node1:~# lcmap type: NSR_CLU_TYPE; clu_type: NSR_LC_TYPE; interface version: 1.0; type: NSR_CLU_VIRTHOST; hostname: 192.168.9.110; local: TRUE; owned paths: /nsr_share; clu_nodes: lnx-node1.amer.lan lnx-node2.amer.lan;
pcs resource config "ip=" field. The owned paths should match the pcs resource config "directory=" field. In some instances, when a startup issue is observed, the lcmap command does not return the hostname, local, or owned paths fields; this is indicative of an issue.
Initial Diagnosis:
If NetWorker services fail to start check the pcs resource status to see which resource is failing:
pcs status
root@lnx-node1:~# pcs status ... ... Node List: * Online: [ lnx-node1.amer.lan lnx-node2.amer.lan ] Full List of Resources: * Resource Group: NW_group: * fs (ocf::heartbeat:Filesystem): Started lnx-node1.amer.lan * ip (ocf::heartbeat:IPaddr): Started lnx-node1.amer.lan * nws (ocf::EMC_NetWorker:Server): Started lnx-node1.amer.lan Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
If a failure is observed, there is a general failure error returned. The failed resources show as FAILED.
- FS (Filesystem): If the Filesystem is in a failed state, see below section on Filesystem Failures.
- IP (IPaddr): If the IPaddr is in a failed state, see below section on IPaddr Failures.
- NWS (Server): If the NetWorker server is in a failed state, perform the following:
- Review the NetWorker server's
daemon.rawfor any failure messages which appear during startup. The server's/nsr_share/nsr/daemon.rawis located in the shared storage path. The physical nodes client daemon is in the/nsr.NetWorker.local/logs/daemon.raw. See Dell article NetWorker: How to use nsr_render_log - If default logging is not sufficient, enable debug by the following:
- Attempt to restart the "Server" resource:
pcs resource cleanup nws
- Use the
dbgcommandto enable debug on thensrdprocess:
dbgcommand -n nsrd Debug=#
daemon.raw for any additional messages which may direct to an issue.
- Review the
/var/log/pcsd/pcsd.logfor any errors. - Review the
/var/log/pacemaker/pacemaker.logfor any errors. - Review the
/var/log/messagesfile for any errors.
pcsd, pacemaker, and messages logs look for messages which were logged during the same timestamps as the NetWorker services attempted to start. Review for any errors or failures which coincide with the service startup failure.
Filesystem Failures:
- Review the pacemaker resources:
pcs resource
- Review the pacemaker resource configuration for the Filesystem resource:
pcs resource fs
root@lnx-node1:~# pcs resource
* Resource Group: NW_group:
* fs (ocf::heartbeat:Filesystem): Started lnx-node1.amer.lan
* ip (ocf::heartbeat:IPaddr): Started lnx-node1.amer.lan
* nws (ocf::EMC_NetWorker:Server): Started lnx-node1.amer.lan
root@lnx-node1:~# pcs resource config fs
Resource: fs (class=ocf provider=heartbeat type=Filesystem)
Attributes: fs-instance_attributes
device=/dev/sdb1
directory=/nsr_share
fstype=xfs
Operations:
monitor: fs-monitor-interval-20
interval=20
timeout=300
start: fs-start-interval-0s
interval=0s
timeout=60s
stop: fs-stop-interval-0s
interval=0s
timeout=60s
- Confirm whether the device is mounted on the FS:
df -h
Example:
root@lnx-node1:~# df -h | grep /nsr_share /dev/sdb1 94G 1.5G 92G 2% /nsr_share
- Confirm if the mountpoint is configured correctly; associating the device with the path:
lsblk
Example:
root@lnx-node1:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 40G 0 disk
├─sda1 8:1 0 600M 0 part /boot/efi
├─sda2 8:2 0 1G 0 part /boot
└─sda3 8:3 0 38.4G 0 part
├─rhel-root 253:0 0 34.4G 0 lvm /
└─rhel-swap 253:1 0 4G 0 lvm [SWAP]
sdb 8:16 0 100G 0 disk
└─sdb1 8:17 0 93.1G 0 part /nsr_share
sr0 11:0 1 1024M 0 rom
- Confirm that the file system used by the device is correct:
blkid
root@lnx-node1:~# blkid
/dev/mapper/rhel-root: UUID="7cf2f957-18d8-45b8-bf8f-6361aadc3517" BLOCK_SIZE="512" TYPE="xfs"
/dev/sda3: UUID="QpZ2hK-OuE2-igN0-Ryba-EwMN-uxq1-LE48hD" TYPE="LVM2_member" PARTUUID="1193db91-4b63-4b33-a4d4-03a22317e064"
/dev/sda1: UUID="F243-AD41" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="EFI System Partition" PARTUUID="6c81bd63-0249-4bdf-afdb-cdde72034162"
/dev/sda2: UUID="7677ad6b-8191-4a45-8a8a-16cf7d00d72c" BLOCK_SIZE="512" TYPE="xfs" PARTUUID="57481b7a-83ec-4cd8-bf2d-bca09ac27040"
/dev/sdb1: UUID="600bca60-dd5d-4162-bf77-0537daa3b1e5" BLOCK_SIZE="512" TYPE="xfs" PARTLABEL="networker" PARTUUID="769aaac2-764b-431d-be21-3b5753d6a5d3"
/dev/mapper/rhel-swap: UUID="537962b6-07d4-4a40-9687-deab2e488936" TYPE="swap"
/var/log/pcsd/pcsd.log/var/log/pacemaker/pacemaker.log/var/log/messages
IPaddr Failures:
- Review the pacemaker resources:
pcs resource
- Review the pacemaker resource configuration for the Filesystem resource:
pcs resource config ip
root@lnx-node1:~# pcs resource
* Resource Group: NW_group:
* fs (ocf::heartbeat:Filesystem): Started lnx-node1.amer.lan
* ip (ocf::heartbeat:IPaddr): Started lnx-node1.amer.lan
* nws (ocf::EMC_NetWorker:Server): Started lnx-node1.amer.lan
root@lnx-node1:~# pcs resource config ip
Resource: ip (class=ocf provider=heartbeat type=IPaddr)
Attributes: ip-instance_attributes
cidr_netmask=24
ip=192.1xx.9.1x0
nic=ens192
Operations:
monitor: ip-monitor-interval-15
interval=15
timeout=120
start: ip-start-interval-0s
interval=0s
timeout=20s stop:
ip-stop-interval-0s
interval=0s
timeout=20s
- Confirm if the NIC is available on the system:
ifconfig -a
root@lnx-node1:~# ifconfig -a
ens192: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.1xx.9.1x8 netmask 255.255.255.0 broadcast 192.1xx.9.255
inet6 fe80::250:56ff:fea5:48e1 prefixlen 64 scopeid 0x20<link>
ether 00:50:56:a5:48:e1 txqueuelen 1000 (Ethernet)
RX packets 953865 bytes 349705527 (333.5 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1190983 bytes 179749786 (171.4 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 129798 bytes 13274289 (12.6 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 129798 bytes 13274289 (12.6 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
The IP address shown with ifconfig matches the physical node name; however, the clustered IP is reachable through this NIC when the node is active. Ensure that both nodes are configured to use the same NIC names.
- Does the IP address resolve to the correct (logical) hostname used by the NetWorker server?
nslookup ip nslookup logical_name_FQDN nslookup logical_name_short
root@lnx-node1:~# nslookup 192.1xx.9.1x0 110.9.1xx.1x2.in-addr.arpa name = lnx-nwcluster.amer.lan. root@lnx-node1:~# nslookup lnx-nwcluster.amer.lan. Server: 192.1xx.9.1x0 Address: 192.1xx.9.100#53 Name: lnx-nwcluster.amer.lan Address: 192.1xx.9.1x0 root@lnx-node1:~# nslookup lnx-nwcluster Server: 192.1xx.9.1x0 Address: 192.1xx.9.100#53 Name: lnx-nwcluster.amer.lan Address: 192.1xx.9.1x0
It is also recommended to perform the same steps against the physical node's IP address, FQDN, and shortname. See Dell article NetWorker: Name Resolution Troubleshooting Best Practices.
- Can you reach the cluster IP address using
ping?
ping -c 4 ip
root@lnx-node1:~# ping -c 4 192.1xx8.9.1x0 PING 192.1xx8.9.1x0 (192.1xx.9.1x0) 56(84) bytes of data. 64 bytes from 192.1xx.9.1x0: icmp_seq=1 ttl=64 time=0.051 ms 64 bytes from 192.1xx.9.1x0: icmp_seq=2 ttl=64 time=0.043 ms 64 bytes from 192.1xx.9.1x0: icmp_seq=3 ttl=64 time=0.033 ms 64 bytes from 192.1xx.9.1x0: icmp_seq=4 ttl=64 time=0.034 ms --- 192.1xx.9.1x0 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3108ms rtt min/avg/max/mdev = 0.033/0.040/0.051/0.008 ms
/var/log/pcsd/pcsd.log/var/log/pacemaker/pacemaker.log/var/log/messages
Other PCS Commands:
| Operation | Command |
Pacemaker or pcs version: |
|
| Pacemaker Overview |
|
| Pacemaker resource overview |
|
| Determine path-ownership in a cluster. |
|
| Enable (start) resource. |
|
Start pcs resource with debug. |
|
| Review pcs resource configuration settings |
|
| Disable (stop) resource: |
|
| Restart failed resource. |
|
| Stop pacemaker on node. |
|
| Start pacemaker |
|
| Put the node in standby. |
|
| Bring the node out of standby. |
|
Important Logs and Files:
| Path | Purpose | Supplemental Commands |
/var/log/messages |
Contains global system messages regarding system resources and services. |
|
/var/log/pacemaker/pacemaker.log |
Default pacemaker information logging for pacemaker resources and functions. | N/A |
/var/log/pcsd/pcsd.log |
Default pacemaker service/daemon (pcsd) log. |
N/A |
/var/log/cluster/corosync.log |
Default pacemaker node communication log. | N/A |
/usr/sbin/nw_hae.log |
NetWorker (nws) resource start log as defined in /usr/lib/ocf/resource.d/EMC_NetWorker/Server |
N/A |
/usr/lib/ocf/resource.d/EMC_NetWorker/Server |
NetWorker pacemaker configuration file. This is what operations are performed/managed by pcs. | N/A |