NetWorker: NetWorker server deployed on Red Had Enterprise Linux Pacemaker failover cluster have no method of tuning the monitoring interval.
Summary: Intermittent NetWorker outages observed in large cluster environments due to brief monitoring interruptions. By default, the monitor function retries three times. There is no "retry count" parameter available. This KB defines potential workarounds, and a Request For Enhancement (RFE) details for a NetWorker enhancement opportunity. ...
Symptoms
- The NetWorker server is installed on a Red Hat Pacemaker (
pcs) failover cluster. - There are intermittent outages in NetWorker due to brief interruptions to the Pacemaker (
pcs) resource for NetWorker server (default isnws)
Cause
The cause of cluster outages can vary. This section defines what is used to perform NetWorker cluster monitor functions.
By default the Pacemaker resource for NetWorker has a "monitor" operation. The operation has "interval" and "timeout" settings which are configured by the cluster administrator during initial NetWorker cluster configuration.
root@NWrhelNodeA:~# pcs resource
* Resource Group: NW_group:
* fs (ocf::heartbeat:Filesystem): Started NWrhelNodeA.emclab.local
* ip (ocf::heartbeat:IPaddr): Started NWrhelNodeA.emclab.local
* nws (ocf::EMC_NetWorker:Server): Started NWrhelNodeA.emclab.local
root@NWrhelNodeA:~# pcs resource config nws
Resource: nws (class=ocf provider=EMC_NetWorker type=Server)
Meta Attrs: is-managed=true
Operations: meta-data interval=0 timeout=10 (nws-meta-data-interval-0)
migrate_from interval=0 timeout=120 (nws-migrate_from-interval-0)
migrate_to interval=0 timeout=60 (nws-migrate_to-interval-0)
start interval=0 timeout=300 (nws-start-interval-0)
stop interval=0s timeout=300 (nws-stop-interval-0s)
validate-all interval=0 timeout=10 (nws-validate-all-interval-0)
monitor interval=120s timeout=300 (nws-monitor-interval-120s)
NetWorker is configured to use Open Cluster Framework (OCF). The monitoring function is defined in /usr/lib/ocf/resource.d/EMC_NetWorker/Server:
NWServer_monitor() {
local count
# exit immediately if configuration is not valid
NWServer_validate_all || exit $?
quick_monitor
if [ $? -eq 0 ]; then
count=0
while [ $count -lt 3 ]; do
echo "q" | nsradmin -s ${NSR_SERVERHOST} -i - > /dev/null 2>&1
if [ $? -eq 0 ]; then
return $OCF_SUCCESS
else
count=`expr ${count} + 1`
sleep 1
fi
done
else
return $OCF_NOT_RUNNING
fi
return $OCF_NOT_RUNNING
}
NOTE: Monitor failure should technically never occur under normal circumstances and should be indicative of an unrecoverable error. However, some large environments may observe intermittent issues where
nsradmin can fail on monitor test of a Pacemaker even if only temporarily, and that results in the Pacemaker taking a full outage.
Resolution
The cluster administrator should investigate all cluster outage issues. The cluster logs can be reviewed for any details on interruptions:
/var/log/pcsd/pcsd.log/var/log/pacemaker/pacemaker.log/var/log/messages
NetWorker server logs can also be reviewed. The NetWorker server's daemon.log is located on the shared disk (for example, /nsr_share).
/nsr_share/nsr/logs/daemon.log
If real-time rendering is not enabled, the .raw log can be rendered into an .log file with the following command:
nsr_render_log /nsr_share/nsr/logs/daemon.raw > /nsr_share/nsr/logs/daemon_`date -I`.log
The cluster administrator can increase the monitor interval and timeout values for the NetWorker server pcs resource. See Red Hat Pacemaker documentation for directions on changing the timeout values as Pacemaker commands may change across Pacemaker versions.
By default, the Pacemaker retries the monitor function three times. In some instances, this may not be sufficient. A Request For Enhancement (RFE) NW-I-2171 is opened against NetWorker. The intention of the RFE is to have a user tunable "retry count" variable introduced for NWServer_monitor. If the NetWorker administrator defines a new retry count, the monitor function waits until the defined retry amount is complete before causing a failover. If you want to track this RFE, contact your Dell Site Account Manager or sales representative with the RFE number NW-I-2171.
Additional Information
The /usr/lib/ocf/resource.d/EMC_NetWorker/Server nsr_monitor function can be modified by the cluster administrator to include additional functions; however, this scripting is outside of NetWorker support. Any changes to these scripts are removed during a NetWorker server upgrade.