NetWorker: NetWorker server deployed on Red Had Enterprise Linux Pacemaker failover cluster have no method of tuning the monitoring interval.

Summary: Intermittent NetWorker outages observed in large cluster environments due to brief monitoring interruptions. By default, the monitor function retries three times. There is no "retry count" parameter available. This KB defines potential workarounds, and a Request For Enhancement (RFE) details for a NetWorker enhancement opportunity. ...

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

  • The NetWorker server is installed on a Red Hat Pacemaker (pcs) failover cluster.
  • There are intermittent outages in NetWorker due to brief interruptions to the Pacemaker (pcs) resource for NetWorker server (default is nws)

Cause

The cause of cluster outages can vary. This section defines what is used to perform NetWorker cluster monitor functions. 

By default the Pacemaker resource for NetWorker has a "monitor" operation. The operation has "interval" and "timeout" settings which are configured by the cluster administrator during initial NetWorker cluster configuration.

root@NWrhelNodeA:~# pcs resource
  * Resource Group: NW_group:
    * fs        (ocf::heartbeat:Filesystem):     Started NWrhelNodeA.emclab.local
    * ip        (ocf::heartbeat:IPaddr):         Started NWrhelNodeA.emclab.local
    * nws       (ocf::EMC_NetWorker:Server):     Started NWrhelNodeA.emclab.local
root@NWrhelNodeA:~# pcs resource config nws Resource: nws (class=ocf provider=EMC_NetWorker type=Server) Meta Attrs: is-managed=true Operations: meta-data interval=0 timeout=10 (nws-meta-data-interval-0) migrate_from interval=0 timeout=120 (nws-migrate_from-interval-0) migrate_to interval=0 timeout=60 (nws-migrate_to-interval-0) start interval=0 timeout=300 (nws-start-interval-0) stop interval=0s timeout=300 (nws-stop-interval-0s) validate-all interval=0 timeout=10 (nws-validate-all-interval-0) monitor interval=120s timeout=300 (nws-monitor-interval-120s)

NetWorker is configured to use Open Cluster Framework (OCF). The monitoring function is defined in /usr/lib/ocf/resource.d/EMC_NetWorker/Server:

NWServer_monitor() {
        local count

        # exit immediately if configuration is not valid
        NWServer_validate_all || exit $?

        quick_monitor
        if [ $? -eq 0 ]; then
                count=0
                while [ $count -lt 3 ]; do
                        echo "q" | nsradmin -s ${NSR_SERVERHOST} -i - > /dev/null 2>&1
                        if [ $? -eq 0 ]; then
                                return $OCF_SUCCESS
                        else
                                count=`expr ${count} + 1`
                                sleep 1
                        fi
                done
        else
                return $OCF_NOT_RUNNING
        fi

        return $OCF_NOT_RUNNING
}

NOTE: Monitor failure should technically never occur under normal circumstances and should be indicative of an unrecoverable error. However, some large environments may observe intermittent issues where nsradmin can fail on monitor test of a Pacemaker even if only temporarily, and that results in the Pacemaker taking a full outage.

Resolution

The cluster administrator should investigate all cluster outage issues. The cluster logs can be reviewed for any details on interruptions:

  • /var/log/pcsd/pcsd.log
  • /var/log/pacemaker/pacemaker.log
  • /var/log/messages

NetWorker server logs can also be reviewed. The NetWorker server's daemon.log is located on the shared disk (for example, /nsr_share).

  • /nsr_share/nsr/logs/daemon.log

If real-time rendering is not enabled, the .raw log can be rendered into an .log file with the following command:

nsr_render_log /nsr_share/nsr/logs/daemon.raw > /nsr_share/nsr/logs/daemon_`date -I`.log

The cluster administrator can increase the monitor interval and timeout values for the NetWorker server pcs resource. See Red Hat Pacemaker documentation for directions on changing the timeout values as Pacemaker commands may change across Pacemaker versions.

By default, the Pacemaker retries the monitor function three times. In some instances, this may not be sufficient. A Request For Enhancement (RFE) NW-I-2171 is opened against NetWorker. The intention of the RFE is to have a user tunable "retry count" variable introduced for NWServer_monitor. If the NetWorker administrator defines a new retry count, the monitor function waits until the defined retry amount is complete before causing a failover. If you want to track this RFE, contact your Dell Site Account Manager or sales representative with the RFE number NW-I-2171.

Additional Information

The /usr/lib/ocf/resource.d/EMC_NetWorker/Server nsr_monitor function can be modified by the cluster administrator to include additional functions; however, this scripting is outside of NetWorker support. Any changes to these scripts are removed during a NetWorker server upgrade.

Affected Products

NetWorker

Products

NetWorker Family, NetWorker Series
Article Properties
Article Number: 000216735
Article Type: Solution
Last Modified: 28 Mar 2025
Version:  5
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.