NMDA: DB2 backups fail randomly every night with Error 3
Summary: NMDA DB2 backups failed last night with Error 3. Problem was resolved after creating a new device and scattering backups in two storage nodes, and setting up DB2 retry and timeout parameters. ...
Symptoms
NMDA DB2 backup fails with Error 3
DB2 backup fails with error 'lgto_auth for `nsrmmd' failed: busy'
There is no networking or firewall issues found.
There are 1000s of below messages in /nsr/logs/daemon.raw in storage node:
"5004-nfs lookup failed (nfs: No such file or directory)""invalid save stream""Cannot stat active file""unable to collect deduplication statistics""was aborted and removed from volume"
Error in nmda-messages.log libnsrdb2.log with debug=9:
153929 2/9/2021 10:34:50 PM 4 7 987 1 18153790 0 (client) (pid18153790) NSR severe The backup session could not start: busy.
93412 2/9/2021 10:34:50 PM 3 5 0 1 18153790 0 (client) (pid18153790) NSR error Could not perform the action 2. The status was changed to 3.
153929 1612842069 4 7 987 1 19136950 0 (client) (pid19136950) NSR severe 39 The backup session could not start: %s. 1 49 8 0 4 busy
93412 1612842069 3 5 0 1 19136950 0 (client) (pid19136950) NSR error 62 Could not perform the action %d. The status was changed to %d. 2 1 1 2 1 1 3
(pid = 18809144) (02/09/21 21:40:00.338942) nsrdb2sv_log_program_args: /usr/bin/nsrdasv -LL -T db2 -s (NW server) -g (group) -a *policy action jobid=2297950 -a *policy name=(policy) -a *policy workflow name=(workflow) -a *policy action name=(action) -y Tue Feb 23 23:59:59 GMT-0600 2021 -w Tue Feb 23 23:59:59 GMT-0600 2021 -m (client) -a *policy action jobid restart=Yes -b (pool) -t 1612810625 -o ....
(pid = 18809144) (02/09/21 21:40:00.624767) Backing up the (DB) database.
(pid = 18809144) (02/09/21 21:40:00.624939) set_db2_version: Exiting set_db2_version(): Return code: 10050000
(pid = 18809144) (02/09/21 21:49:08.731480) DbBackup: Exiting with error:
Unable to backup DB2MDME database due to backup request failure, SQLCODE : -2025, SQL2025N An I/O error occurred. Error code: "3". Media on which this error occurred: "VENDOR".
.
(pid = 18809144) (02/09/21 21:49:08.731631) libdb2sv_main: ERROR: DbBackup() failed.
(pid = 18809144) (02/09/21 21:49:08.731685) Unable to backup DB2MDME database due to backup request failure, SQLCODE : -2025, SQL2025N An I/O error occurred. Error code: "3". Media on which this error occurred: "VENDOR".
Critical error is nsrmmd busy error below:
02/09/21 21:32:46 (pid 18153790): 02/09/21 21:32:46.797073 lgto_auth for `nsrd' succeeded 02/09/21 21:32:46 (pid 18153790): 02/09/21 21:32:46.855631 lgto_parms for `nsrmmd' succeeded 02/09/21 21:32:46 (pid 18153790): 02/09/21 21:32:46.855705 got `store index entries' value of `Yes' 02/09/21 21:32:46 (pid 18153790): 02/09/21 21:32:46.855803 Saving in pool 'IDC-DB2'. 02/09/21 21:32:46 (pid 18153790): 02/09/21 21:32:46.855822 server enabled for immediate mode 02/09/21 21:32:46 (pid 18153790): 02/09/21 21:32:46.882267 lgto_auth for `nsrmmd' failed: busy 02/09/21 21:32:46 (pid 18153790): 02/09/21 21:32:46.882349 Unable to acquire the user credentials for direct save nsrmmd authentication: busy. 02/09/21 21:32:46 (pid 18153790): 02/09/21 21:32:46.882439 The error TYPE is 0, SEVERITY is 0, NUMBER is -13, errnum is -13, errstr is 'busy'.
Cause
Resolution
Problem was resolved after doing the changes below. There is no single root cause, but creating a new device and setting the parameters below helped most:
1. Added one new device in to the storage node.
2. Distributed backups evenly across the storage nodes (target session).
3. Changed backup start times.
4. Added these parameters in NMDA DB2 Application information:
NSR_MAX_START_RETRIES=50NSR_FXBUSY_RETRIES=10NSR_MMDB_RETRY_TIME=10
5. Increased Inactivity timeout to 300, Retries=2, Retry delay=10 in the backup action's properties.