NetWorker: orphan save sets (SSID) on CloudBoost MagFS after Magfs SDK returned: CONNECTION_DISCONNECTED
Summary: Possible orphan files in the CloudBoost file system leading in over usage of cloud space
Symptoms
NetWorker backups are configured to go to a CloudBoost device. The NetWorker Storage Node performing the write operation observes a CONNECTION_DISCONNECTED error:
The error appears on the save process at the storage node, but is the Storage Node nsrmmd daemon reports the following:
MM/DD/YYYY HH:mm:SS nsrmmd SYSTEM critical Unable to write to a file: CONNECTION_DISCONNECTED MM/DD/YYYY HH:mm:SS nsrmmd SYSTEM error Cannot write to base/NW_DEVICE_NAME/##/##/LONG_SSID - errno No error. MM/DD/YYYY HH:mm:SS nsrmmd SYSTEM critical Unable to write buffer to disk for ssid=SSID: Failed opening file / directory base/NW_DEVICE_NAME/##/##/LONG_SSID: Magfs SDK returned: CONNECTION_DISCONNECTED. MM/DD/YYYY HH:mm:SS nsrmmd NSR warning Unable to close or sync file for ssid=SSID: Failed to get fd for ssid=SSID to sync/close file MM/DD/YYYY HH:mm:SS nsrmmd SYSTEM critical Unable to remove SSID file 'base/NW_DEVICE_NAME/##/##/LONG_SSID' on device 'rd=storagenode:base/NW_DEVICE_NAME': Unable to stat save set file 'base/NW_DEVICE_NAME/##/##/LONG_SSID': Unable to retrieve the file statistics: CONNECTION_DISCONNECTED MM/DD/YYYY HH:mm:SS nsrmmd NSR error MM/DD/YY HH:mm:SS nsrmmd #5: save set \\NAS_FILER\CIFS_SHARE$\~snapshot\backup.0 for client nw_client_NAS was aborted and removed from volume VOLUME_NAME MM/DD/YYYY HH:mm:SS nsrsnmd SYSTEM notice nw_cbcl_disconnect: Mount handle is NULL.
Similar errors are found in the NetWorker server and storage node's daemon.raw:
- Linux:
/nsr/logs/daemon.raw - Windows (Default):
C:\Program Files\EMC NetWorker\nsr\logs\daemon.raw - NetWorker: How to use nsr_render_log to render .raw log files
All the backups writing to the same CloudBoost are aborted simultaneously, and then run as zombie until reach the timeout.
The probe is shown above where the CloudBoost MagFS is showing multiple Save Set IDs (SSID) finishing at the exact same time:
/mnt/magfs/base/NW_CB_DEVICE//active$ ls -la total 3586415541 drwx--x--- 1 root root 0 Jul 11 15:54 . d--------- 1 root root 0 Jul 6 15:56 .. -rwx------ 1 root root 204710061008 Jul 7 23:04 13481ae1-00000006-b420c59c-5d20c59c-014d1600-644823be -rwx------ 1 root root 428284665149 Jul 7 23:04 14d46d72-00000006-0220c577-5d20c577-00ff1600-644823be -rwx------ 1 root root 247950379831 Jul 7 23:04 173699d8-00000006-ff20c578-5d20c578-01021600-644823be -rwx------ 1 root root 839859093330 Jul 7 23:04 298cc682-00000006-7920c5e9-5d20c5e9-01881600-644823be -rwx------ 1 root root 91465187328 Jul 7 23:04 2bc5effa-00000006-ce219f97-5d219f97-03331600-644823be -rwx------ 1 root root 132714594304 Jul 7 23:04 58a680e5-00000006-0320c577-5d20c577-00fe1600-644823be -rwx------ 1 root root 140331190669 Jul 7 23:04 5b44744f-00000006-fe20c578-5d20c578-01031600-644823be -rwx------ 1 root root 0 Jul 9 15:21 60207da3-00000006-5b24b0f6-5d24b0f6-07a61600-644823be -rwx------ 1 root root 0 Jul 10 05:21 7b8b6b2c-00000006-8d2575c6-5d2575c6-09741600-644823be -rwx------ 1 root root 437006037198 Jul 7 23:04 89b8a19e-00000006-0020c578-5d20c578-01011600-644823be -rwx------ 1 root root 0 Jul 10 01:58 8d658a35-00000006-96254645-5d254645-096b1600-644823be -rwx------ 1 root root 41388343296 Jul 7 23:04 9514b7dc-00000006-cd219f97-5d219f97-03341600-644823be -rwx------ 1 root root 167433480664 Jul 7 23:04 9aaa5eda-00000006-d4217c62-5d217c62-032d1600-644823be -rwx------ 1 root root 114686069939 Jul 7 23:04 c212db4c-00000006-0520c4af-5d20c4af-00fc1600-644823be -rwx------ 1 root root 826660414175 Jul 9 13:30 d37d7328-00000006-0420c577-5d20c577-00fd1600-644823be -rwx------ 1 root root 0 Jul 9 15:43 f46f6d04-00000006-2624b62e-5d24b62e-07db1600-644823beOn the NW server, the SSID is deleted in the media database as showed below:
[root@nsr]# mminfo -avot -q ssid=9f3884ab-00000006-f424e0b2-5d24e0b2-080d1600-644823be
6095:mminfo: no matches found for the query
But the files corresponding to the failed backup are still at the MagFS folder at the CB:
maginatics@CB_appliance:/mnt/magfs$ ls -lia base/NW_CB_DEVICE//64/47/9f3884ab-00000006-f424e0b2-5d24e0b2-080d1600-644823be
71065 -rwx------ 1 root root 378010589327 Jul 11 11:03 base/NW_CB_DEVICE//64/47/9f3884ab-00000006-f424e0b2-5d24e0b2-080d1600-644823beCause
When the CONNECTION_DISCONNECTED error occur all the Save Set IDs (SSID) are not deleted on the CB and all chunks or objects at Cloud Provider are not deleted.
Resolution
These orphans files are removed during the next NetWorker server (nsrd) startup. To restart NetWorker server process, run the following commands:
- Linux:
systemctl restart networker - Windows (PowerShell):
net stop nsrd ; net start nsrd
If the CONNECTION_DISCONNECTED error is observed frequently, refer to the following KB: NetWorker: backups to CloudBoost are aborted by CONNECTION_DISCONNECTED error