NetWorker: Recovery fails with "NDMP Service Error: Cannot identify format"
Summary: Restores fail as Network Data Management Protocol (NDMP) header is corrupted, causing I/O errors and file numbers that break positioning, triggering "NDMP Service Error: Cannot identify format." ...
Symptoms
The symptoms are related to two main issues:
- The backup administrator is not paying good attention to consistently recurring NMDP header I/O errors, this NetWorker warning carries a VNX error and should be treated seriously
- The two products (NetWorker and VNX) consider the NMDP header corruption as a trivial error. The special case of backup restarts the corruption reaches a point that cannot be automatically resolved when trying to read the data
The required files for recovery span multiple backups (Full + Differential + Incremental) and span multiple volumes.
The NDMP saveset recovery and File by File (browsable) recovery fail with closely related errors
Browsable recovery failure log:
42744:nsrndmp_recover: Tape server paused: reached the end of file
42897:nsrndmp_recover: Opened the tape device : c208t0l0
42870:nsrndmp_recover: Continuing recover from the next volume
42619:nsrndmp_recover: NDMP Service Error: Cannot identify format. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
42617:nsrndmp_recover: NDMP Service Log: server_archive: emctar vol 1, 0 files, 61440 bytes read, 0 bytes written
42738:nsrndmp_recover: Data server halted: Error during the restore.
42856:nsrndmp_recover: NDMP data server has an internal error.
42871:nsrndmp_recover: Error during File NDMP Extraction.
42840:nsrndmp_recover: NDMP recover failed.
42880:nsrndmp_recover: Error during NDMP recover
16279:winworkr: NDMP retrieval: child failed with status of 1
42897:nsrndmp_recover: Opened the tape device : c208t0l5
Mover listen address is NULL(NDMP_ADDR_LOCAL)
The saveset recovery failure happens with the Full backup only, other level backups could be recovered using saveset recovery.
root@NW-server# nsrndmp_recover -s NW-server -c NAS-host -S 1011119015 -v off -m "NAS-host::/root_vdm_1/users/recover_emc" "/root_vdm_1/users/VDI_Users/v009/project/Plan04.2012"
42879:nsrndmp_recover: Peforming recover with no file mark dependency..
05/31/16 17:25:50.504040 NDMP Service Debug: The process id for NDMP service is 0xc4bd90b0
42787:nsrndmp_recover: Performing recover from NDMP type of device
05/31/16 17:26:01.269862 NDMP Service Debug: The process id for NDMP service is 0xc4bd90b0
May 31 17:26:02 NW-server root: [ID 702911 local0.alert] NetWorker media: (waiting) waiting for LTO Ultrium-5 tape VN0116L5 on NAS-host
95555:nsrmmd: ndmp tape mtio failed, I/O error
42850:nsrndmp_recover: Failed to load volume 1111782299 fnum 3 <<<<<< cannot load File Number 3
42855:nsrndmp_recover: Failed to load the tape.
42871:nsrndmp_recover: Error during File NDMP Extraction.
42840:nsrndmp_recover: NDMP recover failed.
42880:nsrndmp_recover: Error during NDMP recover
root@NW-server #
The most important symptom is that the scanner is not able to go through the required backup volumes, it stops prematurely before reading any data!
root@NW-server # scanner -vvv -p "rd=NAS-host:c208t0l5 (NDMP)"
8909:scanner: using 'rd=NAS-host:c208t0l5 (NDMP)' as the device name
05/31/16 17:29:08.919423 NDMP Service Debug: The process id for NDMP service is 0xda018d60
9040:scanner: Opened c208t0l5 for read
8968:scanner: Reading the label...
8969:scanner: Reading the label done
8936:scanner: scanning LTO Ultrium-5 tape VN0116L5 on rd=NAS-host:c208t0l5 (NDMP)
96367:scanner: volume id 1111782299 record size 262144 bytes
created 5/24/16 21:00:35 expires 5/24/18 21:00:35
8973:scanner: setting position from fn 0, rn 0 to fn 2, rn 0
05/31/16 17:29:09.073276 NDMP Service Debug: The process id for NDMP service is 0xda018d60
9040:scanner: Opened c208t0l5 for read
05/31/16 17:29:09.201476 NDMP Service Debug: The process id for NDMP service is 0xda018d60
9040:scanner: Opened c208t0l5 for read
8761:scanner: done with LTO Ultrium-5 tape VN0116L5 <<<<<<<< No Errors , no mention of any file after File number 2 !!
The usage of the NW debug flag file: /nsr/debu/ndmp_auto_pos did not help with the Full backup saveset or browsable recovery.
Cause
Malformed NDMP volume format caused by NDMP header corruption generating erroneous File numbers sometimes (like automated backup restart):
The inspection of NW logs shows that every backup is coupled with an I/O error when writing the NDMP header. This is a consistent problem that occurs with every backup, this is a pointer to a consistent corruption.
70896 05/01/16 12:18:05 0 0 2 1 23177 0 NW-server nsrd NSR info NAS-host:/root_vdm_1/userdata saving to pool 'NDMPPool' (VN0050L5)
42597 05/01/16 12:18:05 2 0 0 1 23990 0 NW-server nsrmmd NSR warning ndmp header: I/O error <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
71193 05/01/16 12:18:05 0 0 0 1 23177 0 NW-server nsrd NSR info NDMP Save Notice: NAS-host:/root_vdm_1/users NDMP save running on 'NW-server'
The inspection of the NAS logs shows that the Buffer size of NDMP devices is set to a value that is less that the device block size configured in NW. The default NetWorker configuration sets the device block size for LTO-5 devices to 256 K, but the NDMP device buffer size (NDMP.bufsz) is 128 K. This part is related only to the NDMP header as the data block size is configured on the NAS by the PAX configuration parameters (default 60k).
2016-05-25 17:30:37: NDMP: 3: Thread ndmp533 DartTapeInterface: tape read error: tape read block size (262144 bytes) > 131072 bytes (NDMP.bufsz (128 KB) x 1024)
2016-05-25 17:30:37: NDMP: 3: Thread ndmp533 set parameter NDMP.bufsz (in KB) to 256 or larger to resolve this error
2016-05-25 17:30:49: NDMP: 3: Thread ndmp535 DartTapeInterface: tape read error: tape read block size (262144 bytes) > 131072 bytes (NDMP.bufsz (128 KB) x 1024)
2016-05-25 17:30:49: NDMP: 3: Thread ndmp535 set parameter NDMP.bufsz (in KB) to 256 or larger to resolve this error
2016-05-25 17:30:50: NDMP: 3: Thread ndmp536 DartTapeInterface: tape read error: tape read block size (262144 bytes) > 131072 bytes (NDMP.bufsz (128 KB) x 1024)
2016-05-25 17:30:50: NDMP: 3: Thread ndmp536 set parameter NDMP.bufsz (in KB) to 256 or larger to resolve this error
The VNX log shows a related message at backup time:
1466097048: NDMP: 3: Session 236 (thread ndmp236) TAPE_WRITE, size(262144) > ndmpMaxBufSize (131072) Set NDMP.bufsz >= 262144 and reboot
This is not considered a backup problem because the data stream has not started, but it causes corruption of the NMDP header (see: NDMP: 3: Session 384 (thread ndmp384) TAPE_WRITE, size(262144) > ndmpMaxBufSize (131072)Set NDMP.bufsz >= 262144 and reboot), which resolves this problem by setting the NDMP devices buffer size to 256k. However this correction does not resolve the problem of the backups that have already been written.
The above NMDP header corruption may exist without causing the recovery problem, but in cases of a backup restart, the file number registered is wrong in the NW media database. Which voids the Automatic positioning of the Tape and thus the NAS responds that the existing data at that position cannot be recognized as valid NDMP data stream (NDMP Service Error: Cannot identify format)
So here there are two types of corruption:
- NDMP header IO errors caused by the inconsistent block size setting between NetWorker and VNX, this type of corruption is bypassed by the Auto positioning feature, and it can exist for years without anyone noticing. As long as the right file numbers are registered with each savesets.
- When a backup is restarted, multiple NDMP headers and footers are written sequentially, and the NW server will miss the right file number where the next backup starts, which voids auto positioning the use of the NDMP disabled auto positioning debug (
/nsr/debug/ndmp_auto_pos) is useless in this case because the Tape management interface (MTIO) will stop reading once it encounters a double file mark (caused by C1 the NDMP header corruption)
Resolution
Reference this KB when contacting Dell Support.