Persistent cloning failure during weekly full backups of UNIX clients

Question

I'm sure I'm leaving out critical details, so please feel free to ask for any additional information needed to solve this puzzle.

Basically, I'm at my wits' end. I've been tracking this problem for a few months, but have had a very difficult time in sizing it up. I've delayed calling EMC support, as I'm confident they're going to want me to update Networker to 7.3/7.4 and I'm not ready to do that just yet.

The clones fail almost exclusively on my UNIX clients (two physical, two cluster nodes). The UNIX savegroups also happen to be the largest, spanning several LTO1 media. The Windows clients have no problems. The savesets being backed up are HP-UX OS filesystems and Oracle 10g filesystems.

I've seen this issue across all four LTO drives in my tape library. It will fail using brand new tapes as well as old. I've replaced drives and tapes, to no avail. Stranger still, I can successfully run a make-up savegroup later in the day and backup and clone the same savesets that failed earlier.

I've posted an example of the daemon.log entries that I see after a failure.

Background:

Fact: Networker Server 7.2.2 Jumbo
Fact: HP-UX 11.11
Fact: MC/Serviceguard Cluster package
Fact: Quantum ATLP2000 Tape Library (4 ¿ IBM Ultrium1 drives, 198-slots)

Symptom:

02/17/08 07:43:15 nsrd: media notice: Volume "ABX304L1" on device "/dev/rmt/c25t0d1BESTnb": Cannot decode block. Verify the device configuration. Tape positioning by record is disabled.

02/17/08 07:44:11 nsrd: media info: can not read record 1 of file 2 on LTO Ultrium tape ABX304L1
02/17/08 07:44:11 nsrmmd #7: Read operation failed and aborted.
02/17/08 07:44:11 nsrd: cloning session:1 of 3 save set(s) reading from ABX304L1 182 GB of 219 GB
02/17/08 07:44:11 nsrmmd #7: Read operation failed and aborted.
02/17/08 07:44:11 nsrmmd #7: Bad mark_suspect ssid 0
02/17/08 07:44:11 nsrmmd #7: Read operation failed and aborted.
02/17/08 07:44:11 nsrmmd #7: cannot find ssid 0 to update02/17/08 07:44:11 nsrd: cloning session:3 of 3 save set(s) done reading 218 GB
02/17/08 07:44:11 nsrmmd #8: MM_CLONEEND w/active saves
02/17/08 07:44:11 nsrd: media info: cannot find ssid 0 to update
02/17/08 07:44:12 nsrd: tanya:cloning session done saving to pool 'Default Clone' (ABX221L1)
02/17/08 07:44:12 nsrd: cloning session:save sets done reading 218 GB
02/17/08 07:44:12 nsrd: media event cleared: LTO Ultrium tape ABX305L1 not used
02/17/08 07:44:12 ansrd: ansrd_clone FAILED: errnum is 15005 and errstr is can not read record 1 of file 2 on LTO Ultrium tape ABX304L1
02/17/08 07:44:12 ansrd: failed to execute MODE_CLONE
02/17/08 07:44:12 savegrp: command 'nsrclone -s nsrhost -b Default Clone -S -f - ' exited with return code 1.
02/17/08 07:44:12 savegrp: Automatic cloning of saveset index:eb92dd6c-00000004-47543d11-475736c1-02720000-8f459dd8 for client nsrhost during savegroup operation has failed!

Any suggestions will be greatly appreciated.

mllegato · Answer

Looks like a read error to me. I'm having similar "Cannot decode block" errors every now and then, but on older hardware (DLT), and have never been able to figure out the cause. Quite often I am able to read on a second attempt.
Can you read this very saveset (recover it)?
Can you clone it manually?

Regards

dbagley1 · Answer

I've not tried to recover or clone the saveset manually after a failure, as the error seems to indicate an issue with the original. After starting the failed savesets again a second (or third) time, I eventually will get a successful backup + clone.

I had a bit of success with last weeks full after separating the two savegroups that were giving me problems. They used to run nearly concurrently. Hopefully, the success will hold for this week as well.

Thanks for the response. I wish you luck with your errors also.

Dave

NetWorker

Persistent cloning failure during weekly full backups of UNIX clients

Was this post helpful?