gbalazs

26 Posts

2143

July 8th, 2014 02:00

since upgrading Networker from 7.6.5 to 8.1.1.4, the simple file-level LAN backup of the large Windows file server intermittently fails when tape full and a next spanning, continuation tape is started

We upgraded the Networker backup SW from 7.6.5 to 8.1.1.4 a couple of months ago. Server is Linux x64 (RedHat), the client in question is Windows Server 2008 R2 (x64). The client is a central file server, it has some 7TB data to be backed up in four file systems (drives), having a total of some 7.5 - 8 million files. Backup is very simplistic: file-level, using Networker Client, via LAN (1G Ethernet backup LAN), to either DD VTL or physical tape. BTW, client is a Vmware VM, without any "trick" (e.g. no Windows cluster, no Vmware FT, but it is actually part of a Vmware DRS cluster).

Since we upgraded this client (from 7.6.5 to 8.1.1.4) we have regular problems during full backups. Those full backups are done on a weekly basis (during the weekend), and backing up all four of those drives takes more than 24 hrs because of the small files, slow disks, but this speed and backup time is otherwise acceptable.

The problem is, when during the backup one tape (or VTL) volume gets full, and a next empty or recyclable volume is otherwise readily available, it happens sometimes (say, 1 out of 10 such spanning) that this next volume is started to be written, but very soon (in less than a minute) something fails and the savestream and the saveset fails with this error message (in savegroup completion report):

* prodapp009:B:\ 76677:save: RPC send operation failed; errno = Unknown error

* prodapp009:B:\ 74209:save: Quit signal received.

* prodapp009:B:\ 99123:save: Handling an abort while processing Windows backup.

In the daemon.raw, we have this or similar in such cases:

71193 06/01/2014 06:08:51 PM 0 0 0 2096645888 22066 0 prodapp071 nsrd NSR info Media Notice: LTO Ultrium-3 tape A00093 on /dev/tape/by-id/scsi-1IBM_ULTRIUM-TD3_9330100010-nst is full

71193 06/01/2014 06:08:52 PM 0 0 0 2096645888 22066 0 prodapp071 nsrd NSR info Media Info: verification of volume "A00093", volid 2274023053 succeeded.

71193 06/01/2014 06:08:52 PM 0 0 0 2096645888 22066 0 prodapp071 nsrd NSR info Media Notice: LTO Ultrium-3 tape A00093 used 419 GB of 409 GB capacity

71193 06/01/2014 06:08:52 PM 0 0 0 2096645888 22066 0 prodapp071 nsrd NSR info Write completion Notice: Writing to volume A00093 completed

0 06/01/2014 06:08:52 PM 1 5 0 2096645888 22066 0 prodapp071 nsrd NSR notice 06/01/14 18:08:52.823520 Operation 220 started : Load volume `A00103', volume id `3076747626'..

71191 06/01/2014 06:08:52 PM 0 0 0 2096645888 22066 0 prodapp071 nsrd NSR info Media Waiting event: Waiting for 1 writable volume(s) to backup pool 'DDPoolFileserver' tape(s) or disk(s) on prodapp071

38752 06/01/2014 06:08:55 PM 2 0 0 2096645888 22066 0 prodapp071 nsrd NSR warning /dev/tape/by-id/scsi-1IBM_ULTRIUM-TD3_9330100007-nst ejected A00033

38752 06/01/2014 06:08:55 PM 1 5 0 2096645888 22066 0 prodapp071 nsrd NSR notice /dev/tape/by-id/scsi-1IBM_ULTRIUM-TD3_9330100007-nst Eject operation in progress

67986 06/01/2014 06:09:01 PM 2 0 0 122476288 22701 0 prodapp071 nsrmmgd NSR warning Unloading volume `A00033' from device `/dev/tape/by-id/scsi-1IBM_ULTRIUM-TD3_9330100007-nst' to slot 53.

67985 06/01/2014 06:09:08 PM 2 0 0 122476288 22701 0 prodapp071 nsrmmgd NSR warning Loading volume `A00103' from slot `104' into device `/dev/tape/by-id/scsi-1IBM_ULTRIUM-TD3_9330100007-nst'.

38752 06/01/2014 06:09:15 PM 1 5 0 2096645888 22066 0 prodapp071 nsrd NSR notice /dev/tape/by-id/scsi-1IBM_ULTRIUM-TD3_9330100007-nst Verify label operation in progress

38752 06/01/2014 06:09:16 PM 1 5 0 2096645888 22066 0 prodapp071 nsrd NSR notice /dev/tape/by-id/scsi-1IBM_ULTRIUM-TD3_9330100007-nst label operation in progress

71193 06/01/2014 06:09:17 PM 0 0 0 2096645888 22066 0 prodapp071 nsrd NSR info Deleted media Notice: Deleted volume: volid=3076747626, volname=~1401638956~A00103, pool=DDPoolFileserver, location=?

12361 06/01/2014 06:09:21 PM 1 5 0 2096645888 22066 0 prodapp071 nsrd NSR notice [Jukebox `DD_L180', operation # 220]. Finished with status: succeeded

71192 06/01/2014 06:09:33 PM 0 0 0 2096645888 22066 0 prodapp071 nsrd NSR info Media event cleared: Waiting for 1 writable volume(s) to backup pool 'DDPoolFileserver' tape(s) or disk(s) on prodapp071

70896 06/01/2014 06:09:33 PM 0 0 2 2096645888 22066 0 prodapp071 nsrd NSR info prodapp009:B:\ saving to pool 'DDPoolFileserver' (A00103) 912 GB

71659 06/01/2014 06:09:40 PM 0 0 2 2096645888 22066 0 prodapp071 nsrd NSR info prodapp009:B:\ done saving to pool 'DDPoolFileserver' (A00103) 912 GB

84078 06/01/2014 06:09:42 PM 2 5 0 858326784 15379 0 prodapp071 savegrp NSR warning command 'save -s prodapp071-b -g Fileserver -LL -m prodapp009 -l full -q -W 78 -N "B:\\" "B:\\"' for client prodapp009 exited with return code 1

This happens sometimes, but not all the time when a new volume is started. (DD VTL emulates LTO-3, so that happens every 400GB... One complete set of full backups of all 4 drives then takes some 15-18 volumes, and almost every week we have some problem with one saveset or the other.) It happens to any of the four disk drives to be backed up. It happens when backup done to VTL (as normally), but also when to physical tape (for archive purposes). Some Windows client-side tuning was already done (network params, keepalive etc.), and also some tuning of Networker resources, e.g. inactivity/nsrmd/storage node timeout, but so far no fix and no clue what is actually happening. We are not aware of similar issue with another Networker client in this environment (e.g. another Windows server with large amount of data or many files), but actually this Windows file server is quite unique for its large data volume.

Any idea or experience?

Thanks in advance.

Geza

Responses(6)

ble1

2 Intern

•

14.3K Posts

0

July 8th, 2014 04:00

Hi Geza,

I do realize you did many tests and that everything points to source you mentioned, but still let me ask few things:

a) did you started to see this when client was updated or both client and server (and storage nodes)? If client, then it may be possible to test few other versions of clients to see if same issues appears

b) quit signal is something I usually see with slower clients (WAN) running AV. Changing policy to avoid read scans or excluding save from scan helps (and in specific case of Symantec we noticed that its own self-protection would kill save when going over its AV signature files). I'm not stating this is your problem, but different NW versions may have different behavior pattern which may or may not be suspicious to AV so if your server is running AV it might be good thing to check it out. As in your case, I have seen this happening on more or less random occasions and never really understood why.

c) keepalive you already mentioned, but usually this is required on both sides of communication channel and not just client - I doubt this is your issue, but worth exploring.

d) do you see the same thing if you kick the backup from the client side (I realize that this will be pain to test due to size and point that this does not happen always, but usually this test - if confirmed to be successful - indicates that control session was killed. In your case I suspect it is data session which gets killed and not control one, but it would be nice to complete picture to have this tested too.

e) do you use TCP offloading on SW level?

G

gbalazs

26 Posts

0

July 8th, 2014 05:00

Hi,

thanks for your quick response.

Q a) did you started to see this when client was updated or both client and server (and storage nodes)? If client, then it may be possible to test few other versions of clients to see if same issues appears

A: We believe that it happened when client updated. Server was updated end of March or so, client around beginning of May. We are also thinking about either downgrading client (back to "original" 7.6.5 that worked properly with NW Server 7.6.5 for quite some time, and it even worked one month quite well with NW Server 8.1.1), or update the client to the latest fixes (8.1.1.6 is already available). Actually, based on the compatibility info we have from docs (Install Guide?), newer server should be OK bwith earlier client, at least with these specific versions (8.1.1 and 7.6.5).

Q b) quit signal is something I usually see with slower clients (WAN) running AV. Changing policy to avoid read scans or excluding save from scan helps (and in specific case of Symantec we noticed that its own self-protection would kill save when going over its AV signature files). I'm not stating this is your problem, but different NW versions may have different behavior pattern which may or may not be suspicious to AV so if your server is running AV it might be good thing to check it out. As in your case, I have seen this happening on more or less random occasions and never really understood why.

A: Well, yes, as this is a typical Windows file server, it has server-side AV installed. (McAfee McShield, v 8.8 ?) Your idea (regarding AV and its settings) could help, an EMC support engineer also recommended that (i.e. some AV exclude rules/directories, no scan on backup file opens/reads). We will go after this hint - unfortunately it takes some extra time as the security officer who has permissions to change AV settings is now on holiday.

The connectivity is LAN, relatively low latency 1G, so communications should not be an issue.

Q c) keepalive you already mentioned, but usually this is required on both sides of communication channel and not just client - I doubt this is your issue, but worth exploring.

A: That may make sense, network parameter tuning (e.g. keepalive) was "only" done on the client side.

Q d) do you see the same thing if you kick the backup from the client side (I realize that this will be pain to test due to size and point that this does not happen always, but usually this test - if confirmed to be successful - indicates that control session was killed. In your case I suspect it is data session which gets killed and not control one, but it would be nice to complete picture to have this tested too.

A: A-ha. I think I understand what you mean, but let us confirm. So you mean a client-initiated backup (e.g. save command, Networker User GUI / Backup) does the data stream only, and the control stream (that is also there when doing a server-initiated backup, so running a group) in that case is not part of the equation... Did I understand your idea well? Well, we have not tried client-initiated backups so far to test this issue. Even with server-initiated backups, and we ran an awful lot of test save groups, it is pretty uncertain whether we see the error or not, and complicated and time-consuming to test. Our impression is, that by running only one saveset at a time, e.g. as a test, we usually do not see the problem, even when >1.5TB and some 2m files backed up, there are 3-4-5 volumes involved and volume full, volume spanning happens during the backup. When running multiple of those savesets at the same time, at level=full (this is how "normal" weekend backups should look like), we almost always get the problem, some 16-18 volumes consumed, >10 volume full events occuring, and out of those occurences, at least one such event generates the error. We experienced even this: multiple savestreams going to the same physical tape drive (multiplexed), tape full, a next tape loaded (and probably (re-)labelled), and all but one savestreams continue to back up data, with the remaining one savestream failing at this point in time.

Q e) do you use TCP offloading on SW level?

A: no offloading, not to my knowledge.

Regards,

Geza

since upgrading Networker from 7.6.5 to 8.1.1.4, the simple file-level LAN backup of the large Windows file server intermittently fails when tape full and a next spanning, continuation tape is started

ble1

2 Intern

•

14.3K Posts

1

July 8th, 2014 05:00

Yes, your client version should be ok though I would rather use NW8.x with NW8.x server. You can also try 8.0.3.x for example.
Yes, you are correct - client initiated backup won't have idle control session while data stream is active. In past (and present I guess) that used to be an issue and there were various workarounds like enabling verbose on group or setting NSR_KEEPALIVE_WAIT in startup scripts (now /nsr/nsrrc in NW8, but on Windows still in ENV variables). From what you said so far, it doesn't look like an issue with control session, but rather data session.
I'm wondering how much a fact that this happens after tape change is a clue or misleading. Could be both I guess, but it is more likely a pattern if it happens in all your tests/runs. Based on your logs - does this happen after:
- tape filled, next tape requested, next tape relabeled, tape started to be used, kaboom, or
- tape filled, next tape requested, next tape loaded, tape started to be used, kaboom,
- does not matter

In third point, different is that next tape is either relabeled or already relabeled/new tape is loaded. This, normally, should not matter at all, but it seems like we deal with rather specific issue so specific condition may apply too. If, and that's big if, if it would turn out that this happens only after tape is being relabeled first, then next question would be how much were you waiting for tape to be changed until job could continue (as waiting time might be crucial here at TCP level when dealing with idle sessions).

G

gbalazs

26 Posts

0

July 8th, 2014 06:00

Hi,

Q:...I would rather use NW8.x with NW8.x server. You can also try 8.0.3.x for example.
- A: We were thinking about "downgrading" to 7.6.5 as that was the version in use (both on server and client) before going to 8.1.1
Q: does this happen after:
- tape filled, next tape requested, next tape relabeled, tape started to be used, kaboom, or
- tape filled, next tape requested, next tape loaded, tape started to be used, kaboom,
- does not matter

In third point, different is that next tape is either relabeled or already relabeled/new tape is loaded. This, normally, should not matter at all, but it seems like we deal with rather specific issue so specific condition may apply too. If, and that's big if, if it would turn out that this happens only after tape is being relabeled first, then next question would be how much were you waiting for tape to be changed until job could continue (as waiting time might be crucial here at TCP level when dealing with idle sessions).

A: this mostly happens with load only, no need for re-label. (This is because the backup admin has sufficient number of empty, labelled virtual tape volumes readily available.) But we have also seen the same error a few times when Networker automatically re-labelled an expired volume. Timing is not necessarily (I would say, probably not) an issue: in case of the VTL, there are 16 virtual tape drives and one or two always immediately available to continue backup, and virtual tape volumes also available right away. Furthermore, loading and even re-labelling a VTL is much faster than with physical tape - we usually see less than 1 minute between volume full and backup continued on the next vol, even with labelling included. And in most of the cases, in less than a minute from vol full (so only a few sec after writing continues) the savestream already fails...

ble1

2 Intern

•

14.3K Posts

0

July 8th, 2014 06:00

ok, it is probably fair to say that timing between stop and start of ongoing session should not be an issue then (well, one way to test it would be to have continuous tests to pure disk device, but I understand this is not your setup so this is not possible to test).

G

gbalazs

26 Posts

0

July 8th, 2014 09:00

Hi,

in the meantime the backup admin just remind me that some time ago, when the current backup server was installed, some network tuning has already been done - the typical keepalive settings (time, interval, probes). I will research into this a bit more, whether there are any more network tuning recommendations for the Linux x64 backup server.

I think we will go after the Anti Virus thread - I double-checked again and the AV log has some entries referring to save.exe !!!! As the security officer is out-of-office these days, it will take a few days to progress. I will also try to correlate suspicious AV log entries and backup failures.

I will let you know about the results in a few days time...

Thanks for your comments.

Regards,

Geza

View All

No Events found!

NetWorker

since upgrading Networker from 7.6.5 to 8.1.1.4, the simple file-level LAN backup of the large Windows file server intermittently fails when tape full and a next spanning, continuation tape is started