Backups fail: Server Busy (NW 8.1)

About a month ago we upgraded from 8.0 to 8.1 and since then we have random problems with backups failing to the NW server. All other storage nodes are fine. The jobs eventually time out and just give me the following:

SQL02.domain.com:index 98519:save: Unable to setup direct save with server backup01.domain.com: busy.

Looks like the indexes are timing out. We have a dedicated pool for them with media available. Any idea? Support got them working once by disabling and enabling the storage node for the NW server under Devices. The nsrmmd showed as UNKNOWN. Afterwards it changed to 8.1.0.3.Build.219. Periodically it changes back to UNKNOWN. Now nothing is working.

Responses(33)

ble1

2 Intern

•

14.3K Posts

0

January 17th, 2017 13:00

Well, first thing I notice is direct save which may indicate that you might not be able to communicate with DD for example from client directly, but you need to proxy that over storage node. That's what direct save is about. In such case, horse power on server is less important. Second, sometimes (and this entirely depends on NW specific version patch) I noticed that my server client parallelism (or storage node client parallelism) is reset. As it turns out, this impacts number of streams storage node role accepts. Not saying this is your issue, but worth checking. You might also check what error do you get once you remove direct save check box from clients who do not get this error. If using DD (as that makes sense with direct save), check what session limit your box and DDOS have not to push it beyond what is possible.

Karthik_N1

25 Posts

1

January 18th, 2017 03:00

You could try disabling client direct and make "Client retries" attribute in group properties to 0(meaning retry unlimited) it would kick start if after finite number of retries hopefully!

Toddman214

2 Intern

•

146 Posts

0

January 18th, 2017 07:00

Hrvoje/Karthik, I have played with the client direct setting many times. If I disabled client direct, the backups will not attempt to run. They will simply sit in a queued state for 8 hours or more, until I kill the jobs. I checked the node and storage parallelism, an they do not seem to have changed. The failure messages are always the same (see below), and fairly generic. Ive researched these, and I do see many accounts of others with these issues, but no solutions that have helped.

suppressed 6502 bytes of output.

53084:(pid 3112):Processing model failed, the item will be skipped.

98519:(pid 3112): Unable to setup direct save with server pdc00nwka802w.ohlogistics.com: retry needed.

38008:(pid 3112):Internal system error, please see nsr\applogs\xbsa.messages on the client system for reason.

29085:(pid 3112):Microsoft SQL Server Provider error:

38006:(pid 3112):Write on "Legato#dff3d290-6290-426e-bc9d-4d88b171d4c3" failed: 995(The I/O operation has been aborted because of either a thread exit or an application request.).

38006:(pid 3112):A nonrecoverable I/O error occurred on file "Legato#dff3d290-6290-426e-bc9d-4d88b171d4c3:" 995(The I/O operation has been aborted because of either a thread exit or an application request.)..

38006:(pid 3112):BACKUP DATABASE is terminating abnormally..

98519:(pid 3112): Unable to setup direct save with server pdc00nwka802w.ohlogistics.com: retry needed.

53084:(pid 3112):Processing msdb failed, the item will be skipped.

armandogm

14 Posts

0

January 18th, 2017 10:00

Have you seen messages like this in your daemon.raw?

nsrd GSS warning Cannot use DD device err: Too many save streams (xxx) on DDR since that would cause the device to exceed the maximum DDR write stream counts (write stream count)..

ble1

2 Intern

•

14.3K Posts

0

January 18th, 2017 11:00

Do you see session established and nothing coming through? It sound like traffic is blocked (routing? firewall? network?)

Toddman214

2 Intern

•

146 Posts

0

January 19th, 2017 07:00

Armando, no. I searched the daemon.raw file on the networker server, and nothing matched that. i was kind of hoping it would, to at least have something further to track down. BUT, it does seem to me that it has something to do with being too busy somewhere, or some process hang. But, on the other hand, once these failures start to happen, they will not resolve on their own, even during times of less busyness. I always have to restart services on the Networker server to get the sql backups to run cleanly again.

armandogm

14 Posts

0

January 19th, 2017 08:00

I frequently see this behavior, I later thought abut it and maybe the reason you don't see that message in your daemon.raw is perhaps because your NW server version may not log it, I see it on 9.1. I found that message relates to the "total max write streams" configured for the Data Domain primary backups go to. That attribute is in the 'nsr data domain' resource. In my case I was going above 149

	type: NSR Data Domain;
	name: xxxxxxxxxxxxx;
	comment: ;
	hosts: xxxxxxxxxxxxx;
	username: sysadmin;
	password: *******;
	model: DD640;
	OS version: Data Domain OS 5.7.3.0-548132;
	serial: ;
	FC hostname: ;
	management host: ;
	management user: ;
	management password: ;
	management certificate: ;
	management port: ;
	cloud unit: ;
	SNMP community string: networker;
	storage node: nsrserverhost;
	used capacity: 18 TB;
	available capacity: 12 TB;
	used logical capacity: 261 TB;
	total capacity: 31 TB;
	inuse read streams: 0;
	inuse read write streams: 0;

inuse replication streams: 0;

	inuse filecopy streams: 0;
	max write streams: 90;
	max read streams: 30;

max replication source streams: 60;

max replication destination streams: 90;

	total max streams: 479;
	total max write streams: 149;
	export path: ;
	last update time: "Thu Jan 19 10:59:21 2017";
	hostname: onextobck1.onex.com;
	administrator:
	ONC program number: 390109;
	ONC version number: 2;
	ONC transport: [TCP] UDP ;

Toddman214

2 Intern

•

146 Posts

0

January 19th, 2017 12:00

Hrvoje, are you thinking that I might be exceeding my 'max streams'. I know we have a fairly large backup environment, and our sql transaction log backups spawn a lot of running jobs.

Toddman214

2 Intern

•

146 Posts

0

January 19th, 2017 12:00

Armondo, here is what that returns for me.

nsradmin> print type: nsr data domain

type: NSR Data Domain;

name: ***********;

comment: ;

hosts: **********;

username: *********;

password: *******;

model: DD7200;

OS version: Data Domain OS 5.4.4.3-460654;

serial:

FC hostname: ;

SNMP community string: public;

storage node:

used capacity: 200 TB;

available capacity: 48 TB;

used logical capacity: 10 PB;

total capacity: 249 TB;

inuse read streams: 3;

inuse read write streams: 6;

inuse replication streams: 0;

inuse filecopy streams: 3;

max write streams: 540;

max read streams: 150;

max replication source streams: 270;

max replication destination streams: 540;

total max streams: 1932;

total max write streams: 942;

last update time: "Thu Jan 19 06:47:15 2017";

ble1

2 Intern

•

14.3K Posts

0

January 19th, 2017 12:00

I believe it is not total max streams but max streams which is limiting you as that comes from DDOS itself and unit you run (thought I'm not sure what total max streams is then) - same you see in recent NW 8.x.

Toddman214

2 Intern

•

146 Posts

0

January 19th, 2017 13:00

Actually, I'm on 8.2.3.8, but Ive had this issue since I was on 8.2.1.4......I hijacked an older thread. :-)

No, here's exactly how the issue goes, and it doesnt matter whether it is a SQL tlog backup, or the daily full SQL backup. I just see it mostly with the tlog backups because they run most frequently.

1) I have alerts that I created for failed savegroups, and I start getting alerts emailed to me that look like this.

NetWorker savegroup failure: (alert) SQL_TLOG_BACKUPS_3HR completed, Total 14 client(s), 0 Clients disabled, 0 Hostname(s) Unresolved, 10 Failed, 4 Succeeded, 0 CPR Failed, 0 CPR Succeeded, Cloning succeeded...........

2) I log into the NMC, and I see the failed savegroup. The email alert also shows me which clients failed in the savegroup, as do the job details.

3) I restart the group, and monitor it. The group fails again, with the same clients. Also, the initial failure does not necessarily start during heavy backup times. Also, also, I have noted the failed clients, logged into EACH one of them, restarted Networker services, and restarted the failed backup job. The job will fail again, with same failed clients, so its not an issue with the clients I dont think.

4) I start getting other email alerts for other failed SQL tlog savegroups. I restart those jobs, and they fail again, with same failed clients.

5) Savegroup failure alerts continue to come through.

6) Our SQL DBA sends me a nastygram, because his email box is blowing up with failed SQL backup alerts.

7) I bounce all Networker services on the Networker server

8) ALL SQL backups run now quickly and flawlessly for anywhere from 2 days to 1 week, when the whole process starts again, with no warning, and no specific day or time.

Now, what I have NOT checked when this overall issue starts occurring, is whether or not its always the same clients that fail initially. But it should not matter, since the SQL backups do run flawlessly on the days between these failures.

ble1

2 Intern

•

14.3K Posts

0

January 19th, 2017 13:00

Ah wait, I have a bit different picture now... so, are you saying that there is some NSR group with LOG backup of some SQL clients and instances... and for some reason it fails. And if you restart it (which should normally, but based on config, restart only failed client) you see same clients failing. But if you start whole group again, it doesn't fail. If so, it sounds like NW issue to me (almost like corrupted cache or something used for restart). And this might be server side issue (I never run 8.1 as jumped to 8.2 and never saw this, but I also have TLOG groups running every 1h so even if it would fail I would just leave it as next hour it would be ok).

Or, when you see full restart, you are saying that you need to run DB backup and only after TLOG backup will run again?

ble1

2 Intern

•

14.3K Posts

0

January 19th, 2017 13:00

I don't. I doubt you have 540 sessions at the same time running to DD. And even if you did, it fails for you even when everything is idle, right? So, there is no impact by other sessions...

Toddman214

2 Intern

•

146 Posts

0

January 19th, 2017 13:00

That's right. Once those SQL tlog backups start failing, only a full service restart will make them run. Even during a relatively slow time of backups, if I manually restart one the of the failed sql backup groups, it will fail again. But, when it fails again, it IS the same clients that fail again.

ble1

2 Intern

•

14.3K Posts

0

January 19th, 2017 13:00

From that description, I would say this is server side issue for sure. And it sounds as if there is some corruption in cache. If you see more groups failing with time (so not just the same, but more of them kick in with the same), it sounds like there is something jamming backup server itself. I can't say if this is application itself (I doubt as on the version you are that would be fairly known by now) or something on OS level. As time is involved, it sound like some resources are not freed up and as more jobs are processed with time you hit some queue which is getting filled. On the other side, I suspect this is on application side as things get better after you bounce application and not whole server. Not sure what OS you run (didn't check if that was mentioned), but I would check for memory usage and also number of open connections to make sure those close at least and there are no ghost connections left over. Is your NMM same version as server? I ask this since in SP3 they did mdb changes and there was some optimization on client side which you could also see only if same combo is used. I also recall that release notes for couple of modules listed that servers could jam... let me see if I can find that one.

1
2
3

View All

No Events found!