Unsolved

This post is more than 5 years old

24234

January 7th, 2014 12:00

Backups fail: Server Busy (NW 8.1)

About a month ago we upgraded from 8.0 to 8.1 and since then we have random problems with backups failing to the NW server. All other storage nodes are fine. The jobs eventually time out and just give me the following:

SQL02.domain.com:index 98519:save: Unable to setup direct save with server backup01.domain.com: busy.


Looks like the indexes are timing out. We have a dedicated pool for them with media available. Any idea? Support got them working once by disabling and enabling the storage node for the NW server under Devices. The nsrmmd showed as UNKNOWN. Afterwards it changed to 8.1.0.3.Build.219. Periodically it changes back to UNKNOWN. Now nothing is working.

2 Intern

 • 

146 Posts

January 19th, 2017 14:00

Thank you Hrvoje. I have thought the exact same thing.....that some process is backlogging somewhere that is either causing this issue over time, OR is being caused by this issue.


I just now went back and opened an Excel spreadhseet and set up conditional formatting for duplicate values. I looked back to see the last few times this issue has occurred.


This happened on Jan 15, Jan 17, and Jan 18 (cant see back further in the gui). I see something interesting. Jan 15 and Jan 17 has the exact same clients failed. Jan 18 did have most of them match up, with a few exceptions.


So, there does seem to be correlation, but that doesnt explain why they start failing in the first place, and why they all start failing at the same time.


The OS on my Networker server, NMC, and all nodes is 2012r2. The NW server is a VM. It has 48gb ram with 2, dual core CPU's. All are running NW 8.2.3.8.




6 Operator

 • 

14.4K Posts

 • 

56.2K Points

January 19th, 2017 14:00

ESC 25733, but apparently that was fixed in SP3 which you run..

6 Operator

 • 

14.4K Posts

 • 

56.2K Points

January 19th, 2017 15:00

Specs wise, I believe it should be enough, but this entirely depends on load (I use by default 72GB on Linux even on Linux I would need less memory than on Windows).  All my servers are also VMs for years now and no issue there.  I can imagine that when you see it for all this is when server really get's lost so everything is affected.  Previously, I wonder if there is some spike when it happens.  I can imagine that there might be a condition (which is causing this) and server gets out of it. But when it doesn't, then all bets are off and everything breaks since.

If, and that is big if, we assume that issue might be resources in NW (like many clients started at the same time), you can try following (you will know better what applies since you know your config):

- make sure no group starts at the same time (if needed, separate them for a minute if you have group jungle) - I know in the past simultaneous group starts had impact on NW and small separation as 1 minute had improvement effect

- see if you can set less clients in groups.  Perhaps even add more separated groups with less clients in hope to spread the load more evenly

Above is just to water down high number of requests at the same time happening in case that is anyhow related to whatever is going on there.  Also, try to use NMC less   I know this may sound strange, but my experience is that NMC pulling the data can sometime bring hard times to NW.  Especially if using NMC server that has latency between itself and server.  In my case, I noticed that sometime some of my co-workers like to leave NMC open or use one from "wrong" country and that has effect on server performance (and sometime things do break).

Another thing which we didn't touch yet is to see if this happens (or starts to happen) during nsrim check of media db and indices (as cleanup of devices can sometimes also kill server response wise and I know that in the past those cleanup sessions were eating server parallelism at some versions - there also few KBs about it, but there was never clear statement on how this is designed to work).  Still, this check may sometimes eat up resources on the box and create problems.  Normally execution of it is every 23 or 24 hours (depending on NSR version) and it checks timestamp in mm folder for the file nsrim.prv (I believe on windows it is called the same).  Once the NSR group is finished, NW will check that timestamp and if more than 23 (or 24) hours have passed, it will run nsrim check (and update timestamp).  Normally, during the check, you will see increased CPU on nsrmmdbd, but real hit starts during the cleanup of devices (at least if you use disk devices, with tape devices you won't really see that).

No Events found!

Top