Start a Conversation

Unsolved

This post is more than 5 years old

4930

June 14th, 2010 12:00

Time out errors during long savegroup backups

I've got a 7.5.2 installation on a rhel5 server that seems to have connection problems with long running savegroups.  For the same number of machines (15 clients) a night where it's just incrementals, the backups complete in about 30 minutes and there aren't any problems.  However, on full nights, where the backup can take up to three to four hours, I get a few of these error messages:

* backupclient1:/scratch 1 retry attempted
* backupclient1:/scratch 39077:save: error, Failed to connect to storage node on backupserver1: save failed to authenticate with nsrmmd using GSS Legato: Timed out
* backupclient1:/scratch
* backupclient1:/scratch 5777:save: Cannot open save session with backupserver1
     * :  Failed with error(s)

Also an excerpt from the iptables file relevant to backup:

-A INPUT -m state --state NEW -m tcp -p tcp --dport 7937:9936  -j ACCEPT

I'm a bit lost and would appreciate any help.

Thanks in advance.

2 Intern

 • 

14.3K Posts

June 14th, 2010 15:00

Oh, I didn't ask you about the version of NW you run.  This is because in 7.6.x you have few more setting in group properties like soft and hard kill.  For example, those - if set - could also produce results as you have.

2 Intern

 • 

14.3K Posts

June 14th, 2010 15:00

On the server you will find log called daemon.log... actually daemon.raw.  Do nsr_render_log daemon.raw > daemon.analysis and then load into the editor the output file.  Inside search for failed sessions during full backup.  You should be able to see when sessions started and also when it broke.  If the time is the same each time (give or take few sec) then you are probably hitting some timeout by fw/switch/OS.  If time is different each time and there is no apparent rule then most probably this would not be fw (well, there is still chance that it is, but we will come to that later once that we know above first).

9 Posts

June 14th, 2010 15:00

I'm not sure how to check exactly when the failures occur.  The only time stamps I can get out of the messages log file is when the savegroup starts and when it sends a result message to logger which has all the failures.

I'm checking the server config to check for drop actions now.

2 Intern

 • 

14.3K Posts

June 14th, 2010 15:00

Do you happen to have in fw logs and drop action for your connection towards server/storage node?  Or perhaps even better question, do these failures happen after exactly same time each time it fails?

2 Intern

 • 

14.3K Posts

June 14th, 2010 15:00

Ok, now the trick question is does it fail 30 minutes since group is started or 30 minutes since saveset started to write,

Let me explain.  If it fails 30 minutes since group started, while saveset is queued, it is possible that due to parallelism setting this sessions are queued and they get hit by group inactivity timeout.  In such case, go to group properties and change default inactivity timeout from 30 to 0 (meaning no inactivity timeout to be used).

If your session is killed after 30 minutes (meaning while it is running - writing on tape or disk) it means something is killing your live session.  I'm not aware of any setting (including fw) that would kill active session after 30 minutes (at least not by default).  This is why I suspect group's inactivity timeout to be the problem here.  It is just a guess, but perhaps right starting point worth exploring at this point.

9 Posts

June 14th, 2010 15:00

Thanks for the information on nsr_render_log.  You were right about the timeouts.  They each occur exactly 30 minutes, +/- 9 seconds after they begin.

However, which saveset fails is different each time.

2 Intern

 • 

14.3K Posts

June 14th, 2010 16:00

Actually there is nothing wrong not to have one.  But, if we assume that backup group would not run more than 12 hours that would be 60x12 so 720.  It really depends on many factors what you wish to put there.  If this is default installation it is possible that some default setting (like server parallelism) are limiting number of concurrent session thus keeping your sessions queued while they could run.  This again depends on version and type of base enabler.  I think, for test at least, you should put group inactivity timeout to 0 and verify that full backup runs with success.  If not, we continue digging.

9 Posts

June 14th, 2010 16:00

The default Parallelism setting for the server was 32, perhaps it's too high?

I agree with your test, we'll see if we can run a test tonight with similar conditions that made it fail in the past and see if we can get it to complete by disabling the inactivity timeout.

I will update this thread accordingly.

thank you for your help!

9 Posts

June 14th, 2010 16:00

The start time is from when the group begins, not from when the write begins.  I actually never see the savesets that end up failing begin to write.  I have to agree with you that the inactivity timeout sounds like the problem here.

Is there a suggested value for the inactivity timeout?  Not to have one at all seems a little scary.

It also seems a bit weird that for a fairly innocuous setup of 15 clients, an out of the box the installation would fail trying to back them up.  I haven't modified any parellism, nor inactivity timeout values from the default installation.

2 Intern

 • 

14.3K Posts

June 14th, 2010 16:00

32 is ok.  I'm not sure about your setup, but if you have single group with 15 clients having saveset all that would give you 15x4 sessions (at least, real value is more I'm sure) which is already 60.  So that means 32 sessions would start to run while 28 would be queued.  In such case, after 30 minutes those sessions would be seen idle and would be killed by server itself (server=backup application).  The easiest way, without much of tuning, would be to create several groups, running hour or so apart with 5 clients each which would spread the load. There are endless combinations here and there is no magic formula as each setup is usually tweaked at the end to match environment it runs at.

9 Posts

June 15th, 2010 09:00

Even setting inactivity timeout to 0, I still had savesets fail at the 30 minute mark.

2 Intern

 • 

14.3K Posts

June 15th, 2010 11:00

This might indicate problem is somewhere within TCP timeout (OS/switch/fw) if none of NW settings is set to cause this.  

Please make following test; place savegroup parallelism to be 4 and start full backup.  I wonder what you will see then....

1 Rookie

 • 

92 Posts

June 15th, 2010 12:00

We went in a slightly different direction, and it seems to work for us.   We have quite a few sessions that can run for a long time during full backup runs, and insteand of putting inactivity timeout at 0, we put it at 1,000.   Give that a try, and see if it does anything.  

9 Posts

June 17th, 2010 09:00

setting parallelism at 4 allowed the backups to complete without throwing any errors.  This is with inactivity timeout set to 30.

In our setup, we are backing up to a staging disk that is connected to the backup server.  The device is set to target sessions 4, could this be why?

thanks again!

445 Posts

June 17th, 2010 10:00

Alphonis,

Target sessions on a device is what we call a soft limit - it will allocate this amount of sessions the device before it attempts to use another device. If all are in use (mounted with this pool or others) it will start to round robin the rest of the sessions to available devices, so you could end up with 20 sessions on your staging disk not just 4. This is all dependent upon the hard limit values of client, savegroup and server parallelism. Server parallelism will not allow any more sessions to be started once this is reached so as Hrvoje said previously it may be this is the limiter for your environment with lots of groups clients within a group starting at the same time or close to each other.

HTH

Regards,

Bill Mason

No Events found!

Top