Start a Conversation

Unsolved

This post is more than 5 years old

7842

October 7th, 2014 06:00

Something is closing the backup session after 4-5 minutes.

Hello!

First of all, short info about our Networker environment:

Server: Running Windows Server 2012 R2 with EMC Networker 8.1.1.3

Failing Client: Running Window server 2008 R2 with EMC Networker 8.1.1.3

For a couple of months we had a client that just won't make backups. The client starts the backup but is failing a couple of minutes later and if i run Networker User on the client is complaining about some RPC errors..

We had a case open with EMC for a couple of weeks but it was close since we though we had find that the problem was the network communication between the server and client. Once i had informed our Network technicians we started the change the network configuration and even tried any <> any between the server and client without success.

Down below is a list of things that we already have tried together with EMC:

  • Removed the client resources on the client and the server.
  • Fully reinstalled EMC Networker on the client.
  • Tried different Networker versions on the client.
  • Changed the host file on both server and client.
  • Rpcinfo between the servers is looking good.
  • Nslookup between the servers is looking good.
  • Tried different save operations and backup commands, for example "VSS:*=off"
  • Networker health check tools is running fine without any problems.

Since the client that is failing is a production machine i can't test different settings to 100% therefore i have created a virtual machine that im testing different backups options on. Since the backups on the test machine is working on other networks (VLANs) i still assume this is a network with the specific network that the failing client is located on but the question is what??

Yesterday i tried to find a common denominator with this RPC error and i think i have found something interesting. What i did was to create 6 different files with the same file name but with different file size. Here is a table with the result:

Filename Size Backup status
Test.txt 10GB OK after 02:45 min
Test.txt 15GB Ok after 03:10 min
Test.txt 18GB OK after 03:53 min
Test.txt 20GB Fail after 04:44 min
Test.txt 30GB Fail after 04:45 min
Test.txt 40GB Fail after 04:43 min

The save set on the test client was set to C:\test.txt and i was running the command savegrp -l full -c TESTCLIENT -g GROUP.

As you can see all backups that runs longer then "around" 4 minutes fail or when the file is getting bigger then 20GB. The errors message i receive after 4 mintues is: Exited with exit code: Unknown, completion severity: INFORMATION(10), completion status: unexpectedly exited(2).

Does anyone have have a clue how we can solve this or troubleshoot?

All the help and tips we can get is very much appreciated!

// Anthon Hassel

4 Operator

 • 

1.3K Posts

October 7th, 2014 07:00

Are the results for Test.txt done using bigasm ? if so, then its definitely a network related issue. Did you have the Network guys check the timeout value set on the firewall ? You might to set a keep alive parameter to keep the connection between the client and server active across the firewall. Get the timeout information from the network admin and then we will tell you what to do next.

2.4K Posts

October 7th, 2014 08:00

Try to isolate the issue.

I personally would avoid running automatic backups from the very beginning. The error should also appear running manual backups. On the client run "save -s server ". I expect the issue to appear here already.

To investigate this deeper, let me suggest that you run debug mode, for example "save -D5 -s server > outfile.txt 2>&1". Then investigate outfile.txt .

35 Posts

October 7th, 2014 22:00

Hello Crazyrov!

I haven't use bigasm in these tests, actually never heard of it before. Trying it out now and this is the steps i did:

1. Created a folder called bigasm on C:\

2. Created a empty file with the name 1 in C:\bigasm

3. Created nsr.dir with the following content:

bigasm -S30GB: *

4. Started the save by running: save -vvv -s BACKUPSERVER.DOMAIN.NET c:\bigasm

Did I do it the right way? I assume I have done everything right since i got a familiar result from the save. Please see the output after that bigasm generated a backup of 30GB:

90099:save: Unable to close save session: RPC send operation failed; errno = Unknown error

80319:save: Aborting session channel connection (1) to 172.20.4.10; why = Unknown error

101702:save: Unable to send indication to jobd: Unknown error

90099:save: Unable to close save session: RPC send operation failed; errno = Unknown error

94693:save: The backup of save set 'C:\bigasm' failed.

7167:save: save completion time: 2014-10-08 07:45:17

100129:save: Unable to update job with id '691897' with name value 'C:\bigasm':

Stale asynchronous RPC handle

90005:save: Unable to end job: Stale asynchronous RPC handle

172.20.4.10 = Backupserver IP.

This time the generated backup took alot longer then 5 minutes.

I will talk to the network admins later today and give a an answer but i tihnk the firewall timeout is set to 30 minutes.

// Anthon

2.4K Posts

October 7th, 2014 23:00

bigasm is just a tool to create a large amount of backup data on the fly without loading your disk. I guess it has been around from the very beginning. However, as you obviously have prepared a source file large enough, it would not matter where the data came from.

I strongly suggest that you rerun the backup in debug mode. Most likely the information will guide you/your collegues to the problematic area. What do you have to lose?

35 Posts

October 7th, 2014 23:00

Thank you for the information Bingo!

Tried to run bigasm with a 5GB file now, almost the same error:

save: C:\bigasm  5000 MB 00:02:06      4 files

94694:save: The backup of save set 'C:\bigasm' succeeded.

7167:save: save completion time: 2014-10-08 08:03:08

80319:save: Aborting session channel connection (1) to 172.20.4.10; why = Unknown error

100129:save: Unable to update job with id '691902' with name value 'C:\bigasm':

Unknown error

90005:save: Unable to end job: Stale asynchronous RPC handle

This time is seems that the save set succeeded.

I'm running a save command in debug mode now on a 20GB test file. I'll let you know what I found.

// Anthon

4 Operator

 • 

1.3K Posts

October 8th, 2014 00:00

Its strange that even though the backup completes successfully you still have error messages at the end. Do you use multiple NIC's on your client machine ?

Currently what ports do you have open between you backup server and the client machine, on both the backup server and client check the port ranges configured in NetWorker by running the command "nsrports".

May be we will find a clearer understandable error in the debug that you are currently running.

35 Posts

October 8th, 2014 00:00

Ok. I tried a backup on the test client with debug mode and i found this on the log that seems to make som sense:

10/08/14 08:36:07.404736 RPC Authentication: error in LookupAccountSid: No mapping between account names and security IDs was done. (Win32 error 0x534)

10/08/14 08:48:06.879938 selbackuptest.DOMAIN.NET:C:\test.txt size 20 GB, 3 file(s), took  11 min  33 sec

10/08/14 08:48:07.192136 rpc/lib/c_tcp.c:1345 Failed to write to socket 1308: Unknown error

90099:save: Unable to close save session: RPC send operation failed; errno = Unknown error

80319:save: Aborting session channel connection (1) to 172.20.4.10; why = Unknown error

10/08/14 08:48:07.192136 Unknown error101702:save: Unable to send indication to jobd: Unknown error

90099:save: Unable to close save session: RPC send operation failed; errno = Unknown error

10/08/14 08:48:07.207746 WINDOWS ROLES AND FEATURES: sow_free_save called

10/08/14 08:48:07.207746 VSS USER DATA: sow_free_save called

10/08/14 08:48:07.207746 VSS OTHER: sow_free_save called

94693:save: The backup of save set 'C:\test.txt' failed.

7167:save: save completion time: 2014-10-08 08:48:07

10/08/14 08:48:07.207746 msgssn (id 1): Can't send message because the session channel is closed.

10/08/14 08:48:07.207746 Stale asynchronous RPC handle100129:save: Unable to update job with id '691917' with name value 'C:\test.txt': Stale asynchronous RPC handle

10/08/14 08:48:07.238966 lgto_auth for `nsrexec' succeeded

10/08/14 08:48:07.254575 msgssn (id 1): Can't send message because the session channel is closed.

10/08/14 08:48:07.254575 Stale asynchronous RPC handle90005:save: Unable to end job: Stale asynchronous RPC handle

10/08/14 08:48:08.597027 VSS .. Gathering writer status after BackupComplete... 10/08/14 08:48:08.643857 VSS .. Checking status of 10 writers.10/08/14 08:48:08.659466 VSS .. cancelsnap called -- backup state is 4 -- SnapDone -- after TakeTheSnapshot succeeded 10/08/14 08:48:08.659466 VSS .. destructing NSRSnapper..10/08/14 08:48:08.690686 win32_post_save(): Called10/08/14 08:48:08.690686 Clearing ssn session 0x0000000001EBF590 (fd 684) from [0], ssn_max_pollfd 0

I can upload the whole file if you wan't to..

Maybe i should have mention this earlier but we also have a cluser on the same network that works without any problems in Networker. Only different is that this cluster is running Windows Server 2012 R2. So i decided to copy the tcp settings from one of the cluster machine to the test machine and for a second i though the problem was solved since a bigasm test with 5GB wen't fine without any RPC error. 1 minute later i tried with a 30GB file and it failed, after that a 2GB file and it also failed... This makes no sense.

2.4K Posts

October 8th, 2014 01:00

The result supports the opionon you alread y had - it is obvious that something is wrong with the RPC connection.

At this point in time it make sense to look at the OS and the network driver and their respective updates.

4 Operator

 • 

1.3K Posts

October 8th, 2014 01:00

If you have an anti virus on this system can you disable it and test the backups again.

35 Posts

October 8th, 2014 22:00

Ok, we finally manage to find the problem. After that i tried everyting you said except the thing about network drivers and updates i got the information from the network admin that there was a load balancer in front of failing client. After some research we find a timeout setting that was set to 5 minutes. We changed it to 10 minutes and the client failed after 10 minutes instead. We will now change it to 45 minutes and hopefully the client backup will succeed.

Namnlös teckning.jpg

Thank you crazyrov and bingo for the help and fast answers, very much appreciated!

// Anthon

No Events found!

Top