Unsolved
This post is more than 5 years old
35 Posts
0
7960
Something is closing the backup session after 4-5 minutes.
Hello!
First of all, short info about our Networker environment:
Server: Running Windows Server 2012 R2 with EMC Networker 8.1.1.3
Failing Client: Running Window server 2008 R2 with EMC Networker 8.1.1.3
For a couple of months we had a client that just won't make backups. The client starts the backup but is failing a couple of minutes later and if i run Networker User on the client is complaining about some RPC errors..
We had a case open with EMC for a couple of weeks but it was close since we though we had find that the problem was the network communication between the server and client. Once i had informed our Network technicians we started the change the network configuration and even tried any <> any between the server and client without success.
Down below is a list of things that we already have tried together with EMC:
- Removed the client resources on the client and the server.
- Fully reinstalled EMC Networker on the client.
- Tried different Networker versions on the client.
- Changed the host file on both server and client.
- Rpcinfo between the servers is looking good.
- Nslookup between the servers is looking good.
- Tried different save operations and backup commands, for example "VSS:*=off"
- Networker health check tools is running fine without any problems.
Since the client that is failing is a production machine i can't test different settings to 100% therefore i have created a virtual machine that im testing different backups options on. Since the backups on the test machine is working on other networks (VLANs) i still assume this is a network with the specific network that the failing client is located on but the question is what??
Yesterday i tried to find a common denominator with this RPC error and i think i have found something interesting. What i did was to create 6 different files with the same file name but with different file size. Here is a table with the result:
Filename | Size | Backup status |
Test.txt | 10GB | OK after 02:45 min |
Test.txt | 15GB | Ok after 03:10 min |
Test.txt | 18GB | OK after 03:53 min |
Test.txt | 20GB | Fail after 04:44 min |
Test.txt | 30GB | Fail after 04:45 min |
Test.txt | 40GB | Fail after 04:43 min |
The save set on the test client was set to C:\test.txt and i was running the command savegrp -l full -c TESTCLIENT -g GROUP.
As you can see all backups that runs longer then "around" 4 minutes fail or when the file is getting bigger then 20GB. The errors message i receive after 4 mintues is: Exited with exit code: Unknown, completion severity: INFORMATION(10), completion status: unexpectedly exited(2).
Does anyone have have a clue how we can solve this or troubleshoot?
All the help and tips we can get is very much appreciated!
// Anthon Hassel
crazyrov
4 Operator
4 Operator
•
1.3K Posts
1
October 7th, 2014 07:00
Are the results for Test.txt done using bigasm ? if so, then its definitely a network related issue. Did you have the Network guys check the timeout value set on the firewall ? You might to set a keep alive parameter to keep the connection between the client and server active across the firewall. Get the timeout information from the network admin and then we will tell you what to do next.
bingo.1
2.4K Posts
1
October 7th, 2014 08:00
Try to isolate the issue.
I personally would avoid running automatic backups from the very beginning. The error should also appear running manual backups. On the client run "save -s server ". I expect the issue to appear here already.
To investigate this deeper, let me suggest that you run debug mode, for example "save -D5 -s server > outfile.txt 2>&1". Then investigate outfile.txt .
AnthonH
35 Posts
0
October 7th, 2014 22:00
Hello Crazyrov!
I haven't use bigasm in these tests, actually never heard of it before. Trying it out now and this is the steps i did:
1. Created a folder called bigasm on C:\
2. Created a empty file with the name 1 in C:\bigasm
3. Created nsr.dir with the following content:
bigasm -S30GB: *
4. Started the save by running: save -vvv -s BACKUPSERVER.DOMAIN.NET c:\bigasm
Did I do it the right way? I assume I have done everything right since i got a familiar result from the save. Please see the output after that bigasm generated a backup of 30GB:
90099:save: Unable to close save session: RPC send operation failed; errno = Unknown error
80319:save: Aborting session channel connection (1) to 172.20.4.10; why = Unknown error
101702:save: Unable to send indication to jobd: Unknown error
90099:save: Unable to close save session: RPC send operation failed; errno = Unknown error
94693:save: The backup of save set 'C:\bigasm' failed.
7167:save: save completion time: 2014-10-08 07:45:17
100129:save: Unable to update job with id '691897' with name value 'C:\bigasm':
Stale asynchronous RPC handle
90005:save: Unable to end job: Stale asynchronous RPC handle
172.20.4.10 = Backupserver IP.
This time the generated backup took alot longer then 5 minutes.
I will talk to the network admins later today and give a an answer but i tihnk the firewall timeout is set to 30 minutes.
// Anthon
bingo.1
2.4K Posts
0
October 7th, 2014 23:00
bigasm is just a tool to create a large amount of backup data on the fly without loading your disk. I guess it has been around from the very beginning. However, as you obviously have prepared a source file large enough, it would not matter where the data came from.
I strongly suggest that you rerun the backup in debug mode. Most likely the information will guide you/your collegues to the problematic area. What do you have to lose?
AnthonH
35 Posts
0
October 7th, 2014 23:00
Thank you for the information Bingo!
Tried to run bigasm with a 5GB file now, almost the same error:
save: C:\bigasm 5000 MB 00:02:06 4 files
94694:save: The backup of save set 'C:\bigasm' succeeded.
7167:save: save completion time: 2014-10-08 08:03:08
80319:save: Aborting session channel connection (1) to 172.20.4.10; why = Unknown error
100129:save: Unable to update job with id '691902' with name value 'C:\bigasm':
Unknown error
90005:save: Unable to end job: Stale asynchronous RPC handle
This time is seems that the save set succeeded.
I'm running a save command in debug mode now on a 20GB test file. I'll let you know what I found.
// Anthon
crazyrov
4 Operator
4 Operator
•
1.3K Posts
0
October 8th, 2014 00:00
Its strange that even though the backup completes successfully you still have error messages at the end. Do you use multiple NIC's on your client machine ?
Currently what ports do you have open between you backup server and the client machine, on both the backup server and client check the port ranges configured in NetWorker by running the command "nsrports".
May be we will find a clearer understandable error in the debug that you are currently running.
AnthonH
35 Posts
0
October 8th, 2014 00:00
Ok. I tried a backup on the test client with debug mode and i found this on the log that seems to make som sense:
10/08/14 08:36:07.404736 RPC Authentication: error in LookupAccountSid: No mapping between account names and security IDs was done. (Win32 error 0x534)
10/08/14 08:48:06.879938 selbackuptest.DOMAIN.NET:C:\test.txt size 20 GB, 3 file(s), took 11 min 33 sec
10/08/14 08:48:07.192136 rpc/lib/c_tcp.c:1345 Failed to write to socket 1308: Unknown error
90099:save: Unable to close save session: RPC send operation failed; errno = Unknown error
80319:save: Aborting session channel connection (1) to 172.20.4.10; why = Unknown error
10/08/14 08:48:07.192136 Unknown error101702:save: Unable to send indication to jobd: Unknown error
90099:save: Unable to close save session: RPC send operation failed; errno = Unknown error
10/08/14 08:48:07.207746 WINDOWS ROLES AND FEATURES: sow_free_save called
10/08/14 08:48:07.207746 VSS USER DATA: sow_free_save called
10/08/14 08:48:07.207746 VSS OTHER: sow_free_save called
94693:save: The backup of save set 'C:\test.txt' failed.
7167:save: save completion time: 2014-10-08 08:48:07
10/08/14 08:48:07.207746 msgssn (id 1): Can't send message because the session channel is closed.
10/08/14 08:48:07.207746 Stale asynchronous RPC handle100129:save: Unable to update job with id '691917' with name value 'C:\test.txt': Stale asynchronous RPC handle
10/08/14 08:48:07.238966 lgto_auth for `nsrexec' succeeded
10/08/14 08:48:07.254575 msgssn (id 1): Can't send message because the session channel is closed.
10/08/14 08:48:07.254575 Stale asynchronous RPC handle90005:save: Unable to end job: Stale asynchronous RPC handle
10/08/14 08:48:08.597027 VSS .. Gathering writer status after BackupComplete... 10/08/14 08:48:08.643857 VSS .. Checking status of 10 writers.10/08/14 08:48:08.659466 VSS .. cancelsnap called -- backup state is 4 -- SnapDone -- after TakeTheSnapshot succeeded 10/08/14 08:48:08.659466 VSS .. destructing NSRSnapper..10/08/14 08:48:08.690686 win32_post_save(): Called10/08/14 08:48:08.690686 Clearing ssn session 0x0000000001EBF590 (fd 684) from [0], ssn_max_pollfd 0
I can upload the whole file if you wan't to..
Maybe i should have mention this earlier but we also have a cluser on the same network that works without any problems in Networker. Only different is that this cluster is running Windows Server 2012 R2. So i decided to copy the tcp settings from one of the cluster machine to the test machine and for a second i though the problem was solved since a bigasm test with 5GB wen't fine without any RPC error. 1 minute later i tried with a 30GB file and it failed, after that a 2GB file and it also failed... This makes no sense.
bingo.1
2.4K Posts
0
October 8th, 2014 01:00
The result supports the opionon you alread y had - it is obvious that something is wrong with the RPC connection.
At this point in time it make sense to look at the OS and the network driver and their respective updates.
crazyrov
4 Operator
4 Operator
•
1.3K Posts
0
October 8th, 2014 01:00
If you have an anti virus on this system can you disable it and test the backups again.
AnthonH
35 Posts
0
October 8th, 2014 22:00
Ok, we finally manage to find the problem. After that i tried everyting you said except the thing about network drivers and updates i got the information from the network admin that there was a load balancer in front of failing client. After some research we find a timeout setting that was set to 5 minutes. We changed it to 10 minutes and the client failed after 10 minutes instead. We will now change it to 45 minutes and hopefully the client backup will succeed.
Thank you crazyrov and bingo for the help and fast answers, very much appreciated!
// Anthon