rayan-chazbek
1 Nickel

Networker 18.1 large file system failure

Jump to solution
Dears, I am new to the community. I am having trouble with backing up a file system about 2TB. The backup starts normally and the data is being backed up after some time around one hours and half the backup fails but it continue to send data. Thank you in advance
Labels (1)
0 Kudos
1 Solution

Accepted Solutions
bdos1
1 Nickel

Re: Networker 18.1 large file system failure

Jump to solution

Well, here's your problem at least...

Aborting session channel connection (33) to 10.0.51.1; why = An existing connection was forcibly closed by the remote host.
An existing connection was forcibly closed by the remote host.

This can mean a few things, most likely a Firewall is forcing the connection closed.
Here are some things you can do to help.

(1) Set Inactivity Timeout in action properties to 0 (Ensures no timeout)
(2) Implement TCP Keepalive tuning on the affected Client, Storage Node, Data Domain
(3) Put the following option into "Save Operations" if using PSS backups PSS:timeout_mins=0

Note: if you have other options already in Save Operations field, they need to be separated with a semi colon ;  and no spaces in between.

0 Kudos
6 Replies
bdos1
1 Nickel

Re: Networker 18.1 large file system failure

Jump to solution

Welcome to the community.

May I suggest you attach the log from the backup so we can check the reason for the failure?
Backup logs are located here: ../nsr/logs/policy_name/workflow_name/action_name
(Be sure to redact any confidential information from the backup logs if posting on the forums.)

Normally with large backups you will run into Inactivity Timeouts / TCP Keepalive issues, however, we will need to check the logs to understand exactly what is happening.

0 Kudos
rayan-chazbek
1 Nickel

Re: Networker 18.1 large file system failure

Jump to solution

1/17/2019 8:25:01 AM Step (1 of 5): nsrjobd has made a request to start this savegrp with PID-9080.
1/17/2019 8:25:01 AM Action backup traditional 'backup' has initialized as 'backup action job' with job id 162669
1/17/2019 8:25:06 AM Step (2 of 5): Querying the group or policy for the configured group of clients with the savegrp PID-9080.
1/17/2019 8:25:06 AM override match on 1/17/2019 for level full
1/17/2019 8:25:06 AM ********:D:\ requested level=full
1/17/2019 8:25:06 AM Step (3 of 5): The group or policy information has been successfully returned.
1/17/2019 8:25:06 AM Action backup traditional will run up to 100 jobs in parallel
1/17/2019 8:25:06 AM Step (4 of 5): Creating a savefs job for all the configured clients.
1/17/2019 8:25:06 AM Creating a 'savefs' job on the host '********'.
1/17/2019 8:25:06 AM Policy 'Windows', workflow 'FileSystem', action 'backup', group 'Windows-FileSystem'.
1/17/2019 8:25:06 AM Starting action backup traditional 'backup', which has 1 clients.
1/17/2019 8:25:06 AM Starting a session on the host '********' to execute the job '********:savefs', which scans the file system to determine the files for backup.
1/17/2019 8:25:06 AM The job '********:savefs' has been started on the client '********'.
1/17/2019 8:25:06 AM ********:savefs started
1/17/2019 8:25:06 AM savefs -s prd-networker.sgbl.com.lb -c ******** -g Windows-FileSystem -p -o "\"PSS:streams_per_ss=12,*;\"" -l full -R -v -F "D:\\"
1/17/2019 8:25:07 AM Group Windows-FileSystem waiting for 1 jobs (0 awaiting restart) to complete.
1/17/2019 8:25:12 AM ********:D:\
1/17/2019 8:25:12 AM level=full, vers=pools, p=12
1/17/2019 8:25:12 AM The job '********:savefs' on the host '********' has been completed.
1/17/2019 8:25:12 AM ********:savefs succeeded.
1/17/2019 8:25:12 AM ********:savefs The job has successfully scanned the file system on the host '********'. The main save will now be started.
1/17/2019 8:25:12 AM Windows-FileSystem:********:savefs See the file 'D:\Networker\nsr\logs\policy\Windows\FileSystem\backup_162669_logs\162670.log' for command output.
1/17/2019 8:25:12 AM Step (5 of 5): Creating a pseudo_saveset job for all the configured clients.
1/17/2019 8:25:12 AM Creating a save job for the save set 'pseudo_saveset' on the host '********'.
1/17/2019 8:25:12 AM Parallel save streams per save set option value '-M#4' is being applied to save set 'pseudo_saveset'
1/17/2019 8:25:12 AM Constructing the save command for the save set 'pseudo_saveset' on the host '********': save -LL -s prd-networker.sgbl.com.lb -g Windows/FileSystem/backup/Windows-FileSystem -a "*policy action jobid=162669" -a "*policy name=Windows" -a "*policy workflow name=FileSystem" -a "*policy action name=backup" -y "Sun Feb 17 23:59:59 GMT+0200 2019" -w "Sun Feb 17 23:59:59 GMT+0200 2019" -m ******** -M #4 -a "device interface=data domain" -a "Data Domain interface=IP" -b PRDFilesystemDD -o "\"PSS:streams_per_ss=12,*;RENAMED_DIRECTORIES:index_lookup=on;REQUESTED_LEVEL:level=full;\"" -l full -q -W 78 -N pseudo_saveset "D:\\".
1/17/2019 8:25:12 AM Executing a 'pseudo_saveset' job on the host '********'. This job is an anchor save set for the workflow, and will be completed at the end of the client's backup.
1/17/2019 8:25:12 AM ********:pseudo_saveset started
1/17/2019 8:25:12 AM save -LL -s prd-networker.sgbl.com.lb -g Windows/FileSystem/backup/Windows-FileSystem -a "*policy action jobid=162669" -a "*policy name=Windows" -a "*policy workflow name=FileSystem" -a "*policy action name=backup" -y "Sun Feb 17 23:59:59 GMT+0200 2019" -w "Sun Feb 17 23:59:59 GMT+0200 2019" -m ******** -M #4 -a "device interface=data domain" -a "Data Domain interface=IP" -b PRDFilesystemDD -o "\"PSS:streams_per_ss=12,*;RENAMED_DIRECTORIES:index_lookup=on;REQUESTED_LEVEL:level=full;\"" -l ful
1/17/2019 8:25:38 AM Group Windows-FileSystem waiting for 2 jobs (0 awaiting restart) to complete.
1/17/2019 9:45:47 AM Unable to determine status for job 162672, ********:D:\: Log file 'D:\Networker\nsr\logs\policy\Windows\FileSystem\backup_162669_logs\162672.log' not found.
1/17/2019 9:45:47 AM ********:D:\ unexpectedly exited.
1/17/2019 9:45:47 AM The save job for the save set 'D:\' on the host '********' has been completed.
1/17/2019 9:45:48 AM Group Windows-FileSystem waiting for 1 jobs (0 awaiting restart) to complete.
1/17/2019 11:08:17 AM The save job for the save set 'pseudo_saveset' on the host 'xxxx' has been completed.
1/17/2019 11:08:21 AM ********:pseudo_saveset succeeded.
1/17/2019 11:08:21 AM ********:pseudo_saveset Save has closed the session on the host '********'.
1/17/2019 11:08:21 AM Windows-FileSystem:********:pseudo_saveset See the file 'D:\Networker\nsr\logs\policy\Windows\FileSystem\backup_162669_logs\162671.log' for command output.
1/17/2019 11:08:26 AM Action backup traditional 'backup' with job id 162669 is exiting with status 'failed', exit code 1

0 Kudos
Highlighted
bdos1
1 Nickel

Re: Networker 18.1 large file system failure

Jump to solution
Okay so here is where your logs will be D:\Networker\nsr\logs\policy\Windows\FileSystem\backup_162669_logs\.. Check in here for the actual backup log which records the failure. From the workflow log you have shared it doesn't tell us much.
0 Kudos
rayan-chazbek
1 Nickel

Re: Networker 18.1 large file system failure

Jump to solution

181407:save: Step (1 of 5) for PID-7472: Save has been started on the client '**********'. 174412:save: Step (2 of 5) for PID-7472: Running the backup on the client '**********' for the save set 'pseudo_saveset'. 174424:save: Step (3 of 5) for PID-7472: Creating the snapshot for the selected save sets. 174426:save: Step (4 of 5) for PID-7472: Determining whether the backup level is either full or incremental. 125800:save: The 'PSS:streams_per_ss=' option contains an invalid value 12. The entire option has been ignored. 138851:save: '**********:D:\' is being reset/promoted to level 'Full' as no backup is found. 174430:save: Backup level for **********:D:\ = full. 125800:save: The 'PSS:streams_per_ss=' option contains an invalid value 12. The entire option has been ignored. 181407:save: Step (1 of 5) for PID-5876: Save has been started on the client '**********'. 181407:save: Step (1 of 5) for PID-8252: Save has been started on the client '**********'. 181407:save: Step (1 of 5) for PID-9468: Save has been started on the client '**********'. 174412:save: Step (2 of 5) for PID-5876: Running the backup on the client '**********' for the save set 'D:\'. 174412:save: Step (2 of 5) for PID-8252: Running the backup on the client '**********' for the save set 'D:\'. 174412:save: Step (2 of 5) for PID-9468: Running the backup on the client '**********' for the save set 'D:\'. 180569:save: Identified a save for the backup with PID-9468 on the client '**********'. Updating the total number of steps from 5 to 7. 174920:save: Step (3 of 7) for PID-9468: Contacting the NetWorker server through the nsrd process to obtain a handle to the target media device through the nsrmmd process for the save set 'D:\'. 174908:save:Saving the backup data in the pool 'PRDFilesystemDD'. 175019:save:Received the media management binding information on the host 'prd-networker'. 174910:save:Connected to the nsrmmd process on the host 'prd-networker'. 175295:save: Successfully connected to the Data Domain device. 129292:save:Successfully established Client direct save session for save-set ID '675291109' (**********:D:\) with Data Domain volume 'PRDFilesystemDD.001'. 174922:save: Step (4 of 7) for PID-9468: Successfully connected to the target media device through the nsrmmd process on the host 'prd-networker' for the save set 'D:\'. 174422:save: Step (5 of 7) for PID-9468: Reading the save sets and writing to the target device. 180569:save: Identified a save for the backup with PID-8252 on the client '**********'. Updating the total number of steps from 5 to 7. 174920:save: Step (3 of 7) for PID-8252: Contacting the NetWorker server through the nsrd process to obtain a handle to the target media device through the nsrmmd process for the save set 'D:\'. 174908:save:Saving the backup data in the pool 'PRDFilesystemDD'. 175019:save:Received the media management binding information on the host 'prd-networker'. 174910:save:Connected to the nsrmmd process on the host 'prd-networker'. 175295:save: Successfully connected to the Data Domain device. 129292:save:Successfully established Client direct save session for save-set ID '658513895' (**********:D:\) with Data Domain volume 'PRDFilesystemDD.001'. 174922:save: Step (4 of 7) for PID-8252: Successfully connected to the target media device through the nsrmmd process on the host 'prd-networker' for the save set 'D:\'. 174422:save: Step (5 of 7) for PID-8252: Reading the save sets and writing to the target device. 180569:save: Identified a save for the backup with PID-5876 on the client '**********'. Updating the total number of steps from 5 to 7. 174920:save: Step (3 of 7) for PID-5876: Contacting the NetWorker server through the nsrd process to obtain a handle to the target media device through the nsrmmd process for the save set 'D:\'. 174908:save:Saving the backup data in the pool 'PRDFilesystemDD'. 175019:save:Received the media management binding information on the host 'prd-networker'. 174910:save:Connected to the nsrmmd process on the host 'prd-networker'. 175295:save: Successfully connected to the Data Domain device. 129292:save:Successfully established Client direct save session for save-set ID '641736682' (**********:D:\) with Data Domain volume 'PRDFilesystemDD.001'. 174922:save: Step (4 of 7) for PID-5876: Successfully connected to the target media device through the nsrmmd process on the host 'prd-networker' for the save set 'D:\'. 174422:save: Step (5 of 7) for PID-5876: Reading the save sets and writing to the target device. 181407:save: Step (1 of 5) for PID-11620: Save has been started on the client '**********'. 174412:save: Step (2 of 5) for PID-11620: Running the backup on the client '**********' for the save set 'D:\'. 180569:save: Identified a save for the backup with PID-11620 on the client '**********'. Updating the total number of steps from 5 to 7. 174920:save: Step (3 of 7) for PID-11620: Contacting the NetWorker server through the nsrd process to obtain a handle to the target media device through the nsrmmd process for the save set 'D:\'. 174908:save:Saving the backup data in the pool 'PRDFilesystemDD'. 175019:save:Received the media management binding information on the host 'prd-networker'. 174910:save:Connected to the nsrmmd process on the host 'prd-networker'. 175295:save: Successfully connected to the Data Domain device. 129292:save:Successfully established Client direct save session for save-set ID '624959470' (**********:D:\) with Data Domain volume 'PRDFilesystemDD.001'. 174922:save: Step (4 of 7) for PID-11620: Successfully connected to the target media device through the nsrmmd process on the host 'prd-networker' for the save set 'D:\'. 174422:save: Step (5 of 7) for PID-11620: Reading the save sets and writing to the target device. 181407:save: Step (1 of 5) for PID-10204: Save has been started on the client '**********'. 174412:save: Step (2 of 5) for PID-10204: Running the backup on the client '**********' for the save set 'null'. parallel save streams partial completed savetime=1547706339 parallel save streams partial completed savetime=1547706343 parallel save streams partial completed savetime=1547706350 parallel save streams partial completed savetime=1547706346 parallel save streams summary **********: D:\ level=full, 1691 GB 02:42:40 90646 files parallel save streams summary savetime=1547706350 80319:save: Aborting session channel connection (33) to 10.0.51.1; why = An existing connection was forcibly closed by the remote host. An existing connection was forcibly closed by the remote host. 01/17/19 11:08:17.144573 mbs_sd_thread_proc (tid1): job_worker_end() returned error Stale asynchronous RPC handle174416:save: Step (5 of 5) for PID-7472: Backup has succeeded. Save is exiting. See the savegrp log to track the closure steps of the backup. 94694:save: The backup of save set 'pseudo_saveset' succeeded. 94694:save: The backup of save set 'pseudo_saveset' succeeded.

0 Kudos
bdos1
1 Nickel

Re: Networker 18.1 large file system failure

Jump to solution

Well, here's your problem at least...

Aborting session channel connection (33) to 10.0.51.1; why = An existing connection was forcibly closed by the remote host.
An existing connection was forcibly closed by the remote host.

This can mean a few things, most likely a Firewall is forcing the connection closed.
Here are some things you can do to help.

(1) Set Inactivity Timeout in action properties to 0 (Ensures no timeout)
(2) Implement TCP Keepalive tuning on the affected Client, Storage Node, Data Domain
(3) Put the following option into "Save Operations" if using PSS backups PSS:timeout_mins=0

Note: if you have other options already in Save Operations field, they need to be separated with a semi colon ;  and no spaces in between.

0 Kudos
rayan-chazbek
1 Nickel

Re: Networker 18.1 large file system failure

Jump to solution
Thank you My friend for your help. It is solved it was a Firewall issue, after about one hour and a half the firewall was closing the session between networker server and the client. Now after moving the initiation session and the backup session to an isolated VLAN it succeeds and the connection is fine
0 Kudos