I'm new to the forum, and new-ish to Networker. Just wondering if anyone else has run into an issue where they are trying to back up a 19 TB save set (using Networker 188.8.131.52, build 1356) and it seems to stall out around the 15 TB mark and just... start over? Seems odd to me that a 19 TB save set would spool 45+ TB to tape over a weekend and never actually complete.
Is this a known issue? I've searched the forums and can't really find anything to direct me to a resolution.
I've used Networker at a previous company. I was consistently troubleshooting and re mediating errors on a daily basis. EMC support actually lined up a dedicated resource to help with configurations. There is so many settings it takes time to know what they all do. Reach out to support and take notes as your working through issues. As far as your question.
19 TB is a pretty good sized backup to run all at once, is there an opportunity to split it up in smaller chunks? My largest backup was around 7 TB. I wish I had more helpful information.
Thanks for the feedback! Unfortunately it's the backing storage on a single CIFS file share, and splitting it out is not (currently) an option for me. 19 TB is a beast of a saveset for sure, but it almost seems to me like there's some sort of bug here. I will definitely seek support from EMC if I find that nobody else has seen this.
AFAIK, there is no limit with respect to the size of a save set. However, there is one major timing issue:
The group is supposed to be started once a day at ther same time. This may cause the issue after 24hrs.
The 'inactivity timeout' should be adjusted as follows:
- Switch on "View > Diagnostic Mode"
- Open the savegroup's Properties - Advanced tab
- Set the 'Inactivity timeout' to zero
Make sure that the client is the only member of this group.
Start the group.
Without knowing the circumstances on what happens when it seems to stall out around the 15 TB mark, it would difficult to say what is the cause.
The first thing I would look into is whether it fails in the same way every time. i.e. Elapse time is consistent? Amount of data backed up? Maybe it fails at exactly the same "spot" in the saveset being backed up? Does it ever complete? Did it ever had a successful backup? Is there anything in the NetWorker or O/S logs on the server, client, and storage node around the time of failure that might give you a clue?
To test walking through the saveset, on the client, you can use the following:
save -s (NetWorker server) -b (NetWorker pool) -v -n (saveset name) >list.txt 2>&1
This save command will process the 19 TB saveset, with the exception that "-n" will cause save not to actually send any data. This results in the save command just walking through this saveset. If successful, it tells us that there is not likely any issues with reading the data, and also can tell us how long it would take NetWorker to just read the data.
I would also look at networking, and how the client, server, storage node, and storage devices are connected to each other.
What type of backup device are you writing to? Have you observed what the write speed is when this is backing up? Is this speed consistent or fluctuating?
As Rideout421 mentioned, is it possible to break down this single 19 TB saveset into multiple chunks? This can be done automatically by utilizing the feature called Parallel Save Stream, or PSS. Another way to make this saveset smaller could be to use the compressasm directive to compress the data on the client side before sending the data, but this will use CPU cycles on the client side.
Setting the savegroup 'inactivity timeout' to zero just tells the NetWorker server to wait indefinitely for the client to send data. This would be helpful if the client is slow sending the data to be processed, however if the data is not being received at all, then it will just hang indefinitely, or at least until the connection ports time out and close.
Hope this helps...
Changed the inactivity timeout to 0 and the backup stalled out at about 3 TB. Log files were uneventful, however I did have to reboot the client (Solaris 11) to get the "save" process to exit. (Yes, kill -9 <pid of save> did nothing)
Hi Wallace, thank you for your reply! The test command says the full backup of the save set completed successfully:
save: /blah/blah 1721168 records 560 MB header 20 TB data
save: /blah/blah 20 TB estimated
94694:save: The backup of save set '/blah/blah' succeeded.
Since the client is just a file server, I've reconfigured the client directive to enable compressasm, turned off directory-level checkpoints and enabled parallel save streams.
Backup is going to LTO-5 tape, and was almost always spooling at about 150 MB/sec, which one would expect. Client and server are both on 10 Gbit network with jumbo frames enabled.
After the first 10 minutes of the backup running, and since the save command does not appear to be capable of multithreading, the save process consumes 100% of one CPU on the client and the throughput doesn't exceed more than about 25 MB/sec. Not exactly optimal for a 19 TB save set.
Retrying again without compressasm enabled, but PSS enabled and checkpoints off.
Also, for the record, I've never actually seen this backup complete successfully.
Normally if save won't die after kill -9 it means you have storage subsystem IO blocked which may indicate corruption or bad block. In such case it is better to start save from client with debug 5 and you get to see names as it walks file system and you get to see exactly where it hangs.
I also had the feeling that it was an I/O issue. Since the backup is already running, I've started an ugly little one-liner from the shell on the client (Solaris 11):
while [ 1 ]; do \
pfiles <pid_of_save> | grep <directory_name> | tail -1 \
sleep 2 \
If/when the save task stalls out, I should be able to tell where the last file was read. Then when the next save process starts, I will watch that one to see where the last file is read. If they're the same, I will definitely need to have a chat with my storage team.
A 19 TB CIFS share on a fileserver? - what is the nature of your files? - mix and match?
If so I would not be surprised if the data rate drops, especially when you backup small(er) files.
Along with the tape's repositioning cycles the overall throughput can become even worse.
You may measure your read data rate by using "uasm" (see the Performance Optimization Planning Guide for details) to get an idea how fast you may read the data from disk ... but do not forget that a tape drive has to reposition if the data cannot be delivered fast enough.
Honestly you should re-think your setup and consider disk backup. The best would be a backup to a DD via DDBoost.