Start a Conversation

Unsolved

This post is more than 5 years old

4626

April 17th, 2017 11:00

Networker save set size limit?

Hi Everyone!

   I'm new to the forum, and new-ish to Networker.  Just wondering if anyone else has run into an issue where they are trying to back up a 19 TB save set (using Networker 8.2.4.3, build 1356) and it seems to stall out around the 15 TB mark and just... start over?  Seems odd to me that a 19 TB save set would spool 45+ TB to tape over a weekend and never actually complete.

Is this a known issue? I've searched the forums and can't really find anything to direct me to a resolution.

Thanks,

-Tim

31 Posts

April 17th, 2017 12:00

I've used Networker at a previous company. I was consistently troubleshooting and re mediating errors on a daily basis. EMC support actually lined up a dedicated resource to help with configurations. There is so many settings it takes time to know what they all do. Reach out to support and take notes as your working through issues.   As far as your question.

19 TB is a pretty good sized backup to run all at once, is there an opportunity to split it up in smaller chunks? My largest backup was around 7 TB. I wish I had more helpful information. 

April 17th, 2017 14:00

Thanks for the feedback! Unfortunately it's the backing storage on a single CIFS file share, and splitting it out is not (currently) an option for me. 19 TB is a beast of a saveset for sure, but it almost seems to me like there's some sort of bug here. I will definitely seek support from EMC if I find that nobody else has seen this.

2.4K Posts

April 17th, 2017 15:00

AFAIK, there is no limit with respect to the size of a save set. However, there is one major timing issue:

The group is supposed to be started once a day at ther same time. This may cause the issue after 24hrs.

The 'inactivity timeout' should be adjusted as follows:

  - Switch on "View > Diagnostic Mode"

  - Open the savegroup's Properties - Advanced tab

  - Set the 'Inactivity timeout' to zero

Make sure that the client is the only member of this group.

Start the group.

263 Posts

April 17th, 2017 17:00

Without knowing the circumstances on what happens when it seems to stall out around the 15 TB mark, it would difficult to say what is the cause.

The first thing I would look into is whether it fails in the same way every time.  i.e.  Elapse time is consistent?  Amount of data backed up?  Maybe it fails at exactly the same "spot" in the saveset being backed up? Does it ever complete?  Did it ever had a successful backup?  Is there anything in the NetWorker or O/S logs on the server, client, and storage node around the time of failure that might give you a clue?

To test walking through the saveset, on the client, you can use the following:

  save -s (NetWorker server) -b (NetWorker pool) -v -n (saveset name) >list.txt 2>&1

This save command will process the 19 TB saveset, with the exception that "-n" will cause save not to actually send any data.  This results in the save command just walking through this saveset.  If successful, it tells us that there is not likely any issues with reading the data, and also can tell us how long it would take NetWorker to just read the data.

I would also look at networking, and how the client, server, storage node, and storage devices are connected to each other.

What type of backup device are you writing to?  Have you observed what the write speed is when this is backing up?  Is this speed consistent or fluctuating?

As Rideout421 mentioned, is it possible to break down this single 19 TB saveset into multiple chunks?  This can be done automatically by utilizing the feature called Parallel Save Stream, or PSS.  Another way to make this saveset smaller could be to use the compressasm directive to compress the data on the client side before sending the data, but this will use CPU cycles on the client side.

Setting the savegroup 'inactivity timeout' to zero just tells the NetWorker server to wait indefinitely for the client to send data.  This would be helpful if the client is slow sending the data to be processed, however if the data is not being received at all, then it will just hang indefinitely, or at least until the connection ports time out and close.


Hope this helps...



April 18th, 2017 10:00

Changed the inactivity timeout to 0 and the backup stalled out at about 3 TB. Log files were uneventful, however I did have to reboot the client (Solaris 11) to get the "save" process to exit.  (Yes, kill -9 did nothing)

April 18th, 2017 12:00

Hi Wallace, thank you for your reply! The test command says the full backup of the save set completed successfully:

     32477:save:

     save: /blah/blah 1721168 records 560 MB header 20 TB data

     save: /blah/blah 20 TB estimated

     94694:save: The backup of save set '/blah/blah' succeeded.

Since the client is just a file server, I've reconfigured the client directive to enable compressasm, turned off directory-level checkpoints and enabled parallel save streams.

Backup is going to LTO-5 tape, and was almost always spooling at about 150 MB/sec, which one would expect.  Client and server are both on 10 Gbit network with jumbo frames enabled.

After the first 10 minutes of the backup running, and since the save command does not appear to be capable of multithreading, the save process consumes 100% of one CPU on the client and the throughput doesn't exceed more than about 25 MB/sec. Not exactly optimal for a 19 TB save set.

Retrying again without compressasm enabled, but PSS enabled and checkpoints off.

Also, for the record, I've never actually seen this backup complete successfully.

14.3K Posts

April 19th, 2017 06:00

Normally if save won't die after kill -9 it means you have storage subsystem IO blocked which may indicate corruption or bad block.  In such case it is better to start save from client with debug 5 and you get to see names as it walks file system and you get to see exactly where it hangs.

April 19th, 2017 09:00

I also had the feeling that it was an I/O issue.  Since the backup is already running, I've started an ugly little one-liner from the shell on the client (Solaris 11):

while [ 1 ]; do \

        pfiles | grep | tail -1 \

        sleep 2 \

        clear \

        done

If/when the save task stalls out, I should be able to tell where the last file was read. Then when the next save process starts, I will watch that one to see where the last file is read. If they're the same, I will definitely need to have a chat with my storage team.

April 19th, 2017 10:00

The client that is being backed up is exporting SAN-backed storage as CIFS.  I'm backing up the SAN-backed storage. I'd love to do NDMP, but I'm not sure if the backing storage supports it.

Either way, the backup to LTO-5 tape is averaging 120 megabytes per second over the course of a 22.25 hour period , so I'm not sure that speed is really the issue here.

2.4K Posts

April 19th, 2017 10:00

A 19 TB CIFS share on a fileserver? - what is the nature of your files? - mix and match?

If so I would not be surprised if the data rate drops, especially when you backup small(er) files.

Along with the tape's repositioning cycles the overall throughput can become even worse.

You may measure your read data rate by using "uasm" (see the Performance Optimization Planning Guide for details) to get an idea how fast you may read the data from disk ... but do not forget that a tape drive has to reposition if the data cannot be delivered fast enough.

Honestly you should re-think your setup and consider disk backup. The best would be a backup to a DD via DDBoost.

263 Posts

April 19th, 2017 13:00

Please download and review the following document.

     NetWorker 8.2 SP1 Performance Optimization Planning Guide

     https://support.emc.com/docu57697_NetWorker_8.2_SP1_Performance_Optimization_Planning_Guide.pdf?language=en_US

The save process is not multi threaded.  To force backups to use multiple streams, you can break down the saveset into smaller saveset chunks, or use parallel savestream.  However, if you still only see one stream, it could be because the client is unable to create the multiple streams, or the server is not accepting multiple streams..  Please look at page page 28-32, "Parallel save stream considerations".  Make sure that the client parallelism is set >1 to allow for multi streaming (default=4), and server parallelism is also big enough too.

Regarding jumbo frames enabled: (Network optimization, page 59-68): "It is recommended to use jumbo frames in environments capable of handling them. If both the source, the computers, and all equipment in the data path are capable of handling jumbo frames, increase the MTU to 9 KB (default=1500 bytes)."  If you are unsure, then better to disable use of jumbo frames.

Also look at the document for information on how to tune your networking environment.

The LTO-5 writing at 150 MB/sec is pretty good rate.

I had mentioned that using compressasm would use up CPU cycles, so save consuming 100% CPU is not surprising. However, if there were other processes needing the CPU, this number would be shared, so I am not worried about the 100%, unless other applications are suffering a performance hit.

I would still try to use PSS. That is your best bet to force multistream backups for this saveset.  This would also allow more than one tape drive to be used, which would also reduce the backup time.

April 19th, 2017 13:00

Hi timbococoagrahams,

two additional comments:

  1. Is there a firewall between the NetWorker platform and e. g. the backup device and/or NetWorker client? In the past we had an issue where the Storage node saves a huge file, during the backup there was no traffic between client and NetWorker server and the firewall terminates the communication between client and server.
  2. Maybe the "Performance Optimization Planning Guide" could provide you additional hints for investigation, e.g. the bigasm directive:
       < / >>
           bigasm -S1GB: testfile
    Allows to backup saveset /testfile whith a size of 1GB (hope that the syntax is correct). The client creates a stream of bytes in memory and saves them to the target device (eliminates disk access). You could specify larger or several files to reach the 19 TB size.

Regards

Michael

April 24th, 2017 09:00

Hi Everyone!  Thank you all for your feedback. I'm reluctant to just "give up", but I'm chalking this up to backing storage/SAN issues. I cannot for the life of me find any other reason why this backup would fail.

FWIW - the save process from the last attempt at a full backup is still waiting on disk I/O, five days later.

No Events found!

Top