/nsr disk 100% busy during NDMP backups?

Question

We have a huge Celerra environment that we're backing up via NDMP. 6 active datamovers, 4 streams a piece. 4 drives zoned to each datamover. It seems since we've moved onto our Linux server from our old Sun box that the disk we have supporting the /nsr filesystem is always 100% busy, causing writes to get cached to memory and eventually we run out of memory and the load averages on the box go up to >20.

Server is RHEL 5.5, Networker 7.5

Server is 8x2Ghz cores, 48GB memory

Disk is fiber attached to Clariion, a single 2T lun, Meta striped across 4 SATA raid groups.

We've been through a couple rounds with support on this and can't get any real answers as to why we're having this issue. We had the same backend disk config on the Sun side with no issues.

I'm going to rack a DAE of 300G FC disks and just make a single RAID1/0 Raid group out of them and migrate the lun onto it to see if that has an impact. Any other suggestions are welcome though, should I break the /nsr subdirectorys (such as /nsr/tmp) onto their own luns?

Kurt

KTelep · Answer

It seems like /nsr/tmp is getting the most punishment.

wlee · Answer

Sorry, I need more information /nsr has serveral subfolders.  It would be key to know if it is a specific folder.  There is a default working folder, if it is at max then it may cause this situation.

MikeToo1 · Answer

Do you know what is the size of your JOBSDB (
sres\jobsdb)?  Default configuration is 7 days and 40Mb.  Just checking if this is a source for your issue.  How much data is in 
sres\jobsdb? JOBSDB turning guide http://solutions.emc.com/emcsolutionview.asp?id=esg93654. Another questions is are you using DPA and what DPA service are run locally?

yzabary · Answer

We had the same problems here with NDMP backups crashing (that was with Netapp, but I guess Cellera is the same). You really need a very big /nsr file system. Previously, we had ~5Gb available space on /nsr and I once noticed that we ran out of space. Then I increased it to a 300Gb filesystem (currently over 150Gb available) and all of the NDMP crash problems are gone. I suggest that after you increase the size of this file system, you will setup a script that will monitor the dynamic space usage on it DURING the backups so you will be able to understand your exact requirements. Also, it is very important to make sure that you can get decent perfomance on this filesystem (it has many critical directories, such as jobsdb and index which really need the perfomance). It would be a big mistake to place it on a disk/LUN which also has AFTD devices. I once had bootstrap backups running for a few hours till I realized that this was because /nsr/res/jobsdb was on a same filesystem as an AFTD.

wlee · Answer

When you backup a filer, like a Celerra, using NDMP, then NetWorker can keep track of what files were backed up by updating the client file file index.

With traditional, non NDMP backups, the client file index is updated directly while the file is backed up. However, with NDMP backups, this is not possible because of the way NDMP backups work. Instead, NDMP backups create a temporary folder called /nsr/tmp/FileIndex..., and it is here where the file and directory names that are backed up are cataloged. Once the NDMP backups are completed, NetWorker then runs a program to commit this information into the client file index, and then DELETES this temp folder /nsr/tmp/FileIndex...

So, the size of this temp folder is directly related to the number of files that were backed up during the NDMP backup.

In addition, if the NDMP DATA backups were completed, BUT for some reason NetWorker was not able to commit the index information, then this temp folder can be left in the /nsr/tmp directory. At this point, there is no way for a user to tell NetWorker to re-process the index information again, and as a result these FileIndex temp folder just end up taking up space. If NetWorker is shutdown, you can delete these FileIndex subdirectories.

(Please contact tech support for confirmation on this. I haven't used NDMP in few years now, but I believe the behaviour is still the same)

In any case, when NetWorker is backing up using NDMP, it definitely can perform a lot of i/o's to the /nsr/tmp directory. Because of this, you may rethink what type of file system it is using. Preferably something that favors heavy write operations such as RAID-0 or RAID-3. If the file system is using journaling, you may want to disable this to improve on write performance. Contact your o/s support for assistance in this.

yzabary · Answer

As a side note, it is simply amazing that the networker startup/shutdown script still doesn't clean /nsr/tmp and /nsr/res/jobsdb. This can help with occasions when files left in these directories cause a failure.

wlee · Answer

Why would you want the startup or shutdown scripts to delete /nsr/tmp and /nsr/res/jobsdb??

If NetWorker wasn't shutdown cleanly and a savegroup was running, then during startup NetWorker would realize this because the jobsdb would still show that the group was running, then NetWorker would perform the necessary cleanup.

Similarily, by deleting the /nsr/tmp and /nsr/res/jobsdb, you prevent any chance of restarting any groups.

yzabary · Answer

The reason is that the information left in these two directories can confuse Networker. I never had much success with the group restart feature of Networker. In my opinion, trying to figure what happened before the restart is unreliable. On the other hand I would expect Networker to be able to check the indexes vs. the schedules and be able to re-schedule the missed backups (that is usually an issue with full backups). Therefore, my opinion is that the files kept in /nsr/tmp should be deleted between Networker restarts and that the group restart feature should be implemented in a more reliable and consistent way.

ble1 · Answer

Nothing stops you to place those action to delete into your startup script.  By design, this will never happen and as explained before there is a good reason behind.

wlee · Answer

There are other EMC applications that uses the /nsr/tmp. By deleting this directory, you can impact on their functionality.

For example, i have been told that EMC Data Protection Advisor gathers data from /nsr/tmp for reporting purposes.

NetWorker

/nsr disk 100% busy during NDMP backups?

Was this post helpful?