JobsDB retention & Networker performance issues

Question

We recently went to 7.6.5 and we started facing multiple performance issues... - the cluster's shared disk went full... we had to manually stage the indexes to free up space. - there would be multiple nsrindexd ADD processes and eventually backup would stall.. - we would need to restart networker in order to proceed with the backups.. - increased parallelism to 96 from 32.. have experienced a slight improvement... Also observed that the Jobsdb purge was happening multiple times in a day...... Looking at the settings, the retention was at 3days/40MB (looks like that was the default) Have changed the retention size x4 times as a trial run today to see if that helps... will keep u posted on the progress However, would like to know if jobsdb purging frequently causes performance overheads that can stall backup completely !!! ???

bingo.1 · Answer

Comparing the headline with your issues ...

- Long before the index disk will become full, you should see a message like 'file system for client ... is getting full'.

It appears whenever your largest CFI can not be copied to the same disk any more. This usually happens long before

the disk itself will be filled. Just do not ignore the message.

- Yes, there is now one "nsrindexd ADD" sub-process for each active backup stream.

Each of them of course needs at least RAM.

- Increasing parallelism does not necessarily improve performance. This also depends on a lot of other factors which you

also have to observe/adjust. But at least more streams need more RAM.

- The retention time just defines how long the info shall remain in the db - i usually set it to 2 weeks (1 week is the usual

backup cycle). The reason is that i can easily compare the last and the previous backup (very easy with NW 8).

- More frequent purges of a db's obsolete data is better as the tasks will finish earlier.

In NW 8, purging of the jobs db is run once every hour.

Meanwhile we are running NW 8.0.1 but as far as i remember, we had no issues with NW 7.6.4.

Before you do anything else, monitor your RAM on server and storage nodes and ensure you have plenty.

fx8282 · Answer

Thanks Bingo for the detailed reply...However, frequent purging of JobsDB puts in a high number of IOPS, doesnt it??I think there is a tech note in the EMC community which mentions that on NW-8, the jobsdb purge related IOPS have decreased about 79% compared to previous versions and that explains how u can afford to run that purge every hour Also would like to highlight that the said nsrindexd ADD processes were topping out.Therefore, I increased server parallelism but ironically there were more and more nsrindexd sessions being established but no corresponding save streams were being created to account for.. any clue on this??Besides, there seems to be one group containing about 15 clients that stalls the whole system, never observed it first hand but as per the discussion I had it stalls all the other backups and we need to stop that group so that other groups run...Interestingly, this is the first group which kicks off our backup window...

bingo.1 · Answer

To give you some numbers: our system creates about 4.8k save sets every day. So each purge would delete data of about 200 save jobs. With no impact on our backups. (Again, this is NW 8.0.1).

"Also would like to highlight that the said nsrindexd ADD processes were topping out.

Therefore, I increased server parallelism but ironically there were more and more nsrindexd sessions being established but no corresponding save streams were being created to account for.. any clue on this??"

What do you expect? - allowing more streams will usually result in more processes. However, i cannot answer why you might have 'orphans'. But if the system is idle, they all should have gone.

With respect to the group: There is the "savegroup parallelism" which allows you to control how many streams this group can open. This ensures that another one will get some left when it starts later.

If this does not help - move clients (one by one) to another group to check which one might be responsible.

ble1 · Answer

I have 40k ssids created per day (so at least so many records in job db even I suspect more).  I would say most of performance issues were addressed by adding faster disk to /nsr and keeping jobdb retention 8h (this is NW 7.6.5.x, but same applies to earlier versions). The impact of fast disk is simply incredible (I started to use fast tier from VMAX while before I used regular FC on DMX).

NetWorker

Was this post helpful?