NMC vs. jobsdb retention

Question

Hello,

We have an open SR regarding this issue, but I guess all NetWorker NMC users have the same problem(?), so I'd like to ask you how do you deal with it.

EMC strongly recommends to have maximum 72 hours of jobsdb retention. They say if it's set to higher values then it can be a performance bottleneck in the server and it can cause many other RPC errors.

There are two issues with this:

The savegroup logs (under /nsr/logs/sg) are also removed based on this retention value. I really don't understand why a database record retention is related to logfile retention?!

We are using a workaround, which copies the savegroup logs to an another folder with rsync. Are you also doing so?

The real issue is that in NMC / Monitoring the status of a failed savegroup changes to 'Successful' when the jobs retention expires(!).

Earlier we had much greater jobsdb retention value (336 hours) and we used a 'trick' to help us to track the status of the backups which are not running daily.

Example:

backup a client on every friday, skip rest of the week.

If you configure that schedule only in the client definition and the backup fails, then it will be in failed status until the next run, but this one on the next day will be a successful 'skip run'.

If you also set the same schedule in the group / Advanced / Client Overrides / schedule then in the daemon.log the 'skip run' is still visible, but nothing happens in NMC or /nsr/logs/sg/... so the last status remains as it was.

It was great when we had a longer jobsdb retention, because in NMC we could see and select the savegroup logs when the group was really doing something (vs. daily status about the big nothing.. )

We can live without that great feature that we could see what's happened in the past to our weekly jobs, it's much bigger problem that a failed savegroup turns to successful after 4 days (now). The runtime is still visible, but no client logs or any error messages, and the status is green..

Because of this "self-healing" how do you monitor that status of failed jobs??

Best regards,

Istvan

crazyrov · Answer

I am not sure if the jobsdb retention is affecting the 'skip run', i have seen this before but i dont remember what i did to get rid of this issue. What version of NetWorker are you currently on ?

oldhercules · Answer

I have different versions, 8.0.4.3, 8.0.2.6, 8.1.2.6, 8.2.1.6 , central NMC is 8.1.3.2

ble1 · Answer

oldhercules wrote:

The savegroup logs (under /nsr/logs/sg) are also removed based on this retention value. I really don't understand why a database record retention is related to logfile retention?!

We are using a workaround, which copies the savegroup logs to an another folder with rsync. Are you also doing so?

This record is there just to provide quick overview for short-term past. It was never designed for longer periods - for those you can refer to log file or try to use NMC (I never did that as in early start it was not really up to the task, but many years passed since). So, whenever I have audit, I just extract data from daemon logs which I keep as long as longest retention is that I have. For DB backups, I also refer to application specific logs (normally yearly audits go one year back so we have no issue with log rotation on application nodes).

As for the skip part, this is logical. NetWorker kicks off the group using group schedule and group schedule (or level) will always override client's one. Not having anything on group, means it will apply schedule from client. If schedule is skip, then it will be skipped and task for that will be successful and therefore overall group status will be successful too. If this is kind of special client in special group, you can set schedule on group level - this way NW won't even try to start the group and old status will remain there.

As for monitoring, I use custom made script which I call via notifications and if something fails (including still runs messages) it is sent to central mailbox of the team to followup (same can be done to be texted to your phone if you have such infra). We use this for years and so far we never had an issues (with 8.2.x there is false positive possible in case where group which has cloning upon ssid completion had nothing for backup; this is rare and patch exists and patch will be part of 8.2.1.8).

oldhercules · Answer

The logs under /nsr/logs/sg/[savegroup] are much more verbose than what you have in the daemon.log. This is important in case of VM backups for me.

Our vmware farm is quite big (at least to me ) and we had many issues with VADP backups (eg. uncontrollable NBD limitations, CBT issues, etc.) so it can help a lot if you are able to process these detailed save logs.

To the other issue: don't you find it dangerous that the status of a failed savegroup turns to green in NMC?

Here are the screenshoots about a test savegroup wich was disabled after the failed run.

The 1st pic was taken after the backup, the 2nd when the jobsdb expired for that job:

savegroup_fail_bug.JPG.jpg

ble1 · Answer

I'm not bothered since I have backups running every day, but I can understand your point - but only if it is applicable to schedule set on group level. Otherwise it works as designed. I agree design could be better; for example:

last status of the group should be taken from NSR_group resource. Currently this is not the case because NSR_group is used for configuration parameters and result of run is not configuration parameter. This is why I doubt this will be changed ever (unless some major re-design is made).
Stuff in /nsr/logs/sg should be subject to retention AND number of last runs. For example, parameter should be 72h and 4 runs which means data is kept for 72 hours or 4 last runs - depending which is more paranoid (in case of daily runs it would result in retention of 96 hours and in case of hourly backups 72 hours and in case of monthly it would kept it for 4 month).

I'm not sure what to say about your example as I never found myself in such situation - I'm allergic to red group status so if something fails - I fix it and restart it. If I don't care, I just leave it and it gets green on next run. All my groups run daily so I never saw red going green, but it is logical again - from current point of view of workflow. Here, I refer to the fact that group status is derived from job statuses belonging to group and once they expire - they are gone. It is similar if you stop bunch of groups running, stop NW, remove jobdb and start NW - it will show all green.

Now, if data in /nsr/tmp/sg/* shows more than any other available log on the server, then I understand why you wish to preserve those in case of the issue - but still for majority 72h is more than enough. I was bothered a bit once in the past in using 7.x where I had retention set to 6 hours as back in those days DB used for jobdb was way different and farse worse in performance than nowaday - but even in those days my backups if they would fail, I would just start them manually for whatever individually has failed from CLI - I have low rate of failures so I was not bothered so much. But yes, I have seen it then and it was a bit of a pain because if I was busy with something else, I could not just restart group but had to do some extra manual steps. Today, this runs on SQLite I think so it gets better and 72h is something I also adopted since and works just fine. I can't see why I would need records for more.

Since I do not protect VMs using VADP (I still use agents as that works just fine and we do not have any performance issues with it) nor any other fancy acronym which seems to change with each new version of VMWare (yet), but I assume if I did and if I wanted to have detailed data in case of failure - I would want that in log - fixed one. So, if entries in /nsr/logs/sg/ are the only place where you can find it - I agree that this is wrong then to be scraped away just because of retention on jobdb. However, I suspect these are just parsed to NW and they most likely live somewhere else on proxy or vmware too (similar is for example with SQL log - /nsr/tmp/sg/ will contain whole or cut off version of log which actually lives on client). Perhaps you could check with EMC (or other folks here) where this same data can be obtained from except job details from jobdb.

oldhercules · Answer

I see your points - how NMC is working and why, but I still think that it's a wrong design: it is a _The Management Console_, it should tell me the truth all the time. Yes, there are many ways to get infos about the backups, generate reports and so on, but with a GUI where you can also do actions it's much simpler and efficient to manage the system.

I think I'll try again to use the BRM appliance to see if it works better with the current NW versions.

I'm also always trying to fix all failed backups, this error came when a vadp backup failed because the vm was left in 'vmtools installation in progress' state for a longer time. When I got the info from the VM guys that it's OK again I wanted to re-run the failed savegroup - and I was surprised that there are no failed vm savegroups anymore...

(btw. it seems that we will also leave the VADP backups soon, vSphere VDP is the same as EBR/VBA (except that it is not flooding the vCenter SQL & logs ), we are testing it nowadays.

I don't see any benefit in the EBR + NW integration - those backups are living their own life, so if it works completely without NW - I'm fine with it).

ble1 · Answer

I run BRM as well, but I only use it for DD space monitoring &#xa0; I think it is not so reliable for NW reporting, but could be me.&#xa0; I plan though to test DPA very soon as DPA, in theory, should provide rather nice overview.&#xa0; I used DPA in the past (while it was called EBA), but back in those times it was heavy on NW with queries and new jobdb (at that time) didn't like it neither.&#xa0; I have seen that latest version should address performance impact made by queries with NW 8.2.x and since I'm moving server to that version I plan to test it against them.

avmaint · Answer

somehow the SS* files are not being created in ~/nsr/tmp/sg directory in NMC properties it is set to 72 hours, which means 6 days the logs should be retained, but no log files are seen

bingo.1 · Answer

The managed entries are located in /nsr/logs/sg/ Just to clarify wrong expectations - 72 hours = 3 days.

bingo.1 · Answer

I defined the retention period to one month (720hrs).

Yes the db becomes quite huge (about 750MB in our case) but it is so convenient to be easily able to compare a group or a certain save set with an earlier entry.

Does this have an impact on the backup server - IMHO not at all. NW only works with the current 'set' this - a potential impact could be when he deletes the expired entries but these are only about 3.100 files/day.

lalexis · Answer

I to use Notifications and send emails to a central email box

I also have scripts for bootstrap reports

I think using the notifications to send email is the easiest way to maintain the data for as long as you could want

ble1 · Answer

Depending how it is deleted, it might impact CPU... for example, I'm not sure how smart NW is when going through the records and I didn't check how this works in present versions.  How long does your typical full purge run?

bingo.1 · Answer

I probably disappoint you - the max. time it needs is 3s.

But have a look at yesterday's sessions yourself:

82327 01.12.2015 03:47:26 1 9 0 10972 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 03:47:29 1 9 0 10972 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 3 sec. Records purged: 1118

82327 01.12.2015 04:47:29 1 9 0 12728 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 04:47:31 1 9 0 12728 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 942

82327 01.12.2015 05:47:34 1 9 0 14988 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 05:47:37 1 9 0 14988 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 3 sec. Records purged: 1020

82327 01.12.2015 06:47:36 1 9 0 12892 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 06:47:38 1 9 0 12892 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 733

82327 01.12.2015 07:47:40 1 9 0 12972 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 07:47:42 1 9 0 12972 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 675

82327 01.12.2015 08:47:45 1 9 0 6948 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 08:47:46 1 9 0 6948 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 332

82327 01.12.2015 09:47:46 1 9 0 13896 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 09:47:47 1 9 0 13896 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 247

82327 01.12.2015 10:47:48 1 9 0 12004 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 10:47:49 1 9 0 12004 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 283

82327 01.12.2015 11:47:50 1 9 0 7560 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 11:47:52 1 9 0 7560 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 180

82327 01.12.2015 12:47:52 1 9 0 13368 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 12:47:53 1 9 0 13368 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 618

82327 01.12.2015 13:47:53 1 9 0 6340 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 13:47:53 1 9 0 6340 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 190

82327 01.12.2015 14:47:57 1 9 0 9672 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 14:47:58 1 9 0 9672 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 882

82327 01.12.2015 15:47:58 1 9 0 14268 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 15:47:59 1 9 0 14268 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 194

82327 01.12.2015 16:48:02 1 9 0 10688 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 16:48:03 1 9 0 10688 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 237

82327 01.12.2015 17:48:04 1 9 0 7124 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 17:48:04 1 9 0 7124 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 171

82327 01.12.2015 18:48:09 1 9 0 4340 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 18:48:11 1 9 0 4340 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 553

82327 01.12.2015 19:48:14 1 9 0 5240 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 19:48:16 1 9 0 5240 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 1122

82327 01.12.2015 20:48:16 1 9 0 4460 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 20:48:18 1 9 0 4460 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 925

82327 01.12.2015 21:48:21 1 9 0 9328 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 21:48:23 1 9 0 9328 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 792

82327 01.12.2015 22:48:26 1 9 0 11968 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 22:48:27 1 9 0 11968 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 284

82327 01.12.2015 23:48:30 1 9 0 2628 14428 0 nsrjobd JOBS notice Starting full purge of jobs database

93514 01.12.2015 23:48:32 1 9 0 2628 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 691

ble1 · Answer

Actually, I'm happy to see that now (most likely change in DB model helped there too).

NetWorker

NMC vs. jobsdb retention

Was this post helpful?