I am not sure if the jobsdb retention is affecting the "skip run", i have seen this before but i dont remember what i did to get rid of this issue. What version of NetWorker are you currently on ?
The savegroup logs (under /nsr/logs/sg) are also removed based on this retention value. I really don't understand why a database record retention is related to logfile retention?!
We are using a workaround, which copies the savegroup logs to an another folder with rsync. Are you also doing so?
This record is there just to provide quick overview for short-term past. It was never designed for longer periods - for those you can refer to log file or try to use NMC (I never did that as in early start it was not really up to the task, but many years passed since). So, whenever I have audit, I just extract data from daemon logs which I keep as long as longest retention is that I have. For DB backups, I also refer to application specific logs (normally yearly audits go one year back so we have no issue with log rotation on application nodes).
As for the skip part, this is logical. NetWorker kicks off the group using group schedule and group schedule (or level) will always override client's one. Not having anything on group, means it will apply schedule from client. If schedule is skip, then it will be skipped and task for that will be successful and therefore overall group status will be successful too. If this is kind of special client in special group, you can set schedule on group level - this way NW won't even try to start the group and old status will remain there.
As for monitoring, I use custom made script which I call via notifications and if something fails (including still runs messages) it is sent to central mailbox of the team to followup (same can be done to be texted to your phone if you have such infra). We use this for years and so far we never had an issues (with 8.2.x there is false positive possible in case where group which has cloning upon ssid completion had nothing for backup; this is rare and patch exists and patch will be part of 8.2.1.8).
The logs under /nsr/logs/sg/[savegroup] are much more verbose than what you have in the daemon.log. This is important in case of VM backups for me.
Our vmware farm is quite big (at least to me ) and we had many issues with VADP backups (eg. uncontrollable NBD limitations, CBT issues, etc.) so it can help a lot if you are able to process these detailed save logs.
To the other issue: don't you find it dangerous that the status of a failed savegroup turns to green in NMC?
Here are the screenshoots about a test savegroup wich was disabled after the failed run.
The 1st pic was taken after the backup, the 2nd when the jobsdb expired for that job:
I'm not bothered since I have backups running every day, but I can understand your point - but only if it is applicable to schedule set on group level. Otherwise it works as designed. I agree design could be better; for example:
last status of the group should be taken from NSR_group resource. Currently this is not the case because NSR_group is used for configuration parameters and result of run is not configuration parameter. This is why I doubt this will be changed ever (unless some major re-design is made).
Stuff in /nsr/logs/sg should be subject to retention AND number of last runs. For example, parameter should be 72h and 4 runs which means data is kept for 72 hours or 4 last runs - depending which is more paranoid (in case of daily runs it would result in retention of 96 hours and in case of hourly backups 72 hours and in case of monthly it would kept it for 4 month).
I'm not sure what to say about your example as I never found myself in such situation - I'm allergic to red group status so if something fails - I fix it and restart it. If I don't care, I just leave it and it gets green on next run. All my groups run daily so I never saw red going green, but it is logical again - from current point of view of workflow. Here, I refer to the fact that group status is derived from job statuses belonging to group and once they expire - they are gone. It is similar if you stop bunch of groups running, stop NW, remove jobdb and start NW - it will show all green.
Now, if data in /nsr/tmp/sg/* shows more than any other available log on the server, then I understand why you wish to preserve those in case of the issue - but still for majority 72h is more than enough. I was bothered a bit once in the past in using 7.x where I had retention set to 6 hours as back in those days DB used for jobdb was way different and farse worse in performance than nowaday - but even in those days my backups if they would fail, I would just start them manually for whatever individually has failed from CLI - I have low rate of failures so I was not bothered so much. But yes, I have seen it then and it was a bit of a pain because if I was busy with something else, I could not just restart group but had to do some extra manual steps. Today, this runs on SQLite I think so it gets better and 72h is something I also adopted since and works just fine. I can't see why I would need records for more.
Since I do not protect VMs using VADP (I still use agents as that works just fine and we do not have any performance issues with it) nor any other fancy acronym which seems to change with each new version of VMWare (yet), but I assume if I did and if I wanted to have detailed data in case of failure - I would want that in log - fixed one. So, if entries in /nsr/logs/sg/ are the only place where you can find it - I agree that this is wrong then to be scraped away just because of retention on jobdb. However, I suspect these are just parsed to NW and they most likely live somewhere else on proxy or vmware too (similar is for example with SQL log - /nsr/tmp/sg/ will contain whole or cut off version of log which actually lives on client). Perhaps you could check with EMC (or other folks here) where this same data can be obtained from except job details from jobdb.
I see your points - how NMC is working and why, but I still think that it's a wrong design: it is a _The Management Console_, it should tell me the truth all the time. Yes, there are many ways to get infos about the backups, generate reports and so on, but with a GUI where you can also do actions it's much simpler and efficient to manage the system.
I think I'll try again to use the BRM appliance to see if it works better with the current NW versions.
I'm also always trying to fix all failed backups, this error came when a vadp backup failed because the vm was left in 'vmtools installation in progress' state for a longer time. When I got the info from the VM guys that it's OK again I wanted to re-run the failed savegroup - and I was surprised that there are no failed vm savegroups anymore...
(btw. it seems that we will also leave the VADP backups soon, vSphere VDP is the same as EBR/VBA (except that it is not flooding the vCenter SQL & logs ), we are testing it nowadays.
I don't see any benefit in the EBR + NW integration - those backups are living their own life, so if it works completely without NW - I'm fine with it).
I run BRM as well, but I only use it for DD space monitoring I think it is not so reliable for NW reporting, but could be me. I plan though to test DPA very soon as DPA, in theory, should provide rather nice overview. I used DPA in the past (while it was called EBA), but back in those times it was heavy on NW with queries and new jobdb (at that time) didn't like it neither. I have seen that latest version should address performance impact made by queries with NW 8.2.x and since I'm moving server to that version I plan to test it against them.
I defined the retention period to one month (720hrs).
Yes the db becomes quite huge (about 750MB in our case) but it is so convenient to be easily able to compare a group or a certain save set with an earlier entry.
Does this have an impact on the backup server - IMHO not at all. NW only works with the current 'set' this - a potential impact could be when he deletes the expired entries but these are only about 3.100 files/day.
Depending how it is deleted, it might impact CPU... for example, I'm not sure how smart NW is when going through the records and I didn't check how this works in present versions. How long does your typical full purge run?
crazyrov
4 Operator
•
1.3K Posts
0
September 2nd, 2015 02:00
I am not sure if the jobsdb retention is affecting the "skip run", i have seen this before but i dont remember what i did to get rid of this issue. What version of NetWorker are you currently on ?
oldhercules
1 Rookie
•
116 Posts
0
September 2nd, 2015 04:00
I have different versions, 8.0.4.3, 8.0.2.6, 8.1.2.6, 8.2.1.6
, central NMC is 8.1.3.2
ble1
4 Operator
•
14.4K Posts
1
September 3rd, 2015 04:00
This record is there just to provide quick overview for short-term past. It was never designed for longer periods - for those you can refer to log file or try to use NMC (I never did that as in early start it was not really up to the task, but many years passed since). So, whenever I have audit, I just extract data from daemon logs which I keep as long as longest retention is that I have. For DB backups, I also refer to application specific logs (normally yearly audits go one year back so we have no issue with log rotation on application nodes).
As for the skip part, this is logical. NetWorker kicks off the group using group schedule and group schedule (or level) will always override client's one. Not having anything on group, means it will apply schedule from client. If schedule is skip, then it will be skipped and task for that will be successful and therefore overall group status will be successful too. If this is kind of special client in special group, you can set schedule on group level - this way NW won't even try to start the group and old status will remain there.
As for monitoring, I use custom made script which I call via notifications and if something fails (including still runs messages) it is sent to central mailbox of the team to followup (same can be done to be texted to your phone if you have such infra). We use this for years and so far we never had an issues (with 8.2.x there is false positive possible in case where group which has cloning upon ssid completion had nothing for backup; this is rare and patch exists and patch will be part of 8.2.1.8).
oldhercules
1 Rookie
•
116 Posts
0
September 3rd, 2015 13:00
The logs under /nsr/logs/sg/[savegroup] are much more verbose than what you have in the daemon.log. This is important in case of VM backups for me.
Our vmware farm is quite big (at least to me
) and we had many issues with VADP backups (eg. uncontrollable NBD limitations, CBT issues, etc.) so it can help a lot if you are able to process these detailed save logs.
To the other issue: don't you find it dangerous that the status of a failed savegroup turns to green in NMC?
Here are the screenshoots about a test savegroup wich was disabled after the failed run.
The 1st pic was taken after the backup, the 2nd when the jobsdb expired for that job:
ble1
4 Operator
•
14.4K Posts
0
September 3rd, 2015 15:00
I'm not bothered since I have backups running every day, but I can understand your point - but only if it is applicable to schedule set on group level. Otherwise it works as designed. I agree design could be better; for example:
I'm not sure what to say about your example as I never found myself in such situation - I'm allergic to red group status so if something fails - I fix it and restart it. If I don't care, I just leave it and it gets green on next run. All my groups run daily so I never saw red going green, but it is logical again - from current point of view of workflow. Here, I refer to the fact that group status is derived from job statuses belonging to group and once they expire - they are gone. It is similar if you stop bunch of groups running, stop NW, remove jobdb and start NW - it will show all green.
Now, if data in /nsr/tmp/sg/* shows more than any other available log on the server, then I understand why you wish to preserve those in case of the issue - but still for majority 72h is more than enough. I was bothered a bit once in the past in using 7.x where I had retention set to 6 hours as back in those days DB used for jobdb was way different and farse worse in performance than nowaday - but even in those days my backups if they would fail, I would just start them manually for whatever individually has failed from CLI - I have low rate of failures so I was not bothered so much. But yes, I have seen it then and it was a bit of a pain because if I was busy with something else, I could not just restart group but had to do some extra manual steps. Today, this runs on SQLite I think so it gets better and 72h is something I also adopted since and works just fine. I can't see why I would need records for more.
Since I do not protect VMs using VADP (I still use agents as that works just fine and we do not have any performance issues with it) nor any other fancy acronym which seems to change with each new version of VMWare (yet), but I assume if I did and if I wanted to have detailed data in case of failure - I would want that in log - fixed one. So, if entries in /nsr/logs/sg/ are the only place where you can find it - I agree that this is wrong then to be scraped away just because of retention on jobdb. However, I suspect these are just parsed to NW and they most likely live somewhere else on proxy or vmware too (similar is for example with SQL log - /nsr/tmp/sg/ will contain whole or cut off version of log which actually lives on client). Perhaps you could check with EMC (or other folks here) where this same data can be obtained from except job details from jobdb.
oldhercules
1 Rookie
•
116 Posts
0
September 3rd, 2015 22:00
I see your points - how NMC is working and why, but I still think that it's a wrong design: it is a _The Management Console_, it should tell me the truth all the time. Yes, there are many ways to get infos about the backups, generate reports and so on, but with a GUI where you can also do actions it's much simpler and efficient to manage the system.
I think I'll try again to use the BRM appliance to see if it works better with the current NW versions.
I'm also always trying to fix all failed backups, this error came when a vadp backup failed because the vm was left in 'vmtools installation in progress' state for a longer time. When I got the info from the VM guys that it's OK again I wanted to re-run the failed savegroup - and I was surprised that there are no failed vm savegroups anymore...
(btw. it seems that we will also leave the VADP backups soon, vSphere VDP is the same as EBR/VBA (except that it is not flooding the vCenter SQL & logs
), we are testing it nowadays.
I don't see any benefit in the EBR + NW integration - those backups are living their own life, so if it works completely without NW - I'm fine with it).
ble1
4 Operator
•
14.4K Posts
0
September 4th, 2015 03:00
I run BRM as well, but I only use it for DD space monitoring
I think it is not so reliable for NW reporting, but could be me. I plan though to test DPA very soon as DPA, in theory, should provide rather nice overview. I used DPA in the past (while it was called EBA), but back in those times it was heavy on NW with queries and new jobdb (at that time) didn't like it neither. I have seen that latest version should address performance impact made by queries with NW 8.2.x and since I'm moving server to that version I plan to test it against them.
avmaint
1 Rookie
•
115 Posts
0
November 30th, 2015 22:00
somehow the SS* files are not being created in ~/nsr/tmp/sg directory
in NMC properties it is set to 72 hours, which means 6 days the logs should be retained, but no log files are seen
bingo.1
2.4K Posts
0
November 30th, 2015 23:00
The managed entries are located in /nsr/logs/sg/
Just to clarify wrong expectations - 72 hours = 3 days.
bingo.1
2.4K Posts
0
December 1st, 2015 05:00
I defined the retention period to one month (720hrs).
Yes the db becomes quite huge (about 750MB in our case) but it is so convenient to be easily able to compare a group or a certain save set with an earlier entry.
Does this have an impact on the backup server - IMHO not at all. NW only works with the current 'set' this - a potential impact could be when he deletes the expired entries but these are only about 3.100 files/day.
lalexis
2 Intern
•
253 Posts
0
December 1st, 2015 05:00
I to use Notifications and send emails to a central email box
I also have scripts for bootstrap reports
I think using the notifications to send email is the easiest way to maintain the data for as long as you could want
ble1
4 Operator
•
14.4K Posts
0
December 1st, 2015 07:00
Depending how it is deleted, it might impact CPU... for example, I'm not sure how smart NW is when going through the records and I didn't check how this works in present versions. How long does your typical full purge run?
bingo.1
2.4K Posts
0
December 2nd, 2015 06:00
I probably disappoint you - the max. time it needs is 3s.
But have a look at yesterday's sessions yourself:
82327 01.12.2015 03:47:26 1 9 0 10972 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 03:47:29 1 9 0 10972 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 3 sec. Records purged: 1118
82327 01.12.2015 04:47:29 1 9 0 12728 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 04:47:31 1 9 0 12728 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 942
82327 01.12.2015 05:47:34 1 9 0 14988 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 05:47:37 1 9 0 14988 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 3 sec. Records purged: 1020
82327 01.12.2015 06:47:36 1 9 0 12892 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 06:47:38 1 9 0 12892 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 733
82327 01.12.2015 07:47:40 1 9 0 12972 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 07:47:42 1 9 0 12972 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 675
82327 01.12.2015 08:47:45 1 9 0 6948 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 08:47:46 1 9 0 6948 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 332
82327 01.12.2015 09:47:46 1 9 0 13896 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 09:47:47 1 9 0 13896 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 247
82327 01.12.2015 10:47:48 1 9 0 12004 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 10:47:49 1 9 0 12004 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 283
82327 01.12.2015 11:47:50 1 9 0 7560 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 11:47:52 1 9 0 7560 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 180
82327 01.12.2015 12:47:52 1 9 0 13368 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 12:47:53 1 9 0 13368 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 618
82327 01.12.2015 13:47:53 1 9 0 6340 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 13:47:53 1 9 0 6340 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 190
82327 01.12.2015 14:47:57 1 9 0 9672 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 14:47:58 1 9 0 9672 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 882
82327 01.12.2015 15:47:58 1 9 0 14268 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 15:47:59 1 9 0 14268 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 194
82327 01.12.2015 16:48:02 1 9 0 10688 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 16:48:03 1 9 0 10688 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 237
82327 01.12.2015 17:48:04 1 9 0 7124 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 17:48:04 1 9 0 7124 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 171
82327 01.12.2015 18:48:09 1 9 0 4340 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 18:48:11 1 9 0 4340 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 553
82327 01.12.2015 19:48:14 1 9 0 5240 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 19:48:16 1 9 0 5240 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 1122
82327 01.12.2015 20:48:16 1 9 0 4460 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 20:48:18 1 9 0 4460 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 925
82327 01.12.2015 21:48:21 1 9 0 9328 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 21:48:23 1 9 0 9328 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 792
82327 01.12.2015 22:48:26 1 9 0 11968 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 22:48:27 1 9 0 11968 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 284
82327 01.12.2015 23:48:30 1 9 0 2628 14428 0 nsrjobd JOBS notice Starting full purge of jobs database
93514 01.12.2015 23:48:32 1 9 0 2628 14428 0 nsrjobd JOBS notice Completed full database purge in 0 min 2 sec. Records purged: 691
ble1
4 Operator
•
14.4K Posts
0
December 2nd, 2015 13:00
Actually, I'm happy to see that now (most likely change in DB model helped there too).