I went to NW8 somewhere in September last year. Since then I wanted to write blog with all details I have to fight with since and steps I made, but I kept postponing it due to some ongoing issues. One of the things I noticed was SQL clients would have their nsrexecd die - well, it would stop responding to RPC calls. I had following situation (always using VDI):
NW7 server + NW7 clients + NMSQL 5.2.2 -> OK
NW8 server + NW7 clients + NMSQL 5.2.2 -> NOK
NW8 server + NW8 clients + NMSQL 5.2.2 -> NOK
NW8 server + NW8.1 clients + NMM 3.0 -> NOK
The frequency of these freezes is irregular and I never really had a chance to spot any specific pattern so I just learned to live with a fact that I have to restart services on client and then all is back to normal for at least next 24 hours or so (ok, sometimes less).
In parallel to that, I have observed that some special jobs that I have, running over savepnpc every 15 minutes, would start to die too with hard time limit (most of the time due to savepnpc not completing post part, but not always). Increasing frequency to 20 minutes helped a bit, but then it was back to its usual nonsense. Real danger would happen when sporadically tmp file would not be removed from /nsr/tmp due to whatever was causing this. As this looked as random pattern too (with its all possible outcomes within failure), I concluded that most likely same thing is causing it. And since all started with simple upgrade of server and storage nodes, I isolate my thoughts to backup server to be major issue here. As things started to look worse with time, it sounded as potential memory leak somewhere. And indeed, it looks as if nsrjobdb is to blame.
During course of several past month I have been monitoring this and all I can say is that either there is something really broken in nsrjobdb in general or on platform where I run backup server (HPUX ia64 11.31). Since I didn't see complains like these before here, I will guess that this might be isolated to ia64 platform for now. When I look at my current run, I have backup server running since November 22nd (this is when 184.108.40.206 was applied which seems to be worse than 220.127.116.11 in that respect). My current nsrjobdb memory usage is 2.5G And nsrjobd rarely likes to light on CPU nowadays (restart the server and it comes back to senses for a while). I thought NW8 was supposed to fix nsrjobd, but in my case is driving me more crazy.
If I check how log purge runs, I see nothing unusual - NW8 really does rock there. I keep records only for 12 hours so amount of records should be ok (it was ok for NW7 so I would expect NW8 to be very happy there too). However, if I check logs, I see all sorts of crazy messages. In some random order and without any specific meaning, here are few:
nsrjobd SYSTEM critical User root does not have the privileges required to start the requested job. The required privilege is: Operate NetWorker. The job requested to be started is: savefs.
nsrjobd SYSTEM critical User root does not have the privileges required to start the requested job. The required privilege is: Operate NetWorker. The job requested to be started is: nsrsqlsv.
nsrjobd SYSTEM critical User root does not have the privileges required to start the requested job. The required privilege is: Operate NetWorker. The job requested to be started is: nsrdasv
Now, above shows savefs, NMM and NMDA affected, but I have seen it for save and NMSAP too. To add further, I have seen that something when I start save from client or I restart group from CLI you get to see messages like:
Cannot create session with nsrjobd: Failed to register with server <SERVER>: Timed out
Retrying 4 more time(s)
Eventually, this may fail upon one or two retries, but at the end it always runs with success. If you use GUI, you won't see it as this is masked (you may find entry in log indicating SSNCHNL error for group for example).
One thing I also noticed is that privilege message (Operate NetWorker) comes from physical node in logs while timeout shows virtual name. Not sure if nsrjobd does not like clusters or this is irrelevant .
I do have ticket with support who opened this with EMC and I do have NW id for it (which at this stage is only indication that it has been pushed to engineer), but no progress at all.
My question is - did anyone who went to NW8 had this issue and did anyone observe nsrjobd being cookie monster when it comes to memory? I have no doubts that my issue is related to nsrjobd as I have excluded pretty much anything else in all tests done so far, but I wonder how others have been dealing with this and what experience they might have. Looking forward to see your memory usage
Some notes: I run pretty much same setup since 2006 (with HW refresh in 2009). We have very good experience with HPUX platform and system is tweaked and tuned as it should be (I also checked NW8 hints for performance optimization and we were there long time ago). The privilege message is bogus - let me assure you we have permission list in place and hasn't been changed for last 8 years - you get the same even with *@*. Finally, even this is NW8, we run our setup with nsrauth disabled so any crazy thing which may occur by using nsrauth is not the case here.
I've seen indeed some cases with different NW 8.x versions where nsrjobd behaves funny, but in most cases I've observed a high CPU utilization instead of memory.
I've seen some old escalations (7.2 and 7.3) where that exact message comes up, and engineering at first pointed to some parallelism settings.
Maybe you would like to set debug level 7 for nsrjobd and monitor a little bit?
I'm talking about, for instance, about LGTsc06725, but there are a couple of other escalations related to the same issue, which was resolved by code changes. Maybe you are facing a regression?
Maybe it's worth to mention these escalations to EMC support:
LGTsc06725 and LGTpa96222
I've been told that couple of issue we have seen should be addressed by 18.104.22.168, but that does not appear to be the case (at least not for HPUX). Actually, I was told about one issue (not sure about bug id, but it was related to one of the nsrjobd messages which I didn't list above) which has patch (or was patched), but only on Solaris. So there might be more things going on, but they all lead to nsrjobd. I also do see high CPU usage, but I can only see it persistent after memory usage has increased. At this point, I do not say MLK is the only problem, but certainly it is most striking one. It is strange to see old escalations mentioned and have them in most recent code, but I will pass this information.
Just reading your initial post again, I noticed the issues with the tmp files not being removed when savepnpc is being ran, and I see this is already fixed in NW150216, included in 22.214.171.124
About the "potential regression" you are right, it's strange to find those errors in a completely new code in nsrjobd/jobsdb, however we could be facing the same issue besides, if I understood correctly, this is happening only in IA servers, did I understand correctly?
The issues I referred to earlier, are not directly related to nsrjobd (the code changes that solved those errors) but to savegrp and save binaries, so that's why I was saying that maybe is worth to mention that to support, so that once they escalate to engineering they have some further references the more information the best
In this case tmp non removal happens due to failure with nsrjobd communication from what I can see. I'm aware of the fix in 126.96.36.199, but we run 188.8.131.52 (184.108.40.206 most likely as of this Saturday - perhaps 220.127.116.11 it will be out by then).
Yes, this happens on itanium platform (HPUX). I will keep this area posted with progress.
Just give small follow-up. I've been running 18.104.22.168. Prior to that I could nsrjobd growing to 1.5G memory and by the very end it was pretty much having problems staying alive (I could catch instances where rpcinfo for process would not respond for period of time and then it would come back to life - this would mostly affect savepnpc and VDI based backups). After that, still on 22.214.171.124, I stopped NW, removed job db and started same version again. It worked fine. Couple of day after I went to 126.96.36.199. From observations so far, memory leak might be still there as in 1 week period memory usage doubled (90M to 182MB), but jury is still out. If I see it in a week of time growing more, I know on what to focus
Maybe you find it childish to ask but I have been unable to understand what does memory leak mean here and how do you identify it.
I am using Networker 188.8.131.52, every time i used to restart the nsr services my VADP proxies (which are also storage noodes) started to notifications like " nsrsnmd started,......trying to restart nsrsnmd, ......another instance of nsrsnmd is already running......, stopping nsrsnmd......., starting it..... and this would continue endlessly, after long follow ups support told it is a memory leak issue and has been fixed in later versions. But I did not understand memory leak then too and also how did they identify it. In the present context I was reminded of the scenario.
Well, not sure why yours would be memory leak - it more looks like race condition causing failure to identify nsrsnmd. It doesn't matter as long as it gets fixed. In my case, I know how much typical process will use memory for its activities. And, most importantly, once process is done it should release memory. Here that doesn't happen. For example, on Friday I had 182M and not it's 200M... and it will continue to grow and then around 1.5G it will crash big time (well, I will kill it as it will make operations rather painful).
I have a rather same situation with NW 184.108.40.206 on HPUX IA (as the NW server), but I had issues with 8.0 on HPUX IA from the start - from the upgrade (nsrexecd cores and so on).
The whole NW server becomes unresponsive, lots of time-out issues on backup, cloning has messages like :
- nsrtask JOBS error 92 [%s] Query for results of job %s failed (%s),
- Stale asynchronous RPC handle
- nsrtask ssnchnl critical 32
And nsrjob has this values:
77 ? 27161 root 152 20 1400M 1377M run 4156:33 14.23 14.20 nsrjobd
IMHO NW 8 server is still not fixed and not working as it should on HPUX IA platform. I am planning on doing the upgrade
to 220.127.116.11, but am not sure it will fix anything
I do not have (yet) CCR enabled so I didn't see nsrtask part - more fun to come I guess. As for "Stale asynchronous RPC handle" I have seen it with NMDA 1.2 (Oracle) if library release was prior to build 348. With build 348 I was able to get rid of that issue. NMDA 1.5 seems to fine too.