Start a Conversation

Unsolved

This post is more than 5 years old

7275

January 13th, 2014 15:00

nsrjobd memory leak in NW8

I went to NW8 somewhere in September last year. Since then I wanted to write blog with all details I have to fight with since and steps I made, but I kept postponing it due to some ongoing issues. One of the things I noticed was SQL clients would have their nsrexecd die - well, it would stop responding to RPC calls.  I had following situation (always using VDI):

NW7 server + NW7 clients + NMSQL 5.2.2 -> OK

NW8 server + NW7 clients + NMSQL 5.2.2 -> NOK

NW8 server + NW8 clients + NMSQL 5.2.2 -> NOK

NW8 server + NW8.1 clients + NMM 3.0 -> NOK

The frequency of these freezes is irregular and I never really had a chance to spot any specific pattern so I just learned to live with a fact that I have to restart services on client and then all is back to normal for at least next 24 hours or so (ok, sometimes less).

In parallel to that, I have observed that some special jobs that I have, running over savepnpc every 15 minutes, would start to die too with hard time limit (most of the time due to savepnpc not completing post part, but not always). Increasing frequency to 20 minutes helped a bit, but then it was back to its usual nonsense. Real danger would happen when sporadically tmp file would not be removed from /nsr/tmp due to whatever was causing this.  As this looked as random pattern too (with its all possible outcomes within failure), I concluded that most likely same thing is causing it.  And since all started with simple upgrade of server and storage nodes, I isolate my thoughts to backup server to be major issue here.  As things started to look worse with time, it sounded as potential memory leak somewhere. And indeed, it looks as if nsrjobdb is to blame.

During course of several past month I have been monitoring this and all I can say is that either there is something really broken in nsrjobdb in general or on platform where I run backup server (HPUX ia64 11.31). Since I didn't see complains like these before here, I will guess that this might be isolated to ia64 platform for now.  When I look at my current run, I have backup server running since November 22nd (this is when 8.0.2.5 was applied which seems to be worse than 8.0.2.3 in that respect). My current nsrjobdb memory usage is 2.5G   And nsrjobd rarely likes to light on CPU nowadays (restart the server and it comes back to senses for a while).  I thought NW8 was supposed to fix nsrjobd, but in my case is driving me more crazy. 

If I check how log purge runs, I see nothing unusual - NW8 really does rock there. I keep records only for 12 hours so amount of records should be ok (it was ok for NW7 so I would expect NW8 to be very happy there too). However, if I check logs, I see all sorts of crazy messages.  In some random order and without any specific meaning, here are few:


nsrjobd SYSTEM critical User root does not have the privileges required to start the requested job. The required privilege is: Operate NetWorker. The job requested to be started is: savefs.

nsrjobd SYSTEM critical User root does not have the privileges required to start the requested job. The required privilege is: Operate NetWorker. The job requested to be started is: nsrsqlsv.

nsrjobd SYSTEM critical User root does not have the privileges required to start the requested job. The required privilege is: Operate NetWorker. The job requested to be started is: nsrdasv

etc...


Now, above shows savefs, NMM and NMDA affected, but I have seen it for save and NMSAP too.  To add further, I have seen that something when I start save from client or I restart group from CLI you get to see messages like:

Cannot create session with nsrjobd: Failed to register with server : Timed out

Retrying 4 more time(s)


Eventually, this may fail upon one or two retries, but at the end it always runs with success.  If you use GUI, you won't see it as this is masked (you may find entry in log indicating SSNCHNL error for group for example).


One thing I also noticed is that privilege message (Operate NetWorker) comes from physical node in logs while timeout shows virtual name.  Not sure if nsrjobd does not like clusters or this is irrelevant .


I do have ticket with support who opened this with EMC and I do have NW id for it (which at this stage is only indication that it has been pushed to engineer), but no progress at all.


My question is - did anyone who went to NW8 had this issue and did anyone observe nsrjobd being cookie monster when it comes to memory?  I have no doubts that my issue is related to nsrjobd as I have excluded pretty much anything else in all tests done so far, but I wonder how others have been dealing with this and what experience they might have.  Looking forward to see your memory usage


Some notes: I run pretty much same setup since 2006 (with HW refresh in 2009). We have very good experience with HPUX platform and system is tweaked and tuned as it should be (I also checked NW8 hints for performance optimization and we were there long time ago). The privilege message is bogus - let me assure you we have permission list in place and hasn't been changed for last 8 years - you get the same even with *@*.  Finally, even this is NW8, we run our setup with nsrauth disabled so any crazy thing which may occur by using nsrauth is not the case here.

14.3K Posts

January 14th, 2014 00:00

Hi Carlos,

I've been told that couple of issue we have seen should be addressed by 8.0.1.5, but that does not appear to be the case (at least not for HPUX).  Actually, I was told about one issue (not sure about bug id, but it was related to one of the nsrjobd messages which I didn't list above) which has patch (or was patched), but only on Solaris.  So there might be more things going on, but they all lead to nsrjobd.  I also do see high CPU usage, but I can only see it persistent after memory usage has increased. At this point, I do not say MLK is the only problem, but certainly it is most striking one. It is strange to see old escalations mentioned and have them in most recent code, but I will pass this information.

1.7K Posts

January 14th, 2014 00:00

Hi Hrvoje,

I've seen indeed some cases with different NW 8.x versions where nsrjobd behaves funny, but in most cases I've observed a high CPU utilization instead of memory.

I've seen some old escalations (7.2 and 7.3) where that exact message comes up, and engineering at first pointed to some parallelism settings.

Maybe you would like to set debug level 7 for nsrjobd and monitor a little bit?

I'm talking about, for instance, about LGTsc06725, but there are a couple of other escalations related to the same issue, which was resolved by code changes. Maybe you are facing a regression?

Maybe it's worth to mention these escalations to EMC support:

LGTsc06725 and LGTpa96222

Thank you,

Carlos

1.7K Posts

January 14th, 2014 03:00

Hi Hrvoje,

Just reading your initial post again, I noticed the issues with the tmp files not being removed when savepnpc is being ran, and I see this is already fixed in NW150216, included in 8.0.1.6

About the "potential regression" you are right, it's strange to find those errors in a completely new code in nsrjobd/jobsdb, however we could be facing the same issue besides, if I understood correctly, this is happening only in IA servers, did I understand correctly?

The issues I referred to earlier, are not directly related to nsrjobd (the code changes that solved those errors) but to savegrp and save binaries, so that's why I was saying that maybe is worth to mention that to support, so that once they escalate to engineering they have some further references the more information the best

Thank you,

Carlos

14.3K Posts

January 14th, 2014 04:00

Hi Carlos,

In this case tmp non removal happens due to failure with nsrjobd communication from what I can see. I'm aware of the fix in 8.0.1.6, but we run 8.0.2.5 (8.0.2.6 most likely as of this Saturday - perhaps 8.0.3.1 it will be out by then).

Yes, this happens on itanium platform (HPUX).  I will keep this area posted with progress.

14.3K Posts

January 25th, 2014 06:00

Just give small follow-up.  I've been running 8.0.3.1.  Prior to that I could nsrjobd growing to 1.5G memory and by the very end it was pretty much having problems staying alive (I could catch instances where rpcinfo for process would not respond for period of time and then it would come back to life - this would mostly affect savepnpc and VDI based backups).  After that, still on 8.0.2.5, I stopped NW, removed job db and started same version again.  It worked fine.  Couple of day after I went to 8.0.3.1.  From observations so far, memory leak might be still there as in 1 week period memory usage doubled (90M to 182MB), but jury is still out. If I see it in a week of time growing more, I know on what to focus

142 Posts

January 26th, 2014 20:00

Dear Hrvoje/Carlos,

Maybe you find it childish to ask but I have been unable to understand what does memory leak mean here and how do you identify it.

I am using Networker 8.0.1.1, every time i used to restart the nsr services my VADP proxies (which are also storage noodes) started to notifications like " nsrsnmd started,......trying to restart nsrsnmd, ......another instance of nsrsnmd is already running......, stopping nsrsnmd......., starting it..... and this would continue endlessly, after long follow ups support told it is a memory leak issue and has been fixed in later versions. But I did not understand memory leak then too and also how did they identify it. In the present context I was reminded of the scenario.

Regards

tech88kur

14.3K Posts

January 27th, 2014 01:00

Well, not sure why yours would be memory leak - it more looks like race condition causing failure to identify nsrsnmd.  It doesn't matter as long as it gets fixed.  In my case, I know how much typical process will use memory for its activities.  And, most importantly, once process is done it should release memory.  Here that doesn't happen.  For example, on Friday I had 182M and not it's 200M... and it will continue to grow and then around 1.5G it will crash big time (well, I will kill it as it will make operations rather painful).

12 Posts

February 1st, 2014 00:00

Hi,

I have a rather same situation with NW 8.0.2.5 on HPUX IA (as the NW server), but I had issues with 8.0 on HPUX IA from the start - from the upgrade (nsrexecd cores and so on).

The whole NW server becomes unresponsive, lots of time-out issues on backup, cloning has messages like :

- nsrtask JOBS error 92 [%s] Query for results of job %s failed (%s),

- Stale asynchronous RPC handle

- nsrtask ssnchnl critical 32

And nsrjob has this values:

77 ? 27161 root 152 20 1400M 1377M run 4156:33 14.23 14.20 nsrjobd


IMHO NW 8 server is still not fixed and not working as it should on HPUX IA platform. I am planning on doing the upgrade
to 8.0.3.1, but am not sure it will fix anything

BR,

Bojan

14.3K Posts

February 3rd, 2014 12:00

Bojan Sumljak wrote:

And nsrjob has this values:

77 ? 27161 root 152 20 1400M 1377M run 4156:33 14.23 14.20 nsrjobd

Bojan

Do you believe in euthanasia? It doesn't matter if you don't - just kill it.  Simply find right time and stop server.  Then delete /nsr/res/jobsdb and start backup server again.  This will buy some time for you.

14.3K Posts

February 3rd, 2014 12:00

I do not have (yet) CCR enabled so I didn't see nsrtask part - more fun to come I guess.  As for "Stale asynchronous RPC handle" I have seen it with NMDA 1.2 (Oracle) if library release was prior to build 348. With build 348 I was able to get rid of that issue.  NMDA 1.5 seems to fine too.

12 Posts

February 5th, 2014 09:00

I've killed it and removed the jobsd, after that ccr works and for now I didn't have any time-outs.

I am monitoring the nsrjobd to see how much time do I have before it gets to 1.5Gb again...

Br,

Bojan

14.3K Posts

February 5th, 2014 14:00

In my case it was between 40-60 days - not sure right now. Last time I stop/start it was 28th last month and now it's 213/177M (S/R).

14.3K Posts

February 24th, 2014 14:00

Let me continue with my monologue. Why monologue?  Because I have ticket opened for almost 3 months with no feedback (except initial give us everything-you-got request to run nsrget).  I have to be fair and say that I have ticket opened over EMC partner, but they assure me that they do not get feedback from EMC engineering.  With that in mind, I can assume following:

a) EMC does like partner

b) This issue is bigger than I thought initially

c) both

I'm not going to guess as that doesn't help, but I will say that it IS bigger than I originally thought as I can confirm that I see the same thing on AIX based backup server. And I suspect Linux too; I recently updated those to 8.0.3.1, but at different times and top -p shows that those started first have larger memory usage (all those servers are small and share pretty similar load - difference should not impact memory usage).  I also compared this against one remaining NW7 server on AIX; memory usage is 130M opposed to NW8 which grabbed 1.46G while running since 15th January.  It is fair to say that NW8.0.x sucks - sucking memory via nsrjobd

240 Posts

February 24th, 2014 17:00

Hrvoje,

I would liek to appolgize for the delay in this issue being resolved and I would like to see if I can assist in getting this problem pushed to resolution.

In order to do that I will need some information emailed to me. Please send me the following information:

The EMC service request, if you have

Your partner company name

Your primary contact at the partner company

I have a fair amount of information from this post.  If I need more, I will let you know.

I can tell you that there is a technical note open for this issue and it is being investigated by NetWorker engineering.  I would like to make sure you are included in this process, if possible.

Again, my appologizes for this issue not being taken care for you.

Mark

mark.bellows@emc.com

14.3K Posts

February 25th, 2014 02:00

Hi Mark,

I emailed the data to you.  I've been told this morning that for some reason ticket was still at EMC as sev3 (while it should be sev2).

Cheers, H

No Events found!

Top