Start a Conversation

Unsolved

This post is more than 5 years old

2 Intern

 • 

14.3K Posts

7366

January 13th, 2014 15:00

nsrjobd memory leak in NW8

I went to NW8 somewhere in September last year. Since then I wanted to write blog with all details I have to fight with since and steps I made, but I kept postponing it due to some ongoing issues. One of the things I noticed was SQL clients would have their nsrexecd die - well, it would stop responding to RPC calls.  I had following situation (always using VDI):

NW7 server + NW7 clients + NMSQL 5.2.2 -> OK

NW8 server + NW7 clients + NMSQL 5.2.2 -> NOK

NW8 server + NW8 clients + NMSQL 5.2.2 -> NOK

NW8 server + NW8.1 clients + NMM 3.0 -> NOK

The frequency of these freezes is irregular and I never really had a chance to spot any specific pattern so I just learned to live with a fact that I have to restart services on client and then all is back to normal for at least next 24 hours or so (ok, sometimes less).

In parallel to that, I have observed that some special jobs that I have, running over savepnpc every 15 minutes, would start to die too with hard time limit (most of the time due to savepnpc not completing post part, but not always). Increasing frequency to 20 minutes helped a bit, but then it was back to its usual nonsense. Real danger would happen when sporadically tmp file would not be removed from /nsr/tmp due to whatever was causing this.  As this looked as random pattern too (with its all possible outcomes within failure), I concluded that most likely same thing is causing it.  And since all started with simple upgrade of server and storage nodes, I isolate my thoughts to backup server to be major issue here.  As things started to look worse with time, it sounded as potential memory leak somewhere. And indeed, it looks as if nsrjobdb is to blame.

During course of several past month I have been monitoring this and all I can say is that either there is something really broken in nsrjobdb in general or on platform where I run backup server (HPUX ia64 11.31). Since I didn't see complains like these before here, I will guess that this might be isolated to ia64 platform for now.  When I look at my current run, I have backup server running since November 22nd (this is when 8.0.2.5 was applied which seems to be worse than 8.0.2.3 in that respect). My current nsrjobdb memory usage is 2.5G   And nsrjobd rarely likes to light on CPU nowadays (restart the server and it comes back to senses for a while).  I thought NW8 was supposed to fix nsrjobd, but in my case is driving me more crazy. 

If I check how log purge runs, I see nothing unusual - NW8 really does rock there. I keep records only for 12 hours so amount of records should be ok (it was ok for NW7 so I would expect NW8 to be very happy there too). However, if I check logs, I see all sorts of crazy messages.  In some random order and without any specific meaning, here are few:


nsrjobd SYSTEM critical User root does not have the privileges required to start the requested job. The required privilege is: Operate NetWorker. The job requested to be started is: savefs.

nsrjobd SYSTEM critical User root does not have the privileges required to start the requested job. The required privilege is: Operate NetWorker. The job requested to be started is: nsrsqlsv.

nsrjobd SYSTEM critical User root does not have the privileges required to start the requested job. The required privilege is: Operate NetWorker. The job requested to be started is: nsrdasv

etc...


Now, above shows savefs, NMM and NMDA affected, but I have seen it for save and NMSAP too.  To add further, I have seen that something when I start save from client or I restart group from CLI you get to see messages like:

Cannot create session with nsrjobd: Failed to register with server : Timed out

Retrying 4 more time(s)


Eventually, this may fail upon one or two retries, but at the end it always runs with success.  If you use GUI, you won't see it as this is masked (you may find entry in log indicating SSNCHNL error for group for example).


One thing I also noticed is that privilege message (Operate NetWorker) comes from physical node in logs while timeout shows virtual name.  Not sure if nsrjobd does not like clusters or this is irrelevant .


I do have ticket with support who opened this with EMC and I do have NW id for it (which at this stage is only indication that it has been pushed to engineer), but no progress at all.


My question is - did anyone who went to NW8 had this issue and did anyone observe nsrjobd being cookie monster when it comes to memory?  I have no doubts that my issue is related to nsrjobd as I have excluded pretty much anything else in all tests done so far, but I wonder how others have been dealing with this and what experience they might have.  Looking forward to see your memory usage


Some notes: I run pretty much same setup since 2006 (with HW refresh in 2009). We have very good experience with HPUX platform and system is tweaked and tuned as it should be (I also checked NW8 hints for performance optimization and we were there long time ago). The privilege message is bogus - let me assure you we have permission list in place and hasn't been changed for last 8 years - you get the same even with *@*.  Finally, even this is NW8, we run our setup with nsrauth disabled so any crazy thing which may occur by using nsrauth is not the case here.

2 Intern

 • 

14.3K Posts

September 29th, 2014 06:00

8.0.1.5 is a bit behind in respect to current patch level (for same code tree that would be 8.0.3.7).  As for error itself, if this is again on NMDA side, I know for sure this is fixed in NMDA 1.5 as I do not get an issue with it.  However, if you get this with file system backup, then chances are that you really have an error somewhere like hanging NFS on the client (easiest way to check is via bdf command).

2 Intern

 • 

14.3K Posts

September 29th, 2014 06:00

Actually, latest patch level for 8.0 is 8.0.4.1.

2 Posts

September 29th, 2014 06:00

Dear Hrvoje,


we have NW Server 8.0.1.5 Build 169 on HP UX 11 with NMDA 1.6 on HP UX Clients or Suse Clients and same issue on both: DB backup and filesystem backup.


:save: Error ending job: Stale asynchronous RPC handle


Greets Andreas Anastassiou

2 Posts

September 29th, 2014 07:00

Thanks for quick answers. As i heard from customer we will go directly to 8.1 soon.

Mit freundlichen Grüssen / Best regards

Andreas Anastassiou

ISD - Industrie Service für Datenverarbeitung GmbH

Technical Service Center (TSC)

Senior Systems Engineer

Sternstrasse 166-168

67063 Ludwigshafen

Phone +49 621 6361 993

andreas.anastassiou@isd.de

www.isd.de

Sitz der Gesellschaft: 67063 Ludwigshafen am Rhein

Registergericht: Amtsgericht Ludwigshafen

Handelsregisternummer: HRB 4595

Geschäftsführer: Peter Krauß (Sprecher der Geschäftsführung), Ralf Trautz, Rouven Heim, Herbert Schenkel

Der Inhalt dieser E-Mail Nachricht von ISD GmbH ist ausschließlich für deren Empfänger bestimmt. Die Verwendung dieser Nachricht durch Dritte ist verboten. Falls Sie diese E-Mail versehentlich erhalten haben, löschen Sie sie bitte, ohne deren Inhalte, auch nur teilweise, zu lesen, zu benutzen, zu kopieren oder an Dritte weiterzuleiten. ISD GmbH übernimmt keinerlei Haftung für Schäden, die aus E-Mail Kommunikation entstehen.

Von: Hrvoje Crvelin

Gesendet: Montag, 29. September 2014 15:54

An: Anastassiou, Andreas

Betreff: Re: - nsrjobd memory leak in NW8

ECN

nsrjobd memory leak in NW8

reply from Hrvoje Crvelin in NetWorker Support Forum - View the full discussion

March 16th, 2015 02:00

Hi Hrvoje,

we are actually using NWS 8.1.1.9 on a windows 2008 R2 SP1 and from a couple of week now, we see some issue you are explaining above. after a couple of days (5-10) Networker hang on. We get in the daemon.raw the following line:

We have opened a Call by EMC (SR#69845352). But except nsr peer information, EMC support did not find any solution.

Did you get some answer from EMC about this memory leak from the jobdb?

Thanks in advance for your help. We are feeling very unconformable with a non stable Backup system.

Cheers

Greg

2 Intern

 • 

14.3K Posts

March 16th, 2015 02:00

What it says there is that daemon did not respond to RPC queries any longer.  Obviously, 8.1.1.9 is no longer latest patch level so you may wish to check latest patch level first.  With 8.1.x (but different OS) I have seen that nsrim actions may cause things to hangs as well.

March 16th, 2015 02:00

Hi,

many thanks for your answer.

We probably will have a look on last patch. But i'm not sure it will solve anything.

Despite the fact that we have scheduled the nsrim process at a specific time, we still seeing Networker crash without any nsrim running.

Cheers

Greg

2 Intern

 • 

14.3K Posts

March 16th, 2015 10:00

OK, then you can rule out nsrim for sure.  Due to some issues I had with 8.1.x and big UNIX boxes, I decided to speed up my plan where I deploy multiple small VM backup servers on Linux (RHEL) and while this runs on fairly fresh 8.1SP1, I didn't see any issues yet so I will most likely continue with this and try to skip 8.1.x altogether.

March 17th, 2015 05:00

Hi,

when you speak about 8.1SP1, did you think 8.1.2?

Cause we have this issue with version 8.1.1.9 who is the last SP1 version, as far as I know.

Thanks

Greg

2 Intern

 • 

14.3K Posts

March 17th, 2015 05:00

Typo... was referring to 8.2SP1.

2 Intern

 • 

14.3K Posts

March 17th, 2015 06:00

Gregoire_Perrenoud wrote:

Cause we have this issue with version 8.1.1.9 who is the last SP1 version, as far as I know.

Last patch for 8.1.x tree is within SP2 realm.

No Events found!

Top