Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

2725

October 14th, 2010 05:00

nsrd taking 45 mins to start

Hi guys,

We have an interesting problem with a NetWorker 7.6.0.8 installation on a Red Hat EL4 x86 host.

When we start NetWorker services we notice two nsrexecd pid's exist and nsrd does not come online for another 45mins afterwards.

If I monitor the nsr folder, there is zero activity in the logs directory (or any other dir) after we start the script. We're logged in as root of course. Once nsrd starts, there are no problems with regular NetWorker operations. The logs do not indicate any errors or any problems of any sort after everything starts.

We've taken a look at the OS logs but nothing is occurring at the time of starting NetWorker.

I have not tried renaming res, mm or cfi yet as a troubleshooting step, but I have renamed tmp.

Any thoughts?

Thanks!

Justin

263 Posts

October 17th, 2010 20:00

Thanks for the feedback.  When you mentioned the servers file, it reminded me that I had worked on another service request with a  different customer.  Same problem, same root cause, and same solution!

I suspect that that bogus hostname would have shown up on the debug output.    Not surprising that it is the root cause.  Nsrexecd is responsible for allowing or refusing connection from other NetWorker servers.  So it makes sense that it would read the servers file if it exist, and then try to resolve the hostnames.

Regardless...  problem solved...  Q.E.D. 

244 Posts

October 14th, 2010 09:00

Hi,

Please have a look at the process activity with the tool strace/trace. Maybe you will find out where it stuck. Please have a look also at the network configuration - I have similar problem that NW start takes a lot of time because of /etc/hosts misconfiguration.

9 Posts

October 14th, 2010 15:00

Hello,

Thanks for the reply! I tried trace/strace but it was not installed on the host at the time

I will definitely look into the network configuration as you suggested. I did notice they had teamed nics. I will also request to have strace installed.

Thanks again

Justin

9 Posts

October 14th, 2010 23:00

Great, thank you

244 Posts

October 14th, 2010 23:00

Well, strace will tell you on which system call it is waiting/hanging.

263 Posts

October 15th, 2010 04:00

If nsrd is not starting at all until 45 mins after nsrexecd starts, then the problem is with nsrexecd.

The startup script starts nsrexecd and then nsrd.  nsrd will not be started till after nsrexecd is completed started.

In addition, normally there will only be one nsrexecd running.  If there are two, then this also more evidence that nsrexecd did not finish its startup phase quickly.  The likely reason for this is a network configuration issue.

To debug this, stop all NetWorker processes, then:

script /tmp/nsrexecd.txt

nsrexecd -D9

(wait)

exit

Review the output file and look at what host names and i.p. addresses it is referencing.

9 Posts

October 17th, 2010 16:00

Thank you Wallace, this is very helpful. I wasn't sure at what point nsrd would be expected to start - I.e. after nsrexecd has completed starting.

Thanks!!

Justin

263 Posts

October 17th, 2010 18:00

I had worked with a customer that seems to have the same symptoms.  While watching the nsrexecd -D9 output, the customer and I had noticed that there was a reference to 127.0.0.2!  We were not sure where this came from, but definitely nsrexecd was trying to resolve this in its logic.  It was not in the local host file.

It turned out that the Linux server had defined 127.0.0.2 in its network configuration.  Once this was removed, nsrexecd started without any unusual delay.

I am not saying that this is what you will see, but the debug information should help.  If you cannot see anything obvious, then open a support ticket for assistance.

Let us know too...  Good luck!

9 Posts

October 17th, 2010 19:00

Hi Wallace,

Thanks for the feedback. I put nsrexecd into D9 mode and couldn't see anything unusual after 5 mins or so of waiting. We did some more poking around and found the /nsr/res/servers config file had a bogus hostname at the top of the list before the NetWorker server name. We removed the bogus host so the nsr host was at the top of the list and everything now starts perfectly.

So it appears it was timing out on a non-existent host. Interesting problem!

Thanks for your assistance!

Justin

14.3K Posts

October 25th, 2010 01:00

I have the same issue and have EMC case too   But I do not use 127.0.0.2 for sure.

What I have found is following (using 7.5.2.4 on HPUX and Solaris 7.6.0.8 - two different environments). I will focus on env I babysit (HPUX one).

I did an update from 7.4.5.8 to 7.5.2.4 and I noticed while starting nsrexecd that it hangs and wait forever.  Well, not forever, but until I hit ctrl+c.  Truss shows that nsrexecd is trying to establish communication with nsrd on server (actually on hosts listed in /nsr/res/servers) and since server is not up at that moment it just hangs until I guess it hits some sort of RPC timeout.  Removing /nsr/res/servers (or making it empty) or making sure nsrd is already running will address this issue.  Same I have seen few days before I did my upgrade in separate datazone based on Solaris - running nsrexecd on storage node will fail (well, run for 45 minutes or so) if nsrd on server is not running. 

Yesterday I did an upgrade to 7.5.3.3 and I can confirm that issue is not present in that version.  Not sure about 7.6.0.latest or 7.6SP1.  It is issue with NW code for sure.

No Events found!

Top