Start a Conversation

Unsolved

This post is more than 5 years old

2163

July 24th, 2014 00:00

NetWorker update from 7.6.4 to 8.1.1 failed. OS IBM AIX


We operate a NetWorker server in version 7.6.4. on IBM-AIX cluster.
This installation runs on two cluster nodes, both XTEST60-XTEST61 has been updated to NetWorker 8.1.1.2.
After the successful update of the failover nodes, XTEST61 fails to start the NetWorker daemons. All recources of the cluster lie on XTEST61.
The error message reads,

73248 1405510561 5 5 0 1 16056448 0 xtest61p.swd-ag.de nsrd critical NSR 148 Can not start nsrd Because% s (/ nsr) is local, and is as configured as a NetWorker cluster server. Use NetWorker cluster manager to check service status. 1 23 8 / nsr / res
nsr_log: ENTRY

The control of all config files revealed that they are correct, compared against our current productive NetWorker cluster.
The update from 7.6.4 to 8.1.1.2, we have carried out again after the error, the result was again wrong.
We intend to update our productive clusters in the coming days and solicit support in this way.

With best regards,
AIX team, Stadtwerke Duesseldorf AG.

4 Operator

 • 

1.3K Posts

July 24th, 2014 02:00

Have you run the cluster configuration script yet - /usr/sbin/networker.cluster ? IF not please run this.

4 Posts

July 24th, 2014 02:00

Of course, we have configured the cluster olso the gst.

The miracle to me is, why did it work on XTEST60 the primary cluster node and did not work on the standby node?????

14.3K Posts

July 24th, 2014 06:00

I do no thave experience with update on AIX nodes from NW7 as all my AIX landscape went from TSM to NW8.  I do have 2 backup servers running on AIX cluster and no issues there (and number of application clients also in cluster).  What I would try is to uninstall everything, remove cluster files by NW and make sure symlink is in correct state.  Once you install it again, networker.cluster should be run and make sure that nw_hacmp.lc on both nodes is correct (eg. pointing to correct IP address and shared storage).  Then run daemons via /etc/rc.nsr and only client should start (passive node).  Then use clRGmove to test failover between nodes.  I suspect cluster config file by NW was not updated on passive node and this is why symlink phase failed or it was even not aware of the cluster config.

4 Posts

July 25th, 2014 01:00

All new, both nodes new installed, NW 7.6.4, update to 8.1.1.2. Installation without errors. Cluster configured.

All the same like before. NW runs fine on the primary node, starts NOT on the standby node.

I'm frustrated.

4 Posts

July 25th, 2014 02:00

What do you mean by cluster config file? /usr/bin/nw_hacmp.lc?

if so it's been running well and it is the same in both cluster nodes.

What is your env? AIX 6.1 TL09 1412 with Virtualized ethernet Adapters?

PowerHA System Mirror 6.1 SP12?

NW 8.1.1?

4 Posts

July 25th, 2014 02:00

After Fail over the NSRD dies with Error:

nsrd NSR critical 148 Can't start nsrd because %s (/nsr) is local, and NetWorker is configured as a cluster server. Use cluster manager to check NetWorker service status

there is no possibility to find out, why nsrd dies. the Debug mode doesn't bring more information as this primitive error.

the NSR link is correct and shows to shared disk.

Everything looking normal like the Primary Node. which starts without any problem.

14.3K Posts

July 25th, 2014 02:00

I'm more frustrated with level of information you give   I assume what doesn't work is failover when you start server, right? Running client only does work?  Now, can you confirm that as already asked that nw_hacmp.lc is correct and have correct values?  If so, do failover and provide logs.  Do also capture output of actions when you call start script to get an idea what it does and where does it break.

14.3K Posts

July 25th, 2014 02:00

I have both PowerHA 6.1 and 7.1 on LPARs.  It is not 8.1.1, but 8.0.3.5, but from error that should not matter at all.

Yes, NW cluster config is nw_hacmp.lc and that file gets nuked each time you do update so it is essential to update it after upgrades with valid information (two lines showing server ip and share disk location).  Please capture what stop and start script are doing when doing failover to get more information.  Before (and after) these actions take note of nsr symlinks (and make sure that /nsr/run on passive node has correct information if client is running only).

4 Posts

July 25th, 2014 02:00

the nw_hacmp.lc has the right environment as mentioned. nw_hacmp-lc start/stop script is only setting the environment and starts the nsruexec in /opt/nsr/admin/nsr_envexec reading a file calls lcmap file.

This files are all the same in cluster nodes.

what cluster does is, bring the resource group online. A RG is a App-Server within a Volume Group and Service ip label.

Nothing more. This is the Duty of cluster and it does it well. The VG is online and Service IP is also reachable.

After RG is online cluster exits with error 0 as far as App-Scripts has the same exit Code.

14.3K Posts

July 25th, 2014 02:00

As I said before, and check the error message you posted, it seems like that when nsr was tried to run on local host is fails as /nsr is linked against local folder and not shared one.  This is why it is essential to enable monitoring of each step in script to see where does it break and monitor what happens on file system at that time.  Can you do that and post it here?

4 Posts

July 25th, 2014 04:00

of course:

here is the link of NSR before starting the Networker:

/nsr@ -> /nsr.NetWorker.local/

service ip is online and pingable

nw_hacmp.lc start

link after starting:

nsr@ -> /networker/nsr/

73248 1406285620 5 5 0 1 14483626 0 xtest61p.swd-ag.de nsrd NSR critical 148 Can't start nsrd because %s (/nsr) is local, and NetWorker is configured as a cluster server. Use cluster manager to check NetWorker service status. 1 23 8 /nsr/res

######lcmapfile output######

+ FILESYSTEMS=

+ GET_FS_FROM_ODM=FALSE

+ + odmget -qgroup = nwt60_rg AND name = SERVICE_LABEL HACMPresource

+ nawk { FS="\"" ; print $2 }

+ grep value

IP_LABELS=xtest60

+ [ Xxtest60 = X ]

+ + odmget -qgroup = nwt60_rg AND name = FILESYSTEM HACMPresource

+ grep value

+ sort

+ nawk { FS="\"" ; print $2 }

FILESYSTEMS=ALL

+ expr ALL : [Aa][Ll][Ll]

+ [ 3 -eq 3 ]

+ GET_FS_FROM_ODM=TRUE

+ FILESYSTEMS=

+ + odmget -qgroup = nwt60_rg AND name = VOLUME_GROUP HACMPresource

+ grep value

+ nawk { FS="\"" ; print $2 }

+ sort

VOLUMES=test60nwvg

+ [ TRUE = FALSE -a TRUE = FALSE ]

+ get_other_volumes_from_odm

+ [ X /networker = X ] # shared disk is correct.

+ [ local = global ]

+ [ TRUE = TRUE ]

+ echo type: NSR_CLU_VIRTHOST;

type: NSR_CLU_VIRTHOST;

+ echo hostname: xtest60; #right hostname

hostname: xtest60;

+ echo owned paths: \c

owned paths: + first=0

+ [ 0 -eq 0 ]

+ echo /networker\c

/networker+ first=1

+ echo ;\n

###########################

the ENV MODE=local (i don't know why)

that's the Problem. I know that local is wrong but it is difficult to find.

active_processes= 13762788      -  0:00 nsrexecd

+ [ -z  13762788      -  0:00 nsrexecd ]

+ echo Starting NetWorker service xtest60 on xtest61p

Starting NetWorker service xtest60 on xtest61p

+ [ ! -z xtest60 ]

+ start_nsrd -k xtest60

+ /opt/nsr/admin/nsr_envexec -u /nsr/nsrrc -s /opt/nsr/admin/networkerrc /usr/bin/nsrd -k xtest60

+ nsr_log NetWorker service failed to start

LCMAP.FILE

----------------------

+ + odmget HACMPgroup

+ grep group

+ awk -F" { print $2 }

+ sort

RESOURCES=nwt60_rg

nwt61_rg

+ GET_FS_FROM_ODM=FALSE

+ GET_LV_FROM_ODM=TRUE

+ [ 0 -eq 1 ]

+ mode=local

14.3K Posts

July 25th, 2014 05:00

I do not see it from your log, but I assume nsrexecd is freshly started when you try this (I mean, it is started from nw_hacmp.lc?).

I just checked lcmap on my side and I also get mode=local at first and then it switched to node later on:

# Start here...

if [ $# -eq 1 ]; then

        mode=$1

else

        mode=local

fi

+ [ 0 -eq 1 ]

+ mode=local

if [ "${mode}" = "node" ] ; then

        print_header

        exit 0

fi

+ [ local = node ]

print_header

+ print_header

type: NSR_CLU_TYPE;

clu_type: NSR_LC_TYPE;

interface version: 1.0;

I think this is just related to kind of query being run as seen in header:

# usage: lcmap local|global|node (no args is the same as "lcmap local")

#

#              local  -> prints out path ownership info for local node

#              global -> prints out path ownership info for all nodes

#              node   -> prints out header (nodename) info for local node

This has to be something silly, but right now I also do not see it where the problem is this way. Just have someone do webex to your env and it should be fairly easy to locate it I guess.  From everything so far, it looks to me that linking is done, but it fails somewhere around nsrexecd being initiated which should run nsrd part later on.

No Events found!

Top