Windows clients intermittently hang at random & without error

Question

I tried to word the Subject as accurately as possible. To further elaborate, a few times per week, we'll have a Windows client hang the savegroup.

o The client never attempts to start.

o No errors are logged in any log. (client & server, OS and Networker logs, no errors)

o Client/savegroup never times out or fails. Manual stoppage is required.

o Issue fixes itself, often within the next day or two with no explanation

o Issue has affected nearly all Windows clients by this point, at one time or another.

o Issue usually only affects one client at a time and never more than two at once.

o Issue first appeared a few weeks after upgrading Networker from 6.1.3 to 7.2.1

So far, EMC Support has been unable to determine the problem or find a solution. This is partly due to the fact that the clients refuse to stay broken long enough to do meaningful troubleshooting. If any of you have encountered this issue, please weigh in by all means. Let me know if there is any additional information you need.

Here's a brief overview of our configuration:

Server:

HP 9000 rp5470 Server
HP-UX 11.11
Networker 7.2.1
Three 1Gbps HBAs (1 dedicated to Autochanger)

Autochanger:

Quantum ATLP2000
Dual 1Gbps HBA on an FC230 card
4 x IBM Ultrium1 LTO drives
198-slot
Bar code reader

Clients:

18 Windows (Windows 2000 Server & Windows 2003 Server)
4 UNIX (HP-UX 11.11)

ble1 · Answer

First thing to do is to put that client into separate group. Have that group start and go to same pool as previous one. With that you won't affect other clients as in newly created group this will be the only client.Then, when it happens, copy content of ..
sr	mp directory from backup server and put it to safe location. Check if you get rpc responses from both server2client and client2server sides. Then try to stop/kill the backup which is hanging. Then try client initiated backup in debug mode and capture the ouput. If lucky it will hang again. Then send all those outputs to support.

dbagley1 · Answer

Unfortunately I hit Post Message twice last night and have two threads going with this subject. As I posted on the other, the below thread seems similar to my own issue:

http://forums.emc.com/forums/thread.jspa?threadID=32123&tstart=160

No resolution was listed for this post, though.

dbagley1 · Answer

First thing to do is to put that client into separategroup. Have that group start and go to same pool asprevious one. With that you won't affect otherclients as in newly created group this will be theonly client.Then, when it happens, copy content of ..
sr	mpdirectory from backup server and put it to safelocation. Check if you get rpc responses from bothserver2client and client2server sides. Then try tostop/kill the backup which is hanging. Then tryclient initiated backup in debug mode and capture theouput. If lucky it will hang again. Then send allthose outputs to support. Hrvoje,Thanks for the reply. Here's where I'm at. I'm testing the client (which is still hung) in its own savegroup. Here are the results of the rpcinfo command on both client and server:Server:root@tanya:/nsr/tmp.save# rpcinfo program version netid address service owner 100000 4 ticots tanya.rpc rpcbind superuser 100000 3 ticots tanya.rpc rpcbind superuser 100000 4 ticotsord tanya.rpc rpcbind superuser 100000 3 ticotsord tanya.rpc rpcbind superuser 100000 4 ticlts tanya.rpc rpcbind superuser 100000 3 ticlts tanya.rpc rpcbind superuser 100000 4 tcp 0.0.0.0.0.111 rpcbind superuser 100000 3 tcp 0.0.0.0.0.111 rpcbind superuser 100000 2 tcp 0.0.0.0.0.111 rpcbind superuser 100000 4 udp 0.0.0.0.0.111 rpcbind superuser 100000 3 udp 0.0.0.0.0.111 rpcbind superuser 100000 2 udp 0.0.0.0.0.111 rpcbind superuser 100024 1 tcp 0.0.0.0.192.0 status superuser 100024 1 udp 0.0.0.0.192.7 status superuser 100021 1 tcp 0.0.0.0.192.1 nlockmgr superuser 100021 1 udp 0.0.0.0.192.8 nlockmgr superuser 100021 3 tcp 0.0.0.0.192.2 nlockmgr superuser 100021 3 udp 0.0.0.0.192.9 nlockmgr superuser 100021 4 tcp 0.0.0.0.192.3 nlockmgr superuser 100021 4 udp 0.0.0.0.192.10 nlockmgr superuser 100020 1 udp 0.0.0.0.15.205 llockmgr superuser 100020 1 tcp 0.0.0.0.15.205 llockmgr superuser 100021 2 tcp 0.0.0.0.192.4 nlockmgr superuser 100068 2 udp 0.0.0.0.192.55 cmsd superuser 100068 3 udp 0.0.0.0.192.55 cmsd superuser 100068 4 udp 0.0.0.0.192.55 cmsd superuser 100068 5 udp 0.0.0.0.192.55 cmsd superuser 805306352 1 tcp 0.0.0.0.2.224 - superuser 100005 1 udp 0.0.0.0.192.99 mountd superuser 100005 3 udp 0.0.0.0.192.99 mountd superuser 100005 1 tcp 0.0.0.0.192.212 mountd superuser 100005 3 tcp 0.0.0.0.192.212 mountd superuser 100003 2 udp 0.0.0.0.8.1 nfs superuser 100003 3 udp 0.0.0.0.8.1 nfs superuser 100003 2 tcp 0.0.0.0.8.1 nfs superuser 100003 3 tcp 0.0.0.0.8.1 nfs superuser 100083 1 tcp 0.0.0.0.0.0 ttdbserver superuser 390113 1 tcp 0.0.0.0.31.1 nsrexec unknown 390103 2 tcp 0.0.0.0.31.19 nsrd unknown 390109 2 tcp 0.0.0.0.31.19 nsrstat unknown 390110 1 tcp 0.0.0.0.31.19 nsrjb unknown 390120 1 tcp 0.0.0.0.31.19 - unknown 390103 2 udp 0.0.0.0.31.60 nsrd unknown 390109 2 udp 0.0.0.0.31.60 nsrstat unknown 390110 1 udp 0.0.0.0.31.60 nsrjb unknown 390120 1 udp 0.0.0.0.31.60 - unknown 390105 5 tcp 0.0.0.0.31.23 nsrindexd unknown 390105 6 tcp 0.0.0.0.31.23 nsrindexd unknown 390107 5 tcp 0.0.0.0.31.53 nsrmmdbd unknown 390107 6 tcp 0.0.0.0.31.53 nsrmmdbd unknown 390104 105 tcp 0.0.0.0.31.46 nsrmmd unknown 390104 205 tcp 0.0.0.0.31.62 nsrmmd unknown 390104 405 tcp 0.0.0.0.31.52 nsrmmd unknown 390104 605 tcp 0.0.0.0.31.31 nsrmmd unknown1342177279 4 tcp 0.0.0.0.195.176 - superuser1342177279 1 tcp 0.0.0.0.195.176 - superuser1342177279 3 tcp 0.0.0.0.195.176 - superuser1342177279 2 tcp 0.0.0.0.195.176 - superuser1342177280 4 tcp 0.0.0.0.237.68 - superuser1342177280 1 tcp 0.0.0.0.237.68 - superuser1342177280 3 tcp 0.0.0.0.237.68 - superuser1342177280 2 tcp 0.0.0.0.237.68 - superuserClient:Z:\>rpcinfo -p localhost program vers proto port 100000 2 tcp 7938 100000 2 udp 7938 390113 1 tcp 7937 program vers proto port 100000 2 tcp 7938 100000 2 udp 7938 390113 1 tcp 7937I am currently attempting to use the dbgcommand utility (provided to us by Legato support) to debug the currently running savegroup. root 17875 17802 0 11:55:26 ? 0:00 /opt/networker/bin/nsrexec -c erecdevweb-c2 -a -- erecdevweb-c root 17873 17802 0 11:55:26 ? 0:00 /opt/networker/bin/nsrexec -c erecdevweb-c2 -a -- erecdevweb-c root 17996 1559 1 11:55:43 pts/0 0:00 grep dev root 17874 17802 0 11:55:26 ? 0:00 /opt/networker/bin/nsrexec -c erecdevweb-c2 -a -- erecdevweb-c root 17872 17802 0 11:55:26 ? 0:00 /opt/networker/bin/nsrexec -c erecdevweb-c2 -a -- erecdevweb-croot@tanya:/tmp# ./dbgcommand.dat -p 17872 Debug=9root@tanya:/tmp# ./dbgcommand.dat -p 17873 Debug=9root@tanya:/tmp# ./dbgcommand.dat -p 17874 Debug=9root@tanya:/tmp# ./dbgcommand.dat -p 17875 Debug=9We've not had any success in receiving any additional info in the daemon.logs using this program. If you have any other means of 'debugging' the savegroup on client or server, please let me know. Thanks again!

ble1 · Answer

Hi Dave,You should test RPC against each other. So, from client you should do 'rpcinfo -p server' and from server you should do 'rpcinfo -p client'. This will probably work. Them you test nsrexecd on both sides:- from client do: echo print | nsradmin -p 390113 -i - -s server_name- from server do: echo print | nsradmin -p 390113 -i - -s client_nameI expect this will work too, but we better check it out.I would suggest to put nsrd to debug mode - this will put all NW processes into the debug as well. I don't thing problem is with savegrp binary, but rather something else.I never saw before ticots (loopback transport providers) to portmapper - no idea if that could mess up or not - I guess not (I really have no experience with that). My guess is this is not relevant (otherwise other clients would have same problem). Given that only one client is giving you a trouble here is set of perhaps silly questions, but you never know...- is that client specific in something (eg. some specific application)- do you have latest patch level for it- was the client rebooted since trouble started - do you longer period of things going well after reboot or there is no connection?For savegrp itself, use truss/(s)trace just to see what it gives you before any debug thing. Do you see it looping in something or sleeping doing nothing? When you go to client, while things are bad, do you activity of the save.exe? Does it use CPU at higher rate than usual? Is there anything in event logs? While save.exe started from server is hanging are you able to start save from the client?When you stop the savegrp of the server, do you see save.exe going away on the client? If not, I would suspect possible issue with file system on that client (seen that few times).

NetWorker

Windows clients intermittently hang at random & without error

Was this post helpful?