Check in the storage nodes if there is any file under:
/nsr/cores/nsrmmd
Run ls -ltr and find out, if there is any file, the date of creation of the cores.
This can be caused for different reasons and I've seen that before however, the behavior you are referring to is not what I would expect from a nsrmmd core dump.
Let's put it this way: from time to time, due to bad code or external factor, daemon will die and dump data inside which may help engineers troubleshoot what happened. For NW, application keeps those dumps in /nsr/cores. For example:
# du -sk /nsr/cores/*
8 /nsr/cores/dvdetect
8 /nsr/cores/nsrexecd
2405496 /nsr/cores/nsrmmd
Above example is obviously from machine which has such issues. When condition that I described will happen, you will see in daemon.raw messages like:
"Auth. session limit greater than %d for mmd# %d on device %s. 3 1 2 10 1 1 1 21 78 rd= : /_AF_readonly"
In that case reseting that device from server will help and life goes on. I placed the script which runs from cron which is sending me notification to know when core dump happens. Of course, this might not be your case at all.
Only if you find time to be the match to when device was umounted. When you check fo last activity on that device, before you noticed something was wrong, was it around time you saw lost connection message?
This has happening has we have created some 50 new device on the DataDomain.
As we have to different save zone (two networker server )on the same DataDomain (two different file system), we are now sure that the DataDomain is the problem.
Cause creating new device on Networker and File system 1 has dismounted device on Networker and File system2.
I belive Dr. Watson is elsewhere, but not sure - check MS site for exact location depending on your Win version. I believe in separate thread one user who also uses aftd mentioned he gets this error when multiple reads are attempted only.
On Windows there are no core dumps, but I assume equivalent would be Dr. Watson event. I have seen this issue so far only with DD dev devices (not aftd as it seems, but perhaps I'm wrong) with both SP4 and SP5. With SP5 and libDDBoost.so from NW144972 frequency of these events has been drastically minimized (interesting enough, SP4 and same patch don't go that well) - probably due to nsrmmd patch for core dumps during cloning.
CarlosRojas
1.7K Posts
0
February 4th, 2013 07:00
Hi Greg,
Check in the storage nodes if there is any file under:
/nsr/cores/nsrmmd
Run ls -ltr and find out, if there is any file, the date of creation of the cores.
This can be caused for different reasons and I've seen that before however, the behavior you are referring to is not what I would expect from a nsrmmd core dump.
Thank you.
Carlos.
ble1
4 Operator
•
14.4K Posts
0
February 4th, 2013 07:00
Let's put it this way: from time to time, due to bad code or external factor, daemon will die and dump data inside which may help engineers troubleshoot what happened. For NW, application keeps those dumps in /nsr/cores. For example:
# du -sk /nsr/cores/*
8 /nsr/cores/dvdetect
8 /nsr/cores/nsrexecd
2405496 /nsr/cores/nsrmmd
Above example is obviously from machine which has such issues. When condition that I described will happen, you will see in daemon.raw messages like:
"Auth. session limit greater than %d for mmd# %d on device %s. 3 1 2 10 1 1 1 21 78 rd= : /_AF_readonly"
In that case reseting that device from server will help and life goes on. I placed the script which runs from cron which is sending me notification to know when core dump happens. Of course, this might not be your case at all.
Castromotorbox
2 Intern
•
217 Posts
0
February 5th, 2013 00:00
Hi, thanks for your answer.
Unfortunatly, we didn't found any /nsr/cores folder on our SN even on our Networker Server.
Still waiting for EMC support...
We found in daemon.raw of storage node the following message:
error: Lost connection to media database: RPC send operation failed; errno = An existing connection was forcibly closed by the remote host.
Could it have any link?
Thanks
Greg
ble1
4 Operator
•
14.4K Posts
0
February 5th, 2013 05:00
Only if you find time to be the match to when device was umounted. When you check fo last activity on that device, before you noticed something was wrong, was it around time you saw lost connection message?
Castromotorbox
2 Intern
•
217 Posts
0
February 5th, 2013 07:00
Well,
I found that the first device was dismounted at 19:35, then we get this error message:
19:35:28 error: Lost connection to media database: RPC send operation failed; errno = An existing connection was forcibly closed by the remote host.
Juste before, I see:
19:35:05 nsrmmd #280, with PID 2924, at HOST xxxx
19:35:03 error: Lost connection to media database: RPC send operation failed; errno = An existing connection was forcibly closed by the remote host.
19:35:02 version: major: 2, minor: 4, patch: 1, engineering: 0, build: 289644
19:04:53 Unable to remove ssid 2850050624 in volume for rd=XXX:DATAGIA_10_0013_C:
can't fetch save set 2850050624
This device aws not concerned by the dismount error at this time.
Does it help?
I'm afraid that's not :-(
Thanks
ble1
4 Operator
•
14.4K Posts
0
February 5th, 2013 12:00
Out of curiosity, which NW version do you run? Because, for example, ddboost version I use is:
libDDBoost version: major: 2, minor: 4, patch: 2, engineering: 2, build: 347703
I believe both SP4 and SP5 use same version, but yours seem to be lower. Perhaps updating NW would be something to consider too.
Castromotorbox
2 Intern
•
217 Posts
0
February 7th, 2013 00:00
Hi,Networker version is 7.6.4.1.build 1049.
We have meet the same problem again yesterday.
This has happening has we have created some 50 new device on the DataDomain.
As we have to different save zone (two networker server )on the same DataDomain (two different file system), we are now sure that the DataDomain is the problem.
Cause creating new device on Networker and File system 1 has dismounted device on Networker and File system2.
Any Idea?
Thanks
ble1
4 Operator
•
14.4K Posts
0
February 7th, 2013 11:00
At one location we also have 2 datazones against one DD and we never had this issues (not aftd, but DD dev). Try 7.6.SP5 instead.
Tomo3
9 Posts
0
February 7th, 2013 13:00
how many devices you have in total?
as you might already know each DD box (depending on the model) has limitation of the total number of open write streams
DD670 has limitation of 90 write streams (e.g., DD880 is limited to 180 open write streams)
- So, if you have open more than 90 write streams open at the time... usually something fails due to limitations
coganb
736 Posts
0
February 8th, 2013 00:00
Hi Greg,
Do you have a firewall in there somewhere that could be cutting the idle connections? If so, implementing a tcp keepalive should fix that. You'll find instructions in the Configuring TCP Networks and Network Firewalls for EMC NetWorker Technical Note p. 29.
-Bobby
Castromotorbox
2 Intern
•
217 Posts
0
February 8th, 2013 02:00
Hi,
@ tomo:
We are using now aprox 150 Device on one Storage Unit and 60on the second one
But they are never used at the same time.
@Bobby: No firewall at all between NWS and DataDomain.
@Hrvoje: we are gonna updating our system, probably.
Thanks all for your help.
Greg
Thierry101
2 Intern
•
326 Posts
0
February 14th, 2013 13:00
Hi Hrovje
is there equivalent nsr/cores for windows..seeing this on DD670 (5.1.1) Networker 7.6.3.7
Auth. session limit greater than 10 for mmd
Thanks
Thierry101
2 Intern
•
326 Posts
0
February 14th, 2013 14:00
Have checked app/sys logs but nothing
we are using aftd...
ble1
4 Operator
•
14.4K Posts
0
February 14th, 2013 14:00
I belive Dr. Watson is elsewhere, but not sure - check MS site for exact location depending on your Win version. I believe in separate thread one user who also uses aftd mentioned he gets this error when multiple reads are attempted only.
ble1
4 Operator
•
14.4K Posts
0
February 14th, 2013 14:00
On Windows there are no core dumps, but I assume equivalent would be Dr. Watson event. I have seen this issue so far only with DD dev devices (not aftd as it seems, but perhaps I'm wrong) with both SP4 and SP5. With SP5 and libDDBoost.so from NW144972 frequency of these events has been drastically minimized (interesting enough, SP4 and same patch don't go that well) - probably due to nsrmmd patch for core dumps during cloning.