Unsolved

This post is more than 5 years old

17 Posts

1426

April 13th, 2007 05:00

hanging nsrmmd on Linux storage node

Hi everybody,

in our environment we have a huge problem with hanging nsrmmd-processes on our Linux Storage Nodes.

On our Networker Server we run W2K3 with Networker 7.3.2 jumbo patch, our Linux Storage Nodes are Redhat 3.0 und 4.0, with Networker 7.3.2 jumbo patch.

Sometimes a bad tape causes a read/write(I/O)-Error on our Linux nodes, und we have to unload the tape via SJIMM, because unload via Networker won´t work. The nsrmmd on the storage node continues to "hang" and there is no chance to use the drive again.

We tried to reboot the drive, the library, the networker processes on the storage node and to kill the nsrmmd with "kill -9", but there is no chance to stop the process. The only thing that works ist to reboot the storage node.

Any ideas what i can do to avoid rebooting the linux node?

Thank You,
Christian

6 Operator

 • 

14.4K Posts

 • 

56.2K Points

April 13th, 2007 06:00

If you can't kill it with -9 then that is I/O issue which you can't fix with NW. When you have such process then the only thing you can do is reboot. What kernel do you use?

194 Posts

April 13th, 2007 06:00

First I would do a 'ps aux | grep nsrmmd' to see what kind of state it's in. Probably an uninterruptible state. Next use an utility to get more information about the process. For the life of me I can't remember what it's called. Maybe go to a Linux forum and for it.

Vic

163 Posts

April 13th, 2007 09:00

Hi,

are you sure that Networker uses the right drives when accessing the tapes? Just today I had an issue that sounds quite similar to your one with an Windows storage node. I suppose it to be an persistent binding issue with the storage node. I did not have enough time to investigate this behaviour a little closer 'til now. I also think we enabled persistent binding on our windows storage node but behaviour is completely different...

Could you perhaps check the following: If you have two drives on your storage node run an inventory on two tapes that are already in Networker's index. One tape on each drive. If it is an persistent binding problem you'll probably get a message that says something like

'Found tape XXX1, expected tape XXX2, updating Networker database'

best regards.

17 Posts

April 18th, 2007 06:00

Hi everybody,
thank you for your suggestions.

I´m not too familiar with Linux, so sorry for my not-so-exact-answers

To your questions:
- Our kernel release is 2.4.21-47.0.1.ELsmp, but from what i have seen until now i think, it could happen with another version, too.
- If it is really an I/O-Issue and a process in an uninterruptable state and neither NW nor operating system can handle the issue without rebooting our only way would be to avoid nsrmmd to get in this state. But how?
- As far as i can see, NW uses the right drives. We check this with sjimm and mt, and everything seems to be right.

At least, one question: With your linux storage nodes, do you use /dev/nst* for your remote devices? And does anybody use the "scsidev"-utility?

Yours,
Christian

6 Operator

 • 

14.4K Posts

 • 

56.2K Points

April 18th, 2007 07:00

- If it is really an I/O-Issue and a process in an
uninterruptable state and neither NW nor operating
system can handle the issue without rebooting our
only way would be to avoid nsrmmd to get in this
state. But how?

Actually what happens to nsrmmd is consequence so cause should be looked somewhere else. Usually that is SAN, but with Linux it can be kernel too. Finding the cause is not easy and might involve expensive solutions such as SCSI/SAN sniffers.

At least, one question: With your linux storage
nodes, do you use /dev/nst* for your remote devices?
And does anybody use the "scsidev"-utility?

So far nst and there was no need to scsidev (I believe due to persistent binding and/or udev usage).

0 events found

No Events found!

Top