NetWorker: Troubleshooting Tape Library Problems in NetWorker
Summary: This article is intended to assist both NetWorker support and NetWorker backup administrators to approach and troubleshoot tape library-related issues.
Instructions
Before investigating tape library issues, it is important to consider the following:
- Tape DRIVES read and write data and labels to media and provide all media functions, and cannot move the tape cartridges themselves
- Tape LIBRARIES move media tape cartridges from element to element: Drives, slots, and Import and Export ports, and do not read or write any data
If NetWorker operations fail because of the inability to move tape cartridges, there are several possible general causes:
- Hardware or firmware issue with library robot or internals
- Connectivity issues from NetWorker host to library robotics over transport
- OS, driver, or compatibility issue between NetWorker host and library
- NetWorker configuration problem pertaining to tape library type, state, and addressing
Follow these steps chronologically to help determine the nature of the problem, and possibly solve it. If this document does not resolve the issue, it provides tests that narrow the problem and help a specialist continue the work.
1. Environmental information
From the NetWorker Server and affected Storage Nodes:
- Hostname, OS type, and version.
- NetWorker version and build number.
- Output of the '
inquire' command showing tapes and libraries. - Zipped copy of current
nsrdb(to preserve current jukebox information if needed, for rollback)
Linux:/nsr/res/nsrdb
Windows (Default):C:\Program Files\EMC NetWorker\nsr\res\nsrdb - Storage Node, NetWorker name for library and list of affected nodes, devices and volumes
- Commonalities of the problem (specific volumes, specific drives, specific nodes, so forth)
nsrget -o:d on affected server and nodes.
-o:d on any host with tapes where the tapes are busy writing. You can check this from the NetWorker Management Console (NMC) under Monitoring -> Devices.
The following article provides information about getting and using NSRGET: NetWorker: How to Use the NSRGet NetWorker Data Collection Tool
2. Test readiness of library
- Check to see if the Storage Node owner of the affected library is enabled and ready:
- In the NMC, go to Devices -> Storage Nodes.
- Ensure View -> Diagnostic Mode is enabled.
- Check the Enabled and Ready columns in the pane on the right.
- If a Storage Node that is expected to be Enabled is not Enabled, right-click the Storage Node and click Enable/Disable to Enable it.
- If the Storage Node fails to become Ready in a minute or two, you must follow up separately; the library is not responsive because its Storage Node is inaccessible.
- Check to see if the affected Library is enabled and ready:
- In the NMC, go to Devices -> Libraries.
- Ensure View -> Diagnostic Mode is enabled.
- Check the Enabled and Ready columns in the pane on the right. If the Library shows a Ready state, proceed to section [
4]. - If it is not enabled, you can right-click the library instance on the left and select 'Enable/Disable' to reenable it.
- Once it is enabled, wait a minute or two, click again on the Libraries container, and see if a green tick appears in the 'Ready' column.
- If the library does not become ready, right-click the library instance on left, select Properties, and on the General tab, ensure that the Control Port value matches the
scsidev@#.#.#address you see from theinquirecommand. - If the Control Port does not match, set Enabled to No and click OK; then reenter properties, and change the Control Port to match the
inquireaddress discovered. After updating the Control Port, change Enabled back to Yes, and click OK again to reenable. Allow a minute or two to see if the library becomes Ready. - Finally, if after correcting the Control Port value does not allow the library to become Ready, enter properties of the library a final time; under the Advanced tab, set Debug Trace Level to 5; and Disable and reenable again to capture the startup sequence in the daemon log for two minutes.
When reenabling the library in section [2.2], confirm that nsrlcpd starts on the intended storage node. Also check that it does not stop or restart independently. If the Process ID (PID) continues to change it suggests the process is stopped or killed by the software, or dumping core. Also be alert for name resolution issues between server and storage node, which can prevent startups. The name the server resolves for the node should match the nodes' own nsrladb name, and the server's name for the node.
- See Troubleshooting Tape Library Readiness Problems in NetWorker for advanced troubleshooting information about library readiness problems.
- See NetWorker Troubleshooting Guide: Process Crashes and Core Dumps if you see or suspect that the Node's
nsrexecd,nsrsnmd, ornsrlcpdare core dumping.
For a detailed overview of NetWorker per-host processes, see: NetWorker Processes and Ports
Messages regarding these services are logged in the host's daemon.raw:
- Linux:
/nsr/logs/daemon.raw - Windows (Default):
C:\Program Files\EMC NetWorker\nsr\logs\daemon.raw - NetWorker: How to use nsr_render_log to render .raw log files
3. Determine responsiveness of the library:
If the library does not become ready, and you cannot determine a cause, ensure the library itself can be contacted:
- Check the
inquireoutput of the Node from [1.3] and ensure that the library appears in the output as 'Autochanger', and note the SCSI#.#.#address. - If the library does not appear in the
inquireoutput, check to ensure that the OS can detect it. Solaris hosts do not report the library if it is configured and enabled in NetWorker. For assistance, see Troubleshooting Tape Library Detection Problems in NetWorker. - Ensure that the library is responsive to basic library commands. Using the SCSI address in [
3.1], run: 'sjisn #.#.#'. For more information about library test commands, check Troubleshooting Tape Library Access Problems in NetWorker. - If the SJI commands fail, consider the possibility of transport or hardware problems: See Troubleshooting Tape Library Hardware Problems in NetWorker for assistance.
- If the problem is that the OS does not discover the library, and responds correctly to SJI commands, NetWorker fails to discover, try using the
jbconfigcommand and selecting option 2; if this does not work, try option 4, and manually supply the library address and configure as a standard library (option 56). See Troubleshooting Tape Library Configuration Problems in NetWorker for more details.
4. Test tape movement and volume health
If the library is ready and appears to be responsive, but is having problems loading volumes, there are many different possible causes.
- Empty the entire library if at all possible; if you can stop other operations, right-click, and Reset the library from the Devices -> Libraries tab.
- Attempt to load a single tape cartridge into a single device, where both are thought to be affected by load problems; unload after each attempt as needed.
- Compare with the same volume in different drives, and different volumes in the drive believed to be affected; note the errors, and patterns, if any.
- If the volume load reliably fails, irrespective of the device, try the following label check:
- Load the volume without mounting if the tape cartridge moves without error, you have verified the arm is mechanically functional.
- Run
nsrmm -pv -f networker_deviceif it responds with verified label, then you have verified the media is also valid and healthy. - In the properties of the Library, with Diagnostic Mode enabled, go to the Timers tab, and set Load Sleep to 60 before clicking OK.
- Unload the volume, then attempt to reload the volume if it now succeeds, the issue was likely a timing issue (you may experiment with lower Sleep values until it starts to fail again).
- If the
nsrmmcommand failed, further testing is required. Disable the drive in question in NMC by right-clicking and selecting Enable/Disable.- Run the
scannercommand on the device:- For NetWorker server's local storage node, run:
scanner -nizv local_device - For on a "remote" NetWorker Storage Node, run:
scanner -s server -nizv local_device
- For NetWorker server's local storage node, run:
- Break after ~20 lines and check the label read messaging; success is indicated by
8936:scanner: scanning media_type tape volume_name on device_name. - If
scannerreturns the message unexpected file number, wanted 2, got higher_number data loss has occurred, most probably due to SCSI reset; check Troubleshooting Overwritten Labels and SCSI Resets in NetWorker.
- Run the
- If the
scannerreturns amessage 8945:scanner: Read: -1 bytesdetermine if the volume can be read on other nodes or drives, and determine the trend of the problem. If you find some volumes can be read on some nodes, but not others, and the device is LTO-4 or above, consider drive decryption failures: LTO Hardware Encryption and NetWorker.
For more advanced media verification information, see: Troubleshooting Media Mounting Problems in NetWorker.
5. Test for drive ordering problems
If load and mount commands succeed but label reads or simple mounts fail, the issue may involve incorrect drive ordering.
- Empty the entire library if at all possible; if you can stop other operations, right-click, and Reset the library from the Devices -> Libraries.
- Attempt to load a single tape cartridge into a single device, where both are thought to be affected by load problems; unload after each attempt as needed.
- Compare with the same volume in different drives, and different volumes in the drive believed to be affected; note the errors, and patterns, if any.
- If a load reliably fails, try the following label check:
For more advanced assistance with library load issues, see: Troubleshooting Tape Library Load Problems in NetWorker.
If all these tests have failed and you are no further, ensure you document your results per each step in this article and engage with NetWorker support. Clear details are essential for expediting solutions and ensuring that "repeated steps" are limited.