NetWorker: Troubleshooting Tape Library Load Problems in NetWorker
Summary: This article is intended to assist supporters and administrators in troubleshooting library load problems at the library or application level. To determine if the problem is logical or physical, and whether it is an issue with the robot, drive or media tape cartridge. ...
Symptoms
- Sporadic or consistent errors in loading tape cartridges in the library
- Unable to perform backups or recoveries from library media
- Library is detectable, confirmed functional and Ready
- Unable to perform load or label operations
- Tapes being marked 'unlabeled'
- Possible ASC/ASCQ/SCSI SENSE errors or messages in system or application logs
- Sporadic or consistent errors performing specific or random library operations
Cause
If library configuration worked previously and suddenly encounters an issue, consider possible changes that may be impeding detection and configuration:
- Robot, switch or adapter firmware, driver or configuration change
- Addition, replacement or removal of drives, tape cartridges or other library components
- Change of NetWorker software version, Operating System patches
- Any hardware event such as power loss or reboot of any component in the data path
- Discrepancies between NetWorker configuration and library (for example, tape cartridges moved outside of NetWorker's control)
If the library has never worked - confirm that the hardware is supported in the NetWorker Hardware Compatibility Guide (Requires Dell support account sign-in). Remember that it is possible for a library to be partially functional; discovery alone does not guarantee usability or supportability.
Resolution
In order to troubleshoot library load problems, after considering the last known changes, troubleshoot by devolving the process to its primitive constituents and testing them individually.
The required data is collected NSRGet when run with the -o:d switch. NetWorker: How to Use the NSRGet NetWorker Data Collection Tool
The items which are not are restricted to those operations which might be considered to be dangerous if attempted manually.
Library Load: Communications
- Again, ensure that the library is responsive and ready before proceeding. If not:
Library Load: Physical Operation
- Check to ensure that library operations are physically possible at a basic level. Ensure that testing is done when the library is not otherwise active, and that tape cartridges are replaced to their original locations.
sjirdtag <changer address>
Then move tape cartridges between elements and back again:
sjimm <changer address> <drive|slot|inlt|mt> <element_number> <drive|slot|inlt|mt> <element_number>
- There are some situations where errors may be expected; for example, libraries for whom Auto-Eject is not enabled at the library level gets an error attempting to move from drive to any other element (the tape cartridge must be separately ejected over a
mt -f <device_handle> offlinecommand prior to moving out of the element). - If errors are returned sporadically or consistently when attempting robot operations, SCSI ASC/ASCQ code errors, consider escalation to the library vendor for review.
Library Load: Logical Operation
Once we have established that physical operations are error free (at least superficially), we can attempt to trace the problem within NetWorker.
- Determine the library's layout and ensure its readiness, comparing the NSR Jukebox state information against the robot's tape cartridge information:
nsrjb [<-j library_name>] -C sjirdtag <changer address>
- Attempt to load an affected tape into an affected drive in high verbosity:
nsrjb [<-j library_name>] -lvvvvv -f <device_handle> -S <slot_number>
If the library loads repeatedly without issues, the load problem may result from specific situational factors rather than a persistent fault. All efforts should be made to isolate the condition that leads to the load failure, and debugging the condition should follow (see below).
- If regular load operations fail, particularly if the volumes are marked as 'unlabeled', then the label read has failed during the load attempt (causing the mount to fail). Attempt to reload the same tape into the same drive in high verbosity, without mounting:
nsrjb [<-j library_name>] -lnvvvvv -f <device_handle> -S <slot_number>
- Perform a standalone label verification, to test to see if the label read failure was transient, or is consistent:
nsrmm -pvvvvv -f <device_handle>
- If the label is read successfully, then the problem may resolve to the label read attempts taking place before the tape device is ready after physically loading it. In this case, you can try setting the variable in the System Environment or startup script:
MAX_LOAD_RETRIES=10
If the load operation still seems to fail during a compound load/mount (label read) operation after setting the variable, go to the Debugging section.
Library Load: Debugging
If all else fails, collect the appropriate data to assist debugging the problem before consulting subject matter experts (SMEs):
- Before reproducing the issue in NetWorker, change the debug trace level to 5 in the NSR Jukebox resource
- Also use
dbgcommandin order to increase the debug level of the runningnsrdandnsrmmgdprocesses to 5dbgcommand -n PROCESS_NAME Debug=5- To disable:
dbgcommand -n PROCESS_NAME Debug=0 - NetWorker: Debug information levels
- Consider
truss/tusc/strace,pstack,gcore/gencoreon the appropriatensrlcpdprior to and during the problem event - Set the debug variables in the System environment (Windows) or the startup script (UNIX) in order to get richer debugging data:
SJI_DEBUG=9 LUS_DEBUG=9 CDI_DEBUG=9 SCSI_DEBUG=9 JBDEBUG=9
If none of the above suggestions help, engage support as appropriate from your Library vendor if the evidence collected from the debug suggests any internal anomalies, as per Troubleshooting Tape Library Detection Problems in NetWorker and Troubleshooting Tape Library Access Problems in NetWorker; otherwise, ensure that the debug output is escalated within NetWorker Support to pursue the possibility of a code defect.