Welcome to this EMC Support Community Ask the Expert conversation. This discussion will focus on day-to-day NetWorker operations from the perspective of an EMC customer.
Dan Gauld (an EMC Partner) is in residency working at Mainland Information Systems. Dan is focused in the area of backup/recovery and data storage for the last 8 years, and has performed numerous assessments, deployments and upgrades for NetBackup, Backup Exec, Networker, Avamar and Commvault infrastructures. Dan writes at backupbuddha.ca about backup recovery for both the enterprise as well as the home consumer. You can follow Dan on twitter @backupbuddha.
This discussion begins on July 29 and concludes on August 9. Get ready by following this page to receive updates in your activity stream or through email.
Hey everybody! I want to thank Mark and EMC for the invite to share some knowledge here. Being asked to participate in a forum where I am the expert is a little intimidating, especially given the subject matter. NetWorker has a long history. There are probably some fellow grey beards out there who may have first encountered NetWorker when it was a product of a company called Legato in the late 80’s. Back then, data centers were growing as were the network bandwidth capabilities. Prior to this if you wanted to protect a server you connected a tape drive and wrote and cron’d some scripts. Legato identified that this was quickly becoming an administrative nightmare. The introduction of a centralized backup server, drive sharing and job scheduling solved some problems.
Back then, nobody wanted to deal with backups, the same is true today. Luckily for us that vacuum created another niche in the IT industry. The backup recovery specialist. One of the great things about this role is BRS touches all parts of an IT infrastructure. A backup recovery expert needs to have some rudimentary knowledge of databases, virtual and physical systems as well as specific OS knowledge of said systems. Never mind, storage that may include SAN connectivity, protection options specific to NDMP. It goes on….
I started in storage and data protection in 1996 when I joined the storage team for a growing Oil and Gas company here in Calgary. After a few years taking care of their backup infrastructure I landed at Mainland Information Systems again in residency, but quickly moved to professional services. I now find myself back in residency, helping a large client with an ever changing dynamic infrastructure trying to stay ahead of their growing data protection challenges. When I first landed here a couple of years ago, I was new to NetWorker. I quickly realized that although a lot of the concepts and ins and outs of the product were shared with other products I had supported in the past, this creature required some special care and feeding. Not to sound dramatic, but if this beast doesn't get enough attention it will bite you.
Creating documentation specific to new learning’s is a great way to solidify knowledge, but I like most of you really dislike creating and maintaining documentation. So I started backupbuddha.ca as a way of documenting my new NetWorker specific knowledge gained in the day to day operations. I want to share here in this forum the little bit of what I’ve learned about the inner workings on NetWorker for those who may be new to the product. What do you, the NetWorker administrator need to do on a daily basis to ensure the infrastructure is healthy? Feel free to ask any questions, but EMC support is your best bet if you have an issue. I open tickets with EMC ALL THE TIME. Their support engineers are the best in the business. I take and share those learning’s on backupbuddha.ca to ensure I retain some of that knowledge in my back pocket.
In addition to EMC support. I would be remiss if I did not mention my acquaintance Preston de Guise. If you have ever had to perform a Google search on anything NetWorker related, you’ve probably stumbled across his website nsrd.info. His site is as valuable as support.emc.com and ECN to the NetWorker administrator. As well his book Enterprise Systems Backup and Recovery: A Corporate Insurance Policy is a great resource for backup administrators of all products and managers of backup portfolios alike. I very much hope to someday buy him more than a few pints at EMC world in Vegas in the future to thank him for is invaluable contributors to the NetWorker community.
Anyway, let’s get started. The first post I’m planning on is specific day to day operations of NetWorker. You just stumbled into the office in the morning. Coffee in hand, you need to check out NetWorker and make sure all is well. What should you look at? What should you look for? Exciting? Not really, but stay tuned. I’ll try to make this as entertaining as possible.
So here is my day so far. I woke up and attempted to determine if that was a good or bad thing. I decided I was indifferent and then wondering why I was in pain? Oh yeah! Leg day at the gym yesterday! Then my morning routine begins. This usually entails dragging my wife out of bed. She is not a morning person, AT ALL.
Today was a little easier as she was starting her new job! Cat fed and watered, then the Dan gets fed and watered. I dress in my typical uniform of jeans and a black v neck t-shirt. I should not be dressing so casually when I work on site with a client. I usually apologize when I run into my client or manager and say I was pulling cable in the data centre and didn't want to get my nice clothes dirty. I pull a lot of cable. Today, I actually am at the data centre. It's nice here. We have a small office to work from where it’s quiet and I can hide from the world.
Now I need coffee. I'll be right back.....
OK, coffee in hand! I may or may not also have two chocolate glazed do-nuts. I will neither confirm or deny.
So first, let's login to the NMC console. Did it work? Excellent! NetWorker is up and running! Small victories, my friends. Embrace these.
Before launching the app, I like to take a look at the events.
Here we see any outstanding alerts. These may be failed backups or other warnings. Today is looking like a good day. A few failures, nothing we can’t handle.
Let’s Launch the console and go to the monitoring tab.
I know what you're thinking. Really, this guy is going to show us the NetWorker monitoring tab? Stay with me it will get better.
The key thing I want to impart here is what to look for. You want to look and see what backups failed last night. If you have a healthy environment most of your backup failures can be chocked up to what I call "an act of god". We live in an imperfect world. People sometimes spill milk. Maybe you burnt the toast, and sometimes backups fail. For most backup failures you will not go down a rabbit hole to find the root cause. If you can, re-run it. Does it work? Great! Go on with your life. This may sound obvious to some of you, but I have actually had debates with co-workers who thought every backup failure should be investigated to the nth degree. Ok, this happened once, and the guy was kind of a tool.
The point is, in a healthy environment, you are going to have backup failures. In my environment most of the failures are related to the server team shutting down or decommissioning a server AND NOT TELLING ME! It's not like I get mad about it or anything, so yeah. I investigate when a client or a specific client component fails repeatedly. We will get into some basic client backup troubleshooting later.
Also look at the alerts and log section. These areas are good for some immediate feedback on the current health of the environment. There are some more in-depth checks we will get into later.
So I have noted the backup failures. I see I have some jobs running? I like to check to see if the jobs are indeed running or are they hung up?
This morning, the running jobs are for some of the filers that are backing up over NDMP.
Let's look at one of the jobs.
What I want to do is see if the job is running and actually has tapes mounted.
Note the volumes names in the job details.
Then let’s go over to devices.
What I'm looking for is some correlation between the jobs details of the previous screen. Are the volumes loaded? More importantly are the drives writing and at what speed?
Part deux will be coming up later today, but I have some work to do now and more coffee to drink.
I’m back! So let’s finish of our daily checks. Next up is checking devices. Given everything appears OK, I’m not anticipating an issue, but I like to keep an eye on these things.
Most larger environments will have a mix of disk and probably tape. Tape!? Yup it’s still around. Don’t pretend it doesn't have a place. It does and will continue to for some time. Check out the devices and look for any that may be down or in service mode. If there are any down you will need to go to your library admin console for further troubleshooting.
The same goes for any disk storage units. I have a couple of DD990’s here. I like to check to make sure the disks are mounted and good to go.
Also, if you have a very busy environment, you may have a look at any backups that run during the day. With tape, there are a lot of moving parts. Not just the robot and the drive themselves. There is also the NetWorker database that tracks and allocates media. Not to mention the relationship to the devices from the OS hosting NetWorker. In short, there are a lot of points of failure for your devices. So, sometimes I will watch the library portion of the GUI. Tomorrow we will look a little at the native NetWorker alerts that can be configured to help you keep on top of some of these.
Hey Guys, It has been a busy week. The next topic I wanted to touch on was around some of the other items you should keep an eye on that may indicate issues. The daemon.log captures all NetWorker operations and associated alerts, warnings and errors. I have seen issues that resulted in data that was unrecoverable, where it was not obvious that there was an issue, despite all the daily check previously mentioned being completed. It's a good idea to go through the daemon.log and grep out any warnings or errors. All warnings and errors should be actioned. Some of the messages may and most likely will seem esoteric. EMC support should be engaged to help understand the error and ramifications that may result.
The daemon.raw file can be configured for realtime rendering. Lets give props to Preston.
Should be completed from nsradmin
nsradmin> update runtime rendered log: /nsr/logs/daemon.log
Else, you can use you can use the nsr_render_log utility.
Some useful switches...
-R hostname: renders log from remote host
-Y severity: outputs messages that match this variable
-F devicename: Outputs only messages related to a specific device.
These are just a few. Now a couple of warnings. The daemom file can get chatty and especislly verbose.
You can read about a fun day I had with this here.
In short, its important to ensure the max file size is somewhat restricted. Otherwise, your issue may be compounded by the disk space filling up on your server.
Another area to keep an eye on is /nsr/cores. Not sure of the Windows has an equivalent?
Inside there are directories specific to the NetWorker daemons and processes. I should really write a script to check these for new cores and email me when a core dump for specific process occurs? Also, I like to keep an eye on syslog (and or event viewer) for anything unusual.
Hey Guys, and by Guys I mean the two people who are actually reading this.
Wow, today...wow. I was honestly contemplating faking an appendicitis attack or something to get out of work today. Being the professional I am, I sucked it up and put on some pants to go to work. I really hate wearing pants. Don't you?
Sorry for the lack of content posting going on here. It was a long weekend up here and we need to take advantage of those. The weather was great; I went to a BBQ and ate far too much food. I then fell asleep on an empty couch I found, and then my wife woke me up and took me home. Mr. Excitement? You know it.
So today completely defeated me. I have surrendered to my deck to continue our adventures in NetWorker.
I wanted to discuss two important commands.
nsim and nsrck.
Exciting right? You're welcome.
Let’s talk about nsrim. "nsrim is used to manage the networker online file and media indexes." A more complete description of the command from the NetWorker Command reference Guide reads...
"...nsrim will automatically invoke nsrck -L 3 after updating the save set’s browse
and retention times in the media database to remove client file indexes that have
exceeded the retention policy. If a problem is detected, a more thorough check will be
automatically performed on the client file index in question.
If you believe an index may be corrupt, you can manually run a higher level..."
Nsrim is run at the completion of the savegrp command if it had not been run in the last 24 hours. It identifies this by checking the timestamp of /nsr/mm/nsrim.prv. NSRIM will update the save set browse and retention times in the media database then invoke nsrck to remove any entries in the client file indexes that have expired. THis process is especially important to ensure tape and disk volumes are recycled. A tape or disk volume will become eligible to be reused when all of the savesets are identified as recyclable.
If a consistency problem is encountered, a more thorough check is automatically performed on the client file index in question. It is best to run outside the backup window. It can impact backup performance which is why some administrators prefer to cron this.
Have you ever seen this error during a routine saveset restore? warning: ASDF type 0x65 version 0 not recognized