I was wondering if anyone has had a similar problem I'm have. I'm attempting to troubleshoot a vss save set, but everytime I kick one off, it runs for several hours with no activity. I simply want to stop the job and tweak the parameters to start again. However, when I select, Monitoring, right click the Group and select Stop, nothing happens other than being presented with a window that asks if I really want to stop the job. But, the job continues to run. Stopping sevices, renaming the temp directory, and rebooting the server is the only way to get the job to stop. Has anyone else experienced this problem and if so, do you have any steps available to fix it? Thanks.
Windows 2008 fully patched.
Kill save process on client... however, be aware that kill from server needs to wait for ACK from client and if that doesn't happen then it will still continue to run until process on client side is not gone.
This answer is unacceptable. I can NEVER kill these processes through the GUI. I’m not looking for a workaround to stop these process, I’m looking for the Management Console to work the way it was designed. If I right click on a group, and there is a “stop” option, I expect it to stop and not have to go to the client machine. Thanks for your understanding and I look forward resolving this with you.
I never had NEVER experience with VSS, but it is easily understandable this may happen. It is like with RMAN sometimes when you kill it and even there is no rman process on client it still leaves and keeps connection open thus you need to kill it from inside (Oracle). Or for example if you have stuck IO against disk - no way to kill that in any nice manner. Same applies to VSS. VSS is MS framework and if VSS API is such that it doesn't provide efficient way to stop things in your case, there is not much you can do.
But since you are trying to troubleshoot VSS hang, you may wish to explore another approach - check event log first to see if you get any VSS related messages. Then check, with vssadmin current state and if you are full of enthusiasm you may wish to dig deeper with vshadow. You can also place VSS into debug mode and get some info from there. Perhaps before anything like that, you may wish to check if there is any VSS patch which applies to your box (NMM config checker is good to give list of those, but if you search across the forum you will few of those which are must too).
I believe being unable to stop vss process is just indication and consequence of vss problem you have. Of course I could be wrong, but most of the time it is vss to blame
I think the problem here is that even if you attempt to stop the actual savegrp.exe there will be still some VSS operations being ran on the client.
If svegrp.exe doesn't acknowledge the cancellation of those other processes (including nsrsnap_vss_save.exe in the case of NMM and Microsoft Windows Shadow Copy and vss.exe).
I appreciate the frustration with this behavior, but for me makes sense, as savegrp.exe need to be aware of all other processes stopped to proceed with stopping the group itself.
Unfortunately the only way, as far as I know, is to manually kill all processes in the client first (save.exe or nsrsnap_vss_save.exe) and then manually kill savegrp.exe.
My advice though is to clean out tmp as well to avoid any session hanging on the NetWorker server, and also the jobsdb.
Carlos is on the right track with this.
Networker does not control VSS - it simply asks VSS for the files that are open. Microsoft VSS is problematic at times and I have seen many times where VSS will not respond properly and Networker (or for that matter any other backup software I have worked with) is just waiting for the ack back and then the notification that the files are ready to be backed up.
Having delat with VSS issues since Windows 2000, I have complied a list of MS KB articles. Here is what I have for Windows 2008 (note that some may not apply to your situation, but others may find them useful):
Knowledge base articles - known Windows 2008 VSS issues:
VSS snapshot creation may fail after a LUN resynchronization on a computer that is running Windows 7 or Windows Server 2008 R2 - November 25, 2009
Backup fails with VSS Event ID 12292 and 11 on Windows Server 2008 and Windows Server 2008 R2 - January 20, 2010
No VSS writers are listed when you run the “vssadmin list writers” command in Windows Server 2008 R2
Windows 2008 R2 64-bit backup failed
System State backup using Windows Server Backup fails with error: System writer is not found in the backup
A VSS hardware snapshot database keeps growing with duplicated ...
(959476) - ... VSS) requestor instances with VSS hardware provider to delete snapshots in Windows Server 2008 ... Important Windows Vista and Windows Server 2008 hotfixes ...
A snapshot may become corrupted when the Volume Shadow Copy ...
(975688) - Fixes a problem in Windows 7 and in Windows Server 2008 R2 in which a snapshot may become corrupted when the VSS snapshots providers takes more than 10 ...
A virtual machine online backup fails in Windows Server 2008 R2 ...
This issue occurs because the Hyper-V Volume Shadow Copy Service (Hyper-V VSS ... Important Windows 7 hotfixes and Windows Server 2008 R2 hotfixes are included in ...
You cannot safely remove volumes after you perform a VSS backup ...
(2487341) - ... files after you perform a Volume Shadow Copy Service (VSS) backup operation in Windows Server 2008 SP2. ... Windows Vista hotfixes and Windows Server 2008 hotfixes ...
Also, please note that not all og these fixes come with Windows update and are used on a case by case situation.
Now, from the client side, one thing you can do is to check on the status of VSS from the command line using vssadmin.
Here are a couple of MS KB articles to get you started:
Manage Volume Shadow Copy Service from the Vssadmin Command-Line
How to enable the Volume Shadow Copy service's debug tracing features in Microsoft Windows Server 2003 and Windows 2008
Oh - one other thing you do from both the server and the client (I would start with the client to see if it is showing any errors) is to render the daemon.raw file.
To do this, make a copy of the /nsr/logs/daemon.raw file - I typically rename the copy to be daemon_YYYYmonthnameDD.raw.
From the command line, navigate to the directory the copied daemon.raw file is in, then run:
nsr_render_log daemon_YYYYmonthnameDD.raw >daemon_YYmonthnameDD.log
ie. nsr_render_log daemon_2012Jan27.raw >daemon_2012Jan27.log
You can then open the file through windows explorer with note pad or word pad and read the contents. It may take a little time to get to the portion you need, but know when the backup started will help.
I hope this helps you some in figuring out what is wrong with the system backups.
Carlos and Mark - thanks for the feed back on this. I agreee 100% that there is a VSS problem and will troubleshoot accordingly. It was my hope that if there is a "workaround" EMC could build that "workaround" into their NMC. I'm kind of old school that if there is an options to stop something, either that option should actually stop something, or the application should have intelligence written in to where it can monitor the process and not display a "stop" options. Just my to cents to EMC, but I do greatly appreciate the feed back on there. I'll dig in w/ the troubleshooting and see what I do to fix the probelm. thanks again.
First, thank you for taking the time to respond and secondly, my apologies for not following up sooner. For everyone problem I’m addressing with NetWorker, there are two more that come up. The implementation of this “solution” has just been a nightmare for our organization so it’s been a little tough to keep up.
Back to the problem at hand, I tried what you suggested and I was on bored with it, but while the group is running (contains one client) there is no nsrsnap or savegrp running on the client machine.
If you have any ideas about this, I would like to hear them. Otherwise, I have an SR open with EMC and I can update you with our findings. In the meantime, I’m going to head back to EMCs documentation and see if I can find the sheet on what process kicks of what other process so I can see where it’s breakinig down.
I still have a sneaky suspicion it’s a VSS process having problems snapshotting some network attached storage on our Dell Equallogic.
I remember some issues with the Equallogic arrays. Did you find any event generated by Volsnap (commonly event ID 25) or similar?
Would you mind to share error message for the next failure (if any), specially daemon.raw from client, and /nsr/tmp/sg/group_name files for that particular group.
Also please check the application and system event logs for the duration of the backup and let us know if there is any event/warning/alert