We're running Avamar 7.0.2-47 and backing up our vCenter (v 5.1) environment nightly. It seems that every few days (or sometimes weeks), some of the VM's fail with message, "virtualMachine.vmx will not permit a restore or backup" and "createSnapshot: snapshot creation failed."
To temporarily fix this, we must first release the VM disks from the proxy server, and then perform snapshot consolidation on each VM which had previously failed. After this is done all VM's are able to back up successfully for some time, but inevitably, fail again with the same problem.
I've seen some past threads regarding this issue but they turned to updating/re-deploying the proxy to resolve the problem. In our case, I've already re-deployed the proxy but I'm still seeing this issue.
Does anyone know the root cause of this problem, and how to go about resolving? Any insight is much appreciated.
The root cause of this issue is the residual snapshots on the VMs that are left behind due to some failure or hung backups.
Some basic points need to be understood in this environment:
1. How many VM backups start at the same time.
2. How many proxies are available.
3. Are there any known issues causing failures on the VMs?
As you have correctly observed, this issue has been talked around a lot of times in this forum and in our Knowledge base as well. Here's a very popular KB to fix this issue: KB 124704
There are some internal tools used to remove the snapshots too.
However, coming back to the point you raised about the root cause, it depends on the type of failure occurred that caused the residual snapshots that needs investigation separately on case by case.
Please contact EMC Technical Support if this issue is intermittent and bothering too much.
Thanks for the feedback, everyone.
Ahmad Arif - to answer your questions:
Before I open an SR with EMC for further investigation, can you let me know how many VM's can be handled by a single proxy? At what threshold are additional proxies suggested? I'm not opposed to deploying an additional proxy or splitting the VM backups into different groups, if the issue might be caused by timeouts/queued VM's.
1- that small I would think that is not an issue
and the number of proxies - really goes with your window -
as you are on 7.0.2-47 your proxies with one proxy in vCenter should show up as
One proxy in vcenter can do 8 vm's at a time and can show up as 8 proxies in avamar
if you make your window no need for more
but as you only have 24 proxies I would not think you would need to make more than one job.
you might need to do a ticket so they can look at logging to find out what is your cause.
I use to have this a LOT, now very rarely.
One of the things I was told was to not start more than 100 vm backups at a time.
meaning in a group.
so I now have 3 groups
I keep each one to 100 or less vm's
you don't have 100 VMs being backing up similanteniously, do you ? So many are running and the rest are sitting queued up ? I have one group with 600 VMs, by default only 50 backups are running at the same time the rest are waiting in line.
I did not say running I said start.....
they told me there was a difference
no group was to have more than 100 in it.
so for you it would be 6 groups.
and they would not have to be one hour apart, I did that to make it easy to see for me.
The EMC tech said not to start/queue more than 100 at a time.
I was just following instructions ( and as I only have 3 proxies x 8 I only have 24 running at once.)
and the reason I only have 24 is I have my vm's landing on DataDomain
and I have to consider the 90 streams that DD can handle
so that has to include my vm backups, my vm replications, and any database backups that are going on via CIFS or NFS
I have to stay below that 90 count on the DD to make sure jobs don't die.
My VMs are going to DD as well, what kind of DD do you have ..90 streams only ?
I would love to hear from EMC engineers about the desctinction between starting X number of VMs versus queing X number of VMs.
that dd is an 860 - and I was told it can only do 90 streams
I don't know if it was an Avamar issue or a vCenter issue.... like don't kill the vcenter by trying to queue up a bunch of servers at once.
Think maybe that when you queue the job it has to talk to vcenter to verify it is there? That could be a load on vcenter maybe?
I just know I had issues, and that was 2 of the things they stressed to me.
Different aspects could lead to different root cause:
1. If you are runnning Avamar 7.0, Maximum number of proxy sessions back to the Avamar MCS is 48… if you have a single proxy with 8 streams, that will be 8 MCS sessions if all streams are being used at once. In Avamar 7.1 (I think, if someone can confirm here), that changes to reduce the load on the MCS… each proxy will only establish a single connection to the MCS so a single proxy with 8 streams will only create a single MCS connection instead of 8..
So you have a limitation here that could cause it
2. Data domain model supporting the maximum number of streams, like discussed above.
3. If your vCenter version is 5.1 update1, there is a known issue where the connection between the third party backup and vcenter drops which leaves the snapshot unremoved from the VMs, the kb article is here: VMware KB: VMware VirtualCenter Server service fails randomly when using 3rd party backup solution
Hope this helps.