As our VMware environment grows, we're running into more and more issues with Avamar. Ok, ok, not really Avamar issues but more of a design thing. We have one vCenter managing about 50 ESXi hosts and over 700 VMs. Our backup window is between 11pm and 6am. We're running about 5 Avamar proxy's.
The last few months we notice more and more "consolidation failed" errors in the morning after the backup has finished. Sometimes these are easy to clean, sometimes a Storage VMotion is required and sometimes a VM needs to be shutdown first. Our SLA window for solving these issues is getting smaller and smaller because of customers demanding more uptime in general, which makes shutting down a VM for cleaning a failed consolidation a pain and gets us more complaints from customers.
Talking to VMware Support they told us that often Avamar is too fast or is demanding too much from the host when it comes to creating and cleaning snapshots. Not the concurrent backups is the issue but the snapshot creating and/or cleaning at the exact same time. At some hosts I see Avamar kicking in, creating snapshots on 7 VMs within one minute.
A few questions:
- How are your experiences with "failed consolidations"? How often do you see them in your environment?
- How do you design for this number of VMs to be backed up? Do you try to put as much load on the proxy's as possible or put a proxy in for each cluster? Is there a max number of total VMs (not concurrent) or ESXi hosts that should be covered by a proxy?
- Are there ways / settings to force a proxy to only make 1 snapshot at the same time?
Any input is appreciated.
I have 3 vCenter environments being backed up to one Avamar grid (Data Domain as the target). I have deployed 5 proxy servers per each vCenter, since each proxy has 8 "streams" that gives me 40 possible concurrent sessions per cluster. We are backing up around 900 VMs daily and used to get at least 2-3 VMs that could not be backed up either due to error 10052 (could not snapshot VM) or error 10056 (too many existing snapshots). The error 10056 is the one where we have to constantly consolidate snapshots, we had to create a script that would it everyday before we start our backups. Last week we upgraded to 7.0.2 (from 7.0.1) and i have not seen a single occurrence of this issue, the jury is still out but i have never gone more than 2-3 days without at least one occurrence of 10056 that required consolidation.
I am currently on 7.0.2 ( and no I don't think the proxy can be higher than the grid)
When I have seen this happen to me on previous versions ( and rarely now) it has been that the proxy needed to be rebooted.
if you look at the failures - do you see a common proxy that is having the issue.
in Activity window get your failures and the go look at the column that says what proxy.
Every time I have had this error - its always failing on one proxy.
reboot that proxy and it goes away, until the next time they do some Change Control the vcenter or vm's and don't tell me.
sees when they do that, they 'confuse' a proxy and it needs to be rebooted.
now this is just me.. and end user... relaying what I have experienced...
I have had this issue before, your probable root cause is on the vCenter. The RPC connection between your backup tool in this case Avamar and the vCenter drops and the backup solution's backup jobs may fail and snapshots are not removed from virtual machines. The best way you can verify this is open a case with VMware and ask them to verify VPXD log on vCenter and confirm if you have drop in connections. our problem lied in the vCenter version, once it was upgraded to the suggested version, I havent had any timeouts neither orphaned "hidden snapshot"
Take a look at the VMware kb article:
on a side note...
When I first started having this issue .... support had be break up the start of my vm's.
they said to not start more than 100 at a time.
so I created more group policies. and I try to keep the to about 100 (of course retiring servers can make them get low so I check every now and then to see which is the smallest and add new ones to it.)
I have about 600 vm's from one vcenter , but going to 2 grids.
so I have them start about 1 hour between. (that was my choice not their suggestion - I think I could have started them closer together, but this works and they all finish way before my window closes)