Thanks guys for the input. We have an Avamar Gen4S /DD 2500. For weeks now it’s been flawless using a vCenter Appliance 22.214.171.12400 Build Number 3634791. Just the past few days we’ve had 70+ failures a night as mentioned in this thread. (“Nothing has changed” as far as we know) We are backing up about 500+ VM’s nightly with the main backup group having 400+ members. All the groups were kicking off at the same time. So as of today I have split the 400+ group into 3 groups of about 135 VMs each and then staggered the start time 2 hours from each other. I’ll see how things go tonight.
Glad this was able to help you!
Be sure to reply back with any changes or experiences that might help the next person experiencing this.
If you note significant numbers of sessions failing with snapshot creation errors, L2 might referr to something like a "transactional limit" or "session state limit". There's a tunable parameter to increase this, and VMware might suggest increasing it, restarting vCenter and re-running your backups. While this only reduced our failures by a handful, it was very informative for showing us we needed more CPU on our vCenter instance.
Let us know if you learn anything that helps your situation!
The only real change we were advised to make was increasing ephemeral ports because of concerns of port exhaustion, but this made no difference.
Splitting up my backup jobs and staggering the start times had no effect on the amount of failures. I’ve escalated the case within EMC and we are digging into it further.
Just today, they have found a correlation in the vpxd logs, the sso logs and the Avamar logs and have deemed it to be a VMware issue. VMware is now escalating it on their side.
Ok. After a couple of WebEx’s with EMC here is the latest.
When looking at the VMs log for the failed backup(s) there are the usual errors and exceptions in Red and this is what we are generally drawn to. However right above one of these entries is something one of the EMC guys found:
2016-10-19 10:03:26 avvcbimage Warning <16041>: VDDK:VixDiskLibVim: Login failure. Callback error 3014 at 2439.
2016-10-19 10:03:26 avvcbimage Info <16041>: VDDK:VixDiskLibVim: Failed to find the VM. Error 3014 at 2511.
2016-10-19 10:03:26 avvcbimage Info <16041>: VDDK:VixDiskLibVim: VixDiskLibVim_FreeNfcTicket: Free NFC ticket.
2016-10-19 10:03:26 avvcbimage Info <16041>: VDDK:VixDiskLib: Error occurred when obtaining NFC ticket for: servername/serverdisk.vmdk. Error 3014 (Insufficient permissions in the host operating system) (fault InvalidLogin, type VmodlVimFaultInvalidLogin, reason: (none given), translated to 3014) at 4526.
2016-10-19 10:03:26 avvcbimage Info <16041>: VDDK:VixDiskLib: VixDiskLib_OpenEx: Cannot open disk servername/serverdisk.vmdk. Error 3014 (Insufficient permissions in the host operating system) (fault InvalidLogin, type VmodlVimFaultInvalidLogin, reason: (none given), translated to 3014) at 4680.
2016-10-19 10:03:26 avvcbimage Info <16041>: VDDK:VixDiskLib: VixDiskLib_Open: Cannot open disk servername/serverdisk.vmdk. Error 3014 (Insufficient permissions in the host operating system) at 4718.
2016-10-19 10:03:26 avvcbimage Error <0000>: Failed to connect to virtual disk servername/serverdisk.vmdk (3014) (3014) Insufficient permissions in the host operating system
2016-10-19 10:03:26 avvcbimage Info <19644>: Connecting virtual disk servername/serverdisk.vmdk
Avamar can’t login to vCenter for some reason, the backup fails. This error is present in all of the failed backups.
So we started to dig into the vpxd.log on the vCenter (appliance in our case) and grepped for “failed” and the AD service account we use for Avamar.
We found over 100+ entries like this:
2016-10-18T23:18:50.794-07:00 error vpxd[7FA110D49700] Failed to authenticate user <domainname\username> (actual domain and user names removed to make the security guy happy)
Some of these entries had an time stamp that exactly matched the Avamar client logs for the “VDDK:VixDiskLibVim: Login failure”
When this was found the EMC engineer asked if we were backing up our vCenter appliance with the other production VMs? Yes we are. The next question was are you also backing up your domain controllers with the production VMs? Yes.
He stated that due to the Avamar taking quiesced snapshots there may be a few seconds during which that is happening that either the vCenter server or the AD DC’s are not accepting the login. This would explain why in the AM we can rerun any one of the failed backups and they complete without issue. No other snapshots going on. SAN is quite, everything is responding well.
For tonight’s backup window we removed the vCenter appliance and all of the AD DCs from the backups. We’ll see what happens overnight.
I’ll update you all tomorrow.
Did you ever find a solution with VMware? We have had a couple of cases open with VMware and they haven't found a solution yet.
Yes. It was quit simple. Instead of using a AD User account for Avamar/ vCenter integration I changed the account Avamar was using a vCenter local account. Fixed the issue. Basically our AD DC’s could not keep up with the amount of AD login requests that Avamar was sending during our nightly backup window. Changed it and we haven’t missed a backup.
Here's what they had to say.
IDM maintains a TenantCache which stores information about the tenant so that IDM can avoid repeated queries to vmdir. The key to this cache is the tenant name, as supplied by the login request. In this case, we were logging into "abcvsphere.local". However, when the customer created this tenant, they provided the name "ABCvsphere". As such, "ABCvsphere" was stored in vmdir as the tenant name. When writing to the cache, the tenant name stored in vmdir is used as the key, in this case "ABCvsphere". I'll walk through an example to illustrate what happened:
- In many cases this will result in a POST to /sts/STSService/odc.local. Notice the tenant name is "abcvsphere.local"
- IDM tries to query the TenantCache with key "abcvsphere.local". The cache is empty, this is expected since it is the first login.
- IDM queries vmdir several times to get the tenant information. Most LDAP queries are case insensitive, so lookups for tenant "abcvsphere.local" will return the data for "ODC.local."
- Now that IDM has the tenant information, it will be cached with key "ABCvsphere.local". The key to the cache is the tenant name as stored in vmdir.
- Subsequent logins will result in IDM querying the cache with key "abcvsphere.local". By default, Java String comparisons are case sensitive, so the cache will appear empty, and IDM will repeat the same process of querying vmdir.
- Since every authentication request results in a cache miss, many extra LDAP connections are created for each request. Eventually, so many LDAP connections are created that the machine exhausts the ephemeral port space. This in turn results in some authentication requests failing since IDM cannot establish a connection to vmdir.
The fix for the port exhaustion is supposed to be available in v6.0u3 out in January.