How to handle VM cache

Question

Hello.

We have two VMs that take too long to backup and we suspect the cache files. When the VMs were powered off, they took 2 minutes to backup. Now they're on, it's taking 4-5 hours each.

We are doing the backup as VMs and not using the client, so the cache files aren't located on the usual client var path.

Where are the cache files located? I'd like to delete it so next backup would create new ones.

Do you have any further suggestion to troubleshoot the issue?

Thank you.

ionthegeek · Answer

What leads you to suspect the cache files are the cause of the problem? I cannot think of any circumstance under which the cache files could be slowing down a VMware Image backup.

Is the backup running to Avamar storage or Data Domain storage? What is the change rate of these VMs? Do the backups consistently run faster if the virtual machines are powered off or does the first backup after the machine has been powered off run slowly as well?

JoelFC · Answer

Hi Ian.

Thank you for questioning.

I suspect it could be the cache files, because for the time that the backup is taking it might be doing full backup every time instead of using the hash cache. My idea was to delete the files, ensure a successful first backup, and check out the time it takes on the first comparing to subsequent.

The Backup is running to Avamar storage. These VMs are newly deployed, and aren't in production yet. It should have a low change rate. The VMs have been powered off for a while and have been turned on a few days ago. It always took just a few minutes to backup, but since they were powered on, it is taking about 5 hours, since day one. It actually failed the first days because the backup window was only 4 hours and had to be enlarged.

There is one relevant detail. These machines are off-site and communication is slower than on-site VMs. We are aware that other off-site machines take longer time to backup as well. For example, a DC with 40Gb is taking between 2H and 2H30, but it's acceptable because it changes a lot everyday. These VMs have thin-provisioned drives of 100GB and as said aren't changing much everyday.

If I delete the cache files, at least I'd check the first backup duration and confirm if it is doing a full everyday or not.

ionthegeek · Answer

VMware Image backups don't make heavy use of the cache. In fact, in earlier releases, the cache was intentionally disabled for VMware Image backups because it provides very little benefit. The only reason the cache is enabled in newer releases is to prevent the proxy clients from hammering the server with millions of identical requests for the all-zero hash (we see that hash a lot during VM Image backups since empty disk space chunks up into identical all-zero chunks).

It's much more likely that the performance issue is being caused by change block tracking being disabled (either at the VMware layer or in the dataset), not using SCSI hotadd, high round-trip latency across the WAN, etc..

I can't explain why the backup was fast when the VM was powered off but I can say with a fair degree of confidence that it's not a cache issue that slowed it down. However, if you're really keen on deleting the cache files, you can find them on the proxy in the /usr/local/avamarclient/var directory.

I'd recommend that you collect the logs for these backups and open a service request so support can help you troubleshoot the issue further.

There is a tool called proxycp that my colleague Amol wrote that is useful for gathering logs from VM Image proxies. You can find proxycp at ftp://avamar_ftp:anonymous (at) ftp.emc.com/software/scripts/proxycp.jar and run it from the command prompt on the Avamar server like this: java -jar proxycp.jar

I'd recommend starting with --help

ionthegeek · Answer

The logs will show whether the backup used hotadd or nbd.

JoelFC · Answer

Hi Ian.

Thank you for the help.

Well change block tracking is enabled both at Avamar dataset and VMware setting, also both on the VM and on the virtual disks. I'm not sure about scsi hotadd, as I haven't figured out how to check. The configurations seems fine for the hotadd to work, has I investigated, but how do I check if it is enabled / if it is being used? Latency is probably contributing for the problem, but there must be something else because these VMs don't have a reason to go from 2 minutes to 5 hours backup time from power-off to power-on.

After confirming the scsi hotadd thing, and any other ideas that may occur, I'll open a case.

JoelFC · Answer

I've looked into the logs for hotadd and it is clearly stated that it connects to the disks with hotadd. That's it. I'm opening a support case. Thank you for the help for troubleshooting.

ionthegeek · Answer

Happy to help! Please keep us updated if you can. I'm interested to learn what's causing the issue.

JoelFC · Answer

Hi Ian.

I've opened a ticket with EMC and after several testing, log analyzing, corrective actions, etc, we couldn't reach a very clear conclusion.

The solution was to disable CBT and keep it that way, as the afftected VMs take about 30 minutes to backup with CBT disabled.

CBT seems to be detecting a lot of changes in the VM blocks and thus taking a huge time to process it. This would be normal in certain scenarios (there is a known issue with CBT), but not expected in these VMs which are not so large and are mostly idle because are not yet in productive environment.

Thank you.

Avamar

How to handle VM cache

Was this post helpful?