Not finding any evidence why, but the Avamar client agent created a new f_cache2.dat file in the next backup job that was run after cancelling a "hung" backup - and of course, it did this on a file server with millions of files that is experiencing performance issues with Avamar backups regardless of them being seeds or not.
Is there any condition where the Avamar agent file cache could be "corrupted" or otherwise "messed up" that would not cause a running backup job to fail outright? Anything that occurs "right at the end" of a backup session after all the files have been scanned?
The first thing that comes to mind is that the f_cache2.dat file wasn't closed properly when the "hung" backup was cancelled - could that be a possibility? If so, as mentioned before, what part(s) of the "backup wrapup" could have had issues to cause that to happen?
All comments/feedback appreciated - thanks.
The only time I've seen something like this is when the client filesystem filled to capacity while the cache was flushing. Is it possible this is what happened on the client? That could explain both symptoms.
Per my understanding, f_cache2.dat will be created if you are moving backup to DD.
Size of f_cache2.dat will be significantly bigger than f_cache.dat. In f_cache2.dat, there are multiple pages to update the files attributes. One page size is around 6.7MB. The number of pages will increase with the number of new files.
You should be able to see similar logs at the end of a backup-
2018-02-05 00:19:09 avtar Info <18895>: Cache update complete /usr/local/avamar/var/f_cache2.dat (696 pages in all backups in cache)
2018-02-05 00:19:09 avtar Info <5069>: - Writing cache file "/usr/local/avamar/var/p_cache.dat"
2018-02-05 00:19:13 avtar Info <5546>: Cache update complete /usr/local/avamar/var/p_cache.dat (1536.0 MiB of 2047 MiB max)
In above example, the size of f_cache2.dat will be 6.7 x 696 = 4663 MBs.
The backups in question have always been going to a Data Domain, so there was no moving involved - unless there was some kind of "disconnect" in terms of knowing that there were previous backups to the Data Domain.
I've had cache files rebuild automatically due to some kind of error/incompatibility in the cache file after I pushed an Avamar agent upgrade to a server (e.g. 7.1.102-21 upgraded to 7.3.0-233). Nowadays I know to delete the file if I ever upgrade the agent. But you'd see very clearly in the log that Avamar was rebuilding the cache, and if I recall correctly it would throw a "completed with exceptions" error. Is that so for you?
In my case, there was no upgrade involved. Also, in each case, the backup was in enough of a hung/stuck/stalled state that it would not complete on its own and had to be cancelled.
And I am saying "in each case" now because the issue occurred again, albeit under slightly different circumstances - but the main symptoms were similar enough, and the primary aspect in both cases are that on the next client backup, the file cache appears to be recreated "from scratch", as if it was the first backup that the client had ever performed.
In my mind, relative to where in the overall process the session hangs, I don't believe that the cache files get properly closed when the session gets cancelled, and so on the next backup, the agent sees that and just creates new ones. It's not quite the same behavior as when there are volume capacity issues and the cache files can't grow any larger, but in both cases you end up with open cache files that are in a "what the heck do I do now" state.
I am likely going to log an SR for this, because it is occurring on a file server with 25 million files and 20TB of data - so having failures that result in multiple "seeding" backups is not something the customer is going to put up with. Also, the most recent failure on this client ties in with issues on the associated Avamar node around the same time, so I need to determine whether there is any relationship between those issues and verify that everything that might be going on is resolved.
FWIW, I'll try to follow up and post the sections of the respective logs where the processes have "hung" so as to provide better perspective on the issue, as well as any Support feedback I get going forward.
It should not be necessary to delete the cache file after an upgrade unless there are exceptional circumstances.
Replicating a 20TB backup in an Avamar / DD configuration is likely going to be problematic. The customer will probably need to split up this dataset either way.
Ian, we saw the problem too many times not to conclude that it had something to do with the upgrade. It might have been the significance of the changes between 7.1 and 7.3 agents (and thus won't happen again if we do smaller upgrades) that caused the failures but in any case it's not impactful enough to our environment not to delete the cache file as a precaution.
I know we did have some issues with the paging cache in older releases but these should be resolved now. I just don't want people to see your post and think they need to delete the cache files every time. Since 90+% of the performance gain from the Avamar client comes from the file cache, that could have a devastating impact on backup performance in a large environment.