Start a Conversation

Unsolved

This post is more than 5 years old

C

6818

February 9th, 2018 11:00

Avamar client agent created a new f_cache2.dat file after cancelling "hung" backup

Not finding any evidence why, but the Avamar client agent created a new f_cache2.dat file in the next backup job that was run after cancelling a "hung" backup - and of course, it did this on a file server with millions of files that is experiencing performance issues with Avamar backups regardless of them being seeds or not.

Is there any condition where the Avamar agent file cache could be "corrupted" or otherwise "messed up" that would not cause a running backup job to fail outright? Anything that occurs "right at the end" of a backup session after all the files have been scanned?

The first thing that comes to mind is that the f_cache2.dat file wasn't closed properly when the "hung" backup was cancelled - could that be a possibility? If so, as mentioned before, what part(s) of the "backup wrapup" could have had issues to cause that to happen?


All comments/feedback appreciated - thanks.

2K Posts

February 9th, 2018 11:00

The only time I've seen something like this is when the client filesystem filled to capacity while the cache was flushing. Is it possible this is what happened on the client? That could explain both symptoms.

8 Posts

February 11th, 2018 23:00

Hello,

Per my understanding, f_cache2.dat will be created if you are moving backup to DD.

Size of f_cache2.dat will be significantly bigger than f_cache.dat. In f_cache2.dat, there are multiple pages to update the files attributes. One page size is around 6.7MB. The number of pages will increase with the number of new files.

You should be able to see similar logs at the end of a backup-

2018-02-05 00:19:09 avtar Info <18895>: Cache update complete /usr/local/avamar/var/f_cache2.dat (696 pages in all backups in cache)

2018-02-05 00:19:09 avtar Info <5069>: - Writing cache file "/usr/local/avamar/var/p_cache.dat"

2018-02-05 00:19:13 avtar Info <5546>: Cache update complete /usr/local/avamar/var/p_cache.dat (1536.0 MiB of 2047 MiB max)

In above example, the size of f_cache2.dat will be 6.7 x 696 = 4663 MBs.

132 Posts

February 12th, 2018 16:00

The backups in question have always been going to a Data Domain, so there was no moving involved - unless there was some kind of "disconnect" in terms of knowing that there were previous backups to the Data Domain.

336 Posts

February 13th, 2018 05:00

I've had cache files rebuild automatically due to some kind of error/incompatibility in the cache file after I pushed an Avamar agent upgrade to a server (e.g. 7.1.102-21 upgraded to 7.3.0-233). Nowadays I know to delete the file if I ever upgrade the agent. But you'd see very clearly in the log that Avamar was rebuilding the cache, and if I recall correctly it would throw a "completed with exceptions" error. Is that so for you?

132 Posts

February 13th, 2018 07:00

In my case, there was no upgrade involved. Also, in each case, the backup was in enough of a hung/stuck/stalled state that it would not complete on its own and had to be cancelled.

And I am saying "in each case" now because the issue occurred again, albeit under slightly different circumstances - but the main symptoms were similar enough, and the primary aspect in both cases are that on the next client backup, the file cache appears to be recreated "from scratch", as if it was the first backup that the client had ever performed.

In my mind, relative to where in the overall process the session hangs, I don't believe that the cache files get properly closed when the session gets cancelled, and so on the next backup, the agent sees that and just creates new ones. It's not quite the same behavior as when there are volume capacity issues and the cache files can't grow any larger, but in both cases you end up with open cache files that are in a "what the heck do I do now" state.

I am likely going to log an SR for this, because it is occurring on a file server with 25 million files and 20TB of data - so having failures that result in multiple "seeding" backups is not something the customer is going to put up with. Also, the most recent failure on this client ties in with issues on the associated Avamar node around the same time, so I need to determine whether there is any relationship between those issues and verify that everything that might be going on is resolved.

FWIW, I'll try to follow up and post the sections of the respective logs where the processes have "hung" so as to provide better perspective on the issue, as well as any Support feedback I get going forward.

2K Posts

February 13th, 2018 07:00

I know we did have some issues with the paging cache in older releases but these should be resolved now. I just don't want people to see your post and think they need to delete the cache files every time. Since 90+% of the performance gain from the Avamar client comes from the file cache, that could have a devastating impact on backup performance in a large environment.

2K Posts

February 13th, 2018 07:00

It should not be necessary to delete the cache file after an upgrade unless there are exceptional circumstances.

336 Posts

February 13th, 2018 07:00

Ian, we saw the problem too many times not to conclude that it had something to do with the upgrade. It might have been the significance of the changes between 7.1 and 7.3 agents (and thus won't happen again if we do smaller upgrades) that caused the failures but in any case it's not impactful enough to our environment not to delete the cache file as a precaution.

2K Posts

February 13th, 2018 07:00

Replicating a 20TB backup in an Avamar / DD configuration is likely going to be problematic. The customer will probably need to split up this dataset either way.

132 Posts

February 13th, 2018 08:00

"Replicating a 20TB backup in an Avamar / DD configuration is likely going to be problematic. The customer will probably need to split up this dataset either way."


Tell me about it.


It would be helpful if this kind of information was also in the "Best Practices" document or elsewhere in the documentation. As opposed to the Support rep I talked to last autumn whose answer in terms of how to gauge replication behavior was "guess and check".

132 Posts

February 13th, 2018 08:00

So, this is from the most recent incident, but this is basically the same format as the first occurrence, where we "lost" an f_cache2.dat file that was pushing 4k pages in size (this one, not as much):

2018-02-11 00:52:25 avtar Info <8688>: Status 2018-02-11 00:52:25, 8,699,379 files, 581,503 folders, 9,029 GB (8,472,275 files, 0 bytes, 0.00% new) 624MB 106% CPU  (1 open files) M:\Media\THD Sales and Marketing\Creative Files\Showroom Graphics\US\W and PD\Andersen\3-Saddle Graphics\Saddle Graphics - old\Andersen-New-Retail-SaddleGraphics Folder\Andersen-Retail-Saddles-400Series Folder\Andersen-Retail-Saddles-400Series-wBC-FINAL Folder\Links\install1.tif

2018-02-11 01:01:24 avtar Info <5548>: 2018-02-11 01:01:24 Central Standard Time: Automatic cache flush of 'V:\Program Files\avs\var\f_cache2.dat'

2018-02-11 01:01:24 avtar Info <5069>: - Writing cache file "V:\Program Files\avs\var\f_cache2.dat"

2018-02-11 01:01:24 avtar Info <18895>: Cache update complete V:\Program Files\avs\var\f_cache2.dat (169 pages in all backups in cache)

2018-02-11 01:07:36 avtar Info <8688>: Status 2018-02-11 01:07:36, 8,788,149 files, 593,883 folders, 9,242 GB (8,479,273 files, 0 bytes, 0.00% new) 659MB  98% CPU  S:\Vaulted Data\400 Series Platform\2009\Restricted\L.  Product Design\Inventor Files\CSMT Hardware\Casement & Awning Hinges\

2018-02-11 01:11:10 avtar Info <0000>: - avamar-1510761673/container.1.cdsf: 8.893 TB (93.776%) was scanned and written in 18,813.057 seconds (495.7 MBPS), 604.3 GB (6.224%) was synthesized on the DD system in 30.723 seconds (19.67 GBPS)

2018-02-11 01:11:10 avtar Info <0000>: - avamar-1510761673/container.1.cdsf: 0 bytes (0.928%) new data backed up with network load of 138.7 GB (1.429%)

2018-02-11 01:22:46 avtar Info <8688>: Status 2018-02-11 01:22:46, 8,855,006 files, 599,645 folders, 9,707 GB (8,479,273 files, 90.13 GB, 0.88% new) 632MB  18% CPU  S:\WeilandVids\Weiland 8 15 17\Weiland Videos 9.19.18\

2018-02-11 01:37:55 avtar Info <8688>: Status 2018-02-11 01:37:55, 8,855,006 files, 599,645 folders, 9,707 GB (8,479,273 files, 90.13 GB, 0.88% new) 632MB   0% CPU  S:\WeilandVids\Weiland 8 15 17\Weiland Videos 9.19.18\

2018-02-11 01:53:04 avtar Info <8688>: Status 2018-02-11 01:53:04, 8,855,006 files, 599,645 folders, 9,707 GB (8,479,273 files, 90.13 GB, 0.88% new) 632MB   0% CPU  S:\WeilandVids\Weiland 8 15 17\Weiland Videos 9.19.18\

2018-02-11 02:08:14 avtar Info <8688>: Status 2018-02-11 02:08:14, 8,855,006 files, 599,645 folders, 9,707 GB (8,479,273 files, 90.13 GB, 0.88% new) 632MB   0% CPU  S:\WeilandVids\Weiland 8 15 17\Weiland Videos 9.19.18\

... continuing with the same message (for the last directory on the last drive in the dataset) for hours afterwards ...

2018-02-12 05:09:50 avtar Info <8688>: Status 2018-02-12 05:09:50, 8,855,006 files, 599,645 folders, 9,707 GB (8,479,273 files, 90.13 GB, 0.88% new) 634MB   0% CPU  S:\WeilandVids\Weiland 8 15 17\Weiland Videos 9.19.18\

2018-02-12 05:24:58 avtar Info <8688>: Status 2018-02-12 05:24:58, 8,855,006 files, 599,645 folders, 9,707 GB (8,479,273 files, 90.13 GB, 0.88% new) 634MB   0% CPU  S:\WeilandVids\Weiland 8 15 17\Weiland Videos 9.19.18\

... and then later on, for this incident only, relative to what else was going on ...

2018-02-12 05:40:07 avtar Info <8688>: Status 2018-02-12 05:40:07, 8,855,006 files, 599,645 folders, 9,707 GB (8,479,273 files, 90.13 GB, 0.88% new) 634MB   0% CPU  S:\WeilandVids\Weiland 8 15 17\Weiland Videos 9.19.18\

... repeating itself for 3+ hours until I finally cancelled the job and it ended like so:

2018-02-12 09:05:49 avtar Error <8609>: Connection killed from GSAN.

2018-02-12 09:05:49 avtar Info <9772>: Starting graceful (staged) termination, KILL event received (wrap-up stage)

2018-02-12 09:10:20 avtar Info <0000>: Entering the 'final' phase of termination, timer trigger

2018-02-12 09:12:07 avtar Info <8688>: Status 2018-02-12 09:12:07, 8,855,006 files, 599,645 folders, 9,707 GB (8,479,273 files, 90.13 GB, 0.88% new) 634MB   0% CPU  S:\WeilandVids\Weiland 8 15 17\Weiland Videos 9.19.18\

2018-02-12 09:12:07 avtar Warning <6568>: Internal Warning: Very large number of entries on TODO queue (114094)

2018-02-12 09:16:04 avtar FATAL <12477>: WATCHDOG TIMEOUT (cancellation exceeded 600 secs): Exiting with code=536870928

Conspicuous in their absence are any messages that indicate additional writing or closing of the cache files.

2K Posts

February 15th, 2018 11:00

That was definitely an unclean shutdown of avtar.

If you don't already have an SR open for this issue, I would recommend opening one. They will need debug logs but may also ask for some x flags to diagnose why we're seeing the "Very large number of entries on TODO queue" message. This message normally appears if we're accumulating data in the processing queue because the queue isn't emptying (i.e. we're not able to send the data to the backup storage for some reason).

Something to keep in mind though, is that the filename listed in the status messages is the last file that was processed successfully, not the file that avtar is currently working on.

15 Posts

November 17th, 2019 14:00

I realize this is an old topic, but after searching the web for a while, it's the closest answer I've gotten so far (having yet to open an SR on the issue).

The long and the short of it is, we have an IDPA (integrated Avamar and DataDomain), and our DataDoman filesystem (DDFS) crashed/died and restarted in the midst of a bunch of backups. The next time the client backups ran for some of the large filesystem jobs, they were glacially slow. It appears their f_cache2.dat files were reset or deleted and are being recreated from scratch. One server with millions of files and a ~50GB cache file is now ~25% of the way through its job and has about a ~1GB cache file thus far.

Very frustrating.

No Events found!

Top