This post is more than 5 years old

61 Posts

8322

March 10th, 2014 05:00

Optimize for large file count

Good day,

  I have a server with about 2.5 million small files on it, that is having trouble backing up. These files are static in nature and do not change once they are written, but new ones are added to the drive daily.

I remember from Avamar training there is some optimization around the fcache & pcache that can be done to allocate more memory for improved performance at the expense of taking more resources on the host.

For those who have had these large file count problems recommend this course of action?

Is there anything else any one could recommend to try and speed up this process?

Thank you kindly,

Steven

2 Intern

 • 

2K Posts

March 13th, 2014 11:00

Version 7 already includes the big directory optimizations. The caches don't seem to be an issue here either (the logs show plenty of room available in the file cache).

Client resources are the limiting factor for the vast majority of client backups so it might be worthwhile to crack out perfmon and see what's consuming system resources during the backup. Generally it's the client I/O performance that is the bottleneck but it would also be worth checking resources like available memory and CPU utilization. The CPU utilization in the log snips you've shown seems a little low to me. There may be some other process on the system using a lot of CPU and Avamar runs at Below Normal priority on Windows so it will yield the CPU to most other processes.

Typical performance for incremental backups on Windows is 100GB/hr and 1 million files/hr so this is definitely going slower than expected. If you're not able to identify what might be causing the problem, I'd recommend opening an SR and support can help you nail down what is slowing down the backup.

355 Posts

March 14th, 2014 04:00

Hello,

Is there any other similar server which has same number of files and backing up good with same LAN ?

Might be this is a issue with network utilization ? (not sure about this)

Regards,

Pawan

61 Posts

March 14th, 2014 11:00

I don't believe it's a network issue, vCenter Operations Manager states that they are only using 57K of bandwidth. We are load balancing our infrastructure of 2 x 10GB adapters.

It is a curious problem for sure, I'm continuing to investigate and will try and post an update when I find out more information.

Thanks for all the tips,

Steven

61 Posts

April 4th, 2014 06:00

Hello Druehl,

  We found the cause of our trouble just a few days ago, and it turned out to be a storage bottle neck. Our systems that had these very large files were on our SATA pool, after all these are just large file servers they didn't need high disk performance (or so we thought).

Thinking about the problem logically in our case it made sense. Every file out of these millions that need to be checked for changes needs to be opened. So it needs to be read from disk, put into memory, the SHA-1 hash needs to be ran on the contents of the file, and then checked against the f_cache.

When we migrated these virtual machines from our SATA pool to our SAS pool the backup time was reduced by more than 50%. Do you have the ability to vMotion these servers to a higher performance storage tier?

I hope this helps,

Steven

2 Intern

 • 

223 Posts

April 4th, 2014 06:00

Hello,

did you solve your problem?

I have the same issue. A server with about 6 Mio files takes 25 hours now for 4,5 Mio files. Avamar is version 7.0.1-61, same client version.

CPU and memory usage is less:

The caches are not full:

2014-04-04 14:39:30 avtar Info <5546>: Cache update complete D:\Programme\avs\var\f_cache.dat (352.0MB of 382MB max)

2014-04-04 14:43:45 avtar Info <8688>: Status 2014-04-04 14:43:45, 4,606,932 files, 680,863 folders, 518.9 GB (2,254,637 files, 68.58 GB, 13.22% new) 687MB  10% CPU

2014-04-04 14:58:45 avtar Info <8688>: Status 2014-04-04 14:58:45, 4,642,185 files, 683,322 folders, 519.7 GB (2,289,890 files, 68.74 GB, 13.23% new) 686MB   7% CPU

As I read in the other postings, expanding cache would not help here. So what can I do?

One other question. In all the Avamar7 documentation I read that with verison 7, there is another cache method used (paging) and that there is no more need to tune the cache. Altough that the cache files are renamed to f_cache2.dat and p_cach2.dat.

I have running version 7, but the files are f_cache.dat and p_cache.dat.

2 Intern

 • 

2K Posts

April 4th, 2014 08:00

druehl, is the backup wrapping up at the end (i.e. Timed Out-End) or is it failing? If the backup is failing at the end (for example, if the server is running Avamar 6.1 and the backup is killed by the start of garbage collection), no partial backup will be created and the cached information from the previous backup run will not be usable for the next backup. If the backup is being killed by GC, shortening the schedule for the backup so it wraps up before the start of GC will allow the new cache information to be used for the next backup.

Just to confirm, are you seeing any "booted" messages about the cache? These messages appear if cache information is discarded.

Based on the % new bytes, it looks like a large number of files are still being chunked. That normally means the client has not yet completed an initial backup. Is this the case? Typically once slow-running clients have completed an initial, their subsequent backups run more quickly, especially if the change rate is low.

If a client times out before it is able to back up all the data in the dataset, it will create a partial that is normally preserved for 7 days to serve as a base for subsequent backups. One approach I've used with customers who had similar issues in the past was to cut down the dataset to a size that allowed the backup to complete within the backup window, then gradually add the data back in until the whole client is protected. The advantage of this approach is that it creates complete, restorable backups instead of partials so the 7 day partial retention doesn't apply.

61 Posts

April 4th, 2014 08:00

Our backups were completing just not in the expected time frame.

2 Intern

 • 

223 Posts

April 7th, 2014 00:00

Hi Ian,

you are right, this is the first backup run and it took 43h hours to completet:

Backed-up 583.9 GB in 2592.26 minutes: 14 GB/hour (151,962 files/hour)


The secound run did take about 9 hours which is faster, but again to long I think:

Backed-up 583.3 GB in 527.62 minutes: 66 GB/hour (747,784 files/hour)


There was 0,1% new bytes only:

Backup #4 timestamp 2014-04-07 04:47:37, 6,575,710 files, 999,280 folders, 583.3 GB (3,670 files, 46.01 MB, 0.01% new)


I think is has to be the disk IO on the server, so today we are installing a new physical server which up to date hardware and then we will see if the backup is faster the next days.

61 Posts

April 7th, 2014 13:00

We are using the local client unfortunately.  We do charge back based on the space used on the virtual machine, when you use the appliances you can't get the used space.

Example: 25GB of space used on a 50GB VM.

If you use the appliance it will show up as 50GB of space in the database not 25GB of space. If you use the client it will give you an accurate count. From what I understand this isn't an Avamar limitation, but a limitation on the VMware storage APIs.

6 Posts

April 7th, 2014 13:00

just out of curiosity were you doing only file level backup with client loaded locally or VMDK with file level blocks enabled? If you were using the deployable Appliance attached to the cluster and LUN then this issue should be dramatically mitigated as it would only read the the VMDK level, not from the OS using disk I/O of the VM and its parent SATA pool. Reading virtual machines in this manner should reduce data retention times in the 1-1.5 hour levels instead of 3-4-5 hours for a single server.

2 Intern

 • 

223 Posts

April 8th, 2014 03:00

Hi,

this is only filesystem backup on physical servers. Each server has a Raid6 with 8 10K SAS disks.

But it is Windows Server 2003 32bit system, perhaps this can´t handle the caching correct?

2 Intern

 • 

2K Posts

April 8th, 2014 07:00

Backup #4 timestamp 2014-04-07 04:47:37, 6,575,710 files, 999,280 folders, 583.3 GB (3,670 files, 46.01 MB, 0.01% new)

The numbers on the left show the total number of files and folders processed, and the size of the data being backed up. The numbers in the brackets on the right show the number of files with file cache misses and the amount of data that was actually sent to the Avamar server. The cache is working very well here.

I think is has to be the disk IO on the server, so today we are installing a new physical server which up to date hardware and then we will see if the backup is faster the next days.

I agree, this is likely being caused by the I/O performance -- I/O performance is the bottleneck for most Avamar backups. On Windows clients, typical performance for incremental backups is about 1 million files per hour and 100 GB/hr so the performance on this system is below average.

2 Intern

 • 

223 Posts

April 17th, 2014 04:00

Hello,

we have a new hardware now and Windows Server 2008 R2 64bit OS.

But we still have the performance problem.

I have the following throughput after the first full backup:

Backup #14 timestamp 2014-04-17 10:28:12, 6,467,710 files, 996,147 folders, 547.0 GB (34,093 files, 1.295 GB, 0.24% new)

Backed-up 547.0 GB in 868.19 minutes: 38 GB/hour (446,980 files/hour)

Cache update complete D:\Programme\avs\var\f_cache.dat (352.0MB of 382MB max)

Is there anything I can do? For me it looks like there is no cache problem, so it would not help to make changes in f_cache?

I am not sure if it will help tp split the Datasets, why should it run faster then?

I have set this client to "allowo overtime alway" option, so the backup can finish every day. But we need 14 hours.

2 Intern

 • 

2K Posts

April 17th, 2014 06:00

38GB/hour is lower than I would expect, especially for freshly updated hardware.

The cache is in good shape. I don't believe there is any problem there.

Is there some process starving avtar for resources during the backup? CPU, memory, etc.. Perfmon can help you determine this.

There are some other tests you can do to separate the disk performance from the network performance. There is a test called a "degenerate" test that reads data from the disk and processes it as normal but discards the results instead of sending them across the network. If the degenerate test shows poor performance, that would eliminate the network as the possible bottleneck and point towards disk I/O, CPU starvation, etc.. The second test is called a "randchunk" test and it generates random chunks of data in memory and sends them to the server. If the randchunk test shows poor performance, that generally points to a network bottleneck.

If you've never run these tests before, I'd recommend getting in touch with support and they can walk you through them.

I do not recommend splitting the dataset unless absolutely necessary. Splitting the dataset can actually hurt performance if it places the underlying disks into contention. If there are multiple independent volumes, it's a viable option but I would explore other avenues first. Parallelism is the nuclear option because of the management overhead and host requirements involved.

0 events found

No Events found!

Top