Start a Conversation

Unsolved

This post is more than 5 years old

188302

March 9th, 2012 07:00

Ask the Expert: Avamar Server management best practices

Troubleshooting Avamar Server management & implementing best practices with ianderson

 

Welcome to the EMC Support Community Ask the Expert conversation. This is an opportunity to learn how to troubleshoot Avamar Server managment.

 

http://community.emc.com/servlet/JiveServlet/downloadImage/102-14674-1-33946/168-168/Ian+Anderson.jpg

 

Ian Anderson started with EMC in 2009 as a member of the NetWorker support team, moving into Avamar support that same year. Ian was privileged join the Avamar Engineering team in 2011 as a specialist in Avamar Server training, troubleshooting and repair. He has travelled to a number of EMC centers, delivering training and mentorship on the Avamar product to the global Avamar team. Ian is an EMC Proven Professional and a Microsoft Certified Professional. Ian is on Twitter - @ionthegeek.

 

This eventis now concluded.

A summary document of the main points of converasation in this discussion can be viewed here:

 

Summary - Ask the Expert: Avamar Server management best practices (2012-04-06)

 

 

 

666 Posts

March 20th, 2012 01:00

This Discussion is now open for questions to Ian. We look forward to this discussion.

Regards,

Mark

2K Posts

March 20th, 2012 12:00

To get us started, I thought I would talk a little bit about "Avamar Zen".

Steady State

Avamar operates best in "steady state" -- that is, the amount of data being removed from the system by garbage collection each day is as much or more than the amount of data being ingested during backups each day.

Capacity Issues

One of the most common issues the Avamar support team sees generally starts with a panicked call -- "My Avamar is full! I need to get backups tonight!"

There are a variety of reasons this can happen. Support will help you, the customer, to get the system back into a state where backups can run. From there the case will go down one of two paths.

The Sad Path

The first path is a path of terrible suffering and pain.

If the current capacity issue is addressed (usually by removing data and running garbage collection) without making changes to the data ingest rate or the data expiry rate, the system will run for some time before it inevitably becomes full again. You will find yourself back on the phone with support in a few days. Or a few weeks. Or a few months.

There may be checkpoint overhead issues. There may be garbage collection failures. There will almost certainly be backup failures.

This will happen over... and over... and over again. You will be frustrated. Support will be frustrated. After a while, management on both sides will be frustrated.

The Happy Path

The second path leads to Avamar nirvana.

I'll copy the Wikipedia definition of Nirvana so you'll have it in front of you:

[Nirvana] ... refers to release from a state of suffering after ... period of committed spiritual practice.

When the data ingest rate and the data removal rate on a system are in balance and the capacity is monitored regularly, maintenance will run as scheduled, backups will run as scheduled, and the capacity will gradually stabilize.

Tools

There are a number of tools that can help you understand and manage your Avamar server's capacity. Here is a brief overview of the most customer-friendly ones:

Enterprise Manager

Inside the enterprise manager, there are graphs showing capacity history and forecast. This is a good way to review the capacity history at a glance. The graphs themselves are fairly self-explanatory so I won't spend a lot of time on them.

DPN Summary

There is a report built into the Avamar software called "Activities - DPN Summary". This report will tell you on a backup-by-backup basis how much new data is being sent to the server by a client.

To generate the report on your own Avamar system:

  1. Open the Avamar Adminstrator GUI and log into the grid
  2. Select Tools => Manage Reports...
  3. Scroll down until you find "Activities - DPN Summary" and select it
  4. Click the Run button
  5. Select the appropriate date range (be aware that very large ranges could cause the GUI to stop responding for some time) and then click Retrieve
  6. Click the Export button and you can save the report as a file in comma separated values (CSV) format so it can be imported into spreadsheet software for easier analysis

The columns that are likely to be of most interest will be columns I (as in India) through M (as in Mike). These columns are, respectively:

I - ModReduced - Bytes saved by using compression

J - ModNotSent - Bytes present on the Avamar server but not in the client caches

K - ModSent - New bytes added to the server by the backup

L - TotalBytes - The total size of the data being protected (whether or not we had to send it)

M - PcntCommon - The percentage of data for the backup that is already on the grid (higher is better)

Column K in particular is useful for measuring capacity and capacity growth. Using the ModSent information for each backup still present on the grid and the size of the initial backup for the client, you can do a rough "back of the envelope" calculation of how much space that client is consuming on the grid.

One quick caveat - the DPN Summary is a report, not a status which means it includes information for backups that may have already expired from the grid.

capacity.sh

Using capacity.sh requires you to log into the utility node of the grid using SSH. If you don't know how to do this already, this option is probably not for you.

The capacity.sh script is shipped as part of the Avamar base install. It's a shell script that analyzes the ingest data and garbage collect data for a system and produces an ASCII report showing the daily ingest for the last 14 days (by default - use --days=n to specify a number of days), the daily garbage collect performance and the net change.

The report will also show the highest change rate clients on the system (in other words, the clients using the most storage after de-dupe).

To run it, log into the utility node as the admin user and type "capacity.sh" at the prompt.

You'll get back output that looks something like this:

admin@testgrid01:~/ija/>: ./capacity.sh

Date          New Data #BU       Removed #GC    Net Change

----------  ---------- -----  ---------- -----  ----------

2012-03-06     4888 mb 6           -1 mb 4         4887 mb

2012-03-07     1232 mb 9            0 mb           1232 mb

2012-03-08    63902 mb 9           -2 mb 4        63900 mb

2012-03-12     1158 mb 4            0 mb           1158 mb

2012-03-13      497 mb 7           -1 mb 1          496 mb

2012-03-14     1661 mb 8           -1 mb 1         1660 mb

2012-03-15     4772 mb 10          -1 mb 1         4771 mb

2012-03-16      781 mb 8         -268 mb 1          513 mb

2012-03-17      701 mb 9            0 mb 1          701 mb

2012-03-18      369 mb 7            0 mb 1          369 mb

2012-03-19      503 mb 9            0 mb 1          503 mb

2012-03-20     1630 mb 7            0 mb           1630 mb

----------  ---------- -----  ---------- -----  ----------

Average        6841 mb            -22 mb           6818 mb

Top 5 Capacity Clients        Added  % of Total   ChgRate

----------------------  ------------  ---------- ---------

  client1                   68405 mb       83.3%    3.022%

  client2                    4571 mb        5.6%    1.851%

  client3                    3844 mb        4.7%    2.592%

  client4                    3062 mb        3.7%    1.914%

  client5                    1738 mb        2.1%    0.128%

Total for all clients       82100 mb      100.0%    0.016%

From the output, it's very easy to see which direction capacity is moving. On this particular grid, we are adding much more data that we are removing. I would be worried if I didn't know that the system is a testing grid that is less than 5% full.

If capacity utilization is increasing day overday even after data has started expiring from the grid, no matter how storage nodes are added, sooner or later the system will fill up. The capacity.sh script is a very good way to show this trend.

Long-Term Capacity Management

If your Avamar system is not in steady state even after all your clients begin expiring backups, there are really only two long term options:

  1. Back up less
  2. Expire more

For the first option, there are different approaches you could take. If there is spare capacity available on another grid, clients can be moved. If there are high change rate clients consuming large amounts of your capacity, it might be better to move those clients off to Data Domain. If there are non-critical clients, they could be backed up less frequently (or not at all). There may be items such as temp files that should be excluded from the datasets to avoid backing up high change, low value data from each client.

For the second option, it's a good idea to periodically review retention practices. Do you need all of the data that has been backed up? Is it still valuable? One other important consideration when deleting items from the Avamar is that de-duplication is a double-edged sword. Be sure to take a look at the DPN Summary report when deleting backups. No matter what the GUI says about the size of the backup being deleted, deleting individual backups will only reclaim roughly the amount of space listed in the "ModSent" column of the DPN Summary. The overall size of the backup might be 500GB but if there are only 2MB of unique data, you will only regain 2MB of space.

Those are the basics of capacity management on Avamar. I look forward to your questions!

223 Posts

March 27th, 2012 07:00

Hello,is there any way to get a kind of dedup ratio for the whole avamar grid? For example like it is solved with Datadomain systems, there you have a dedup ratio on the start page. This is what many customers ask for. They dont like to calculate the ratio from the dpnsummaru output.

2K Posts

March 27th, 2012 11:00

Unfortunately there's no quick and easy way to get the dedup ratio for a whole grid. I've filed a Request For Enhancement (RFE) to request that a future version of the Enterprise Manager (EM) report the global dedup ratio for each grid.

In the meantime, it's possible to set up ODBC on a Windows system so that it can query the SQL Views in the Avamar Administrator Server database. The DPN Summary report is based on information found in the v_dpnsummary view, so with an ODBC connection configured you could use database software, a spreadsheet, a reporting package, etc. to run an automatic calculation or report based on the information in v_dpnsummary. It's not the most straightforward option but it would work.

March 28th, 2012 07:00

Ian, a two-part question for you:

1:

when we activate a client by right-clicking on the avamar client icon in the toolbar and saying “Manage, Activate”, the Avamar server adds a suffix like ".our-org.com".

So the client name will be something like "websrv01.our-org.com".

We have many avamar servers, and they are not set up identically, so today, we have many different suffixes: "our-org.com", "ent.our-org.com", no-suffix, etc.

On the other hand, when we activate a node using the MCCLI command line, we can give it any name we want, including no suffix at all.  No-suffix seems to works fine, backups work fine, Avamar has the full DNS name in its database somewhere - all is good.

But we'd like to get our client names consistent, if we can.

Is there any reason NOT to just use (for example) "websrv01" instead of "websrv01.our-org.com"?

I would guess that if there WERE a problem, the client would fail to activate.

Right?

2:

We can rename a client using, for example, "mccli client edit --name=webssrv01.our-org.com --new-name=websrv01".

And if we do, do its old backups get orphaned? So would  we have to rename them, too?

Thanks for your advice here.

2K Posts

March 28th, 2012 09:00

For the first question, we recommend using the fully qualified name instead of the short name because by default, Avamar will not allow two clients with the same name to be activated to the same grid. This could be an issue if you have, for example, Windows Domain Controllers called "dc1.west.our-org.com" and "dc1.east.our-org.com". One of these clients would fail to activate because of the name conflict.

The answer to the second question requires a bit of background first.

When explaining how client accounts work, I always like to use a bucket as a metaphor for a client account. Inside the bucket, you will find all the backups for that client. Engraved into the handle of the bucket is a unique identifier called the Client ID or CID. There is also a label on the bucket (the hostname) but that's only used for two things:

  1. We use the hostname as a human-friendly identifier because humans aren't very good at remembering 20-byte hexadecimal strings (funny, that).
  2. We compare the hostname and CID whenever an activated client checks in. If the hostname and CID do not match what we have recorded, we will not issue backup or restore workorders to that client. This is to prevent a rogue client from stealing the hostname of an activated client and impersonating it.

So using the bucket analogy, changing the label on the bucket (the hostname) will not affect the contents of that bucket (the backups).

I ran some quick testing of this and as long as the client name you use as "new-name" is the short name or any of the fully qualified names associated with the client, you shouldn't have any issues. You do have to be careful with this, however. If you make a typo in the new-name, the Avamar Backup Agent will not be able to process workorders because the mapping between the hostname and CID on the client will not match the mapping between the hostname and CID on the Avamar server.

I hope this helps!

2K Posts

March 28th, 2012 11:00

Validating a backup is essentially doing a restore and discarding the results. It will prove that the backup is consistent (in other words, the backup is on the server and all the bits that were backed up can be restored) but it doesn't tell us anything about the content of that backup. For example, on Windows clients, if VSS is not functioning correctly, any open files on the client will not be backed up. This backup will be valid (since it is consistent and restorable) but if you were trying to restore one of those open files, you will not be able to do so.

When reviewing the activities, keep in mind that each plugin has a specific purpose. If the filesystem backup succeeds but the VSS backup fails, it will not be possible to perform a bare metal recovery of this system but any data backed up as part of the file system backup will be available for restore. Similarly, if your DB2 backups succeed but your filesystem backups fail, you will be able to recover the databases but not the file system. Naturally, this applies to any plug-in backups.

There are some reports built into the Avamar Administrator that you may find useful for determining if a client is fully protected or not. In particular, take a look at the "Activities - Exceptions", "Activities - Failed" and "Client - No Activities" reports. The two "Activities" reports are pretty self-explanatory (they report on any clients that complete with exceptions or fail). The "Client - No Activities" report will give you information on any client that is not running backups.

It is also possible to create your own reports using the GUI, though the options for customization are somewhat limited.

You can configure the system to send built-in or custom reports to you by e-mail on a daily basis. If you find that these reports are not sufficient for your needs, you could use the ODBC connectivity I mentioned above to "roll your own" or you could use a reporting package such as Data Protection Advisor to generate more in-depth reports "out of the box".

March 28th, 2012 11:00

How do we ensure that we have a good backup with Avamar?

We can try using "mccli backup validate..."; but that runs a long time, and I don't see where the results go.

We can look at the "Last Successful Backup Date" from "mccli client show...", but that apparently reports success even if the backup had exceptions.

We can look at activities, and ee if they Completed, Completed With Exceptions, etc.  But do we have to look for a success on each plugin, to be sure?  For example, Windows filesystem and Winodws VSS have to both succeed, right?

Just looking for a straightforward way to day "Yes, we're OK on server xxx".

Thanks.

March 30th, 2012 06:00

Ian, thanks for your knowledgeable and well-stated responses.

Here's a new question:

I’m trying to back up my laptop. I get thousands of errors like this:

      2012-03-30 08:37:20 avtar Error <5137>: Unable to open "C:\temp.txt" (code 5: Access is denied).

I can open these files myself.

The Backup Agent service is running as “Local System Account”.

Why can’t it open these files (thousands of them - possibly all files on the machine)?  Any ideas?

Also, the GUI reports Success even though thousands, possibly all, files failed.  That doesn't seem right.  Is there a flag I can set to prevent bogus "Success" results?

2K Posts

March 30th, 2012 08:00

My pleasure!

I would recommend reviewing the NTFS permissions for these files. It's possible somebody has deleted the SYSTEM account from the ACL or removed the account's read permissions for these files. By default, the SYSTEM account is granted Full Control for all files on a system and Microsoft recommends that this not be changed.

I've also seen this message if the VSS snapshot fails for some reason (for example if the snapshot is removed out from under the running backup). See if there are any SnapVol errors in the Windows event logs.

Are the backups being marked "Completed" or "Completed w/exceptions"?

March 30th, 2012 08:00

For C:/Temp.txt, SYSTEM has Full control.

I see no Snapvol events in Application, Security, or System event logs.

Avamar Client,  "Backup...", History says "Completed Successfully", no indication of exceptions.

Avamar Client, Manage, View console  says "Completed (45055 errors)";

Avamar administrator, Log on to Server, Activities, says "Completed w exceptions".

So I still don't see why these files are not readable. 

Is a windows service involoved? Can I bypass it?  I tried cutting and pasting the "avtar.exe ..." command line from the ..avs/var/clientlogs/... into a command shell to try and run the backup as myself, not through the service.  But I got:

C:\Program Files\avs\bin>avtar --sysdir="C:\Program Files\avs\etc" --bindir="C:\Program Files\avs\bin" --vardir="C:\Program Files\avs\var" --ctlcallport=1706 --

ctlinterface="3001-Windows-Windows-Test -1333110692426" --logfile="C:\Program Files\avs\var\clientlogs\lm-cmdlinetest.log" --sessionattr=dtlt=true

avtar Info <5241>: Logging to C:\Program Files\avs\var\clientlogs\lm-cmdlinetest.log

avtar Info <5551>: Command Line: avtar --sysdir="C:\Program Files\avs\etc" --bindir="C:\Program Files\avs\bin" --vardir="C:\Program Files\avs\var" --ctlcallport

=1706 --ctlinterface="3001-Windows-Windows-Test -1333110692426" --logfile="C:\Program Files\avs\var\clientlogs\lm-cmdlinetest.log" --sessionattr=dtlt=true

avtar FATAL <10790>: Unable to connect to 127.0.0.1:1706 with proprietary encryption

avtar Info <9901>: Cancel Request being processed (setting code from 0 to 536870920)

Appreciate any advice.  Guess I could open a ticket with EMC...

2K Posts

March 30th, 2012 09:00

It is possible to run avtar manually but you can't copy and paste the command line from the log because avtar commands that are run as part of scheduled backups use the "CTL" interface to communicate with their caller to retrieve the information required to run the backup (targets, options, etc.) and return status messages. For file system backups the caller will be the "Avamar Backup Agent" or "Backup Agent" service (internally we call it "avagent" which is the name of the binary). Since avagent didn't start this avtar process, it won't be listening for replies and you will receive the FATAL you've pasted above.

I think it would be best to open a service request for this issue. Support can do an in-depth analysis of the logs or work with you live via WebEx and you're likely to get a faster resolution this way.

If you speak with L2 support, they can show you how to run a "degenerate" test that will process the filesystem but discard the results instead of sending them to the server. Such a test is normally used to isolate performance bottlenecks (it measures how fast avtar can read the filesystem since it doesn't have to wait for replies to come back from the server) but it would also be useful for this type of troubleshooting since it would allow you to keep your tests local to the client.

March 30th, 2012 12:00

We have one server with 25 million files, scattered through directories six levels deep.

We'd like to throw it at our test Avamar grid; any tuning I should look at on the client 9or  server) side before we set it up for its first backup?

March 30th, 2012 12:00

We plan to migrate a few thousand servers to several Avamar grids.  But we don't want to throw too many first-time on-demand backups at an Avamar grid all at once.

How many is too many?  Or is there any reason not to queue up on-demand backups for 1,000 different servers and wait for them to complete?

(Obviously we don't want to throw more data at the grid than it can hold after being de-duped.   We think we have that part figured out.)

Another way of saying it:

We can monitor various things:

  total client GB that will be ingested,

  number of sessions in use on the Avamar server,

  disk IO,

  CPU utilization,

etc.  Is there a metric on one of these that we should be careful not to exceed? 

2K Posts

April 2nd, 2012 07:00

lmorris99 wrote:

We have one server with 25 million files, scattered through directories six levels deep.

We'd like to throw it at our test Avamar grid; any tuning I should look at on the client (or server) side before we set it up for its first backup?

The most important thing to do on a client with so many files is to make sure that the file cache is sized appropriately. The file cache is responsible for the vast majority (>90%) of the performance of the Avamar client. If there's a file cache miss, the client has to go and thrash your disk for a while chunking up a file that may already be on the server.

So how to tune the file cache size?

The file cache starts at 22MB in size and doubles in size each time it grows. Each file on a client will use 44 bytes of space in the file cache (two SHA-1 hashes consuming 20 bytes each and 4 bytes of metadata). For 25 million files, the client will generate just over 1GB of cache data.

Doubling from 22MB, we get a minimum required cache size of:

22MB => 44MB => 88MB => 176MB => 352MB => 704MB => 1408MB

The naive approach would be to set the filecachemax in the dataset to 1500. However, unless you have an awful lot of memory, you probably don't want to do that since the file cache must stay loaded in memory for the entire run of the backup.

Fortunately there is a feature called "cache prefixing" that can be used to set up a unique pair of cache files for a specific dataset. Since there are so many files, you will likely want to work with support to set up cache prefixing for this client and break the dataset up into more manageable pieces.

One quick word of warning -- as the saying goes, if you have a hammer, everything starts to look like a nail. Cache prefixing is the right tool for this job because of the large dataset but it shouldn't be the first thing you reach for whenever there is client performance tuning to be done.

On to the initial backup.

If you plan to have this client run overtime during its initial, you will have to make sure that there is enough free capacity on the server to allow garbage collection to be skipped for a few days while the initial backup completes.

If there is not enough free space on the server, the client will have to be allowed to time out each day and create partials. Make sure the backup schedule associated with the client is configured to end no later than the start of the blackout window. If a running backup is killed by garbage collection, no partial will be created.

You will probably want to start with a small dataset (one that will complete within a few days) and gradually increase the size of the dataset (or add more datasets if using cache prefixing) to get more new data written to the server each day. The reason for this is that partial backups that will only be retained on the server for 7 days. Unless a backup completes successfully within 7 days of the first partial, any progress made by the backup will be lost when the first partial expires.

After the initial backup completes, typical filesystem backup performance for an Avamar client is about 1 million files per hour. You will likely have to do some tuning to get this client to complete on a regular basis, even doing incrementals. The speed of an incremental Avamar backup is generally limited by the disk performance of the client itself but it's important to run some performance testing to isolate the bottleneck before taking corrective action. If we're being limited by the network performance, obviously we don't want to try to tweak disk performance first.

The support team L2s from the client teams have a good deal of experience with performance tuning and can work with you to run some testing. The tests that are normally run are:

  • An iperf test to measure raw network throughput between client and server
  • A "randchunk" test, which generates a set of random chunks and sends them to the grid in order to test network backup performance
  • A "degenerate" test which, as I mentioned previously, processes the filesystem and discards the results in order to measure disk I/O performance
  • OS performance monitoring to ensure we are not being bottlenecked by system resource availability (CPU cycles, memory, etc.)

Edit -- 2013-08-06: The behaviour for partial backups changed in Avamar 6.1. More information in the following forums thread:

Re: Garbage collection does not reclaim expected amount of space

No Events found!

Top