Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products
  • Manage your Dell EMC sites, products, and product-level contacts using Company Administration.

Avamar: Capacity Management Concepts and Training

Summary: This article is for Avamar User and operating system Capacity Management, for Avamar system administrators or those who monitor the health of an Avamar installation and require aSee more

This article may have been automatically translated. If you have any feedback regarding its quality, please let us know using the form at the bottom of this page.

Article Content


Symptoms

This article is targeted to Avamar versions 5.x and onwards.

For Capacity Management issues which relate to Data Domain devices, see the chapter "Reclaiming storage on a full Data Domain system" in the Avamar and Data Domain System Integration Guide.

Objectives of this article:

  • Summarize the types of data which are stored in the /data* partitions.
  • Introduce the concept of "operating system Capacity" and contrast this with the concept of "User Capacity" (sometimes called "GSAN Capacity.")
  • Explain why Avamar should not be run close to the User Capacity limit.
  • List the factors which contribute to checkpoint overhead.
  • Describe how to monitor data partition utilization.
  • Describe the symptoms experienced if operating system capacity gets out of control.
  • List typical causes of the MSG_ERR_DISKFULL message.
  • Outline the recovery methods used where high operating system capacity is impacting normal system operation.
  • Describe the symptoms experienced if User Capacity exceeds the User capacity limit.
  • Discuss how to recover from a high User Capacity situation.

The KB article assumes that the reader is familiar with the "Managing Capacity" section of the Avamar Operational Best Practices Guide.

Common issues which affect, or are symptoms, of too high "operating system Capacity" symptoms are:

  • Checkpoint validation (HFS Check) is failing.
  • Garbage collection fails to run and reports with a MSG_ERR_DISKFULL.
  • Checkpoint creation failures.

Common symptoms which are closely associated with too high "User Capacity" are:

  • Backups are failing.
  • Failure of incoming replication jobs. 
  • The Administrator interface shows the system in 'Admin' mode during the backup window.

Cause

See Resolution section.

Resolution

How is data stored on the Avamar grid?

Avamar capacity management concerns the data which is in the /data* partitions of all Avamar data nodes. This consists of;

  • Deduplicated backup data
  • RAIN parity data
  • Checkpoint overhead data. 

Both RAIN parity and checkpoint data are layers of redundancy available to Avamar in addition to RAID and Replication. 

Free space in the data partitions is also required in order for maintenance tasks such as garbage collection and asynchronous stripe crunching to run correctly.


Below is a graphical representation of physical storage space available within the data partitions on the Avamar storage nodes.

kA2j0000000R3NhCAK_1_0


How is data stored in the data partitions?

In the above diagram, we see a simple representation of how the space is used in the data partitions.

The 100% value on the left is defined to be the total amount of physical space available to the operating system in the data partitions.

If any of the data partitions consume more than 85% of the total space, garbage collection is unable to run.

The 100% User capacity marker (read-only limit) indicates that up to 65% of the total space in the data partition is available for storage of deduplicated data. The space below this 100% User Capacity marker is equivalent to the Server Utilization value which is visible in the Administrator UI. If the amount of deduplicated data that is stored on any data partition on any node reaches 65%, then the Avamar system becomes read-only and refuses further backup data.

We can now understand that, from the Avamar Administrator UI, the user has visibility of space that is consumed by backups but they do not have visibility of space that is consumed in the operating system data partitions.


Why an Avamar system should not be run close to the "User capacity" limit.

The relationship between high "User Capacity" and checkpoint overhead is such that as a system becomes increasingly full, even small increases in backup data can cause large increases in checkpoint overhead. A full discussion of why this is the case is beyond the scope of this article however the important thing to remember is:

  • The closer an Avamar system is to 100% User capacity, the less operating system capacity is available for checkpoint overhead.

On a full system, as we can see in the diagram above, checkpoint overhead is limited to 20% of the total operating system space in the data partitions.  

Or an Avamar system to run reliably at high levels of "User capacity," it must meet the following criteria:

  • The system must have a low rate of daily changed data (no higher than 1%.)
  • Capacity must be in a steady-state (as described in the Avamar Operational Best Practices Guide.)
  • Maintenance tasks should be completing successfully every day.

If any of these statements turn from true to false, checkpoint overhead can be expected to gradually rise or suddenly spike and cause serious operational issues. 


Factors which contribute to checkpoint overhead:

The following factors can cause the checkpoint overhead to increase. 

  • Asynchronous crunching of stripes (enabled by default.)
  • The number of checkpoints stored on the system.
  • Checkpoint validation not completing successfully everyday.
  • How empty stripes are when they are reused by the Avamar server (becomes more severe with higher server utilization.)
  • The daily backup change rate.

A system administrator has a certain degree of control over these factors. Configuration of asynchronous crunching is for support only, but Administrators may remove excess checkpoints, investigate checkpoint failures, and influence server utilization and daily data change rate.


How to monitor data partition utilization

The correct way to monitor utilization of the operating system data partition is to use the following Avamar command from the Avamar Utility Node.

For Example:

admin@utilitynode:~/>: avmaint nodelist | grep fs-percent
        fs-percent-full="7.8"
        fs-percent-full="6.3"
        fs-percent-full="6.4"
        fs-percent-full="6.4"
        fs-percent-full="7.6"
        fs-percent-full="6.2"
        fs-percent-full="6.1"
        fs-percent-full="6.6"
        fs-percent-full="7.8"
        fs-percent-full="6.4"
        fs-percent-full="6.5"
        fs-percent-full="6.8"


This output gives you a true reading of the operating system capacity utilization. On a grid where data nodes use a file pool, the Linux "df" command is not meaningful because the stripes are preallocated in the file pool, and many of the stripes might not be in use.


What happens if Operating System capacity usage gets out of control?

From a user's point of view, the first indication that data partition utilization is out of control occurs when it rises above 85%. 

Garbage collection is no longer able to run and fails with an MSG_ERR_DISKFULL error message. 

  • Here is where misunderstandings often occur. 

The user often interprets the MSG_ERR_DISKFULL message to mean that the system no longer has space for backups. 

This interpretation is not correct, however, the user usually checks the server utilization value in the Avamar Administrator UI and find the value to be acceptable, for example 60%.

The user may attempt to delete backups from the Avamar UI's Backup management interface. Even if the User capacity level were high, the deletion of backups would not alleviate the situation since garbage collection is unable to run and remove expired chunks of data from the system. 


Remember: If a system is experiencing both high operating system Capacity issue and high User Capacity, focus on resolving the high operating system Capacity issue first.

In cases of high operating system capacity utilization, the system may run short of space to create checkpoints. 


What causes the MSG_ERR_DISKFULL message?

The most typical cause is too-high checkpoint overhead. Typical causes of high checkpoint overhead could be:

Checkpoint validation (HFScheck) has failed repeatedly.

  • HFScheck failure has many possible root causes (abrupt cancellation, software failure, and so forth).

The system is running too full and has a high daily data change rate.

  • The system needs more data nodes to handle the data change rate and store the data.
  • The system is configured to back up more data or clients than it was sized for.

Too many checkpoints are being stored (Avamar stores two checkpoints by default, one of which has been validated).

  • Excess checkpoints have been created by the system administrator.
  • Maintenance was recently carried out, but the default checkpoint retentions were not reinstated.

See the following article to help resolve a MSR_ERR_DISKFULL situation.


Actions to investigate and help alleviate high operating system capacity.

  1. Find out when the last successful HFScheck completed.

To perform this, use either the Avamar Administrator or the command line on the Avamar Utility Node.

In the Avamar Administrator, go to Server > Checkpoint Management tab.

Check the most recent date and time that is listed in the Checkpoint Validation column. This should have occurred within the last 24 hours.

Or, using the Avamar Utility Node command line:   

Run the command 'cplist'.

Below is an example of the output. The most recent validated checkpoint that is listed here is dated January 14, 11:14. We can identify it by the flag directly after the 'valid' marker. Depending on the types of HFSchecks set on the system, the flag could be 'rol' or 'hfs'. Here we have a 'rol' (rolling HFScheck). 

admin@utilitynode:~/>: cplist
cp.20110114111419 Fri Jan 14 11:14:19 2011   valid rol ---  nodes   3/3 stripes   1131
cp.20110114194457 Fri Jan 14 19:44:57 2011   valid --- ---  nodes   3/3 stripes   1131
 

If the results show that the latest validated checkpoint is older than 24 hours, find out why.

This could either be because the HFScheck did not run or because it failed.

  1. Confirm if HFScheck ran or if it failed.

On the Avamar Utility Node, run 'status.dpn' and find the line which contains "Last hfscheck."

For Example:    

Last hfscheck: finished Sat Jan 15, 11:07:17 2011 after 06m 41s >> checked 528 of 528 stripes (OK)

Make a note of when it finished and what the status was (in the line above the status is shown as 'OK').

Note: The 'sched.sh' script can also be used to identify when a HFScheck last ran and whether it was successful.

 If HFScheck jobs have been failing, this should be investigated immediately.

 If HFScheck has not run lately, confirm whether maintenance tasks are enabled. Via the command-line interface on the Avamar Utility Node, enter 'dpnctl status' 

admin@utilitynode:~/>: dpnctl status
Identity added: /home/admin/.ssh/dpnid (/home/admin/.ssh/dpnid)
dpnctl: INFO: gsan status: ready
dpnctl: INFO: MCS status: up.
dpnctl: INFO: EMS status: up.
dpnctl: INFO: Backup scheduler status: up.
dpnctl: INFO: dtlt status: up.
dpnctl: INFO: Maintenance windows scheduler status: enabled.
dpnctl: INFO: Maintenance cron jobs status: enabled.
dpnctl: INFO: Unattended startup status: disabled.
 

If the maintenance windows scheduler is 'disabled', it can be enabled with the command:

dpnctl start maint

Once HFScheck completes successfully and the oldest checkpoint is 'rolled off' the system, operating system capacity should reduce considerably.

If operating system capacity is still too high and garbage collection continues to fail with the MSG_ERR_DISKFULL message, then EMC Support's assistance may be required. 

Otherwise, if operating system capacity is low enough to allow garbage collection, then work on lowering "User Capacity" and bring the "server utilization" figure down.

 

Actions to alleviate high User Capacity

Unlike operating system Capacity, User Capacity levels are more easily and directly influenced by the Avamar system administrator.

a. Ensure that garbage collection is running every day and that it does not get interrupted by backups.

This is the most crucial point as even an adequately sized system quickly experiences high User Capacity if garbage collection does not run regularly or reliably.

As shown earlier, confirm that the maintenance window is enabled and use the capacity.sh and sched.sh scripts to verify that garbage collection is running, that it is removing data, and that no backup and replication jobs are running when the blackout window commences.

From Avamar v5 SP1, garbage collection is given priority over backups if the system is starting to get full. When garbage collection starts, if the system is fuller than the warning value (~77% server utilization), any backups which are running are gracefully canceled.


From Avamar v7, garbage collection can run concurrently with backups

b. Stop adding new clients to the grid.

Once an Avamar system is approaching capacity, we should immediately stop adding new clients to prevent the situation from worsening.

If you have another Avamar grid which is running at a lower level of server utilization, consider adding new clients to that grid instead of the server which is becoming full.

c. Learn which clients are consuming the most storage space.

To address a capacity issue, we should identify which clients are responsible for adding the most data to the Avamar system. 

The Enterprise Manager interface provides reports which return this information. See the Avamar Administrator guide "Server and Client Average Daily Change Rates" section for instructions on how to access these reports.


The capacity.sh script (run from the Avamar Utility Node command line) can also be used to identify which clients which have the highest change rate.

Only registered Dell Customers can access the content on the following link, using Dell.com/support.   
See KB 60149: How to use capacity.sh script to understand daily data changes on an Avamar system for more detail on how to use the capacity.sh script.

It is often found that the 'hungriest' clients are those which back up SQL databases or email servers so pay particular attention to these.

  1. Reassess retention policies.

After identifying high change rate clients, reassess retention policies to see if any can be lowered in order to reduce storage requirements to an acceptable level.

Note: It is recommended that retention policies be set to at least 14 days.

If the system is old enough to have started expiring the longest retained backups, then after reducing retention policies, we would expect to see an increase in the amount of data removed each day by garbage collection. Monitor this trend with capacity.sh.

If the Avamar system is not yet old enough to have started expiring backups, then the retention policies may need altering so that the oldest backups now start to expire.

If it is not possible to reduce retention policies due to regulatory requirements you should consider expanding the Avamar system or migrating clients to another, less used, Avamar system.

  1. Migrate clients to an alternative Avamar system.

If another Avamar system is available, consider the possibility of migrating large or high change rate clients from higher to lower used systems by using the Avamar Client Manager interface.

Note: 

  • The new Avamar server requires sufficient storage for the Avamar clients you want to migrate.
  • Keep clients with similar type of data on the same Avamar system to take advantage of deduplication efficiencies.
  • This strategy is best used where the Avamar systems are on the same local area network.
  1. Delete old backups.

If the User capacity level is severe (>90%), expiring old backups through the Backup Management interface may be required or with the modify-snapups tool. 

Only registered Dell Customers can access the content on the following link, using Dell.com/support.   
See KB 58216: Avamar Capacity Management: How to delete or expire backups in bulk with the "modify-snapups" tool  

Deleting backups do not immediately lower the server utilization level. What it does is allow garbage collection to start removing the data the next time garbage collection runs. Deleting old backups is a short-term workaround. The backups are replaced over the coming days. If backups are deleted, it is essential to also tune retention policies.

  1. Increase the Blackout Window.

Now that we have deleted backups and reduced retention policies we can let garbage collection start to free up space on the system.

To allow garbage collection to remove data, the blackout window should be increased in from the default level. How long to set the blackout window depends on various factors but for the first few days you should consider increasing it.

  1. Monitor data change using capacity.sh.

After backups have been deleted and retention policies changed, closely monitor the amount of data change on the system using the capacity.sh script. You should start to see that the "removed" data value increase and the "Net Change" value should become negative. Eventually, as the excess data is cleared off the system, the "Removed" value starts to return to more normal levels. Once this occurs, you may gradually reduce the length of the blackout window. Continue to monitor the "Removed" value.

If the net change value does not become negative check the garbage collection log to see how long garbage collection is running for and how much work it is achieving within the blackout windows.

Only registered Dell Customers can access the content on the following link, using Dell.com/support.   
See KB 60149: How to use capacity.sh script to understand daily data changes on an Avamar system

  1. Expanding the Avamar system

Often high utilization on the Avamar system is due to natural and expected data growth. To continue production backups more space must be made available.

How this can be done depends on the type of Avamar system.

  •  Single node systems and Avamar Virtual Edition (AVE) systems

These cannot be expanded. Commission a second, larger Avamar system, and request EMC Professional Services to perform a system migration from the smaller to the larger system. Professional Services can be engaged through the EMC Account manager.

The new system may be a single node, AVE, or a multinode system, if it provides more storage space than the source.

  • Multinode systems

These systems can be expanded up to 16 data nodes. Contact the EMC account manager for details. 
Node additions are not performed by regular support channels so an SR should not be opened to request this work.

  • Integrate Data Domain

Integrating a Data Domain system as a back-end storage device is a useful way to expand the capacity available to clients which back up to Avamar. Discuss options with your EMC Account Manager.

Additional Information

Useful Tools

  • status.dpn
  • capacity.sh
  • Avalanche
  • DPN Summary Report
  • replcnt.sh
  • Avamar Client Manager

Best Practices:

  • Try to prevent the Avamar Server utilization (User Capacity) value rising higher than 80%.
  • Lower User Capacity provides resilience against unexpected changes in the amount of data added and can protect against the system becoming unusable if unexpected failures or short-term issues with maintenance tasks.
  • An Avamar system running above 80% User Capacity requires more diligent monitoring by the system administrator to ensure that maintenance tasks complete successfully and that the system does not become read-only.

Article Properties


Affected Product

Avamar

Product

Avamar

Last Published Date

04 Aug 2021

Version

5

Article Type

Solution