Avamar GSAN or User Capacity Resolution Path
Summary: This Resolution Path article is for handling and troubleshooting GSAN Capacity (aka User Capacity) issues on Avamar.
Symptoms
For initial concepts and understanding of Operating System (OS) Capacity, see Avamar: Capacity General Training - Resolution Path
It is often easiest to consider GSAN Capacity as the space and utilization for client backups.
-
Basic understanding of deduplication
-
Basic understanding of checkpoint, checkpoint validation (
hfscheck), and Garbage Collection, and the importance of each. -
The difference between
GSAN(or User) Capacity and OS Capacity -
Change Rate
-
Steady State
GSAN Capacity can include:
-
Backup or Replication failure when the grid access state has changed to "admin mode"
- A client backup job could fail with a message similar to: "
avtar Info <5314>: Command failed (1 error, exit code 10028: Operation failed because target server is full)"
- A client backup job could fail with a message similar to: "
-
The automatic disabling of the backup scheduler (until acknowledged and cleared by someone)
-
80% - Capacity Warning
-
95% - Health Check Limit is reached (this can sometimes disable the backup scheduler, at least until manually acknowledged)
-
100% - Server Read-Only Limit is reached (The grid goes into admin mode)
Cause
GSAN capacity "deduplicates" backup data, meaning when certain bytes or chunks of data are similar, it is only required to store that chunk once. Any data can be "deduplicated" against any other data from the same or different clients backed up on the Avamar grid. As these chunks of data are small, it can find many duplicates and save a lot of capacity not having to repeatedly back it up.
-
Avamar need only save and store the minor changes and differences between each client backup job because of deduplication. As new backups (or incoming replication) run, it can add new data and increase the Avamar capacity or utilization value.
-
After a certain amount of time, backups will expire based on their configured retention and expiration and are no found on the Avamar grid available to restore.
-
When the Avamar maintenance job called Garbage Collection (GC) runs, it finds all the unique portions or chunks of data that are no longer needed due to these expired backups. GC verifies that no other current backups share that same data (because of deduplication) and then removes or free up those chunks of data no longer required to reduce the Avamar server capacity or utilization.
When the amount of daily incoming data added is about the same as the amount of daily data being cleaned, this is called "Steady State". This is the goal of every Avamar grid installed.
Before a new Avamar grid is set up and configured, general preinstallation sizing calculations are made to determine the capacity required to store the backup data. These calculations are based on the customers retention requirements and how much data is to be backed up. It also estimates how much of that data could deduplicate on average and so forth.
-
Garbage Collection not consistently running
-
Garbage Collection performance is slow or not running long enough
-
Deduplication estimates prior to Avamar grid installation were not accurate enough
-
Data other than what was calculated prior to Avamar grid installation is being backed up to this Avamar server.
-
Other reasons
Resolution
Validate that each troubleshooting step below is true for the environment:
Do not skip any steps.
Step 1. Data Collection:
Ensure that there are no other non-capacity issues with the Avamar grid. If there are, they could require attention PRIOR to troubleshooting capacity.
This includes hardware errors, data integrity issues (including offline nodes), offline stripes, checkpoint validation failures or failing maintenance jobs. If any of these are an issue, capacity troubleshooting must be stopped and other issues addressed. Once other issues have been resolved, the capacity can be revisited.
A health check should be run (Avamar: How to run the proactive_check.pl health check script on an Avamar Server, but at minimum the status.dpn command can give a quick overview and verification of most of those same issues. See Avamar: How to understand the output generated by the status.dpn command
See the following article for additional information: Avamar: How to Apply the "Avamar troubleshooting hierarchy" Approach Correctly.
If assistance is required to address any noncapacity issues, Create a Service Request with the Dell Technologies Avamar Support team.
Step 2. Capacity Information Collection:
Refer to the following for all the required information required to troubleshoot Avamar Capacity issues: Avamar: How to gather the information to troubleshoot capacity issues
At the very least, the status.dpn command or the values within the Avamar Administration UI show the GSAN capacity.
status.dpn command and the UI differ by intended design.
Step 3. Check if the OS Capacity is full:
The following command help to show the current value of the OS Capacity for each disk partition. If any of the values have reached or exceeded 85%, as in the second sample output, it is considered high OS Capacity:
avmaint nodelist | egrep 'nodetag|fs-percent-full'
Sample outputs:
nodetag="0.2"
fs-percent-full="56.6"
fs-percent-full="54.7"
fs-percent-full="54.4"
fs-percent-full="54.6"
fs-percent-full="54.7"
fs-percent-full="54.7"
nodetag="0.1"
fs-percent-full="56.2"
fs-percent-full="54.6"
fs-percent-full="54.6"
fs-percent-full="54.8"
fs-percent-full="54.8"
fs-percent-full="54.5"
nodetag="0.0"
fs-percent-full="56.2"
fs-percent-full="54.7"
fs-percent-full="54.8"
fs-percent-full="54.7"
fs-percent-full="54.6"
fs-percent-full="54.6"
nodetag="0.2"
fs-percent-full="94.5"
fs-percent-full="94.4"
fs-percent-full="94.2"
fs-percent-full="94.1"
fs-percent-full="94.0"
fs-percent-full="94.0"
nodetag="0.1"
fs-percent-full="94.5"
fs-percent-full="94.3"
fs-percent-full="94.1"
fs-percent-full="93.6"
fs-percent-full="94.0"
fs-percent-full="93.9"
nodetag="0.0"
fs-percent-full="94.4"
fs-percent-full="94.4"
fs-percent-full="94.0"
fs-percent-full="94.1"
fs-percent-full="92.7"
fs-percent-full="92.5"
GSAN Capacity because Garbage Collection cannot run if the capacity exceeds 89%. This is discussed in more detail, and troubleshooting steps are provided in: Avamar: Operating System (OS) Capacity (Resolution Path)
Only if the OS capacity is below 85% should the GSAN Capacity troubleshooting continue.
Step 4. Non-Capacity issues that can sometimes be misunderstood as Capacity:
It is possible that client backups may fail not for "Capacity" reasons but are instead "Quota" reasons. These can sometimes be misunderstood as Capacity.
This situation can be confirmed by the status.dpn command or some of the other collected output showing lower capacity.
It is also possible that client backups may have failed or not run due other Non-GSAN Capacity reasons. The collected information should confirm this, or can also be seen in the Avamar Administration UI.
GSAN capacity is not high, refer to the following articles:
If the GSAN Capacity is high, and these other Capacities are also high, troubleshooting can be performed in any order (except for OS Capacity which must always be first).
GSAN Capacity, the Metadata Capacity, and the DD Capacity may be high. In these situations, they can be addressed in any order, unlike OS Capacity which must always be addressed first.
Step 5. Stripe Balance and OS Disk Balance:
"Stripes" on Avamar are the container files that backup data is stored within on the data nodes (except for a single-node Avamar grid).
The expectation is that stripes are "balanced" or evenly distributed across the different disks and nodes within the grid, however, in sometimes they can become unbalanced.
By design on Avamar, the largest node or disk partition is the limiting factor when it comes to Avamar Capacity.
This is intentional so none of the disks or nodes create more stripes than they can handle (or allowed) therefore having balanced stripes is important to Capacity.
For example, when adding additional data nodes for Avamar grid expansion, balancing must be run to evenly distribute stripes to the new nodes to decrease the overall Avamar Capacity percentage.
Another type of balance requiring understanding is OS Disk balance. This is only limited to data partitions on the same node, not partitions on multiple nodes.
If on the same data node, one partition is much larger or smaller than another partition of the SAME node, a limit can be exceeded called "freespaceunbalance". While this is generally on the OS and not the GSAN Capacity, it can be reported as a GSAN Capacity issue.
Step 6. Check if the Garbage Collection is completing:
Run the following command to get information about garbage collection:
dumpmaintlogs --types=gc --days=30 | grep "4201\|2402"
Ideally, the output will show that GC has completed for the last 30 days:
2025/10/07-12:00:35.24911 {0.1} <4201> completed garbage collection
2025/10/08-12:00:34.61185 {0.1} <4201> completed garbage collection
2025/10/09-12:00:35.14874 {0.1} <4201> completed garbage collection
2025/10/10-12:00:34.67986 {0.1} <4201> completed garbage collection
2025/10/11-12:00:34.73284 {0.1} <4201> completed garbage collection
2025/10/12-12:00:33.23205 {0.1} <4201> completed garbage collection
2025/10/13-12:00:33.41448 {0.1} <4201> completed garbage collection
2025/10/14-12:00:35.70726 {0.1} <4201> completed garbage collection
2025/10/15-12:00:35.08316 {0.1} <4201> completed garbage collection
2025/10/16-12:00:34.82681 {0.1} <4201> completed garbage collection
2025/10/17-12:00:35.29262 {0.1} <4201> completed garbage collection
2025/10/18-12:00:35.24618 {0.1} <4201> completed garbage collection
2025/10/19-12:00:34.56531 {0.1} <4201> completed garbage collection
2025/10/20-19:06:45.15574 {0.1} <4201> completed garbage collection
2025/10/21-12:00:34.21062 {0.1} <4201> completed garbage collection
2025/10/22-12:00:35.29770 {0.1} <4201> completed garbage collection
2025/10/23-12:00:36.13041 {0.1} <4201> completed garbage collection
2025/10/24-12:00:35.52502 {0.1} <4201> completed garbage collection
2025/10/25-12:00:35.93730 {0.1} <4201> completed garbage collection
2025/10/26-12:00:35.55037 {0.1} <4201> completed garbage collection
2025/10/27-12:00:36.12049 {0.1} <4201> completed garbage collection
2025/10/28-12:00:35.75633 {0.1} <4201> completed garbage collection
2025/10/29-12:00:34.85499 {0.1} <4201> completed garbage collection
2025/10/30-12:00:34.96325 {0.2} <4201> completed garbage collection
2025/10/31-12:00:35.39840 {0.0} <4201> completed garbage collection
2025/11/01-12:00:35.11248 {0.0} <4201> completed garbage collection
2025/11/02-13:00:34.39202 {0.0} <4201> completed garbage collection
2025/11/03-13:00:34.70587 {0.0} <4201> completed garbage collection
2025/11/04-13:00:34.18799 {0.0} <4201> completed garbage collection
2025/11/05-13:00:34.44950 {0.0} <4201> completed garbage collection
GC failure messages can include, but not limited to, the following:
2025/11/04-13:00:01.62234 {0.1} <4202> failed garbage collection with error MSG_ERR_DDR_ERROR
2025/11/01-12:35:06.62868 {0.2} <4202> failed garbage collection with error MSG_ERR_BACKUPSINPROGRESS
2025/10/13-12:20:07.35498 {0.7} <4202> failed garbage collection with error MSG_ERR_TRYAGAINLATER
2025/10/27-12:07:44.35485 {0.0} <4202> failed garbage collection with error MSG_ERR_DISKFULL
2025/11/02-13:16:39.72027 {0.1} <4202> failed garbage collection with error MSG_ERR_MISC
2025/11/02-13:16:39.72027 {0.1} <4202> failed garbage collection with error MSG_ERR_TIMEOUT
2025/11/02-13:16:39.72027 {0.1} <4202> failed garbage collection with error MSG_ERR_GARBAGECOLLECT
If GC has been failing, address this first using the following article as a reference: Avamar: Troubleshooting Garbage Collection (GC) Failures (Resolution Path)
(If any issues have already been resolved, go to the next step.)
Step 7. Is GC running long enough?
a. Run the following command to check the maximum time allowed for GC:
dumpmaintlogs --types=gc --days=30 | grep gcflags
Sample output:
2025/10/07-12:00:20.05509 {0.1} <gcflags gccount="0" gcmincount="0" kill="0" limitadjust="5" maxpass="0" maxtime="14400" refcheck="true" throttlelevel="0" usehistory="false" orphansfirst="false"/>
2025/10/08-12:00:20.09141 {0.1} <gcflags gccount="0" gcmincount="0" kill="0" limitadjust="5" maxpass="0" maxtime="14400" refcheck="true" throttlelevel="0" usehistory="false" orphansfirst="false"/>
2025/10/09-12:00:20.42307 {0.1} <gcflags gccount="0" gcmincount="0" kill="0" limitadjust="5" maxpass="0" maxtime="14400" refcheck="true" throttlelevel="0" usehistory="false" orphansfirst="false"/>
2025/10/10-12:00:20.47775 {0.1} <gcflags gccount="0" gcmincount="0" kill="0" limitadjust="5" maxpass="0" maxtime="14400" refcheck="true" throttlelevel="0" usehistory="false" orphansfirst="false"/>
...
2025/11/02-13:00:19.76100 {0.0} <gcflags gccount="0" gcmincount="0" kill="0" limitadjust="5" maxpass="0" maxtime="14400" refcheck="true" throttlelevel="0" usehistory="false" orphansfirst="false"/>
2025/11/03-13:00:19.92093 {0.0} <gcflags gccount="0" gcmincount="0" kill="0" limitadjust="5" maxpass="0" maxtime="14400" refcheck="true" throttlelevel="0" usehistory="false" orphansfirst="false"/>
2025/11/04-13:00:19.42781 {0.0} <gcflags gccount="0" gcmincount="0" kill="0" limitadjust="5" maxpass="0" maxtime="14400" refcheck="true" throttlelevel="0" usehistory="false" orphansfirst="false"/>
2025/11/05-13:00:19.74984 {0.0} <gcflags gccount="0" gcmincount="0" kill="0" limitadjust="5" maxpass="0" maxtime="14400" refcheck="true" throttlelevel="0" usehistory="false" orphansfirst="false"/>
Take note of the maxtime value, which in this example is 14400 (seconds).
(A value of 0 means unlimited)
b. Run the following command to determine how long the GC is running for, and how many "passes" complete:
(Passes are to do with the layers of the stored backup data. Think of the GSAN Capacity like layers of an onion. The outer layers must be peeled back or removed before the inner layers are seen. Each pass is a layer of the GSAN stored data.)
dumpmaintlogs --types=gc --days=30 | grep passes | cut -d ' ' -f1,14-20
Sample output:
2025/10/07-12:00:35.24463 passes="24" start-time="1758283220" elapsed-time="250" end-time="1758283235"/>
2025/10/08-12:00:34.60779 passes="3" start-time="1758369620" elapsed-time="70" end-time="1758369627"/>
2025/10/09-12:00:35.14232 megabytes-recovered="1" passes="4" start-time="1758456020" elapsed-time="85" end-time="1758456028"/>
2025/10/10-12:00:34.67590 passes="3" start-time="1758542420" elapsed-time="72" end-time="1758542427"/>
...
2025/11/02-13:00:34.38348 megabytes-recovered="2" passes="18" start-time="1762088419" elapsed-time="89" end-time="1762088427"/>
2025/11/03-13:00:34.69743 passes="18" start-time="1762174819" elapsed-time="9" end-time="1762174828"/>
2025/11/04-13:00:34.17943 megabytes-recovered="8" passes="22" start-time="1762261219" elapsed-time="134" end-time="1762261228"/>
2025/11/05-13:00:34.44187 megabytes-recovered="2" passes="16" start-time="1762347619" elapsed-time="119" end-time="1762347628"/>
Take note of the number of passes and the elapsed-time (seconds).
c. Assuming that the maxtime is nonzero, calculate 2/3 of maxtime, and compare it to the elapsed time.
(In the example above 2/3 of 14400 is 9600, and all elapsed-time outputs are well below this figure.)
-
If the
elapsed-timeis less than 2/3 ofmaxtime, it is likely that GC finished early because there was nothing left to collect and is caught up. - If the number of passes is high (14 or more), it is likely that GC is removing sufficient amounts of data.
Note: Understand that if no data expired and there is nothing to clean, the value of the passes is expected to be low so it is best to understand the entire situation and environment as well. Do not assume that few passes mean that there is a problem.
Various issues can cause GC to run slowly, or not scan everything. This can include not having had enough time to run for a significant amount of time or days in the past, incorrect configuration, errors, and more.
If there are concerns about the maxtime, or number of passes, create a Service Request with the Dell Technologies Avamar Support Team to investigate further.
Step 8. If it suspected that GC did not remove enough or the expected data:
If confirmed that GC is running long enough, it is possible that data is not being collected for reasons outside of Garbage Collection control. This is a list of the documented reasons that should generally be checked:
a. Verify that backups are configured to at least expire eventually or regularly. If there are not frequent expiring backups, GC will not have much work to do.
b. Use this article to find the "Top Change Rate" Clients: Avamar: How to manage capacity with the capacity.sh script. (Review both the "% OF TOTAL" and "CHGRATE".)
c. Check for skipped hashes per Avamar: Avamar Garbage Collection reports "skipped-hashes" that cannot be cleaned up. If these are occurring but rare, this is normal and this can be skipped.
d. There is a flag or option which forces the Avamar server to keep the last and most recent backup from every client. This is used for safety purposes so that a client does not have every backup accidentally expired. However, this can cause other issues when it comes to data cleanup and Garbage Collection. The Dell Technologies Avamar Support team can confirm if this is enabled.
e. If backups were recently switched from GSAN to DD backend or there was an accidental GSAN backup, but the GSAN Capacity does not decrease, create a Service Request with the Dell Technologies Avamar Support Team to investigate further.
Step 9. The Avamar grid is undersized for the amount of current or expected data to be added:
Once all other solutions and possible causes have been reviewed for high capacity, and this is not a configuration issue or issue with accidental data:
This means data may require deletion, or options explored such as migrating certain clients to other Avamar grids, adding new data nodes, and so forth.
Step 10. Acknowledge any capacity events and resume the backup scheduler if required:
a. Once capacity issues are addressed, acknowledge all capacity-related events in the Avamar Admin UI.
b. Resume the backup scheduler:
dpnctl start sched
For any other Avamar Capacity questions, training, troubleshooting, and more, see: Avamar: Capacity Troubleshooting, Issues, and Questions - All Capacity (Resolution Path)
Additional Information
(This is a reference to running GC out of the scheduled automatic times.)
-
This action in itself can "mask" and hide the real issues, only making them reappear in a few days or weeks later all over again, causing this manual job to be wasted time.
-
Additionally, the manual GC might not run as efficiently as it is running out of schedule.
-
On occasion, it can make other issues worse. For more information, see: Avamar: About the use of manual Garbage Collection
-
GSAN Capacity at all.
-
This change or action is generally not performed and should not be considered by default. An Avamar L2 engineer, or Subject Matter Expert (SME) must approve this change.
-
Unfortunately, such actions can often cause permanent damage to an Avamar grid in various ways that can only be resolved by adding additional storage nodes or redeployment.
Understand that neither of the actions listed above are performed because the support team wants to help resolve the Capacity issues in the most beneficial way.