Avamar: Maintenance tasks fail with MSG_ERR_DISKFULL due to the Operating System capacity of one or more data partitions exceeding 89 percent

Table of Contents

Detailed Article

Symptoms

Cause

Resolution

Additional Info

Affected Products

Provide Feedback

Summary: The Operating System (OS) capacity exceeds recommended limits causing maintenance tasks to fail.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Check out other resources

Symptoms

The status.dpn output shows one or more maintenance activities reporting MSG_ERR_DISKFULL:

status.dpn

Tue Jan 14 18:26:56 PST 2025  [AVE01] Wed Jan 15 02:26:56 2025 UTC (Initialized Tue Feb  1 05:02:45 2022 UTC)
Node   IP Address     Version     State    Runlevel    Srvr+Root+User   Dis Suspend Load  UsedMB  Errlen      %Full   Percent Full and Stripe Status by Disk
0.0  192.168.255.1    19.4.0-124  ONLINE   fullaccess  mhpu+0hpu+0hpu   1   false   0.15  55075   3800263240  64.1%  64%(onl:1778) 64%(onl:1780) 64%(onl:1777)
Srvr+Root+User Modes = migrate + hfswriteable + persistwriteable + useraccntwriteable

System ID: 1643691765@00:01:02:03:04:05

All reported states=(ONLINE), runlevels=(fullaccess), modes=(mhpu+0hpu+0hpu)
System-Status: ok
Access-Status: full

Checkpoint failed with result MSG_ERR_DISKFULL : cp.20250114220752 started Tue Jan 14 14:08:12 2025 ended Tue Jan 14 14:08:12 2025, completed 0 of 5335 stripes
Last GC: finished Tue Jan 14 14:02:10 2025 after 00m 22s >> recovered 36.24 KB (MSG_ERR_DISKFULL)
Last hfscheck: finished Tue Jan 14 14:06:55 2025 after 04m 18s >> checked 1250 of 1250 stripes (OK)

Maintenance windows scheduler capacity profile is active.
  The maintenance window is currently running.
  Next backup window start time: Tue Jan 14 21:00:00 2025 PST

The Avamar grid has a high utilization:

mccli server show-prop

0,23000,CLI command completed successfully.
Attribute                                    Value
-------------------------------------------- ----------------------------
State                                        Suspended
Active sessions                              0
Total capacity                               562.4 GB
Capacity used                                562.3 GB
Server utilization                           98.9%
Bytes protected (client pre-comp size)       560 GB
Bytes protected quota (client pre-comp size) Not configured
License expiration                           Never
Time since Server initialization             1078 days 21h:32m
Last checkpoint                              2025-01-14 14:07:52 PST
Last validated checkpoint                    2025-01-14 14:02:12 PST
System Name                                  AVE01
System ID                                    1643691765@00:01:02:03:04:05
HFSAddr                                      ave01
HFSPort                                      27000
IP address                                   192.168.255.1
Number of nodes                              1
Nodes Online                                 1
Nodes Offline                                0
Nodes Read-only                              0
Nodes Timed-out                              0

One or more of the Avamar data partitions has a high utilization:

avmaint nodelist | egrep 'nodetag|fs-percent-full'

Example output from a single Node:

nodetag="0.0"
  fs-percent-full="96.9"
  fs-percent-full="96.9"
  fs-percent-full="96.9"

Example output from a Multinode: (The command can help identify the high OS capacity per node)

nodetag="0.2"
  fs-percent-full="96.7"
  fs-percent-full="96.9"
  fs-percent-full="96.9"
nodetag="0.1"
  fs-percent-full="96.9"
  fs-percent-full="96.4"
  fs-percent-full="96.8"
nodetag="0.0"
  fs-percent-full="96.3"
  fs-percent-full="96.8"
  fs-percent-full="96.7"

Cause

The likely cause is one of these two reasons:

1. Sudden and large amounts of change in the data backed up:

Sudden and large amounts of change in the data backed up by an Avamar client can have a negative impact on the grid.

Note: This can include backups accidentally backed up to the Avamar Server (GSAN) instead of an attached Data Domain (DD).

If too much data is added or removed, the Operating System (OS) capacity can spike, causing maintenance to fail with MSG_ERR_DISKFULL.

Checkpoints track changes on Avamar so that if a rollback is needed, Avamar can be returned to that point in time.

When a large amount of data is added or removed, the checkpoint becomes larger and consumes extra space in the OS.

2. The daily change rate is too high:

A lot of data change within a single day can cause a sudden spike in the OS capacity.

Change means high intake of new data and rapid removal of old data.

Change should be introduced to Avamar gradually where possible.

The more full a grid is, the higher the impact of a spike in changed data.

The capacity.sh script is helpful in tracking the change rate on a grid.

For information about how to use the capacity.sh script, review the article number 000060149: Avamar: How to manage capacity with the capacity.sh script.

Example:

capacity.sh

  DATE     AVAMAR NEW     #BU    DDR NEW     #BU    SCANNED       REMOVED     MINS PASS AVAMAR NET      CHG RATE
========== ============= ==== ============= ==== ============= =============  ==== ==== ============= ==========
2024-12-04    1770185 mb  367   36590255 mb 4414  427354917 mb   -1155011 mb   179   36     615174 mb      8.98%
2024-12-05    1799386 mb  366   35834788 mb 4384  424229450 mb    -967906 mb   158   36     831480 mb      8.87%
2024-12-06    1641614 mb  366   36339601 mb 4387  422918309 mb    -715952 mb    95   36     925662 mb      8.98%
2024-12-07    1482274 mb  368   36021600 mb 4382  422096834 mb   -1369565 mb   182   35     112708 mb      8.89%
2024-12-08    1476971 mb  376   35466632 mb 4379  418749502 mb    -882663 mb   120   36     594307 mb      8.82%
2024-12-09    2338688 mb  377   36564862 mb 4408  426949173 mb    -521711 mb   102   36    1816976 mb      9.11%
2024-12-10    1830728 mb  482   36776445 mb 4303  423650873 mb    -369845 mb    80   36    1460882 mb      9.11%
2024-12-11   10323736 mb  478   33010286 mb 4416  435953105 mb   -1016271 mb   159   34    9307465 mb      9.94%
2024-12-12    8773933 mb  473   32431241 mb 4399  442013401 mb    -167120 mb    64   35    8606813 mb      9.32%
2024-12-13    8834627 mb  485   31265504 mb 4378  434459112 mb    -186507 mb    60   35    8648119 mb      9.23%
2024-12-14    8605313 mb  479   31150950 mb 4391  434117515 mb     -32753 mb    41   35    8572559 mb      9.16%
2024-12-15   10727441 mb  478   32164212 mb 4393  435520200 mb     -58643 mb    53   36   10668797 mb      9.85%
2024-12-16   10133770 mb  477   31557436 mb 4396  432462001 mb     -55780 mb    43   36   10077989 mb      9.64%
2024-12-17    9941271 mb  477   30824614 mb 4419  434292081 mb     -68284 mb    53   35    9872986 mb      9.39%
2024-12-18   10147447 mb  416   24608011 mb 3237  319673822 mb    -577890 mb   124   35    9569557 mb     10.87%
================================================================================================================
14 DAY AVG   5988492 mb  431   33373763 mb 4312  422296020 mb    -543060 mb   101   35    5445432 mb      9.32%
30 DAY AVG   3622366 mb  403   36648167 mb 4353  427001356 mb   -1326697 mb   150   34    2295669 mb      9.43%
60 DAY AVG   3047161 mb  392   34199043 mb 4323  417800256 mb   -1489983 mb   159   34    1557178 mb      8.91%

Resolution

Caution: Before proceeding, ensure that the capacity issue is NOT due to accidental backups to the GSAN.

If this is the case, stop referencing this article and create a Swarm with the Avamar SCR team ASAP.

1. Verify that the checkpoint and hfscheck retentions are set to the default values of 2 and 1:

avmaint config --ava | grep "cpmostrecent\|cphfschecked"

cpmostrecent="2"
cphfschecked="1"

If these values are incorrect, run the following command:

avmaint config --ava cpmostrecent=2 cphfschecked=1

Example output if the previous values were 5 and 3:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gsanconfig
  cpmostrecent="5"
  cphfschecked="3"/>

2. Disable async crunching so that the OS capacity does not continue to grow during the troubleshooting process:

avmaint config --ava asynccrunching=false

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gsanconfig asynccrunching="true"/>

3. Check the status of the OS data partition capacities:

avmaint nodelist | egrep 'nodetag|fs-percent-full'

Example output from a single Node:

nodetag="0.0"
  fs-percent-full="96.9"
  fs-percent-full="96.9"
  fs-percent-full="96.9"

Example output from a Multinode:

nodetag="0.2"
  fs-percent-full="96.7"
  fs-percent-full="96.9"
  fs-percent-full="96.9"
nodetag="0.1"
  fs-percent-full="96.9"
  fs-percent-full="96.4"
  fs-percent-full="96.8"
nodetag="0.0"
  fs-percent-full="96.3"
  fs-percent-full="96.8"
  fs-percent-full="96.7"

4. If the highest fs-percent full is:

Above 98%:

Open a Service Request with the Dell Technologies Avamar Support team referencing this knowledge article.

Above 96%, but below 98%:

If the checkpoint retentions are set to the default, Open a Service Request with the Dell Technologies Avamar Support team referencing this knowledge article.

Note: If the checkpoint retentions were not the default, the OS capacity should drop during the next maintenance cycle.

Either run maintenance manually, or monitor the grid until the next maintenance cycle has been completed.

If the capacity issues remain, Open a Service Request with the Dell Technologies Avamar Support team referencing this knowledge article.

Above 89%, but below 96%:

Checkpoints will complete successfully, and the OS capacity should drop during the next maintenance cycle.

Either run maintenance manually, or monitor the grid until the next maintenance cycle has been completed.

If the capacity does not drop below 89%, Open a Service Request with the Dell Technologies Avamar Support team referencing this knowledge article.

Additional Information

For further information about Avamar OS capacity issues, see Avamar: Capacity Management Concepts and Training

Avamar maintenance activities require a certain amount of free OS space in order to run as illustrated in the diagram below.

With the default settings, if the OS capacity is

Greater than 89%: Garbage Collection fails
Greater than 96%: Checkpoints fail

100%   "---------------------" <-- 100% Data partition capacity
       " CP cannot run >96%  "
       "                     "   
       " GC cannot run >89%  "
89%    "---------------------"  
       " Reserved for        "
       " checkpoint          "
       " overhead            "
       "                     "
65%    "---------------------" <-- 100% User Capacity
       " Commonality         "     Can be monitored
       " factored data       "     from the Admin
       " & RAIN parity       "     UI.
       " data                "    
       "                     "
       "                     "
       "                     "
       "                     "
       "                     "
       "                     "
       "                     "
0%     "---------------------"

Affected Products

Avamar

Products

Avamar, Avamar Server

Article Number: 000040861

Article Type: Solution

Last Modified: 15 May 2025

Version: 28

Check if your device is covered by Support Services.

Avamar: Maintenance tasks fail with MSG_ERR_DISKFULL due to the Operating System capacity of one or more data partitions exceeding 89 percent

Summary: The Operating System (OS) capacity exceeds recommended limits causing maintenance tasks to fail.

Symptoms

Cause

Resolution

Additional Information

Affected Products

Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services