Avamar: Maintenance tasks fail with MSG_ERR_DISKFULL due to the Operating System capacity of one or more data partitions exceeding 89 percent
Summary: The Operating System (OS) capacity exceeds recommended limits causing maintenance tasks to fail.
Symptoms
The status.dpn output shows one or more maintenance activities reporting MSG_ERR_DISKFULL:
status.dpn
Tue Jan 14 18:26:56 PST 2025 [AVE01] Wed Jan 15 02:26:56 2025 UTC (Initialized Tue Feb 1 05:02:45 2022 UTC)
Node IP Address Version State Runlevel Srvr+Root+User Dis Suspend Load UsedMB Errlen %Full Percent Full and Stripe Status by Disk
0.0 192.168.255.1 19.4.0-124 ONLINE fullaccess mhpu+0hpu+0hpu 1 false 0.15 55075 3800263240 64.1% 64%(onl:1778) 64%(onl:1780) 64%(onl:1777)
Srvr+Root+User Modes = migrate + hfswriteable + persistwriteable + useraccntwriteable
System ID: 1643691765@00:01:02:03:04:05
All reported states=(ONLINE), runlevels=(fullaccess), modes=(mhpu+0hpu+0hpu)
System-Status: ok
Access-Status: full
Checkpoint failed with result MSG_ERR_DISKFULL : cp.20250114220752 started Tue Jan 14 14:08:12 2025 ended Tue Jan 14 14:08:12 2025, completed 0 of 5335 stripes
Last GC: finished Tue Jan 14 14:02:10 2025 after 00m 22s >> recovered 36.24 KB (MSG_ERR_DISKFULL)
Last hfscheck: finished Tue Jan 14 14:06:55 2025 after 04m 18s >> checked 1250 of 1250 stripes (OK)
Maintenance windows scheduler capacity profile is active.
The maintenance window is currently running.
Next backup window start time: Tue Jan 14 21:00:00 2025 PST
The Avamar grid has a high utilization:
mccli server show-prop
0,23000,CLI command completed successfully.
Attribute Value
-------------------------------------------- ----------------------------
State Suspended
Active sessions 0
Total capacity 562.4 GB
Capacity used 562.3 GB
Server utilization 98.9%
Bytes protected (client pre-comp size) 560 GB
Bytes protected quota (client pre-comp size) Not configured
License expiration Never
Time since Server initialization 1078 days 21h:32m
Last checkpoint 2025-01-14 14:07:52 PST
Last validated checkpoint 2025-01-14 14:02:12 PST
System Name AVE01
System ID 1643691765@00:01:02:03:04:05
HFSAddr ave01
HFSPort 27000
IP address 192.168.255.1
Number of nodes 1
Nodes Online 1
Nodes Offline 0
Nodes Read-only 0
Nodes Timed-out 0
One or more of the Avamar data partitions has a high utilization:
avmaint nodelist | egrep 'nodetag|fs-percent-full'
Example output from a single Node:
nodetag="0.0"
fs-percent-full="96.9"
fs-percent-full="96.9"
fs-percent-full="96.9"
Example output from a Multinode: (The command can help identify the high OS capacity per node)
nodetag="0.2"
fs-percent-full="96.7"
fs-percent-full="96.9"
fs-percent-full="96.9"
nodetag="0.1"
fs-percent-full="96.9"
fs-percent-full="96.4"
fs-percent-full="96.8"
nodetag="0.0"
fs-percent-full="96.3"
fs-percent-full="96.8"
fs-percent-full="96.7"
Cause
The likely cause is one of these two reasons:
1. Sudden and large amounts of change in the data backed up:
2. The daily change rate is too high:
capacity.sh script is helpful in tracking the change rate on a grid.
For information about how to use the
capacity.sh script, review the article number 000060149: Avamar: How to manage capacity with the capacity.sh script.
Example:
capacity.sh
DATE AVAMAR NEW #BU DDR NEW #BU SCANNED REMOVED MINS PASS AVAMAR NET CHG RATE
========== ============= ==== ============= ==== ============= ============= ==== ==== ============= ==========
2024-12-04 1770185 mb 367 36590255 mb 4414 427354917 mb -1155011 mb 179 36 615174 mb 8.98%
2024-12-05 1799386 mb 366 35834788 mb 4384 424229450 mb -967906 mb 158 36 831480 mb 8.87%
2024-12-06 1641614 mb 366 36339601 mb 4387 422918309 mb -715952 mb 95 36 925662 mb 8.98%
2024-12-07 1482274 mb 368 36021600 mb 4382 422096834 mb -1369565 mb 182 35 112708 mb 8.89%
2024-12-08 1476971 mb 376 35466632 mb 4379 418749502 mb -882663 mb 120 36 594307 mb 8.82%
2024-12-09 2338688 mb 377 36564862 mb 4408 426949173 mb -521711 mb 102 36 1816976 mb 9.11%
2024-12-10 1830728 mb 482 36776445 mb 4303 423650873 mb -369845 mb 80 36 1460882 mb 9.11%
2024-12-11 10323736 mb 478 33010286 mb 4416 435953105 mb -1016271 mb 159 34 9307465 mb 9.94%
2024-12-12 8773933 mb 473 32431241 mb 4399 442013401 mb -167120 mb 64 35 8606813 mb 9.32%
2024-12-13 8834627 mb 485 31265504 mb 4378 434459112 mb -186507 mb 60 35 8648119 mb 9.23%
2024-12-14 8605313 mb 479 31150950 mb 4391 434117515 mb -32753 mb 41 35 8572559 mb 9.16%
2024-12-15 10727441 mb 478 32164212 mb 4393 435520200 mb -58643 mb 53 36 10668797 mb 9.85%
2024-12-16 10133770 mb 477 31557436 mb 4396 432462001 mb -55780 mb 43 36 10077989 mb 9.64%
2024-12-17 9941271 mb 477 30824614 mb 4419 434292081 mb -68284 mb 53 35 9872986 mb 9.39%
2024-12-18 10147447 mb 416 24608011 mb 3237 319673822 mb -577890 mb 124 35 9569557 mb 10.87%
================================================================================================================
14 DAY AVG 5988492 mb 431 33373763 mb 4312 422296020 mb -543060 mb 101 35 5445432 mb 9.32%
30 DAY AVG 3622366 mb 403 36648167 mb 4353 427001356 mb -1326697 mb 150 34 2295669 mb 9.43%
60 DAY AVG 3047161 mb 392 34199043 mb 4323 417800256 mb -1489983 mb 159 34 1557178 mb 8.91%
Resolution
If this is the case, stop referencing this article and create a Swarm with the Avamar SCR team ASAP.
1. Verify that the checkpoint and hfscheck retentions are set to the default values of 2 and 1:
avmaint config --ava | grep "cpmostrecent\|cphfschecked"
cpmostrecent="2"
cphfschecked="1"
If these values are incorrect, run the following command:
avmaint config --ava cpmostrecent=2 cphfschecked=1
Example output if the previous values were 5 and 3:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gsanconfig
cpmostrecent="5"
cphfschecked="3"/>
2. Disable async crunching so that the OS capacity does not continue to grow during the troubleshooting process:
avmaint config --ava asynccrunching=false
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gsanconfig asynccrunching="true"/>
3. Check the status of the OS data partition capacities:
avmaint nodelist | egrep 'nodetag|fs-percent-full'
Example output from a single Node:
nodetag="0.0"
fs-percent-full="96.9"
fs-percent-full="96.9"
fs-percent-full="96.9"
Example output from a Multinode:
nodetag="0.2"
fs-percent-full="96.7"
fs-percent-full="96.9"
fs-percent-full="96.9"
nodetag="0.1"
fs-percent-full="96.9"
fs-percent-full="96.4"
fs-percent-full="96.8"
nodetag="0.0"
fs-percent-full="96.3"
fs-percent-full="96.8"
fs-percent-full="96.7"
4. If the highest fs-percent full is:
- Above 98%:
Open a Service Request with the Dell Technologies Avamar Support team referencing this knowledge article.
- Above 96%, but below 98%:
If the checkpoint retentions are set to the default, Open a Service Request with the Dell Technologies Avamar Support team referencing this knowledge article.
Either run maintenance manually, or monitor the grid until the next maintenance cycle has been completed.
If the capacity issues remain, Open a Service Request with the Dell Technologies Avamar Support team referencing this knowledge article.
- Above 89%, but below 96%:
Checkpoints will complete successfully, and the OS capacity should drop during the next maintenance cycle.
Either run maintenance manually, or monitor the grid until the next maintenance cycle has been completed.
If the capacity does not drop below 89%, Open a Service Request with the Dell Technologies Avamar Support team referencing this knowledge article.
Additional Information
For further information about Avamar OS capacity issues, see Avamar: Capacity Management Concepts and Training
Avamar maintenance activities require a certain amount of free OS space in order to run as illustrated in the diagram below.
With the default settings, if the OS capacity is
- Greater than 89%: Garbage Collection fails
- Greater than 96%: Checkpoints fail
100% "---------------------" <-- 100% Data partition capacity
" CP cannot run >96% "
" "
" GC cannot run >89% "
89% "---------------------"
" Reserved for "
" checkpoint "
" overhead "
" "
65% "---------------------" <-- 100% User Capacity
" Commonality " Can be monitored
" factored data " from the Admin
" & RAIN parity " UI.
" data "
" "
" "
" "
" "
" "
" "
" "
0% "---------------------"