Avamar: HFScheck is running for longer than expected
Summary: HFScheck is running for longer than expected. This is due to Avamar checkpoint not being created on all storage nodes due to suspended disk.
Symptoms
The "status.dpn" output shows that the checkpoint validation (aka hfscheck) is running for a long time.
The "avmaint hfscheckstatus" that is running as reduced:
avmaint hfscheckstatus
<hfscheckstatus
nodes-queried="13"
nodes-replied="13"
nodes-total="12"
checkpoint="cp.20190512172440"
status="hfscheck"
phase="datasweep"
type="reduced"
checks="rolling+metadata:10:2"
elapsed-time="153057"
start-time="1557682651"
end-time="0"
check-start-time="1557683842"
check-end-time="0"
generation-time="1557835708"
stripes-checking="274039"
stripes-completed="240254"
offline-stripes="0"
minutes-to-completion="1190"
percent-complete="68.01">
<hfscheckerrors/>
</hfscheckstatus>
One of the nodes is missing the hfscheck error log:
(This confirms which of the nodes has an issue)
With the keys loaded (See Avamar: How to Log in to an Avamar Server and Load Various Keys for information if required), run the following command:
mapall --noerror 'tail -4 /data01/hfscheck/err.log'
In this sample output, node 0.10 does not have the hfscheck/err.log:
Using /usr/local/avamar/var/probe.xml
(0.0) ssh -q -x -o GSSAPIAuthentication=no admin@192.168.255.2 'mapall --noerror 'tail -4 /data01/hfscheck/err.log''
2019/05/12-13:01:56.40927 {P0.0} [gsan] <1306> sysconfig info: Valid NICs=8 NICs up=3
2019/05/12-13:01:56.40930 {P0.0} [gsan] <1306> sysconfig info: valid NIC eth0 [speedMb=100, duplex=FULL] is not at maximum speed [speedMb=1000, duplex=FULL]
2019/05/12-13:01:56.48368 {P0.0} [gsan] <1291> FIPS mode enabled
2019/05/12-13:02:02.76275 {0.0} [nodebeat:116] <0016> node 0.1 was offline, changing
(0.1) ssh -q -x -o GSSAPIAuthentication=no admin@192.168.255.4 'mapall --noerror 'tail -3 /data01/hfscheck/err.log''
2019/05/12-13:01:56.42559 {P0.1} [gsan] <1306> sysconfig info: Valid NICs=8 NICs up=3
2019/05/12-13:01:56.42568 {P0.1} [gsan] <1306> sysconfig info: All NICs are at maximum speed [speedMb=1000, duplex=FULL]
2019/05/12-13:01:56.49923 {P0.1} [gsan] <1291> FIPS mode enabled
2019/05/12-13:02:02.78169 {0.1} [nodebeat:116] <0016> node 0.0 was offline, changing
...
(0.10) ssh -q -x -o GSSAPIAuthentication=no admin@192.168.255.12 'mapall --noerror 'tail -4 /data01/hfscheck/err.log''
tail: cannot open /data01/hfscheck/err.log: No such file or directory
(0.11) ssh -q -x -o GSSAPIAuthentication=no admin@192.168.255.13 'mapall --noerror 'tail -4 /data01/hfscheck/err.log''
2019/05/12-13:01:56.41444 {P0.2} [gsan] <1306> sysconfig info: All NICs are at maximum speed [speedMb=1000, duplex=FULL]
2019/05/12-13:01:56.48873 {P0.2} [gsan] <1291> FIPS mode enabled
2019/05/12-13:02:01.74107 {0.11} [nodebeat:107] <0016> node 0.1 was offline, changing
2019/05/12-13:02:01.76281 {0.11} [nodebeat:107] <0016> node 0.0 was offline, changing
There are warning (WARN) messages about suspended disks are reported in the GSAN log (/data01/cur/gsan.log*).
For example:
2019/05/12-17:27:13.29974 {0.A} [manage:3858] WARN: <1084> changing disk 2 on node 0.A to suspended state
The checkpoint was not created on the affected storage node, and the checkpoint is reduced:
(From the GSAN logs)
2019/05/12-17:27:13.29974 {0.A} [manage:3858] WARN: <1084> changing disk 2 on node 0.A to suspended state
2019/05/12-17:28:27.28818 {0.A} [manage:3148] WARN: <1040> cannot create checkpoint cp.20190512172440 on node 0.A because a disk is suspended.
The checkpoint is reduced:
cplist --lscp
In this sample output, the checkpoint is on 12 of the 13 nodes:
cp.20190512172440 Sun May 12 13:24:40 2019 valid --- --- nodes 12/13 stripes 342457Cause
The suspended disk on the storage node caused the checkpoint to be incomplete (12/13 nodes), therefore the checkpoint validation (hfscheck) is reduced, and takes longer to complete.
Resolution
1. Verify that all disks on the affected storage node are online before proceeding.
2. Verify that all stripes are online per Avamar: Suspended Partitions, Stripes, and Hfscheck Failures on Avamar
3. Terminate the running hfshceck:
avmaint hfscheckstop --ava
4. Put the grid into a controlled state per Avamar: How to Set the Avamar Server into a Known Controlled State)
5. Create a checkpoint:
avmaint checkpoint --ava --wait
6. Verify that the checkpoint created successfully by using one of the following commands:
status.dpn |grep "Last checkpoint"
Last checkpoint: cp.20190514140904 finished Tue May 14 07:09:26 2019 after 00m 22s (OK)
-- Or --
avmaint cpstatus
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cpstatus
generation-time="1777352670"
tag="cp.20190514140904"
status="completed"
stripes-completed="393"
stripes-total="393"
start-time="1777298944"
end-time="1777298966"
result="OK"
refcount="13"/>
7. Verify that the checkpoint is not reduced:
cplist --lscp
In this sample output, the checkpoint is on all 13 nodes:
cp.20190512172440 Tue May 14 13:24:40 2019 valid --- --- nodes 13/13 stripes 342457
8. Run an hfshceck on the newly created checkpoint:
avmaint hfscheck --ava
(The prompt only returns once the initial phase of the hfshceck has completed.)
9. Monitor the hfshceck to completion:
watch -n 60 avmaint hfscheckstatus
10. Acknowledge any data integrity alerts:
mccli event clear-data-integrity-alerts --reset-code=AVAMARDATAOK
11. Return the grid to production state (See Avamar: How to Set the Avamar Server into a Known Controlled State).