Avamar: Checkpoint failed with result "MSG_ERR_BADTIMESYNC."
Riepilogo: Checkpoints fail with the result "MSG_ERR_BADTIMESYNC"
Sintomi
Checkpoint failed with result "MSG_ERR_BADTIMESYNC"
The "avmaint cpstatus" command shows the following error:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cpstatus
generation-time="1663935384"
tag="cp.20220923121551"
status="error"
stripes-completed="0"
stripes-total="0"
start-time="1663935351"
end-time="1663935351"
result="MSG_ERR_BADTIMESYNC"
refcount="1"/>
The "mapall --parallel 'date'" command shows that a node is out of sync:
(To run the mapall command, the keys must be loaded per Avamar: How to Log in to an Avamar Server and Load Various Keys)
admin@utility:~/>: mapall --parallel 'date'
Using /usr/local/avamar/var/probe.xml
(0.0) ssh -q -x -o GSSAPIAuthentication=no admin@192.168.255.2 'date'
(0.1) ssh -q -x -o GSSAPIAuthentication=no admin@192.168.255.3 'date'
(0.2) ssh -q -x -o GSSAPIAuthentication=no admin@192.168.255.4 'date'
(0.3) ssh -q -x -o GSSAPIAuthentication=no admin@192.168.255.5 'date'
(0.4) ssh -q -x -o GSSAPIAuthentication=no admin@192.168.255.6 'date'
(0.7) ssh -q -x -o GSSAPIAuthentication=no admin@192.168.255.9 'date'
(0.6) ssh -q -x -o GSSAPIAuthentication=no admin@192.168.255.8 'date'
(0.5) ssh -q -x -o GSSAPIAuthentication=no admin@192.168.255.7 'date'
Fri Sep 23 13:05:21 UTC 2022
Fri Sep 23 13:05:21 UTC 2022
Fri Sep 23 13:07:17 UTC 2022 <---- out of sync node
Fri Sep 23 13:05:20 UTC 2022
Fri Sep 23 13:05:22 UTC 2022
Fri Sep 23 13:05:20 UTC 2022
Fri Sep 23 13:05:22 UTC 2022
Fri Sep 23 13:05:21 UTC 2022
Verifying the network time protocol (NTP) with the "ntpq -pn" command shows the message "Connection refused" on the suspect node:
(Output edited to show only the affected node)
admin@utility:~/>: mapall --noerror '/usr/sbin/ntpq -pn'
Using /usr/local/avamar/var/probe.xml
...
(0.3) ssh -q -x -o GSSAPIAuthentication=no admin@192.168.255.5 '/usr/sbin/ntpq -pn'
/usr/sbin/ntpq: read: Connection refused
...
When verifying the status directly as root on the affected node, the "Network Time Protocol Daemon" (NTPD) shows "Active: activating (auto-restart) (Result: resources)":
root@node03:~/>: systemctl status ntpd
● ntpd.service - NTP Server Daemon
Loaded: loaded (/usr/lib/systemd/system/ntpd.service; enabled; vendor preset: disabled)
Drop-In: /run/systemd/generator/ntpd.service.d
└─50-insserv.conf-$time.conf
Active: activating (auto-restart) (Result: resources) since Fri 2022-09-23 13:22:35 UTC; 1min 58s ago
The status should reflect Active: active (running):
● ntpd.service - NTP Server Daemon
Loaded: loaded (/usr/lib/systemd/system/ntpd.service; enabled; vendor preset: disabled)
Drop-In: /run/systemd/generator/ntpd.service.d
└─50-insserv.conf-$time.conf
Active: active (running) since Fri 2022-09-23 14:04:37 UTC; 26s ago
Attempting to start NTPD fails:
root@node03:~/#: systemctl start ntpd.service
Job for ntpd.service failed because a configured resource limit was exceeded. See "systemctl status ntpd.service" and "journalctl -xe" for details.
The output of the "journalctl -xe" command reports "No space left on the device" messages.
The "df" command shows that /var is at 100%:
df -kh
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 8.0K 16G 1% /dev
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 16G 50M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/sda5 9.8G 2.4G 7.0G 26% /
/dev/sdg1 183G 8.3G 165G 5% /ssd01
/dev/sda1 979M 50M 878M 6% /boot
/dev/sdd1 1.9T 236G 1.6T 13% /data04
/dev/sdc1 1.9T 240G 1.6T 13% /data03
/dev/sde1 1.9T 236G 1.6T 13% /data05
/dev/sdf1 1.9T 238G 1.6T 13% /data06
/dev/sdb1 1.9T 238G 1.6T 13% /data02
/dev/sda7 2.0G 2.0G 0 100% /var <------- 100% Use
/dev/sda3 1.8T 267G 1.6T 15% /data01Causa
NTPD relies on /var/lib/ntp/drift/ntp.drift which contains the latest estimate of clock frequency error.
If /var is at 100% full, NTPD cannot update or create the ntp.drift file and NTP does not function correctly.
Risoluzione
1. On the affected node, investigate and resolve 100% usage of /var.
2. Once corrected:
a. Restart NTPD:
root@node03:~/#: systemctl restart ntpd
b. Check the status of ntpd:
root@node03:~/#: systemctl status ntpd
Results similar to the following should be seen:
● ntpd.service - NTP Server Daemon
Loaded: loaded (/usr/lib/systemd/system/ntpd.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2022-09-27 21:21:42 UTC; 37s ago
Docs: man:ntpd(1)
Process: 29442 ExecStart=/usr/sbin/start-ntpd start (code=exited, status=0/SUCCESS)
Main PID: 29463 (ntpd)
Tasks: 2
CGroup: /system.slice/ntpd.service
├─29463 /usr/sbin/ntpd -p /var/run/ntp/ntpd.pid -g -u ntp:ntp -c /etc/ntp.conf
└─29464 ntpd: asynchronous dns resolver
Sep 27 21:21:42 node03 ntpd[29463]: Listen normally on 3 bond0 10.n.n.52:123
Sep 27 21:21:42 node03 ntpd[29463]: Listen normally on 4 bond1 192.168.255.22:123
Sep 27 21:21:42 node03 ntpd[29463]: Listen normally on 5 lo [::1]:123
Sep 27 21:21:42 node03 ntpd[29463]: Listen normally on 6 bond0 [fe80::260:16ff:feaa:2a10%11]:123
Sep 27 21:21:42 node03 ntpd[29463]: Listen normally on 7 bond1 [fe80::260:16ff:fea9:b182%12]:123
Sep 27 21:21:42 node03 ntpd[29463]: Listening on routing socket on fd #24 for interface updates
Sep 27 21:21:42 node03 start-ntpd[29442]: Starting network time protocol daemon (NTPD)
Sep 27 21:21:42 node03 systemd[1]: Started NTP Server Daemon.
c. Verify NTP with ntpq:
root@node03:~/#: /usr/sbin/ntpq -pn
Results similar to the following should be seen:
remote refid st t when poll reach delay offset jitter
==============================================================================
*10.241.216.209 10.233.131.242 2 u 966 1024 377 0.558 1.559 0.600
+192.168.255.21 10.241.216.209 3 u 401 1024 377 0.152 0.521 0.420
3. Confirm the resolution by running a manual checkpoint from the Avamar Utility Node.