ECS: xDoctor: RAP081: Symptom Code: 2048: All NTP servers are NOT suitable for synchronization
Summary: xDoctor detected a Network Time Protocol (NTP) daemon issue.
Symptoms
All nodes in an ECS rack should have the NTP daemon running, and the configured NTP servers should be capable of synchronizing time. If not, this may lead to problems with frontend data ingestion.
| Symptom |
Message |
|---|---|
| NTP_NOT_SUITABLE_ERROR |
Message = All NTP servers are NOT suitable for synchronization. |
Cause
The above symptoms remain as a WARNING if it does not occur within 24 hours.
After 24 hours, if this persists the severity will then be increased to an ERROR, and a RAP081 is reported.
Resolution
This means that on each node listed in the 'Extra' field cannot synchronize with the NTP Server.
Verification:
1. Get the list of NTP Servers on each of the listed nodes:
Command:
# getrackinfo -r | grep NTP
Example:
admin@node1:~> getrackinfo -r | grep NTP
NTPServer = xxx.xxx.xxx.xxx
2. For each NTP Server listed in step 1, test if it is capable of synchronizing time.
Command:
# sudo ntpdate -p 2 -d <NTP IP Address / NTP FQDN>
Or
# sudo ntpdate -p 2 -d `getrackinfo -r | grep NTP |grep -oP "(?:[0-9]{1,3}\.){3}[0-9]{1,3}"`
Example (capable of synchronizing time):
admin@node1:~> sudo ntpdate -p 2 -d xxx.xxx.xxx.xxx
22 Feb 13:47:48 ntpdate[110901]: ntpdate 4.2.8p11@1.3728-o Thu Jun 14 09:26:52 UTC 2018 (1)
Looking for host <NTP IP Address> and service ntp
<NTP IP Address> reversed to <NTP hostname>
host found : <NTP hostname>
transmit(<NTP IP Address>)
receive(<NTP IP Address>)
transmit(<NTP IP Address>)
receive(<NTP IP Address>)
server <NTP IP Address>, port 123
stratum 2, precision -24, leap 00, trust 000
refid [<NTP IP Address>], delay 0.02615, dispersion 0.00003
transmitted 2, in filter 2
reference time: e01a7b0d.af9e6616 Fri, Feb 22 2019 13:43:41.686
originate timestamp: e01a7c06.748e0c65 Fri, Feb 22 2019 13:47:50.455
transmit timestamp: e01a7c06.7478b000 Fri, Feb 22 2019 13:47:50.454
filter delay: 0.02635 0.02615 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000
filter offset: 0.000043 -0.00002 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000
delay 0.02615, dispersion 0.00003
offset -0.000022
22 Feb 13:47:50 ntpdate[110901]: adjust time server <NTP IP address> offset -0.000022 sec
Example: (If it is not capable of syncing time it outputs)
admin@node1:~> sudo ntpdate -p 2 -d xxx.xxx.xxx.xxx
22 Feb 13:47:48 ntpdate[110901]: ntpdate 4.2.8p11@1.3728-o Thu Jun 14 09:26:52 UTC 2018 (1)
Looking for host <NTP IP Address> and service ntp
<NTP IP Address> reversed to <NTP hostname>
host found : <NTP hostname>
transmit(<NTP IP Address>)
transmit(<NTP IP Address>)
transmit(<NTP IP Address>)
server <NTP IP Address>, port 123
stratum 2, precision -24, leap 00, trust 000
refid [<NTP IP Address>], delay 0.02615, dispersion 0.00003
transmitted 2, in filter 2
reference time: e01a7b0d.af9e6616 Fri, Feb 22 2019 13:43:41.686
originate timestamp: e01a7c06.748e0c65 Fri, Feb 22 2019 13:47:50.455
transmit timestamp: e01a7c06.7478b000 Fri, Feb 22 2019 13:47:50.454
filter delay: 0.02635 0.02615 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000
filter offset: 0.000043 -0.00002 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000
delay 0.02615, dispersion 0.00003
offset -0.000022
22 Feb 13:47:50 ntpdate[112232]: no server suitable for synchronization found
3. Add the FQDN to the NTP section in the getrackinfo -r result.
Command:
# sudo setrackinfo -a NTPServer < NTP FQDN >
4. Check for network separation and static routes, as NTP sent from the management interface over Policy-Based Routing could cause the problem.
Command:
# getrackinfo -n;getrackinfo -t
Example:
admin@node1:~>getrackinfo -n;getrackinfo -t
Named networks
==============
Node ID Network Ip Address Netmask Gateway VLAN Interface
Static route list
=================
Node ID Network Netmask Gateway Interface
5. Confirm if NTP servers are listening in their environment and is often a firewall blocking the port.
Command:
# sudo ntpq -c as
Example: (Below we see one NTP server that is not reachable and the other is blocking likely due to an ACL)
admin@node1:~> sudo ntpq -c as
ind assid status conf reach auth condition last_event cnt
===========================================================
1 56633 8011 yes no none reject mobilize 1
6. Check if there is any date drift in NTP.
Command:
# viprexec "date +%s" 2>&1 | grep "^15"
Example:
admin@node1:~>viprexec "date +%s" 2>&1 | grep "^15"
1554470147
1554470111
1554470096
1554470142
1554470144
1554470109
1554470124
1554470140
7. Check for the ntpd service status and then restart the service. (Even if the status is up and running, proceed with the restart.)
Note: The ntpd.service is a non-impact service.
Command:
# viprexec systemctl status ntpd.service | grep Active:
Example:
admin@node1:~> viprexec systemctl status ntpd.service | grep Active:
Active: active (running) since Tue 2019-08-06 02:49:06 UTC; 1 day 18h ago
Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago
Active: active (running) since Wed 2019-08-07 20:13:27 UTC; 58min ago
Active: active (running) since Tue 2019-08-06 02:49:06 UTC; 1 day 18h ago
Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago
Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago
Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago
Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago
Command:
# viprexec systemctl restart ntpd.service
Example:
admin@node1:~> viprexec systemctl restart ntpd.service
Output from host : 192.168.219.8
Output from host : 192.168.219.7
Output from host : 192.168.219.6
Output from host : 192.168.219.4
Output from host : 192.168.219.3
Output from host : 192.168.219.2
Output from host : 192.168.219.5
Output from host : 192.168.219.1
8. Verify the md5sum ntp.conf file on all the nodes.
Command:
# viprexec "sudo md5sum /etc/ntp.conf"
Example:
admin@node1:~> viprexec "sudo md5sum /etc/ntp.conf"
Output from host : 192.168.219.2
741f0abb12ac82a21f150004bd407334 /etc/ntp.conf
Output from host : 192.168.219.5
741f0abb12ac82a21f150004bd407334 /etc/ntp.conf
Output from host : 192.168.219.4
741f0abb12ac82a21f150004bd407334 /etc/ntp.conf
Output from host : 192.168.219.1
7da6eb8009abc18ed1875f1f15ade72a /etc/ntp.conf
Output from host : 192.168.219.3
741f0abb12ac82a21f150004bd407334 /etc/ntp.conf
Output from host : 192.168.219.8
741f0abb12ac82a21f150004bd407334 /etc/ntp.conf
Output from host : 192.168.219.6
741f0abb12ac82a21f150004bd407334 /etc/ntp.conf
Output from host : 192.168.219.7
741f0abb12ac82a21f150004bd407334 /etc/ntp.conf
Note: This maybe due to having a public and management interfaces and the nodes are all configured to go out of public per the last configuration provided. On older versions of ECS PBR can be stuck where one node is valid and the rest of the nodes seemed to be behind a firewall.
9. Add 123 to ns_mgmt in getrackinfo -r result and then check if the NTP has started transmitting and receiving.
Command:
# sudo setrackinfo -a ns_mgmt 123
Example:
admin@node1:~>sudo setrackinfo -a ns_mgmt 123
Should the error still persists place the port 123 back to the public interface and again check for the synchronization.
Command:
# sudo setrackinfo -d ns_mgmt 123
Example:
admin@node1:~> sudo setrackinfo -d ns_mgmt 123
Check the status of the NTP synchronization after performing each of the above steps.
Resolution:
This means that the server as configured is not an NTP server or that it is not functioning as expected. The customer's network team must be engaged to resolve the NTP issue.
Additional Information
For symptom 'NTP daemon not running' (NTPD_NOT_RUNNING), see knowledge article:
ECS: xDoctor: RAP081: Symptom Code: 2048: NTP daemon not running
For symptom 'All NTP servers adjust an offset higher than the error threshold' (NTP_ERROR_OFFSET_ERROR), see knowledge article:
ECS: xDoctor: RAP081: Symptom Code: 2048: All NTP servers adjust an offset higher than the error threshold
For symptom 'System time difference above ERROR Threshold', see knowledge article:
ECS: xDoctor: RAP081: Symptom Code: 2048: System time difference above ERROR threshold