VxRail: How to Troubleshoot NTP in a VxRail Cluster
Summary: How to troubleshoot Network Time Protocol (NTP) issues.
Instructions
/etc/ntp.conf directly. To configure NTP on the hosts see: https://knowledge.broadcom.com/external/article/313808
Use ntpq to check synchronization status from VxRail Manager:
vrm:~ # ntpq -c assoc ind assid status conf reach auth condition last_event cnt =========================================================== 1 3898 961a yes yes none sys.peer sys_peer 1
Note: If the NTP works fine, the result should be reach=yes, condition=sys.peer.
ntpq> rv 3898 associd=3898 status=961a conf, reach, sel_sys.peer, 1 event, sys_peer, srcadr=10.XX.1XX.1X0, srcport=123, dstadr=10.XX.1XX.1X1, dstport=123, leap=00, stratum=12, precision=-6, rootdelay=31.250, rootdisp=64.575, refid=10.62.68.236, reftime=e0d00ab8.2af01902 Wed, Jul 10 2019 6:56:56.167, rec=e0d00c5e.d78d706e Wed, Jul 10 2019 7:03:58.842, reach=377,
If the reach is not yes, and condition is not sys.peer (which means the time synchronization is having issue), check the local time and NTP server time. If the local time is greater or less than 1000 seconds, ntpd will not set the clock. The time must be manually set.
The following status is showing the abnormal synchronization status:
vrm:~ # ntpq -c assoc ind assid status conf reach auth condition last_event cnt =========================================================== 1 58280 8011 yes no none reject mobilize 1
The reach=no means that the NTP server does not respond to the request or the network is unavailable. Troubleshoot the network and NTP server.
Scenario 1: Network issue:
Use ping to check if the NTP Server is reachable and follow the network troubleshooting to check. Once the network issue is confirmed, ask the user to engage the network team and confirm that the network issue is fixed.
Scenario 2: Wrong NTP IP or Service issue:
If the NTP server is pingable, it may be that the user inputs the wrong NTP IP or the NTP service runs into an issue. Confirm with the user that the NTP IP address is correct, or use another NTP server if the user has one and asked the user to engage their admin team to check. Sometimes a server reboot can fix the issue, so we can try that route, if that is acceptable for the user.
Scenario 3: Windows NTP server:
Windows time service implements a non full-featured NTP. If the user uses a Windows Server as NTP server, the rootdisp may be higher than 1000. In that case, configure Windows NTP Server to synchronize a reliable external NTP Server.
If the reach=yes, but condition=reject, use ntpq with assoc and rv to check the flash code, dispersion, and rootdisp.
vrm:~ # ntpq -c assoc ind assid status conf reach auth condition last_event cnt =========================================================== 1 3898 9014 yes yes none reject reachable 1
Note: The assoc option can show the assid which is needed for rv later.
Use the rv command to get the flash code, dispersion, and rootdisp.
Run the ntpq command to enter the ntpq shell, then use rv assid to get the detailed information.
ntpq ntpq> rv 3898 associd=3898 status=9014 conf, reach, sel_reject, 1 event, reachable, srcadr=10.XX.1XX.1X0, srcport=123, dstadr=10.XX.1XX.1X1, dstport=123, leap=00, stratum=12, precision=-6, rootdelay=31.250, rootdisp=1814.209, refid=10.XX.XX.2X6, reftime=e0cff348.12fb407d Wed, Jul 10 2019 5:16:56.074, rec=e0cff42b.60680b73 Wed, Jul 10 2019 5:20:43.376, reach=377, unreach=0, hmode=3, pmode=4, hpoll=6, ppoll=6, headway=50, flash=400 peer_dist, keyid=0, offset=-2536.264, delay=0.354, dispersion=16.515, jitter=4.414, xleave=0.038, filtdelay= 0.35 0.29 0.32 0.26 0.28 3.22 0.28 0.35, filtoffset= -2536.2 -2538.2 -2529.4 -2536.2 -2541.6 -2530.0 -2532.5 -2538.1, filtdisp= 15.63 16.63 17.59 18.55 19.53 20.53 21.52 22.50 flash=400 peer_dist #reject reason dispersion=16.515 #it presents the error/variance between that NTP server and client rootdisp=1814.209 #it presents the total amount of error/variance from the root NTP server to client
flash=400 peer_dist indicates that the distance to the root NTP server is too long. It is unfit to synchronize.
Find more information about flash code from the following link:
https://www.eecis.udel.edu/~mills/ntp/html/decode.html#flash
Generally, dispersion higher than 1000 is considered unfit NTP Server. If Windows NTP Server is configured to synchronize time with itself, or parameters are not configured correctly, the rootdisp is higher than 1000, and NTP configuration in Windows Server must be corrected.
Refer to the following Microsoft KB article to configure Windows time server.
https://support.microsoft.com/en-us/help/816042/how-to-configure-an-authoritative-time-server-in-windows-server
Note: Change MaxPosPhaseCorrection, MaxNegPhaseCorrection and SpecialPollInterval to 300 seconds
Scenario 4: Unstable network between NTP server and external NTP server:
Follow network troubleshooting to check the network, Can use ping to check if there is high latency.