开始新对话

未解决

Community Manager

 • 

6.3K 消息

617

2020年7月5日 23:00

ECS:xDoctor:RAP081:SymptomCode:2048:“NTP daemon not running”或“All servers not suitable for synchronization found”(000530725)

​ ​

​ ​

​ ​

​知识库文章:​​000530725​
​ECS:xDoctor:RAP081:SymptomCode:2048:“NTP daemon not running”或“All servers not suitable for synchronization found”(000530725)​

​ ​

​主要产品:Elastic Cloud Storage​

​ ​

​产品:ECS 一体机硬件系列、Elastic Cloud Storage​


​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​
​ ​

​版本:13​

​ ​
​ ​

​文章类型:中断修复​

​ ​
​ ​

​目标受众:级别 10 = 公用​

​ ​
​ ​

​上次发布时间:2020 年 4 月 9 日(星期四),18:40:39 GMT​

​ ​
​ ​

​ ​

​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​
​ ​

​总结:​

​ ​
​ ​

​xDoctor 检测到了 NTP 守护程序问题。​

​ ​
​ ​

​ ​

​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​
​ ​

​问题:​

​ ​
​ ​

​ECS 机架中的所有节点都应运行 NTP 守护程序,并且配置的 NTP 服务器应能够同步时间。​
​否则,这可能会导致前端数据接收出现问题。​
​ ​

​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​
​ ​

​以下症状报告​​ RAP081 (ERROR)​

​ ​
​ ​

​症状​

​ ​
​ ​

​消息​

​ ​
​ ​

​1. NTPD_NOT_RUNNING​

​ ​
​ ​

​Message = NTP daemon not running​
​Extra = [List of nodes]​

​ ​
​ ​

​2. NTP_NOT_SUITABLE_ERROR​

​ ​
​ ​

​Message = All NTP servers are NOT suitable for synchronization​
​Extra = [List of nodes]​

​ ​
​ ​

​3. NTP_ERROR_OFFSET_ERROR​

​ ​
​ ​

​Message = All NTP servers adjust offset higher than error threshold​
​Extra = [List of nodes]​

​ ​
​ ​

​4. System time difference above ERROR Threshold​

​ ​
​ ​

​Message = System time difference above ERROR Threshold​
​Extra = [List of nodes]​

​ ​
​ ​


​如果在 24 小时内未重复发生,以上症状仍然是 WARNING 状态。​
24 小时后,如果此症状仍然存在,则严重性随后提升到 ERROR 状态,并报告 RAP081。​

​ ​
​ ​

​ ​

​ ​
​ ​

​解决方案:​

​ ​
​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​
​ ​

​1. ​​NTPD_NOT_RUNNING​

​ ​
​ ​

​这表示 ntpd 未在“Extra”字段中列出的每个节点上运行,您需要确认 NTP 服务是否正在运行。​

​ ​
​ ​

​验证:​

​ ​

​1. 确认 NTP 服务正在运行。​

​ ​

​命令:​
​# sudo service ntpd status​
​示例:​

​ ​

​admin@ecsnode1:~> sudo service ntpd status​

​ ​

​* ntpd.service - NTP Server Daemon​

​ ​

​ Loaded: loaded (/usr/lib/systemd/system/ntpd.service; enabled; vendor preset: disabled)​

​ ​

​ Drop-In: /run/systemd/generator/ntpd.service.d​

​ ​

​ `-50-insserv.conf-$time.conf​

​ ​

​ ​​ Active: inactive (dead) since Wed 2019-08-07 20:00:00 UTC; 3s ago​

​ ​

​ Docs: man:ntpd(1)​

​ ​

​ Main PID: 63810 (code=exited, status=0/SUCCESS)​

​ ​

​ ​

​ ​

​Aug 07 19:25:49 ecsnode1.gslabs.lab.emc.com sntp[63803]: 2019-08-07 19:25:49.504908 (+0000) -0.00017 +/- 0.051426 10.73.242.40 s2 no-leap​

​ ​

​Aug 07 19:25:49 ecsnode1.gslabs.lab.emc.com start-ntpd[63780]: Time synchronized with 10.73.242.40​

​ ​

​Aug 07 19:25:50 ecsnode1.gslabs.lab.emc.com ntpd[63809]: ntpd 4.2.8p12@1.3728-o Wed Oct 17 16:05:35 UTC 2018 (1): Starting​

​ ​

​Aug 07 19:25:50 ecsnode1.gslabs.lab.emc.com ntpd[63809]: Command line: /usr/sbin/ntpd -p /var/run/ntp/ntpd.pid -x -g -u ntp:ntp -c /etc/ntp.conf​

​ ​

​Aug 07 19:25:50 ecsnode1.gslabs.lab.emc.com ntpd[63810]: proto: precision = 0.089 usec (-23)​

​ ​

​Aug 07 19:25:50 ecsnode1.gslabs.lab.emc.com ntpd[63810]: switching logging to file /var/log/ntp​

​ ​

​Aug 07 19:25:50 ecsnode1.gslabs.lab.emc.com start-ntpd[63780]: Starting network time protocol daemon (NTPD)​

​ ​

​Aug 07 19:25:50 ecsnode1.gslabs.lab.emc.com systemd[1]: Started NTP Server Daemon.​

​ ​

​Aug 07 20:00:00 ecsnode1.gslabs.lab.emc.com systemd[1]: Stopping NTP Server Daemon...​

​ ​

​Aug 07 20:00:00 ecsnode1.gslabs.lab.emc.com systemd[1]: Stopped NTP Server Daemon.​

​ ​

​2. 确认 NTP 的 PID 是存在还是缺失:​

​ ​

​命令:​
​# ps ax | grep ntpd | grep -v grep​
​示例:​

​ ​

​admin@node1:~> ps ax | grep ntpd | grep -v grep​

​ ​

​admin@node1:~>​

​ ​
​ ​

​解决方案:​
​ ​

​ ​

​1. 如果 ntpd 未主动运行,则需要(重新)启动服务:​

​ ​

​命令:​
​# sudo service ntpd restart​
​示例:​

​ ​

​admin@node1:~> sudo service ntpd restart​

​ ​

​admin@node1:~>​

​ ​

​2. 确认服务正在运行,并且系统上存在 PID。​

​ ​

​命令:​
​# sudo service ntpd status;ps ax | grep ntpd | grep -v grep​
​示例:​

​ ​

​admin@node1:~> sudo service ntpd status;ps ax | grep ntpd | grep -v grep​

​ ​

​* ntpd.service - NTP Server Daemon​

​ ​

​ Loaded: loaded (/usr/lib/systemd/system/ntpd.service; enabled; vendor preset: disabled)​

​ ​

​ Drop-In: /run/systemd/generator/ntpd.service.d​

​ ​

​ `-50-insserv.conf-$time.conf​

​ ​

​ ​​Active: active (running) since Wed 2019-08-07 20:13:27 UTC; 3min 25s ago​

​ ​

​ Docs: man:ntpd(1)​

​ ​

​ Process: 913 ExecStart=/usr/sbin/start-ntpd start (code=exited, status=0/SUCCESS)​

​ ​

​ Main PID: 944 (ntpd)​

​ ​

​ Tasks: 2 (limit: 512)​

​ ​

​ Memory: 820.0K​

​ ​

​ CPU: 588ms​

​ ​

​ CGroup: /system.slice/ntpd.service​

​ ​

​ |-944 /usr/sbin/ntpd -p /var/run/ntp/ntpd.pid -x -g -u ntp:ntp -c /etc/ntp.conf​

​ ​

​ `-945 ntpd: asynchronous dns resolver​

​ ​

​ ​

​ ​

​Aug 07 20:13:26 ecsnode1.gslabs.lab.emc.com systemd[1]: Starting NTP Server Daemon...​

​ ​

​Aug 07 20:13:26 ecsnode1.gslabs.lab.emc.com sntp[937]: sntp 4.2.8p12@1.3728-o Wed Oct 17 16:05:30 UTC 2018 (1)​

​ ​

​Aug 07 20:13:26 ecsnode1.gslabs.lab.emc.com sntp[937]: 2019-08-07 20:13:26.567273 (+0000) +0.00003 +/- 0.048796 10.73.242.40 s2 no-leap​

​ ​

​Aug 07 20:13:26 ecsnode1.gslabs.lab.emc.com start-ntpd[913]: Time synchronized with 10.73.242.40​

​ ​

​Aug 07 20:13:27 ecsnode1.gslabs.lab.emc.com ntpd[943]: ntpd 4.2.8p12@1.3728-o Wed Oct 17 16:05:35 UTC 2018 (1): Starting​

​ ​

​Aug 07 20:13:27 ecsnode1.gslabs.lab.emc.com ntpd[943]: Command line: /usr/sbin/ntpd -p /var/run/ntp/ntpd.pid -x -g -u ntp:ntp -c /etc/ntp.conf​

​ ​

​Aug 07 20:13:27 ecsnode1.gslabs.lab.emc.com ntpd[944]: proto: precision = 0.074 usec (-24)​

​ ​

​Aug 07 20:13:27 ecsnode1.gslabs.lab.emc.com ntpd[944]: switching logging to file /var/log/ntp​

​ ​

​Aug 07 20:13:27 ecsnode1.gslabs.lab.emc.com start-ntpd[913]: Starting network time protocol daemon (NTPD)​

​ ​

​Aug 07 20:13:27 ecsnode1.gslabs.lab.emc.com systemd[1]: Started NTP Server Daemon.​

​ ​

​ 944 ? Ss 0:00 /usr/sbin/ntpd -p /var/run/ntp/ntpd.pid -x -g -u ntp:ntp -c /etc/ntp.conf​

​ ​

​ 945 ? S 0:00 ntpd: asynchronous dns resolver​

​ ​
​ ​

​ ​

​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​
​ ​

​2. ​​NTP_NOT_SUITABLE_ERROR​

​ ​
​ ​

​这表示“Extra”字段中列出的每个节点无法与 NTP 服务器同步。​

​ ​
​ ​

​验证:​

​ ​

​1. 获取列出的每个节点上的 NTP 服务器列表:​

​ ​

​命令:​
​# getrackinfo -r | grep NTP​
​示例:​​ ​

​ ​

​admin@node1:~> getrackinfo -r | grep NTP​

​ ​

​ NTPServer = xxx.xxx.xxx.xxx​

​ ​

​2. 对于步骤 1 中列出的每个 NTPServer,测试它是否能够同步时间。​

​ ​

​命令:​
​# sudo ntpdate -p 2 -d ​
​或 ​
​# sudo ntpdate -p 2 -d `getrackinfo -r | grep NTP |grep -oP "(?:[0-9]{1,3}\.){3}[0-9]{1,3}"` ​
​ ​​示例:(能够同步时间)​​ ​

​ ​

​admin@node1:~> sudo ntpdate -p 2 -d xxx.xxx.xxx.xxx​

​ ​

​22 Feb 13:47:48 ntpdate[110901]: ntpdate 4.2.8p11@1.3728-o Thu Jun 14 09:26:52 UTC 2018 (1)​

​ ​

​Looking for host and service ntp ​

​ ​

​ reversed to ​

​ ​

​host found : ​

​ ​

​transmit( ) ​

​ ​

​receive( ) ​

​ ​

​transmit( ) ​

​ ​

​receive( ) ​

​ ​

​server , port 123 ​

​ ​

​stratum 2, precision -24, leap 00, trust 000​

​ ​

​refid [ ], delay 0.02615, dispersion 0.00003 ​

​ ​

​transmitted 2, in filter 2​

​ ​

​reference time: e01a7b0d.af9e6616 Fri, Feb 22 2019 13:43:41.686​

​ ​

​originate timestamp: e01a7c06.748e0c65 Fri, Feb 22 2019 13:47:50.455​

​ ​

​transmit timestamp: e01a7c06.7478b000 Fri, Feb 22 2019 13:47:50.454​

​ ​

​filter delay: 0.02635 0.02615 0.00000 0.00000​

​ ​

​ 0.00000 0.00000 0.00000 0.00000​

​ ​

​filter offset: 0.000043 -0.00002 0.000000 0.000000​

​ ​

​ 0.000000 0.000000 0.000000 0.000000​

​ ​

​delay 0.02615, dispersion 0.00003​

​ ​

​offset -0.000022​

​ ​

​ ​

​ ​

​22 Feb 13:47:50 ntpdate[110901]: ​​adjust time server offset -0.000022 sec ​

​ ​

​ ​

​ ​

​示例:(无法同步时间时的输出)​

​ ​

​admin@node1:~> sudo ntpdate -p 2 -d xxx.xxx.xxx.xxx​

​ ​

​22 Feb 13:47:48 ntpdate[110901]: ntpdate 4.2.8p11@1.3728-o Thu Jun 14 09:26:52 UTC 2018 (1)​

​ ​

​Looking for host and service ntp ​

​ ​

​ reversed to ​

​ ​

​host found : ​

​ ​

​transmit( ) ​

​ ​

​transmit( ) ​

​ ​

​transmit( ) ​

​ ​

​server , port 123 ​

​ ​

​stratum 2, precision -24, leap 00, trust 000​

​ ​

​refid [ ], delay 0.02615, dispersion 0.00003 ​

​ ​

​transmitted 2, in filter 2​

​ ​

​reference time: e01a7b0d.af9e6616 Fri, Feb 22 2019 13:43:41.686​

​ ​

​originate timestamp: e01a7c06.748e0c65 Fri, Feb 22 2019 13:47:50.455​

​ ​

​transmit timestamp: e01a7c06.7478b000 Fri, Feb 22 2019 13:47:50.454​

​ ​

​filter delay: 0.02635 0.02615 0.00000 0.00000​

​ ​

​ 0.00000 0.00000 0.00000 0.00000​

​ ​

​filter offset: 0.000043 -0.00002 0.000000 0.000000​

​ ​

​ 0.000000 0.000000 0.000000 0.000000​

​ ​

​delay 0.02615, dispersion 0.00003​

​ ​

​offset -0.000022​

​ ​

​ ​

​ ​

​22 Feb 13:47:50 ntpdate[112232]: ​​no server suitable for synchronization found​

​ ​

​3. 在 getrackinfo -r 结果中,将 FQDN 添加到 NTP 部分。​

​ ​

​命令:​
​# sudo setrackinfo -a NTPServer < NTP FQDN >​
​示例:​

​ ​

​admin@node1:~> sudo setrackinfo -a NTPServer xxx.xxx.xxx.xxx​

​ ​

​4. 检查客户是否采用网络分离和静态路由,因为 NTP 是通过基于策略的路由从管理接口发送出来,可能是产生问题的原因。​

​ ​

​命令:​
​# getrackinfo -n;getrackinfo -t​
​示例:​

​ ​

​admin@node1:~>getrackinfo -n;getrackinfo -t​

​ ​

​Named networks​

​ ​

​==============​

​ ​

​Node ID Network Ip Address Netmask Gateway VLAN Interface​

​ ​

​Static route list​

​ ​

​=================​

​ ​

​Node ID Network Netmask Gateway Interface​

​ ​

​5. 确认 NTP 服务器是否在其环境中进行侦听,通常有阻止端口的防火墙。 ​

​ ​

​命令:​
​# sudo ntpq -c as​
​示例:(在下面您将看到一台​​ NTP ​​服务器无法访问,另一台可能由于​​ ACL ​​而阻止)​

​ ​

​admin@node1:~> sudo ntpq -c as​

​ ​

​ind assid status conf reach auth condition last_event cnt​

​ ​

​===========================================================​

​ ​

​ ​​ 1 56633 8011 yes no none reject mobilize 1​

​ ​

​6. 检查 NTP 中是否存在日期漂移。 ​

​ ​

​命令:​
​# viprexec "date +%s" 2>&1 | grep "^15"​
​示例:​

​ ​

​admin@node1:~>viprexec "date +%s" 2>&1 | grep "^15"​

​ ​

​1554470147​

​ ​

​1554470111​

​ ​

​1554470096​

​ ​

​1554470142​

​ ​

​1554470144​

​ ​

​1554470109​

​ ​

​1554470124​

​ ​

​1554470140​

​ ​

​admin@ecsnode1:~>​

​ ​

​7. 检查 ntpd 服务状态,然后重新启动服务。(即使状态是已启动并正在运行,仍然继续重新启动)。 ​

​ ​

​注意:ntpd.service 是一种不会产生影响的服务。​
​命令:​
​# viprexec systemctl status ntpd.service | grep Active:​
​示例:​

​ ​

​admin@node1:~> viprexec systemctl status ntpd.service | grep Active:​

​ ​

​ Active: active (running) since Tue 2019-08-06 02:49:06 UTC; 1 day 18h ago​

​ ​

​ Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago​

​ ​

​ Active: active (running) since Wed 2019-08-07 20:13:27 UTC; 58min ago​

​ ​

​ Active: active (running) since Tue 2019-08-06 02:49:06 UTC; 1 day 18h ago​

​ ​

​ Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago​

​ ​

​ Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago​

​ ​

​ Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago​

​ ​

​ Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago​

​ ​

​ ​

​ ​

​命令:​
​# viprexec systemctl restart ntpd.service​
​示例:​

​ ​

​admin@node1:~> viprexec systemctl restart ntpd.service​

​ ​

​Output from host : 192.168.219.8​

​ ​

​Output from host : 192.168.219.7​

​ ​

​Output from host : 192.168.219.6​

​ ​

​Output from host : 192.168.219.4​

​ ​

​Output from host : 192.168.219.3​

​ ​

​Output from host : 192.168.219.2​

​ ​

​Output from host : 192.168.219.5​

​ ​

​Output from host : 192.168.219.1​

​ ​

​8. 验证所有节点上的 md5sum ntp.conf 文件。​

​ ​

​命令:​
​# viprexec "sudo md5sum /etc/ntp.conf"​
​示例:​​ ​

​ ​

​admin@node1:~> viprexec "sudo md5sum /etc/ntp.conf"​

​ ​

​ ​

​ ​

​Output from host : 192.168.219.2​

​ ​

​741f0abb12ac82a21f150004bd407334 /etc/ntp.conf​

​ ​

​ ​

​ ​

​Output from host : 192.168.219.5​

​ ​

​741f0abb12ac82a21f150004bd407334 /etc/ntp.conf​

​ ​

​ ​

​ ​

​Output from host : 192.168.219.4​

​ ​

​741f0abb12ac82a21f150004bd407334 /etc/ntp.conf​

​ ​

​ ​

​ ​

​Output from host : 192.168.219.1​

​ ​

​7da6eb8009abc18ed1875f1f15ade72a /etc/ntp.conf​

​ ​

​ ​

​ ​

​Output from host : 192.168.219.3​

​ ​

​741f0abb12ac82a21f150004bd407334 /etc/ntp.conf​

​ ​

​ ​

​ ​

​Output from host : 192.168.219.8​

​ ​

​741f0abb12ac82a21f150004bd407334 /etc/ntp.conf​

​ ​

​ ​

​ ​

​Output from host : 192.168.219.6​

​ ​

​741f0abb12ac82a21f150004bd407334 /etc/ntp.conf​

​ ​

​ ​

​ ​

​Output from host : 192.168.219.7​

​ ​

​741f0abb12ac82a21f150004bd407334 /etc/ntp.conf​

​ ​


​注意:​​这可能是因为具有公共和管理接口,并且节点全部根据所提供的最后一个配置而配置为不公开。通常在较早版本的 ECS 上,如果 1 个节点有效,而其余节点似乎位于防火墙后面,则 PBR 可能会卡住。​

​ ​

​9. 在 getrackinfo -r 结果中将 123 添加到 ns_mgmt,然后检查 NTP 是否已开始传输和接收。​

​ ​

​命令:​
​# sudo setrackinfo -a ns_mgmt 123​
​示例:​

​ ​

​admin@node1:~>sudo setrackinfo -a ns_mgmt 123​

​ ​


​如果错误仍然存在,则将端口 123 放回到公共接口,然后再次检查同步。​

​ ​

​命令:​
​# sudo setrackinfo -d ns_mgmt 123​
​示例:​

​ ​

​admin@node1:~> sudo setrackinfo -d ns_mgmt 123​

​ ​


​执行以上每个步骤之后,检查 NTP 同步的状态。​

​ ​
​ ​

​解决方案:​
​这表示配置的服务器不是 NTP 服务器,或者它未按预期正常运行。需要接洽客户的网络团队以解决 NTP 问题。​

​ ​
​ ​

​ ​

​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​
​ ​

​3. ​​NTP_ERROR_OFFSET_ERROR​

​ ​
​ ​

​这表示 NTP 服务器与 ECS 节点之间的偏移高于 ERROR 阈值(10 秒)。​

​ ​
​ ​

​验证:​

​ ​

​1. 获取列出的每个节点上的 NTP 服务器列表:​

​ ​

​命令:​
​# getrackinfo -r | grep NTP​
​示例:​​ ​

​ ​

​admin@ecsnode1:~> getrackinfo -r | grep NTP​

​ ​

​ NTPServer = xxx.xxx.xxx.xxx​

​ ​
​ ​

​解决方案:​

​对于验证步骤中列出的每个 NTPServer,测试它是否能够同步时间。​

​ ​

​命令:​
​# sudo ntpdate -p 2 -d ​
​ ​​示例:​​ ​

​ ​

​admin@node1:~> sudo ntpdate -p 2 -d xxx.xxx.xxx.xxx​

​ ​

​22 Feb 13:47:48 ntpdate[110901]: ntpdate 4.2.8p11@1.3728-o Thu Jun 14 09:26:52 UTC 2018 (1)​

​ ​

​Looking for host and service ntp ​

​ ​

​ reversed to ​

​ ​

​host found : ​

​ ​

​transmit( ) ​

​ ​

​receive( ) ​

​ ​

​transmit( ) ​

​ ​

​receive( ) ​

​ ​

​server , port 123 ​

​ ​

​stratum 2, precision -24, leap 00, trust 000​

​ ​

​refid [ ], delay 0.02615, dispersion 0.00003 ​

​ ​

​transmitted 2, in filter 2​

​ ​

​reference time: e01a7b0d.af9e6616 Fri, Feb 22 2019 13:43:41.686​

​ ​

​originate timestamp: e01a7c06.748e0c65 Fri, Feb 22 2019 13:47:50.455​

​ ​

​transmit timestamp: e01a7c06.7478b000 Fri, Feb 22 2019 13:47:50.454​

​ ​

​filter delay: 0.02635 0.02615 0.00000 0.00000​

​ ​

​ 0.00000 0.00000 0.00000 0.00000​

​ ​

​filter offset: 0.000043 -0.00002 0.000000 0.000000​

​ ​

​ 0.000000 0.000000 0.000000 0.000000​

​ ​

​delay 0.02615, dispersion 0.00003​

​ ​

​offset -0.000022​

​ ​

​ ​

​ ​

​22 Feb 13:47:50 ntpdate[110901]: adjust time server offset -0.000022 sec ​

​ ​

​If the offset is greater than 10 seconds there is a problem.​

​ ​

​22 Feb 13:47:50 ntpdate[110901]: adjust time server offset -23.000242 sec ​

​ ​
​ ​

​ ​

​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​
​ ​

​4. ​​系统时差高于​​ ERROR ​​阈值​

​ ​
​ ​

​由于 NTP 漂移文件(在每个节点上由 ntpd 服务每小时更新一次)而存在节点时差。​

​如果以前发生过网络问题,并且节点在重新加入网络之后创建了不正确的漂移文件,从而强制节点之间存在时差,则可能会发生此问题。 ​

​ ​


​当节点在出现问题之后重新加入网络时,它可能会临时创建一个漂移文件,以便与 NTP 服务器上的 NTP 时间匹配。​
​这应该是临时的,但是如果 NTPD 服务无法删除该文件,则 ECS 支持可能需要删除漂移文件并将 NTPD 服务重新启动为良好状态。​

​ ​
​ ​

​验证:​

​检查是否所有 NTP 服务器都能够 ping 通。​

​ ​

​1. 确认是否启用了合规性。​

​ ​

​命令:​
​# domulti 'cat /opt/emc/caspian/fabric/agent/conf/agent_customize.conf | grep compliance_enabled'​
​示例:​

​ ​

​admin@node1:~> domulti 'cat /opt/emc/caspian/fabric/agent/conf/agent_customize.conf | grep compliance_enabled'​

​ ​

​ ​

​ ​

​192.168.219.1​

​ ​

​========================================​

​ ​

​compliance_enabled = true​

​ ​

​ ​

​ ​

​192.168.219.2​

​ ​

​========================================​

​ ​

​compliance_enabled = true​

​ ​

​ ​

​ ​

​192.168.219.3​

​ ​

​========================================​

​ ​

​compliance_enabled = true​

​ ​

​ ​

​ ​

​192.168.219.4​

​ ​

​========================================​

​ ​

​compliance_enabled = true​

​ ​

​admin@ecs-n1:~>​

​ ​

​2. 检查群集以确定其是否合规。 ​

​ ​

​命令:​
​# viprexec "/opt/emc/caspian/fabric/cli/bin/fcli lifecycle cluster.compliance"​
​示例:​

​ ​

​admin@node1:~> viprexec "/opt/emc/caspian/fabric/cli/bin/fcli lifecycle cluster.compliance"​

​ ​

​ ​

​ ​

​Output from host : 192.168.219.4​

​ ​

​{ ​

​ ​

​ "compliance": "​​NON_COMPLIANT​​",​

​ ​

​ "status": "OK",​

​ ​

​ "etag": 22527​

​ ​

​}​

​ ​

​ ​

​ ​

​Output from host : 192.168.219.1​

​ ​

​{ ​

​ ​

​ "compliance": "​​NON_COMPLIANT​​",​

​ ​

​ "status": "OK",​

​ ​

​ "etag": 22527​

​ ​

​}​

​ ​

​ ​

​ ​

​Output from host : 192.168.219.3​

​ ​

​{ ​

​ ​

​ "compliance": "​​NON_COMPLIANT​​",​

​ ​

​ "status": "OK",​

​ ​

​ "etag": 22527​

​ ​

​}​

​ ​

​ ​

​ ​

​Output from host : 192.168.219.2​

​ ​

​{ ​

​ ​

​ "compliance": "​​NON_COMPLIANT​​",​

​ ​

​ "status": "OK",​

​ ​

​ "etag": 22527​

​ ​

​}​

​ ​

​admin@ecs-n1:~​

​ ​

​在默认情况下,由于我们启用了该功能,因此进行 ​​3.3​​ 升级后的预期输出是 ​​COMPLIANT​​。如果您看到 NON_COMPLIANT,则需要调查原因。​

​ ​

​3. 在每个单独节点上运行合规性检查脚本,以确定是否存在(甚至只有一个)不合规的节点,这会导致群集检查显示不合规状态。​

​ ​

​在所有节点上运行合规性脚本,在出现“NTP peers out of sync”的节点中,某些节点上可能会发生 NTP 漂移文件问题。​

​ ​

​如果节点上出现输出“Checking compliance...”而没有失败输出,则表示通过检查,未发现问题。​
​ ​

​ ​

​命令:​
​# domulti /opt/emc/caspian/fabric/agent/conf/compliance_check.sh​
​示例:​

​ ​

​admin@node1:~> domulti /opt/emc/caspian/fabric/agent/conf/compliance_check.sh​

​ ​

​ ​

​ ​

​192.168.219.1​

​ ​

​========================================​

​ ​

​Checking compliance...​

​ ​

​ ​​ NTP peers out of sync​

​ ​

​ ​

​ ​

​192.168.219.2​

​ ​

​========================================​

​ ​

​Checking compliance...​

​ ​

​ ​

​ ​

​ ​

​ ​

​192.168.219.3​

​ ​

​========================================​

​ ​

​Checking compliance...​

​ ​

​ ​​ NTP peers out of sync​

​ ​

​ ​

​ ​

​192.168.219.4​

​ ​

​========================================​

​ ​

​Checking compliance...​

​ ​

​ ​​ NTP peers out of sync​

​ ​

​admin@ecs-n1:~>​

​ ​


​如果出现输出“NTP peers out of sync”,请继续阅读下面的“对等方不同步”部分。​

​ ​
​ ​

​解决方案:​
​NTP ​
​对等方不同步。​

​ ​

​1. 请检查 NTP 偏移是否超过 10 (+/-),这会导致合规性警报。​

​ ​

​命令:​
​# viprexec -i "ntpq -nc peers"​
​示例:(注意:每个节点示例有三台​​ NTP ​​服务器。)​

​ ​

​admin@node1:~> viprexec -i "ntpq -nc peers"​

​ ​

​ ​

​ ​

​Output from host : 169.254.1.1 ​

​ ​

​remote refid st t when poll reach delay offset jitter​

​ ​

​==============================================================================​

​ ​

​*10.xxx.xxx.16 .GPSs. 1 u 31 64 377 0.103 ​​-367.66​​ 44.909​

​ ​

​+10.xxx.xxx.33 .GPSs. 1 u 32 64 377 0.097 ​​-368.68​​ 44.341​

​ ​

​+10.xxx.xxx.35 .GPSs. 1 u 16 64 377 0.107 ​​-338.96​​ 69.736​

​ ​

​ ​

​ ​

​Output from host : 169.254.1.2​

​ ​

​remote refid st t when poll reach delay offset jitter​

​ ​

​==============================================================================​

​ ​

​+10.xxx.xxx.16 .GPSs. 1 u 26 64 377 0.089 8.566 0.746​

​ ​

​*10.xxx.xxx.33 .GPSs. 1 u 26 64 377 0.100 8.585 0.739​

​ ​

​+10.xxx.xxx.35 .GPSs. 1 u 23 64 377 0.104 8.888 0.592​

​ ​

​ ​

​ ​

​Output from host : 169.254.1.3​

​ ​

​remote refid st t when poll reach delay offset jitter​

​ ​

​==============================================================================​

​ ​

​*10.xxx.xxx.16 .GPSs. 1 u 31 64 377 0.101 ​​-354.40​​ 52.444​

​ ​

​+10.xxx.xxx.33 .GPSs. 1 u 29 64 377 0.101 ​​-338.84​​ 63.750​

​ ​

​+10.xxx.xxx.35 .GPSs. 1 u 39 64 377 0.106 ​​-387.28​​ 44.286​

​ ​

​ ​

​ ​

​ ​

​ ​

​Output from host : 169.254.1.4​

​ ​

​remote refid st t when poll reach delay offset jitter​

​ ​

​==============================================================================​

​ ​

​*10.xxx.xxx.16 .GPSs. 1 u 26 64 377 0.084 ​​72.675​​ 9.200​

​ ​

​+10.xxx.xxx.33 .GPSs. 1 u 37 64 377 0.107 ​​65.047​​ 14.913​

​ ​

​+10.xxx.xxx.35 .GPSs. 1 u 33 64 377 0.103 ​​87.374​​ 13.435​

​ ​

​ ​

​ ​

​Output from host : 169.254.1.5​

​ ​

​remote refid st t when poll reach delay offset jitter​

​ ​

​==============================================================================​

​ ​

​*10.xxx.xxx.16 .GPSs. 1 u 27 64 377 0.094 ​​352.741​​ 54.056​

​ ​

​+10.xxx.xxx.33 .GPSs. 1 u 26 64 377 0.103 ​​413.893​​ 43.770​

​ ​

​+10.xxx.xxx.35 .GPSs. 1 u 33 64 377 0.101 ​​334.493​​ 69.059​

​ ​

​ ​

​ ​

​Output from host : 169.254.1.6​

​ ​

​remote refid st t when poll reach delay offset jitter​

​ ​

​==============================================================================​

​ ​

​+10.xxx.xxx.16 .GPSs. 1 u 27 64 377 0.101 ​​-428.51​​ 54.955​

​ ​

​+10.xxx.xxx.33 .GPSs. 1 u 26 64 377 0.097 ​​-326.21​​ 91.208​

​ ​

​*10.xxx.xxx.35 .GPSs. 1 u 32 64 377 0.098 ​​-349.00​​ 70.110​

​ ​

​ ​

​ ​


​如果重新启动 ntpd 服务,viprexec -i " ntpq -nc peers” 的偏移会在一段时间内低于 10,然后​​重新增加到超过​​ 100​​。​

​ ​

​2. 这可能是由于节点 ntp.drift 文件在 ntpd 服务重新启动之后重新应用了不正确的偏移。​

​ ​

​命令:​
​# viprexec cat /var/lib/ntp/drift/ntp.drift​
​示例:​

​ ​

​admin@node1:~> viprexec cat /var/lib/ntp/drift/ntp.drift​

​ ​

​ ​

​ ​

​Output from host : 169.254.1.1​

​ ​

​500.000​

​ ​

​ ​

​ ​

​Output from host : 169.254.1.2​

​ ​

​-14.212​

​ ​

​ ​

​ ​

​Output from host : 169.254.1.3​

​ ​

​500.000​

​ ​

​ ​

​ ​

​Output from host : 169.254.1.4​

​ ​

​-102.474​

​ ​

​ ​

​ ​

​Output from host : 169.254.1.5​

​ ​

​-500.000​

​ ​

​ ​

​ ​

​Output from host : 169.254.1.6​

​ ​

​500.000​

​ ​


​由于临时网络问题,可能会自动生成具有此偏移大小的 NTP 漂移文件。​
​这样在节点重新建立到 NTP 服务的连接时,会发现自己偏离了正确的时间,并生成文件以重新纠正自己。​
​一段时间后,将不需要漂移文件,可以删除​


​因此,应执行以下操作 ​

​ ​

​1. 应停止 ntpd 服务。​

​ ​

​2. 删除 ntp.drift 文件。​

​ ​

​3. 再次启动 ntpd 服务。​

​ ​


​注意:ntpd.service 是一种不会产生影响的服务​

​ ​

​命令:​
​# viprexec systemctl stop ntpd​
​# viprexec cat /var/lib/ntp/drift/ntp.drift​
​# viprexec 'rm -f /var/lib/ntp/drift/ntp.drift'​
​# viprexec 'ntpd -gq'​
​# viprexec systemctl start ntpd​
​# viprexec 'ntpq -p'​

​ ​


​重新运行合规性检查脚本:viprexec -i "/opt/emc/caspian/fabric/agent/conf/compliance_check.sh"​

​如果 NTP 漂移文件为零,则继续检查 NTP 中是否存在日期漂移,然后重新启动 ntpd 服务。 ​

​ ​

​命令:​
​# viprexec "date +%s" 2>&1 | grep "^15"​
​示例:​

​ ​

​admin@node1:~> viprexec "date +%s" 2>&1 | grep "^15"​

​ ​

​1554470147​

​ ​

​1554470111​

​ ​

​1554470096​

​ ​

​1554470142​

​ ​

​1554470144​

​ ​

​1554470109​

​ ​

​1554470124​

​ ​

​1554470140​

​ ​

​admin@ecsnode1:~>​

​ ​

​节点之间的差异指示 NTP 漂移,因此需要重新启动 ntpd 服务。​
​检查 ntpd 服务状态,然后重新启动服务。(即使状态是已启动并正在运行,仍然继续重新启动)。​
​注意:ntpd.service 是一种不会产生影响的服务​

​ ​

​命令:​
​# viprexec systemctl status ntpd.service | grep Active:​
​示例:​

​ ​

​admin@node1:~> viprexec systemctl status ntpd.service | grep Active:​

​ ​

​ Active: active (running) since Tue 2019-08-06 02:49:06 UTC; 1 day 18h ago​

​ ​

​ Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago​

​ ​

​ Active: active (running) since Wed 2019-08-07 20:13:27 UTC; 58min ago​

​ ​

​ Active: active (running) since Tue 2019-08-06 02:49:06 UTC; 1 day 18h ago​

​ ​

​ Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago​

​ ​

​ Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago​

​ ​

​ Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago​

​ ​

​ Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago​

​ ​

​ ​

​ ​

​命令:​
​# viprexec systemctl restart ntpd.service​
​示例:​

​ ​

​admin@node1:~> viprexec systemctl restart ntpd.service​

​ ​

​Output from host : 192.168.219.8​

​ ​

​Output from host : 192.168.219.7​

​ ​

​Output from host : 192.168.219.6​

​ ​

​Output from host : 192.168.219.4​

​ ​

​Output from host : 192.168.219.3​

​ ​

​Output from host : 192.168.219.2​

​ ​

​Output from host : 192.168.219.5​

​ ​

​Output from host : 192.168.219.1​

​ ​

​如果未打开 swarm,则应解决 NTP 漂移:​
​命令:​
​# viprexec "date +%s" 2>&1 | grep "^15"​
​示例:​

​ ​

​admin@node1:~> viprexec "date +%s" 2>&1 | grep "^15"​

​ ​

​1585746672​

​ ​

​1585746672​

​ ​

​1585746672​

​ ​

​1585746672​

​ ​

​1585746672​

​ ​

​1585746672​

​ ​

​1585746672​

​ ​

​1585746672​

​ ​

​admin@ecsnode1:~>​

​ ​


​如果问题仍然存在或与以上问题不符,请发送包含相关信息的咨询电子邮件。​

​-------------------------------------------------------------------------------------​

​此外请注意,如果​​解决问题​​之后 compliance_check.sh 通过,但​​服务控制台运行状况检查​​未通过,并且 ​​fcli lifecycle cluster.compliance​​ --> ​​NON_COMPLIANT​​,即服务控制台仍然发现不合规,而问题已得到解决。 ​

​请稍后重新运行服务控制台状况检查和 fcli lifecycle cluster.compliance,系统记录更新合规性可能需要一段时间。​

​ ​
​ ​

​相关知识库文章:​​《SymptomCode: 2048: Node time difference due to NTP drift file, causing: System time difference above ERROR Threshold》​

​ ​
​ ​
​ ​

​ ​

​ ​
​ ​

​备注:​

​ ​
​ ​

​如果以上方法均不起作用,则需要接洽客户的网络团队来解决 NTP 问题。​

​ ​
​ ​

​ ​

​ ​
​ ​

​主要产品:​

​ ​
​ ​

​Elastic Cloud Storage​

​ ​
​ ​

​ ​

​ ​
​ ​

​产品:​

​ ​
​ ​

​ECS 一体机硬件系列、Elastic Cloud Storage​

​ ​
​ ​

​ ​

​ ​
​ ​

​EMC ​​内部信息:​

​ ​
​ ​

​高价值内容​

​ ​
​ ​

​ ​

​ ​
​ ​

​组件​​/​​子组件:​

​ ​
​ ​

​xDoctor​

​ ​
​ ​

​ ​

​ ​

​ ​

​ ​

#IWork4Dell

请您将合适的回复标记为“接受的回答”,并为喜欢的帖子“点赞”。这对我们非常重要!

没有回复!
找不到事件!

Top