ECS: xDoctor: RAP081: Symptom Code: 2048: System time difference above ERROR threshold
Summary: xDoctor detected a Network Time Protocol (NTP) daemon issue.
Symptoms
All nodes in an ECS rack should have the NTP daemon running, and the configured NTP servers should be capable of synchronizing time. If not, this may lead to problems with frontend data ingestion.
| Symptom |
Message |
|---|---|
| System time difference above ERROR Threshold |
Message = System time difference above ERROR Threshold |
Cause
The above symptoms remain as a WARNING if it does not occur within 24 hours.
After 24 hours, if this persists the severity will then be increased to an ERROR, and a RAP081 is reported.
Resolution
Node time difference due to NTP drift file that is updated every hour by the ntpd service on the nodes. This issue can occur when a network issue previously occurs on a node. After it rejoins the network creates an incorrect drift file, enforcing a time difference between the nodes.
When a node has rejoined the network after an issue, it may temporarily create a drift file to match the NTP time on the NTP server. This should be temporary, but if ntpd cannot remove the file, we may delete the drift files and restart the service to restore it.
Verification:
Check whether all NTP servers can ping.
1. Confirm if the Compliance is enabled.
Command:
# domulti 'cat /opt/emc/caspian/fabric/agent/conf/agent_customize.conf | grep compliance_enabled'
admin@node1:~> domulti 'cat /opt/emc/caspian/fabric/agent/conf/agent_customize.conf | grep compliance_enabled' 192.168.219.1 ======================================== compliance_enabled = true 192.168.219.2 ======================================== compliance_enabled = true 192.168.219.3 ======================================== compliance_enabled = true 192.168.219.4 ======================================== compliance_enabled = true
2. Check the ECS to determine if the cluster is compliant or not.
# viprexec -i "/opt/emc/caspian/fabric/cli/bin/fcli lifecycle cluster.compliance"
admin@node1:~> viprexec -i "/opt/emc/caspian/fabric/cli/bin/fcli lifecycle cluster.compliance"
Output from host : 192.168.219.1
{
"compliance": "NON_COMPLIANT",
"status": "OK",
"etag": 22527
}
Output from host : 192.168.219.2
{
"compliance": "NON_COMPLIANT",
"status": "OK",
"etag": 22527
}
Output from host : 192.168.219.3
{
"compliance": "NON_COMPLIANT",
"status": "OK",
"etag": 22527
}
Output from host : 192.168.219.4
{
"compliance": "NON_COMPLIANT",
"status": "OK",
"etag": 22527
}
The expected output is COMPLIANT. If we see NON_COMPLIANT, then we must investigate why.
3. Run the compliance check script on each node to identify any non-compliant nodes, which may result the ECS check to show non-compliance.
Run the compliance script on all nodes, nodes with "NTP peers out of sync," may have the NTP drift file issue on some nodes. An output of "Checking compliance…" on a node with no failure output means the check passed with no issues found.
# domulti /opt/emc/caspian/fabric/agent/conf/compliance_check.sh
admin@node1:~> domulti /opt/emc/caspian/fabric/agent/conf/compliance_check.sh
192.168.219.1
========================================
Checking compliance...
NTP peers out of sync
192.168.219.2
========================================
Checking compliance...
192.168.219.3
========================================
Checking compliance...
NTP peers out of sync
192.168.219.4
========================================
Checking compliance...
NTP peers out of sync
If there is an output of "NTP peers out of sync," go to "peers out of sync" section below.
Resolution:
1. Check for an NTP offset is over 10 (+/-) which can cause the compliance alert.
# viprexec -i "ntpq -nc peers"
admin@node1:~> viprexec -i "ntpq -nc peers" Output from host : 169.254.1.1 remote refid st t when poll reach delay offset jitter ============================================================================== *10.xxx.xxx.16 .GPSs. 1 u 31 64 377 0.103 -367.66 44.909 +10.xxx.xxx.33 .GPSs. 1 u 32 64 377 0.097 -368.68 44.341 +10.xxx.xxx.35 .GPSs. 1 u 16 64 377 0.107 -338.96 69.736 Output from host : 169.254.1.2 remote refid st t when poll reach delay offset jitter ============================================================================== +10.xxx.xxx.16 .GPSs. 1 u 26 64 377 0.089 8.566 0.746 *10.xxx.xxx.33 .GPSs. 1 u 26 64 377 0.100 8.585 0.739 +10.xxx.xxx.35 .GPSs. 1 u 23 64 377 0.104 8.888 0.592 Output from host : 169.254.1.3 remote refid st t when poll reach delay offset jitter ============================================================================== *10.xxx.xxx.16 .GPSs. 1 u 31 64 377 0.101 -354.40 52.444 +10.xxx.xxx.33 .GPSs. 1 u 29 64 377 0.101 -338.84 63.750 +10.xxx.xxx.35 .GPSs. 1 u 39 64 377 0.106 -387.28 44.286 Output from host : 169.254.1.4 remote refid st t when poll reach delay offset jitter ============================================================================== *10.xxx.xxx.16 .GPSs. 1 u 26 64 377 0.084 72.675 9.200 +10.xxx.xxx.33 .GPSs. 1 u 37 64 377 0.107 65.047 14.913 +10.xxx.xxx.35 .GPSs. 1 u 33 64 377 0.103 87.374 13.435 Output from host : 169.254.1.5 remote refid st t when poll reach delay offset jitter ============================================================================== *10.xxx.xxx.16 .GPSs. 1 u 27 64 377 0.094 352.741 54.056 +10.xxx.xxx.33 .GPSs. 1 u 26 64 377 0.103 413.893 43.770 +10.xxx.xxx.35 .GPSs. 1 u 33 64 377 0.101 334.493 69.059 Output from host : 169.254.1.6 remote refid st t when poll reach delay offset jitter ============================================================================== +10.xxx.xxx.16 .GPSs. 1 u 27 64 377 0.101 -428.51 54.955 +10.xxx.xxx.33 .GPSs. 1 u 26 64 377 0.097 -326.21 91.208 *10.xxx.xxx.35 .GPSs. 1 u 32 64 377 0.098 -349.00 70.110
If we restart the ntpd service, viprexec -i "ntpq -nc peers" we have an offset of under 10 for a few moments, and then increases back to over 100.
2. A node’s ntp.drift file reapplying an incorrect offset after the ntpd service restart may lead to this issue:
# viprexec -i "cat /var/lib/ntp/drift/ntp.drift"
admin@node1:~> viprexec -i "cat /var/lib/ntp/drift/ntp.drift" Output from host : 169.254.1.1 500.000 Output from host : 169.254.1.2 -14.212 Output from host : 169.254.1.3 500.000 Output from host : 169.254.1.4 -102.474 Output from host : 169.254.1.5 -500.000 Output from host : 169.254.1.6 500.000
An NTP drift file of this offset size may be automatically generated due to a temporary network issue. When a node reestablishing a connection to the NTP service finds itself off the correct time, and generate the file to re-correct itself. After a few moments the drift file is not needed, and may be removed. Therefore, the following should be carried out.
Note: ntpd.service is a non-impact service.
Commands:
# viprexec -i "systemctl stop ntpd" # viprexec -i "cat /var/lib/ntp/drift/ntp.drift # viprexec -i "rm -f /var/lib/ntp/drift/ntp.drift" # viprexec -i "ntpd -gq" # viprexec -i "systemctl start ntpd" # viprexec -i "ntpq -p"
Re-run compliance check script: viprexec -i "/opt/emc/caspian/fabric/agent/conf/compliance_check.sh"
If the NTP drift files are zero, check if there is any date drift in NTP, and restart the ntpd service.
# viprexec "date +%s" 2>&1 | grep "^15"
admin@node1:~> viprexec "date +%s" 2>&1 | grep "^15" 1554470147 1554470111 1554470096 1554470142 1554470144 1554470109 1554470124 1554470140
Difference between the nodes indicates that an NTP drift with a ntpd service restart is required. Check for the ntpd service status and then restart the service. (Even if the status is up and running, proceed with the restart.) Note: ntpd.service is a non-impact service.
# viprexec systemctl status ntpd.service | grep Active:
admin@node1:~> viprexec systemctl status ntpd.service | grep Active: Active: active (running) since Tue 2019-08-06 02:49:06 UTC; 1 day 18h ago Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago Active: active (running) since Wed 2019-08-07 20:13:27 UTC; 58min ago Active: active (running) since Tue 2019-08-06 02:49:06 UTC; 1 day 18h ago Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago
# viprexec -i "systemctl restart ntpd.service"
admin@node1:~> viprexec systemctl restart ntpd.service Output from host : 192.168.219.1 Output from host : 192.168.219.2 Output from host : 192.168.219.3 Output from host : 192.168.219.4 Output from host : 192.168.219.5 Output from host : 192.168.219.6 Output from host : 192.168.219.7 Output from host : 192.168.219.8
NTP drift should be resolved:
# viprexec -i "date +%s" 2>&1 | grep "^15"
admin@node1:~> viprexec -i "date +%s" 2>&1 | grep "^15" 1585746672 1585746672 1585746672 1585746672 1585746672 1585746672 1585746672 1585746672
If the issue still persists or does not match the above issue, reach out to ECS technical support.
Additional Information
If the above resolution does not work, the customer's network team must be engaged to resolve the NTP issue.
For symptom 'NTP daemon not running' (NTPD_NOT_RUNNING), see knowledge article:
ECS: xDoctor: RAP081: Symptom Code: 2048: NTP daemon not running
For symptom 'All NTP servers are NOT suitable for synchronization' (NTP_NOT_SUITABLE_ERROR), see knowledge article:
ECS: xDoctor: RAP081: Symptom Code: 2048: All NTP servers are NOT suitable for synchronization.
For symptom 'All NTP servers adjust an offset higher than the error threshold' (NTP_ERROR_OFFSET_ERROR), see knowledge article:
ECS: xDoctor: RAP081: Symptom Code: 2048: All NTP servers adjust an offset higher than the error threshold.