ECS: xDoctor: RAP081: Symptom Code: 2048: System time difference above ERROR threshold

Summary: xDoctor detected a Network Time Protocol (NTP) daemon issue.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

All nodes in an ECS rack should have the NTP daemon running, and the configured NTP servers should be capable of synchronizing time. If not, this may lead to problems with frontend data ingestion.

Symptom

Message

System time difference above ERROR Threshold

Message = System time difference above ERROR Threshold
Extra = [List of nodes]

Cause

The above symptoms remain as a WARNING if it does not occur within 24 hours.
After 24 hours, if this persists the severity will then be increased to an ERROR, and a RAP081 is reported.

Resolution

Node time difference due to NTP drift file that is updated every hour by the ntpd service on the nodes. This issue can occur when a network issue previously occurs on a node. After it rejoins the network creates an incorrect drift file, enforcing a time difference between the nodes. 

When a node has rejoined the network after an issue, it may temporarily create a drift file to match the NTP time on the NTP server. This should be temporary, but if ntpd cannot remove the file, we may delete the drift files and restart the service to restore it.

Verification:
Check whether all NTP servers can ping.
1. Confirm if the Compliance is enabled.

Command:

# domulti 'cat /opt/emc/caspian/fabric/agent/conf/agent_customize.conf | grep compliance_enabled'
Example:
admin@node1:~> domulti 'cat /opt/emc/caspian/fabric/agent/conf/agent_customize.conf | grep compliance_enabled'

192.168.219.1
========================================
compliance_enabled = true

192.168.219.2
========================================
compliance_enabled = true

192.168.219.3
========================================
compliance_enabled = true

192.168.219.4
========================================
compliance_enabled = true

2. Check the ECS to determine if the cluster is compliant or not. 

Command: 
# viprexec -i "/opt/emc/caspian/fabric/cli/bin/fcli lifecycle cluster.compliance"
Example:
admin@node1:~> viprexec -i "/opt/emc/caspian/fabric/cli/bin/fcli lifecycle cluster.compliance"

Output from host : 192.168.219.1
{
  "compliance": "NON_COMPLIANT",
  "status": "OK",
  "etag": 22527
}

Output from host : 192.168.219.2
{
  "compliance": "NON_COMPLIANT",
  "status": "OK",
  "etag": 22527
}

Output from host : 192.168.219.3
{
  "compliance": "NON_COMPLIANT",
  "status": "OK",
  "etag": 22527
}

Output from host : 192.168.219.4
{
  "compliance": "NON_COMPLIANT",
  "status": "OK",
  "etag": 22527
}

The expected output is COMPLIANT. If we see NON_COMPLIANT, then we must investigate why.

3. Run the compliance check script on each node to identify any non-compliant nodes, which may result the ECS check to show non-compliance.

Run the compliance script on all nodes, nodes with "NTP peers out of sync," may have the NTP drift file issue on some nodes. An output of "Checking compliance…" on a node with no failure output means the check passed with no issues found.

Command:
# domulti /opt/emc/caspian/fabric/agent/conf/compliance_check.sh
Example:
admin@node1:~> domulti /opt/emc/caspian/fabric/agent/conf/compliance_check.sh
 
192.168.219.1
========================================
Checking compliance...
    NTP peers out of sync
 
192.168.219.2
========================================
Checking compliance...
   
 
192.168.219.3
========================================
Checking compliance...
    NTP peers out of sync
 
192.168.219.4
========================================
Checking compliance...
    NTP peers out of sync

If there is an output of "NTP peers out of sync," go to "peers out of sync" section below.

Resolution:
1. Check for an NTP offset is over 10 (+/-) which can cause the compliance alert.

Command:
# viprexec -i "ntpq -nc peers"
Example: (Note: There are three NTP servers per node example.)
admin@node1:~> viprexec -i "ntpq -nc peers"

Output from host : 169.254.1.1  
remote refid st t when poll reach delay offset jitter
==============================================================================
*10.xxx.xxx.16 .GPSs. 1 u 31 64 377 0.103 -367.66 44.909
+10.xxx.xxx.33 .GPSs. 1 u 32 64 377 0.097 -368.68 44.341
+10.xxx.xxx.35 .GPSs. 1 u 16 64 377 0.107 -338.96 69.736

Output from host : 169.254.1.2 
remote refid st t when poll reach delay offset jitter
==============================================================================
+10.xxx.xxx.16 .GPSs. 1 u 26 64 377 0.089 8.566 0.746
*10.xxx.xxx.33 .GPSs. 1 u 26 64 377 0.100 8.585 0.739
+10.xxx.xxx.35 .GPSs. 1 u 23 64 377 0.104 8.888 0.592

Output from host : 169.254.1.3 
remote refid st t when poll reach delay offset jitter
==============================================================================
*10.xxx.xxx.16 .GPSs. 1 u 31 64 377 0.101 -354.40 52.444
+10.xxx.xxx.33 .GPSs. 1 u 29 64 377 0.101 -338.84 63.750
+10.xxx.xxx.35 .GPSs. 1 u 39 64 377 0.106 -387.28 44.286


Output from host : 169.254.1.4 
remote refid st t when poll reach delay offset jitter
==============================================================================
*10.xxx.xxx.16 .GPSs. 1 u 26 64 377 0.084 72.675 9.200
+10.xxx.xxx.33 .GPSs. 1 u 37 64 377 0.107 65.047 14.913
+10.xxx.xxx.35 .GPSs. 1 u 33 64 377 0.103 87.374 13.435

Output from host : 169.254.1.5 
remote refid st t when poll reach delay offset jitter
==============================================================================
*10.xxx.xxx.16 .GPSs. 1 u 27 64 377 0.094 352.741 54.056
+10.xxx.xxx.33 .GPSs. 1 u 26 64 377 0.103 413.893 43.770
+10.xxx.xxx.35 .GPSs. 1 u 33 64 377 0.101 334.493 69.059

Output from host : 169.254.1.6 
remote refid st t when poll reach delay offset jitter
==============================================================================
+10.xxx.xxx.16 .GPSs. 1 u 27 64 377 0.101 -428.51 54.955
+10.xxx.xxx.33 .GPSs. 1 u 26 64 377 0.097 -326.21 91.208
*10.xxx.xxx.35 .GPSs. 1 u 32 64 377 0.098 -349.00 70.110

If we restart the ntpd service, viprexec -i "ntpq -nc peers" we have an offset of under 10 for a few moments, and then increases back to over 100.

2. A node’s ntp.drift file reapplying an incorrect offset after the ntpd service restart may lead to this issue:

Command:
# viprexec -i "cat /var/lib/ntp/drift/ntp.drift"
Example:
admin@node1:~> viprexec -i "cat /var/lib/ntp/drift/ntp.drift"

Output from host : 169.254.1.1 
500.000

Output from host : 169.254.1.2 
-14.212

Output from host : 169.254.1.3 
500.000

Output from host : 169.254.1.4 
-102.474

Output from host : 169.254.1.5 
-500.000

Output from host : 169.254.1.6 
500.000

An NTP drift file of this offset size may be automatically generated due to a temporary network issue. When a node reestablishing a connection to the NTP service finds itself off the correct time, and generate the file to re-correct itself. After a few moments the drift file is not needed, and may be removed. Therefore, the following should be carried out. 

1. The ntpd service should be stopped.
2. The ntp.drift file is removed.
3. The ntpd service started up again.

Note: ntpd.service is a non-impact service.


Commands:
# viprexec -i "systemctl stop ntpd"
# viprexec -i "cat /var/lib/ntp/drift/ntp.drift
# viprexec -i "rm -f /var/lib/ntp/drift/ntp.drift"
# viprexec -i "ntpd -gq"
# viprexec -i "systemctl start ntpd"
# viprexec -i "ntpq -p"

Re-run compliance check script: viprexec -i "/opt/emc/caspian/fabric/agent/conf/compliance_check.sh"

If the NTP drift files are zero, check if there is any date drift in NTP, and restart the ntpd service. 

Command:
# viprexec "date +%s" 2>&1 | grep "^15"
Example:
admin@node1:~> viprexec "date +%s" 2>&1 | grep "^15"
1554470147
1554470111
1554470096
1554470142
1554470144
1554470109
1554470124
1554470140

Difference between the nodes indicates that an NTP drift with a ntpd service restart is required. Check for the ntpd service status and then restart the service. (Even if the status is up and running, proceed with the restart.) Note: ntpd.service is a non-impact service.

Command:
# viprexec systemctl status ntpd.service | grep Active:
Example:
admin@node1:~> viprexec systemctl status ntpd.service | grep Active:
   Active: active (running) since Tue 2019-08-06 02:49:06 UTC; 1 day 18h ago
   Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago
   Active: active (running) since Wed 2019-08-07 20:13:27 UTC; 58min ago
   Active: active (running) since Tue 2019-08-06 02:49:06 UTC; 1 day 18h ago
   Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago
   Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago
   Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago
   Active: active (running) since Tue 2019-08-06 02:49:07 UTC; 1 day 18h ago
Command:
# viprexec -i "systemctl restart ntpd.service"
Example:
admin@node1:~> viprexec systemctl restart ntpd.service
Output from host : 192.168.219.1
Output from host : 192.168.219.2
Output from host : 192.168.219.3
Output from host : 192.168.219.4
Output from host : 192.168.219.5
Output from host : 192.168.219.6
Output from host : 192.168.219.7
Output from host : 192.168.219.8

NTP drift should be resolved:

Command:
# viprexec -i "date +%s" 2>&1 | grep "^15"
Example:
admin@node1:~> viprexec -i "date +%s" 2>&1 | grep "^15"
1585746672
1585746672
1585746672
1585746672
1585746672
1585746672
1585746672
1585746672

If the issue still persists or does not match the above issue, reach out to ECS technical support.

Additional Information

If the above resolution does not work, the customer's network team must be engaged to resolve the NTP issue.

For symptom 'NTP daemon not running' (NTPD_NOT_RUNNING), see knowledge article:
ECS: xDoctor: RAP081: Symptom Code: 2048: NTP daemon not running

For symptom 'All NTP servers are NOT suitable for synchronization' (NTP_NOT_SUITABLE_ERROR), see knowledge article:
ECS: xDoctor: RAP081: Symptom Code: 2048: All NTP servers are NOT suitable for synchronization.

For symptom 'All NTP servers adjust an offset higher than the error threshold' (NTP_ERROR_OFFSET_ERROR), see knowledge article:
ECS: xDoctor: RAP081: Symptom Code: 2048: All NTP servers adjust an offset higher than the error threshold.

Affected Products

ECS

Products

ECS Appliance, ECS Appliance Gen 1, ECS Appliance Gen 2, ECS Appliance Gen 3, ECS Software
Article Properties
Article Number: 000230636
Article Type: Solution
Last Modified: 03 Oct 2024
Version:  2
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.