文章编号: 535482

printer 打印 mail 电子邮件

Data Domain: cloud unit (or the filesystem for DDVE ATOS) goes down due to time gap exceeding 15 minutes between the DD / DDVE and cloud provider

摘要: This KB describes the reason why and the fix for a situation in which a cloud unit or the FS in a DDVE ATOS may go down, if the time difference between the DD and the cloud provider exceeds 15 minutes.

主要产品: Data Domain

产品: Data Domain 展开...

上次发布时间: 01 4月 2020

文章类型: Break Fix

发布状态: 在线

版本: 4

Data Domain: cloud unit (or the filesystem for DDVE ATOS) goes down due to time gap exceeding 15 minutes between the DD / DDVE and cloud provider

文章内容

问题


Both access to a cloud unit, and communication with underlying (virtual) storage in DDVE ATOS (Active Tier on Object Store) uses a HTTP-based protocol called S3, which is the one originally created for storage access in Amazon Web Services (AWS), and which has been widely adopted by the industry as a de-facto standard.

Particularly so for authenticated S3 requests (all storage communication with a cloud provider in DDOS is authenticated) it is by design mandatory that the date and time for the client (the DD or DDVE ATOS in this case) matches, or is as close as possible to that in the cloud storage provider. If the time difference is too large, the authenticated requests sent by the DD / DDVE ATOS to the cloud provider fail (due to expiration of the authentication token) and hence "403 Forbidden" errors are returned from the cloud provider, and storage becomes unavailable.

This, for a Data Domain which has a cloud unit configured, will result in the cloud unit being unavailable and hence show as disconnected:   
 
# cloud unit list
Name         Profile              Status
----------   ------------------   ------
cloudunit1  cloudunit1_profile    Disconnected
----------   ------------------   ------

For a DDVE ATOS, in which the active tier storage lives in the remote cloud provider, this results in active tier storage being unavailable, and the FS to be disabled:   
# alerts show current
Id      Post Time                  Severity   Class     Object        Message
-----   ------------------------   --------   -------   -----------   --------------------------------------------------
m0-19   Fri Mar 27 07:05:04 2020   CRITICAL   Storage   Tier=Active   EVT-STORAGE-00020: The Active tier is unavailable.
-----   ------------------------   --------   -------   -----------   --------------------------------------------------


In either case, the messages in the FS logs (ddfs.info) will be similar. For cloud unit, errors will look similar to the below:   
 
07/04 09:34:38.408 (tid 0x1234567bc2fe0): ERROR: CAL cl_curl_write_cb:1665 - Request failed with httpcode:403
07/04 09:34:38.408 (tid 0x1234567bc2fe0): ERROR: CAL cl_request_convert_httpcode_to_err:1539 - HTTP operation returned code:403, request error:The difference between the request time and the server's time is too large. [5075], uri:https://ecs001.abc/1234567890ebe872-1234567123456-d0//m2/cp_nameval_1, date:Thu, 04 Jul 2019 01:34:38 GMT, bytes_sent:540672, bytes:rcvd:0
07/04 09:34:38.408 (tid 0x123456731e910): INFO: CAL cal_cloudunit_set_unavail:1339 - Marking cloud unit:cloudunit1 as UNAVAILABLE
07/04 09:34:38.408 (tid 0x123456731e910): INFO: CAL cal_event_post:3006 - Enqueueing CAL event: 32 for cloud unit uuid: 1234567890ebe872-1234567123456
07/04 09:34:38.408 (tid 0x123456742f0): INFO: cp1: receiving DDR_EVENT_CAL_UNIT_UNAVAIL event
07/04 09:34:39.458 (tid 0x123456731e910): INFO: Event posted: m0-55 (21000037:553648183): EVT-CLOUD-00001: Unable to access provider for cloud unit cloudunit1.EVT-OBJ::CloudUnit=cloudunit1 EVT-INFO::Cause=The difference between the request time and the server's time is too large.
07/04 09:34:39.458 (tid 0x12345980): cp_nameval_mirror_write: failed to write to cpm_mirror_file for CP 123456789315a645:1234567896895e5653 for copy 1 errstr (Missing storage device)

Whereas for a DDVE ATOS, they will look similar to the below:    
03/27 05:22:43.024 (tid 0x7f8a6eba98e0): ERROR: CAL cl_curl_write_cb:1665 - Request failed with httpcode:403
03/27 05:22:43.024 (tid 0x7f8a6eba98e0): ERROR: CAL cl_request_convert_httpcode_to_err:1539 - HTTP operation returned code:403, request error:The difference between the request time and the current time is too large. [5075], uri:https://endpoint.amazonaws.com/bucket-name-s3//d1/d94078b7/000000000bf4f0a8/0000000000000000, date:Fri, 27 Mar 2020 04:22:43 GMT, bytes_sent:0, bytes:rcvd:0
03/27 05:22:43.024 (tid 0x7f8a72c29f50): ERROR: CAL cal_libcloud_iodone_cb:2227 - CAL I/O operation:CAL_READ_OP for object:/d1/d94078b7/000000000bf4f0a8/0000000000000000 returned error:The difference between the request time and the current time is too large.
03/27 05:22:43.028 (tid 0x7f8607aaccb0): ERROR: MSG-SL-00006: Cloud object-store is unavailable. err:The difference between the request time and the current time is too large..
03/27 05:22:43.028 (tid 0x7f8607aaccb0): Exiting...
原因
The cause for either the cloud unit to be disconnected or the FS in a DDVE ATOS to be shut down is the time gap between the cloud provider and Data Domain exceeds 15 minutes.

The S3 protocol requires a valid time stamp (using either the HTTP Date header or an x-amz-date alternative) for authenticated requests, and the client timestamp included with an authenticated request must be within 15 minutes of S3 system time when the request is received.

This applies to all of the Data Domain supported cloud providers (AWS, ECS, IBM, Google Cloujd, Virtustream, etc.) as well as to any DDVE ATOS configuration.
解决方案
Make sure DD / DDVE and the cloud provider times are in sync. The best way to make sure this is the case over time is to have both the DD / DDVE and the cloud provider use NTP.

If cloud provider is ECS, it is a mandatory requirement to use NTP on ECS, Refer to below document for detail:    
https://www.dellemc.com/resources/en-us/asset/white-papers/products/storage-1/h14071-ecs-architectural-guide-wp.pdf

If the cloud provider is a public one (ie AWS, Azure, etc.), then it is clear the cloud provider will have NTP in place, and the reason for the time difference will be time is not correct and synced in the DD / DDVE side.

Before setting the DD / DDVE to use NTP, it is necessary to make sure the system time is as close as possible to the real one (else, if time difference exceeds 1000 seconds, NTP will fail on startup). Start with showing the DD / DDVE current time from the CLI using the command below:    

# date

And if the date / time is not within 1000 seconds of the real one, set the correct one from the CLI by running the command below:   
 
# system set date <MMDDhhmm[[CC]YY]>
#### Example for setting the DD / DDVE date and time to "April 01 2020 11:23 AM"
# system set date 040111232020

Now, if not configured yet, tell the DD / DDVE to use one or more NTP time servers of your choice for time synchronization (repeat the command below as many times as necessary to add all NTP servers):     
 
# ntp add timeserver NTP_SERVER_IP_OR_HOSTNAME

Finally, enable the NTP service from the DD / DDVE command line with the below command:     

 
# ntp enable

In a few minutes, the disconnected cloud unit should turn "active" again, and the FS which was down in a DDVE ATOS setup, may be enabled again by running the command below:    
 
# filesys enable
备注

If using a private ECS cloud, check the current time on all ECS nodes by running the below commands:     
 
#### Generate host list the 
# getclusterinfo -a /home/admin/MACHINES
#### Get time from all nodes:
# viprexec "date"
 

问题


Both access to a cloud unit, and communication with underlying (virtual) storage in DDVE ATOS (Active Tier on Object Store) uses a HTTP-based protocol called S3, which is the one originally created for storage access in Amazon Web Services (AWS), and which has been widely adopted by the industry as a de-facto standard.

Particularly so for authenticated S3 requests (all storage communication with a cloud provider in DDOS is authenticated) it is by design mandatory that the date and time for the client (the DD or DDVE ATOS in this case) matches, or is as close as possible to that in the cloud storage provider. If the time difference is too large, the authenticated requests sent by the DD / DDVE ATOS to the cloud provider fail (due to expiration of the authentication token) and hence "403 Forbidden" errors are returned from the cloud provider, and storage becomes unavailable.

This, for a Data Domain which has a cloud unit configured, will result in the cloud unit being unavailable and hence show as disconnected:   
 
# cloud unit list
Name         Profile              Status
----------   ------------------   ------
cloudunit1  cloudunit1_profile    Disconnected
----------   ------------------   ------

For a DDVE ATOS, in which the active tier storage lives in the remote cloud provider, this results in active tier storage being unavailable, and the FS to be disabled:   
# alerts show current
Id      Post Time                  Severity   Class     Object        Message
-----   ------------------------   --------   -------   -----------   --------------------------------------------------
m0-19   Fri Mar 27 07:05:04 2020   CRITICAL   Storage   Tier=Active   EVT-STORAGE-00020: The Active tier is unavailable.
-----   ------------------------   --------   -------   -----------   --------------------------------------------------


In either case, the messages in the FS logs (ddfs.info) will be similar. For cloud unit, errors will look similar to the below:   
 
07/04 09:34:38.408 (tid 0x1234567bc2fe0): ERROR: CAL cl_curl_write_cb:1665 - Request failed with httpcode:403
07/04 09:34:38.408 (tid 0x1234567bc2fe0): ERROR: CAL cl_request_convert_httpcode_to_err:1539 - HTTP operation returned code:403, request error:The difference between the request time and the server's time is too large. [5075], uri:https://ecs001.abc/1234567890ebe872-1234567123456-d0//m2/cp_nameval_1, date:Thu, 04 Jul 2019 01:34:38 GMT, bytes_sent:540672, bytes:rcvd:0
07/04 09:34:38.408 (tid 0x123456731e910): INFO: CAL cal_cloudunit_set_unavail:1339 - Marking cloud unit:cloudunit1 as UNAVAILABLE
07/04 09:34:38.408 (tid 0x123456731e910): INFO: CAL cal_event_post:3006 - Enqueueing CAL event: 32 for cloud unit uuid: 1234567890ebe872-1234567123456
07/04 09:34:38.408 (tid 0x123456742f0): INFO: cp1: receiving DDR_EVENT_CAL_UNIT_UNAVAIL event
07/04 09:34:39.458 (tid 0x123456731e910): INFO: Event posted: m0-55 (21000037:553648183): EVT-CLOUD-00001: Unable to access provider for cloud unit cloudunit1.EVT-OBJ::CloudUnit=cloudunit1 EVT-INFO::Cause=The difference between the request time and the server's time is too large.
07/04 09:34:39.458 (tid 0x12345980): cp_nameval_mirror_write: failed to write to cpm_mirror_file for CP 123456789315a645:1234567896895e5653 for copy 1 errstr (Missing storage device)

Whereas for a DDVE ATOS, they will look similar to the below:    
03/27 05:22:43.024 (tid 0x7f8a6eba98e0): ERROR: CAL cl_curl_write_cb:1665 - Request failed with httpcode:403
03/27 05:22:43.024 (tid 0x7f8a6eba98e0): ERROR: CAL cl_request_convert_httpcode_to_err:1539 - HTTP operation returned code:403, request error:The difference between the request time and the current time is too large. [5075], uri:https://endpoint.amazonaws.com/bucket-name-s3//d1/d94078b7/000000000bf4f0a8/0000000000000000, date:Fri, 27 Mar 2020 04:22:43 GMT, bytes_sent:0, bytes:rcvd:0
03/27 05:22:43.024 (tid 0x7f8a72c29f50): ERROR: CAL cal_libcloud_iodone_cb:2227 - CAL I/O operation:CAL_READ_OP for object:/d1/d94078b7/000000000bf4f0a8/0000000000000000 returned error:The difference between the request time and the current time is too large.
03/27 05:22:43.028 (tid 0x7f8607aaccb0): ERROR: MSG-SL-00006: Cloud object-store is unavailable. err:The difference between the request time and the current time is too large..
03/27 05:22:43.028 (tid 0x7f8607aaccb0): Exiting...
原因
The cause for either the cloud unit to be disconnected or the FS in a DDVE ATOS to be shut down is the time gap between the cloud provider and Data Domain exceeds 15 minutes.

The S3 protocol requires a valid time stamp (using either the HTTP Date header or an x-amz-date alternative) for authenticated requests, and the client timestamp included with an authenticated request must be within 15 minutes of S3 system time when the request is received.

This applies to all of the Data Domain supported cloud providers (AWS, ECS, IBM, Google Cloujd, Virtustream, etc.) as well as to any DDVE ATOS configuration.
解决方案

Make sure DD / DDVE and the cloud provider times are in sync. The best way to make sure this is the case over time is to have both the DD / DDVE and the cloud provider use NTP.

If cloud provider is ECS, it is a mandatory requirement to use NTP on ECS, Refer to below document for detail:    
https://www.dellemc.com/resources/en-us/asset/white-papers/products/storage-1/h14071-ecs-architectural-guide-wp.pdf

If the cloud provider is a public one (ie AWS, Azure, etc.), then it is clear the cloud provider will have NTP in place, and the reason for the time difference will be time is not correct and synced in the DD / DDVE side.

Before setting the DD / DDVE to use NTP, it is necessary to make sure the system time is as close as possible to the real one (else, if time difference exceeds 1000 seconds, NTP will fail on startup). Start with showing the DD / DDVE current time from the CLI using the command below:    

# date

And if the date / time is not within 1000 seconds of the real one, set the correct one from the CLI by running the command below:   
 
# system set date <MMDDhhmm[[CC]YY]>
#### Example for setting the DD / DDVE date and time to "April 01 2020 11:23 AM"
# system set date 040111232020

Now, if not configured yet, tell the DD / DDVE to use one or more NTP time servers of your choice for time synchronization (repeat the command below as many times as necessary to add all NTP servers):     
 
# ntp add timeserver NTP_SERVER_IP_OR_HOSTNAME

Finally, enable the NTP service from the DD / DDVE command line with the below command:     

 
# ntp enable

In a few minutes, the disconnected cloud unit should turn "active" again, and the FS which was down in a DDVE ATOS setup, may be enabled again by running the command below:    
 
# filesys enable

备注

If using a private ECS cloud, check the current time on all ECS nodes by running the below commands:     
 
#### Generate host list the 
# getclusterinfo -a /home/admin/MACHINES
#### Get time from all nodes:
# viprexec "date"
 

Article Attachments

附件

附件

文章属性

首次发布时间

周二 7月 09 2019 22:40:04 GMT

首次发布时间

周二 7月 09 2019 22:40:04 GMT

评价此文章

准确性
有用性
易理解性
这篇文章对您有帮助吗?
0/3000 characters