Data Domain: ECS Cloud Tier disconnected due to internal error or full

Summary: There are many different reasons why the data domain File system reports that cloud tier profile got disconnected, causing the running data-movement, recalls, or cloud cleaning processes to get interrupted or terminated due to the cloud unit getting into UNAVAILABLE state one of them is receiving an internal 500 server error from ECS. ...

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

There are many different reasons why the data domain File system reports that cloud tier profile got disconnected, causing the running data-movement, recalls, or cloud cleaning processes to get interrupted or terminated due to the cloud unit getting into UNAVAILABLE state.

One of the reasons why the cloud tier unit gets disconnected is due to receiving an internal error from ECS with HTTP operation returned code:500.

When that happens we receive a similar alert message: 
Time:           Sun Mar  1 00:08:44 2020
Alert Id:       m0-3761
Event Id:       EVT-CLOUD-00001
Event Message:  Unable to access provider for cloud unit XXXX-XXXX-XXX.
Object:         CloudUnit=XXXX-XXXX-XXXX
Additional Information: Cause=We encountered an internal error. Please try again.

Due to this error , the cloud tier cleaning process may get terminated if it was in progress : 
Cloud Cleaning Status
---------------------
Cloud tier cleaning started on cloud unit "XXXX-XXXX-XXXX" at 2020/02/12 15:56:11 and was aborted at 2020/02/14 03:12:56.
Cloud tier cleaning was aborted because cloud is unavailable
Background deletion completed.

From ddfs.info log file we may see similar error messages for the reason why the cloud tier unit got disconnected : 
 
02/02 10:35:58.067 (tid 0xbf47f70): ERROR: CAL cl_request_convert_httpcode_to_err:1539 - HTTP operation returned code:500, request error:We encountered an internal error. Please try again. [5009]
....
02/02 10:35:58.067 (tid 0x7f6ab4714fd0): INFO: CAL cal_cloudunit_set_unavail:1339 - Marking cloud unit:XXXX-XXXX-XXXX-XXXX as UNAVAILABLE
....
02/02 10:35:58.067 (tid 0x7f77e22e2370): Fmig: fmig2_process_cal_event: XXXX-XXXX-XXXX-XXXX (path=cloud1/cp1): receiving DDR_EVENT_CAL_UNIT_UNAVAIL event
....
02/02 10:35:59.229 (tid 0x7f6ab4714fd0): INFO: Event posted: m0-1361 (21000551:553649489): EVT-CLOUD-00001: Unable to access provider for cloud unit XXXX-XXXX-XXXX-XXXX.EVT-OBJ::CloudUnit=XXXX-XXXX-XXXX-XXXX EVT-INFO::Cause=We encountered an internal error. Please try again.
...
02/02 10:35:58.067 (tid 0x7f6ab4714fd0): INFO: CAL cal_cloudunit_set_unavail:1339 - Marking cloud unit:XXXX-XXXX-XXXX as UNAVAILABLE

The 500 Internal Server Error is a general HTTP status code that means something has gone wrong on the server.
The 5xx errors are the status codes returned by the server when the server encounters an unexpected condition which prevented it from fulfilling the request from a client "the data domain system in our case " .
This error response is a generic error response that needs further investigation to get the reason for it. 

From the latest Auto-Support you can follow up the state of the cloud requests by checked the cloud error stats for you cloud unit data bucket:

Cloud error stats for bucket:<name of the bucket>-d0
 	Number of Retries                     : 9180
 	...
 	Number of http 400 errors             : 0
 	Number of http 403 errors             : 0
     ...
 	Number of http 416 errors             : 0
 	Number of http 429 errors             : 0
 	Number of http 500 errors             : 10195 ----------->lots of 500 errors 
 	Number of http 503 errors             : 0
     ...

Cause

Reason 1 :
One of the main reasons for the DD to receive this error message is that ECS got completely full causing it to refuse the incoming requests from the data domain unit. To solve this issue please free some space from ECS to restore the DD connection with the ECS cloud. 

Reason 2 : 
ECS is busy, so it is not able to fulfill all incoming requests. 

There are also other reasons that may cause this error. 

Resolution

Solution:
If ECS cloud got full, the data domain keeps receiving disconnection error messages until the cloud tier profile gets some free space for the data domain system to be able to restore the communication. 

Solution: 
If the ECS cloud is busy, it would be recommended not to run garbage collection, data-movement, or recall processes simultaneously if you receive 500 internal server errors  , try to schedule them to run at different times to give the ECS cloud tier the space to handle all incoming requests. Ask the assistance from ECS support to help you further pinpointing any issues.

Additional Information

If the problem is not resolved,  open a new Support case with both ECS and data domain support to triage this issue together. 
Collect and upload a new support bundle when opening a new data domain support case. 

Affected Products

Data Domain

Products

Data Domain, Data Domain Deduplication Storage Systems
Article Properties
Article Number: 000081881
Article Type: Solution
Last Modified: 11 Dec 2023
Version:  4
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.