rbacalzo
1 Copper

Best practice for retries based on Atmos server error code?

Jump to solution

Is it a valid assumption that all Atmos error codes specified in the Atmos Programmer's Guide that have the phrase "Please try again" are retryable errors that have a chance of succeeding if the request is re-submitted a short time later?

If so, what is a good value for "a short time later"?  I've been using a value of 3 seconds in my application to wait before trying to resubmit the same request using a new http header.  Then, depending on the operation (e.g. create/retrieve/delete) I'll adjust the number of retries between 1 -5 attempts before failing the operation at the application level.  Do these seem like reasonable defaults?

During my stress testing, the most common Atmos error code I get during creates (besides Socket timeout errors) is:

1025 The server encountered an internal error. Please try again.

However, I also get these type of errors, which when re-submitted, sometimes also succeed:

1034 Unable to read the contents of the HTTP body.

Note that this latter case doesn't have the "Please try again" message, so I don't know whether I should be relying on this phrase to distinguish which errors I should automatically retry.  Or maybe it's just that I really shouldn't be retrying this particular error code after all ...

Labels (1)
0 Kudos
1 Solution

Accepted Solutions

Re: Best practice for retries based on Atmos server error code?

Jump to solution

Generally, we recommend retrying all HTTP 5xx errors and IO exceptions up to 3 times.  These errors are usually temporary and attributed to system load or network congestion.  However, if you are seeing these errors frequently or at times when the system is not experiencing any heavy load, be sure to check the logs on the nodes or open an SR to assess the overall health of your cloud (be sure to include as much detail about the errors as possible).

To address the delay window between retries, you could implement your choice of backoff algorithms or you could use a static delay like you've been doing.  However, I would make it configurable so you can tune it.  That is exactly what our Java SDK does, with the exception of error 1040 (server busy).  For that error, it adds an additional 300ms since the default static delay is nothing (no wait).

Java SDK RetryFilter

0 Kudos
1 Reply

Re: Best practice for retries based on Atmos server error code?

Jump to solution

Generally, we recommend retrying all HTTP 5xx errors and IO exceptions up to 3 times.  These errors are usually temporary and attributed to system load or network congestion.  However, if you are seeing these errors frequently or at times when the system is not experiencing any heavy load, be sure to check the logs on the nodes or open an SR to assess the overall health of your cloud (be sure to include as much detail about the errors as possible).

To address the delay window between retries, you could implement your choice of backoff algorithms or you could use a static delay like you've been doing.  However, I would make it configurable so you can tune it.  That is exactly what our Java SDK does, with the exception of error 1040 (server busy).  For that error, it adds an additional 300ms since the default static delay is nothing (no wait).

Java SDK RetryFilter

0 Kudos