ECS: Erasure Coding rebuild with node outage on a four node ECS cluster

Summary: Starting with ECS version 3.4 Erasure Coding rebuild is not automatically initiated if there is a node outage on a four node ECS cluster.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

In ECS versions prior to 3.4, if only three healthy nodes remained, the ECS initiated restoring customer data to three mirrored copies, one per node. This design decision was made to maximize data protection for customer data. The process is also known as EC (Erasure Coding) retiring.

EC is an algorithm that reduces storage space while protecting data against disk or node failures.

When data is erasure coded, the physical space required on an ECS is approximately 1.33x for storing customer data (12 data segments + 4 parity segments).  However, if the ECS begins EC retiring, the physical space required for customer data increases from 1.33x to 3x.

This may lead to the ECS reaching the 90% capacity utilization threshold, causing the ECS to go into read-only mode and data unavailability. 

Cause

During EC retiring in an ECS, hard drive space increases to store 3x copies of customer data instead of the usual 1.33x footprint. Even on moderately used ECSs there may not be enough space available to unpack the erasure coded customer data and create three mirrored copies of the data. This process can fill ECS to 90% capacity before EC rebuilds complete, preventing the goal of maximizing data protection from being achieved. This may cause the ECS to go into read-only mode and may result in data unavailability.

Resolution

To enhance data protection and avoid exceeding the 90% capacity threshold, the default behavior for ECS clusters with only three healthy nodes was changed.

In 3.4, the design change was made such that ECS will no longer implement EC Retiring automatically when only three nodes are healthy/online. The system runs in a degraded state and may encounter performance issues but is likely to avoid a DU. New writes continue to be written as three mirrored copies and will be erasure coded once there are 4+ nodes online and available to write to.

Any additional drive failures may cause isolated DUs, it may also slightly increase exposure to a potential Data Loss (DL), however it is still unlikely.

Also, consider expanding the ECS to five or more nodes. It decreases ECS exposure to performance degradation, DU, and DL situations during node failure. For more details on ECS architecture, see the ECS Admin Guide. 

Additional Information

*EC is a data protection method that breaks down data chunks into multiple fragments and distribute the fragments across nodes. Erasure coding (EC) reduces the storage overhead and ensures data durability and resilience against disk and node failures. For more information about EC, see the ECS Administration Guide.

Affected Products

ECS Appliance

Products

ECS Appliance, ECS Appliance Gen 1, ECS Appliance Gen 2, ECS Appliance Gen 3, ECS Appliance Hardware Gen3 EX300, ECS Appliance Hardware Gen3 EX3000, ECS Appliance Hardware Gen1 U-Series, ECS Appliance Hardware Gen1 C-Series , ECS Appliance Hardware Gen2 D-Series, ECS Appliance Hardware Gen2 U-Series, ECS Appliance Hardware Gen3 EX500, ECS Appliance Hardware Series, ECS Appliance Software with Encryption, ECS Appliance Software without Encryption, Elastic Cloud Storage ...
Article Properties
Article Number: 000050615
Article Type: Solution
Last Modified: 26 Sept 2025
Version:  5
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.