XtremIO: DPG has two simultaneous SSD failures causing Performance Impact
Summary: This article provides information about how XtremIO manages two simultaneous solid state drives (SSD) failures in a Data Protection Group (DPG) and explains how it can impact performance. It also offers some recommendations on how to help reduce performance impact while the DPG is rebuilding. ...
Symptoms
A performance impact is seen when a second SSD fails before the DPG completes a rebuild from a previous SSD failure.
Cause
On an XtremIO, if one SSD fails, the DPG rebuilds on the remaining healthy SSDs to maintain data protection. The DPG rebuild process is different for single SSD failures than it is for two simultaneous SSD failures.
When only one SSD fails:
- The system enters a single degraded state and only has one parity block to recover the data.
- DPG rebuild starts to rebuild the data from the failed SSD on the remaining healthy SSDs to return to a double parity block status for data protection.
- When DPG is rebuilding from a single SSD failure, the rebuild process is done in the background and incoming I/O is still prioritized.
When two simultaneous SSD failures in the same DPG:
- The system enters a double degraded state, meaning no parity protection remains.
- The risk of data loss increases dramatically so the system prioritizes recovery. This uses more CPUs in order to rebuild the DPG faster.
- XtremIO prioritizes rebuild operations over user I/O, consuming the CPU and memory resources to avoid data loss.
- This type of rebuild requires massive metadata updates. It must calculate parity for every affected stripe and performs data integrity checks to help ensure that there is no corruption.
Alerts that may present when the DPG is rebuilding:
| Alert Name | Symptom Code | Description |
|---|---|---|
rebuild_0_to_20_done |
XTR0800211 |
DPG rebuild has started. |
rebuild_20_to_40_done |
XTR0800221 |
DPG rebuild is in progress. More than 20 percent of the rebuild has been completed. |
rebuild_40_to_60_done |
XTR0800231 |
DPG rebuild is in progress. More than 40 percent of the rebuild has been completed. |
rebuild_60_to_80_done |
XTR0800241 |
DPG rebuild is in progress. More than 60 percent of the rebuild has been completed. |
rebuild_99_done |
XTR0800251 |
DPG rebuild is in progress. More than 80 percent of the rebuild has been completed. |
rg_state_integrate |
XTR0800904 |
The DPG is performing SSD Integration. |
ssd_assigning_to_rg |
XTR0900106 |
SSD is being assigned to the DPG. |
The CLI command show-data-protection-groups can also be run to check the progress of a DPG rebuild:
xmcli (tech)> show-data-protection-groups Name Index Cluster-Name Index State Num-Of-SSDs Useful-SSD-Space User-Space User-Space-In-Use Rebuild-Progress Preparation-Progress Proactive-Metadata-Loading Rebuild-Prevention Brick-Name Index X1-DPG-1 1 LAB-XIO001 1 normal 28 97.809T 85.690T 65.344T 0 0 False none X1 1 X2-DPG-1 3 LAB-XIO001 1 double_degraded 26 97.809T 83.690T 65.359T 0 54 False assigning_disk X2 2
Resolution
When two SSDs within the same DPG fail simultaneously, performance may be impacted due to the resources consumed to rebuild parity. There is no way to stop the DPG rebuild, nor is there a command to accelerate it. However, if the customer is experiencing significant performance degradation due to double degraded protection mode (rg_double_degrade), the following actions may help reduce the impact, if possible:
- Pause any backups, replication, or intense I/O jobs until the DPG rebuild completes.
- If possible, fail over most active hosts to other storage until the DPG rebuild completes.
- If using VMs, power off or migrate VMs or put hosts in maintenance mode.
- If possible, use QoS or host-side throttling to reduce load on the array during rebuild.
Once the DPG rebuild is complete, any related performance impact should resolve.
Additional Information
Related articles:
(Log in as a registered Dell Support user may be required to view these articles.)