XtremIO: SSD failure performance impact on XtremIO Array
Summary: SSD failure performance impact on XtremIO Array
Symptoms
Single or Multiple SSD failures in XtremIO DPG may cause performance impact on XtremIO Array. In order to understand what causes this situation, we must explain DPG Operations & DPG States:
- Main DPG Operations:
DPG Rebuild:
- When: When an SSD fails
- Why: Restore double parity protection
DPG Integration:
- When: When a technician replaces a failed SSD with a brand-new drive
- Why: Adds a new SSD into the DPG
DPG States:
- Healthy: Double parity protection
- Single Degraded: Single parity protection
- Double Degraded: No parity protection
- Failed: Data Loss
XtremIO is a Content Aware Storage (a.k.a. CAS). Therefore, all the I/O operations to the DPG are statistically random, this fact allows us to achieve the same performance regardless of the user's random or sequential workload. Another benefit is that if an SSD fails, the cluster is not required to return a page to its original location.
Other arrays that are not CAS do have requirements for both sequential logical data and sequential physical data; If you do not return data to its original location, you lose the sequential I/O performance.
DPG Rebuild explained:
When an SSD is removed or fails SYM issues an automatic DPG rebuild, the DPG rebuild requires two operations to take place:
Phase 1: Recover all the lost data and write it elsewhere:
The lost pages (data + parity) are recovered to the DPG (new write flow)
The PLBM/HMD tables are updated.
Phase 2: Update the parity information of all stripes:
Moving data/parity pages requires updating all parities (across all stripes)
Both operations require updating all stripes, to save time and reduce writes. Both are performed on a single iteration.
DPG Integration explained: Adding a new SSD to the DPG requires a manual intervention.
- It requires a manual intervention of a technician (place a new SSD in the DAE slot)
- There is little/no criticality (opposed to a rebuild)
Once requested, the DPG integration process balances the parity blocks. Only parity blocks are recovered to the original SSD (to achieve an even parity distribution). This is done by Assigning, Adding & Integrating the new SSD.
Cause
Single or Multiple SSD failures in XtremIO DPG
Resolution
Based on the above information, during a DPG rebuild/integration there is some increase in cluster resource utilization, though usually there should not be a noticeable performance or latency increase. However, during a double DPG rebuild the cluster focuses nearly all of its resources to rebuilding the failed SSDs as soon as possible in order to ensure data integrity and avoid data loss. This is expected by design, and performance should go back to normal performance after completion of all operations.