PowerFlex: Client IO Errors When Replication Is Utilized

Résumé: Client servers are experiencing I/O errors against PowerFlex-backed devices. The overall backend (MDMs and SDSs) appears to be healthy. PowerFlex replication is active, and RPO errors are affecting one or more RCGs. ...

Cet article concerne Cet article ne concerne pas Cet article n’est associé à aucun produit spécifique. Toutes les versions du produit ne sont pas identifiées dans cet article.

Symptômes

 

  • No degraded or failed capacity
  • No SDSes were decoupled, and no SDS devices reported errors
  • No disconnected MDMs
  • The replication feature is being utilized

 

One or more alerts in the UI are reporting the following errors:

 Minor - Remote Consistency Group RPO Exceeded
Major - The RCG consistent image is too large to be consumed by the destination in one piece.

 

MDM events may report the following:

2024-06-11 15:55:56.592000:0001566:RPL_PD_CAP_UTILIZATION_MINOR     WARNING  Protection Domain ID <pd_id> Replication journal capacity is at MINOR utilization level 
...
2024-06-11 16:20:12.848000:0001567:RPL_PD_CAP_UTILIZATION_MAJOR     ERROR    Protection Domain ID <pd_id> Replication journal capacity is at MAJOR utilization level 
...
2024-06-11 17:19:57.272000:0001584:RPL_PD_CAP_UTILIZATION_CRITICAL  CRITICAL Protection Domain ID <pd_id> Replication journal capacity is at VERY_HIGH utilization level 
...
2024-06-11 17:52:26.352000:0001585:RPL_PD_CAP_UTILIZATION_CRITICAL  CRITICAL Protection Domain ID <pd_id> Replication journal capacity is at CRITICAL utilization level
...
2024-06-11 16:25:14.381000:0001576:RPL_CG_MOVED_TO_SLIM_MODE        INFO     Replication Consistency Group ID <rcg_id> entered slim mode
...
2024-06-11 18:27:29.738000:0001586:SDR_CRITICAL_CAP_CHANGE          ERROR    SDR ID <sdr_id>) handling user data changed discarded old user data and stopped to accumulate new user data due critical capacity

 

Impact

Clients are unable to access volumes that are being replicated.

Cause

A rare software defect may occur where the MDM and the SDR component disagree on the internal counters related to journal capacity, and/or the successful completion of initial copy for the RCG.

This discrepancy may cause the MDM to fail to unallocated (release) additional journal capacity when the SDR's capacity is full and/or finishing the initial copy for the RCG, potentially leading to I/O errors on client servers utilizing devices backed by PowerFlex.

Résolution

This procedure is a one-time remediation step rather than a permanent fix. Users may encounter this issue again under similar conditions.

The procedure requires a rolling restart of all SDR components and the MDM ownership switch on the Source system.

Note: It is recommended that all RCGs on the Source system be terminated until the issue is resolved.

 

1. Restart SDR components on the Target site

a. Identify all the SDRs and validate that they are in a healthy state before continuing to step b.:

scli --query_all_sdr

b. Enter SDR into maintenance mode:

scli --enter_sdr_maintenance_mode --sdr_name <name>

c. Validate that the SDR is in maintenance mode by running the command in step a.

d. Restart the SDR component:

pkill sdr

e. Exit SDR from maintenance mode:

scli --exit_sdr_maintenance_mode --sdr_name <name>

f. Repeat steps a. through e. on each SDR.

 

2. Restart SDR components on the Source site

a. Identify all the SDRs and validate that they are in a healthy state before continuing to step b.:

scli --query_all_sdr

b. Enter SDR into maintenance mode:

scli --enter_sdr_maintenance_mode --sdr_name <name>

c. Validate that the SDR is in maintenance mode by running the command in step a.

d. Restart the SDR component:

pkill sdr

e. Exit SDR from maintenance mode:

scli --exit_sdr_maintenance_mode --sdr_name <name>

f. Repeat steps a. through e. on each SDR.

g. Once all SDRs (Target and Source systems) are restarted and are in a healthy state, proceed to step #3.

 

3. Switch the MDM ownership on the Target site

Switch the MDM ownership. Commands may vary based on PowerFlex version:

Note: The ownership can be transferred back to the original MDM server if desired.
#3.x
scli --switch_mdm_ownership --new_master_mdm_name <name>

#4.x
scli --switch_mdm_ownership --new_primary_mdm_name <name>

 

4. Switch the MDM ownership on the Source site

a. Switch the MDM ownership. Commands may vary based on PowerFlex version:

Note: The ownership can be transferred back to the original MDM server if desired.
#3.x
scli --switch_mdm_ownership --new_master_mdm_name <name>

#4.x
scli --switch_mdm_ownership --new_primary_mdm_name <name>

b. Validate that the I/O errors on the client servers are no longer reported.  If the client has entered a read-only filesystem, the client server may require a reboot. If the I/O errors continue after applying the above steps, engage Dell Support for assistance.

Informations supplémentaires

Impacted Versions

PowerFlex Core 3.x

PowerFlex Core 4.x

Fixed In Version

PowerFlex Core 4.5.5.2 HF1 - requires an Engineering evaluations of the Source and Target systems. Upload an SCR collection to the SR.

PowerFlex Core 4.5.6

Produits concernés

PowerFlex rack, PowerFlex Appliance, PowerFlex Software
Propriétés de l’article
Numéro d’article: 000227849
Type d’article: Solution
Dernière modification: 25 Jun 2026
Version:  12
Trouvez des réponses à vos questions auprès d’autres utilisateurs Dell
Services de support
Vérifiez si votre appareil est couvert par les services de support.