PowerFlex: Client IO Errors When Replication Is Being used

Summary: Client/servers are experiencing IO errors against PowerFlex-backed devices. The overall backend (MDM/SDSes) appears to be healthy. PowerFlex replication is being used and there are some RPO errors against 1 or more of the RCGs. ...

Acest articol se aplică pentru Acest articol nu se aplică pentru Acest articol nu este legat de un produs specific. Acest articol nu acoperă toate versiunile de produs existente.

Symptoms

  • No degraded or failed capacity
  • No SDSs decoupled or SDS devices reporting errors.
  • No disconnected MDMs

The replication feature is being used.
One or more alerts in the UI reporting the following errors:

Major - The RCG consistent image is too large to be consumed by the destination in one piece.

Minor - Remote Consistency Group RPO Exceeded
MDM event logs may report the following:
2024-06-11 15:55:56.592000:0001566:RPL_PD_CAP_UTILIZATION_MINOR     WARNING  Protection Domain ID <pd_id> Replication journal capacity is at MINOR utilization level
2024-06-11 16:20:12.848000:0001567:RPL_PD_CAP_UTILIZATION_MAJOR     ERROR    Protection Domain ID <pd_id> Replication journal capacity is at MAJOR utilization level
2024-06-11 17:19:57.272000:0001584:RPL_PD_CAP_UTILIZATION_CRITICAL  CRITICAL Protection Domain ID <pd_id> Replication journal capacity is at VERY_HIGH utilization level
2024-06-11 17:52:26.352000:0001585:RPL_PD_CAP_UTILIZATION_CRITICAL  CRITICAL Protection Domain ID <pd_id> Replication journal capacity is at CRITICAL utilization level
...
2024-06-11 16:25:14.381000:0001576:RPL_CG_MOVED_TO_SLIM_MODE        INFO     Replication Consistency Group ID <rcg_id> entered slim mode
2024-06-11 18:27:29.738000:0001586:SDR_CRITICAL_CAP_CHANGE          ERROR    SDR ID <sdr_id>) handling user data changed discarded old user data and stopped to accumulate new user data due critical capacity
 

Impact 

Clients are unable to access volumes that are intercepted for replication.


Cause

A rare software defect may occur where the MDM and the SDR component disagree on the internal counters related to journal capacity. This discrepancy may cause the MDM to unallocated additional journal capacity when the SDR's capacity is full, potentially leading to IO errors on client/servers using devices backed by PowerFlex.

Resolution

A rolling restart of all SDR components is required, and the MDM ownership on the source system must be switched to resolve the issue.

Restart SDR components:

1) Identify all the SDRs:

scli --query_all_sdr

2) Enter maintenance mode on the SDR:

scli --enter_sdr_maintenance_mode --sdr_name <name>

3) Validate that the SDR is fully in maintenance mode by running the command in step 1.

4) Restart the SDR component.

pkill sdr

5) Repeat this for all SDRs on the source site.

Once all SDRs are restarted, switch the MDM ownership:

#3.x
scli --switch_mdm_ownership --new_master_mdm_name <name>
 
#4.x
scli --switch_mdm_ownership --new_primary_mdm_name <name>

*If wanted, the ownership can be transferred back to the original MDM server.

Validate that the IO errors on the client/servers are no longer are reported. If the client has entered a read-only file system, the client/server may require a reboot.

Until the cause can be determined, it will also be recommended that all RCGs on the source system be terminated.

If the IO errors continue after applying the above steps, engage PowerFlex Engineering.

Additional Information

Impacted Versions

PowerFlex 3.x

PowerFlex 4.x

Fixed In Version

PowerFlex 4.5.3
PowerFlex 4.5.4 - upgrade to 4.5.4 HF1
PowerFlex 4.5.5 - no fix available.
PowerFlex 4.5.6 and higher

Produse afectate

PowerFlex rack, PowerFlex Appliance, PowerFlex Software
Proprietăți articol
Article Number: 000227849
Article Type: Solution
Ultima modificare: 27 Feb 2026
Version:  8
Găsiți răspunsuri la întrebările dvs. de la alți utilizatori Dell
Servicii de asistență
Verificați dacă dispozitivul dvs. este acoperit de serviciile de asistență.