PowerFlex: Excessive Data Role-Switching Causes IO Latency and Errors

Summary: This article explains how excessive data Role-switching causes I/O latency and errors.

Affected Products

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Check out other resources

Symptoms

Under certain cluster state transitions, the MDM role-balance logic can produce rapid, repeated primary/secondary role switches across many combs (internal data structures that track which SDS nodes store each piece of volume data). Each role switch invalidates client-side (SDC) comb maps and forces I/O retries. When enough combs are affected simultaneously, the cumulative retry overhead causes I/O latency spikes and I/O errors on SDC hosts. Depending on the host environment, this can result in application I/O timeouts, VMs entering read-only state, or file system unavailability.

This behavior has been observed under multiple trigger scenarios and is not limited to any single operational procedure.

Common Indicators

MDM_DATA_DEGRADED event followed by sustained I/O latency lasting 1-15+ minutes
SDC hosts report I/O errors and or I/O timeouts during the degraded window
- VMware (ESXi): VMFS heartbeat timeouts, SCSI hardware errors (sense data: 0x4 0x0 0x0), VMs entering read-only state, potential HA failover
- Linux: I/O errors in system logs (/var/log/messages, dmesg), applications may experience I/O timeouts or file system remounting read-only
MDM event logs show the system in DEGRADED state longer than expected for a single SDS loss
System eventually self-recovers to a NORMAL state without manual intervention (usually)

Note: As of this writing, this issue has only been reported in environments with VMware (ESXi) and Linux SDC hosts. There are no known reports of the behavior impacting Windows SDC hosts, though the underlying defect is in MDM core logic and is not OS-specific.

Scenario 1: Ungraceful SDS Loss (No Maintenance Mode)

When it can happen:

This is a rare scenario. For the rapid, repeated role-switch events to occur during an ungraceful SDS loss, several specific conditions must be present simultaneously:

- Large-scale environment - Significant number of SDS nodes and volumes
- Heavy production I/O load - Significant I/O activity at the moment the SDS fails
- Rebuild workload exceeds processing capacity - The number of metadata rows requiring rebuild exceeds the MDM balancer's per-cycle limit of 1,024 rows

Each rebalance cycle can process up to 1,024 metadata rows. When more rows must rebuild, the balancer cannot finish the current plan before generating the next one.

What happens:

The SDS abruptly decouples from the MDM (event SDS_DECOUPLED)
All SDCs that were connected to that SDS lose their connections → SDC disconnect events
The MDM marks the cluster DEGRADED (event MDM_DATA_DEGRADED)
Because the number of rows to rebuild is over 1,024, the MDM balancer cannot finish the current rebalance plan
The balancer starts a new plan while the previous plan is still running, producing rapid, repeated role-switch events
Client SDCs see continuous I/O failures (IO_FAULT_NOT_PRI, SCSI sense 0x4). After retries are exhausted, the host OS reports I/O errors, timeouts, or a read-only file system
MDM trace evidence:

When the rebalance workload exceeds the 1,024-row limit, the MDM trace shows the threshold being crossed:

2026/03/28 22:43:53.246702 MED:7f1f984aedb0:balanceExec_HandleDegradedRows:00343: BALANCER: Storage Pool: 1193844800000000 - 1024 rows processed out of 1098 degraded rows. 0 allocation failures. 0 cumulative allocation failures.

This indicates that 1,098 rows required rebuilding, but only 1,024 could be processed in the current cycle. The remaining rows trigger a new rebalance plan before the previous plan completes, starting the feedback loop.

Chain of events:

Log Source     Event / Pattern
MDM events     SDS_DECOUPLED — SDS formally declared dead
MDM events     MDM_DATA_DEGRADED — Cluster enters DEGRADED state
SDS traces     Flood of IO_FAULT_NOT_PRI — SDS received IO for a comb it is no longer primary for
ESXi vmkernel  SCSI sense data: 0x4 0x0 0x0 — Hardware error
MDM events     MULTIPLE_SDC_CONNECTIVITY_CHANGES — Mass SDC connectivity storm
MDM events     SDC_DISCONNECTED_FROM_SDS_IP — SDCs losing contact with the failed SDS

Example MDM event sequence:

SDC_DISCONNECTED_FROM_SDS_IP    SDC disconnected from SDS <name>
SDS_DECOUPLED                   SDS <name> decoupled
MDM_DATA_DEGRADED               The system is now in DEGRADED state

Scenario 2: SDS Power-Off During PMM Entry

When it can happen:
This is a rare scenario that requires two simultaneous events:

An SDS is entering Protected Maintenance Mode (PMM)
The SDS fails or is powered off before the PMM transition completes

What happens:

The MDM receives the PMM entry command and records it as succeeded
The SDS unexpectedly decouples while the PMM entry is still in progress
The MDM marks the cluster DEGRADED
The role balancer enters a sustained role-switch loop throughout the PMM enter phase
Non-PMM data rows are repeatedly role-switched across the entire storage pool
The storm persists until the SDS rejoins the cluster and completes the maintenance mode transition

Chain of events:

Log Source  Event / Pattern
MDM events  CLI_COMMAND_SUCCEEDED — enter_protected_maintenance_mode command succeeded
MDM events  SDS_DECOUPLED — SDS decoupled before maintenance mode started
MDM events  MDM_DATA_DEGRADED — Cluster enters DEGRADED state
SDS traces  Repeated role-switch operations across non-PMM rows

Example MDM event sequence:

CLI_COMMAND_SUCCEEDED            Command enter_protected_maintenance_mode succeeded
SDS_DECOUPLED                    SDS <name> decoupled
MDM_DATA_DEGRADED               The system is now in DEGRADED state

When the SDS rejoins and PMM completes:

SDS_MAINTENANCE_MODE_STARTED     SDS maintenance mode started
MDM_DATA_NORMAL                 The system is now in NORMAL state

Scenario 3: SDS in Instant Maintenance Mode (IMM)

When it can happen:
An SDS enters or exits Instant Maintenance Mode (IMM). This scenario occurs when a single SDS is in maintenance mode and the system cannot decide which SDS should handle I/O for specific data.

What happens:

The system repeatedly changes which SDS is responsible for serving the same data
These constant changes mean that applications do not know where to send their I/O requests
I/O is sent to the wrong SDS, causing retries and delays
Applications experience latency or timeouts while trying to access the affected data

Impact:

Customer impact: Applications report latency and timeouts while the SDS is in IMM
Duration: Continues while the SDS is in IMM state
Recovery: Automatic - resolves when the SDS exits IMM

Chain of events:

Log Source  Event / Pattern
SDS traces  Repeated role-switch operations on the same data
SDS traces  Primary and secondary role switches on identical data

Scenario 4: SDS Exit from Protected Maintenance Mode (PMM)

When it can happen:
An SDS exits Protected Maintenance Mode (PMM). This scenario occurs during every PMM exit - it is not a rare event, but the severity depends on how long the maintenance mode operation lasted.

What happens:

As the SDS exits PMM, the role balancer must reassign data segments to include the returning SDS
The rebalance process affects the entire storage pool, not just data on the returning SDS
Role switches occur across many data segments during the reintegration
Applications may experience brief I/O errors or latency as the role assignments stabilize

Impact:

Customer impact: For short maintenance windows (less than 5 seconds), the impact is barely noticeable. For extended maintenance with active I/O, thousands of role switches can occur, causing sustained I/O stalls
Duration: Continues during the reintegration phase until the rebalance completes
Recovery: Automatic

Chain of events:

Log Source  Event / Pattern
MDM events  Role-switch operations across the storage pool during exit
SDS traces  Repeated role-switch operations during reintegration

Example MDM event sequence:

SDS_MAINTENANCE_MODE_EXIT_STARTED    SDS maintenance mode exit started
SDS_MAINTENANCE_MODE_EXIT_COMPLETED   SDS maintenance mode exit completed

Log Outputs:

Note: The log examples below are generic representations of the patterns observed across multiple occurrences of this issue. They are not tied to any specific scenario listed above - the same patterns appear regardless of the trigger.

MDM Event Logs: The MDM event log shows the cluster-level sequence. The key indicators are role-switch operations during the maintenance mode exit.

SDS Trace Logs: On SDS nodes, trace logs show repeated role-switch operations during reintegration:

raidComb_SetPriTgtGenNum: combId <id> combGenNum: cur <gen> new <gen>
contCmd_SetCombState: CombId <id> devId <id> PRI->SEC Switch roles
contCmd_SetCombState: CombId <id> devId <id> SEC->PRI Switch roles

A high volume of Switch roles entries in a short time window (thousands or more within seconds) is the definitive SDS-side indicator of this issue.

SDC/Host Logs: VMware (ESXi) SDC I/O retries showing the comb, target SDS, and fault code:

vmkernel log
PowerFlex mapVolIO_Do_CK:1496 :Mit: <addr>. Retrying IO Type WRITE. Failed comb: <id>. SDS_ID <id>. Comb Gen <gen>. Head Gen <gen>.
PowerFlex mapVolIO_Do_CK:1510 :Mit: <addr>. Vol ID <id>. Last fault Status IO_FAULT_NOT_PRI(12). Retry count (1)

If retries exhaust, SCSI errors are returned:

sense data: 0x4 0x0 0x0   -- SCSI Hardware Error

Diagnostic tip: If you see I/O errors on multiple SDS nodes (not just the node that had an issue), this may indicate a role-switch storm rather than normal degraded-state behavior. If I/O errors are isolated to a single SDS, this is expected degraded-state behavior.

Scenario 5: Maintenance Mode Phase Transitions

When it can happen:
During the transition when an SDS enters or exits maintenance mode (IMM or PMM) - at the moment the state changes from normal to MM, or from MM back to normal.

What happens:

The role balancer redistributes data responsibilities to accommodate the change
Brief bursts of role switches occur as the system settles into the new arrangement
Applications may experience short latency spikes during the transition

Impact:

Customer impact: Brief latency spikes lasting seconds to a few minutes. Usually below application timeout thresholds
Duration: Lasts seconds to a few minutes, then settles
Recovery: Automatic

Chain of events:

Log Source  Event / Pattern
SDS traces  Brief role-switch operations during phase transitions

Cause

A software defect in the MDM role-balance logic causes a feedback loop when the cluster transitions state due to an SDS loss or maintenance mode operation.

Under certain conditions, the MDM repeatedly reassigns which SDS nodes are responsible for serving I/O to affected combs. Each reassignment invalidates the SDC's cached view of where the data is located, forcing I/O retries. When many combs are affected simultaneously, the volume of reassignments outpaces the SDCs' ability to update, resulting in sustained I/O errors across multiple hosts.

The storm is typically self-limiting. It resolves once the cluster stabilizes, but the duration depends on the size of the protection domain and I/O load at the time of the event.

Resolution

This issue is addressed in PowerFlex Core version 4.5.6. Upgrade to this version once available. Contact Dell Support for release timeline information.

For planned maintenance operations:

Do not power-cycle or reboot an SDS until the MDM logs SDS_MAINTENANCE_MODE_STARTED. Verify that the SDS has fully entered maintenance mode before proceeding with physical maintenance.
Monitor for latency spikes when entering or exiting maintenance mode.

For unplanned SDS outages:

The storm is self-limiting and typically resolves within minutes as the cluster stabilizes. If the issue is observed, collect getinfo logs from all SDS nodes in the protection domain, from the all manager MDMs as soon as possible after the event, and contact Dell Support.

In rare cases where the issue does not self-resolve, temporarily disabling and reenabling rebuild can allow the MDM to stabilize:

scli --set_rebuild_mode --protection_domain_name <pd_name> --storage_pool_name <sp_name> --disable_rebuild

# Wait 5-10 seconds, then enable rebuild:

scli --set_rebuild_mode --protection_domain_name <pd_name> --storage_pool_name <sp_name> --enable_rebuild

Important: The cluster has reduced redundancy while rebuild is disabled. Only disable rebuild long enough for the system to stabilize, then reenable immediately. It is recommended to perform this action with Dell Support guidance.

Note: Rebuild is managed at the Storage Pool level. If the affected SDS has devices in multiple Storage Pools, apply this action to each affected Storage Pool. Storage Pools that do not contain devices from the affected SDS are not impacted. The Protection Domain, Storage Pool, and SDS-to-device mapping can be identified from the scli --query_all command output.

Affected Products

PowerFlex rack, PowerFlex Appliance, PowerFlex rack connectivity, PowerFlex Software

Article Number: 000450312

Article Type: Solution

Last Modified: 12 مايو 2026

Version: 5

Check if your device is covered by Support Services.

PowerFlex: Excessive Data Role-Switching Causes IO Latency and Errors

Summary: This article explains how excessive data Role-switching causes I/O latency and errors.

Symptoms

Cause

Resolution

Affected Products

Symptoms

Common Indicators

Scenario 1: Ungraceful SDS Loss (No Maintenance Mode)

Scenario 2: SDS Power-Off During PMM Entry

Scenario 3: SDS in Instant Maintenance Mode (IMM)

Scenario 4: SDS Exit from Protected Maintenance Mode (PMM)

Scenario 5: Maintenance Mode Phase Transitions

Cause

Resolution

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

PowerFlex: Excessive Data Role-Switching Causes IO Latency and Errors

Summary: This article explains how excessive data Role-switching causes I/O latency and errors.

Detailed Article

Symptoms

Cause

Resolution

Affected Products

Symptoms

Common Indicators

Scenario 1: Ungraceful SDS Loss (No Maintenance Mode)

Scenario 2: SDS Power-Off During PMM Entry

Scenario 3: SDS in Instant Maintenance Mode (IMM)

Scenario 4: SDS Exit from Protected Maintenance Mode (PMM)

Scenario 5: Maintenance Mode Phase Transitions

Cause

Resolution

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services