PowerFlex: Excessive Data Role-Switching Causes IO Latency and Errors
Summary: This article explains how excessive data Role-switching causes I/O latency and errors.
Symptoms
Under certain cluster state transitions, the MDM role-balance logic can produce rapid, repeated primary/secondary role switches across many combs (internal data structures that track which SDS nodes store each piece of volume data). Each role switch invalidates client-side (SDC) comb maps and forces I/O retries. When enough combs are affected simultaneously, the cumulative retry overhead causes I/O latency spikes and I/O errors on SDC hosts. Depending on the host environment, this can result in application I/O timeouts, VMs entering read-only state, or file system unavailability.
This behavior has been observed under multiple trigger scenarios and is not limited to any single operational procedure.
Common Indicators
MDM_DATA_DEGRADEDevent followed by sustained I/O latency lasting 1-15+ minutes- SDC hosts report I/O errors and or I/O timeouts during the degraded window
- VMware (ESXi): VMFS heartbeat timeouts, SCSI hardware errors (
sense data: 0x4 0x0 0x0), VMs entering read-only state, potential HA failover - Linux: I/O errors in system logs (
/var/log/messages, dmesg), applications may experience I/O timeouts or file system remounting read-only
- VMware (ESXi): VMFS heartbeat timeouts, SCSI hardware errors (
- MDM event logs show the system in DEGRADED state longer than expected for a single SDS loss
- System eventually self-recovers to a NORMAL state without manual intervention (usually)
Scenario 1: Ungraceful SDS Loss (No Maintenance Mode)
When it can happen:
- This is a rare scenario. For the rapid, repeated role-switch events to occur during an ungraceful SDS loss, several specific conditions must be present simultaneously:
-
- Large-scale environment - Significant number of SDS nodes and volumes
- Heavy production I/O load - Significant I/O activity at the moment the SDS fails
- Rebuild workload exceeds processing capacity - The number of metadata rows requiring rebuild exceeds the MDM balancer's per-cycle limit of 1,024 rows
Each rebalance cycle can process up to 1,024 metadata rows. When more rows must rebuild, the balancer cannot finish the current plan before generating the next one.
What happens:
- The SDS abruptly decouples from the MDM (event SDS_DECOUPLED)
- All SDCs that were connected to that SDS lose their connections → SDC disconnect events
- The MDM marks the cluster DEGRADED (event
MDM_DATA_DEGRADED) - Because the number of rows to rebuild is over 1,024, the MDM balancer cannot finish the current rebalance plan
- The balancer starts a new plan while the previous plan is still running, producing rapid, repeated role-switch events
- Client SDCs see continuous I/O failures (
IO_FAULT_NOT_PRI, SCSI sense 0x4). After retries are exhausted, the host OS reports I/O errors, timeouts, or a read-only file system - MDM trace evidence:
When the rebalance workload exceeds the 1,024-row limit, the MDM trace shows the threshold being crossed:
2026/03/28 22:43:53.246702 MED:7f1f984aedb0:balanceExec_HandleDegradedRows:00343: BALANCER: Storage Pool: 1193844800000000 - 1024 rows processed out of 1098 degraded rows. 0 allocation failures. 0 cumulative allocation failures.
This indicates that 1,098 rows required rebuilding, but only 1,024 could be processed in the current cycle. The remaining rows trigger a new rebalance plan before the previous plan completes, starting the feedback loop.
Chain of events:
Log Source Event / Pattern MDM events SDS_DECOUPLED — SDS formally declared dead MDM events MDM_DATA_DEGRADED — Cluster enters DEGRADED state SDS traces Flood of IO_FAULT_NOT_PRI — SDS received IO for a comb it is no longer primary for ESXi vmkernel SCSI sense data: 0x4 0x0 0x0 — Hardware error MDM events MULTIPLE_SDC_CONNECTIVITY_CHANGES — Mass SDC connectivity storm MDM events SDC_DISCONNECTED_FROM_SDS_IP — SDCs losing contact with the failed SDSExample MDM event sequence:
SDC_DISCONNECTED_FROM_SDS_IP SDC disconnected from SDS <name> SDS_DECOUPLED SDS <name> decoupled MDM_DATA_DEGRADED The system is now in DEGRADED state
Scenario 2: SDS Power-Off During PMM Entry
When it can happen:
This is a rare scenario that requires two simultaneous events:
- An SDS is entering Protected Maintenance Mode (PMM)
- The SDS fails or is powered off before the PMM transition completes
What happens:
- The MDM receives the PMM entry command and records it as succeeded
- The SDS unexpectedly decouples while the PMM entry is still in progress
- The MDM marks the cluster DEGRADED
- The role balancer enters a sustained role-switch loop throughout the PMM enter phase
- Non-PMM data rows are repeatedly role-switched across the entire storage pool
- The storm persists until the SDS rejoins the cluster and completes the maintenance mode transition
Chain of events:
Log Source Event / Pattern MDM events CLI_COMMAND_SUCCEEDED — enter_protected_maintenance_mode command succeeded MDM events SDS_DECOUPLED — SDS decoupled before maintenance mode started MDM events MDM_DATA_DEGRADED — Cluster enters DEGRADED state SDS traces Repeated role-switch operations across non-PMM rows
Example MDM event sequence:
CLI_COMMAND_SUCCEEDED Command enter_protected_maintenance_mode succeeded SDS_DECOUPLED SDS <name> decoupled MDM_DATA_DEGRADED The system is now in DEGRADED state
When the SDS rejoins and PMM completes:
SDS_MAINTENANCE_MODE_STARTED SDS maintenance mode started MDM_DATA_NORMAL The system is now in NORMAL state
Scenario 3: SDS in Instant Maintenance Mode (IMM)
When it can happen:
An SDS enters or exits Instant Maintenance Mode (IMM). This scenario occurs when a single SDS is in maintenance mode and the system cannot decide which SDS should handle I/O for specific data.
What happens:
- The system repeatedly changes which SDS is responsible for serving the same data
- These constant changes mean that applications do not know where to send their I/O requests
- I/O is sent to the wrong SDS, causing retries and delays
- Applications experience latency or timeouts while trying to access the affected data
Impact:
- Customer impact: Applications report latency and timeouts while the SDS is in IMM
- Duration: Continues while the SDS is in IMM state
- Recovery: Automatic - resolves when the SDS exits IMM
Chain of events:
Log Source Event / Pattern SDS traces Repeated role-switch operations on the same data SDS traces Primary and secondary role switches on identical data
Scenario 4: SDS Exit from Protected Maintenance Mode (PMM)
When it can happen:
An SDS exits Protected Maintenance Mode (PMM). This scenario occurs during every PMM exit - it is not a rare event, but the severity depends on how long the maintenance mode operation lasted.
What happens:
- As the SDS exits PMM, the role balancer must reassign data segments to include the returning SDS
- The rebalance process affects the entire storage pool, not just data on the returning SDS
- Role switches occur across many data segments during the reintegration
- Applications may experience brief I/O errors or latency as the role assignments stabilize
Impact:
- Customer impact: For short maintenance windows (less than 5 seconds), the impact is barely noticeable. For extended maintenance with active I/O, thousands of role switches can occur, causing sustained I/O stalls
- Duration: Continues during the reintegration phase until the rebalance completes
- Recovery: Automatic
Chain of events:
Log Source Event / Pattern MDM events Role-switch operations across the storage pool during exit SDS traces Repeated role-switch operations during reintegration
Example MDM event sequence:
SDS_MAINTENANCE_MODE_EXIT_STARTED SDS maintenance mode exit started SDS_MAINTENANCE_MODE_EXIT_COMPLETED SDS maintenance mode exit completed
Log Outputs:
MDM Event Logs: The MDM event log shows the cluster-level sequence. The key indicators are role-switch operations during the maintenance mode exit.
SDS Trace Logs: On SDS nodes, trace logs show repeated role-switch operations during reintegration:
raidComb_SetPriTgtGenNum: combId <id> combGenNum: cur <gen> new <gen> contCmd_SetCombState: CombId <id> devId <id> PRI->SEC Switch roles contCmd_SetCombState: CombId <id> devId <id> SEC->PRI Switch roles
A high volume of Switch roles entries in a short time window (thousands or more within seconds) is the definitive SDS-side indicator of this issue.
SDC/Host Logs: VMware (ESXi) SDC I/O retries showing the comb, target SDS, and fault code:
vmkernel log PowerFlex mapVolIO_Do_CK:1496 :Mit: <addr>. Retrying IO Type WRITE. Failed comb: <id>. SDS_ID <id>. Comb Gen <gen>. Head Gen <gen>. PowerFlex mapVolIO_Do_CK:1510 :Mit: <addr>. Vol ID <id>. Last fault Status IO_FAULT_NOT_PRI(12). Retry count (1)
If retries exhaust, SCSI errors are returned:
sense data: 0x4 0x0 0x0 -- SCSI Hardware Error
Diagnostic tip: If you see I/O errors on multiple SDS nodes (not just the node that had an issue), this may indicate a role-switch storm rather than normal degraded-state behavior. If I/O errors are isolated to a single SDS, this is expected degraded-state behavior.
Scenario 5: Maintenance Mode Phase Transitions
When it can happen:
During the transition when an SDS enters or exits maintenance mode (IMM or PMM) - at the moment the state changes from normal to MM, or from MM back to normal.
What happens:
- The role balancer redistributes data responsibilities to accommodate the change
- Brief bursts of role switches occur as the system settles into the new arrangement
- Applications may experience short latency spikes during the transition
Impact:
- Customer impact: Brief latency spikes lasting seconds to a few minutes. Usually below application timeout thresholds
- Duration: Lasts seconds to a few minutes, then settles
- Recovery: Automatic
Chain of events:
Log Source Event / Pattern SDS traces Brief role-switch operations during phase transitions
Cause
A software defect in the MDM role-balance logic causes a feedback loop when the cluster transitions state due to an SDS loss or maintenance mode operation.
Under certain conditions, the MDM repeatedly reassigns which SDS nodes are responsible for serving I/O to affected combs. Each reassignment invalidates the SDC's cached view of where the data is located, forcing I/O retries. When many combs are affected simultaneously, the volume of reassignments outpaces the SDCs' ability to update, resulting in sustained I/O errors across multiple hosts.
The storm is typically self-limiting. It resolves once the cluster stabilizes, but the duration depends on the size of the protection domain and I/O load at the time of the event.
Resolution
This issue is addressed in PowerFlex Core version 4.5.6. Upgrade to this version once available. Contact Dell Support for release timeline information.
For planned maintenance operations:
- Do not power-cycle or reboot an SDS until the MDM logs
SDS_MAINTENANCE_MODE_STARTED. Verify that the SDS has fully entered maintenance mode before proceeding with physical maintenance. - Monitor for latency spikes when entering or exiting maintenance mode.
For unplanned SDS outages:
- The storm is self-limiting and typically resolves within minutes as the cluster stabilizes. If the issue is observed, collect
getinfologs from all SDS nodes in the protection domain, from the all manager MDMs as soon as possible after the event, and contact Dell Support.
In rare cases where the issue does not self-resolve, temporarily disabling and reenabling rebuild can allow the MDM to stabilize:
scli --set_rebuild_mode --protection_domain_name <pd_name> --storage_pool_name <sp_name> --disable_rebuild
# Wait 5-10 seconds, then enable rebuild:
scli --set_rebuild_mode --protection_domain_name <pd_name> --storage_pool_name <sp_name> --enable_rebuild
scli --query_all command output.