PowerFlex: Excessive Data Role-switching Causes IO Latency and Errors
Сводка: Under certain cluster state transitions, the MDM role-balance logic can produce rapid, repeated primary/secondary role switches across a large number of combs (internal data structures that track which SDS nodes store each piece of volume data). Each role switch invalidates client-side (SDC) comb maps and forces IO retries. When enough combs are affected simultaneously, the cumulative retry overhead causes IO latency spikes and IO errors on SDC hosts. Depending on the host environment, this can result in application IO timeouts, VMs entering read-only state, or filesystem unavailability. This behavior has been observed under multiple trigger scenarios and is not limited to any single operational procedure. ...
Симптомы
Common Indicators
MDM_DATA_DEGRADEDevent followed by sustained IO latency lasting 1-15+ minutes- SDC hosts report IO errors and/or IO timeouts during the degraded window
- VMware (ESXi): VMFS heartbeat timeouts, SCSI hardware errors (
sense data: 0x4 0x0 0x0), VMs entering read-only state, potential HA failover - Linux: IO errors in system logs (
/var/log/messages, dmesg), applications may experience IO timeouts or filesystems remounting read-only - MDM event logs show the system in DEGRADED state longer than expected for a single SDS loss
- System eventually self-recovers to NORMAL state without manual intervention (in most cases)
Note: As of this writing, this issue has only been reported in environments with VMware (ESXi) and Linux SDC hosts. There are no known reports of this behavior impacting Windows SDC hosts, though the underlying defect is in MDM core logic and is not OS-specific.
Scenario 1: Ungraceful SDS Loss (No Maintenance Mode)
An SDS is decoupled unexpectedly. SDCs disconnect from the affected SDS, the cluster enters DEGRADED, and the role-balance storm begins immediately. Example MDM event sequence:
SDC_DISCONNECTED_FROM_SDS_IP SDC disconnected from SDS <name>
SDS_DECOUPLED SDS <name> decoupled
MDM_DATA_DEGRADED The system is now in DEGRADED state
IO errors begin within seconds of the decouple event across multiple surviving SDS nodes, not just the SDS that was lost.
Scenario 2: SDS Power-Off During PMM Entry
An SDS is powered off before it finishes entering Protected Maintenance Mode. The MDM records the PMM request followed by an unexpected decouple before SDS_MAINTENANCE_MODE_STARTED is logged. Example MDM event sequence:
CLI_COMMAND_SUCCEEDED Command enter_protected_maintenance_mode succeeded
SDS_DECOUPLED SDS <name> decoupled
MDM_DATA_DEGRADED The system is now in DEGRADED state
The role-balance storm persists until the SDS re-joins and completes the maintenance mode transition.
Scenario 3: SDS in Instant Maintenance Mode (IMM)
An SDS enters or exits IMM. During the transition, IO latency spikes are observed. The MDM does not report DEGRADED state in this scenario, but SDC hosts experience IO retries and latency until the IMM transition completes.
Scenario 4: SDS Exit from Maintenance Mode
An SDS exits IMM or PMM. During re-integration, IO errors may occur briefly as the role-balance logic reassigns combs to the returning SDS.
Log Outputs:
MDM Event Logs: The MDM event log shows the cluster-level sequence. The key indicators are an SDS decouple followed by DEGRADED state, with the system remaining in DEGRADED longer than expected for a single SDS loss:
SDC_DISCONNECTED_FROM_SDS_IP WARNING SDC <name> disconnected from the IP address <ip> of SDS <name>
MULTIPLE_SDC_CONNECTIVITY_CHANGES INFO Multiple SDC connectivity changes occurred
SDS_DECOUPLED ERROR SDS: <name> (ID: <id>) decoupled
MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state
When the storm resolves and the system stabilizes:
MDM_DATA_NORMAL INFO The system is now in NORMAL state
SDS Trace Logs:
On the surviving SDS nodes (not the SDS that was lost), trace logs will show repeated IO fault responses for combs that should be stable. These indicate the role-balance storm is actively flipping primary/secondary assignments:
raidComb_SetPriTgtGenNum: combId <id> combGenNum: cur <gen> new <gen>
contCmd_SetCombState: CombId <id> devId <id> PRI->SEC Switch roles
contCmd_SetCombState: CombId <id> devId <id> SEC->PRI Switch roles
IO faults seen on surviving SDS nodes during the storm:
IO_FAULT_NOT_PRI -- SDS received IO for a comb it is no longer primary for
IO_FAULT_WRONG_COMB_GEN -- SDC's cached comb generation is stale
IO_HARD_ERROR -- SDS could not complete the IO (partner SDS unreachable)
A high volume of Switch roles entries in a short time window (thousands or more within seconds) is the definitive SDS-side indicator of this issue.
SDC / Host Logs:
VMware (ESXi) SDC IO retries showing the comb, target SDS, and fault code:
vmkernel log
PowerFlex mapVolIO_Do_CK:1496 :Mit: <addr>. Retrying IO Type WRITE. Failed comb: <id>. SDS_ID <id>. Comb Gen <gen>. Head Gen <gen>.
PowerFlex mapVolIO_Do_CK:1510 :Mit: <addr>. Vol ID <id>. Last fault Status IO_FAULT_NOT_PRI(12). Retry count (1)
If retries exhaust, SCSI errors are returned:
sense data: 0x4 0x0 0x0 -- SCSI Hardware Error
VMFS heartbeat timeouts on affected datastores:
HBX: 3089: '<datastore>': HB at offset <offset> - Waiting for timed out HB
Linux SDC hosts -- /var/log/messages or dmesg. IO errors surfaced through the SCSI layer or filesystem:
sd <device>: [scini] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
blk_update_request: I/O error, dev <device>, sector <sector>
EXT4-fs error (device <device>): ext4_journal_check_start: Detected aborted journal
Key point: The distinguishing characteristic is IO errors appearing on multiple SDS nodes, not just the SDS that was lost. If IO errors are isolated to the failed SDS, the issue is expected degraded-state behavior, not this defect.
Impact
During the role-balance storm, IO latency spikes cause temporary IO stalls on affected volumes. The duration and severity of impact depends on cluster size, IO load, and the number of combs affected.
Observed impact has included:
- IO stalls lasting from approximately 15 seconds to 15+ minutes
- VMs entering read-only state
- VMFS heartbeat timeouts on ESXi hosts, potentially triggering HA failover or VM power-state alerts
- Application IO timeouts on Linux SDC hosts
- Large-scale environments may experience widespread impact across hundreds or thousands of VMs
No data loss has been observed in any reported occurrence. The system self-recovers once the role-balance storm subsides and the cluster returns to NORMAL state. The severity of impact scales with the size of the protection domain. Environments with a large number of SDS nodes, volumes, and active IO at the time of the event will experience greater impact.
Причина
A software defect in the MDM role-balance logic causes a feedback loop when the cluster transitions state due to an SDS loss or maintenance mode operation. Under certain conditions, the MDM repeatedly reassigns which SDS nodes are responsible for serving IO to affected combs. Each reassignment invalidates the SDC's cached view of where data is located, forcing IO retries. When a large number of combs are affected simultaneously, the volume of reassignments outpaces the SDCs' ability to update, resulting in sustained IO errors across multiple hosts. The storm is typically self-limiting. It resolves once the cluster stabilizes, but the duration depends on the size of the protection domain and IO load at the time of the event.
Разрешение
|
This issue is fixed in PowerFlex Core version 4.5.6. Upgrade to this version once available. Contact Dell Support for release timeline information. -For planned maintenance operations:
-For unplanned SDS outages: • The storm is self-limiting and typically resolves within minutes as the cluster stabilizes. If the issue is observed, collect -In rare cases where the issue does not self-resolve, temporarily disabling and re-enabling rebuild can allow the MDM to stabilize: # Wait 5-10 seconds, then enable rebuild:
Important: The cluster has reduced redundancy while rebuild is disabled. Only disable rebuild long enough for the system to stabilize, then re-enable immediately. It is recommended to perform this action with Dell Support guidance.
Note: Rebuild is managed at the Storage Pool level. If the affected SDS has devices in multiple Storage Pools, apply this action to each affected Storage Pool. Storage Pools that do not contain devices from the affected SDS are not impacted. The Protection Domain, Storage Pool, and SDS-to-device mapping can be identified from the
scli --query_all command output.
|
Additional Info
Impacted VersionsPowerFlex Core - 4.5.x and lower Fixed In Version
PowerFlex Core - 4.5.6 and higher
|