PowerFlex 3.X: Slow Writes To OS Disk Can Cause Multiple MDM Issues.

Summary: Slow writes to operating system disk can cause multiple MDM issues.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Any number of scenarios can present as a result of a slow operating system disk on an MDM.

In ScaleIO 3.0, the MDM mechanism has been made more robust to better handle severely slow OS disk issues. (10+ second latency)


When the MDMs are running on OS disks that take too long to write, the following symptoms may be seen:

  • Putting an SDS into maintenance causes the Master MDM to disconnect.

  • A rebuild event causes the Master MDM and possibly also the Slave MDMs to disconnect.

  • MDM switchover not working; Slave MDMs cannot take over Master MDM responsibilities and so no MDM is master.

  • Output of "scli --query_cluster" shows slave MDMs not synchronized occasionally.

  • SDC writes IO errors.

In all scenarios, "Harden took too long" is seen in MDM trc logs:

08/12 03:36:42.336327 0x7f64207f4eb0:replFile_WriteUnlocked:00667: WARNING: Harden took too long: 1360 ms
08/12 03:36:44.811987 0x7f6420668eb0:replFile_WriteUnlocked:00667: WARNING: Harden took too long: 1840 ms
08/12 03:36:46.463661 0x7f642072eeb0:replFile_WriteUnlocked:00667: WARNING: Harden took too long: 2210 ms

Impact

MDM repo writes exceeding the harden threshold means that MDM is not synchronized.
This means that the MDM cluster is not synchronized, and MDM processes restart. 
If MDMs restart quickly/repeatedly enough, complete data unavailable scenarios (when there is no Master MDM available) as in MDM cluster down after repeated failovers can occur. 

Cause

When the Master MDM must make changes to the state of data blocks, it must write these state changes to the MDM repository file then sync those changes to the Slave MDMs. When those writes are complete, the MDM notifies the SDSes that the changes are finalized, and they can serve write IOs to the SDCs from the primary copy only (until rebuild is completed). If it takes longer than 500 millisecond (1/2 second) for the Master MDM to write the changes to the local repository, the "Harden took too long" messages will show in the MDM trc logs. This causes the MDM to not be able to respond quickly enough to the SDSes requests and may cause IO errors on the SDCs. The MDM will stay in this state until the IO can write to the repository in less than 500 millisecond or after 10 seconds when an MDM switch ownership will occur within the cluster. 

Resolution

The solution is to resolve the OS disk latency issue.  

This can be due to:

  • RAID rebuilds (14G Ready Nodes have BOSS cards with 2x m.2 SATA drives in RAID1)

  • Disk wear/age

  • Improper sizing/selection of OS disks (HDD, slow/cheap SSD, etc. usually only in software only configs)

  • Bugs in OS disk controller/disk firmware

  • Disk failure/predictive failure state 

  • But the most common cause is extraneous IO load on the OS disk. 

In any case, monitoring/profiling the OS disk's performance is necessary.

Disk latency can be monitored by sar or iostat. 

The easiest/most universally available tool is iostat. 

Run 

iostat -xtN 1

And observe the await times, reported in milliseconds. 

All versions are impacted.

Affected Products

Converged Infrastructure

Products

Converged Infrastructure, Software, Storage, PowerFlex Software
Article Properties
Article Number: 000201707
Article Type: Solution
Last Modified: 19 Nov 2025
Version:  4
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.