PowerFlex 3.X: Slow Writes To OS Disk Can Cause Multiple MDM Issues.
Summary: Slow writes to operating system disk can cause multiple MDM issues.
Symptoms
Any number of scenarios can present as a result of a slow operating system disk on an MDM.
In ScaleIO 3.0, the MDM mechanism has been made more robust to better handle severely slow OS disk issues. (10+ second latency)
When the MDMs are running on OS disks that take too long to write, the following symptoms may be seen:
-
Putting an SDS into maintenance causes the Master MDM to disconnect.
-
A rebuild event causes the Master MDM and possibly also the Slave MDMs to disconnect.
-
MDM switchover not working; Slave MDMs cannot take over Master MDM responsibilities and so no MDM is master.
-
Output of "scli --query_cluster" shows slave MDMs not synchronized occasionally.
-
SDC writes IO errors.
In all scenarios, "Harden took too long" is seen in MDM trc logs:
08/12 03:36:42.336327 0x7f64207f4eb0:replFile_WriteUnlocked:00667: WARNING: Harden took too long: 1360 ms 08/12 03:36:44.811987 0x7f6420668eb0:replFile_WriteUnlocked:00667: WARNING: Harden took too long: 1840 ms 08/12 03:36:46.463661 0x7f642072eeb0:replFile_WriteUnlocked:00667: WARNING: Harden took too long: 2210 ms
Impact
MDM repo writes exceeding the harden threshold means that MDM is not synchronized.
This means that the MDM cluster is not synchronized, and MDM processes restart.
If MDMs restart quickly/repeatedly enough, complete data unavailable scenarios (when there is no Master MDM available) as in MDM cluster down after repeated failovers can occur.
Cause
When the Master MDM must make changes to the state of data blocks, it must write these state changes to the MDM repository file then sync those changes to the Slave MDMs. When those writes are complete, the MDM notifies the SDSes that the changes are finalized, and they can serve write IOs to the SDCs from the primary copy only (until rebuild is completed). If it takes longer than 500 millisecond (1/2 second) for the Master MDM to write the changes to the local repository, the "Harden took too long" messages will show in the MDM trc logs. This causes the MDM to not be able to respond quickly enough to the SDSes requests and may cause IO errors on the SDCs. The MDM will stay in this state until the IO can write to the repository in less than 500 millisecond or after 10 seconds when an MDM switch ownership will occur within the cluster.
Resolution
The solution is to resolve the OS disk latency issue.
This can be due to:
-
RAID rebuilds (14G Ready Nodes have BOSS cards with 2x m.2 SATA drives in RAID1)
-
Disk wear/age
-
Improper sizing/selection of OS disks (HDD, slow/cheap SSD, etc. usually only in software only configs)
-
Bugs in OS disk controller/disk firmware
-
Disk failure/predictive failure state
-
But the most common cause is extraneous IO load on the OS disk.
In any case, monitoring/profiling the OS disk's performance is necessary.
Disk latency can be monitored by sar or iostat.
The easiest/most universally available tool is iostat.
Run
iostat -xtN 1
And observe the await times, reported in milliseconds.
All versions are impacted.