PowerFlex 3.x: MDM Panics at Function rpl_transmit_mgr.c
Summary: Mobile Device Management (MDM) process continuously panics due to replication
Symptoms
In this case, the replication site code level was at 3.x and the destination site code level was at 4.x, however, the issue may impact any 3.x systems.
No changes have been made on the storage side.
The MDM process continuously panics with the following stack trace:
2024/11/24 05:51:06.186359 Panic in file /data/build/workspace/ScaleIO-Common-Job/src/mdm/replication/consistency_engine/rpl_transmit_mgr.c, line 833, function rplTransmitManager_ProcessRequestsForTimelinesRFD, PID 19477.Panic Expression ALWAYS_ASSERT . /opt/emc/scaleio/mdm/bin/mdm-3.6.400.107(mosDbg_PanicPrepare+0x13a) [0xabf1ba] /opt/emc/scaleio/mdm/bin/mdm-3.6.400.107(rplTransmitManager_ProcessRequestsForTimelinesRFD+0x1f0) [0x880da0] /opt/emc/scaleio/mdm/bin/mdm-3.6.400.107(consistencyEngine_AnalyzeTimelines+0x7b) [0x7f2ebb] /opt/emc/scaleio/mdm/bin/mdm-3.6.400.107(consistencyEngine_AnalayzerUmtIteration+0x3c) [0x60d96c] /opt/emc/scaleio/mdm/bin/mdm-3.6.400.107(consistencyEngine_AnalayzerUmtRoutine+0x33) [0x60da43] /opt/emc/scaleio/mdm/bin/mdm-3.6.400.107(mosUmt_StartFunc+0x7a) [0x69a9fa] /lib64/libc.so.6(+0x48190) [0x7ff82e834190] /opt/emc/scaleio/mdm/bin/mdm-3.6.400.107(mosUmt_Init+0x129) [0x8f5e89] [(nil)]
Impact:
MDM cluster is down which results in data unavailable (DU).
Cause
The issue was identified as a software code defect in version 3.x, which caused the MDMs to panic. Due to this defect, the transmitted data exceeded the enforced limit of 200 GiB during replication. Due to excessive requests, the MDMs struggled to process them, resulting in instability and ultimately panic.
In this specific case, the highly transmitted data was a result of a Windows SDC trim command, however, the issue could be seen due to any large data transmission.
Resolution
This software issue has been resolved in the latest versions. To permanently resolve the issue, the recommendation is upgrading to 4.5.x or later to ensure stability before resuming replication:
- Stop SDRs on all nodes.
This temporarily resolves the panic. - Pause or Stop all Replication Consistency Groups (RCGs) and replication pairs.
- Upgrade the system to the latest 4.5.x version or later.
- Resume the replication after completing the upgrade.
Impacted Versions:
PowerFlex 3.x
Fixed In Version:
PowerFlex 4.5