PowerFlex MDM Cluster Down After Repeated Failovers
Summary: The MDM cluster loses sync repeatedly, eventually becoming unavailable/staying down until users intervene.
Symptoms
Issue Description
The MDM cluster loses sync repeatedly, eventually becoming unavailable/staying down until users intervene.
Scenario
The MDM process restarts too quickly, and after time systemd prevents more starts.
If no synchronized secondary MDMs can take the primary role, this leaves the system with no primary MDM.
Symptoms
In MDM events log, there are many MDM cluster connection losses:
2020-12-03 17:40:36.068 REMOTE_SYSLOG_MODULE_INITIALIZED INFO Initialized the remote syslog module 2020-12-03 17:40:36.068 MDM_MANAGER_START INFO MDM started with the role of Manager 2020-12-03 17:40:36.251 MDM_CLUSTER_CONNECTED INFO The MDM, MDM2 (ID 0f8dcd34388a8c01), connected after 0ms 2020-12-03 17:40:36.251 MDM_CLUSTER_CONNECTED INFO The MDM, MDM3 (ID 551526045129a502), connected after 0ms 2020-12-03 17:40:36.251 MDM_CLUSTER_CONNECTED INFO The MDM, TB2 (ID 7d5c0c020abdea04), connected after 0ms 2020-12-03 17:40:36.251 MDM_CLUSTER_CONNECTED INFO The MDM, TB1 (ID 4207e70f08980503), connected after 0ms 2020-12-03 17:40:37.486 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM3 (ID 551526045129a502), has lost connection to the cluster. 2020-12-03 17:40:37.785 MDM_CLUSTER_CONNECTED INFO The MDM, MDM3 (ID 551526045129a502), connected after 310ms 2020-12-03 17:40:38.755 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM3 (ID 551526045129a502), has lost connection to the cluster. 2020-12-03 17:40:39.060 MDM_CLUSTER_CONNECTED INFO The MDM, MDM3 (ID 551526045129a502), connected after 310ms 2020-12-03 17:40:40.032 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM3 (ID 551526045129a502), has lost connection to the cluster. 2020-12-03 17:40:40.337 MDM_CLUSTER_CONNECTED INFO The MDM, MDM3 (ID 551526045129a502), connected after 310ms 2020-12-03 17:40:41.364 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM3 (ID 551526045129a502), has lost connection to the cluster. 2020-12-03 17:40:41.673 MDM_CLUSTER_CONNECTED INFO The MDM, MDM3 (ID 551526045129a502), connected after 310ms 2020-12-03 17:40:42.602 MDM_CLUSTER_NOT_RESPOND WARNING The MDM, MDM1 (ID 30deabdb5ddf2a00), is not responding 2020-12-03 17:40:42.676 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM3 (ID 551526045129a502), has lost connection to the cluster. 2020-12-03 17:40:43.091 MDM_CLUSTER_CONNECTED INFO The MDM, MDM3 (ID 551526045129a502), connected after 410ms 2020-12-03 17:40:44.073 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM3 (ID 551526045129a502), has lost connection to the cluster. 2020-12-03 17:40:45.557 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM2 (ID 0f8dcd34388a8c01), has lost connection to the cluster. 2020-12-03 17:40:45.858 MDM_CLUSTER_CONNECTED INFO The MDM, MDM2 (ID 0f8dcd34388a8c01), connected after 310ms 2020-12-03 17:40:46.967 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM2 (ID 0f8dcd34388a8c01), has lost connection to the cluster. 2020-12-03 17:40:47.268 MDM_CLUSTER_CONNECTED INFO The MDM, MDM2 (ID 0f8dcd34388a8c01), connected after 300ms 2020-12-03 17:40:48.413 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM2 (ID 0f8dcd34388a8c01), has lost connection to the cluster. 2020-12-03 17:40:48.625 MDM_CLUSTER_NOT_RESPOND WARNING The MDM, MDM3 (ID 551526045129a502), is not responding 2020-12-03 17:40:48.811 MDM_CLUSTER_CONNECTED INFO The MDM, MDM2 (ID 0f8dcd34388a8c01), connected after 400ms 2020-12-03 17:40:49.866 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM2 (ID 0f8dcd34388a8c01), has lost connection to the cluster. 2020-12-03 17:40:50.160 MDM_CLUSTER_CONNECTED INFO The MDM, MDM2 (ID 0f8dcd34388a8c01), connected after 300ms 2020-12-03 17:40:51.208 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM2 (ID 0f8dcd34388a8c01), has lost connection to the cluster. 2020-12-03 17:40:51.520 MDM_CLUSTER_CONNECTED INFO The MDM, MDM2 (ID 0f8dcd34388a8c01), connected after 310ms 2020-12-03 17:40:52.603 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM2 (ID 0f8dcd34388a8c01), has lost connection to the cluster. 2020-12-03 17:40:53.407 MDM_CLUSTER_BECOMING_MASTER WARNING This MDM, MDM1 (ID 30deabdb5ddf2a00), took control of the cluster and is now the Master MDM. 2020-12-03 17:40:53.619 MDM_CLUSTER_NODE_DEGRADED ERROR MDM cluster node is now DEGRADED and is in offline node MDM3 (ID 551526045129a502); IPs: [10.180.88.5], Port: 9011 . 2020-12-03 17:40:53.619 MDM_CLUSTER_NODE_DEGRADED ERROR MDM cluster node is now DEGRADED and is in offline node MDM2 (ID 0f8dcd34388a8c01); IPs: [10.180.88.69], Port: 9011 . 2020-12-03 17:40:54.622 REMOTE_SYSLOG_MODULE_INITIALIZED INFO Initialized the remote syslog module 2020-12-03 17:40:54.622 MDM_MANAGER_START INFO MDM started with the role of Manager 2020-12-03 17:40:54.743 MDM_CLUSTER_CONNECTED INFO The MDM, TB2 (ID 7d5c0c020abdea04), connected after 0ms 2020-12-03 17:40:54.842 MDM_CLUSTER_CONNECTED INFO The MDM, TB1 (ID 4207e70f08980503), connected after 0ms 2020-12-03 17:40:54.943 MDM_CLUSTER_BECOMING_MASTER WARNING This MDM, MDM1 (ID 30deabdb5ddf2a00), took control of the cluster and is now the Master MDM. 2020-12-03 17:40:55.144 MDM_CLUSTER_NODE_DEGRADED ERROR MDM cluster node is now DEGRADED and is in offline node MDM3 (ID 551526045129a502); IPs: [10.180.88.5], Port: 9011 . 2020-12-03 17:40:55.145 MDM_CLUSTER_NODE_DEGRADED ERROR MDM cluster node is now DEGRADED and is in offline node MDM2 (ID 0f8dcd34388a8c01); IPs: [10.180.88.69], Port: 9011 . 2020-12-03 17:40:56.140 REMOTE_SYSLOG_MODULE_INITIALIZED INFO Initialized the remote syslog module 2020-12-03 17:40:56.140 MDM_MANAGER_START INFO MDM started with the role of Manager 2020-12-03 17:40:56.229 MDM_CLUSTER_CONNECTED INFO The MDM, TB2 (ID 7d5c0c020abdea04), connected after 0ms 2020-12-03 17:40:56.327 MDM_CLUSTER_CONNECTED INFO The MDM, TB1 (ID 4207e70f08980503), connected after 0ms 2020-12-03 17:40:56.428 MDM_CLUSTER_BECOMING_MASTER WARNING This MDM, MDM1 (ID 30deabdb5ddf2a00), took control of the cluster and is now the Master MDM. 2020-12-03 17:40:56.629 MDM_CLUSTER_NODE_DEGRADED ERROR MDM cluster node is now DEGRADED and is in offline node MDM3 (ID 551526045129a502); IPs: [10.180.88.5], Port: 9011 . 2020-12-03 17:40:56.629 MDM_CLUSTER_NODE_DEGRADED ERROR MDM cluster node is now DEGRADED and is in offline node MDM2 (ID 0f8dcd34388a8c01); IPs: [10.180.88.69], Port: 9011 . 2020-12-03 17:40:57.660 REMOTE_SYSLOG_MODULE_INITIALIZED INFO Initialized the remote syslog module 2020-12-03 17:40:57.660 MDM_MANAGER_START INFO MDM started with the role of Manager 2020-12-03 17:40:57.768 MDM_CLUSTER_CONNECTED INFO The MDM, TB2 (ID 7d5c0c020abdea04), connected after 0ms 2020-12-03 17:40:57.869 MDM_CLUSTER_CONNECTED INFO The MDM, TB1 (ID 4207e70f08980503), connected after 0ms 2020-12-03 17:40:57.970 MDM_CLUSTER_BECOMING_MASTER WARNING This MDM, MDM1 (ID 30deabdb5ddf2a00), took control of the cluster and is now the Master MDM. 2020-12-03 17:40:58.171 MDM_CLUSTER_NODE_DEGRADED ERROR MDM cluster node is now DEGRADED and is in offline node MDM3 (ID 551526045129a502); IPs: [10.180.88.5], Port: 9011 . 2020-12-03 17:40:58.172 MDM_CLUSTER_NODE_DEGRADED ERROR MDM cluster node is now DEGRADED and is in offline node MDM2 (ID 0f8dcd34388a8c01); IPs: [10.180.88.69], Port: 9011 . 2020-12-03 17:40:59.143 REMOTE_SYSLOG_MODULE_INITIALIZED INFO Initialized the remote syslog module 2020-12-03 17:40:59.144 MDM_MANAGER_START INFO MDM started with the role of Manager 2020-12-03 17:40:59.245 MDM_CLUSTER_CONNECTED INFO The MDM, TB2 (ID 7d5c0c020abdea04), connected after 0ms 2020-12-03 17:40:59.353 MDM_CLUSTER_CONNECTED INFO The MDM, TB1 (ID 4207e70f08980503), connected after 0ms 2020-12-03 17:40:59.454 MDM_CLUSTER_BECOMING_MASTER WARNING This MDM, MDM1 (ID 30deabdb5ddf2a00), took control of the cluster and is now the Master MDM. 2020-12-03 17:40:59.655 MDM_CLUSTER_NODE_DEGRADED ERROR MDM cluster node is now DEGRADED and is in offline node MDM3 (ID 551526045129a502); IPs: [10.180.88.5], Port: 9011 . 2020-12-03 17:40:59.655 MDM_CLUSTER_NODE_DEGRADED ERROR MDM cluster node is now DEGRADED and is in offline node MDM2 (ID 0f8dcd34388a8c01); IPs: [10.180.88.69], Port: 9011 . 2020-12-03 17:41:00.630 REMOTE_SYSLOG_MODULE_INITIALIZED INFO Initialized the remote syslog module 2020-12-03 17:41:00.630 MDM_MANAGER_START INFO MDM started with the role of Manager 2020-12-03 17:41:00.722 MDM_CLUSTER_CONNECTED INFO The MDM, TB2 (ID 7d5c0c020abdea04), connected after 0ms 2020-12-03 17:41:00.818 MDM_CLUSTER_CONNECTED INFO The MDM, TB1 (ID 4207e70f08980503), connected after 0ms 2020-12-03 17:41:00.919 MDM_CLUSTER_BECOMING_MASTER WARNING This MDM, MDM1 (ID 30deabdb5ddf2a00), took control of the cluster and is now the Master MDM. 2020-12-03 17:41:01.120 MDM_CLUSTER_NODE_DEGRADED ERROR MDM cluster node is now DEGRADED and is in offline node MDM3 (ID 551526045129a502); IPs: [10.180.88.5], Port: 9011 . 2020-12-03 17:41:01.121 MDM_CLUSTER_NODE_DEGRADED ERROR MDM cluster node is now DEGRADED and is in offline node MDM2 (ID 0f8dcd34388a8c01); IPs: [10.180.88.69], Port: 9011 . 2020-12-03 20:37:38.973 REMOTE_SYSLOG_MODULE_INITIALIZED INFO Initialized the remote syslog module 2020-12-03 20:37:38.973 MDM_MANAGER_START INFO MDM started with the role of Manager
Which means that MDM processes are restarting many times in a row. This can be due to severe connectivity problems or the OS disk responding slowly as in Slow writes to OS disk can cause multiple MDM issues. ("Harden took too long") In journalctl or /var/log/messages: "Started scaleio mdm" will be seen immediately after the service stops if systemd successfully restarts the mdm service. When systemd does not reschedule it to restart, you see "repeated too quickly"/"failed to start" (as at 17:41:01 below):
Dec 3 17:40:54 RHEL7-1 systemd: mdm.service: main process exited, code=exited, status=255/n/a Dec 3 17:40:54 RHEL7-1 systemd: Unit mdm.service entered failed state. Dec 3 17:40:54 RHEL7-1 systemd: mdm.service failed. Dec 3 17:40:54 RHEL7-1 systemd: mdm.service has no holdoff time, scheduling restart. Dec 3 17:40:54 RHEL7-1 systemd: Stopped scaleio mdm. Dec 3 17:40:54 RHEL7-1 systemd: Started scaleio mdm. Dec 3 17:40:55 RHEL7-1 systemd: mdm.service: main process exited, code=exited, status=255/n/a Dec 3 17:40:55 RHEL7-1 systemd: Unit mdm.service entered failed state. Dec 3 17:40:55 RHEL7-1 systemd: mdm.service failed. Dec 3 17:40:55 RHEL7-1 systemd: mdm.service has no holdoff time, scheduling restart. Dec 3 17:40:55 RHEL7-1 systemd: Stopped scaleio mdm. Dec 3 17:40:55 RHEL7-1 systemd: Started scaleio mdm. Dec 3 17:40:57 RHEL7-1 systemd: mdm.service: main process exited, code=exited, status=255/n/a Dec 3 17:40:57 RHEL7-1 systemd: Unit mdm.service entered failed state. Dec 3 17:40:57 RHEL7-1 systemd: mdm.service failed. Dec 3 17:40:57 RHEL7-1 systemd: mdm.service has no holdoff time, scheduling restart. Dec 3 17:40:57 RHEL7-1 systemd: Stopped scaleio mdm. Dec 3 17:40:57 RHEL7-1 systemd: Started scaleio mdm. Dec 3 17:40:58 RHEL7-1 systemd: mdm.service: main process exited, code=exited, status=255/n/a Dec 3 17:40:58 RHEL7-1 systemd: Unit mdm.service entered failed state. Dec 3 17:40:58 RHEL7-1 systemd: mdm.service failed. Dec 3 17:40:58 RHEL7-1 systemd: mdm.service has no holdoff time, scheduling restart. Dec 3 17:40:58 RHEL7-1 systemd: Stopped scaleio mdm. Dec 3 17:40:58 RHEL7-1 systemd: Started scaleio mdm. Dec 3 17:41:00 RHEL7-1 systemd: mdm.service: main process exited, code=exited, status=255/n/a Dec 3 17:41:00 RHEL7-1 systemd: Unit mdm.service entered failed state. Dec 3 17:41:00 RHEL7-1 systemd: mdm.service failed. Dec 3 17:41:00 RHEL7-1 systemd: mdm.service has no holdoff time, scheduling restart. Dec 3 17:41:00 RHEL7-1 systemd: Stopped scaleio mdm. Dec 3 17:41:00 RHEL7-1 systemd: Started scaleio mdm. Dec 3 17:41:01 RHEL7-1 systemd: mdm.service: main process exited, code=exited, status=255/n/a Dec 3 17:41:01 RHEL7-1 systemd: Unit mdm.service entered failed state. Dec 3 17:41:01 RHEL7-1 systemd: mdm.service failed. Dec 3 17:41:01 RHEL7-1 systemd: mdm.service has no holdoff time, scheduling restart. Dec 3 17:41:01 RHEL7-1 systemd: Stopped scaleio mdm. Dec 3 17:41:01 RHEL7-1 systemd: start request repeated too quickly for mdm.service Dec 3 17:41:01 RHEL7-1 systemd: Failed to start scaleio mdm. Dec 3 17:41:01 RHEL7-1 systemd: Unit mdm.service entered failed state. Dec 3 17:41:01 RHEL7-1 systemd: mdm.service failed.
In short, you can find when systemd stopped bringing the MDM service back up with the following lines:
systemd: start request repeated too quickly for mdm.service systemd: Failed to start scaleio mdm.
Impact Data Unavailable.
Cause
Part of the MDM's behavior when yielding the primary role is to restart (planned crash) the process.
Systemd in RHEL/CentOS 6.x & 7.x has a threshold for the number of times a process can restart within a certain time frame.
If the MDM service is unresponsive repeatedly enough, systemd does not allow it to restart.
Resolution
Workaround
- Stabilize the MDM cluster.
- If not possible, engage SIO L3 for assistance and cite this article.
Impacted Versions
RHEL/CentOS 6.x & 7.x
Fixed In Version
TBD