PowerFlex MDM Cluster Down After Repeated Failovers

Summary: The MDM cluster loses sync repeatedly, eventually becoming unavailable/staying down until users intervene.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Issue Description
The MDM cluster loses sync repeatedly, eventually becoming unavailable/staying down until users intervene.

Scenario
The MDM process restarts too quickly, and after time systemd prevents more starts.
If no synchronized secondary MDMs can take the primary role, this leaves the system with no primary MDM.

Symptoms
In MDM events log, there are many MDM cluster connection losses:

2020-12-03 17:40:36.068 REMOTE_SYSLOG_MODULE_INITIALIZED INFO     	 Initialized the remote syslog module 
2020-12-03 17:40:36.068 MDM_MANAGER_START         INFO     	 MDM started with the role of Manager 
2020-12-03 17:40:36.251 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, MDM2 (ID 0f8dcd34388a8c01), connected after 0ms 
2020-12-03 17:40:36.251 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, MDM3 (ID 551526045129a502), connected after 0ms 
2020-12-03 17:40:36.251 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, TB2 (ID 7d5c0c020abdea04), connected after 0ms 
2020-12-03 17:40:36.251 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, TB1 (ID 4207e70f08980503), connected after 0ms 
2020-12-03 17:40:37.486 MDM_CLUSTER_LOST_CONNECTION WARNING  	 The MDM, MDM3 (ID 551526045129a502), has lost connection to the cluster. 
2020-12-03 17:40:37.785 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, MDM3 (ID 551526045129a502), connected after 310ms 
2020-12-03 17:40:38.755 MDM_CLUSTER_LOST_CONNECTION WARNING  	 The MDM, MDM3 (ID 551526045129a502), has lost connection to the cluster. 
2020-12-03 17:40:39.060 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, MDM3 (ID 551526045129a502), connected after 310ms 
2020-12-03 17:40:40.032 MDM_CLUSTER_LOST_CONNECTION WARNING  	 The MDM, MDM3 (ID 551526045129a502), has lost connection to the cluster. 
2020-12-03 17:40:40.337 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, MDM3 (ID 551526045129a502), connected after 310ms 
2020-12-03 17:40:41.364 MDM_CLUSTER_LOST_CONNECTION WARNING  	 The MDM, MDM3 (ID 551526045129a502), has lost connection to the cluster. 
2020-12-03 17:40:41.673 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, MDM3 (ID 551526045129a502), connected after 310ms 
2020-12-03 17:40:42.602 MDM_CLUSTER_NOT_RESPOND   WARNING  	 The MDM, MDM1 (ID 30deabdb5ddf2a00), is not responding 
2020-12-03 17:40:42.676 MDM_CLUSTER_LOST_CONNECTION WARNING  	 The MDM, MDM3 (ID 551526045129a502), has lost connection to the cluster. 
2020-12-03 17:40:43.091 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, MDM3 (ID 551526045129a502), connected after 410ms 
2020-12-03 17:40:44.073 MDM_CLUSTER_LOST_CONNECTION WARNING  	 The MDM, MDM3 (ID 551526045129a502), has lost connection to the cluster. 
2020-12-03 17:40:45.557 MDM_CLUSTER_LOST_CONNECTION WARNING  	 The MDM, MDM2 (ID 0f8dcd34388a8c01), has lost connection to the cluster. 
2020-12-03 17:40:45.858 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, MDM2 (ID 0f8dcd34388a8c01), connected after 310ms 
2020-12-03 17:40:46.967 MDM_CLUSTER_LOST_CONNECTION WARNING  	 The MDM, MDM2 (ID 0f8dcd34388a8c01), has lost connection to the cluster. 
2020-12-03 17:40:47.268 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, MDM2 (ID 0f8dcd34388a8c01), connected after 300ms 
2020-12-03 17:40:48.413 MDM_CLUSTER_LOST_CONNECTION WARNING  	 The MDM, MDM2 (ID 0f8dcd34388a8c01), has lost connection to the cluster. 
2020-12-03 17:40:48.625 MDM_CLUSTER_NOT_RESPOND   WARNING  	 The MDM, MDM3 (ID 551526045129a502), is not responding 
2020-12-03 17:40:48.811 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, MDM2 (ID 0f8dcd34388a8c01), connected after 400ms 
2020-12-03 17:40:49.866 MDM_CLUSTER_LOST_CONNECTION WARNING  	 The MDM, MDM2 (ID 0f8dcd34388a8c01), has lost connection to the cluster. 
2020-12-03 17:40:50.160 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, MDM2 (ID 0f8dcd34388a8c01), connected after 300ms 
2020-12-03 17:40:51.208 MDM_CLUSTER_LOST_CONNECTION WARNING  	 The MDM, MDM2 (ID 0f8dcd34388a8c01), has lost connection to the cluster. 
2020-12-03 17:40:51.520 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, MDM2 (ID 0f8dcd34388a8c01), connected after 310ms 
2020-12-03 17:40:52.603 MDM_CLUSTER_LOST_CONNECTION WARNING  	 The MDM, MDM2 (ID 0f8dcd34388a8c01), has lost connection to the cluster. 
2020-12-03 17:40:53.407 MDM_CLUSTER_BECOMING_MASTER WARNING  	 This MDM, MDM1 (ID 30deabdb5ddf2a00), took control of the cluster and is now the Master MDM. 
2020-12-03 17:40:53.619 MDM_CLUSTER_NODE_DEGRADED ERROR    	 MDM cluster node is now DEGRADED and is in offline node MDM3 (ID 551526045129a502); IPs: [10.180.88.5], Port: 9011 . 
2020-12-03 17:40:53.619 MDM_CLUSTER_NODE_DEGRADED ERROR    	 MDM cluster node is now DEGRADED and is in offline node MDM2 (ID 0f8dcd34388a8c01); IPs: [10.180.88.69], Port: 9011 . 
2020-12-03 17:40:54.622 REMOTE_SYSLOG_MODULE_INITIALIZED INFO     	 Initialized the remote syslog module 
2020-12-03 17:40:54.622 MDM_MANAGER_START         INFO     	 MDM started with the role of Manager 
2020-12-03 17:40:54.743 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, TB2 (ID 7d5c0c020abdea04), connected after 0ms 
2020-12-03 17:40:54.842 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, TB1 (ID 4207e70f08980503), connected after 0ms 
2020-12-03 17:40:54.943 MDM_CLUSTER_BECOMING_MASTER WARNING  	 This MDM, MDM1 (ID 30deabdb5ddf2a00), took control of the cluster and is now the Master MDM. 
2020-12-03 17:40:55.144 MDM_CLUSTER_NODE_DEGRADED ERROR    	 MDM cluster node is now DEGRADED and is in offline node MDM3 (ID 551526045129a502); IPs: [10.180.88.5], Port: 9011 . 
2020-12-03 17:40:55.145 MDM_CLUSTER_NODE_DEGRADED ERROR    	 MDM cluster node is now DEGRADED and is in offline node MDM2 (ID 0f8dcd34388a8c01); IPs: [10.180.88.69], Port: 9011 . 
2020-12-03 17:40:56.140 REMOTE_SYSLOG_MODULE_INITIALIZED INFO     	 Initialized the remote syslog module 
2020-12-03 17:40:56.140 MDM_MANAGER_START         INFO     	 MDM started with the role of Manager 
2020-12-03 17:40:56.229 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, TB2 (ID 7d5c0c020abdea04), connected after 0ms 
2020-12-03 17:40:56.327 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, TB1 (ID 4207e70f08980503), connected after 0ms 
2020-12-03 17:40:56.428 MDM_CLUSTER_BECOMING_MASTER WARNING  	 This MDM, MDM1 (ID 30deabdb5ddf2a00), took control of the cluster and is now the Master MDM. 
2020-12-03 17:40:56.629 MDM_CLUSTER_NODE_DEGRADED ERROR    	 MDM cluster node is now DEGRADED and is in offline node MDM3 (ID 551526045129a502); IPs: [10.180.88.5], Port: 9011 . 
2020-12-03 17:40:56.629 MDM_CLUSTER_NODE_DEGRADED ERROR    	 MDM cluster node is now DEGRADED and is in offline node MDM2 (ID 0f8dcd34388a8c01); IPs: [10.180.88.69], Port: 9011 . 
2020-12-03 17:40:57.660 REMOTE_SYSLOG_MODULE_INITIALIZED INFO     	 Initialized the remote syslog module 
2020-12-03 17:40:57.660 MDM_MANAGER_START         INFO     	 MDM started with the role of Manager 
2020-12-03 17:40:57.768 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, TB2 (ID 7d5c0c020abdea04), connected after 0ms 
2020-12-03 17:40:57.869 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, TB1 (ID 4207e70f08980503), connected after 0ms 
2020-12-03 17:40:57.970 MDM_CLUSTER_BECOMING_MASTER WARNING  	 This MDM, MDM1 (ID 30deabdb5ddf2a00), took control of the cluster and is now the Master MDM. 
2020-12-03 17:40:58.171 MDM_CLUSTER_NODE_DEGRADED ERROR    	 MDM cluster node is now DEGRADED and is in offline node MDM3 (ID 551526045129a502); IPs: [10.180.88.5], Port: 9011 . 
2020-12-03 17:40:58.172 MDM_CLUSTER_NODE_DEGRADED ERROR    	 MDM cluster node is now DEGRADED and is in offline node MDM2 (ID 0f8dcd34388a8c01); IPs: [10.180.88.69], Port: 9011 . 
2020-12-03 17:40:59.143 REMOTE_SYSLOG_MODULE_INITIALIZED INFO     	 Initialized the remote syslog module 
2020-12-03 17:40:59.144 MDM_MANAGER_START         INFO     	 MDM started with the role of Manager 
2020-12-03 17:40:59.245 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, TB2 (ID 7d5c0c020abdea04), connected after 0ms 
2020-12-03 17:40:59.353 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, TB1 (ID 4207e70f08980503), connected after 0ms 
2020-12-03 17:40:59.454 MDM_CLUSTER_BECOMING_MASTER WARNING  	 This MDM, MDM1 (ID 30deabdb5ddf2a00), took control of the cluster and is now the Master MDM. 
2020-12-03 17:40:59.655 MDM_CLUSTER_NODE_DEGRADED ERROR    	 MDM cluster node is now DEGRADED and is in offline node MDM3 (ID 551526045129a502); IPs: [10.180.88.5], Port: 9011 . 
2020-12-03 17:40:59.655 MDM_CLUSTER_NODE_DEGRADED ERROR    	 MDM cluster node is now DEGRADED and is in offline node MDM2 (ID 0f8dcd34388a8c01); IPs: [10.180.88.69], Port: 9011 . 
2020-12-03 17:41:00.630 REMOTE_SYSLOG_MODULE_INITIALIZED INFO     	 Initialized the remote syslog module 
2020-12-03 17:41:00.630 MDM_MANAGER_START         INFO     	 MDM started with the role of Manager 
2020-12-03 17:41:00.722 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, TB2 (ID 7d5c0c020abdea04), connected after 0ms 
2020-12-03 17:41:00.818 MDM_CLUSTER_CONNECTED     INFO     	 The MDM, TB1 (ID 4207e70f08980503), connected after 0ms 
2020-12-03 17:41:00.919 MDM_CLUSTER_BECOMING_MASTER WARNING  	 This MDM, MDM1 (ID 30deabdb5ddf2a00), took control of the cluster and is now the Master MDM. 
2020-12-03 17:41:01.120 MDM_CLUSTER_NODE_DEGRADED ERROR    	 MDM cluster node is now DEGRADED and is in offline node MDM3 (ID 551526045129a502); IPs: [10.180.88.5], Port: 9011 . 
2020-12-03 17:41:01.121 MDM_CLUSTER_NODE_DEGRADED ERROR    	 MDM cluster node is now DEGRADED and is in offline node MDM2 (ID 0f8dcd34388a8c01); IPs: [10.180.88.69], Port: 9011 . 


2020-12-03 20:37:38.973 REMOTE_SYSLOG_MODULE_INITIALIZED INFO     	 Initialized the remote syslog module 
2020-12-03 20:37:38.973 MDM_MANAGER_START         INFO     	 MDM started with the role of Manager 

Which means that MDM processes are restarting many times in a row. This can be due to severe connectivity problems or the OS disk responding slowly as in Slow writes to OS disk can cause multiple MDM issues. ("Harden took too long") In journalctl or /var/log/messages: "Started scaleio mdm" will be seen immediately after the service stops if systemd successfully restarts the mdm service. When systemd does not reschedule it to restart, you see "repeated too quickly"/"failed to start" (as at 17:41:01 below):

Dec  3 17:40:54 RHEL7-1 systemd: mdm.service: main process exited, code=exited, status=255/n/a
Dec  3 17:40:54 RHEL7-1 systemd: Unit mdm.service entered failed state.
Dec  3 17:40:54 RHEL7-1 systemd: mdm.service failed.
Dec  3 17:40:54 RHEL7-1 systemd: mdm.service has no holdoff time, scheduling restart.
Dec  3 17:40:54 RHEL7-1 systemd: Stopped scaleio mdm.
Dec  3 17:40:54 RHEL7-1 systemd: Started scaleio mdm.

Dec  3 17:40:55 RHEL7-1 systemd: mdm.service: main process exited, code=exited, status=255/n/a
Dec  3 17:40:55 RHEL7-1 systemd: Unit mdm.service entered failed state.
Dec  3 17:40:55 RHEL7-1 systemd: mdm.service failed.
Dec  3 17:40:55 RHEL7-1 systemd: mdm.service has no holdoff time, scheduling restart.
Dec  3 17:40:55 RHEL7-1 systemd: Stopped scaleio mdm.
Dec  3 17:40:55 RHEL7-1 systemd: Started scaleio mdm.

Dec  3 17:40:57 RHEL7-1 systemd: mdm.service: main process exited, code=exited, status=255/n/a
Dec  3 17:40:57 RHEL7-1 systemd: Unit mdm.service entered failed state.
Dec  3 17:40:57 RHEL7-1 systemd: mdm.service failed.
Dec  3 17:40:57 RHEL7-1 systemd: mdm.service has no holdoff time, scheduling restart.
Dec  3 17:40:57 RHEL7-1 systemd: Stopped scaleio mdm.
Dec  3 17:40:57 RHEL7-1 systemd: Started scaleio mdm.


Dec  3 17:40:58 RHEL7-1 systemd: mdm.service: main process exited, code=exited, status=255/n/a
Dec  3 17:40:58 RHEL7-1 systemd: Unit mdm.service entered failed state.
Dec  3 17:40:58 RHEL7-1 systemd: mdm.service failed.
Dec  3 17:40:58 RHEL7-1 systemd: mdm.service has no holdoff time, scheduling restart.
Dec  3 17:40:58 RHEL7-1 systemd: Stopped scaleio mdm.
Dec  3 17:40:58 RHEL7-1 systemd: Started scaleio mdm.


Dec  3 17:41:00 RHEL7-1 systemd: mdm.service: main process exited, code=exited, status=255/n/a
Dec  3 17:41:00 RHEL7-1 systemd: Unit mdm.service entered failed state.
Dec  3 17:41:00 RHEL7-1 systemd: mdm.service failed.
Dec  3 17:41:00 RHEL7-1 systemd: mdm.service has no holdoff time, scheduling restart.
Dec  3 17:41:00 RHEL7-1 systemd: Stopped scaleio mdm.
Dec  3 17:41:00 RHEL7-1 systemd: Started scaleio mdm.


Dec  3 17:41:01 RHEL7-1 systemd: mdm.service: main process exited, code=exited, status=255/n/a
Dec  3 17:41:01 RHEL7-1 systemd: Unit mdm.service entered failed state.
Dec  3 17:41:01 RHEL7-1 systemd: mdm.service failed.
Dec  3 17:41:01 RHEL7-1 systemd: mdm.service has no holdoff time, scheduling restart.
Dec  3 17:41:01 RHEL7-1 systemd: Stopped scaleio mdm.
Dec  3 17:41:01 RHEL7-1 systemd: start request repeated too quickly for mdm.service
Dec  3 17:41:01 RHEL7-1 systemd: Failed to start scaleio mdm.
Dec  3 17:41:01 RHEL7-1 systemd: Unit mdm.service entered failed state.
Dec  3 17:41:01 RHEL7-1 systemd: mdm.service failed.

In short, you can find when systemd stopped bringing the MDM service back up with the following lines:

systemd: start request repeated too quickly for mdm.service
systemd: Failed to start scaleio mdm.

Impact Data Unavailable.

 

Cause

Part of the MDM's behavior when yielding the primary role is to restart (planned crash) the process.
Systemd in RHEL/CentOS 6.x & 7.x has a threshold for the number of times a process can restart within a certain time frame.
If the MDM service is unresponsive repeatedly enough, systemd does not allow it to restart.

 

Resolution

Workaround

  1. Stabilize the MDM cluster.
  2. If not possible, engage SIO L3 for assistance and cite this article.

Impacted Versions
RHEL/CentOS 6.x & 7.x

Fixed In Version
TBD

 

Affected Products

PowerFlex rack, ScaleIO
Article Properties
Article Number: 000205794
Article Type: Solution
Last Modified: 29 ذو القعدة 1447
Version:  5
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.