PowerFlex Inability To Elect The Primary MDM Causing DU
Summary: In specific cases, an MDM cluster working in a DEGRADED mode experiencing network disconnections might be unable to elect the Primary MDM, effectively causing a DU until the next MDM-MDM disconnection or manual intervention. ...
Symptoms
Scenario
5_node MDM cluster with a single node not participating in the cluster - the node was manually unconfigured from the cluster and not replaced by any new MDM.
Symptoms
Unable to log in to SCLI or the UI on any of the MDM hosts.
SDCs are reporting IO errors, yet no DATA_FAILED event was reported in the MDM events.
MDM cluster is experiencing network issues causing disconnections between the cluster members.
Not fully functional MDM cluster - in this example "MDM3" was manually reconfigured and did not participate in the cluster, but was never replaced:
Cluster:
Name: SCALEIO, ID: 2e89c0596415569b, Mode: 5_node, State: Degraded, Active: 4/5, Replicas: 2/3
Virtual IP Addresses: 10.106.136.31, 10.106.136.95
Master MDM:
Name: MDM1, ID: 0x73a3e7f71453ec05
IP Addresses: 10.106.136.71, 10.106.136.7, Management IP Addresses: 10.106.202.159, Port: 9011, Virtual IP interfaces: eth1, eth2
Version: 3.0.1200
Actor ID: 0x11df55433c3b4c05, Voter ID: 0x00eb43f04a811905
Certificate Info:
Subject: /GN=MDM/CN=ScaleIO-10-106-202-159/L=Hopkinton/ST=Massachusetts/C=US/O=EMC/OU=ASD
Issuer: /GN=MDM/CN=ScaleIO-10-106-202-159/L=Hopkinton/ST=Massachusetts/C=US/O=EMC/OU=ASD
Valid From: May 28 20:40:36 2021 GMT
Valid To: May 27 21:40:36 2031 GMT
Thumbprint: 1A:EA:CE:BA:C7:A8:D2:3C:87:5D:FD:D6:1C:85:6B:82:5B:18:2D:19
Slave MDMs:
Name: MDM3, ID: 0x6eed06ed18096200
IP Addresses: 10.106.136.91, 10.106.136.27, Management IP Addresses: 10.106.202.179, Port: 9011, Virtual IP interfaces: eth1, eth2
Status: Disconnected, Version: N/A
Actor ID: 0x7ba70699061a9900, Voter ID: 0x4ee9d7976362f000, Replication State: Degraded
(...)
Constant MDM disconnections:
31651 2021-08-24 09:52:41.951 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, TB3 (ID 2e11f17503736304), has lost connection to the cluster. 31653 2021-08-24 09:55:48.900 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, TB3 (ID 2e11f17503736304), has lost connection to the cluster. 31671 2021-08-24 11:32:41.096 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, TB3 (ID 2e11f17503736304), has lost connection to the cluster. 31673 2021-08-24 11:41:14.056 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, TB2 (ID 59a25f8d34a62803), has lost connection to the cluster. 31681 2021-08-24 11:51:48.323 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster. 31685 2021-08-24 12:01:47.822 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster. 31699 2021-08-24 12:26:14.300 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, TB2 (ID 59a25f8d34a62803), has lost connection to the cluster. 31720 2021-08-24 13:21:13.967 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, TB2 (ID 59a25f8d34a62803), has lost connection to the cluster. 31722 2021-08-24 15:16:48.814 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster. 31724 2021-08-24 15:55:48.604 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, TB3 (ID 2e11f17503736304), has lost connection to the cluster. 31725 2021-08-24 15:55:48.605 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, TB2 (ID 59a25f8d34a62803), has lost connection to the cluster. 31726 2021-08-24 15:55:48.616 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster. 31730 2021-08-24 15:55:53.032 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster. 31732 2021-08-24 15:56:48.161 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster. 31734 2021-08-24 16:40:48.521 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster. 31735 2021-08-24 16:40:48.611 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, TB2 (ID 59a25f8d34a62803), has lost connection to the cluster. 31736 2021-08-24 16:40:48.611 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, TB3 (ID 2e11f17503736304), has lost connection to the cluster. 31740 2021-08-24 16:55:48.820 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, TB3 (ID 2e11f17503736304), has lost connection to the cluster. 31742 2021-08-24 17:16:13.856 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, TB2 (ID 59a25f8d34a62803), has lost connection to the cluster. 31744 2021-08-24 17:51:47.737 MDM_CLUSTER_LOST_CONNECTION WARNING The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster.
No Master MDM - all SDSs stop serving the IO (SDS trc.x):
24/08 15:16:52.656567 0x7fa5d72e5db8:kalive_IsBlocked:00570: Keep-Alive (KA) is blocked: TRUE 24/08 15:16:52.667527 0x7fa5d2567db8:kalive_ShouldSendKeepAlive:00345: KA aborted because SDS is blocked 24/08 15:16:52.789810 0x7fa5d66ebdb8:raidComb_WaitForRemoteResponse:07474: (combId=7e33000d0318) write to tgtId=122276c50000000c failed (rc=IO_FAULT_BLOCKED)
Impact
DU - unable to access the data stored on PowerFlex storage.
Cause
Constant disconnections occurring in the unsynchronized 5-node (effectively 4-node) cluster caused a tie in voting for the Master MDM. Similar issue might have happened in a 3-node cluster as well.
Resolution
In this particular case, the cluster healed itself after another MDM disconnection which broke the tie in voting and allowed to elect the Primary MDM. In a similar scenario, fixing the fifth cluster node should allow the Primary MDM to be elected.
Alternative workaround is to try to restart the MDM process - that should break the tie, might need to restart a couple of times.
Network disconnections should be investigated and fixed.
Impacted Versions
All PowerFlex version
Fixed In Version
N/A - this is not a software bug, such behavior can affect any type of cluster when running with an even number of nodes.