PowerFlex Inability To Elect The Primary MDM Causing DU

Summary: In specific cases, an MDM cluster working in a DEGRADED mode experiencing network disconnections might be unable to elect the Primary MDM, effectively causing a DU until the next MDM-MDM disconnection or manual intervention. ...

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Scenario
5_node MDM cluster with a single node not participating in the cluster - the node was manually unconfigured from the cluster and not replaced by any new MDM.

Symptoms
Unable to log in to SCLI or the UI on any of the MDM hosts.

SDCs are reporting IO errors, yet no DATA_FAILED event was reported in the MDM events.

MDM cluster is experiencing network issues causing disconnections between the cluster members.

 

Not fully functional MDM cluster - in this example "MDM3" was manually reconfigured and did not participate in the cluster, but was never replaced:

Cluster:
    Name: SCALEIO, ID: 2e89c0596415569b, Mode: 5_node, State: Degraded, Active: 4/5, Replicas: 2/3
    Virtual IP Addresses: 10.106.136.31, 10.106.136.95
Master MDM:
    Name: MDM1, ID: 0x73a3e7f71453ec05
        IP Addresses: 10.106.136.71, 10.106.136.7, Management IP Addresses: 10.106.202.159, Port: 9011, Virtual IP interfaces: eth1, eth2
        Version: 3.0.1200
        Actor ID: 0x11df55433c3b4c05, Voter ID: 0x00eb43f04a811905
        Certificate Info:
            Subject:    /GN=MDM/CN=ScaleIO-10-106-202-159/L=Hopkinton/ST=Massachusetts/C=US/O=EMC/OU=ASD
            Issuer:     /GN=MDM/CN=ScaleIO-10-106-202-159/L=Hopkinton/ST=Massachusetts/C=US/O=EMC/OU=ASD
            Valid From: May 28 20:40:36 2021 GMT
            Valid To:   May 27 21:40:36 2031 GMT
            Thumbprint: 1A:EA:CE:BA:C7:A8:D2:3C:87:5D:FD:D6:1C:85:6B:82:5B:18:2D:19
Slave MDMs:
    Name: MDM3, ID: 0x6eed06ed18096200
        IP Addresses: 10.106.136.91, 10.106.136.27, Management IP Addresses: 10.106.202.179, Port: 9011, Virtual IP interfaces: eth1, eth2
        Status: Disconnected, Version: N/A
        Actor ID: 0x7ba70699061a9900, Voter ID: 0x4ee9d7976362f000, Replication State: Degraded
(...)

Constant MDM disconnections:

31651 2021-08-24 09:52:41.951 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, TB3 (ID 2e11f17503736304), has lost connection to the cluster.
31653 2021-08-24 09:55:48.900 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, TB3 (ID 2e11f17503736304), has lost connection to the cluster.
31671 2021-08-24 11:32:41.096 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, TB3 (ID 2e11f17503736304), has lost connection to the cluster.
31673 2021-08-24 11:41:14.056 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, TB2 (ID 59a25f8d34a62803), has lost connection to the cluster.
31681 2021-08-24 11:51:48.323 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster.
31685 2021-08-24 12:01:47.822 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster.
31699 2021-08-24 12:26:14.300 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, TB2 (ID 59a25f8d34a62803), has lost connection to the cluster.
31720 2021-08-24 13:21:13.967 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, TB2 (ID 59a25f8d34a62803), has lost connection to the cluster.
31722 2021-08-24 15:16:48.814 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster.
31724 2021-08-24 15:55:48.604 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, TB3 (ID 2e11f17503736304), has lost connection to the cluster.
31725 2021-08-24 15:55:48.605 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, TB2 (ID 59a25f8d34a62803), has lost connection to the cluster.
31726 2021-08-24 15:55:48.616 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster.
31730 2021-08-24 15:55:53.032 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster.
31732 2021-08-24 15:56:48.161 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster.
31734 2021-08-24 16:40:48.521 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster.
31735 2021-08-24 16:40:48.611 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, TB2 (ID 59a25f8d34a62803), has lost connection to the cluster.
31736 2021-08-24 16:40:48.611 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, TB3 (ID 2e11f17503736304), has lost connection to the cluster.
31740 2021-08-24 16:55:48.820 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, TB3 (ID 2e11f17503736304), has lost connection to the cluster.
31742 2021-08-24 17:16:13.856 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, TB2 (ID 59a25f8d34a62803), has lost connection to the cluster.
31744 2021-08-24 17:51:47.737 MDM_CLUSTER_LOST_CONNECTION WARNING        The MDM, MDM4 (ID 7ee1ad613bb07901), has lost connection to the cluster.

No Master MDM - all SDSs stop serving the IO (SDS trc.x):

24/08 15:16:52.656567 0x7fa5d72e5db8:kalive_IsBlocked:00570: Keep-Alive (KA) is blocked: TRUE
24/08 15:16:52.667527 0x7fa5d2567db8:kalive_ShouldSendKeepAlive:00345: KA aborted because SDS is blocked
24/08 15:16:52.789810 0x7fa5d66ebdb8:raidComb_WaitForRemoteResponse:07474: (combId=7e33000d0318) write to tgtId=122276c50000000c failed (rc=IO_FAULT_BLOCKED)

 

Impact

 DU - unable to access the data stored on PowerFlex storage.

Cause

Constant disconnections occurring in the unsynchronized 5-node (effectively 4-node) cluster caused a tie in voting for the Master MDM. Similar issue might have happened in a 3-node cluster as well.

Resolution

In this particular case, the cluster healed itself after another MDM disconnection which broke the tie in voting and allowed to elect the Primary MDM. In a similar scenario, fixing the fifth cluster node should allow the Primary MDM to be elected.

Alternative workaround is to try to restart the MDM process - that should break the tie, might need to restart a couple of times.

Network disconnections should be investigated and fixed.

Impacted Versions

All PowerFlex version

Fixed In Version

N/A - this is not a software bug, such behavior can affect any type of cluster when running with an even number of nodes.

Affected Products

PowerFlex rack, VxFlex Product Family
Article Properties
Article Number: 000193828
Article Type: Solution
Last Modified: 14 Apr 2025
Version:  3
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.