PowerFlex 3.5: Peer Disconnection When Using Replication
Summary: After configuring PowerFlex replication, the peer system status is "Decoupled" with "REMOTE_PEER_MDM_DENIED_MESSAGE_AS_NO_WORKING_CLIENT_CONNECTION_TO_THIS_PEER" error message.
Symptoms
This problem can occur right away after configuring PowerFlex replication, but can also be seen after some network changes or when the Master MDM on either side is changed to a specific node.
scli --query_replication_peer_system on one side (SiteA) returns:
query-all-Replication Peer System returned 1 Replication Peer System nodes. Replication Peer System ID: 045a1aa61167b20f Replication Peer System internal ID: eef8648500000000 Name: SiteB State: Decoupled, REMOTE_PEER_MDM_DENIED_MESSAGE_AS_NO_WORKING_CLIENT_CONNECTION_TO_THIS_PEER IP: 192.168.89.14,192.168.89.13,192.168.89.18 Port: 7611 Version: N/A SDR-SDR connectivity status: All connected
"netstat" output looks similar to:
tcp 0 0 192.168.86.19:50470 192.168.89.14:7611 ESTABLISHED 36766/mdm-3.5.1100. tcp 0 0 192.168.86.19:50464 192.168.89.14:7611 ESTABLISHED 36766/mdm-3.5.1100. tcp 0 0 192.168.86.19:50216 192.168.89.14:7611 ESTABLISHED 36766/mdm-3.5.1100. tcp 0 0 192.168.86.19:50458 192.168.89.14:7611 ESTABLISHED 36766/mdm-3.5.1100.
Notice there are four outgoing connections to port 7611 on the peer MDM, but there are no incoming connections from SiteB to port 7611 on the localhost.
Another side (SiteB) shows up as Decoupled, NOT_CONN, for example:
Query-all-Replication Peer System returned 1 Replication Peer System nodes. Replication Peer System ID: 0966250f2fae770f Replication Peer System internal ID: c0f3862b00000000 Name: SiteA State: Decoupled, NOT_CONN IP: 192.168.86.20,192.168.86.13,192.168.86.19 Port: 7611 Version: 3.5.1100 SDR-SDR connectivity status: All connected
"netstat" output on this side might look similar to:
B -> A tcp 0 157 192.168.89.14:7611 192.168.86.19:50470 ESTABLISHED 446371/mdm-3.5.1100 tcp 0 0 192.168.89.14:7611 192.168.86.19:50216 ESTABLISHED 446371/mdm-3.5.1100 tcp 0 0 192.168.89.14:7611 192.168.86.19:50464 ESTABLISHED 446371/mdm-3.5.1100 tcp 0 0 192.168.89.14:7611 192.168.86.19:50458 ESTABLISHED 446371/mdm-3.5.1100 tcp 0 0 192.168.89.14:54460 192.168.86.19:7611 SYN_SENT 446371/mdm-3.5.1100 tcp 0 0 192.168.89.14:54456 192.168.86.19:7611 SYN_SENT 446371/mdm-3.5.1100 tcp 0 0 192.168.89.14:54458 192.168.86.19:7611 SYN_SENT 446371/mdm-3.5.1100 tcp 0 0 192.168.89.14:54454 192.168.86.19:7611 SYN_SENT 446371/mdm-3.5.1100
There are connections coming from SiteA (192.168.89.14) and source port numbers match the "netstat" output on SiteA, but outgoing connections are in SYN_SENT state which means they are unable to finish the TCP handshake with SiteA and in effect unable to establish the MDM peering.
Impact
Replication not working Depending on the root cause, it might not work at all or only when a specific node becomes Master MDM on one of the sides.
Cause
This problem is caused either by MDM IP address misconfiguration or network issues between sites. For example, if SiteA is configured with correct IP addresses, but SiteB was configured with IPs that do not belong to SiteA MDMs this problem might occur.
If there is any network connectivity (firewall, routing etc.) issue between the sites customer can also experience a similar problem. Another reason is duplicated IPs on either of the sides (that is there are two MDMs running with the same IP) or some kind of network device intercepting outgoing TCP sessions (proxy).
In this particular case, SiteB MDM kept TCP sockets opened against one of the MDMs on SiteA, but was not connected to that MDM, rather the connection was artificially kept alive by one of the routers on the path between sites:
This is how netstat output looked like on both sites:
A -> B tcp 0 0 192.168.86.19:50470 192.168.89.14:7611 ESTABLISHED 36766/mdm-3.5.1100. tcp 0 0 192.168.86.19:50464 192.168.89.14:7611 ESTABLISHED 36766/mdm-3.5.1100. tcp 0 0 192.168.86.19:50216 192.168.89.14:7611 ESTABLISHED 36766/mdm-3.5.1100. tcp 0 0 192.168.86.19:50458 192.168.89.14:7611 ESTABLISHED 36766/mdm-3.5.1100. B -> A tcp 0 0 192.168.89.14:54460 192.168.86.19:7611 ESTABLISHED 446371/mdm-3.5.1100 tcp 0 0 192.168.89.14:54456 192.168.86.19:7611 ESTABLISHED 446371/mdm-3.5.1100 tcp 0 0 192.168.89.14:54458 192.168.86.19:7611 ESTABLISHED 446371/mdm-3.5.1100 tcp 0 0 192.168.89.14:54454 192.168.86.19:7611 ESTABLISHED 446371/mdm-3.5.1100 tcp6 0 157 192.168.89.14:7611 192.168.86.19:50470 ESTABLISHED 446371/mdm-3.5.1100 tcp6 0 0 192.168.89.14:7611 192.168.86.19:50216 ESTABLISHED 446371/mdm-3.5.1100 tcp6 0 0 192.168.89.14:7611 192.168.86.19:50464 ESTABLISHED 446371/mdm-3.5.1100 tcp6 0 0 192.168.89.14:7611 192.168.86.19:50458 ESTABLISHED 446371/mdm-3.5.1100
Notice that SiteB (192.168.89.14) shows four ESTABLISHED connections to the IP address of SiteA (192.168.86.19) on port 7611, but we do not see the same in the "netstat" output on SiteA - some kind of network proxy kept these TCP sessions alive.
Resolution
Fix the peer MDM IP configuration. Test connectivity between sites on port TCP/7611. Switch Master MDM ownership to different nodes in the cluster and/or restart the MDM service in order to close old sockets.
Impacted Versions
PowerFlex 3.5 and above
Fixed In Version
N/A - not PowerFlex problem