PowerFlex SDS Getting Sockets Closed With No Network Issues

Summary: SDS reports sockets closed without any network events or evidence of networking issues.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Scenario
SDS reports having sockets closed by remote processes (SDCs, SDSs, MDMs), and no observed NIC down events, dropped frames, or packet loss. 

Symptoms
Events log reports SDS connectivity loss, either a decouple or reconnect: 

2017-11-11 16:52:12.101 SDS_RECONNECTED           INFO     	 SDS: xyz_d35 (ID 67211111110089) reconnected 
2017-11-11 16:52:13.690 MDM_DATA_FAILED           CRITICAL 	 The system is now in DATA FAILURE state. Some data is unavailable. 
2017-11-11 16:52:15.791 MDM_DATA_DEGRADED         ERROR    	 The system is now in DEGRADED state. 

Before that, we see errors like the following in the SDS traces.

The SDS tried to send, and it took >1 sec to respond:

11/11 16:52:04.527408 0x7ff0b19eaeb0:contNet_OscillationNotif:01720: Con 672cb111110099 - Oscillation of type 5 (RPC_LINGERED_1SEC) reported

Socket with a peer was closed from the other side due to their lack of receipt of our sending lower-level keepalives:

11/11 16:52:06.241105 0x7ff0b19e1eb0:contNet_OscillationNotif:01720: Con a71d2b5d00000078 - Oscillation of type 1 (SOCKET_DOWN) reported 

Another instance of this look like:

11/11 16:52:06.241224 0x7ff0b19e1eb0:contNet_OscillationNotif:01720: Con a71d2b3c00000057 - Oscillation of type 2 (IO_ERROR) reported

Other indicators:
Iterations
ScaleIO's lower-level networking keepalive timer is measured in iterations, which are 100 milliseconds long.

MDM->SDS timeout is 20 iterations, or 2 s, while the MDM-MDM keepalive timeout is 3 iterations or 300 ms.

Twenty iterations exceeded:

11/11 16:52:11.685281 0x7ff752d1beb0:netPath_IsKaNeeded:01858:  :: Connected Live CLIENT path 0x7ff6e2192a00 of portal 0x7ff6e2192900 net 0x7ff7480e1110 socket 210 inflights 0 didn't receive message for 20 iterations from 10.124.162.109:7072. Marking as down  

Sockets down These trace prints indicate when the sockets went down:

11/11 16:52:09.787793 0x7ff752cf7eb0:tgtMgr_TgtOscCB:07696: Con 672cba7400000089 Network address 10.124.130.109 - Oscillation of type SOCKET_DOWN reported

11/11 16:52:11.685290 0x7ff752d1beb0:tgtMgr_TgtOscCB:07696: Con 672cba7400000089 Network address 10.124.162.109 - Oscillation of type RCV_KA_DISCONNECT reported
11/11 16:52:11.685308 0x7ff752cf7eb0:tgtMgr_TgtOscCB:07696: Con 672cba7400000089 Network address 10.124.162.109 - Oscillation of type SOCKET_DOWN reported

This print means that the last socket to an SDS went down and is the point at which the SDS is considered disconnected:

11/11 16:52:11.685319 0x7ff752cf7eb0:tgtMgr_TgtDisconnectCB:07818: Tgt: 672cba7400000089 ConId: 672cba7400000089

The MDM will issue an addmdm command to get the SDS to reconnect. 

IO Fault Blocked
We know that IO_FAULT_BLOCKED happens when the SDS refuses IO because it cannot reach the MDM, but this information is incomplete.

The SDS sends keepalives to the MDM every second, and if the MDM does not get this for 5 s, the SDS is considered timed out and gets marked as decoupled.

The MDM sends the SDS a "keep working" message every second. It is when the SDS does not receive this message for 5 s that it refuses IO with IO_FAULT_BLOCKED: 

11/11 16:52:12.007045 0x7ff0b0cdfeb0:ioh_NewRequest:05490: Write to comb f778038007f - Done rc is IO_FAULT_BLOCKED (Lba 6721528 8), volume 6e1a2f4a0000075d (dit)
11/11 16:52:12.008825 0x7ff0b0ec5eb0:ioh_NewRequest:05490: Write to comb f78803903fc - Done rc is IO_FAULT_BLOCKED (Lba 5031040 6), volume 6e1a2f4c0000075f (dit)
11/11 16:52:12.017262 0x7ff0b26daeb0:ioh_NewRequest:05490: Write to comb f768037003e - Done rc is IO_FAULT_BLOCKED (Lba 15106144 16), volume 6e1a2f490000075c (dit)

 

Impact

Loss of SDS connectivity

During data_degraded state or Instant Maintenance Mode, this can cause DU.

Cause

The cause of IO failure in this example case was that the (5 s) SDS lease had not expired, but the (2 s) lower-level network timeout had.

The root cause is one or more of the following reasons: 

1- TCP/network issues
A- This will likely manifest with TCP retransmits, which indicates HW/configuration issues. (Cable, NIC, switch issues, etc.) as seen in the output of 

sar -n ETCP 1

Which outputs as:

 Linux 3.10.0-693.5.2.el7.x86_64 (SIO-DCOE-96O-3)        12/13/2017      _x86_64_        (48 CPU)

04:33:44 PM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
04:33:45 PM      0.00      0.00     50.00      0.00      0.00
04:33:46 PM      2.00      0.00     75.00      0.00      0.00
04:33:47 PM      0.00      0.00    223.00      0.00      0.00
04:33:48 PM      0.00      0.00    106.00      0.00      0.00
04:33:49 PM      2.00      0.00     58.00      0.00      0.00
04:33:50 PM      0.00      0.00      5.00      0.00      0.00
04:33:51 PM      0.00      0.00      7.00      0.00      0.00
04:33:52 PM      2.00      0.00      2.00      0.00      0.00
04:33:53 PM      0.00      0.00      1.00      0.00      0.00
^C

04:33:53 PM      0.00      0.00      0.00      0.00      0.00
Average:         0.65      0.00     99.00      0.00      0.00

For reference:

  • Green = single digits/s
  • Yellow = high double digits up to 50/s
  • Red = >50/s

B-In the case of older Linux distros, such as the SusE 11 SP3 on our SVMs, one can monitor for network retransmits with the following command:

watch -d -n 2 "netstat -s |grep retrans"

Which outputs as below, highlighting the characters that updated during the last interval:

Every 2.0s: netstat -s |grep retrans                                                                                                                                                   Wed Dec 13 09:55:10 2017

    1244070 segments retransmited

2- CPU issues, process starvation.
This manifests as a simultaneous pause in every SIO component's traces, with lots of sockets closing as the first lines in the trace upon resume. 
LIA, SDS, MDM/TB, SDC/messages file shows the gap.
LIA trace, for instance, shows LIA→SDS sockets closing after a 3 s (30 iterations x 100 ms) timeout:

11/11 16:52:11.597227 0x7f44c41c6eb0:netPath_IsKaNeeded:01858:  :: Connected Live SERVER path 0x7f44c4195690 of portal 0x7f44c4192bb0 net 0x83b040 socket 8 inflights 0 didn't receive message for 30 iterations from 127.0.0.1:43228. Marking as down
11/11 16:52:12.031195 0x7f44c419eeb0:liaNet_DisconnectedNotif:01553: Con aed disconnected
11/11 16:52:12.158383 0x7f44c419eeb0:liaNet_ConnectedNotif:01483: Con aed  connected

3- Maybe we have a bug somewhere in SIO. Likely a non-network thread is holding a CPU and not allowing something else to run, etc.

Resolution

Workaround

No general workaround

Impacted versions

All

Fixed in version

N/A 

Affected Products

PowerFlex Software, VxFlex Product Family, VxFlex Ready Node, Ready Node Series
Article Properties
Article Number: 000203040
Article Type: Solution
Last Modified: 10 Mar 2025
Version:  4
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.