PowerFlex Adding Nodes To An Existing Cluster May Cause Network Instability
Summary: New nodes are added to an existing cluster, and frequent disconnects are observed.
Symptoms
Scenario
Widespread connectivity issues may be observed when new nodes are added to the network. Connectivity issues may also be seen outside of PowerFlex.
Symptoms
There are various symptoms, and each site may contain one or more of these:
The customer may be unable to ping certain IP addresses from one host but may be able to do so from a different host.
The customer may not be able to SSH into various network devices (such as their switch).
The server may not be able to reach external services such as rsyslog.
There may be socket allocation errors and disconnections reported by the SDCs:
2022-03-28 15:55:14.310000:4507407:NEW_SDC_CONNECTED WARNING New SDC connected. ID: c39eb4b2000000ca; IP: 192.168.192.239,192.168.176.239,192.168.168.239,192.168.184.239; GUID: 23D8FFD8-E3E3-4A68-9459-D6AFCB1480DF ... 2022-03-28 15:55:17.439000:4507422:OSCILLATION_COUNTER_PASSED_THRESHOLD WARNING SDC (Name: , ID: c39eb4b2000000ca) reports frequently exceeded socket allocation failures. Short window threshold (300 socket allocation failures in 60 seconds). 2022-03-28 15:55:17.439000:4507423:OSCILLATION_COUNTER_PASSED_THRESHOLD WARNING SDC (Name: , ID: c39eb4b2000000ca) reports frequently exceeded socket allocation failures. Medium window threshold (500 socket allocation failures in 3600 seconds). ... 2022-03-28 16:19:29.708000:4507849:SDC_DISCONNECTED WARNING SDC disconnected. ID: c39eb4b2000000ca; GUID: 23D8FFD8-E3E3-4A68-9459-D6AFCB1480DF
There will likely be SDS decouple events along with connectivity health warnings:
2022-03-28 15:55:28.223000:4507431:SDS_DECOUPLED ERROR SDS: Sds-pdc1lfisnxbsn01.fnfis.com (id: b0abeb1300000000) decoupled. 2022-03-28 15:55:34.488000:4507433:CMATRIX_POLICY_BECAME_WORSE WARNING The inter-SDS connectivity health for Protection Domain ID: 7a1f601000000000 has changed from HEALTHY to REBUILD_ALLOWED
Stability was restored when the new systems were turned off.
Impact
DU could occur if the behavior is severe. SDS decouples with successful rebuilds are more common.
Cause
The additional nodes have exceeded the maximum number of ARP entries configured on the server.
To determine the current ARP settings, run the following command:
sysctl -a | grep "net.ipv4.neigh.default.gc_thresh" net.ipv4.neigh.default.gc_thresh1 = 128 net.ipv4.neigh.default.gc_thresh2 = 512 net.ipv4.neigh.default.gc_thresh3 = 1024
This information will also be in sysctl.txt in the log bundle:
$ grep "net.ipv4.neigh.default.gc_thresh" /path/to/server_directory/sysctl.txt net.ipv4.neigh.default.gc_thresh1 = 128 net.ipv4.neigh.default.gc_thresh2 = 512 net.ipv4.neigh.default.gc_thresh3 = 1024
To determine how many ARP entries are currently being used, run the following command:
If the number of ARP entries is near (or above) the gc_thresh2 threshold, you are probably hitting this issue.
arp –a | grep ether | wc –l
This information will also be in arp_av.txt in the log bundle:
$ grep "ether" /path/to/server_directory/arp_av.txt | wc -l 565
Resolution
Workaround
Increase the ARP threshold.
- PxFM version 3.7.1 and later will automatically update the settings (this change was implemented to support 2,000 SDCs with V4.0+).
- The settings below are tested and supported for PowerFlex 3.6 and later.
- The changes below should be made on all MDMs, SDSs, and SDCs for consistency.
Update the thresholds in sysctl.conf:
file="/etc/sysctl.conf"
cp "$file" "$file".backup
if grep -q "net.ipv4.neigh.default.gc_thresh" "$file"; then
sed -i 's/net.ipv4.neigh.default.gc_thresh1 = [0-9]\+/net.ipv4.neigh.default.gc_thresh1 = 8192/' "$file"
sed -i 's/net.ipv4.neigh.default.gc_thresh2 = [0-9]\+/net.ipv4.neigh.default.gc_thresh2 = 16384/' "$file"
sed -i 's/net.ipv4.neigh.default.gc_thresh3 = [0-9]\+/net.ipv4.neigh.default.gc_thresh3 = 32768/' "$file"
sysctl -p
else
echo -e "net.ipv4.neigh.default.gc_thresh1 = 8192 \nnet.ipv4.neigh.default.gc_thresh2 = 16384 \nnet.ipv4.neigh.default.gc_thresh3 = 32768" >> "$file"
sysctl -p
fi
Here is what the script does:
It defines sysctl.conf as the name of the file in the file variable.
It makes a copy of sysctl.conf before making any changes.
It uses grep -q to check if the line containing "net.ipv4.neigh.default.gc_thresh" exists in the file. The -q flag makes grep quiet and only returns a success or failure exit status.
If grep is successful (i.e., the line exists), it uses sed to update the values for thresh[1|2|3] to 8192, 16384, and 32768, respectively, in place within the file, and runs sysctl -p to apply the changes in sysctl.conf.
If grep is not successful (i.e., the line does not exist), it appends the lines "…thresh1 = 8192", "…thresh2 = 16384", and "…thresh3 = 32768" to the end of the file, and runs sysctl -p to apply the changes in sysctl.conf.
Impacted Versions
N/A - The issue is not specific to PowerFlex.
Fixed In Version
N/A
For PxFM customers see the comment above in the Workaround section.