PowerScale: Nodes split as lldpd is no longer sending heartbeat to the backend Dell switch
Summary: This article describes an issue with _lldpd process consuming its maximum allowed memory and no longer sending Link Layer Discovery Protocol (LLDP) packets to backend switches.
Symptoms
-
It is not recommended to perform a OneFS upgrade to the affected versions (9.5.0.0 to 9.5.0.5)
-
RPS or the Customer MUST restart all lldpd processes prior to performing a OneFS upgrade while on an affected version (for example, upgrading from 9.5.0.1 to 9.5.0.7). Follow Option 2 in resolution.
There is a memory leak in lldpd and a node can split if the process has consumed its maximum allowed memory (1GB). Once the node consumes its allowed memory, it no longer sends LLDP packets to the backend switches. A PowerScale Dell qualified backend switch running SmartFabric Services (SFS) must receive heartbeat (LLDP) packets from a node. If three heartbeats are missed, the switch port is then removed from its dedicated virtual-network. Then a node can no longer communicate to the cluster through that path.
If a cluster is going to be upgraded, nodes are rebooted in succession and each reboot takes several links down and back up. Each of these link events from the reboots slowly increases the size of lldpd vmem usage. It is highly likely that nodes split during an upgrade if the process has not been restarted recently.
This issue can occur during the following scenarios:
- OneFS Upgrades
- Normal cluster operations
The current vmem usage can be seen with the following command, where a MAXIMUM Resident Set Size (RSS) value is 1,048,576 KB. RSS is the sixth column of information from PS (left of " - ") output (excluding the node name).
# isi_for_array -s 'ps aux | grep _lldpd | grep -v grep'
Example output below:
cl950x-1# isi_for_array -s 'ps aux | grep _lldpd | grep -v grep' cl950x-1: _lldpd 1483 0.0 3.2 273804 262168 - S 6Aug23 74:25.14 lldpd: no neighbor. (lldpd) cl950x-1: _lldpd 1492 0.0 3.2 273804 262168 - S 6Aug23 74:31.73 lldpd: no neighbor. (lldpd) cl950x-2: _lldpd 1483 0.0 2.9 251068 238632 - S 14Aug23 66:19.68 lldpd: no neighbor. (lldpd) cl950x-2: _lldpd 1492 0.0 2.9 251068 238632 - S 14Aug23 66:24.72 lldpd: no neighbor. (lldpd) cl950x-3: _lldpd 1483 0.0 2.9 251832 239420 - S 14Aug23 46:25.36 lldpd: no neighbor. (lldpd) cl950x-3: _lldpd 1492 0.0 2.9 251832 239420 - S 14Aug23 46:32.24 lldpd: no neighbor. (lldpd) cl950x-4: _lldpd 1487 0.0 3.1 268052 256212 - S 8Aug23 50:25.15 lldpd: no neighbor. (lldpd) cl950x-4: _lldpd 1496 0.0 3.1 268052 256212 - S 8Aug23 50:36.34 lldpd: no neighbor. (lldpd) cl950x-5: _lldpd 1483 0.0 3.1 273208 261552 - S 6Aug23 75:41.91 lldpd: no neighbor. (lldpd) cl950x-5: _lldpd 1492 0.0 3.1 273208 261552 - S 6Aug23 75:35.00 lldpd: no neighbor. (lldpd) cl950x-6: _lldpd 1482 0.0 3.2 274144 262516 - S 6Aug23 50:49.08 lldpd: no neighbor. (lldpd) cl950x-6: _lldpd 1492 0.0 3.2 274144 262516 - S 6Aug23 51:02.88 lldpd: no neighbor. (lldpd) cl950x-7: _lldpd 1483 0.0 3.2 274004 262380 - S 6Aug23 50:51.55 lldpd: no neighbor. (lldpd) cl950x-7: _lldpd 1492 0.0 3.2 274004 262380 - S 6Aug23 51:03.26 lldpd: no neighbor. (lldpd) cl950x-8: _lldpd 1483 0.0 2.9 251176 238744 - S 14Aug23 46:40.93 lldpd: no neighbor. (lldpd) cl950x-8: _lldpd 1492 0.0 2.9 251176 238744 - S 14Aug23 46:49.57 lldpd: no neighbor. (lldpd) ^^^^^^
The speed at which the lldpd process consumes the memory varies on several factors. This is also the reason for the higher than normal increase in memory usage during OneFS upgrades:
- Network configuration size on the cluster
- Number of subnets created from the network configuration
- The number of network events such as link down or up
- Recurring reboot events
The amount of time it takes for _lldpd process to reach a MAXIMUM allowed memory varies from cluster to cluster. However, it was discovered that there is a correlation between network configuration size and the time to failure. This means that the more groupnets, subnets, and pools that are configured, the sooner it can occur.
Cause
Resolution
WARNING
|
There are several options to resolve or work around the issue depending on your current scenario:
- Upgrade OneFS to 9.5.0.6 and later
- Note the warning messages detailed in the article regarding restarting lldpd prior to any upgrade out of the affected versions.
- A temporary workaround is completed immediately restarting lldpd processes. This requires manual intervention by restarting the process across the cluster:
-
# killall lldpd
-
- A temporary workaround after the issue is resolved in immediately restarting lldpd processes that are over 500MB:
-
# isi_for_array -s 'ps auxww | grep _lldpd | grep -v grep | awk '"'"'{print $2}'"'"' | while read pid; do procstat -r $pid | grep RSS; done | awk '"'"'{ if ($5 > 500000 && $2 == "lldpd") { command=sprintf("kill %d",$1); system(command); close(command) } }'"'"''
-
- A temporary workaround after the issue is resolved (the following command is the same as the previous command), this can be run in a screen session to perform the check every 1200 s.
-
# while true; do isi_for_array -s 'ps auxww | grep _lldpd | grep -v grep | awk '"'"'{print $2}'"'"' | while read pid; do procstat -r $pid | grep RSS; done | awk '"'"'{ if ($5 > 500000 && $2 == "lldpd") { command=sprintf("kill %d",$1); system(command); close(command) } }'"'"''; sleep 1200; done
-