PowerScale: Nodes split as lldpd is no longer sending heartbeat to the backend Dell switch

Summary: This article describes an issue with _lldpd process consuming its maximum allowed memory and no longer sending Link Layer Discovery Protocol (LLDP) packets to backend switches.

Resolution

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Check out other resources

Symptoms

WARNING:

It is not recommended to perform a OneFS upgrade to the affected versions (9.5.0.0 to 9.5.0.5)
RPS or the Customer MUST restart all lldpd processes prior to performing a OneFS upgrade while on an affected version (for example, upgrading from 9.5.0.1 to 9.5.0.7). Follow Option 2 in resolution.

There is a memory leak in lldpd and a node can split if the process has consumed its maximum allowed memory (1GB). Once the node consumes its allowed memory, it no longer sends LLDP packets to the backend switches. A PowerScale Dell qualified backend switch running SmartFabric Services (SFS) must receive heartbeat (LLDP) packets from a node. If three heartbeats are missed, the switch port is then removed from its dedicated virtual-network. Then a node can no longer communicate to the cluster through that path.

If a cluster is going to be upgraded, nodes are rebooted in succession and each reboot takes several links down and back up. Each of these link events from the reboots slowly increases the size of lldpd vmem usage. It is highly likely that nodes split during an upgrade if the process has not been restarted recently.

This issue can occur during the following scenarios:

OneFS Upgrades
Normal cluster operations

The current vmem usage can be seen with the following command, where a MAXIMUM Resident Set Size (RSS) value is 1,048,576 KB. RSS is the sixth column of information from PS (left of " - ") output (excluding the node name).

# isi_for_array -s 'ps aux | grep _lldpd | grep -v grep'

NOTE: "isi_for_array" does not work on nodes that are OFFLINE. A direct connection to OFFLINE nodes is required to gather the above information.

Example output below:

cl950x-1# isi_for_array -s 'ps aux | grep _lldpd | grep -v grep'
cl950x-1: _lldpd  1483   0.0  3.2 273804 262168  -  S     6Aug23     74:25.14 lldpd: no neighbor. (lldpd)
cl950x-1: _lldpd  1492   0.0  3.2 273804 262168  -  S     6Aug23     74:31.73 lldpd: no neighbor. (lldpd)
cl950x-2: _lldpd  1483   0.0  2.9 251068 238632  -  S    14Aug23     66:19.68 lldpd: no neighbor. (lldpd)
cl950x-2: _lldpd  1492   0.0  2.9 251068 238632  -  S    14Aug23     66:24.72 lldpd: no neighbor. (lldpd)
cl950x-3: _lldpd  1483   0.0  2.9 251832 239420  -  S    14Aug23     46:25.36 lldpd: no neighbor. (lldpd)
cl950x-3: _lldpd  1492   0.0  2.9 251832 239420  -  S    14Aug23     46:32.24 lldpd: no neighbor. (lldpd)
cl950x-4: _lldpd  1487   0.0  3.1 268052 256212  -  S     8Aug23     50:25.15 lldpd: no neighbor. (lldpd)
cl950x-4: _lldpd  1496   0.0  3.1 268052 256212  -  S     8Aug23     50:36.34 lldpd: no neighbor. (lldpd)
cl950x-5: _lldpd  1483   0.0  3.1 273208 261552  -  S     6Aug23     75:41.91 lldpd: no neighbor. (lldpd)
cl950x-5: _lldpd  1492   0.0  3.1 273208 261552  -  S     6Aug23     75:35.00 lldpd: no neighbor. (lldpd)
cl950x-6: _lldpd  1482   0.0  3.2 274144 262516  -  S     6Aug23     50:49.08 lldpd: no neighbor. (lldpd)
cl950x-6: _lldpd  1492   0.0  3.2 274144 262516  -  S     6Aug23     51:02.88 lldpd: no neighbor. (lldpd)
cl950x-7: _lldpd  1483   0.0  3.2 274004 262380  -  S     6Aug23     50:51.55 lldpd: no neighbor. (lldpd)
cl950x-7: _lldpd  1492   0.0  3.2 274004 262380  -  S     6Aug23     51:03.26 lldpd: no neighbor. (lldpd)
cl950x-8: _lldpd  1483   0.0  2.9 251176 238744  -  S    14Aug23     46:40.93 lldpd: no neighbor. (lldpd)
cl950x-8: _lldpd  1492   0.0  2.9 251176 238744  -  S    14Aug23     46:49.57 lldpd: no neighbor. (lldpd)
                                         ^^^^^^

The speed at which the lldpd process consumes the memory varies on several factors. This is also the reason for the higher than normal increase in memory usage during OneFS upgrades:

Network configuration size on the cluster
Number of subnets created from the network configuration
The number of network events such as link down or up
Recurring reboot events

The amount of time it takes for _lldpd process to reach a MAXIMUM allowed memory varies from cluster to cluster. However, it was discovered that there is a correlation between network configuration size and the time to failure. This means that the more groupnets, subnets, and pools that are configured, the sooner it can occur.

Cause

In 9.5, lldpd was updated and contains a memory leak.

Resolution

WARNING

It is not recommended to perform a OneFS upgrade to the affected versions (9.5.0.0 to 9.5.0.5)
RPS or the Customer MUST restart all lldpd processes prior to performing a OneFS upgrade while on an affected version (for example, upgrading from 9.5.0.1 to 9.5.0.7). Follow Option 2 in resolution.

There are several options to resolve or work around the issue depending on your current scenario:

Upgrade OneFS to 9.5.0.6 and later
- Note the warning messages detailed in the article regarding restarting lldpd prior to any upgrade out of the affected versions.
A temporary workaround is completed immediately restarting lldpd processes. This requires manual intervention by restarting the process across the cluster:
- ```
# killall lldpd
```

A temporary workaround after the issue is resolved in immediately restarting lldpd processes that are over 500MB:

# isi_for_array -s 'ps auxww | grep _lldpd | grep -v grep | awk '"'"'{print $2}'"'"' | while read pid; do procstat -r $pid | grep RSS; done | awk '"'"'{ if ($5 > 500000 && $2 == "lldpd") { command=sprintf("kill %d",$1); system(command); close(command) } }'"'"''

A temporary workaround after the issue is resolved (the following command is the same as the previous command), this can be run in a screen session to perform the check every 1200 s.

# while true; do isi_for_array -s 'ps auxww | grep _lldpd | grep -v grep | awk '"'"'{print $2}'"'"' | while read pid; do procstat -r $pid | grep RSS; done | awk '"'"'{ if ($5 > 500000 && $2 == "lldpd") { command=sprintf("kill %d",$1); system(command); close(command) } }'"'"''; sleep 1200; done

Article Number: 000219439

Article Type: Solution

Last Modified: 19 May 2025

Version: 7

Check if your device is covered by Support Services.

PowerScale: Nodes split as lldpd is no longer sending heartbeat to the backend Dell switch

Summary: This article describes an issue with _lldpd process consuming its maximum allowed memory and no longer sending Link Layer Discovery Protocol (LLDP) packets to backend switches.

Symptoms

Cause

Resolution

Symptoms

Cause

Resolution

WARNING

It is not recommended to perform a OneFS upgrade to the affected versions (9.5.0.0 to 9.5.0.5)

RPS or the Customer MUST restart all lldpd processes prior to performing a OneFS upgrade while on an affected version (for example, upgrading from 9.5.0.1 to 9.5.0.7). Follow Option 2 in resolution.

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

PowerScale: Nodes split as lldpd is no longer sending heartbeat to the backend Dell switch

Summary: This article describes an issue with _lldpd process consuming its maximum allowed memory and no longer sending Link Layer Discovery Protocol (LLDP) packets to backend switches.

Detailed Article

Symptoms

Cause

Resolution

Symptoms

Cause

Resolution

WARNING

It is not recommended to perform a OneFS upgrade to the affected versions (9.5.0.0 to 9.5.0.5)

RPS or the Customer MUST restart all lldpd processes prior to performing a OneFS upgrade while on an affected version (for example, upgrading from 9.5.0.1 to 9.5.0.7). Follow Option 2 in resolution.

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services