S4048-ON VLT and LACP issue while reloading VLT peer

Question

Plain setup with a couple of S4048-ON switches in VLT setup: SA, SB

From these two switches we have connected a set of LACP port channels towards upstream C6800 (VSS), ToR switches and F5 BIG-IPs.

Now, our issues start when we want to upgrade the S4048-ON pair.
While reloading one VLT peer, we get massive packet drops 50% on the LACP port channels.
The connected LACP devices don't register that the links to the reloading VLT peer are down, and it takes about 1-2 minutes until it converges.

As I am not that familiar with DNOS, how can it be that the S4048 don't shut down interfaces at once while reloading?

As a workaround, we have tried to manually shut down the interfaces on the VLT peer that we want to reload, and that as expected works like a charm with convergence <1s.

interface Port-channel 51
 description Po51 - ny LACP til cisco
 no ip address
 switchport
 no spanning-tree pvst err-dis cause invalid-pvst-bpdu
 lacp fast-switchover
 vlt-peer-lag port-channel 51
 no shutdown
interface TenGigabitEthernet 1/47
 description Po51 - cisco lacp
 no ip address
 mtu 9216
!
 port-channel-protocol LACP
  port-channel 51 mode active


interface TenGigabitEthernet 1/48
 description Po51 - cisco lacp
 no ip address
 mtu 9216
!
 port-channel-protocol LACP
  port-channel 51 mode active

S4048-ON-2-01#sh vlt br
 VLT Domain Brief
------------------
 Domain ID:                      2
 Role:                           Primary
 Role Priority:                  1
 ICL Link Status:                Up
 HeartBeat Status:               Up
 VLT Peer Status:                Up
 Local Unit Id:                  0
 Version:                        6(8)
 Local System MAC address:       f4:8e:38:35:1c:08
 Remote System MAC address:      f4:8e:38:35:1d:08
 Remote system version:          6(8)
 Delay-Restore timer:            4 seconds
 Delay-Restore Abort Threshold:  60 seconds
 Peer-Routing :                  Disabled
 Peer-Routing-Timeout timer:     0 seconds
 Multicast peer-routing timeout: 150 seconds

HenrikS. · Answer

Hello,

While looking through the white papers, I see some differences:
1: We are now running: Version 9.13(0.3P1), but we have seen this behavior on earlier versions aswell.
2: We use PVST, as should be supported?
3: We have not defined a specific VLT system-mac.

Other than that, we are not using routed-VLT and the issue only occurs while reloading one peer.

Power-cut of one peer works as expected.

Shut down of the LACP interfaces up front of reloading works as expected.

I have compared running-config on the VLT peers, and differences are:

SwitchA:
-------------------------------------------
protocol spanning-tree pvst
 no disable
 vlan 32 hello-time 1
 vlan 32 max-age 6
 vlan 32 forward-delay 4
 vlan 1,32,1000-1020,1091-1099,1451,1456,1491-1492,2500-2510 bridge-priority 24576

SwitchB:
-------------------------------------------
protocol spanning-tree pvst
 no disable
 vlan 32 hello-time 1
 vlan 32 max-age 6
 vlan 32 forward-delay 4
 vlan 1,32,1000-1020,1091-1099,1451,1456,1491-1492,2500-2510 bridge-priority 28672

SwitchA:
-------------------------------------------
vlt domain 2
 peer-link port-channel 100
 back-up destination 10.x.x.8
 primary-priority 1
 unit-id 0
 delay-restore 4

SwitchB:
-------------------------------------------
vlt domain 2
 peer-link port-channel 100
 back-up destination 10.x.x.7
 primary-priority 8192
 unit-id 1
 delay-restore 4

The following configuration that are in place for LACP are identical on both switches:

Switch A/B VLTi:
-------------------------------------------
interface fortyGigE 1/49
 no ip address
 mtu 9216
!
 protocol lldp
  advertise management-tlv management-address system-capabilities system-description system-name
 no shutdown
!
interface fortyGigE 1/50
 no ip address
 mtu 9216
!
 protocol lldp
  advertise management-tlv management-address system-capabilities system-description system-name
 no shutdown

Switch A/B LACP:
-------------------------------------------
interface TenGigabitEthernet 1/47
 no ip address
 mtu 9216
!
 port-channel-protocol LACP
  port-channel 51 mode active
!
 protocol lldp
  advertise management-tlv management-address system-capabilities system-description system-name
 no shutdown
!
interface TenGigabitEthernet 1/48
 no ip address
 mtu 9216
!
 port-channel-protocol LACP
  port-channel 51 mode active
!
 protocol lldp
  advertise management-tlv management-address system-capabilities system-description system-name
 no shutdown
!

Switch A/B VLTIi/PO:
-------------------------------------------
interface Port-channel 100
 description VLTi
 no ip address
 channel-member fortyGigE 1/49,1/50
 no shutdown
!
interface Port-channel 51
 no ip address
 switchport
 no spanning-tree pvst err-dis cause invalid-pvst-bpdu
 lacp fast-switchover
 vlt-peer-lag port-channel 51
 no shutdown
!

genoscope2 · Answer

I have exactly the same problem but i don't see any indication of the fix of the problem on the link given. where is it?

HenrikS. · Answer

Just to confirm your findings, we have not yet resolved the issue, but a workaround is to shut LACP interfaces on the VLT peer before reloading it.

janwillem.molenaar · Answer

Hi Hendrik, We are experiencing the same issue with OS10 switches (10.5.2.0).Did you manage to fix this issue? We are about to open a SR at dell support. Regards, Jan-Willem Molenaar

HenrikS. · Answer

Hello,

No, we have not yet been able to confirm a potential fix.
The main reason is that the packet loss that we encountered with the issue is so severe that the impact on production systems just isn't worth the while
The issue does not happen if you just cut power or manually shut down lacp interfaces before reload...

On another note. We started to have other issues with one out of 4 pairs of these switches this fall. Shortly after opening a SR, we were told that we were hit by the Intel Clock bug (after checking all SNs, all 8! switches were affected), ref:
https://www.dell.com/support/article/no-no/qna44095/networking-clock-signal-q-a?lang=en

(Note: "A: No you do not need to proactively contact Dell to get replacement product. Dell will contact individual customers to coordinate for product replacement as material becomes available.")

We were never proactively contacted by Dell on this matter, and for all I know it might be some kind of root cause.

We have just replaced the knowingly failed units, and are planning replacement of 6 more, we might perhaps perform another reload test on a replaced and fully updated pair after this.

I advice you to open up a SR and also to ask explicitly for them to check regarding known issues with your units.

-Henrik

bealdrid2 · Answer

Henrik,

Regarding the clock bug issue, can you share what kind of issues you noticed on your switches that were affected by this bug? So far I have one pair that are affected but have not shown any problems to date, and I'm just trying to get a feel for what kind of things to look out for.

HenrikS. · Answer

Well, as it affects the Intel Atom CPU, it's related to the control plane rather than the data plane.

For us, it was discovered during normal operations, adding a new vlan, tagging interfaces and then suddenly just after a 'write mem', the hostname in the CLI prompt suddenly was renamed back to the default name:
Dell-EMC-something #

After that, we checked the serial output and saw error messages referring to writing to nvram and opened a SR.

Since the KB article stated "Once encountered it is likely that the unit will not boot, and will not be recoverable. Typically the system or card will stop functioning and will hang or reboot continuously. The issue may not be observed until a reboot or power cycle occurs." we did no more tests on the units until replacement, as the dataplane was still operational at the time.

-Henrik

bealdrid2 · Answer

Henrik,

Thanks for that information. Our units are right about 4 years old now. We have been contacted regarding pro-active replacement, but due to supply chain issues we have not received new units yet. Hoping we can hold out until we do.

stefangs · Answer

Hello all, Is there already a fix for this issue? Is there an active SR now for open at Dell? Kind regards, Stefan Geutjes

DELL-Josh Cr · Answer

Hi Stefan, Is it also an S4048-ON that you are having issues with? What version of the OS are you running?

stefangs · Answer

Hello Josh,We experience the same issue with this version.OS Version: 10.5.2.3Build Version: 10.5.2.3.304System Type: S4148F-ONArchitecture: x86_64I hope you can bring the solution.Kind regards,Stefan Geutje

DELL-Josh Cr · Answer

The best option would be to call phone support and have them look at the configuration live.

Tomas Kalabis · Answer

hi, guys.

same isuue

S4112 with OS10 - Software version : 10.5.1.2

this is my post about it (with logs)

http://tomaskalabis.com/wordpress/dell-emc-s4112-on-vlt-and-lacp-issue-while-reloading-vlt-peer/

Tomas.

DELL-Stefan R · Answer

Hi Tomas.   Thanks for the info. According to the last post from Josh, I also recommend calling the phone support to have them look into the configuration live.

Networking General

S4048-ON VLT and LACP issue while reloading VLT peer

Was this post helpful?