Unsolved
This post is more than 5 years old
7 Posts
0
2512
N4032 <-> 2x10Gbe NICs (server bonded) lost packets when main slave interface is down
Doing resiliency tests I've found what I think is a bug on the 10GbE NICs, as this is not happening with the 1GbE.
Equipment is:
- Switch: 2x N4032
- Server: PowerEdge r740 with onboard NIC (2x 1GbE + 2x 10GbE)
- OS: Ubuntu Server 16.04
The 2x N4032 are trunked. I've kept the config simple to analyze the outcome:
SWITCH1 Te1/0/16 is connected to 10Gb port1 on the server (eno1)
SWITCH2 Te1/0/16 is connected to 10Gb port2 on the server (eno2d1)
Both NICs are bonded in linux mode: active-backup, interface (bond1)
source /etc/network/interfaces.d/* # The loopback network interface auto lo iface lo inet loopback # The primary network interface auto eno3 iface eno3 inet manual bond-master bond0 bond-primary eno3 auto eno4 iface eno4 inet manual bond-master bond0 auto bond0 iface bond0 inet static address 10.32.3.131 netmask 255.255.255.0 network 10.32.3.0 broadcast 10.32.3.255 gateway 10.32.3.1 bond-mode active-backup bond-miimon 100 bond-slaves none dns-nameservers 8.8.8.8 dns-search addisonglobal.cloud auto bond0:1 iface bond0:1 inet static address 10.32.4.131 netmask 255.255.255.0 network 10.32.4.0 broadcast 10.32.4.255 auto eno1 iface eno1 inet manual bond-master bond1 bond-primary eno1 auto eno2d1 iface eno2d1 inet manual bond-master bond1 auto bond1 iface bond1 inet static address 10.32.5.131 netmask 255.255.255.0 network 10.32.5.0 broadcast 10.32.5.255 bond-mode active-backup bond-miimon 100 bond-slaves none
When pinging from another server to interface bond1 (10.32.5.131) everything is working ok, in order to test the failover I logged into the server and turned down interface eno1, inmediately the pings go over the trunk to SWITCH2 and over eno2d1, keeps pinging ok for about 30 packets, then I get packets loss for another 30 packets (40 secs aprox) and then the pings come back.
You can see the packet loss comign back and forth (turned down eno1 interface on icmp_seq=10)
user@comput00:~$ ping 10.32.5.131 PING 10.32.5.131 (10.32.5.131) 56(84) bytes of data. 64 bytes from 10.32.5.131: icmp_seq=1 ttl=64 time=0.148 ms 64 bytes from 10.32.5.131: icmp_seq=2 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=3 ttl=64 time=0.153 ms 64 bytes from 10.32.5.131: icmp_seq=4 ttl=64 time=0.114 ms 64 bytes from 10.32.5.131: icmp_seq=5 ttl=64 time=0.152 ms 64 bytes from 10.32.5.131: icmp_seq=6 ttl=64 time=0.141 ms 64 bytes from 10.32.5.131: icmp_seq=7 ttl=64 time=0.147 ms 64 bytes from 10.32.5.131: icmp_seq=8 ttl=64 time=0.116 ms 64 bytes from 10.32.5.131: icmp_seq=9 ttl=64 time=0.137 ms 64 bytes from 10.32.5.131: icmp_seq=10 ttl=64 time=0.139 ms 64 bytes from 10.32.5.131: icmp_seq=11 ttl=64 time=0.158 ms 64 bytes from 10.32.5.131: icmp_seq=12 ttl=64 time=0.144 ms 64 bytes from 10.32.5.131: icmp_seq=13 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=14 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=15 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=48 ttl=64 time=0.208 ms 64 bytes from 10.32.5.131: icmp_seq=49 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=50 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=51 ttl=64 time=0.149 ms 64 bytes from 10.32.5.131: icmp_seq=52 ttl=64 time=0.144 ms 64 bytes from 10.32.5.131: icmp_seq=53 ttl=64 time=0.148 ms 64 bytes from 10.32.5.131: icmp_seq=54 ttl=64 time=0.148 ms 64 bytes from 10.32.5.131: icmp_seq=55 ttl=64 time=0.141 ms 64 bytes from 10.32.5.131: icmp_seq=56 ttl=64 time=0.151 ms 64 bytes from 10.32.5.131: icmp_seq=57 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=58 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=59 ttl=64 time=0.144 ms 64 bytes from 10.32.5.131: icmp_seq=60 ttl=64 time=0.147 ms 64 bytes from 10.32.5.131: icmp_seq=61 ttl=64 time=0.146 ms 64 bytes from 10.32.5.131: icmp_seq=62 ttl=64 time=0.147 ms 64 bytes from 10.32.5.131: icmp_seq=63 ttl=64 time=0.143 ms 64 bytes from 10.32.5.131: icmp_seq=64 ttl=64 time=0.146 ms 64 bytes from 10.32.5.131: icmp_seq=65 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=66 ttl=64 time=0.144 ms 64 bytes from 10.32.5.131: icmp_seq=67 ttl=64 time=0.147 ms 64 bytes from 10.32.5.131: icmp_seq=68 ttl=64 time=0.146 ms 64 bytes from 10.32.5.131: icmp_seq=69 ttl=64 time=0.152 ms 64 bytes from 10.32.5.131: icmp_seq=70 ttl=64 time=0.143 ms 64 bytes from 10.32.5.131: icmp_seq=71 ttl=64 time=0.147 ms 64 bytes from 10.32.5.131: icmp_seq=72 ttl=64 time=0.148 ms 64 bytes from 10.32.5.131: icmp_seq=73 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=74 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=75 ttl=64 time=0.153 ms 64 bytes from 10.32.5.131: icmp_seq=108 ttl=64 time=0.207 ms 64 bytes from 10.32.5.131: icmp_seq=109 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=110 ttl=64 time=0.148 ms 64 bytes from 10.32.5.131: icmp_seq=111 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=112 ttl=64 time=0.144 ms 64 bytes from 10.32.5.131: icmp_seq=113 ttl=64 time=0.148 ms 64 bytes from 10.32.5.131: icmp_seq=114 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=115 ttl=64 time=0.148 ms 64 bytes from 10.32.5.131: icmp_seq=116 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=117 ttl=64 time=0.148 ms 64 bytes from 10.32.5.131: icmp_seq=118 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=119 ttl=64 time=0.143 ms 64 bytes from 10.32.5.131: icmp_seq=120 ttl=64 time=0.143 ms 64 bytes from 10.32.5.131: icmp_seq=121 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=122 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=123 ttl=64 time=0.148 ms 64 bytes from 10.32.5.131: icmp_seq=124 ttl=64 time=0.144 ms 64 bytes from 10.32.5.131: icmp_seq=125 ttl=64 time=0.151 ms 64 bytes from 10.32.5.131: icmp_seq=126 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=127 ttl=64 time=0.143 ms 64 bytes from 10.32.5.131: icmp_seq=128 ttl=64 time=0.146 ms 64 bytes from 10.32.5.131: icmp_seq=129 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=130 ttl=64 time=0.147 ms 64 bytes from 10.32.5.131: icmp_seq=131 ttl=64 time=0.152 ms 64 bytes from 10.32.5.131: icmp_seq=132 ttl=64 time=0.145 ms 64 bytes from 10.32.5.131: icmp_seq=133 ttl=64 time=0.146 ms 64 bytes from 10.32.5.131: icmp_seq=134 ttl=64 time=0.149 ms 64 bytes from 10.32.5.131: icmp_seq=135 ttl=64 time=0.143 ms
Next packet loss was at icmp_seq=132.
Logging into the N4032 switches I've noticed a strange behavior showing the mac-address-table. When the pings are working, the mac address associated with bond1 is on trunk Po1, but after 40 secs aprox, the mac address shows being forwarded again on Te1/0/16 when this shouldn't happen, as I've the interface still down the linux server.
ACCESS1#show mac address-table Aging time is 300 Sec Vlan Mac Address Type Port -------- --------------------- ----------- --------------------- 1 0009.0F09.0006 Dynamic Te1/0/1 1 000A.F7DE.5F50 Dynamic Te1/0/17 1 000A.F7DE.ECA9 Dynamic Te1/0/9 1 000A.F7DE.ECAA Dynamic Te1/0/15 1 000A.F7DE.ECAB Dynamic Te1/0/15 1 000A.F7DF.0058 Dynamic Te1/0/10 1 000A.F7DF.005A Dynamic Po1 1 D094.662A.D3D0 Dynamic Te1/0/8 1 D094.662A.DC31 Dynamic Te1/0/7 1 E4F0.0470.8F8E Dynamic Po1 1 E4F0.0470.8F9B Management Vl1 Total MAC Addresses in use: 11 ACCESS1#show mac address-table Aging time is 300 Sec Vlan Mac Address Type Port -------- --------------------- ----------- --------------------- 1 0009.0F09.0006 Dynamic Te1/0/1 1 000A.F7DE.5F50 Dynamic Te1/0/17 1 000A.F7DE.ECA9 Dynamic Te1/0/9 1 000A.F7DE.ECAA Dynamic Te1/0/15 1 000A.F7DE.ECAB Dynamic Te1/0/15 1 000A.F7DF.0058 Dynamic Po1 1 000A.F7DF.005A Dynamic Te1/0/16 1 D094.662A.D3D0 Dynamic Te1/0/8 1 E4F0.0470.8F8E Dynamic Po1 1 E4F0.0470.8F9B Management Vl1 Total MAC Addresses in use: 10
This MAC address table swapping is happening forever, and everythime that mac gets associated with Te1/0/16 I totally loose the pings.
I'ts a very very strange behaviour, any ideas why is this happening?? I've reproduced the exact same steps on the 1GbE ports and I don't loose a single packet, so that's why I'm thinking it could be a bug? or is there anything else I'm missing??
Another important note, if I shutdown the port directly on the switch instead of the interface eno1, pings occur 100% successfully with 0% packet loss.
Here is the config of the Dells, these are the simplest ones, I've already tried setting up rstp, mst, enabling portfast, etc and same behaviour is happening all the time:
Any help would be appreciated, don't know if I'm missing anything but as it's working perfectly fine with 1GbE ports and as well shutting down the ports on the switches... I think there is something else around, maybe on the drivers? or the nic? what do you think? thanks.
Anonymous
5 Practitioner
5 Practitioner
•
274.2K Posts
0
March 26th, 2018 08:00
Have you tried different bonding modes? To see if there is any different behavior between them. Are there any messages recorded in the logs during your testing? # show logging
The show run output does not look complete. Once you place an interface into a channel-group, that interface then only follows the commands issued to the port-channel interface, just something to keep in mind as you configure the switch.
# interface port-channel 1
# switchport mode trunk
# switchport trunk allowed vlan 1
It looks like the firmware is out of date on the switch, with the newest firmware offering some improvements that may help in this situation. Could you please schedule a time to perform this update and then test again?
https://dell.to/2DVMW6Z
uleinad
7 Posts
0
March 27th, 2018 09:00
Hi Daniel,
First of all, thank you for your quick response, very much appreciated!
Yes I've tried a different bonding mode on the ubuntu servers (balance-rr), and some issues were appearing too.
To give you an update, I've finished the config with VRRP between the switches and now I don't have that random packet loss.
You're right as well I've checked the config I sent you and is not correct, I copy/pasted from an old one, sorry for the misunderstaing.
Both switches, NICs and servers are up to date firmware, as I updated them myself a couple of weeks ago.
Anyway, as I don't have this issue anymore I would mark this as resolved.
Following, if it helps for somebody, I'll paste the current config I've working, with VRRP
Working Config: https://pastebin.com/WEEX4aeZ
I can see I'm really on a old firmware, thanks for the update, I will update it tomorrow.
Thanks again for your excellent support Daniel.
Anonymous
5 Practitioner
5 Practitioner
•
274.2K Posts
0
March 27th, 2018 12:00
Thanks for the great update and sharing your findings.
Cheers