Unsolved

Closed

1 Rookie

 • 

4 Posts

2388

March 14th, 2023 05:00

Broadcom 22.31.13.70 breaks TCP in VXLAN with tx offloading

Hello,

We have several Dell PowerEdge Servers equipped with Broadcom Adv. Dual 10Gb Ethernet cards. They run Ubuntu 18.04 (Kernel 5.4.0-144-generic) with an OpenStack cloud installation using VXLAN tunnel for tenant networking.

Since the recent update of the NIC firmware to 22.31.13.70, all TCP connections inside VXLAN tunnels are not working anymore. ICMP packets can be sent and received by virtual machines, but TCP data connections are stuck. The kernel log shows the following messages:

 

kernel: bnxt_en 0000:19:00.0 data: hwrm req_type 0xa1 seq id 0x5716 error 0xffff
kernel: bnxt_en 0000:19:00.0 data: hwrm_tunnel_dst_port_alloc failed. rc:-95

 

We tried resetting the NIC firmware by physically unplugging the server from all power for 10 minutes, but this did not help too. Further investigation indicated that disabling TX checksum offloading avoids the issue:

# ethtool --offload data tx off
Actual changes:
tx-checksumming: off
	tx-checksum-ipv4: off
	tx-checksum-ipv6: off
tcp-segmentation-offload: off
	tx-tcp-segmentation: off [requested on]
	tx-tcp6-segmentation: off [requested on]

With this workaround, connectivity was restored.

Not offloading, of course, has an impact on performance and latency. Would you recommend a downgrade to the previous firmware, and disabling iDRAC firmware updates, or will there be a fix soon?

Thank you.

Moderator

 • 

9.6K Posts

March 14th, 2023 09:00

Jgraichen,

 

Would you do me a favor and confirm a couple things for me?

 

Is this occuring on multiple platforms, or the same model devices?

What models is this affecting?

Is the rest of the host server also up to date as well?

Have you attempted any form of backflashing to get to the previous version, in order to see if it works again? If so what version did you try?

 

Laslty, if you like I can try submitting the issue withthe firmware, but would need a svc tag to do so, if you like you can private message me a svc tag and I can submit it on your behalf. 

 

 

Let me know.

 

 

4 Posts

April 1st, 2023 05:00

Hi,

 

Would like to share that we're seeing the same behavior.

Servers are RPowerEdge R6515 with Dual 10G NIC - iDRAC shows them as "Broadcom Adv. Dual 10Gb Ethernet" 

Family Firmware Version 22.31.13.70

 

linux detects them as: "Broadcom BCM57412 NetXtreme-E 10Gb Etherne"

Error messages similar as mentioned by @jgraichen 

===

kernel: bnxt_en 0000:41:00.0 ens3f0np0: hwrm req_type 0xa1 seq id 0x23c error 0xffff
kernel: bnxt_en 0000:41:00.0 ens3f0np0: hwrm_tunnel_dst_port_alloc failed. rc:-95
kernel: bnxt_en 0000:41:00.1 ens3f1np1: hwrm req_type 0xa1 seq id 0x23c error 0xffff
kernel: bnxt_en 0000:41:00.1 ens3f1np1: hwrm_tunnel_dst_port_alloc failed. rc:-95

 

Any expected update on this, that does not involve disabling offloading?

@DELL-Chris H 

All servers have latest firmware applied for all components. And would prefer not downgrade firmware as this in my previus experiences generated lots of problems and didn't work.. Any tips on how to do it if we should try it again?

And how can/should i privately send you a svctag fo rise this issue to firmware team?

 

Thanks

4 Posts

April 1st, 2023 06:00

should add that we're also seeing it on R6525 (dual EPYC) servers, with two dual 10G NICs, The NICs have the same firmware version (latest available on lifecycle/update)

----

kernel: bnxt_en 0000:02:00.0 eth0: Broadcom BCM57416 NetXtreme-E 10GBase-T Ethernet found at mem e1210000, node addr ---
kernel: bnxt_en 0000:41:00.0 eth2: Broadcom BCM57412 NetXtreme-E 10Gb Ethernet found at mem 90210000, node addr ---
....
kernel: bnxt_en 0000:02:00.0 ens1f0np0: hwrm req_type 0xa1 seq id 0x1f7 error 0x3
kernel: bnxt_en 0000:02:00.0 ens1f0np0: hwrm_tunnel_dst_port_alloc failed. rc:-13
kernel: bnxt_en 0000:02:00.1 ens1f1np1: hwrm req_type 0xa1 seq id 0x1f9 error 0x3
kernel: bnxt_en 0000:02:00.1 ens1f1np1: hwrm_tunnel_dst_port_alloc failed. rc:-13
kernel: bnxt_en 0000:41:00.0 ens3f0np0: hwrm req_type 0xa1 seq id 0x1f6 error 0xffff
kernel: bnxt_en 0000:41:00.0 ens3f0np0: hwrm_tunnel_dst_port_alloc failed. rc:-95
kernel: bnxt_en 0000:41:00.1 ens3f1np1: hwrm req_type 0xa1 seq id 0x1f7 error 0xffff
kernel: bnxt_en 0000:41:00.1 ens3f1np1: hwrm_tunnel_dst_port_alloc failed. rc:-95

Moderator

 • 

5.3K Posts

April 2nd, 2023 15:00

Hello, you can click on Chris' ID on his post and you will see as attached on your right.

1 Attachment

Moderator

 • 

5.3K Posts

April 2nd, 2023 18:00

In the meantime, what is the BIOS version on the server and the connected end device name?

4 Posts

April 4th, 2023 14:00

Hi @DELL-Young E  

Thanks for the tip on how to direct message.

 

As for the info requested, here you go:

BIOS Version: 2.10.2
iDRAC Firmware Version: 6.10.30.00

Linux kernel is: 5.4.66-flatcar (Flatcar Linux)

Not sure what you mean with "connected end device name". Servers are connected to Cisco Nexus 9k  switches, if that is what you mean.

thanks

Moderator

 • 

5.3K Posts

April 4th, 2023 21:00

Hello, 

 

Linux detects them as: "Broadcom BCM57412 NetXtreme-E 10Gb Ethernet"    >>> BCM57412 (10Gb SFP+)

https://dell.to/3m1ImjD

 

This is the exact module used in the server.

 

April 15th, 2023 16:00

Hello,

We are facing the same issue with "BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller"
ubuntu server 22.04.1 
PowerEdge C6520
22.31.13.70

BIOS Version 1.9.2
iDRAC Firmware Version 6.10.30.00 

 

Moderator

 • 

5.3K Posts

April 16th, 2023 14:00

@DELL-Chris H  wrote: Is this occuring on multiple platforms, or the same model devices? What models is this affecting? Is the rest of the host server also up to date as well? Have you attempted any form of backflashing to get to the previous version, in order to see if it works again? If so what version did you try?

Hello,mobileye.storage

 
 

Thanks for choosing Dell. Have you tried what Chris said earlier in the thread? Also you can PM me for your service tag as well. Wish you a good one.

April 16th, 2023 20:00

Hello Young E

We see this on DELL C6520 - Broadcom Adv. Dual 10Gb Ethernet, BCM57412 NetXtreme-E 10Gb
Everything else is up to date.

We backlashed FW from 22.31.13.70 to 22.21.07.80 and everything is working now.

1 Rookie

 • 

32 Posts

April 16th, 2023 21:00

Hi, we have the exact same issue with Dell R640 and R740 with a mix of Ubuntu 18 and Ubuntu 22.

All firmware is up2date. 

When putting 22.31.13.70 on the NIC the network connectivity for the server(s) breaks, but when reverting to any older version it works as expected.

4 Posts

June 8th, 2023 18:00

We're not sure if we're seeing the same thing, but we're having sporadic problems with Geneve offload (NSX-T) with firmware version 22.31.13.70. Traffic generally works, but we've sporadically had all tunnels on a single NIC drop. (We're using the Broadcom 57504 NIC, so we have 4-ports. It only seemed to impact 1 port at a time.) We're not confident it's due to a firmware issue, but given your reports of VXLAN issues, we're proceeding to downrev the firmware to the prior version we were running (21.85.21.91).

1 Rookie

 • 

4 Posts

June 14th, 2023 04:00

Hello Chris,

Sorry for that late reply. Apparently, I did not get email notifications from here and forgot to answer. Do I have to manually subscribe to own threads?

 


Is this occuring on multiple platforms, or the same model devices?


We immediately stopped any upgrades on other machines. Therefore, I cannot say if other models have been affected too.


What models is this affecting?


In our case, it was a PowerEdge R740.


Is the rest of the host server also up to date as well?


Yes, we had regularly and automatically applied all updates.


Have you attempted any form of backflashing to get to the previous version, in order to see if it works again? If so what version did you try?


Yes, we reverted to version 22.21.07.80, which is working fine again.

Thank you!

Best regards,

Jan

4 Posts

June 16th, 2023 10:00

After some discussion with VMware support, they pointed us at https://kb.vmware.com/s/article/90494 which indicates this is a known problem with Broadcom firmware. The issue is apparently fixed in the latest Broadcom firmware releases (225.x and later), but Dell hasn't picked up those firmware updates yet. We haven't seen any issues since we downgraded back to 21.85.21.91.

No Events found!

Top