Critical: network (bnxt_en module) crashes on 14G servers

Question

Hi,

We dozen of new 14g servers, they were running (with no or minimal load) in last few weeks without any issues.

~~After we deployed BIOS upgrade (1.3.7), and rebooted (even power off/on), some of the servers (every 3rd) got locked with below kernel messages.~~ Doesn't matter if BIOS 1.3.7 or BIOS 1.2.11 are installed, problem persist. Even after few days of running (minimal or no load at all) they've got locked.

How we should proceed from here, as the servers should get more load, and we lost faith in stability of them. Running fulling updated RHEL v7 and firmwares.

bnxt_en modules are dies, and lock network access and spews constantly

00:03:34 kernel: bnxt_en 0000:19:00.0 em1: Error (timeout: 500) msg {0x23 0x12be} len:0
00:03:34 kernel: bnxt_en 0000:19:00.1 em2: Error (timeout: 500) msg {0x51 0x12d6} len:0

More about the kernel message on paste bin url: Kernel messages of crashed bnxt_en

Please let us know which way to proceed from here.

Thank you!

NonOnNon · Accepted Answer

Yes.Solution was to power down the server(s) and *remove* power cords for 5 minutes.This will force re-init of the NIC firmware.

DELL-Josh Cr · Answer

Hi, Have you tried flashing back to an older bios? What version were you running previously? Are they all the same model of NIC?

NonOnNon · Answer

Why are my follow-ups erased from this forum?  Below are more lockups with latest DELL software.  https://pastebin.com/ne446cxs

NonOnNon · Answer

I'm going to revert BIOS on some of the servers. Please advise on further steps if this doesn't help. Clearly, having non 1.3.x BIOS gives us vulnerable on the latest security issues. Thanks

DELL-Josh Cr · Answer

Can you private message me the followups if they are not posting?

NonOnNon · Answer

Yes. All hardware is same across the server base.

I reverted back on couple of servers to 1.2.11 BIOS. We did upgrade LifeCycle / iDRAC as well, but this surely is not connected with iDRAC.

It seems that we should go with full support from DELL route.

If you have any other valuable resource, particularly with the network card (Broadcom BCM57412 NetXtreme-E 10Gb Ethernet), please let me know.

Thanks!

DELL-Josh Cr · Answer

If reverting does fix it, we can look into issues with the update, if it doesn’t fix it we can try to find out what is causing the problem. What version were you running previously? Are all the servers using  same model of NIC? Can you create a support assist bundle? https://www.dell.com/support/article/us/en/04/sln306670/poweredge-14g-how-to-manually-create-the-supportassist-collection-?lang=en

NonOnNon · Answer

Unfortunately, reverting to BIOS which was delivered with the servers doesn't work.

Again, same lock ups. Two more lockups with two different server with BIOS 1.2.11 on the link below.

https://pastebin.com/Uh2vrmR0

Thank you

NonOnNon · Answer

We created ticket with DELL, lets see what it will be out of it.  As BIOS are not an issues here, it is probably due network card firmware (for 10G part of the card).

DELL-Josh Cr · Answer

Can you create a support assist bundle on one server on the older bios and one on the newer one? https://www.dell.com/support/article/us/en/04/sln306670/poweredge-14g-how-to-manually-create-the-supportassist-collection-?lang=en

Crossix · Answer

We have encountered a similar problem. Was this ever clearly identified / resolved?   Thanks,

Eddy300G · Answer

Hello folks:

Has anyone run into the same problem ?. We are facing this issue in 3 PE R640. They have the last firmware, drver and kernel.

BIOS Version: 1.4.9

Broadcom card details

driver: bnxt_en
version: 1.8.0
firmware-version: 20.8.171.0/pkg 20.08.04.04

Note: power drain did not work for us. The issue still happen.

Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.273740] INFO: rcu_sched detected stalls on CPUs/tasks:
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.279431] 35-...!: (0 ticks this GP) idle=9c8/0/0 softirq=777893253/777893253 fqs=0
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.287626] (detected by 18, t=15005 jiffies, g=122605494, c=122605493, q=75199)
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.295304] Sending NMI from CPU 18 to CPUs 35:
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.295330] NMI backtrace for cpu 35 skipped: idling at intel_idle+0x7b/0x130
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.296317] rcu_sched kthread starved for 15007 jiffies! g122605494 c122605493 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=35
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307823] rcu_sched I 0 9 2 0x80000000
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307826] Call Trace:
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307833] __schedule+0x3d6/0x8b0
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307835] schedule+0x36/0x80
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307837] schedule_timeout+0x162/0x370
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307840] ? __next_timer_interrupt+0xe0/0xe0
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307843] rcu_gp_kthread+0x5b4/0x960
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307848] kthread+0x105/0x140
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307849] ? rcu_barrier_sched+0x10/0x10
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307851] ? kthread_destroy_worker+0x50/0x50
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307853] ret_from_fork+0x35/0x40
Oct 3 16:49:58 ndc-cl-compute16 kernel: [3599033.265801] bnxt_en 0000:3b:00.0 enp59s0f0: NIC Link is Down
Oct 3 16:49:58 ndc-cl-compute16 kernel: [3599033.271823] bnxt_en 0000:3b:00.0 enp59s0f0: speed changed to 0 for port enp59s0f0
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599033.569357] bnxt_en 0000:19:00.0 eno1: NIC Link is Down
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599033.751941] bnxt_en 0000:3b:00.0 enp59s0f0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599033.751943] bnxt_en 0000:3b:00.0 enp59s0f0: FEC autoneg off encodings: None
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599034.318323] bnxt_en 0000:19:00.0 eno1: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599034.318325] bnxt_en 0000:19:00.0 eno1: FEC autoneg off encodings: None
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.252553] bnxt_en 0000:3b:00.1 enp59s0f1d1: NIC Link is Down
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.258741] bnxt_en 0000:3b:00.1 enp59s0f1d1: speed changed to 0 for port enp59s0f1d1
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.318274] bnxt_en 0000:19:00.1 eno2d1: NIC Link is Down
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.502124] bnxt_en 0000:3b:00.1 enp59s0f1d1: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.502127] bnxt_en 0000:3b:00.1 enp59s0f1d1: FEC autoneg off encodings: None
Oct 3 16:50:03 ndc-cl-compute16 kernel: [3599037.570256] bnxt_en 0000:19:00.1 eno2d1: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Oct 3 16:50:03 ndc-cl-compute16 kernel: [3599037.570258] bnxt_en 0000:19:00.1 eno2d1: FEC autoneg off encodings: None

casoul · Answer

We got the same problem in R740xd. BIOS 1.4.9 broadcom driver :1.7.9 NIC firmware: 20.8.163/1.8.4 OS kernel: RHEL 7.3 3.10.0-514 we think there is something wrong with NIC firmware or BIOS. Any suggestion?

smougey-pixtel.fr · Answer

Same problem for us on two different R740 with the latest or previous broadcom firmwares (20.08 or 20.06) with RHEL 7.5.

NonOnNon · Answer

Have you reach the support or found solution?

PowerEdge Hardware General

Critical: network (bnxt_en module) crashes on 14G servers

Was this post helpful?