Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

15748

March 9th, 2018 12:00

Critical: network (bnxt_en module) crashes on 14G servers

Hi,

We dozen of new 14g servers, they were running (with no or minimal load) in last few weeks without any issues.

After we deployed BIOS upgrade (1.3.7), and rebooted (even power off/on), some of the servers (every 3rd) got locked with below kernel messages.  Doesn't matter if  BIOS 1.3.7 or BIOS 1.2.11 are installed, problem persist. Even after few days of running (minimal or no load at all) they've got locked.

How we should proceed from here, as the servers should get more load, and we lost faith in stability of them. Running fulling updated RHEL v7 and firmwares. 

bnxt_en modules are dies, and lock network access and spews constantly 

00:03:34 kernel: bnxt_en 0000:19:00.0 em1: Error (timeout: 500) msg {0x23 0x12be} len:0
00:03:34 kernel: bnxt_en 0000:19:00.1 em2: Error (timeout: 500) msg {0x51 0x12d6} len:0

More about the kernel message on paste bin url: Kernel messages of crashed bnxt_en

Please let us know which way to proceed from here.

Thank you! 

 

 


 

 

 

16 Posts

May 8th, 2018 23:00

Yes.

Solution was to power down the server(s) and *remove* power cords for 5 minutes.

This will force re-init of the NIC firmware.

Moderator

 • 

8.5K Posts

March 9th, 2018 13:00

Hi,

Have you tried flashing back to an older bios? What version were you running previously? Are they all the same model of NIC?

16 Posts

March 9th, 2018 14:00

Why are my follow-ups erased from this forum? 

Below are more lockups with latest DELL software. 

https://pastebin.com/ne446cxs

16 Posts

March 9th, 2018 14:00

I'm going to revert BIOS on some of the servers.

Please advise on further steps if this doesn't help.

Clearly, having non 1.3.x BIOS gives us vulnerable on the latest security issues.

Thanks

 

Moderator

 • 

8.5K Posts

March 9th, 2018 14:00

Can you private message me the followups if they are not posting?

16 Posts

March 9th, 2018 15:00

Yes. All hardware is same across the server base.

I reverted back on couple of servers to 1.2.11 BIOS. We did upgrade LifeCycle / iDRAC as well, but this surely is not connected with iDRAC.

It seems that we should go with full support from DELL route.

If you have any other valuable resource, particularly with the network card (Broadcom BCM57412 NetXtreme-E 10Gb Ethernet), please let me know.

Thanks!

 

Moderator

 • 

8.5K Posts

March 9th, 2018 15:00

If reverting does fix it, we can look into issues with the update, if it doesn’t fix it we can try to find out what is causing the problem. What version were you running previously? Are all the servers using  same model of NIC? Can you create a support assist bundle? https://www.dell.com/support/article/us/en/04/sln306670/poweredge-14g-how-to-manually-create-the-supportassist-collection-?lang=en

16 Posts

March 10th, 2018 01:00

Unfortunately, reverting to BIOS which was delivered with the servers doesn't work.

Again, same lock ups. Two more lockups with two different server with BIOS 1.2.11 on the link below.

https://pastebin.com/Uh2vrmR0

Thank you

16 Posts

March 12th, 2018 09:00

We created ticket with DELL, lets see what it will be out of it. 

As BIOS are not an issues here, it is probably due network card firmware (for 10G part of the card).

 

Moderator

 • 

8.5K Posts

March 12th, 2018 09:00

Can you create a support assist bundle on one server on the older bios and one on the newer one? https://www.dell.com/support/article/us/en/04/sln306670/poweredge-14g-how-to-manually-create-the-supportassist-collection-?lang=en

1 Message

May 8th, 2018 23:00

We have encountered a similar problem. Was this ever clearly identified / resolved?

 

Thanks,

1 Message

October 4th, 2018 13:00

Hello folks:

Has anyone run into the same problem ?. We are facing this issue in 3 PE R640. They have the last firmware, drver and kernel.

BIOS Version: 1.4.9

 

Broadcom card details 

driver: bnxt_en
version: 1.8.0
firmware-version: 20.8.171.0/pkg 20.08.04.04

 

Note: power drain did not work for us. The issue still happen.

 

Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.273740] INFO: rcu_sched detected stalls on CPUs/tasks:
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.279431] 35-...!: (0 ticks this GP) idle=9c8/0/0 softirq=777893253/777893253 fqs=0
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.287626] (detected by 18, t=15005 jiffies, g=122605494, c=122605493, q=75199)
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.295304] Sending NMI from CPU 18 to CPUs 35:
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.295330] NMI backtrace for cpu 35 skipped: idling at intel_idle+0x7b/0x130
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.296317] rcu_sched kthread starved for 15007 jiffies! g122605494 c122605493 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=35
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307823] rcu_sched I 0 9 2 0x80000000
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307826] Call Trace:
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307833] __schedule+0x3d6/0x8b0
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307835] schedule+0x36/0x80
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307837] schedule_timeout+0x162/0x370
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307840] ? __next_timer_interrupt+0xe0/0xe0
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307843] rcu_gp_kthread+0x5b4/0x960
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307848] kthread+0x105/0x140
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307849] ? rcu_barrier_sched+0x10/0x10
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307851] ? kthread_destroy_worker+0x50/0x50
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307853] ret_from_fork+0x35/0x40
Oct 3 16:49:58 ndc-cl-compute16 kernel: [3599033.265801] bnxt_en 0000:3b:00.0 enp59s0f0: NIC Link is Down
Oct 3 16:49:58 ndc-cl-compute16 kernel: [3599033.271823] bnxt_en 0000:3b:00.0 enp59s0f0: speed changed to 0 for port enp59s0f0
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599033.569357] bnxt_en 0000:19:00.0 eno1: NIC Link is Down
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599033.751941] bnxt_en 0000:3b:00.0 enp59s0f0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599033.751943] bnxt_en 0000:3b:00.0 enp59s0f0: FEC autoneg off encodings: None
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599034.318323] bnxt_en 0000:19:00.0 eno1: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599034.318325] bnxt_en 0000:19:00.0 eno1: FEC autoneg off encodings: None
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.252553] bnxt_en 0000:3b:00.1 enp59s0f1d1: NIC Link is Down
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.258741] bnxt_en 0000:3b:00.1 enp59s0f1d1: speed changed to 0 for port enp59s0f1d1
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.318274] bnxt_en 0000:19:00.1 eno2d1: NIC Link is Down
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.502124] bnxt_en 0000:3b:00.1 enp59s0f1d1: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.502127] bnxt_en 0000:3b:00.1 enp59s0f1d1: FEC autoneg off encodings: None
Oct 3 16:50:03 ndc-cl-compute16 kernel: [3599037.570256] bnxt_en 0000:19:00.1 eno2d1: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Oct 3 16:50:03 ndc-cl-compute16 kernel: [3599037.570258] bnxt_en 0000:19:00.1 eno2d1: FEC autoneg off encodings: None

1 Message

October 10th, 2018 22:00

We got the same problem in R740xd. BIOS 1.4.9 broadcom driver :1.7.9 NIC firmware: 20.8.163/1.8.4 OS kernel: RHEL 7.3 3.10.0-514 we think there is something wrong with NIC firmware or BIOS. Any suggestion?

October 17th, 2018 03:00

Same problem for us on two different R740 with the latest or previous broadcom firmwares (20.08 or 20.06) with RHEL 7.5.

16 Posts

November 8th, 2018 01:00

Have you reach the support or found solution? 

 

 

No Events found!

Top