This post is more than 5 years old
16 Posts
0
16034
Critical: network (bnxt_en module) crashes on 14G servers
Hi,
We dozen of new 14g servers, they were running (with no or minimal load) in last few weeks without any issues.
After we deployed BIOS upgrade (1.3.7), and rebooted (even power off/on), some of the servers (every 3rd) got locked with below kernel messages.
Doesn't matter if BIOS 1.3.7 or BIOS 1.2.11 are installed, problem persist. Even after few days of running (minimal or no load at all) they've got locked.
How we should proceed from here, as the servers should get more load, and we lost faith in stability of them. Running fulling updated RHEL v7 and firmwares.
bnxt_en modules are dies, and lock network access and spews constantly
00:03:34 kernel: bnxt_en 0000:19:00.0 em1: Error (timeout: 500) msg {0x23 0x12be} len:0
00:03:34 kernel: bnxt_en 0000:19:00.1 em2: Error (timeout: 500) msg {0x51 0x12d6} len:0
More about the kernel message on paste bin url: Kernel messages of crashed bnxt_en
Please let us know which way to proceed from here.
Thank you!
NonOnNon
16 Posts
1
May 8th, 2018 23:00
Solution was to power down the server(s) and *remove* power cords for 5 minutes.
This will force re-init of the NIC firmware.
DELL-Josh Cr
Moderator
Moderator
•
8.7K Posts
0
March 9th, 2018 13:00
Hi,
Have you tried flashing back to an older bios? What version were you running previously? Are they all the same model of NIC?
NonOnNon
16 Posts
0
March 9th, 2018 14:00
Why are my follow-ups erased from this forum?
Below are more lockups with latest DELL software.
https://pastebin.com/ne446cxs
NonOnNon
16 Posts
0
March 9th, 2018 14:00
I'm going to revert BIOS on some of the servers.
Please advise on further steps if this doesn't help.
Clearly, having non 1.3.x BIOS gives us vulnerable on the latest security issues.
Thanks
DELL-Josh Cr
Moderator
Moderator
•
8.7K Posts
0
March 9th, 2018 14:00
Can you private message me the followups if they are not posting?
NonOnNon
16 Posts
0
March 9th, 2018 15:00
Yes. All hardware is same across the server base.
I reverted back on couple of servers to 1.2.11 BIOS. We did upgrade LifeCycle / iDRAC as well, but this surely is not connected with iDRAC.
It seems that we should go with full support from DELL route.
If you have any other valuable resource, particularly with the network card (Broadcom BCM57412 NetXtreme-E 10Gb Ethernet), please let me know.
Thanks!
DELL-Josh Cr
Moderator
Moderator
•
8.7K Posts
0
March 9th, 2018 15:00
If reverting does fix it, we can look into issues with the update, if it doesn’t fix it we can try to find out what is causing the problem. What version were you running previously? Are all the servers using same model of NIC? Can you create a support assist bundle? https://www.dell.com/support/article/us/en/04/sln306670/poweredge-14g-how-to-manually-create-the-supportassist-collection-?lang=en
NonOnNon
16 Posts
0
March 10th, 2018 01:00
Unfortunately, reverting to BIOS which was delivered with the servers doesn't work.
Again, same lock ups. Two more lockups with two different server with BIOS 1.2.11 on the link below.
https://pastebin.com/Uh2vrmR0
Thank you
NonOnNon
16 Posts
0
March 12th, 2018 09:00
We created ticket with DELL, lets see what it will be out of it.
As BIOS are not an issues here, it is probably due network card firmware (for 10G part of the card).
DELL-Josh Cr
Moderator
Moderator
•
8.7K Posts
0
March 12th, 2018 09:00
Can you create a support assist bundle on one server on the older bios and one on the newer one? https://www.dell.com/support/article/us/en/04/sln306670/poweredge-14g-how-to-manually-create-the-supportassist-collection-?lang=en
Crossix
1 Message
0
May 8th, 2018 23:00
We have encountered a similar problem. Was this ever clearly identified / resolved?
Thanks,
Eddy300G
1 Message
0
October 4th, 2018 13:00
Hello folks:
Has anyone run into the same problem ?. We are facing this issue in 3 PE R640. They have the last firmware, drver and kernel.
BIOS Version: 1.4.9
Broadcom card details
driver: bnxt_en
version: 1.8.0
firmware-version: 20.8.171.0/pkg 20.08.04.04
Note: power drain did not work for us. The issue still happen.
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.273740] INFO: rcu_sched detected stalls on CPUs/tasks:
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.279431] 35-...!: (0 ticks this GP) idle=9c8/0/0 softirq=777893253/777893253 fqs=0
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.287626] (detected by 18, t=15005 jiffies, g=122605494, c=122605493, q=75199)
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.295304] Sending NMI from CPU 18 to CPUs 35:
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.295330] NMI backtrace for cpu 35 skipped: idling at intel_idle+0x7b/0x130
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.296317] rcu_sched kthread starved for 15007 jiffies! g122605494 c122605493 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=35
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307823] rcu_sched I 0 9 2 0x80000000
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307826] Call Trace:
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307833] __schedule+0x3d6/0x8b0
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307835] schedule+0x36/0x80
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307837] schedule_timeout+0x162/0x370
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307840] ? __next_timer_interrupt+0xe0/0xe0
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307843] rcu_gp_kthread+0x5b4/0x960
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307848] kthread+0x105/0x140
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307849] ? rcu_barrier_sched+0x10/0x10
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307851] ? kthread_destroy_worker+0x50/0x50
Oct 3 16:49:39 ndc-cl-compute16 kernel: [3599014.307853] ret_from_fork+0x35/0x40
Oct 3 16:49:58 ndc-cl-compute16 kernel: [3599033.265801] bnxt_en 0000:3b:00.0 enp59s0f0: NIC Link is Down
Oct 3 16:49:58 ndc-cl-compute16 kernel: [3599033.271823] bnxt_en 0000:3b:00.0 enp59s0f0: speed changed to 0 for port enp59s0f0
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599033.569357] bnxt_en 0000:19:00.0 eno1: NIC Link is Down
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599033.751941] bnxt_en 0000:3b:00.0 enp59s0f0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599033.751943] bnxt_en 0000:3b:00.0 enp59s0f0: FEC autoneg off encodings: None
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599034.318323] bnxt_en 0000:19:00.0 eno1: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Oct 3 16:49:59 ndc-cl-compute16 kernel: [3599034.318325] bnxt_en 0000:19:00.0 eno1: FEC autoneg off encodings: None
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.252553] bnxt_en 0000:3b:00.1 enp59s0f1d1: NIC Link is Down
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.258741] bnxt_en 0000:3b:00.1 enp59s0f1d1: speed changed to 0 for port enp59s0f1d1
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.318274] bnxt_en 0000:19:00.1 eno2d1: NIC Link is Down
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.502124] bnxt_en 0000:3b:00.1 enp59s0f1d1: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Oct 3 16:50:02 ndc-cl-compute16 kernel: [3599037.502127] bnxt_en 0000:3b:00.1 enp59s0f1d1: FEC autoneg off encodings: None
Oct 3 16:50:03 ndc-cl-compute16 kernel: [3599037.570256] bnxt_en 0000:19:00.1 eno2d1: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Oct 3 16:50:03 ndc-cl-compute16 kernel: [3599037.570258] bnxt_en 0000:19:00.1 eno2d1: FEC autoneg off encodings: None
casoul
1 Message
0
October 10th, 2018 22:00
smougey-pixtel.fr
1 Message
0
October 17th, 2018 03:00
Same problem for us on two different R740 with the latest or previous broadcom firmwares (20.08 or 20.06) with RHEL 7.5.
NonOnNon
16 Posts
0
November 8th, 2018 01:00
Have you reach the support or found solution?