Unsolved
8 Posts
1
4257
Precision 7560 won't boot with Network Enabled
Hey all,
Work just upgraded my laptop to the new Precision 7560 running Ubuntu 20.04. It seemed to work fine the first day, but after upgrading the installed packages I stopped being able to boot. I'd get past GRUB and just get a blank screen.
Variations of passing `nomodeset` and `acpi=off` to the kernel helped a bit, but not reliably. After a couple OS recovery installs and trying the standard Ubuntu installer, I think I've isolated it to the Intel NIC. If I disable it in the BIOS, I can boot into X every time. This is under the 5.10.0-1034-oem kernel.
When I tried the standard Ubuntu installer (kernel 5.8.0), I was able to boot into X with the NIC enabled. However, the WIFI and Bluetooth devices are not supported! I have also updated to the latest BIOS 1.2.2 without any luck.
Edit to Add: The wired NIC worked fine under kernel 5.8.0 (and under 5.10 if it happens to boot succesffully), and all the Onboard Diagnostic tests pass.
I am able to view some error messages from the previous stalled boot using journactl. It looks like it's related to the device firmware loading/unloading.
Jul 17 16:20:24 talos kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [systemd-udevd:286]
Jul 17 16:20:24 talos kernel: Modules linked in: hid_sensor_custom hid_sensor_hub intel_ishtp_loader intel_ishtp_hid nvidia_drm(PO) nvidia_modeset(PO) hid_generic nvidia(PO) f>
Jul 17 16:20:24 talos kernel: CPU: 1 PID: 286 Comm: systemd-udevd Tainted: P W O 5.10.0-1034-oem #35-Ubuntu
Jul 17 16:20:24 talos kernel: Hardware name: Dell Inc. Precision 7560/0Y1R4H, BIOS 1.2.2 06/29/2021
Jul 17 16:20:24 talos kernel: RIP: 0010:e1000_flash_cycle_ich8lan.constprop.0+0x5c/0x90 [e1000e]
Jul 17 16:20:24 talos kernel: Code: 00 00 0b 76 42 c1 e0 10 89 42 04 bb 81 96 98 00 eb 0f bf c7 10 00 00 e8 32 e8 0b e7 83 eb 01 74 10 49 8b 44 24 10 66 8b 40 04 <41> 89 c5 a8>
Jul 17 16:20:24 talos kernel: RSP: 0018:ffffa40780def920 EFLAGS: 00000202
Jul 17 16:20:24 talos kernel: RAX: ffffa40780bc4028 RBX: 0000000000353713 RCX: 0000000000000001
Jul 17 16:20:24 talos kernel: RDX: 0000000000000a38 RSI: 0000000000000001 RDI: 0000000000000a1f
Jul 17 16:20:24 talos kernel: RBP: ffffa40780def938 R08: 0000001c8a9dd609 R09: 0000000000000001
Jul 17 16:20:24 talos kernel: R10: ffff8d6a56ef1030 R11: 0000000000000000 R12: ffff8d6a56ef0f38
Jul 17 16:20:24 talos kernel: R13: 0000000080bc4028 R14: 0000000000000000 R15: ffff8d6a56ef0f38
Jul 17 16:20:24 talos kernel: FS: 00007fa01c36e880(0000) GS:ffff8d798fc40000(0000) knlGS:0000000000000000
Jul 17 16:20:24 talos kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 17 16:20:24 talos kernel: CR2: 000055644907169c CR3: 0000000118780002 CR4: 0000000000770ee0
Jul 17 16:20:24 talos kernel: PKRU: 55555554
Jul 17 16:20:24 talos kernel: Call Trace:
Jul 17 16:20:24 talos kernel: e1000_erase_flash_bank_ich8lan+0xa0/0x1a0 [e1000e]
Jul 17 16:20:24 talos kernel: e1000_update_nvm_checksum_spt+0x1f5/0x340 [e1000e]
Jul 17 16:20:24 talos kernel: e1000_validate_nvm_checksum_ich8lan+0xa1/0xd0 [e1000e]
Jul 17 16:20:24 talos kernel: e1000_probe+0x65f/0xc90 [e1000e]
Jul 17 16:20:24 talos kernel: local_pci_probe+0x48/0x80
Jul 17 16:20:24 talos kernel: pci_device_probe+0x10f/0x1c0
Jul 17 16:20:24 talos kernel: really_probe+0xfb/0x420
Jul 17 16:20:24 talos kernel: driver_probe_device+0xe9/0x160
Jul 17 16:20:24 talos kernel: device_driver_attach+0x5d/0x70
Jul 17 16:20:24 talos kernel: __driver_attach+0x8f/0x150
Jul 17 16:20:24 talos kernel: ? device_driver_attach+0x70/0x70
Jul 17 16:20:24 talos kernel: bus_for_each_dev+0x7e/0xc0
Jul 17 16:20:24 talos kernel: driver_attach+0x1e/0x20
Jul 17 16:20:24 talos kernel: bus_add_driver+0x152/0x1f0
Jul 17 16:20:24 talos kernel: driver_register+0x74/0xd0
Jul 17 16:20:24 talos kernel: ? 0xffffffffc0773000
Jul 17 16:20:24 talos kernel: __pci_register_driver+0x54/0x60
Jul 17 16:20:24 talos kernel: e1000_init_module+0x3b/0x1000 [e1000e]
Jul 17 16:20:24 talos kernel: do_one_initcall+0x48/0x1d0
Jul 17 16:20:24 talos kernel: ? _cond_resched+0x19/0x30
Jul 17 16:20:24 talos kernel: ? kmem_cache_alloc_trace+0x37a/0x430
Jul 17 16:20:24 talos kernel: ? do_init_module+0x28/0x250
Jul 17 16:20:24 talos kernel: do_init_module+0x62/0x250
Jul 17 16:20:24 talos kernel: load_module+0x11ac/0x1370
Jul 17 16:20:24 talos kernel: ? security_kernel_post_read_file+0x5c/0x70
Jul 17 16:20:24 talos kernel: ? security_kernel_post_read_file+0x5c/0x70
Jul 17 16:20:24 talos kernel: __do_sys_finit_module+0xc2/0x120
Jul 17 16:20:24 talos kernel: ? __do_sys_finit_module+0xc2/0x120
Jul 17 16:20:24 talos kernel: __x64_sys_finit_module+0x1a/0x20
Jul 17 16:20:24 talos kernel: do_syscall_64+0x38/0x90
Jul 17 16:20:24 talos kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 17 16:20:24 talos kernel: RIP: 0033:0x7fa01c8f089d
Any suggestions would be appreciated! In the meantime I guess I'm running WiFi-only or maybe using a thunderbolt dock.
Thanks,
Dean
xerotope
8 Posts
1
July 19th, 2021 13:00
I had some luck using a custom compiled version of the e1000e module, using the 3.8.7 version from sourceforge.net. I did have to disable an Ubuntu version check and SecureBoot (until I enroll my own MOK). But it boots, and the wired NIC seems to work!
xerotope
8 Posts
1
July 21st, 2021 19:00
Found a bug report on the Linux Kernel bug tracker here: https://bugzilla.kernel.org/show_bug.cgi?id=213667
The included patches also seem to fix the issue. Hopefully those make it into the OEM kernel soon!
xerotope
8 Posts
0
October 28th, 2021 18:00
As an update, a fix did make it into the OEM kernel. So now it won't hang on boot trying to write a correct checksum to the NIC NVM. Yay! https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1936998
But, while it boots right up, it won't load the e1000e driver module if the NVM checksum is bad. Booo! I'm still running a custom patched e1000e module (through DKMS) until Dell and our cyber-support team sort this out.
marrcin
11 Posts
0
November 1st, 2021 11:00
Is there any hope that someone tries to fix it? I think we cannot expect the next Kernel patch when the NVM checksum is wrong.
Windows doesn't complain and the network card works, so probably it ignores the bad checksum by default.
pindi
10 Posts
0
November 21st, 2021 01:00
Hi,
I have same identical problem on my Precision 7560 with Ubuntu 20.04.
The story: after first installation of Ubuntu 20.04 (and after update kernel to 5.11.0-38-generic) NIC work but Wi-Fi not (bloutooth was ok). After some search I have to remove file iwlwifi-ty-a0-gf-a0.pnvm and all was ok, except a problem which forced me to replace the motherboard.
After motehrboard replacement SBAM! NIC didn't work anymore: problem was
Then, after upgrade on kernel 5.11.0-41-generic, I installed related dkms fix https://github.com/koljah-de/e1000e-dkms-debian/releases and after reboot NIC has returned to work.
Here the problem: not all system reboot now happen successfully: sometimes system doesn't boot (with errore BUG: soft lockup - CPU#1 stuck for 23s!) or start very slow. If force restart during this time instead (hold power button for 10 seconds), all go on and Ubuntu start as a usual, with NIC working as expected.
Note that every time I access the bios and exit, the next reboot failed, ALWAYS! This behavior does not allow to start an iso because the Ubuntu ISOs do not start for the same problem (not allowing me to test if the NIC works with a different version of Ubuntu).
At the moment, as suggested by xerotope, I am forced to disable NIC in the BIOS and work only with Wi-Fi module.
Now some questions:
Thanks in advance!
xerotope
8 Posts
0
November 25th, 2021 06:00
@pindi , here's my current configuration steps:
* Use the Ubuntu OEM kernel packages. This is what the Dell ISO includes, but you can also install it just through apt-get. I'm using linux-oem-20.04c which is based on the 5.13 kernel line.
* Additional Dell OEM packages include oem-somerville-meta and oem-somerville-factory-meta
Now, the OEM kernels include the fix referenced in launchpad. However, if your NVM checksum is already bad, it will still return the error message when trying to load the e1000e module. As far as I can tell, this problem is caused either by a bad version of the driver trying to write the checksum in some kernels or a factory error programming the NIC.
My current workaround is to manually patch the driver kernel module using the patches from the original kernel bug report https://bugzilla.kernel.org/show_bug.cgi?id=213667 . To keep it up to date, I made a DKMS script to re-build it. However, I hacked together this process so don't currently have any step-by-step instructions.
If I get it fully automated I'll try and get a gist or something on Github and link it here. Good luck!
xerotope
8 Posts
0
November 25th, 2021 07:00
Yeah, the bugfix just stops it from hard locking, but the checksum is still checked and the module won't load.
I'm not sure about a timetable for a fix. My company's IT team has been in contact with Dell and it's been elevated to some higher level Linux engineering team and confirmed. But no path on a fix (other than USB-C ethernet dongles, yuck)
I also tried the ethtool/bootutil route, and it seems the consensus is the EEPROM is read-only, even if write protection is disabled in the module parameters.
So for now, patching the kernel module to bypass the checksum check entirely is my workaround.
pindi
10 Posts
0
November 25th, 2021 07:00
Hi @xerotope
if I understand correctly, the bugfix released into these kernel versions is for resolve the stuck problem during OS boot, not for bypassing "The NVM Checksum Is Not Valid" error, right?
And for some reason, my old motherboard probably had a nic module with correct checksum, while the current one does not.
But is it possible in your opinion that a complete fix for this problem will be released soon?
Unfortunately it does not even seem possible to act on the checksum with ethtool or bootutil64e (because write is probably blocked --> "Unable to write default configuration to EEPROM").