4 Posts
0
4817
NVIDIA A100 in Dell Poweredge R750 with Ubuntu 20.04
Hello,
I'm trying to configure Ubuntu 20.04 LTS Dell Poweredge R750 with accelerator NVIDIA A100 80GB PCIe. I'm using drivers provided by NVIDIA and certified for this server. Unfortunately NVIDIA driver crashes during the system startup.
According to documentation both Ubuntu 20.04 system and NVIDIA A100 should be compatible with the server.
I'm using server: Dell Inc. PowerEdge R750/06V45N, BIOS 1.4.4 10/07/2021
Should I change something in BIOS configuration or download some special driver from Dell?
Below parts of dmesg log:
[ 1.910999] pci 0000:ca:00.0: BAR 8: no space for [mem size 0x1400000000 64bit pref]
[ 1.911001] pci 0000:ca:00.0: BAR 8: failed to assign [mem size 0x1400000000 64bit pref]
[ 1.911004] pci 0000:ca:00.0: BAR 10: no space for [mem size 0x28000000 64bit pref]
[ 1.911006] pci 0000:ca:00.0: BAR 10: failed to assign [mem size 0x28000000 64bit pref]
[ 1.911008] pci 0000:ca:00.0: BAR 7: no space for [mem size 0x00500000]
[ 1.911009] pci 0000:ca:00.0: BAR 7: failed to assign [mem size 0x00500000]
[ ...later more similar information. When system tries to load nvidia drivers: ]
[ 19.732655] BUG: kernel NULL pointer dereference, address: 0000000000000228
[ 19.739101] #PF: supervisor write access in kernel mode
[ 19.745161] #PF: error_code(0x0002) - not-present page
[ 19.752704] PGD 10e066067 P4D 10e067067 PUD 1129ac067 PMD 0
[ 19.760671] Oops: 0002 [#1] SMP NOPTI
[ 19.768071] CPU: 10 PID: 1638 Comm: nvidia-persiste Tainted: P OE 5.11.0-27-generic #29~20.04.1-Ubuntu
[ 19.775043] Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.4.4 10/07/2021
[ 19.775044] RIP: 0010:_nv029611rm+0xa0/0xd0 [nvidia]
[ 19.787895] Code: 83 c3 01 44 89 ee 48 8b 47 70 e8 0b bc 54 e8 48 83 fb 07 75 bd 48 83 c4 08 44 89 f0 5b 41 5c 41 5d 41 5e 48 83 c5 10 c3 66 90 <48> 89 34 25 28 02 00 00 0f 0b 66 0f 1f 44 00 00 be 00 00 2c 02 bf
[ 19.787898] RSP: 0018:ff53ca538b0c3950 EFLAGS: 00010246
[ 19.787900] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffc33a10b0\
[ ...and the rest of dump]
Dell- Maria J
Moderator
Moderator
•
278 Posts
1
January 31st, 2022 07:00
Hello brqd,
Thank you for your replies. I've checked screenshot with TPM Advanced Settings, all parameters are correct nothind needs to be changed.
I'm sorry that performed troubleshooting steps didn't help to resolve an issue.
I've just checked Ubuntu Server 20.04 LTS for Dell EMC PowerEdge Servers Release Notes and found this on page 11 (Known issues):
NVIDIA out-of-box driver fails to load when system has NVIDIA GPGPUs on Ubuntu 20.04
https://dell.to/3GfO2KE
Workaround: Pass pci=realloc=off kernel parameter.
Thanks,
Dell- Maria J
Moderator
Moderator
•
278 Posts
1
January 31st, 2022 00:00
Hello brqd,
I am sorry you faced with this issue. I've checked documentation, NVIDIA A100 is compatible with PowerEdge R750, as you noticed too. I would like to check with you follow troubleshooting steps:
-Is NVIDIA A100 installed in slot 7 or 2, as it shown in Dell EMC PowerEdge R750 Installation and Service Manual (Table 5):
https://dell.to/3AT7nAw
-Did you install this driver?:
https://dell.to/3rgcckg
-Could you please check, if BIOS secure boot is disabled
An overview of secure boot:
https://dell.to/3AKHd2B
-Could you please also share, what version of iDRAC is installed? In iDRAC Version 5.00.10.20 was added support for NVIDIA A100 80GB PCIe GPU in PowerEdge R750, PowerEdge R750xa, and PowerEdge R7525:
https://dell.to/3Hg45JX
Here you can download the latest version of iDRAC:
https://dell.to/3HeICRB
Please let me know, if you have any questions,
Thank you.
brqd
4 Posts
0
January 31st, 2022 01:00
Hello Maria,
I'm answering your question below quotes:
It's in slot 7 (installed by Dell)
I've installed the same driver (470.82.01) but from archive mentioned in NVIDIA installation guide: https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html
Secure boot is disabled. Please take a look in other parameters connected with Secure boot:
My iDRAC Firmware Version is 5.00.20.00
It's not the latest version but the number is higher than 5.00.10.20. Should I update to the latest one?
Thank you
brqd
4 Posts
0
January 31st, 2022 06:00
I have reinstalled NVIDIA drivers from the link you have sent and updated iDRAC to newest 5.10.00.00.
Unfortunately, nothing has changed.
brqd
4 Posts
0
January 31st, 2022 07:00
That solved my problem.
Thank you very much!