Start a Conversation

Solved!

Go to Solution

4573

January 29th, 2022 03:00

NVIDIA A100 in Dell Poweredge R750 with Ubuntu 20.04

Hello, 

I'm trying to configure Ubuntu 20.04 LTS Dell Poweredge R750 with accelerator NVIDIA A100 80GB PCIe. I'm using drivers provided by NVIDIA and certified for this server. Unfortunately NVIDIA driver crashes during the system startup.

According to documentation both Ubuntu 20.04 system and NVIDIA A100 should be compatible with the server.

I'm using server: Dell Inc. PowerEdge R750/06V45N, BIOS 1.4.4 10/07/2021

Should I change something in BIOS configuration or download some special driver from Dell?

 

Below parts of dmesg log:

[ 1.910999] pci 0000:ca:00.0: BAR 8: no space for [mem size 0x1400000000 64bit pref]
[ 1.911001] pci 0000:ca:00.0: BAR 8: failed to assign [mem size 0x1400000000 64bit pref]
[ 1.911004] pci 0000:ca:00.0: BAR 10: no space for [mem size 0x28000000 64bit pref]
[ 1.911006] pci 0000:ca:00.0: BAR 10: failed to assign [mem size 0x28000000 64bit pref]
[ 1.911008] pci 0000:ca:00.0: BAR 7: no space for [mem size 0x00500000]
[ 1.911009] pci 0000:ca:00.0: BAR 7: failed to assign [mem size 0x00500000]

[ ...later more similar information. When system tries to load nvidia drivers: ]

[ 19.732655] BUG: kernel NULL pointer dereference, address: 0000000000000228
[ 19.739101] #PF: supervisor write access in kernel mode
[ 19.745161] #PF: error_code(0x0002) - not-present page
[ 19.752704] PGD 10e066067 P4D 10e067067 PUD 1129ac067 PMD 0
[ 19.760671] Oops: 0002 [#1] SMP NOPTI
[ 19.768071] CPU: 10 PID: 1638 Comm: nvidia-persiste Tainted: P OE 5.11.0-27-generic #29~20.04.1-Ubuntu
[ 19.775043] Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.4.4 10/07/2021
[ 19.775044] RIP: 0010:_nv029611rm+0xa0/0xd0 [nvidia]
[ 19.787895] Code: 83 c3 01 44 89 ee 48 8b 47 70 e8 0b bc 54 e8 48 83 fb 07 75 bd 48 83 c4 08 44 89 f0 5b 41 5c 41 5d 41 5e 48 83 c5 10 c3 66 90 <48> 89 34 25 28 02 00 00 0f 0b 66 0f 1f 44 00 00 be 00 00 2c 02 bf
[ 19.787898] RSP: 0018:ff53ca538b0c3950 EFLAGS: 00010246
[ 19.787900] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffc33a10b0\

[ ...and the rest of dump]

Moderator

 • 

278 Posts

January 31st, 2022 07:00

Hello brqd,

 

Thank you for your replies. I've checked screenshot with TPM Advanced Settings, all parameters are correct nothind needs to be changed.

 

I'm sorry that performed troubleshooting steps didn't help to resolve an issue.

I've just checked Ubuntu Server 20.04 LTS for Dell EMC PowerEdge Servers Release Notes and found this on page 11 (Known issues):

 

NVIDIA out-of-box driver fails to load when system has NVIDIA GPGPUs on Ubuntu 20.04

 

https://dell.to/3GfO2KE

 

Workaround: Pass pci=realloc=off kernel parameter.

 

Thanks,

Moderator

 • 

278 Posts

January 31st, 2022 00:00

Hello brqd,

 

I am sorry you faced with this issue. I've checked documentation, NVIDIA A100 is compatible with PowerEdge R750, as you noticed too. I would like to check with you follow troubleshooting steps:

 

-Is NVIDIA A100 installed in slot 7 or 2, as it shown in Dell EMC PowerEdge R750 Installation and Service Manual (Table 5):

https://dell.to/3AT7nAw

 

-Did you install this driver?:

 https://dell.to/3rgcckg

 

-Could you please check, if  BIOS secure boot is disabled

An overview of secure boot:

https://dell.to/3AKHd2B

 

-Could you please also share, what version of iDRAC is installed? In iDRAC Version 5.00.10.20  was added support for NVIDIA A100 80GB PCIe GPU in PowerEdge R750, PowerEdge R750xa, and PowerEdge R7525:

https://dell.to/3Hg45JX

 

Here you can download the latest version of iDRAC:

https://dell.to/3HeICRB

 

Please let me know, if you have any questions,

 

Thank you.

4 Posts

January 31st, 2022 01:00

Hello Maria,

I'm answering your question below quotes:

-Is NVIDIA A100 installed in slot 7 or 2, as it shown in Dell EMC PowerEdge R750 Installation and Service Manual (Table 5):

https://dell.to/3AT7nAw


It's in slot 7 (installed by Dell)


-Did you install this driver?:

 https://dell.to/3rgcckg

I've installed the same driver (470.82.01) but from archive mentioned in NVIDIA installation guide: https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html

-Could you please check, if  BIOS secure boot is disabled

An overview of secure boot:

https://dell.to/3AKHd2B


Secure boot is disabled. Please take a look in other parameters connected with Secure boot:

Secure_boot_screenshot.png


-Could you please also share, what version of iDRAC is installed? In iDRAC Version 5.00.10.20  was added support for NVIDIA A100 80GB PCIe GPU in PowerEdge R750, PowerEdge R750xa, and PowerEdge R7525:

https://dell.to/3Hg45JX

My iDRAC Firmware Version is 5.00.20.00

It's not the latest version but the number is higher than 5.00.10.20. Should I update to the latest one?

 

Thank you

4 Posts

January 31st, 2022 06:00

I have reinstalled NVIDIA drivers from the link you have sent and updated iDRAC to newest 5.10.00.00.

Unfortunately, nothing has changed.

4 Posts

January 31st, 2022 07:00

That solved my problem.

Thank you very much!

No Events found!

Top