Dell Automation: Troubleshooting NativeEdge Detection of Multiple NVIDIA GPUs in UEFI VMs
Summary: This article explains an issue where the kernel drivers fail to detect one of the NVIDIA Graphics Processing Units (GPU) in a Unified Extensible Firmware Interface (UEFI) Virtual Machine (VM) configured with multiple GPU passthroughs. ...
Symptoms
The issue occurs when using Ubuntu 22.04 and passing two or more GPUs to the VM.
This issue occurs when the system allocates only 64 GB of MMIO space, which is insufficient for the required configuration. This value is not enough to support two or more GPUs, resulting in a failed BAR allocation.
The symptoms of this issue include:
- The kernel drivers do not detect one of the NVIDIA GPUs
- The
dmesglog shows an error message indicating that the PCI I/O region assigned to the NVIDIA device is invalid - The
lspci -nnkoutput shows that the kernel driver is not in use for the affected GPU
Example of dmesg output for failing device for reference:
[ 268.756466] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:NVRM: BAR0 is 0M @ 0x0 (PCI:0000:09:00.0) [ 268.756481] nvidia: probe of 0000:09:00.0 failed with error -1 [ 268.756592] NVRM: The NVIDIA probe routine failed for 1 device(s).
Cause
The cause of this issue is the default MMIO space allocation of 64GB, which is not sufficient to support two or more GPUs. In UEFI VMs, the OVMF firmware sets a default MMIO space of 32GB if it is not overwritten.
The affected GPU, L4, occupies a BAR size of 32GB, which exceeds the default allocation, resulting in a failed BAR allocation.
Resolution
The issue is resolved in the NativeEdge 4.2 release, which includes a revamped logic to determine the MMIO space size dynamically. As a workaround, users can switch to BIOS legacy mode.
Also, a future update allows users to overwrite the MMIO space size if there is further issues.
Users running NativeEdge 4.1 release are advised to upgrade to NativeEdge 4.2 release to resolve this issue.