PowerEdge: XE9680 - GPUs Disappear From Host
Summary: One or more NVIDIA HGX GPUs may disappear from XE9680 hosts with Xid and Bus Fatal Error messages.
This article applies to
This article does not apply to
This article is not tied to any specific product.
Not all product versions are identified in this article.
Symptoms
Typical events and alerts that indicate this issue include:
- Host kernel logs showing
Xid74 and or 79 from the NVIDIA driver. On Linux,Xiderror messages are in/var/log/messages.
Use the command grep "NVRM:
Xid" to locate all Xid messages. Example of an Xid string:
[...] NVRM: Xid (PCI:0000:1a:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
- Dell LifeCycle Logs may report a
PCI1318 Bus Fatal Error:
PCI1318: A fatal error was detected on a component at bus <X> device <Y> function <Z>
- The issue may also be transient in nature:
- GPUs may show up again after an AC power cycle
- The number and slots where GPUs disappear may change.
Cause
Xid and Bus Fatal Error messages are often related to PCIe Retimer-related errors. These errors can happen due to signal integrity issues within the PCIe architecture.
Resolution
Dell has released updated GPU accelerator firmware. This update includes enhancements to the PCIe Retimers, which improve signal integrity and help prevent GPUs from falling off the bus. As a first step, it is advisable to update to the firmware listed below or newer if there are bus fatal errors.
GPU |
Accelerator Bundle Version |
Retimer Version |
|---|---|---|
H100 |
20.24.07.10 - FW 1.5.0 |
2.10.42 |
H200 |
20.24.07.10 - FW 1.5.0 |
2.10.42 |
H800 |
TBD* |
TBD* |
H20 |
TBD* |
TBD* |
IMPORTANT:
The server must be Powered ON for the firmware update to apply.
An AC REBOOT or Virtual AC Power Cycle or Full Power Cycle is required after completion for the firmware changes to take effect.
The server must be Powered ON for the firmware update to apply.
An AC REBOOT or Virtual AC Power Cycle or Full Power Cycle is required after completion for the firmware changes to take effect.
Affected Products
PowerEdge XE9680Article Properties
Article Number: 000230721
Article Type: Solution
Last Modified: 16 Mar 2026
Version: 3
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.