PowerEdge: XE9680 - ConnectX-7 adapters may report PCI1318
Summary: ConnectX-7 adapters may report PCI1318 "A fatal error was detected on a component" after a GPU falls off the PCIe bus with event PCI1360 "A bus fatal error was detected on a component." ...
This article applies to
This article does not apply to
This article is not tied to any specific product.
Not all product versions are identified in this article.
Symptoms
After one or more GPU falls off the PCIe bus with event PCI1360 "A bus fatal error was detected on a component," A ConnectX-7 adapter may subsequently report event PCI1318 "A fatal error was detected on a component" similar to:
2024-05-18 14:44:19 308 PCI1318 A fatal error was detected on a component at bus 201 device 1 function 0.
2024-05-18 14:44:16 307 PCI1360 A bus fatal error was detected on a component at slot 26.
Similarly, the Operating System (OS) log may report similar events:
Jun 7 17:36:41 kernel: [750851.735504] {1}[Hardware Error]: device_id: 0000:c9:01.0
Jun 7 17:36:41 kernel: [750851.760433] {1}[Hardware Error]: slot: 26
Jun 7 17:36:41 kernel: [750851.764705] {1}[Hardware Error]: secondary_bus: 0xcb
Jun 7 17:36:41 kernel: [750851.769932] {1}[Hardware Error]: vendor_id: 0x1000, device_id: 0xc030
Jun 7 17:36:41 kernel: [750851.776631] {1}[Hardware Error]: class_code: 060400
Jun 7 17:36:41 kernel: [750851.781769] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
Jun 7 17:36:41 kernel: [750851.789596] {1}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x03a10000
Jun 7 17:36:41 kernel: [750851.798029] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
Jun 7 17:36:41 kernel: [750851.804209] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
Jun 7 17:36:41 kernel: [750851.811870] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
Jun 7 17:36:41 kernel: [750851.820225] {2}[Hardware Error]: event severity: recoverable
Jun 7 17:36:41 kernel: [750851.825972] {2}[Hardware Error]: Error 0, type: fatal
Jun 7 17:36:41 kernel: [750851.831196] {2}[Hardware Error]: section_type: PCIe error
Jun 7 17:36:41 kernel: [750851.836858] {2}[Hardware Error]: port_type: 5, upstream switch port
Jun 7 17:36:41 kernel: [750851.843381] {2}[Hardware Error]: version: 3.0
Jun 7 17:36:41 kernel: [750851.848002] {2}[Hardware Error]: command: 0x0007, status: 0x0010
Jun 7 17:36:41 kernel: [750851.854269] {2}[Hardware Error]: device_id: 0000:c8:00.0
Jun 7 17:36:41 kernel: [750851.859842] {2}[Hardware Error]: slot: 0
Jun 7 17:36:41 kernel: [750851.864028] {2}[Hardware Error]: secondary_bus: 0xc9
Jun 7 17:36:41 kernel: [750851.869253] {2}[Hardware Error]: vendor_id: 0x1000, device_id: 0xc030
Jun 7 17:36:41 kernel: [750851.875953] {2}[Hardware Error]: class_code: 060400
Jun 7 17:36:41 kernel: [750851.881094] {2}[Hardware Error]: bridge: secondary_status: 0x4000, control: 0x0003
Jun 7 17:36:41 kernel: [750851.888918] {2}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x03810000
Jun 7 17:36:41 kernel: [750851.897351] {2}[Hardware Error]: aer_uncor_severity: 0x004ef030
Jun 7 17:36:41 kernel: [750851.903532] {2}[Hardware Error]: TLP Header: 60000001 c700000f 00002ee0 02b81408
Jun 7 17:36:41 kernel: [750851.911375] pcieport 0000:c9:01.0: AER: aer_status: 0x004ef030, aer_mask: 0x0000a000
Jun 7 17:36:41 kernel: [750851.919229] pcieport 0000:c9:01.0: [ 4] DLP
Jun 7 17:36:41 kernel: [750851.925431] pcieport 0000:c9:01.0: [ 5] SDES
Jun 7 17:36:41 kernel: [750851.931629] pcieport 0000:c9:01.0: [12] TLP
Jun 7 17:36:41 kernel: [750851.937826] pcieport 0000:c9:01.0: [14] CmpltTO
Jun 7 17:36:41 kernel: [750851.944020] pcieport 0000:c9:01.0: [17] RxOF
Jun 7 17:36:41 kernel: [750851.950219] pcieport 0000:c9:01.0: [18] MalfTLP
Jun 7 17:36:41 kernel: [750851.956414] pcieport 0000:c9:01.0: [19] ECRC
Jun 7 17:36:41 kernel: [750851.962611] pcieport 0000:c9:01.0: [22] UncorrIntErr
Jun 7 17:36:41 kernel: [750851.968808] pcieport 0000:c9:01.0: AER: aer_layer=Data Link Layer, aer_agent=Completer ID
Jun 7 17:36:41 kernel: [750851.977086] pcieport 0000:c9:01.0: AER: aer_uncor_severity: 0x0000f1c1
Jun 7 17:36:41 kernel: [750851.983716] pcieport 0000:c9:01.0: AER: TLP Header: 00002ee0 02b81408 00000000 00000000
Jun 7 17:36:41 kernel: [750851.991995] nvidia 0000:cb:00.0: AER: can't recover (no error_detected callback)
Jun 7 17:36:43 kernel: [750853.750472] pcieport 0000:c9:01.0: AER: Downstream Port link has been reset (0)
Jun 7 17:36:43 kernel: [750853.750491] pcieport 0000:c9:01.0: AER: device recovery failed
Jun 7 17:36:43 kernel: [750853.750592] pcieport 0000:c8:00.0: AER: aer_status: 0x00100000, aer_mask: 0x03810000
Jun 7 17:36:43 kernel: [750853.758439] pcieport 0000:c8:00.0: [20] UnsupReq (First)
Jun 7 17:36:43 kernel: [750853.765330] pcieport 0000:c8:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
Jun 7 17:36:43 kernel: [750853.773778] pcieport 0000:c8:00.0: AER: aer_uncor_severity: 0x004ef030
Jun 7 17:36:43 kernel: [750853.780409] pcieport 0000:c8:00.0: AER: TLP Header: 60000001 c700000f 00002ee0 02b81408
Jun 7 17:36:43 kernel: [750853.788685] nvidia 0000:cb:00.0: AER: can't recover (no error_detected callback)
Jun 7 17:36:43 kernel: [750853.788694] mlx5_core 0000:cc:00.0: mlx5_pci_err_detected Device state = 1 health sensors: 2 pci_status: 1. Enter, pci channel state = 2
Jun 7 17:36:43 kernel: [750853.788709] mlx5_core 0000:cc:00.0: mlx5_error_sw_reset:280:(pid 3639451): start
Jun 7 17:36:45 kernel: [750855.802512] mlx5_core 0000:cc:00.0: NIC IFC still 0 after 2000ms.
Jun 7 17:36:45 kernel: [750855.808710] mlx5_core 0000:cc:00.0: mlx5_error_sw_reset:313:(pid 3639451): end
Jun 7 17:36:45 kernel: [750856.170523] mlx5_core 0000:cc:00.0: mlx5_wait_for_pages:916:(pid 3639451): Skipping wait for vf pages stage
Jun 7 17:36:45 kernel: [750856.170526] mlx5_core 0000:cc:00.0: mlx5_wait_for_pages:916:(pid 3639451): Skipping wait for vf pages stage
Jun 7 17:36:47 kernel: [750858.440852] mlx5_core 0000:cc:00.0: mlx5_pci_err_detected Device state = 2 pci_status: 0. Exit, result = 3, need reset
Jun 7 17:36:47 kernel: [750858.440859] mpt3sas_cm6: PCI error: detected callback, state(2)!!Cause
In the above log examples, slot 26 has a hardware issue, which is reported back through the root complex through the PCIe switch. The CX7 adapter detects an issue with the PCIe bus and tries to recover.
Resolution
Update firmware for the NVIDIA HGX H100 80G 8-GPU Baseboard Assembly to version 20.24.07.10 or newer.
Affected Products
Mellanox Family of Adapters, PowerEdge XE9680Article Properties
Article Number: 000227854
Article Type: Solution
Last Modified: 22 Jan 2026
Version: 4
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.