Start a Conversation

Solved!

Go to Solution

1 Rookie

 • 

3 Posts

48

April 9th, 2024 16:04

PCI1318 on Dell Poweredge r840

I have a Dell Poweredge r840 with a Mellanox ConnectX-5 Infiniband PCIe card. I recently upgraded from Centos 7 to Centos Stream 9. Any reboot causes the iDRAC to log:

PCI1318 A fatal error was detected on a component at bus 2 device 0 function 0.

Issuing "lspci" reveals that the device at bus 2 is "02:00.0 PCI bridge: PLDA PCI Express Bridge (rev 02)". I am not sure if the issue lies with the PCI Bridge itself or the Mellanox card.

I have updated the BIOS, iDRAC, Mellanox card firmware, and all suggested firmware upgrades from the dsu (dell system update) utility. None of these updates corrected the issue, any reboot causes the PCI1318 error to be thrown in the IDRAC. Other than the iDRAC stating "System has critical issues" and the orange light blinking on the server, the machine seems to run just fine. 

I would like to correct the issue so that I can return the server to it's healthy state.

Moderator

 • 

3.4K Posts

April 16th, 2024 18:13

Hello,

 

With the PCI bridge on the system board it looks like it may need a system board replacement.

 

If you are ok with the way it is working then you could export then clear the SEL

 

Viewing and exporting System Event Log and Lifecycle Log

https://dell.to/440QBxK

 

PowerEdge - How to View or Clear the System Event Log

https://dell.to/3xBxE8O

 

Moderator

 • 

3.4K Posts

April 9th, 2024 20:46

Hello,

 

We haven't validated Centos Stream 9.

Is the Mellanox on the HCL for that OS?

 

If you remove the Mellanox does it give the same error on boot up?

 

Check the slot for any pin damage?

 

Try a different slot. Page 61 Owners manual link below:

100G NICs Mellanox

Slot 3&4 for low profile

Slot 2&6 for full height

https://dell.to/4au3RgH

 

1 Rookie

 • 

3 Posts

April 10th, 2024 18:42

@DELL-Charles R​ Thanks Charles,

The mellanox board is not included in the hardware compatibility list for Centos Stream 9. Looks like it is only supported up to version 8. However, the infiniband commands seem to work just fine and I can use ibping between the hosts in my cluster.

I inspected the Mellanox board, pins look fine. The PCIe slot on the motherboard did not have debris either. I moved the Mellanox board to another slot and still received the PCI1318 error. I also received the error when the mellanox board was completely removed.

This tells me the problem has to do with that "PLDA PCI Express Bridge" after all.

Moderator

 • 

3.4K Posts

April 10th, 2024 20:38

You could bring it down to Minimum to POST configuration and see if it still has the error.

If not you can put things back a little at a time until you find a faulting component.

 

Minimum-to-POST components:

*System board

*PIB, and PIB cable (4 power + 2 side band connect to MB)

*1 X supported Power Supply unit

*Control Panel with cable (to power system on)

*1 X the same Skylake-SP 5xxx/6xxx/8xxx series Processors (CPU)

*1 X DDR4 RDIMM or LRDIMM Memory Module (DIMM) installed in socket A1

 

1 Rookie

 • 

3 Posts

April 16th, 2024 15:40

It's not a bad idea, but I get the feeling this PCI bridge is something on the motherboard itself, i.e. not removable. It doesn't appear on the iDRAC's hardware inventory despite the iDRAC notifying me of the issue. I had to use lspci from the OS to see what was on bus 2. Since the error is only thrown at boot and the system appears to run just fine I am not too concerned about the PCI Bridge. However, It would be nice to clear this error so I don't brush of the critical health warning. Do you know of a way to do that?

1 Rookie

 • 

1 Message

May 20th, 2024 12:42

This in the RH KnowledgeBase refers:

https://access.redhat.com/solutions/7062084
No Events found!

Top