PowerEdge: Troubleshooting PCI1360 and PCI1318 Fatal BUS Errors
Summary: This article provides information about Fatal BUS errors seen with PCI1360 and PCI1318.
Instructions
Users may encounter Fatal Bus Errors such as PCI1360 and PCI1318 in the logs as seen below:
Figure 1: PCI Fatal BUS errors from the LCC Log
While the events may appear separate they are all pointing to the same end device, they just reference it differently.
Depending on the severity of the issue the IDRAC may not be able to identify the end-point slot of the device, such as when the device is no longer detected.
If the events do not specify the device or device slot number, the BUS IDs can help pinpoint as well.
For example in Figure 1, the log references BUS 13 Device 0 Function 0 (13:0:0).
Looking at a SupportAssist report (TSR) we can see the device on 13:0:0 is indeed the NIC in slot 1:
Figure 2: PCIE Devices listed from a SupportAssist report
The PCIE BUS errors can have multiple causes that must be isolated such as those listed below:
- Device Firmware or OS Drivers
- Faulty PCIE Device
- Faulty PCIE Slot or Riser
- Faulty Memory DIMMs
- Faulty CPU
- Faulty Motherboard
Troubleshooting PCIE BUS errors (PCI1360, PCI1318, PCIE1363):
- Check the installed firmware and drivers and verify that they are up to date and compatible with each other.
- Check for other component errors such as DIMM errors, if any are found focus troubleshooting on isolating the DIMM error and if possible remove the DIMM to confirm the BUS errors clear.
- Begin troubleshooting steps to isolate between the Device, Slot, or CPU, or socket on the board.
Note: We recommend engaging Dell Support to assist with troubleshooting and review of hardware logs (TSR) to best isolate the component causing issues.
- The first step would be to swap the PCIE device into another slot or riser, if possible, to check if the error follows the device to the new slot or not.
- If the other slot is controlled by a different CPU and the error follows the device to the new slot and CPU, then this would indicate the device itself as the culprit.
- If the error follows the device to the other slot but the other slot is controlled by the same CPU, it is necessary to try to isolate further.
- If the error does not follow the device but stays on the same slot regardless of what device is installed, then next step would be to swap the CPUs to see if the errors then move to a new slot or not.
- If the errors follow the CPU, then this would indicate the CPU as the culprit. If the errors still stay unchanged, then this isolates down to either the Motherboard or the riser.
- The first step would be to swap the PCIE device into another slot or riser, if possible, to check if the error follows the device to the new slot or not.