PowerEdge: What is a CTL137 Event and How to Troubleshoot it
Summary: The CTL137 event is new to iDRAC code 4.xx.xx.xx. The CTL137 event exist to notify the end user that the iDRAC lost communication with an endpoint device within the system.
Symptoms
What is a CTL137 Event?
The CTL137 event is new to iDRAC code 4.xx.xx.xx. The CTL137 event exist to notify the end user that the iDRAC lost communication with an endpoint device within the system.
Example:
"The storage controller PCIe SSD in Slot X in Bay X is unable to communicate to the BMC because either the storage controller or BMC is not responding to the commands either because of an internal error or the bus is in an error state."
Cause
What Triggers a CTL137 Event.
The iDRAC continuously polls the end-point devices (NVMe drives, PERC, and so forth) over the i2C channel within the system for various reasons (Health Status changes, Temperature, and so forth). If an endpoint device fails to respond to 10 consecutive i2C polls, then the iDRAC logs a CTL137 event. To notify the end user that the iDRAC lost communication with the endpoint device.
Will the iDRAC notify you when communication is restored?
If i2C communication to the endpoint device is restored after a CTL137 is generated. Observing a CTL138 event notifies the user that communication has been restored to the endpoint device.
Example:
"CTL138 The communication between storage controller PCIe SSD in Slot X in Bay X and the BMC is restored"
Why does an iDRAC reset temporarily resolve the issue.
By performing an iDRAC reset, you are forcing the system to re-inventory the attached endpoint devices. Also in the process the iDRAC command failed counter for each endpoint device is reset to 0 for each endpoint device. Once the endpoint device fails to respond to 10 consecutive i2C polls then the iDRAC logs a CTL137 event against that endpoint device.
Why am I seeing a CTL137 event and the drive is still online within the Operating SystemOS.
In some situations you may see that the drive is still online within the OS yet, you are seeing a CTL137 event. This is because the i2C communication to the drives does not go through the data cables. The iDRAC sends all SMBus communication using a single SIG cable that goes from the planar to the backplane. Your issue might be with these components.
Resolution
What to look for when you see a CTL137 event.
- Look for the obvious reasons as to why the drive may have failed to reply to an
i2Cpoll from the iDRAC.- Did a drive fail?
- PCIe Link Training Failure
- PCI Down Train Errors
- Verify that all the components within the system NVMe Drive, backplane, iDRAC are on the latest code.
- After steps 1 and 2 have been performed and if still seeing repeated CTL137 events followed by
CTL138events. Then you are likely seeing an intermittent Signal issue between the iDRAC and the endpoint device.- Attempt to reseat the drive and or backplane cables to the planer.