PowerEdge: GPU Thermal Throttling or GPU Not Detected Issues

摘要: This article provides troubleshooting guidance for GPU thermal throttling, and GPU not detected issues on Dell PowerEdge servers. These issues may occur due to temperature conditions, hardware configuration problems, or system configuration settings. ...

本文章適用於 本文章不適用於 本文無關於任何特定產品。 本文未識別所有產品版本。

症狀

 

  • GPU performance is reduced during high workload.
  • GPU clock speed drops automatically to protect the hardware.
  • GPU temperature reaches high values during stress workloads.
  • System Event Log (SEL) shows warnings related to system inlet temperature.
  • GPU does not appear in operating system or management tools.
  • The command nvidia-smi does not show any GPU device.
  • iDRAC or BIOS does not detect the installed GPU.

原因

  • High ambient data center temperature
  • Insufficient airflow or blocked air intake in the server rack
  • Incorrect fan profile or thermal policy settings
  • GPU not properly seated in the PCIe slot
  • Unsupported GPU configuration or firmware mismatch
  • Outdated BIOS, iDRAC, or GPU firmware
  • Power or cable connection issues for GPU modules

解析度

1. Check GPU Temperature and Throttle Status:

Run the following command in the operating system to check GPU performance and throttle status of Nvidia GPUs:

nvidia-smi -q -d performance

If throttle reasons are shown as "Not Active", the GPU is operating normally.

 

2. Monitor System Temperature:

  • Check the System Event Log (SEL) in iDRAC.
  • Review Lifecycle Log for temperature warnings.
  • Verify the System Inlet Temperature from the Temperature Overview section.

 

3. Improve Cooling Conditions:

  • Ensure the data center ambient temperature is within supported limits.
  • Remove airflow blockages in the rack.
  • Verify that all system fans are functioning properly.
  • Use appropriate airflow shrouds and GPU cooling kits.

 

4. Verify GPU Hardware Installation:

  • Ensure that the GPU is properly seated in the PCIe slot.
  • Check GPU power cables and connectors.
  • Confirm that the GPU is supported on the server platform.

 

5. Update System Firmware:

  • Update the server BIOS.
  • Update iDRAC firmware.
  • Update GPU drivers and firmware.

 

6. Verify GPU Detection:

Use the following command to check if the Nvidia GPU is detected by the system:

nvidia-smi

If the GPU is not detected, check BIOS settings and hardware installation.

 

7. Test GPU in Another PCIe Slot:

If the GPU is not detected or continues to experience performance issues, try installing the GPU in another available PCIe slot.

  • Power off the server and disconnect power cables.
  • Remove the GPU from the current PCIe slot.
  • Install the GPU into another supported PCIe slot.
  • Reconnect power and power on the system.
  • Check whether the GPU is detected using the command nvidia-smi or from the iDRAC hardware inventory.

If the GPU is detected in another slot, the original PCIe slot may have a configuration or hardware issue.

 

8. Run GPU Diagnostic Test:

Run the NVIDIA Data Center GPU Manager (DCGM) diagnostic tool to verify Nvidia GPU health and detect potential hardware or thermal issues.

  1. Access the operating system through SSH or console.
  2. Run the following command to perform an extended GPU diagnostic test:
sudo dcgmi diag

This command performs a comprehensive diagnostic test that checks GPU memory, PCIe connectivity, and thermal behavior. Review the output to identify any hardware or performance-related issues.

受影響的產品

C Series, Rack Servers, XE Servers
文章屬性
文章編號: 000458921
文章類型: Solution
上次修改時間: 01 5月 2026
版本:  1
向其他 Dell 使用者尋求您問題的答案
支援服務
檢查您的裝置是否在支援服務的涵蓋範圍內。