PowerEdge: GPU Thermal Throttling or GPU Not Detected Issues

Oversigt: This article provides troubleshooting guidance for GPU thermal throttling, and GPU not detected issues on Dell PowerEdge servers. These issues may occur due to temperature conditions, hardware configuration problems, or system configuration settings. ...

Denne artikel gælder for Denne artikel gælder ikke for Denne artikel er ikke knyttet til et bestemt produkt. Det er ikke alle produktversioner, der er identificeret i denne artikel.

Symptomer

 

  • GPU performance is reduced during high workload.
  • GPU clock speed drops automatically to protect the hardware.
  • GPU temperature reaches high values during stress workloads.
  • System Event Log (SEL) shows warnings related to system inlet temperature.
  • GPU does not appear in operating system or management tools.
  • The command nvidia-smi does not show any GPU device.
  • iDRAC or BIOS does not detect the installed GPU.

Årsag

  • High ambient data center temperature
  • Insufficient airflow or blocked air intake in the server rack
  • Incorrect fan profile or thermal policy settings
  • GPU not properly seated in the PCIe slot
  • Unsupported GPU configuration or firmware mismatch
  • Outdated BIOS, iDRAC, or GPU firmware
  • Power or cable connection issues for GPU modules

Løsning

1. Check GPU Temperature and Throttle Status:

Run the following command in the operating system to check GPU performance and throttle status of Nvidia GPUs:

nvidia-smi -q -d performance

If throttle reasons are shown as "Not Active", the GPU is operating normally.

 

2. Monitor System Temperature:

  • Check the System Event Log (SEL) in iDRAC.
  • Review Lifecycle Log for temperature warnings.
  • Verify the System Inlet Temperature from the Temperature Overview section.

 

3. Improve Cooling Conditions:

  • Ensure the data center ambient temperature is within supported limits.
  • Remove airflow blockages in the rack.
  • Verify that all system fans are functioning properly.
  • Use appropriate airflow shrouds and GPU cooling kits.

 

4. Verify GPU Hardware Installation:

  • Ensure that the GPU is properly seated in the PCIe slot.
  • Check GPU power cables and connectors.
  • Confirm that the GPU is supported on the server platform.

 

5. Update System Firmware:

  • Update the server BIOS.
  • Update iDRAC firmware.
  • Update GPU drivers and firmware.

 

6. Verify GPU Detection:

Use the following command to check if the Nvidia GPU is detected by the system:

nvidia-smi

If the GPU is not detected, check BIOS settings and hardware installation.

 

7. Test GPU in Another PCIe Slot:

If the GPU is not detected or continues to experience performance issues, try installing the GPU in another available PCIe slot.

  • Power off the server and disconnect power cables.
  • Remove the GPU from the current PCIe slot.
  • Install the GPU into another supported PCIe slot.
  • Reconnect power and power on the system.
  • Check whether the GPU is detected using the command nvidia-smi or from the iDRAC hardware inventory.

If the GPU is detected in another slot, the original PCIe slot may have a configuration or hardware issue.

 

8. Run GPU Diagnostic Test:

Run the NVIDIA Data Center GPU Manager (DCGM) diagnostic tool to verify Nvidia GPU health and detect potential hardware or thermal issues.

  1. Access the operating system through SSH or console.
  2. Run the following command to perform an extended GPU diagnostic test:
sudo dcgmi diag

This command performs a comprehensive diagnostic test that checks GPU memory, PCIe connectivity, and thermal behavior. Review the output to identify any hardware or performance-related issues.

Berørte produkter

C Series, Rack Servers, XE Servers
Artikelegenskaber
Artikelnummer: 000458921
Artikeltype: Solution
Senest ændret: 01 maj 2026
Version:  1
Find svar på dine spørgsmål fra andre Dell-brugere
Supportservices
Kontrollér, om din enhed er dækket af supportservices.