PowerEdge: How To Troubleshoot GPU Thermal Throttling and Detection Issues

Riepilogo: This article guides users through diagnosing and resolving GPU thermal throttling and detection problems on Dell PowerEdge servers. It covers checking GPU temperature and throttle status, reviewing system logs, improving cooling, verifying hardware installation, updating BIOS/iDRAC and GPU firmware, and running diagnostic utilities such as nvidia‑smi and DCGM. ...

Questo articolo si applica a Questo articolo non si applica a Questo articolo non è legato a un prodotto specifico. Non tutte le versioni del prodotto sono identificate in questo articolo.

Istruzioni

Preparation

  • Access to the operating system with administrative privileges.
  • iDRAC or BIOS access to view system logs and settings.
  • Installed NVIDIA/CUDA driver and NVIDIA‑smi utility
  • Physical access to the server for hardware checks

Task Execution

  1. Check GPU Temperature and Throttle Status
    • Run the following command in the operating system to check GPU performance and throttle status:
      nvidia-smi -q -d performance 
    • If throttle reasons are shown as "Not Active," the GPU is operating normally.
  2. Monitor System Temperature
    • Check the System event log (SEL) in iDRAC.
    • Review the life cycle Log for temperature warnings.
    • Verify the System Inlet Temperature from the Temperature Overview section.
  3. Improve Cooling Conditions
    • Ensure the data‑center ambient temperature is within supported limits.
    • Remove any airflow blockages in the rack.
    • Verify that all system fans are functioning properly.
    • Install appropriate airflow shrouds and GPU cooling kits if available.
  4. Verify GPU Hardware Installation
    • Confirm that the GPU is properly seated in the PCIe slot.
    • Check power cables and connectors for secure attachment.
    • Validate that the GPU model is supported on the server platform.
  5. Update System Firmware
    • Update the server BIOS to the latest version.
    • Update iDRAC firmware to the latest version.
    • Update GPU drivers and firmware to the latest releases.
  6. Verify GPU Detection
    • Use the following command to check if the GPU is detected by the system:
      nvidia-smi 
    • If the GPU is not detected, review BIOS settings and hardware installation.
  7. Test GPU in Another PCIe Slot
    • Power off the server and disconnect power cables.
    • Remove the GPU from the current PCIe slot.
    • Install the GPU into another supported PCIe slot.
    • Reconnect power and power on the system.
    • Check detection using nvidia-smi or the iDRAC hardware inventory.
    • If the GPU is detected in the new slot, the original slot may have a configuration or hardware issue.
  8. Run GPU Diagnostic Test
    1. DCGMi Tool
    2. NVIDIA SMI Logs
      • Run# nvidia-smi to obtain a summary of GPU usage and status.
      • Run# nvidia-smi -q for detailed GPU information.
      • Run# nvidia-smi nvlink -s to view NVLink status and errors.
    3. OS‑Level Outputs
      • Run
        # lspci -s 9b: 00.0 -vv
        (replace the device ID as appropriate) to view PCIe details for the GPU.)

Verification

  • GPU temperature remains within normal operating range, and throttle status shows "Not Active"
  • GPU appears in the output of nvidia-smi and in the iDRAC hardware inventory.
  • No temperature‑related warnings are present in the System event log.

Prodotti interessati

Rack Servers

Prodotti

Tower Servers, XE Servers
Proprietà dell'articolo
Numero articolo: 000452203
Tipo di articolo: How To
Ultima modifica: 05 mag 2026
Versione:  1
Trova risposta alle tue domande dagli altri utenti Dell
Support Services
Verifica che il dispositivo sia coperto dai Servizi di supporto.