PowerEdge: How To Troubleshoot GPU Thermal Throttling and Detection Issues
Riepilogo: This article guides users through diagnosing and resolving GPU thermal throttling and detection problems on Dell PowerEdge servers. It covers checking GPU temperature and throttle status, reviewing system logs, improving cooling, verifying hardware installation, updating BIOS/iDRAC and GPU firmware, and running diagnostic utilities such as nvidia‑smi and DCGM. ...
Questo articolo si applica a
Questo articolo non si applica a
Questo articolo non è legato a un prodotto specifico.
Non tutte le versioni del prodotto sono identificate in questo articolo.
Istruzioni
Preparation
- Access to the operating system with administrative privileges.
- iDRAC or BIOS access to view system logs and settings.
- Installed NVIDIA/CUDA driver and NVIDIA‑smi utility
- Physical access to the server for hardware checks
Task Execution
- Check GPU Temperature and Throttle Status
- Run the following command in the operating system to check GPU performance and throttle status:
nvidia-smi -q -d performance
- If throttle reasons are shown as "Not Active," the GPU is operating normally.
- Run the following command in the operating system to check GPU performance and throttle status:
- Monitor System Temperature
- Check the System event log (SEL) in iDRAC.
- Review the life cycle Log for temperature warnings.
- Verify the System Inlet Temperature from the Temperature Overview section.
- Improve Cooling Conditions
- Ensure the data‑center ambient temperature is within supported limits.
- Remove any airflow blockages in the rack.
- Verify that all system fans are functioning properly.
- Install appropriate airflow shrouds and GPU cooling kits if available.
- Verify GPU Hardware Installation
- Confirm that the GPU is properly seated in the PCIe slot.
- Check power cables and connectors for secure attachment.
- Validate that the GPU model is supported on the server platform.
- Update System Firmware
- Update the server BIOS to the latest version.
- Update iDRAC firmware to the latest version.
- Update GPU drivers and firmware to the latest releases.
- Verify GPU Detection
- Use the following command to check if the GPU is detected by the system:
nvidia-smi
- If the GPU is not detected, review BIOS settings and hardware installation.
- Use the following command to check if the GPU is detected by the system:
- Test GPU in Another PCIe Slot
- Power off the server and disconnect power cables.
- Remove the GPU from the current PCIe slot.
- Install the GPU into another supported PCIe slot.
- Reconnect power and power on the system.
- Check detection using
nvidia-smior the iDRAC hardware inventory. - If the GPU is detected in the new slot, the original slot may have a configuration or hardware issue.
- Run GPU Diagnostic Test
- DCGMi Tool
- See the DCGM utility
- For more instructions check PowerEdge: NVIDIA DataCenter GPU Manager (DCGM) install and how to run diagnostics
- See the DCGM utility
- NVIDIA SMI Logs
- Run
# nvidia-smito obtain a summary of GPU usage and status. - Run
# nvidia-smi -qfor detailed GPU information. - Run
# nvidia-smi nvlink -sto view NVLink status and errors.
- Run
- OS‑Level Outputs
- Run
(replace the device ID as appropriate) to view PCIe details for the GPU.)# lspci -s 9b: 00.0 -vv
- Run
- DCGMi Tool
Verification
- GPU temperature remains within normal operating range, and throttle status shows "Not Active"
- GPU appears in the output of
nvidia-smiand in the iDRAC hardware inventory. - No temperature‑related warnings are present in the System event log.
Prodotti interessati
Rack ServersProdotti
Tower Servers, XE ServersProprietà dell'articolo
Numero articolo: 000452203
Tipo di articolo: How To
Ultima modifica: 05 mag 2026
Versione: 1
Trova risposta alle tue domande dagli altri utenti Dell
Support Services
Verifica che il dispositivo sia coperto dai Servizi di supporto.