PowerEdge: GPU Thermal Throttling or GPU Not Detected Issues
Resumen: This article provides troubleshooting guidance for GPU thermal throttling, and GPU not detected issues on Dell PowerEdge servers. These issues may occur due to temperature conditions, hardware configuration problems, or system configuration settings. ...
Síntomas
- GPU performance is reduced during high workload.
- GPU clock speed drops automatically to protect the hardware.
- GPU temperature reaches high values during stress workloads.
- System Event Log (SEL) shows warnings related to system inlet temperature.
- GPU does not appear in operating system or management tools.
- The command
nvidia-smidoes not show any GPU device. - iDRAC or BIOS does not detect the installed GPU.
Causa
- High ambient data center temperature
- Insufficient airflow or blocked air intake in the server rack
- Incorrect fan profile or thermal policy settings
- GPU not properly seated in the PCIe slot
- Unsupported GPU configuration or firmware mismatch
- Outdated BIOS, iDRAC, or GPU firmware
- Power or cable connection issues for GPU modules
Resolución
1. Check GPU Temperature and Throttle Status:
Run the following command in the operating system to check GPU performance and throttle status of Nvidia GPUs:
nvidia-smi -q -d performance
If throttle reasons are shown as "Not Active", the GPU is operating normally.
2. Monitor System Temperature:
- Check the System Event Log (SEL) in iDRAC.
- Review Lifecycle Log for temperature warnings.
- Verify the System Inlet Temperature from the Temperature Overview section.
3. Improve Cooling Conditions:
- Ensure the data center ambient temperature is within supported limits.
- Remove airflow blockages in the rack.
- Verify that all system fans are functioning properly.
- Use appropriate airflow shrouds and GPU cooling kits.
4. Verify GPU Hardware Installation:
- Ensure that the GPU is properly seated in the PCIe slot.
- Check GPU power cables and connectors.
- Confirm that the GPU is supported on the server platform.
5. Update System Firmware:
- Update the server BIOS.
- Update iDRAC firmware.
- Update GPU drivers and firmware.
6. Verify GPU Detection:
Use the following command to check if the Nvidia GPU is detected by the system:
nvidia-smi
If the GPU is not detected, check BIOS settings and hardware installation.
7. Test GPU in Another PCIe Slot:
If the GPU is not detected or continues to experience performance issues, try installing the GPU in another available PCIe slot.
- Power off the server and disconnect power cables.
- Remove the GPU from the current PCIe slot.
- Install the GPU into another supported PCIe slot.
- Reconnect power and power on the system.
- Check whether the GPU is detected using the command
nvidia-smior from the iDRAC hardware inventory.
If the GPU is detected in another slot, the original PCIe slot may have a configuration or hardware issue.
8. Run GPU Diagnostic Test:
Run the NVIDIA Data Center GPU Manager (DCGM) diagnostic tool to verify Nvidia GPU health and detect potential hardware or thermal issues.
- Access the operating system through SSH or console.
- Run the following command to perform an extended GPU diagnostic test:
sudo dcgmi diag
This command performs a comprehensive diagnostic test that checks GPU memory, PCIe connectivity, and thermal behavior. Review the output to identify any hardware or performance-related issues.