Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products
  • Manage your Dell EMC sites, products, and product-level contacts using Company Administration.

Integrated Dell Remote Access Controller 9 User's Guide

GPU (Accelerators) Management

Dell PowerEdge servers are shipped with Graphics Processing Unit (GPU). GPU management enables you to view the various GPUs connected to the system and also monitor power, temperature, and thermal information for the GPUs.

This is a licensed feature and is available only with iDRAC Datacenter and Enterprise licenses. Below are the properties that are available with Datacenter/Enterprise license, other properties are listed even without these licenses:

GPU Properties Datacenter License Enterprise License
Thermal Metrics
GPU Target Temperature Yes No
Minimum GPU HW Slowdown Temperature Yes No
GPU Shutdown Temperature Yes No
Maximum Memory Operating temperature Yes No
Maximum GPU Operating Temperature Yes No
Thermal Alert State Yes No
Power Brake State Yes No
Power Metrics
Power Supply Status Yes No
Board Power Supply Status Yes No
Telemetry
All Telemetry reports data Yes No
NOTE:
  • GPU properties are not listed for Embedded GPU cards and the Status is marked as Unknown.
  • The operating temperature may be different for AMD based systems.
  • The number of GPU entries per PCIe slot displayed in the host may differ from that in the iDRAC.
  • When a manual AC power cycle is required after performing any component or bundled firmware updates for GPUs or Power Distribution Board (PDB) CPLDs, SUP0545 event in Lifecycle(LC) logs is displayed. After this event, ensure to perform a manual AC or virtual AC power cycle to avoid any unexpected behavior in the server.
  • After a GPU firmware update that includes component firmware updates or bundled firmware updates, ensure to perform an AC power cycle or virtual AC power cycle to complete the update and avoid any unexpected behavior in iDRAC related to GPUs.

GPU has to be in ready state before the command fetches the data. GPUStatus field in Inventory shows the availability of the GPU and whether GPU device is responding or not. If the GPU status is ready, GPUStatus shows OK, otherwise the status shows Unavailable.

The GPU offers multiple health parameters which can be pulled through the SMBPB interface of the NVIDIA controllers. This feature is limited only to NVIDIA cards. Following are the health parameters retrieved from the GPU device:
  • Power
  • Temperature
  • Thermal
NOTE:This feature is only limited to NVIDIA cards. This information is not available for any other GPU that the server may support. The interval for polling the GPU cards over the PBI is 5 seconds.
NOTE: Avoid connecting to iDRAC through the USB management port or iDRAC quick sync while updating NVIDIA GPU firmware, as the connection may cause a failure in firmware update.

The host system must have the NVIDIA driver installed and running for the Power consumption, GPU target temperature, Min GPU slowdown temperature, GPU shutdown temperature, Max memory operating temperature, and Max GPU operating temperature features to be available. These values are shown as N/A if the GPU driver is not installed.

In Linux, when the card is unused, the driver down-trains the card and unloads in order to save power. In such cases, the Power consumption, GPU target temperature, Min GPU slowdown temperature, GPU shutdown temperature, Max memory operating temperature, Max memory operating temperature, and Max GPU operating temperature features are not available. Persistent mode should be enabled for the device to avoid unload. You can use nvidia-smi tool to enable this using the command nvidia-smi -pm 1.

You can generate GPU reports using Telemetry. For more information on telemetry feature, see Telemetry Streaming

NOTE: In Racadm, You may see dummy GPU entries with empty values. This may happen if device is not ready to respond when iDRAC queries the GPU device for the information. Perform iDRAC racrest operation to resolve this issue.

FPGA Monitoring

Field-programmable Gate Array (FPGA) devices needs real-time temperature sensor monitoring as it generates significant heat when in use. Perform the following steps to get FPGA inventory information:

  • Power off the server.
  • Install FPGA device on the riser card.
  • Power on the server.
  • Wait until POST is complete.
  • Login to iDRAC GUI.
  • Navigate to System > Overview > Accelerators. You can see both GPU and FPGA sections.
  • Expand the specific FPGA component to see the following sensor information:
    • Power consumption
    • Temperature details
NOTE:You must have iDRAC Login privilege to access FPGA information.
NOTE:Power consumption sensors are available only for the supported FPGA cards and is available only with Datacenter license.

Rate this content

Accurate
Useful
Easy to understand
Was this article helpful?
0/3000 characters
  Please provide ratings (1-5 stars).
  Please provide ratings (1-5 stars).
  Please provide ratings (1-5 stars).
  Please select whether the article was helpful or not.
  Comments cannot contain these special characters: <>()\