跳转至主要内容
  • 快速、轻松地下订单
  • 查看订单并跟踪您的发货状态
  • 创建并访问您的产品列表
  • 使用“Company Administration”(公司管理),管理Dell EMC站点、产品和产品级联系人。

Integrated Dell Remote Access Controller 9 User's Guide

GPU (Accelerators) Management

Dell PowerEdge servers are shipped with Graphics Processing Unit (GPU). GPU management enables you to view the various GPUs connected to the system and also monitor power, temperature, and thermal information for the GPUs.

NOTE This is a licensed feature and is available only with iDRAC Datacenter and Enterprise licenses. Below properties require Datacenter/Enterprise license, other properties are listed even without these license:
  • Thermal Metrics:
    • GPU Target Temperature
    • Minimum GPU HW Slowdown Temperature
    • GPU Shutdown Temperature
    • Maximum Memory Operating temperature
    • Maximum GPU Operating Temperature
    • Thermal Alert State
    • Power Brake State
  • Power Metrics:
    • Power Supply Status
    • Board Power Supply Status
  • Telemetry — All GPU telemetry reports data
NOTE GPU properties will not be listed for Embedded GPU cards and the Status is marked as Unknown.

GPU has to be in ready state before the command fetches the data. GPUStatus field in Inventory shows the availability of the GPU and whether GPU device is responding or not. If the GPU status is ready, GPUStatus shows OK, otherwise the status shows Unavailable.

The GPU offers multiple health parameters which can be pulled through the SMBPB interface of the NVIDIA controllers. This feature is limited only to NVIDIA cards. Following are the health parameters retrieved from the GPU device:
  • Power
  • Temperature
  • Thermal
NOTE This feature is only limited to NVIDIA cards. This information is not available for any other GPU that the server may support. The interval for polling the GPU cards over the PBI is 5 seconds.

The host system must have the NVIDIA driver installed and running for the Power consumption, GPU target temperature, Min GPU slowdown temperature, GPU shutdown temperature, Max memory operating temperature, and Max GPU operating temperature features to be available. These values are shown as N/A if the GPU driver is not installed.

In Linux, when the card is unused, the driver down-trains the card and unloads in order to save power. In such cases, the Power consumption, GPU target temperature, Min GPU slowdown temperature, GPU shutdown temperature, Max memory operating temperature, Max memory operating temperature, and Max GPU operating temperature features are not available. Persistent mode should be enabled for the device to avoid unload. You can use nvidia-smi tool to enable this using the command nvidia-smi -pm 1.

You can generate GPU reports using Telemetry. For more information on telemetry feature, see Transmisión de telemetría

NOTE In Racadm, You may see dummy GPU entries with empty values. This may happen if device is not ready to respond when iDRAC queries the GPU device for the information. Perform iDRAC racrest operation to resolve this issue.

FPGA Monitoring

Field-programmable Gate Array (FPGA) devices needs real-time temperature sensor monitoring as it generates significant heat when in use. Perform the following steps to get FPGA inventory information:

  • Power off the server.
  • Install FPGA device on the riser card.
  • Power on the server.
  • Wait until POST is complete.
  • Login to iDRAC GUI.
  • Navigate to System > Overview > Accelerators. You can see both GPU and FPGA sections.
  • Expand the specific FPGA component to see the following sensor information:
    • Power consumption
    • Temperature details
NOTE You must have iDRAC Login privilege to access FPGA information.
NOTE Power consumption sensors are available only for the supported FPGA cards and is available only with Datacenter license.

对此内容评级

准确性
有用性
易理解性
这篇文章对您有帮助吗?
0/3000 characters
  请提供评级(1-5星)。
  请提供评级(1-5星)。
  请提供评级(1-5星)。
  请选择这篇文章是否有帮助。
  注释中不得包含以下特殊字符:<>()\