PowerEdge: NVIDIA DataCenter GPU Manager (DCGM) install and how to run diagnostics

Summary: Overview on how to install NVIDIA's DCGM (data center GPU manager) tool in Linux (RHEL/Ubuntu) and how to run and understand the diagnostics application.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Instructions

How to install DCGM in Linux:

https://developer.nvidia.com/dcgm#DownloadsThis hyperlink is taking you to a website outside of Dell Technologies.
https://github.com/NVIDIA/DCGMThis hyperlink is taking you to a website outside of Dell Technologies.
DCGM 3.3 User and Install GuideThis hyperlink is taking you to a website outside of Dell Technologies.

 

Installing the Latest DCGM
By downloading and using the software, you agree to fully comply with the terms and conditions of the NVIDIA DCGM LicenseThis hyperlink is taking you to a website outside of Dell Technologies..
It is recommended to use the latest R450+ NVIDIA data center driver that can be downloaded from the NVIDIA Driver Downloads pageThis hyperlink is taking you to a website outside of Dell Technologies..
As the recommended method, install DCGM directly from the CUDA network repos. Older DCGM releases are also available from the repos.

 

Features of DCGM:

  • GPU behavior monitoring
  • GPU configuration management
  • GPU policy oversight
  • GPU health and diagnostics
  • GPU accounting and process statistics
  • NVSwitch configuration and monitoring

 

QuickStart Instructions:

Ubuntu LTS
Set up the CUDA network repository metadata, GPG key The example shown below is for Ubuntu 20.04 on x86_64:

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
$ sudo dpkg -i cuda-keyring_1.0-1_all.deb
$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /“

 

Install DCGM.

$ sudo apt-get update
&& sudo apt-get install -y datacenter-gpu-manager

 

Red Hat
Set up the CUDA network repository metadata, GPG key The example shown below is for Red Hat Enterprise Linux 8 on x86_64:

*Pro-Tip for RHEL 9 repo simply replace the 8 below with 9 in the URL string*
$ sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

 

Install DCGM.

$ sudo dnf clean expire-cache \
&& sudo dnf install -y datacenter-gpu-manager
Set up the DCGM service
$ sudo systemctl --now enable nvidia-dcgm.

 

How to run DCGM:

Datacenter GPU Manager (DCGM) is a quicker way for customers to test GPUs from within the OS. There are four levels of tests. Run the level 4 test for the most in-depth results. It typically takes around 1 hr 30 minutes, but this can vary with GPU type and quantity. The tool has the ability for the customer to configure the tests to run automatically and alert the customer. You can find more on that from this linkThis hyperlink is taking you to a website outside of Dell Technologies.. We would advise to always use the latest version, version 3.3 is the latest build.

 

 

Example #1:

Command: dcgmi diag -r 1
Command: dcgmi diag -r 1 example

 

 

Example #2:

Command: dcgmi diag -r 2
Command: dcgmi diag -r 2 example

Example #3:

Command: dcgmi diag -r 3
Command: dcgm diag -r 3 example

 

Example #4:

Command: dcgmi diag -r 4
Command: dcgm diag -r 4 example

 

The diagnostic might miss some errors due to their niche nature, workload specificity, or the need for extended run times to detect them.
Should you see an error, investigate it to fully understand the nature of it.
Start with pulling the nvidia-bug-report.sh command (native to Linux OS only, no windows) and review the output file.

 

Examples of a memory alert failure:
The below example was enabling and starting the DCGM Health monitor with a subsequent check on all installed GPUs in the server. You can see that GPU3 produced a warning about SBEs (single bit errors) and the driver wanting to retire the impacted memory address.
Command: dcgmi health -s a (this starts the health service and the "a" tells it to watch everything)
Command: dcgmi health -c (this checks all discovered GPUs and reports back on them)
dcgmi command example

 

Another place you can see what the memory faults are from the output below. Edited to show only the memory-related items we can see that the GPU encountered 3,081 SBEs, with a lifetime aggregate count of 6,161. We also see that the GPU has one previous SBE retired page with an additional pending page blacklisted.
Another place you can see what the memory faults is

 

In the event you see memory faults on GPUs the device itself has to be reset. A whole system reboot or issuing the nvidia-smi GPU reset against the device accomplishes this.
After the driver is unloaded, the marked blacklisted memory address is mapped out. When the driver reloads, the GPU gets a new address table with the impacted addresses blocked, similar to PPR on Intel CPUs).
Failure to reset the GPU often leads to volatile and aggregate counters increment. This is due to the GPU still allowing to use that impacted address, so each time it is hit the counters increment.

 

If you still suspect faults in one or more GPUs, run the NVIDIA fieldiags (629 diagnostics) for a more in-depth test on the targeted GPU.

 

Ensure that you use the latest and correct fieldiags for the GPU installed! This is critical!

Affected Products

C Series, PowerEdge R640, PowerEdge R6415, PowerEdge R650, PowerEdge R650xs, PowerEdge R6515, PowerEdge R6525, PowerEdge R660, PowerEdge R660xs, PowerEdge R6615, PowerEdge R6625, PowerEdge R740, PowerEdge R740XD, PowerEdge R740XD2, PowerEdge R7415 , PowerEdge R7425, PowerEdge R750, PowerEdge R750XA, PowerEdge R750xs, PowerEdge R7515, PowerEdge R7525, PowerEdge R760, PowerEdge R760XA, PowerEdge R760xd2, PowerEdge R760xs, PowerEdge R7615, PowerEdge R7625, PowerEdge R840, PowerEdge R860, PowerEdge R940, PowerEdge R940xa, PowerEdge R960, PowerEdge T550, PowerEdge T560, PowerEdge T640, PowerEdge XE8545, PowerEdge XE8640, PowerEdge XE9640, PowerEdge XE9680, Red Hat Enterprise Linux Version 7, Red Hat Enterprise Linux Version 9, Red Hat Enterprise Linux Version 8, SUSE Linux Enterprise Server 15, Ubuntu Server LTS ...
Article Properties
Article Number: 000219485
Article Type: How To
Last Modified: 27 May 2025
Version:  5
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.