PowerEdge: NVIDIA DataCenter GPU Manager (DCGM) install and how to run diagnostics
Summary: Overview on how to install NVIDIA's DCGM (data center GPU manager) tool in Linux (RHEL/Ubuntu) and how to run and understand the diagnostics application.
Instructions
How to install DCGM in Linux:
https://developer.nvidia.com/dcgm#Downloads
https://github.com/NVIDIA/DCGM
DCGM 3.3 User and Install Guide
Installing the Latest DCGM
By downloading and using the software, you agree to fully comply with the terms and conditions of the NVIDIA DCGM License.
It is recommended to use the latest R450+ NVIDIA data center driver that can be downloaded from the NVIDIA Driver Downloads page.
As the recommended method, install DCGM directly from the CUDA network repos. Older DCGM releases are also available from the repos.
Features of DCGM:
- GPU behavior monitoring
- GPU configuration management
- GPU policy oversight
- GPU health and diagnostics
- GPU accounting and process statistics
- NVSwitch configuration and monitoring
QuickStart Instructions:
Ubuntu LTS
Set up the CUDA network repository metadata, GPG key The example shown below is for Ubuntu 20.04 on x86_64:
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb $ sudo dpkg -i cuda-keyring_1.0-1_all.deb $ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /“
Install DCGM.
$ sudo apt-get update && sudo apt-get install -y datacenter-gpu-manager
Red Hat
Set up the CUDA network repository metadata, GPG key The example shown below is for Red Hat Enterprise Linux 8 on x86_64:
*Pro-Tip for RHEL 9 repo simply replace the 8 below with 9 in the URL string* $ sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
Install DCGM.
$ sudo dnf clean expire-cache \ && sudo dnf install -y datacenter-gpu-manager Set up the DCGM service $ sudo systemctl --now enable nvidia-dcgm.
How to run DCGM:
Datacenter GPU Manager (DCGM) is a quicker way for customers to test GPUs from within the OS. There are four levels of tests. Run the level 4 test for the most in-depth results. It typically takes around 1 hr 30 minutes, but this can vary with GPU type and quantity. The tool has the ability for the customer to configure the tests to run automatically and alert the customer. You can find more on that from this link. We would advise to always use the latest version, version 3.3 is the latest build.
Example #1:
Command: dcgmi diag -r 1
Example #2:
Command: dcgmi diag -r 2
Example #3:
Command: dcgmi diag -r 3
Example #4:
Command: dcgmi diag -r 4
The diagnostic might miss some errors due to their niche nature, workload specificity, or the need for extended run times to detect them.
Should you see an error, investigate it to fully understand the nature of it.
Start with pulling the nvidia-bug-report.sh command (native to Linux OS only, no windows) and review the output file.
Examples of a memory alert failure:
The below example was enabling and starting the DCGM Health monitor with a subsequent check on all installed GPUs in the server. You can see that GPU3 produced a warning about SBEs (single bit errors) and the driver wanting to retire the impacted memory address.
Command: dcgmi health -s a (this starts the health service and the "a" tells it to watch everything)
Command: dcgmi health -c (this checks all discovered GPUs and reports back on them)
Another place you can see what the memory faults are from the output below. Edited to show only the memory-related items we can see that the GPU encountered 3,081 SBEs, with a lifetime aggregate count of 6,161. We also see that the GPU has one previous SBE retired page with an additional pending page blacklisted.
In the event you see memory faults on GPUs the device itself has to be reset. A whole system reboot or issuing the nvidia-smi GPU reset against the device accomplishes this.
After the driver is unloaded, the marked blacklisted memory address is mapped out. When the driver reloads, the GPU gets a new address table with the impacted addresses blocked, similar to PPR on Intel CPUs).
Failure to reset the GPU often leads to volatile and aggregate counters increment. This is due to the GPU still allowing to use that impacted address, so each time it is hit the counters increment.
If you still suspect faults in one or more GPUs, run the NVIDIA fieldiags (629 diagnostics) for a more in-depth test on the targeted GPU.
Ensure that you use the latest and correct fieldiags for the GPU installed! This is critical!