PowerEdge: R760xa - NVLInk Inactive and DCGMI Failure on NVIDIA H100NVL with Bridge
Summary: Reports show NVLink failures on an R760XA with NVIDIA H100NVL GPUs and bridges, with 12 of 18 links active; standby links are incorrectly identified as down due to an issue in DCGM versions up to 3.1.3.1. ...
Symptoms
DCGMI Diagnostic reports failure due to NVlink link down on NVIDIA H100NVL GPUs when used with NVLink Bridge.
Issue persists even after reseating or replacing both GPUs, all three NVlink bridge connecting the two GPUs, and the risers on which they are installed. Out of the 18 NVlinks (6 on each NVlink Bridge), only 12 of 18 links are up.

The last two links on each NVlink bridge device are always inactive.

Cause
H100 silicon has 18 NvLink connections in groups of 6, but on the H100 NVL PCIE GPU only 12 paths out of 18 would be UP and functional, while the remaining paths would be in a stand‑by state.
The two "inactive" links are used for failover, if there happened to be a problem with the first four links in the NVlink bridge. H100 PCIE GPU requires 12 active links to be up.
Three bridges are still required to allow for failover, if bad links should arise (GPUs and/or bridge).
Due to a problem in DCGM version 3.1.3.1 and below, inactive NVLinks is reported as a failure.
Resolution
DO NOT REPLACE ANY HARDWARE FOR THIS ISSUE.
DCGM version 3.1.6 fixes the issue.
https://docs.NVIDIA.com/datacenter/dcgm/latest/release-notes/changelog.html
Customer has to download and install 3.1.6 or above to resolve the issue.