Loading...
Loading...
DCGMI Diagnostic reports failure due to NVlink link down on NVIDIA H100NVL GPUs when used with NVLink Bridge. Issue persists even after reseating or replacing both GPUs, all three NVlink bridge connecting the two GPUs, and the risers on which they are installed. Out of the 18 NVlinks (6 on each NVlink Bridge), only 12 of 18 links are up. The last two links on each NVlink bridge device are always inactive.
H100 silicon has 18 NvLink connections in groups of 6, but on the H100 NVL PCIE GPU only 12 paths out of 18 would be UP and functional, while the remaining paths would be in a stand‑by state. The two "inactive" links are used for failover, if there happened to be a problem with the first four links in the NVlink bridge. H100 PCIE GPU requires 12 active links to be up. Three bridges are still required to allow for failover, if bad links should arise (GPUs and/or bridge). Due to a problem in DCGM version 3.1.3.1 and below, inactive NVLinks is reported as a failure.
DO NOT REPLACE ANY HARDWARE FOR THIS ISSUE. DCGM version 3.1.6 fixes the issue. https://docs.NVIDIA.com/datacenter/dcgm/latest/release-notes/changelog.html Customer has to download and install 3.1.6 or above to resolve the issue.
Click on a version to see all relevant bugs
Dell Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.