Loading...
Loading...
Typical events and alerts that indicate this issue include: Host kernel logs showing Xid 74 and or 79 from the NVIDIA driver. On Linux, Xid error messages are in /var/log/messages . Use the command grep "NVRM: Xid " to locate all Xid messages. Example of an Xid string: [...] NVRM: Xid (PCI:0000:1a:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus. Dell LifeCycle Logs may report a PCI1318 Bus Fatal Error : PCI1318: A fatal error was detected on a component at bus <X> device <Y> function <Z> The issue may also be transient in nature: GPUs may show up again after an AC power cycle The number and slots where GPUs disappear may change.
Xid and Bus Fatal Error messages are often related to PCIe Retimer -related errors. These errors can happen due to signal integrity issues within the PCIe architecture.
Dell has released updated GPU accelerator firmware. This update includes enhancements to the PCIe Retimers , which improve signal integrity and help prevent GPUs from falling off the bus. As a first step, it is advisable to update to the firmware listed below or newer if there are bus fatal errors. GPU Accelerator Bundle Version Retimer Version H100 20.24.07.10 - FW 1.5.0 2.10.42 H200 20.24.07.10 - FW 1.5.0 2.10.42 H800 TBD* TBD* H20 TBD* TBD* IMPORTANT: The server must be Powered ON for the firmware update to apply. An AC REBOOT or Virtual AC Power Cycle or Full Power Cycle is required after completion for the firmware changes to take effect.
Click on a version to see all relevant bugs
Dell Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.