Loading...
Loading...
On an HPE ProLiant Compute XD685 server, under heavy GPU stress, GPUs may drop from the PCIe bus or a PCIe bus error may occur when running with multiple NUMA domains per processor [example: BIOS workload profile set to High Performance Computing (HPC).When this occurs, the system reboots and applications will stop responding.The Linux dmesg log will contain the following entry:NVRM: Xid (PCI:0000:xx:00):79, pid=xxxx,name=xxxx,GPU has fallen off the bus.In addition, the HPE Integrated Lights-Out (iLO) Integrated Management Log (IML) may contain entries similar to the following:Uncorrectable PCI Express Error Detected. Slot __ (Segment 0x_, Bus 0x__, Device 0x_, Function 0x_) Uncorrectable Error Status 0x14000This occurs if the GPU access to memory exceeds the allowable configured PCI Express timeout due to strongly ordered PCIe transactions.
Any HPE ProLiant Compute XD685 under heavy GPU stress loads and running any of the following Operating Systems:Ubuntu 24.04Ubuntu 22.04Red Hat 9.4
Increase the PCIe completion timeout value and enable relaxed ordering in the NVIDIA GPU driver:Enable GPU relaxed ordering (Reboot required)Edit /etc/modprobe.d/nvidia.confAdd to the file "options nvidia NVreg_EnablePCIERelaxedOrderingMode=1"Save the fileReboot the systemVerify that GPU relaxed ordering is enabled (following reboot):cat /proc/driver/nvidia/params |grep RelaxExpected result: EnablePCIERelaxedOrderingMode: 1Minimum requirements for the above workaround: (HPE highly recommends to use the latest BIOS available)System ROM version 3/14/2025:Software Details - System ROM Flash Binary - HPE ProLiant Compute XD685 (A59) Servers | HPE SupportPCIe switch firmware 4.15.01.30Enable relaxed ordering for the NVIDIA driver within the Linux OSIn addition, the following are minimum CPLD versions that should be installed:PCIe Switch Board CPLD0CMB CPLD07Converter Board CPLD0AOCP Retimer CPLD08LP PCIe Retimer CPLD08These updates will require the assistance of HPE. Please contact HPE Support and refer to the following document number: a00155310.Important:If after updating all the necessary firmware, a system still encounters a GPU dropping off the bus, or a PCI device device disappears, be sure to also update all the required drivers to the latest available versions.Document VersionRelease DateDetails4February 23, 2026Updated document with additional information to update the drivers to the latest available versions as well, if the issue persists3November 19, 2025Updated document with additional information2October 22, 2025Updated to add additional information to the Resolution section1April 25, 2025Original document release
Operating Systems Affected:Red Hat Enterprise Linux 9.4, Ubuntu 22.04 LTS, Ubuntu 24.04 LTS
Click on a version to see all relevant bugs
Hewlett Packard Enterprise Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.