Loading...
Loading...
On an HPE ProLiant Compute XD685 server, under heavy GPU stress, GPUs may drop from the PCIe bus or a PCIe bus error may occur when running with multiple NUMA domains per processor [example: BIOS workload profile set to High Performance Computing (HPC).When this occurs, the system reboots and applications will stop responding. The HPE Integrated Lights-Out (iLO) Integrated Management Log (IML) may contain entries similar to the following:Uncorrectable PCI Express Error Detected. Slot __ (Segment 0x_, Bus 0x__, Device 0x_, Function 0x_) Uncorrectable Error Status 0x14000This occurs if the GPU access to memory exceeds the allowable configured PCI Express timeout due to strongly ordered PCIe transactions.
Any HPE ProLiant Compute XD685 under heavy GPU stress loads and running any of the following Operating Systems:Ubuntu 24.04Ubuntu 22.04Red Hat 9.4
Increase the PCIe completion timeout value and enable relaxed ordering in the NVIDIA GPU driver:Enable GPU relaxed ordering (Reboot required)Edit /etc/modprobe.d/nvidia.confAdd to the file options nvidia NVreg_EnablePCIERelaxedOrderingMode=1Save the fileReboot the systemVerify that GPU relaxed ordering is enabled (following reboot):cat /proc/driver/nvidia/params |grep RelaxExpected result: EnablePCIERelaxedOrderingMode: 1
Operating Systems Affected:Ubuntu 22.04 LTS, Ubuntu 24.04 LTS, Red Hat Enterprise Linux 9.4
Click on a version to see all relevant bugs
Hewlett Packard Enterprise Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.