...
HPE ProLiant Gen10 Plus or Gen10 Plus V2 servers and Apollo Gen10 Plus servers may experience uncorrectable PCIe bus errors. These servers will be configured with AMD EPYC 7xx2- or 7xx3- series processors, where "xx" can be any characters that match an AMD processor model number. The failure message displayed in the Integrated Management Log (IML) may resemble the following examples. Uncorrectable PCI Express Error Detected. Slot 3 (Segment 0x0, Bus 0x43, Device 0x0, Function 0x0). Uncorrectable Error Status: 0x40000 ACTION: Update the firmware of the failing device. If the issue persists, replace the device. Uncorrectable PCI Express Error Detected. Slot 3 (Segment 0x0, Bus 0x43, Device 0x0, Function 0x0). Uncorrectable Error Status: 0x44000 ACTION: Update the firmware of the failing device. If the issue persists, replace the device. Uncorrectable PCI Express Error Detected. Slot 7 (Segment 0x0, Bus 0xCB, Device 0x0, Function 0x0). Uncorrectable Error Status: 0x4000 ACTION: Update the firmware of the failing device. If the issue persists, replace the device. The IML entries above are indicating a "completion timeout" error signaled by an endpoint PCIe option. This will usually be a device capable of high-bandwidth data transfers such as an Infiniband option card or a GPU. Mellanox Network and Infiniband adapters with older firmware may signal only an uncorrectable error status of 0x40000 which indicates a malformed TLP error due to a bug that is fixed with an update that can be downloaded here . Updated Mellanox adapters will signal an uncorrectable error status of 0x44000.
Affected server platforms are listed in the Products section.
If the server is configured with an AMD EPYC 7xx3 processor, follow first the recommendation to update the System ROM to version 3.00 (or later) as described in this customer advisory . For servers configured with AMD EPYC 7xx2 processors, or servers configured with AMD EPYC 7xx3 on which updating to ROM 3.0 does not resolve the problem, the server may have sub-optimal configuration settings that are contributing to the failure. HPE has consulted with AMD to provide recommended settings for configuration options in System Utilities. Modify the configuration settings below as indicated. Not all settings may be available for all servers. If a setting is unavailable, it can be ignored. First, reboot the server and press F9 during POST to boot to the System Utilities menu. At the System Utilities menu, navigate to System Configuration > BIOS/Platform Configuration (RBSU) . Navigation to the various settings will begin here. Set the Workload Profile to "Custom" . From the "BIOS/Platform Configuration (RBSU)" menu, select Workload Profile > Custom . Note that making this selection is necessary to make sure the configuration settings that follow are available. Press F10 to save the setting. Disable Infinity State Power Management . From the "BIOS/Platform Configuration (RBSU)" menu, navigate to Power and Performance Options > Advanced Power Options> Infinity Fabric Power Management > Disable . Press F10 to save the setting. Set the Infinity Fabric Performance State . From the "BIOS/Platform Configuration (RBSU)" menu, navigate to Power and Performance Options > Infinity Fabric Performance State > P0 . Press F10 to save the setting. Configure AMD NBIO LCLK DPM Level . From the "BIOS/Platform Configuration (RBSU)" menu, navigate to Power and Performance Options > I/O Options > NBIO LCLK DPM Level . There will be seven different NBIO LCLK options to configure. For each one, select Static High . Press F10 to save the setting. Disable C-State Efficiency Mode . From the "BIOS/Platform Configuration (RBSU)" menu, navigate to Power and Performance Options > C-State Efficiency Mode > Disable . Press F10 to save the setting. Disable Data Fabric C-States . From the "BIOS/Platform Configuration (RBSU)" menu, navigate to Power and Performance Options > Data Fabric C-State Enable > Disable . Press F10 to save the setting. Disable Access Control Service . From the "BIOS/Platform Configuration (RBSU)" menu, navigate to Virtualization Options > Access Control Service > Disable . Press F10 to save the setting. Disable Active State Power Management . From the "BIOS/Platform Configuration (RBSU)" menu, navigate to System Configuration > BIOS/Platform Configuration (RBSU) > PCIe Device Configuration > PCIe Power Management (ASPM) > Disabled . Press F10 to save the setting. Set the minimum C-state . From the "BIOS/Platform Configuration (RBSU)" menu, navigate to Power and Performance Options > Minimum Processor Idle Power Core C-State . If the "cpupower" package is installed in the operating system, select C6 . Otherwise, select No C-States . Press F10 to save the setting. In addition, at the OS level, configure the OS to execute the following commands on boot. Configure cpupower using the command below. cpupower idle-set -d 2 Disable Access Control Services (ACS) on all PCIe devices. An example command is provided below that can be executed on Linux platforms. Executing the command may result in output indicating it cannot be executed for some PCIe devices. This is expected behavior. for i in $(lspci | cut -f 1 -d " "); do setpci -v -s $i ecap_acs+6.w=0; done Note: these commands are not permanent and need to be entered into a startup script, so they are executed again after a reboot. Revision History Document Version Release Date Details 2 October 22, 2024 Updated the Resolution to correct the cpupower command and added a note. 1 May 21, 2024 Original Document Release.