...
Document Version Release Date Details 4 April 12, 2023 Updated the Resolution section with the new CPLD version 0x14, and clarified the process to upgrade the components. 3 November 15, 2022 Updated the steps in the Resolution to update the CPLD and FPGA firmware. 2 October 25, 2022 Updated the Resolution steps. 1 September 26, 2022 Original Document Release. IMPORTANT : The firmware updates mentioned in the Resolution section are required to prevent the issue detailed below. By disregarding this notification and not performing the recommended resolution, the servers may become unable to power on. After a cold or aux power-cycle, an HPE ProLiant XL675d server configured with modular NVIDIA HGX GPUs (40GB and/or 80GB as shown in the Scope section below), may remain in a state where the system is unable to power on. When this occurs, the four system LEDs on the front of the server will continually repeat a pattern of 13 blinks, followed by a short pause. In addition to not powering on, and the LED blink pattern described above, the following message will appear in the Integrated Lights-Out (iLO) Integrated Management Log (IML): Server Critical Fault (Service Information: Power-On Fault, GPU Board/Modules, GPU Board VRD (02h))
Any HPE ProLiant XL675d platform with either of the following NVIDIA modular A100 GPUs: NVIDIA HGX A100 40GB 8GPU AIR FIO BS BRD (R3V64A) - Spares Kit SPS-PCA,A100 SXM4 x8 40GB Air-cooled w/HS - P22765-001 NVIDIA HGX A100 80GB 8GPU AIR FIO BS BRD (R7B94A)-Spares Kit SPS-PCA,A100 HGX x8 80GB Air-Cooled w/HS - P38948-001 Note: This does not affect ProLiant XL675d systems with traditional PCIe form-factor GPUs.
The issue has been fixed in NVIDIA GPU baseboard FPGA firmware version v3.14, and that new FPGA firmware comes with a mandatory dependency on server CPLD to 0x14. On ProLiant XL675d systems, with NVIDIA modular HGX A100 40GB/80GB GPUs, update the recommended firmware as soon as possible. It is critically important to follow the sequence detailed below, this is necessary because of the inter-dependency between the images and the fact that auxiliary power is automatically cycled during the sequence. Download these firmware images: Latest iLO firmware. FPGA firmware version v3.14 . CPLD firmware version 0x14 . Leave the server booted and at the OS prompt. Do not shutdown the server, because this presents risk of hitting the issue before we can mitigate with the new firmware. With the server booted throughout, update these firmware images from the "Update FW" tab of iLO web interface, in this exact sequence: Update the iLO firmware. The iLO will reset afterwards. Update the baseboard FPGA by following the instructions in the image, but do not shutdown or reboot yet . Update the server CPLD by following the instructions in the image, but do not shutdown or reboot yet . With that set of firmware now flashed, proceed with a graceful shutdown of the server, so that the new firmware can become effective. When the server reaches standby power, the server will automatically cycle auxiliary power, the iLO connection will disconnect. After approx. 30 seconds, auxiliary power will automatically return, the iLO connection will be restored, and the server will boot normally. Verify that server CPLD 0x14 and NVIDIA FPGA v3.14 are in place by selecting the "Firmware & OS Software" panel, then "Firmware" tab in iLO. If the issue described in the Description has already occurred, contact your local country HPE Customer Support or log a case via the HPE Support Center , and reference document: a00127196en_us. RECEIVE PROACTIVE UPDATES : Receive support alerts (such as Customer Advisories), as well as updates on drivers, software, firmware, and customer replaceable components, proactively in your e-mail through HPE Support Alerts. Sign up for Support Alerts at the following URL: HPE Email Preference Center. NAVIGATION TIP: For hints on navigating HPE.com to locate the latest drivers, patches and other support software downloads, refer to the Navigation Tips document. SEARCH TIP: For hints on locating similar documents on HPE.com, refer to the Search Tips document.