
OPERATIONAL DEFECT DATABASE
...

...
On an HPE Apollo Z70 chassis with an AR44z Gen10 or AR64z Gen10 running the Mellanox InfiniBand OFED driver for Red Hat Enterprise Linux or SUSE Linux Enterprise Server and installed with an InfiniBand adapter, the kernel crash dump (kdump) debug stops and displays the following message:out of memoryThis occurs with different crash kernel sizes (default of 512MB for example in Red Hat Enterprise Linux 7.6) and with the maximum crash kernel size (800MB).The crash dump generation stops due to a memory management issue with the InfiniBand specific Mellanox OFED driver modules. The example below is for Red Hat Enterprise Linux 7.6 and Mellanox OFED driver version 4.6-1.0.1.0:1. Console logs with steps to verify Kdump:# cat /proc/cmdlineBOOT_IMAGE=/vmlinuz-4.14.0-115.7.1.el7a.aarch64 root=/dev/mapper/rhel-rootro crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swapLANG=en_US.UTF-8# cat /proc/iomem | grep -i crasha0000000-bfffffff : Crash kernel# dmesg | grep crash[ 0.000000] crashkernel reserved: 0x00000000a0000000 -0x00000000c0000000(512 MB)Note: The crash kernel size was set to auto by default (which is 512MB).# cat /etc/sysconfig/kdumpKDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 swiotlb=noforcecma=0reset_devices cgroup_disable=memory udev.children-max=2 panic=10rootflags=nofail"KDUMP_BOOTDIR="/boot"KDUMP_IMG="vmlinuz"# systemctl start kdump.servicea kdump.service - Crash recovery kernel armingLoaded: loaded (/usr/lib/systemd/system/kdump.service;enabled; vendorpreset: enabled)Active: active (exited) since Mon 2019-07-08 12:46:29 EDT;20min agoProcess: 18644 ExecStart=/usr/bin/kdumpctl start(code=exited,status=0/SUCCESS)Main PID: 18644 (code=exited, status=0/SUCCESS)CGroup: /system.slice/kdump.serviceJul 08 12:46:26 apache5 systemd[1]: Starting Crash recoverykernelarming...Jul 08 12:46:29 apache5 kdumpctl[18644]: kexec: loaded kdump kernelJul 08 12:46:29 apache5 systemd[1]: Started Crash recovery kernel arming.Jul 08 12:46:29 apache5 kdumpctl[18644]: Starting kdump: [OK]# cat /sys/kernel/kexec_crash_loaded1Note: The above command should output "1" to confirm if kexec loaded.2. Crash dump collection steps followed:# echo 8 > /proc/sysrq-trigger - For Changing Loglevel# echo s > /proc/sysrq-trigger - Sync filesystems# echo u > /proc/sysrq-trigger - Remount all mounted filesystemsread-only# echo c > /proc/sysrq-trigger - Perform a kexec reboot to take acrashdump3. Out of Memory error logs after crash dump:[ 25.001062] Kernel panic - not syncing: Out of memory and nokillableprocesses...[ 25.001062][ 25.010095] CPU: 0 PID: 162 Comm: kworker/u2:3 Tainted:G OE ---------4.14.0-115.7.1.el7a.aarch64 #1[ 25.020687] Hardware name: HPE Apollo70 /C01_APACHE_MB ,BIOSL50_5.13_1.0.6 07/10/2018[ 25.030623] Workqueue: mlx5_page_allocator pages_work_handler[mlx5_core][ 25.037398] Call trace:[ 25.039833] [<ffff000008089df4>]dump_backtrace+0x0/0x23c[ 25.045218] [<ffff00000808a054>] show_stack+0x24/0x2c[ 25.050257] [<ffff000008848b9c>] dump_stack+0x84/0xa8[ 25.055296] [<ffff0000080d4890>] panic+0x138/0x2a0[ 25.060074] [<ffff00000820f8fc>] out_of_memory+0x37c/0x484[ 25.065547] [<ffff0000082154a8>] _alloc_pages_nodemask+0xa78/0xec0[ 25.071920] [<ffff0000011bef40>] give_pages+0x2d8/0x8a8[mlx5_core][ 25.078291] [<ffff0000011bf918>]pages_work_handler+0x50/0xf0 [mlx5_core][ 25.085066] [<ffff0000080f0df0>]process_one_work+0x168/0x3a4[ 25.090799] [<ffff0000080f1090>]worker_thread+0x64/0x46c[ 25.096184] [<ffff0000080f7ffc>] kthread+0x10c/0x138[ 25.101135] [<ffff000008084f34>] ret_from_fork+0x10/0x18[ 25.106437] Kernel Offset: disabled[ 25.109912] CPU features: 0x5000c38[ 25.113386] Memory Limit: none[ 25.116429] Rebooting in 10 seconds..
Any HPE Apollo Z70 chassis with an AR44z Gen10 or AR64z Gen10 running Mellanox InfiniBand OFED driver for Red Hat Enterprise Linux or SUSE Linux Enterprise Server and installed with the following adapter:HPE InfiniBand EDR/Ethernet 100Gb 1-port 841OCP QSFP28 Adapter (HPE Part Number: P02012-B21)
To generate the crash dump, blacklist the Mellanox ConnectX-5 core driver "mlx5_core" from the crash kernel (secondary kernel) to avoid memory limitation issues during the crash dump by performing the following:Edit the kdump config /proc/sysconfig/kdump and append "rd.driver.blacklist=mlx5_core" to "KDUMP_COMMANDLINE_APPEND".Example:#vi /proc/sysconfig/kdumpKDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 swiotlb=noforcecma=0reset_devices cgroup_disable=memory udev.children-max=2 panic=10rootflags=nofail rd.driver.blacklist=mlx5_core"Restart the kdump service and start sysrq crash dump as follows:# systemctl restart kdump.service# echo 8 > /proc/sysrq-trigger - For Changing Log level# echo s > /proc/sysrq-trigger - Sync filesystems# echo u > /proc/sysrq-trigger - Remount all mounted filesystems read-only# echo c > /proc/sysrq-trigger - Perform a kexec reboot to take acrashdumpNote: The workaround blacklists the Mellanox ConnectX-5 core driver from the crash kernel only and will not affect any other functionality of IB driver and related applications running on the boot kernel. Both Mellanox Ethernet and InfiniBand driver debug data will still be available in the crash dump as the module blacklisting is applicable for the crash kernel only.RECEIVE PROACTIVE UPDATES: Receive support alerts (such as Customer Advisories), as well as updates on drivers, software, firmware, and customer replaceable components, proactively via e-mail through HPE Subscriber's Choice. Sign up for Subscriber's Choice at the following URL:Proactive Updates Subscription Form.NAVIGATION TIP: For hints on navigating HPE.com to locate the latest drivers, patches, and other support software downloads for HPE systems and Options, refer to theNavigation Tips document.SEARCH TIP: For hints on locating similar documents on HPE.com, refer to theSearch Tips Document.To search for additional advisories related to Linux, use the following search string:+Advisory +ProLiant -"Software and Drivers" +Linux
Operating Systems Affected:Red Hat Enterprise Linux 7 Server for ARM, Red Hat Enterprise Linux 8 Server, SUSE Linux Enterprise Server 12, SUSE Linux Enterprise Server 15
No external links available for this bug
Click on a version to see all relevant bugs
Hewlett Packard Enterprise Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.