...
On an HPE Apollo Z70 chassis with an AR44z Gen10 or AR64z Gen10 running the Mellanox InfiniBand OFED driver for Red Hat Enterprise Linux or SUSE Linux Enterprise Server and installed with an InfiniBand adapter, the kernel crash dump (kdump) debug stops and displays the following message: out of memory This occurs with different crash kernel sizes (default of 512MB for example in Red Hat Enterprise Linux 7.6) and with the maximum crash kernel size (800MB). The crash dump generation stops due to a memory management issue with the InfiniBand specific Mellanox OFED driver modules. The example below is for Red Hat Enterprise Linux 7.6 and Mellanox OFED driver version 4.6-1.0.1.0: 1. Console logs with steps to verify Kdump: # cat /proc/cmdline BOOT_IMAGE=/vmlinuz-4.14.0-115.7.1.el7a.aarch64 root=/dev/mapper/rhel- rootro crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap LANG=en_US.UTF-8 # cat /proc/iomem | grep -i crash a0000000-bfffffff : Crash kernel # dmesg | grep crash [ 0.000000] crashkernel reserved: 0x00000000a0000000 -0x00000000c000000 0 (512 MB) Note : The crash kernel size was set to auto by default (which is 512MB). # cat /etc/sysconfig/kdump KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 swiotlb=noforce cma=0reset_devices cgroup_disable=memory udev.children-max=2 panic=10 rootflags=nofail" KDUMP_BOOTDIR="/boot" KDUMP_IMG="vmlinuz" # systemctl start kdump.service a kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service;enabled; vendor preset: enabled) Active: active (exited) since Mon 2019-07-08 12:46:29 EDT;20min ago Process: 18644 ExecStart=/usr/bin/kdumpctl start (code=exited,status=0/SUCCESS) Main PID: 18644 (code=exited, status=0/SUCCESS) CGroup: /system.slice/kdump.service Jul 08 12:46:26 apache5 systemd[1]: Starting Crash recovery kernelarming... Jul 08 12:46:29 apache5 kdumpctl[18644]: kexec: loaded kdump kernel Jul 08 12:46:29 apache5 systemd[1]: Started Crash recovery kernel arming. Jul 08 12:46:29 apache5 kdumpctl[18644]: Starting kdump: [OK] # cat /sys/kernel/kexec_crash_loaded 1 Note : The above command should output "1" to confirm if kexec loaded. 2. Crash dump collection steps followed: # echo 8 > /proc/sysrq-trigger - For Changing Loglevel # echo s > /proc/sysrq-trigger - Sync filesystems # echo u > /proc/sysrq-trigger - Remount all mounted filesystemsread-only # echo c > /proc/sysrq-trigger - Perform a kexec reboot to take acrashdump 3. Out of Memory error logs after crash dump: [ 25.001062] Kernel panic - not syncing: Out of memory and nokillable processes... [ 25.001062] [ 25.010095] CPU: 0 PID: 162 Comm: kworker/u2:3 Tainted:G OE --------- 4.14.0-115.7.1.el7a.aarch64 #1 [ 25.020687] Hardware name: HPE Apollo70 /C01_APACHE_MB , BIOSL50_5.13_1.0.6 07/10/2018 [ 25.030623] Workqueue: mlx5_page_allocator pages_work_handler[mlx5_core] [ 25.037398] Call trace: [ 25.039833] [<ffff000008089df4>]dump_backtrace+0x0/0x23c [ 25.045218] [<ffff00000808a054>] show_stack+0x24/0x2c [ 25.050257] [<ffff000008848b9c>] dump_stack+0x84/0xa8 [ 25.055296] [<ffff0000080d4890>] panic+0x138/0x2a0 [ 25.060074] [<ffff00000820f8fc>] out_of_memory+0x37c/0x484 [ 25.065547] [<ffff0000082154a8>] _alloc_pages_nodemask+0xa78/0xec0 [ 25.071920] [<ffff0000011bef40>] give_pages+0x2d8/0x8a8[mlx5_core] [ 25.078291] [<ffff0000011bf918>]pages_work_handler+0x50/0xf0 [mlx5_core] [ 25.085066] [<ffff0000080f0df0>]process_one_work+0x168/0x3a4 [ 25.090799] [<ffff0000080f1090>]worker_thread+0x64/0x46c [ 25.096184] [<ffff0000080f7ffc>] kthread+0x10c/0x138 [ 25.101135] [<ffff000008084f34>] ret_from_fork+0x10/0x18 [ 25.106437] Kernel Offset: disabled [ 25.109912] CPU features: 0x5000c38 [ 25.113386] Memory Limit: none [ 25.116429] Rebooting in 10 seconds..
Any HPE Apollo Z70 chassis with an AR44z Gen10 or AR64z Gen10 running Mellanox InfiniBand OFED driver for Red Hat Enterprise Linux or SUSE Linux Enterprise Server and installed with the following adapter: HPE InfiniBand EDR/Ethernet 100Gb 1-port 841OCP QSFP28 Adapter (HPE Part Number: P02012-B21)
To generate the crash dump, blacklist the Mellanox ConnectX-5 core driver "mlx5_core" from the crash kernel (secondary kernel) to avoid memory limitation issues during the crash dump by performing the following: Edit the kdump config /proc/sysconfig/kdump and append "rd.driver.blacklist=mlx5_core" to "KDUMP_COMMANDLINE_APPEND". Example: #vi /proc/sysconfig/kdump KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 swiotlb=noforce cma=0reset_devices cgroup_disable=memory udev.children-max=2 panic=10 rootflags=nofail rd.driver.blacklist=mlx5_core" Restart the kdump service and start sysrq crash dump as follows: # systemctl restart kdump.service # echo 8 > /proc/sysrq-trigger - For Changing Log level # echo s > /proc/sysrq-trigger - Sync filesystems # echo u > /proc/sysrq-trigger - Remount all mounted filesystems read-only # echo c > /proc/sysrq-trigger - Perform a kexec reboot to take a crashdump Note : The workaround blacklists the Mellanox ConnectX-5 core driver from the crash kernel only and will not affect any other functionality of IB driver and related applications running on the boot kernel. Both Mellanox Ethernet and InfiniBand driver debug data will still be available in the crash dump as the module blacklisting is applicable for the crash kernel only. RECEIVE PROACTIVE UPDATES : Receive support alerts (such as Customer Advisories), as well as updates on drivers, software, firmware, and customer replaceable components, proactively via e-mail through HPE Subscriber's Choice. Sign up for Subscriber's Choice at the following URL: Proactive Updates Subscription Form. NAVIGATION TIP : For hints on navigating HPE.com to locate the latest drivers, patches, and other support software downloads for HPE systems and Options, refer to the Navigation Tips document . SEARCH TIP : For hints on locating similar documents on HPE.com, refer to the Search Tips Document . To search for additional advisories related to Linux, use the following search string: +Advisory +ProLiant -"Software and Drivers" +Linux