...
Description of problem: After "openmpi ucx osu_bw" test, RDMA server host was left with SIG 6 core file, when the test was run on MLX5 ROCE with bonding/teaming. This took place on RDMA lab machines of rdma-dev-19/20 pair - rdma-dev-19 as server in bonding and rdma-dev-20 as client hosts in teaming. On the rdma-dev-19 (server): TIME PID UID GID SIG COREFILE EXE Mon 2022-11-28 11:39:01 EST 79390 0 0 6 present /usr/lib64/openmpi/bin/mpitests-osu_bw total 2452 rw-r----. 1 root root 2504822 Nov 28 11:39 core.mpitests-osu_bw.0.02d616224e974648ae9e3d757a08ba58.79390.1669653541000000.lz4 Red Hat Enterprise Linux release 8.8 Beta (Ootpa) This seems to be a regression, as the same test in RHEL8.7.0 did not produce the core in the server side. Version-Release number of selected component (if applicable): Clients: rdma-dev-20 Servers: rdma-dev-19 DISTRO=RHEL-8.8.0-20221120.2 + [22-11-28 11:37:35] cat /etc/redhat-release Red Hat Enterprise Linux release 8.8 Beta (Ootpa) + [22-11-28 11:37:35] uname -a Linux rdma-dev-19.rdma.lab.eng.rdu2.redhat.com 4.18.0-438.el8.x86_64 #1 SMP Mon Nov 14 13:08:07 EST 2022 x86_64 x86_64 x86_64 GNU/Linux + [22-11-28 11:37:35] cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-438.el8.x86_64 root=UUID=67eba586-c572-49ad-8973-e9030c9f66e6 ro console=tty0 rd_NO_PLYMOUTH intel_idle.max_cstate=0 intel_iommu=on iommu=on processor.max_cstate=0 crashkernel=auto resume=UUID=a124f939-9473-482f-bc5f-f093bc222674 console=ttyS1,115200 + [22-11-28 11:37:35] rpm -q rdma-core linux-firmware rdma-core-41.0-1.el8.x86_64 linux-firmware-20220726-110.git150864a4.el8.noarch + [22-11-28 11:37:35] tail /sys/class/infiniband/mlx5_2/fw_ver /sys/class/infiniband/mlx5_3/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver ==> /sys/class/infiniband/mlx5_2/fw_ver <== 12.28.2006 ==> /sys/class/infiniband/mlx5_3/fw_ver <== 12.28.2006 ==> /sys/class/infiniband/mlx5_bond_0/fw_ver <== 14.31.1014 + [22-11-28 11:37:35] lspci + [22-11-28 11:37:35] grep -i -e ethernet -e infiniband -e omni -e ConnectX 01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 04:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] 04:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] 82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] Installed: ucx-cma-1.13.0-1.el8.x86_64 ucx-ib-1.13.0-1.el8.x86_64 ucx-rdmacm-1.13.0-1.el8.x86_64 How reproducible: 100% Steps to Reproduce: 1. Install RHEL-8.8.0-20221120.2 on rdma-dev-19/20 2. Install & execute kernel-kernel-infiniband-ucx test script 3. Watch ucx result on client side Actual results: In rdma-dev-19 (server host), the above mentioned core file will be found. Expected results: No core files should be produced after the "openmpi ucx osu_bw" test Additional info:
Won't Do