...
Description of problem: On RHEL-8.7.0, 'openmpi ucx osu_bw' test of our ucx test on hosts with Mellanox mlx4 MT27520 CX-3Pro failed as shown in Actual results section. The failure occurred when running test for RoCE fabric. Version-Release number of selected component (if applicable): DISTRO=RHEL-8.7.0-20220817.0 Red Hat Enterprise Linux release 8.7 Beta (Ootpa) 4.18.0-418.el8.x86_64 rdma-core-41.0-1.el8.x86_64 linux-firmware-20220726-110.git150864a4.el8.noarch + [22-08-18 10:00:53] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver ==> /sys/class/infiniband/mlx5_0/fw_ver <== 12.28.2006 ==> /sys/class/infiniband/mlx5_1/fw_ver <== 12.28.2006 ==> /sys/class/infiniband/mlx5_bond_0/fw_ver <== 14.32.1010 + [22-08-18 10:00:53] lspci + [22-08-18 10:00:53] grep -i -e ethernet -e infiniband -e omni -e ConnectX 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] 05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] How reproducible: Seen it only once so far. Steps to Reproduce: 1. Install RHEL-8.7.0-20220817.0 on rdma-virt-02/03 2. Install & execute kernel-kernel-infiniband-ucx test script 3. Watch ucx result on client side Actual results: + [22-08-18 10:06:14] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl '^vader,tcp,openib' -mca btl_openib_cpc_include rdmacm -mca btl_openib_receive_queues P,65536,256,192,128 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=mlx5_bond_0:1 mpitests-osu_bw OSU MPI Bandwidth Test v5.8 Size Bandwidth (MB/s) [rdma-virt-02:219082:0:219082] ib_mlx5_log.c:177 Transport retry count exceeded on mlx5_bond_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0) [rdma-virt-02:219082:0:219082] ib_mlx5_log.c:177 RC QP 0x1379 wqe[0]: SEND --e [inl len 10] [rqpn 0x1379 dlid=0 sl=0 port=1 src_path_bits=0 dgid=::ffff:172.31.40.203 sgid_index=7 traffic_class=0] ==== backtrace (tid: 219082) ==== 0 /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x15108a68cedc] 1 /lib64/libucs.so.0(ucs_fatal_error_message+0xb1) [0x15108a689d41] 2 /lib64/libucs.so.0(ucs_log_default_handler+0xde4) [0x15108a68e6a4] 3 /lib64/libucs.so.0(ucs_log_dispatch+0xe4) [0x15108a68e9c4] 4 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x27a) [0x15108a40259a] 5 /lib64/ucx/libuct_ib.so.0(+0x3c480) [0x15108a419480] 6 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x4d) [0x15108a40403d] 7 /lib64/ucx/libuct_ib.so.0(+0x3a48a) [0x15108a41748a] 8 /lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x15108ad53ada] 9 /usr/lib64/openmpi/lib/libopen-pal.so.40(opal_progress+0x34) [0x1510a07f2f94] 10 /usr/lib64/openmpi/lib/libmpi.so.40(ompi_request_default_wait+0x12d) [0x1510a1e9659d] 11 /usr/lib64/openmpi/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0x103) [0x1510a1f02643] 12 /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Barrier+0xb0) [0x1510a1eadb70] 13 mpitests-osu_bw(+0x1fd0) [0x55d7f1079fd0] 14 /lib64/libc.so.6(__libc_start_main+0xe5) [0x1510a0f6ad85] 15 mpitests-osu_bw(+0x25de) [0x55d7f107a5de] ================================= [rdma-virt-02:219082] *** Process received signal *** [rdma-virt-02:219082] Signal: Aborted (6) [rdma-virt-02:219082] Signal code: (-6) [rdma-virt-02:219082] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x1510a1308cf0] [rdma-virt-02:219082] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x1510a0f7eaff] [rdma-virt-02:219082] [ 2] /lib64/libc.so.6(abort+0x127)[0x1510a0f51ea5] [rdma-virt-02:219082] [ 3] /lib64/libucs.so.0(+0x27d46)[0x15108a689d46] [rdma-virt-02:219082] [ 4] /lib64/libucs.so.0(ucs_log_default_handler+0xde4)[0x15108a68e6a4] [rdma-virt-02:219082] [ 5] /lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x15108a68e9c4] [rdma-virt-02:219082] [ 6] /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x27a)[0x15108a40259a] [rdma-virt-02:219082] [ 7] /lib64/ucx/libuct_ib.so.0(+0x3c480)[0x15108a419480] [rdma-virt-02:219082] [ 8] /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x4d)[0x15108a40403d] [rdma-virt-02:219082] [ 9] /lib64/ucx/libuct_ib.so.0(+0x3a48a)[0x15108a41748a] [rdma-virt-02:219082] [10] /lib64/libucp.so.0(ucp_worker_progress+0x2a)[0x15108ad53ada] [rdma-virt-02:219082] [11] /usr/lib64/openmpi/lib/libopen-pal.so.40(opal_progress+0x34)[0x1510a07f2f94] [rdma-virt-02:219082] [12] /usr/lib64/openmpi/lib/libmpi.so.40(ompi_request_default_wait+0x12d)[0x1510a1e9659d] [rdma-virt-02:219082] [13] /usr/lib64/openmpi/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0x103)[0x1510a1f02643] [rdma-virt-02:219082] [14] /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Barrier+0xb0)[0x1510a1eadb70] [rdma-virt-02:219082] [15] mpitests-osu_bw(+0x1fd0)[0x55d7f1079fd0] [rdma-virt-02:219082] [16] /lib64/libc.so.6(__libc_start_main+0xe5)[0x1510a0f6ad85] [rdma-virt-02:219082] [17] mpitests-osu_bw(+0x25de)[0x55d7f107a5de] [rdma-virt-02:219082] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 219082 on node 172.31.45.202 exited on signal 6 (Aborted). -------------------------------------------------------------------------- + [22-08-18 10:06:32] RQA_check_result -r 134 -t 'openmpi ucx osu_bw' + [22-08-18 10:06:32] local test_pass=0 + [22-08-18 10:06:32] local test_skip=777 + [22-08-18 10:06:32] test 4 -gt 0 + [22-08-18 10:06:32] case $1 in + [22-08-18 10:06:32] local rc=134 + [22-08-18 10:06:32] shift + [22-08-18 10:06:32] shift + [22-08-18 10:06:32] test 2 -gt 0 + [22-08-18 10:06:32] case $1 in + [22-08-18 10:06:32] local 'msg=openmpi ucx osu_bw' + [22-08-18 10:06:32] shift + [22-08-18 10:06:32] shift + [22-08-18 10:06:32] test 0 -gt 0 + [22-08-18 10:06:32] '[' -z 134 -o -z 'openmpi ucx osu_bw' ']' + [22-08-18 10:06:32] '[' -z /tmp/tmp.LwXAyOokgN/results_ucx-ucx-.txt ']' + [22-08-18 10:06:32] '[' -z /tmp/tmp.LwXAyOokgN/results_ucx-ucx-.txt ']' + [22-08-18 10:06:32] '[' 134 -eq 0 ']' + [22-08-18 10:06:32] '[' 134 -eq 777 ']' + [22-08-18 10:06:32] local test_result=FAIL + [22-08-18 10:06:32] export result=FAIL + [22-08-18 10:06:32] result=FAIL + [22-08-18 10:06:32] [[ ! -z '' ]] + [22-08-18 10:06:32] printf '%10s | %6s | %s\n' FAIL 134 'openmpi ucx osu_bw' + [22-08-18 10:06:32] set +x — TEST RESULT FOR ucx Test: openmpi ucx osu_bw Result: FAIL Return: 134 — Expected results: Test to complete successfully. Additional info:
Won't Do