...
Description of problem: Two openmpi benchmarks shown below failed with timing out showing significant latency values compared to the RHEL8.4 result when run on QEDR ROCE device. mpitests-osu_get_acc_latency mpirun mpitests-osu_acc_latency mpirun The RHEL8.4 results of the above mentioned benchmarks are results from QEDR IW device. Version-Release number of selected component (if applicable): Clients: rdma-dev-03 Servers: rdma-dev-02 DISTRO=RHEL-9.1.0-20220524.0 + [22-05-31 11:31:05] cat /etc/redhat-release Red Hat Enterprise Linux release 9.1 Beta (Plow) + [22-05-31 11:31:05] uname -a Linux rdma-dev-03.rdma.lab.eng.rdu2.redhat.com 5.14.0-96.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Thu May 19 07:21:30 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux + [22-05-31 11:31:05] cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-96.el9.x86_64 root=UUID=14045f8c-f33e-4a6d-b28e-4627b0d63394 ro console=tty0 rd_NO_PLYMOUTH intel_iommu=on iommu=on crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=1eb623ff-0653-473b-9cd5-656ed5ebb410 console=ttyS1,115200 + [22-05-31 11:31:05] rpm -q rdma-core linux-firmware rdma-core-37.2-1.el9.x86_64 linux-firmware-20220509-126.el9.noarch + [22-05-31 11:31:05] tail /sys/class/infiniband/qedr0/fw_ver /sys/class/infiniband/qedr1/fw_ver ==> /sys/class/infiniband/qedr0/fw_ver <== 8.59.1.0 ==> /sys/class/infiniband/qedr1/fw_ver <== 8.59.1.0 + [22-05-31 11:31:05] lspci + [22-05-31 11:31:05] grep -i -e ethernet -e infiniband -e omni -e ConnectX 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 08:00.0 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10) 08:00.1 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10) Installed: mpitests-openmpi-5.8-1.el9.x86_64 openmpi-1:4.1.1-5.el9.x86_64 openmpi-devel-1:4.1.1-5.el9.x86_64 How reproducible: 100% Steps to Reproduce: 1. With the above build & packages boot up RDMA hosts of server and client with QEDR Iwarp device 2. issue the following two benchmark commands on the client timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr0:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' --mca mtl_base_verbose 100 --mca btl_openib_verbose 100 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=qede_roce.45 --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 /usr/lib64/openmpi/bin/mpitests-osu_acc_latency timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr0:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' --mca mtl_base_verbose 100 --mca btl_openib_verbose 100 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=qede_roce.45 --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 /usr/lib64/openmpi/bin/mpitests-osu_get_acc_latency Actual results: 1) mpitests-osu_acc_latency [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0 rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23) rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0 rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23) rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0 rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23) [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59126] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59126] pml_ucx.c:289 mca_pml_ucx_init [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59126] pml_ucx.c:114 Pack remote worker address, size 38 [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59126] pml_ucx.c:114 Pack local worker address, size 141 [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59126] pml_ucx.c:351 created ucp context 0x5563c8af6400, worker 0x5563c8b3aa50 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1.mpitests-osu_acc_latency: Unable to create UD QP on qedr0 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1: PSM3 can't open nic unit: 0 (err=23) rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1.mpitests-osu_acc_latency: Unable to initialize verbs [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1.mpitests-osu_acc_latency: Unable to create UD QP on qedr0 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1: PSM3 can't open nic unit: 0 (err=23) rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1.mpitests-osu_acc_latency: Unable to initialize verbs [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1.mpitests-osu_acc_latency: Unable to create UD QP on qedr0 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1: PSM3 can't open nic unit: 0 (err=23) rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1.mpitests-osu_acc_latency: Unable to initialize verbs [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56368] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56368] pml_ucx.c:289 mca_pml_ucx_init [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56368] pml_ucx.c:114 Pack remote worker address, size 38 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56368] pml_ucx.c:114 Pack local worker address, size 141 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56368] pml_ucx.c:351 created ucp context 0x557d49675250, worker 0x557d496b9900 [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59126] pml_ucx.c:182 Got proc 0 address, size 141 [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59126] pml_ucx.c:411 connecting to proc. 0 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56368] pml_ucx.c:182 Got proc 1 address, size 141 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56368] pml_ucx.c:411 connecting to proc. 1 OSU MPI_Accumulate latency Test v5.8 Window creation: MPI_Win_allocate Synchronization: MPI_Win_flush Size Latency (us) [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56368] pml_ucx.c:182 Got proc 0 address, size 38 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56368] pml_ucx.c:411 connecting to proc. 0 [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59126] pml_ucx.c:182 Got proc 1 address, size 38 [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59126] pml_ucx.c:411 connecting to proc. 1 1 3830.00 2 3830.58 4 3830.64 8 3830.10 mpirun: Forwarding signal 18 to job + [22-05-31 12:11:43] __MPI_check_result 1 mpitests-openmpi OSU /usr/lib64/openmpi/bin/mpitests-osu_acc_latency mpirun /root/hfile_one_core 2) mpitests-osu_get_acc_latency [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1.mpitests-osu_get_acc_latency: Unable to create UD QP on qedr0 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1.mpitests-osu_get_acc_latency: Unable to initialize verbs rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1: PSM3 can't open nic unit: 0 (err=23) [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1.mpitests-osu_get_acc_latency: Unable to create UD QP on qedr0 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1.mpitests-osu_get_acc_latency: Unable to initialize verbs rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1: PSM3 can't open nic unit: 0 (err=23) [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1.mpitests-osu_get_acc_latency: Unable to create UD QP on qedr0 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1.mpitests-osu_get_acc_latency: Unable to initialize verbs rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank1: PSM3 can't open nic unit: 0 (err=23) [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56928] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56928] pml_ucx.c:289 mca_pml_ucx_init [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56928] pml_ucx.c:114 Pack remote worker address, size 38 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56928] pml_ucx.c:114 Pack local worker address, size 141 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56928] pml_ucx.c:351 created ucp context 0x55e8dfcde250, worker 0x55e8dfd22900 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_get_acc_latency: Unable to create UD QP on qedr0 rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_get_acc_latency: Unable to initialize verbs rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23) [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_get_acc_latency: Unable to create UD QP on qedr0 rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_get_acc_latency: Unable to initialize verbs rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23) rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_get_acc_latency: Unable to create UD QP on qedr0 [create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_get_acc_latency: Unable to initialize verbs rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23) [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59839] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2 [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59839] pml_ucx.c:289 mca_pml_ucx_init [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59839] pml_ucx.c:114 Pack remote worker address, size 38 [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59839] pml_ucx.c:114 Pack local worker address, size 141 [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59839] pml_ucx.c:351 created ucp context 0x55eee6a4b0b0, worker 0x55eee6a9e130 [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59839] pml_ucx.c:182 Got proc 0 address, size 141 [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59839] pml_ucx.c:411 connecting to proc. 0 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56928] pml_ucx.c:182 Got proc 1 address, size 141 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56928] pml_ucx.c:411 connecting to proc. 1 OSU MPI_Get_accumulate latency Test v5.8 Window creation: MPI_Win_create Synchronization: MPI_Win_lock/unlock Size Latency (us) [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56928] pml_ucx.c:182 Got proc 0 address, size 38 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:56928] pml_ucx.c:411 connecting to proc. 0 [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59839] pml_ucx.c:182 Got proc 1 address, size 38 [rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:59839] pml_ucx.c:411 connecting to proc. 1 1 4197.50 2 4158.80 4 3511.71 8 4604.89 + [22-05-31 12:18:17] __MPI_check_result 1 mpitests-openmpi OSU /usr/lib64/openmpi/bin/mpitests-osu_get_acc_latency mpirun /root/hfile_one_core Expected results: Based on RHEL8.4 results: OSU MPI_Accumulate latency Test v5.7 Window creation: MPI_Win_allocate Synchronization: MPI_Win_flush Size Latency (us) 1 113.62 2 112.55 4 112.75 8 112.62 16 112.28 32 112.61 64 113.36 128 114.21 256 114.96 512 115.08 1024 115.97 2048 120.05 4096 127.22 8192 199.89 16384 245.07 32768 296.35 65536 333.01 131072 451.64 262144 655.62 524288 1134.11 1048576 1999.18 2097152 3613.38 4194304 6917.08 + [21-06-22 12:13:57] __MPI_check_result 0 mpitests-openmpi OSU /usr/lib64/openmpi/bin/mpitests-osu_acc_latency mpirun /root/hfile_one_core OSU MPI_Get_accumulate latency Test v5.7 Window creation: MPI_Win_create Synchronization: MPI_Win_lock/unlock Size Latency (us) 1 209.78 2 202.78 4 196.63 8 196.05 16 239.80 32 196.06 64 232.56 128 196.13 256 218.99 512 220.77 1024 196.88 2048 208.63 4096 230.35 8192 294.08 16384 392.08 32768 396.60 65536 457.26 131072 564.21 262144 814.36 524288 1373.00 1048576 2415.52 2097152 4394.46 4194304 8461.62 + [21-06-22 12:15:21] __MPI_check_result 0 mpitests-openmpi OSU /usr/lib64/openmpi/bin/mpitests-osu_get_acc_latency mpirun /root/hfile_one_core Additional info: This seems to be same issue as with bz 2092512
Won't Do