...
Description of problem: OSU acc_latency benchmark fails with following error message: rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23) Version-Release number of selected component (if applicable): Clients: rdma-dev-02 Servers: rdma-perf-06 DISTRO=RHEL-8.7.0-20220524.0 + [22-05-26 02:08:38] cat /etc/redhat-release Red Hat Enterprise Linux release 8.7 Beta (Ootpa) + [22-05-26 02:08:38] uname -a Linux rdma-dev-02.rdma.lab.eng.rdu2.redhat.com 4.18.0-393.el8.x86_64 #1 SMP Wed May 18 12:44:50 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux + [22-05-26 02:08:38] cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-393.el8.x86_64 root=UUID=fd7a6a9d-cd42-4b62-9933-1f5f3d4c927b ro console=tty0 rd_NO_PLYMOUTH intel_iommu=on iommu=on crashkernel=auto resume=UUID=9ea769dc-0bb3-455f-a1b3-d99cd5d33215 console=ttyS1,115200 + [22-05-26 02:08:38] rpm -q rdma-core linux-firmware rdma-core-37.2-1.el8.x86_64 linux-firmware-20220210-107.git6342082c.el8.noarch + [22-05-26 02:08:38] tail /sys/class/infiniband/qedr0/fw_ver /sys/class/infiniband/qedr1/fw_ver ==> /sys/class/infiniband/qedr0/fw_ver <== 8. 59. 1. 0 ==> /sys/class/infiniband/qedr1/fw_ver <== 8. 59. 1. 0 + [22-05-26 02:08:38] lspci + [22-05-26 02:08:38] grep -i -e ethernet -e infiniband -e omni -e ConnectX 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 08:00.0 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10) 08:00.1 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10) Installed: mpitests-openmpi-5.8-1.el8.x86_64 openmpi-1:4.1.1-3.el8.x86_64 openmpi-devel-1:4.1.1-3.el8.x86_64 How reproducible: 100% Steps to Reproduce: 1. With the above build on qedr roce device 2. set up both RDMA server and client for openmpi 3. On the client side, run the following benchmark command imeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr0:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' --mca mtl_base_verbose 100 --mca btl_openib_verbose 100 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=qede_roce.45 --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 /usr/lib64/openmpi/bin/mpitests-osu_acc_latency Actual results: [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23) rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23) [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0 rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23) rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:289 mca_pml_ucx_init [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:114 Pack remote worker address, size 38 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:114 Pack local worker address, size 141 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:351 created ucp context 0x56170ef84000, worker 0x56170efd7e50 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95 [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95 [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2 [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:289 mca_pml_ucx_init [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:114 Pack remote worker address, size 38 [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:114 Pack local worker address, size 141 [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:351 created ucp context 0x55e45dfd7160, worker 0x55e45e524ca0 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:182 Got proc 0 address, size 141 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:411 connecting to proc. 0 [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:182 Got proc 1 address, size 141 [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:411 connecting to proc. 1 OSU MPI_Accumulate latency Test v5.8 Window creation: MPI_Win_allocate Synchronization: MPI_Win_flush Size Latency (us) [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:182 Got proc 0 address, size 38 [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:411 connecting to proc. 0 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:182 Got proc 1 address, size 38 [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:411 connecting to proc. 1 1 2570.11 2 2570.11 4 2570.11 8 2570.11 16 2570.18 32 2570.10 + [22-05-26 02:41:36] __MPI_check_result 1 mpitests-openmpi OSU /usr/lib64/openmpi/bin/mpitests-osu_acc_latency mpirun /root/hfile_one_core Expected results: Normal execution with proper stats output Additional info:
Won't Do