...
Description of problem: The following mvapich2 benchmarks fail due to "[error_sighandler] Caught error" FAIL | 135 | mvapich2 IMB-NBC Ireduce_scatter mpirun one_core [rdma-dev-25.rdma.lab.eng.rdu2.redhat.com:mpi_rank_0][error_sighandler] Caught error: Bus error (signal 7) FAIL | 139 | mvapich2 IMB-RMA Unidir_put mpirun one_core FAIL | 139 | mvapich2 IMB-RMA Bidir_put mpirun one_core FAIL | 139 | mvapich2 IMB-RMA Put_local mpirun one_core FAIL | 139 | mvapich2 IMB-RMA Accumulate mpirun one_core [rdma-dev-26.rdma.lab.eng.rdu2.redhat.com:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) Version-Release number of selected component (if applicable): Clients: rdma-dev-26 Servers: rdma-dev-25 DISTRO=RHEL-9.1.0-20220509.3 + [22-05-10 09:57:56] cat /etc/redhat-release Red Hat Enterprise Linux release 9.1 Beta (Plow) + [22-05-10 09:57:56] uname -a Linux rdma-dev-26.rdma.lab.eng.rdu2.redhat.com 5.14.0-86.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Fri May 6 09:23:00 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux + [22-05-10 09:57:56] cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-86.el9.x86_64 root=/dev/mapper/rhel_rdma-dev26-root ro intel_idle.max_cstate=0 intremap=no_x2apic_optout processor.max_cstate=0 console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G:512M resume=/dev/mapper/rhel_rdma-dev-26-swap rd.lvm.lv=rhel_rdma-dev-26/root rd.lvm.lv=rhel_rdma-dev-26/swap console=ttyS1,115200n81 + [22-05-10 09:57:56] rpm -q rdma-core linux-firmware rdma-core-37.2-1.el9.x86_64 linux-firmware-20220209-126.el9_0.noarch + [22-05-10 09:57:56] tail /sys/class/infiniband/bnxt_re0/fw_ver /sys/class/infiniband/bnxt_re1/fw_ver ==> /sys/class/infiniband/bnxt_re0/fw_ver <== 219.0.112.0 ==> /sys/class/infiniband/bnxt_re1/fw_ver <== 219.0.112.0 + [22-05-10 09:57:56] lspci + [22-05-10 09:57:56] grep -i -e ethernet -e infiniband -e omni -e ConnectX 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 04:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11) 04:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11) Installed: mpitests-mvapich2-5.8-1.el9.x86_64 mvapich2-2.3.6-3.el9.x86_64 How reproducible: 100% Steps to Reproduce: 1. bring up the RDMA hosts mentioned above with RHEL8.7 build 2. set up RDMA hosts for mvapich2 benchamrk tests 3. run one of the mvapich2 benchmark with "mpirun" command, as the following: timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 mpitests-IMB-NBC Ireduce_scatter -time 1.5 Actual results: + [22-05-10 10:05:07] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 mpitests-IMB-NBC Ireduce_scatter -time 1.5 [rdma-dev-25.rdma.lab.eng.rdu2.redhat.com:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [rdma-dev-25.rdma.lab.eng.rdu2.redhat.com:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 #---------------------------------------------------------------- Intel(R) MPI Benchmarks 2021.3, MPI-NBC part #---------------------------------------------------------------- Date : Tue May 10 10:05:07 2022 Machine : x86_64 System : Linux Release : 5.14.0-86.el9.x86_64 Version : #1 SMP PREEMPT_DYNAMIC Fri May 6 09:23:00 EDT 2022 MPI Version : 3.1 MPI Thread Environment: Calling sequence was: mpitests-IMB-NBC Ireduce_scatter -time 1.5 Minimum message length in bytes: 0 Maximum message length in bytes: 4194304 # MPI_Datatype : MPI_BYTE MPI_Datatype for reductions : MPI_FLOAT MPI_Op : MPI_SUM # # List of Benchmarks to run: Ireduce_scatter #----------------------------------------------------------------------------- Benchmarking Ireduce_scatter #processes = 2 #----------------------------------------------------------------------------- #bytes #repetitions t_ovrl[usec] t_pure[usec] t_CPU[usec] overlap[%] 0 1000 0.94 0.53 0.36 0.00 4 1000 6.93 5.73 5.60 76.66 8 1000 6.93 5.75 5.60 76.79 16 1000 5.45 4.23 4.08 67.59 32 1000 5.45 4.20 4.09 67.52 64 1000 9.84 6.87 6.83 56.19 128 1000 10.03 6.99 7.12 57.31 256 1000 8.29 6.88 6.84 78.94 512 1000 8.30 6.81 6.82 78.20 1024 1000 11.16 7.36 7.45 49.08 2048 1000 11.78 7.69 7.76 47.35 4096 1000 9.49 7.15 7.15 67.25 8192 1000 13.87 9.80 9.94 59.05 16384 1000 38.18 19.26 19.10 0.91 32768 1000 47.83 23.53 23.07 0.00 65536 640 65.72 32.23 31.97 0.00 131072 320 107.81 49.63 49.01 0.00 262144 160 181.96 88.82 87.53 0.00 524288 80 331.56 159.86 157.39 0.00 1048576 40 715.22 354.64 350.01 0.00 2097152 20 1413.44 695.38 685.92 0.00 4194304 10 3006.39 1472.74 1487.87 0.00 All processes entering MPI_Finalize [1] Failed to dealloc pd (Device or resource busy) [0] Failed to dealloc pd (Device or resource busy) [1] 8 at [0x0000558cc8ce1c50], src/mpi/comm/create_2level_comm.c[1523] [1] 8 at [0x0000558cc8ce20c0], src/util/procmap/local_proc.c[93] [1] 8 at [0x0000558cc8ce1e60], src/util/procmap/local_proc.c[92] [1] 24 at [0x0000558cc811dcc0], src/mpi/group/grouputil.c[74] [1] 8 at [0x0000558cc811db10], src/mpi/comm/create_2level_comm.c[1481] [1] 128 at [0x0000558cc8ce1930], src/mpi/coll/ch3_shmem_coll.c[4484] [1] 8 at [0x0000558cc8ce2010], src/util/procmap/local_proc.c[93] [1] 8 at [0x0000558cc8ce1f60], src/util/procmap/local_proc.c[92] [1] 8 at [0x0000558cc8ce1880], src/mpi/comm/create_2level_comm.c[942] [1] 8 at [0x0000558cc8ce17d0], src/mpi/comm/create_2level_comm.c[940] [1] 1024 at [0x0000558cc8ce1330], src/mpi/coll/ch3_shmem_coll.c[5254] [1] 8 at [0x0000558cc8ce1280], src/mpi/coll/ch3_shmem_coll.c[5249] [1] 312 at [0x0000558cc8ce10a0], src/mpi/coll/ch3_shmem_coll.c[5201] [1] 264 at [0x0000558cc8c38920], src/mpi/coll/ch3_shmem_coll.c[5150] [1] 8 at [0x0000558cc8c38f60], src/mpi/comm/create_2level_comm.c[2103] [1] 8 at [0x0000558cc8c38eb0], src/mpi/comm/create_2level_comm.c[2095] [1] 8 at [0x0000558cc8c38e00], src/util/procmap/local_proc.c[93] [1] 8 at [0x0000558cc8c38d50], src/util/procmap/local_proc.c[92] [1] 24 at [0x0000558cc8c38c90], src/mpid/ch3/src/mpid_vc.c[111] [1] 16 at [0x0000558cc8c90e30], src/mpi/group/grouputil.c[74] [1] 8 at [0x0000558cc8c38be0], src/util/procmap/local_proc.c[93] [1] 8 at [0x0000558cc8c38b30], src/util/procmap/local_proc.c[92] [1] 24 at [0x0000558cc8c90f90], src/mpi/group/grouputil.c[74] [1] 8 at [0x0000558cc8c90ee0], src/mpi/comm/create_2level_comm.c[1998] [1] 8 at [0x0000558cc8c90d80], src/mpi/comm/create_2level_comm.c[1974] [1] 2048 at [0x0000558cc8c904e0], src/mpi/comm/create_2level_comm.c[1961] [1] 24 at [0x0000558cc811dc00], src/mpi/group/grouputil.c[74] [1] 8 at [0x0000558cc811df50], src/util/procmap/local_proc.c[93] [1] 8 at [0x0000558cc811dea0], src/util/procmap/local_proc.c[92] [1] 8 at [0x0000558cc8503800], src/mpid/ch3/src/mpid_rma.c[182] [1] 8 at [0x0000558cc8503750], src/mpid/ch3/src/mpid_rma.c[182] [1] 8 at [0x0000558cc85036a0], src/mpid/ch3/src/mpid_rma.c[182] [1] 8 at [0x0000558cc83c0f80], src/mpid/ch3/src/mpid_rma.c[182] [1] 8 at [0x0000558cc83c0c10], src/mpid/ch3/src/mpid_rma.c[182] [1] 8 at [0x0000558cc83c0a00], src/mpid/ch3/src/mpid_rma.c[182] [1] 504 at [0x0000558cc842fa40], src/mpi/comm/commutil.c[342] [1] 32 at [0x0000558cc842f980], src/mpid/ch3/src/mpid_vc.c[111] [rdma-dev-25.rdma.lab.eng.rdu2.redhat.com:mpi_rank_0][error_sighandler] Caught error: Bus error (signal 7) <<<=============================== =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 44242 RUNNING AT 172.31.45.125 = EXIT CODE: 135 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Bus error (signal 7) This typically refers to a problem with your application. Please see the FAQ page for debugging suggestions + [22-05-10 10:05:09] __MPI_check_result 135 mpitests-mvapich2 IMB-NBC Ireduce_scatter mpirun /root/hfile_one_core Expected results: Additional info:
Won't Do