...
+++ This bug was initially created as a clone of Bug #2148553 +++ Description of problem: All mvapich2 benchmarks fail with RC134, with "mpirun" command, or RC1, with "mpirun_rsh" command. This happens on a host with MT27700 CX-4 device and the transport is IB0 or IB1. However, this takes place specifically on rdma-dev-19 / rdma-dev-20 host pairs when run in RDMA server & client, respectively. This is a REGRESSION from RHEL-8.7.0 the mvapich2 on IB0 on the same HCA on rdma-dev-19 / rdma-dev-20, where all benchmarks PASSED Version-Release number of selected component (if applicable): Clients: rdma-dev-20 Servers: rdma-dev-19 DISTRO=RHEL-8.8.0-20221120.2 + [22-11-25 16:18:29] cat /etc/redhat-release Red Hat Enterprise Linux release 8.8 Beta (Ootpa) + [22-11-25 16:18:29] uname -a Linux rdma-dev-20.rdma.lab.eng.rdu2.redhat.com 4.18.0-438.el8.x86_64 #1 SMP Mon Nov 14 13:08:07 EST 2022 x86_64 x86_64 x86_64 GNU/Linux + [22-11-25 16:18:29] cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-438.el8.x86_64 root=UUID=4dcc79ce-c280-4af4-9b75-02011855b115 ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=1c9d8b9c-d969-417d-ad02-b9e6279dfac8 console=ttyS1,115200n81 + [22-11-25 16:18:29] rpm -q rdma-core linux-firmware rdma-core-41.0-1.el8.x86_64 linux-firmware-20220726-110.git150864a4.el8.noarch + [22-11-25 16:18:29] tail /sys/class/infiniband/mlx5_2/fw_ver /sys/class/infiniband/mlx5_3/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver ==> /sys/class/infiniband/mlx5_2/fw_ver <== 12.28.2006 ==> /sys/class/infiniband/mlx5_3/fw_ver <== 12.28.2006 ==> /sys/class/infiniband/mlx5_bond_0/fw_ver <== 14.31.1014 + [22-11-25 16:18:29] lspci + [22-11-25 16:18:29] grep -i -e ethernet -e infiniband -e omni -e ConnectX 01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 04:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] 04:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] 82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] Installed: mpitests-mvapich2-5.8-1.el8.x86_64 mvapich2-2.3.6-1.el8.x86_64 How reproducible: 100% Steps to Reproduce: 1. bring up the RDMA hosts mentioned above with RHEL8.8 build 2. set up RDMA hosts for mvapich2 benchamrk tests 3. run one of the mvapich2 benchmark with "mpirun" command, as the following: a) mpirun command timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 mpitests-IMB-MPI1 PingPong -time 1.5 buffer overflow detected ***: terminated [rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpi_rank_1][error_sighandler] Caught error: Aborted (signal 6) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 48458 RUNNING AT 172.31.0.120 = EXIT CODE: 134 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== [proxy:0:0@rdma-dev-19.rdma.lab.eng.rdu2.redhat.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911): assert (!closed) failed [proxy:0:0@rdma-dev-19.rdma.lab.eng.rdu2.redhat.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [proxy:0:0@rdma-dev-19.rdma.lab.eng.rdu2.redhat.com] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6) This typically refers to a problem with your application. Please see the FAQ page for debugging suggestions b) "mpirun_rsh" command + [22-11-25 14:26:27] timeout --preserve-status --kill-after=5m 3m mpirun_rsh -np 2 -hostfile /root/hfile_one_core mpitests-IMB-MPI1 PingPong -time 1.5 buffer overflow detected ***: terminated [rdma-dev-19.rdma.lab.eng.rdu2.redhat.com:mpi_rank_0][error_sighandler] Caught error: Aborted (signal 6) buffer overflow detected ***: terminated [rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpi_rank_1][error_sighandler] Caught error: Aborted (signal 6) [rdma-dev-19.rdma.lab.eng.rdu2.redhat.com:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 5. MPI process died? [rdma-dev-19.rdma.lab.eng.rdu2.redhat.com:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died? [rdma-dev-19.rdma.lab.eng.rdu2.redhat.com:mpispawn_0][child_handler] MPI process (rank: 0, pid: 51624) terminated with signal 6 -> abort job [rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 6. MPI process died? [rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died? [rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpispawn_1][child_handler] MPI process (rank: 1, pid: 52467) terminated with signal 6 -> abort job [rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node 172.31.0.119 aborted: Error while reading a PMI socket (4) + [22-11-25 14:26:30] __MPI_check_result 1 mpitests-mvapich2 IMB-MPI1 PingPong mpirun_rsh /root/hfile_one_core Actual results: Expected results: Normal run with stats Additional info: On other hosts, like rdma-dev-21 and rdma-dev-22 pair, with the same MT27700 CX-4 device, with IB0, all mvapich2 benchmarks PASSED. Also, on rdma-perf-02/03 host pair, with mlx5 MT27800 CX-5 ib0, all mvapich2 benchmarks PASSED.
Won't Do