8000 MAINT: gdr_unmap segfault on master branch via NVSHMEM 2.10.1 on Cray Slingshot 11 with cuFFTMp · Issue #296 · NVIDIA/gdrcopy · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
MAINT: gdr_unmap segfault on master branch via NVSHMEM 2.10.1 on Cray Slingshot 11 with cuFFTMp #296
@tylerjereddy

Description

@tylerjereddy

Working on Cray Slingshot 11, on 2 nodes with 4 x A100 each, with the test case from https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/samples/r2c_c2r_slabs_GROMACS, modified in this way to force multi-node NVSHMEM (2.10.1):

diff --git a/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile b/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile
index 5d9fa3e..64e39be 100644
--- a/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile
+++ b/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile
@@ -15,4 +15,4 @@ $(exe): $(exe).cu
 build: $(exe)
 
 run: $(exe)
-	LD_LIBRARY_PATH="${NVSHMEM_LIB}:${CUFFT_LIB}:${LD_LIBRARY_PATH}" mpirun -oversubscribe -n 4 $(exe) 
+	LD_LIBRARY_PATH="${NVSHMEM_LIB}:${CUFFT_LIB}:${LD_LIBRARY_PATH}" /lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin/mpirun -oversubscribe -n 8 -N 4 $(exe)

I'm seeing the output/backtrace below the fold:

Hello from rank 7/8 using GPU 3
Hello from rank 4/8 using GPU 0
Hello from rank 5/8 using GPU 1
Hello from rank 6/8 using GPU 2
Hello from rank 3/8 using GPU 3
Hello from rank 1/8 using GPU 1
Hello from rank 2/8 using GPU 2
Hello from rank 0/8 using GPU 0
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

ERR:  mh is not mapped yet
[nid001217:115514:0:115701] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x18)
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
==== backtrace (tid: 115701) ====
 0 0x00000000000168c0 __funlockfile()  ???:0
 1 0x0000000000001aa7 gdr_unmap()  ???:0
 2 0x0000000000032d92 cuda_gdrcopy_dev_unregister()  :0
 3 0x00000000000a488f cxip_unmap()  :0
 4 0x000000000008c165 cxip_rma_cb()  cxip_rma.c:0
 5 0x00000000000adfe5 cxip_evtq_progress()  :0
 6 0x0000000000081695 cxip_ep_progress()  :0
 7 0x000000000008b599 cxip_cntr_readerr()  cxip_cntr.c:0
 8 0x000000000000dfc2 nvshmemt_libfabric_progress()  /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:446
 9 0x00000000000e4bad progress_transports()  /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/proxy/proxy.cpp:963
10 0x00000000000e51b9 progress()  /lustre/sc
89AA
ratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/proxy/proxy.cpp:992
11 0x000000000000a6ea start_thread()  ???:0
12 0x0000000000117a6f __GI___clone()  ???:0
=================================

My full interactive run script is this, which will tell you a bit more about various dependency versions/paths:

#!/bin/bash -l
#

# setup the runtime environment
#export FI_LOG_LEVEL=debug
#export NVSHMEM_DEBUG=TRACE
export FI_HMEM=cuda
export GDRCOPY_ENABLE_LOGGING=1
# we need special CXI- and CUDA-enabled version of libfabric
# per: https://github.com/ofiwg/libfabric/issues/10001#issuecomment-2078604043
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/lib64:$LD_LIBRARY_PATH"
export PATH="/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/bin:$PATH"
export PATH="$PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin"
export PATH="$PATH:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin"
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-hc255f5j4fcqhtufeisjj3pytrkv4dqt/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-hc255f5j4fcqhtufeisjj3pytrkv4dqt/lib/ucx:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib"
export NVSHMEM_DISABLE_CUDA_VMM=1
export FI_CXI_OPTIMIZED_MRS=false
export NVSHMEM_REMOTE_TRANSPORT=libfabric
export MPI_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install
export CUFFT_LIB=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib
export CUFFT_INC=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp
export NVSHMEM_LIB=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib
export NVSHMEM_INC=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include
which fi_info
echo "fi_info -l:"
fi_info -l
echo "fi_info -p cxi:"
fi_info -p cxi
cd /lustre/scratch5/treddy/march_april_2024_testing/github_projects/CUDALibrarySamples/cuFFTMp/samples/r2c_c2r_slabs_GROMACS
make clean
make build
make run

More gruesome details about libfabric, CXI, CUDA support are described at ofiwg/libfabric#10001, but since I'm apparently segfaulting in gdrcopy now, it may be helpful to determine what my next debugging steps should be here. I've already discussed things fairly extensively with the NVSHMEM team.

I built the latest gdrcopy master branch with gcc 12.2.0 + cuda/12.0 "modules" loaded:

make -j 32 prefix=/lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install CUDA=/usr/projects/hpcsoft/cos2/chicoma/cuda/12.0 all install

It would be awesome if I could get this working somehow. Note that I was originally getting different backtraces with gdrcopy 2.3.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0