NCCL_SHM_USE_CUDA_MEMCPY=1 causes hang in PyTorch #803

geohot · 2023-03-17T07:05:49Z

Using cuda memcpy is much faster, so I'm trying to make it work. On a machine with two non nvlinked 3090s.

# boilerplate NCCL init code

nbytes = 65536*2
buf0 = torch.randn(nbytes // 4).cuda(rank)
dist.all_reduce(buf0)
#torch.cuda.synchronize(rank)
buf0.cpu()

If I comment out the torch.cuda.synchronize, it works, otherwise it hangs. I suspect it has to do with the stream created in shmSendProxyConnect (also, does that stream work with cuda graphs?)

If this is something you'd like to look into, I can help with a more full reproduction.

The text was updated successfully, but these errors were encountered:

sjeaugey · 2023-03-17T08:20:11Z

Can you try with NCCL_CREATE_THREAD_CONTEXT=1? Creating a separate CUDA context for the progress thread is important when you try to call CUDA within the progress thread. Otherwise, when the progress thread calls cudaMemcpy, it may get blocked by other calls blocked by the NCCL kernel itself, causing a deadlock.

geohot · 2023-03-17T19:56:15Z

Thanks for the quick reply!

I tried that, no luck. Here is a full repro, NCCL version 2.17.1-1 with trunk PyTorch. It also repros on PyTorch 2.0 with built in NCCL. CUDA 11.8, 530.30.02 driver.

import os
os.environ['NCCL_SHM_USE_CUDA_MEMCPY'] = '1'
os.environ['NCCL_CREATE_THREAD_CONTEXT'] = '1'

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def init_process(rank, size, backend='nccl'):
  os.environ['MASTER_ADDR'] = '127.0.0.1'
  os.environ['MASTER_PORT'] = '29500'
  dist.init_process_group(backend, rank=rank, world_size=size)

  buf0 = torch.randn(32768).cuda(rank)
  dist.all_reduce(buf0)
  print(f"{rank}: all reduce done")
  
  # NOTE: uncommenting this synchronize will fix it, but it shouldn't be needed
  #torch.cuda.synchronize(rank)
  buf0.cpu()
  print(f"{rank}: to cpu")

if __name__ == "__main__":
  p1 = mp.Process(target=init_process, args=(0, 2))
  p2 = mp.Process(target=init_process, args=(1, 2))
  p1.start()
  p2.start()
  p1.join()
  p2.join()

geohot · 2023-03-17T22:00:17Z

Ahh! This NCCL_CREATE_THREAD_CONTEXT seems to almost fix it. If I add:

CUPFN(cuCtxSetCurrent(comm->proxyState.cudaCtx));

to shmSendProxyConnect it works. Maybe that's being called in the other thread?

geohot · 2023-03-17T22:32:40Z

So I believe the bug is deeper than that. This construct exists twice in proxy.cc:

if (ncclSetThreadContext(comm) != ncclSuccess) {
  WARN("[Proxy Progress] Failed to set CUDA context on device %d", comm->cudaDev);
} else if (cudaSetDevice(comm->cudaDev) != cudaSuccess) {
  WARN("[Proxy Progress] Failed to set CUDA device %d", comm->cudaDev);
}

If ncclSetThreadContext succeeds, it also calls cudaSetDevice. But I don't think this is right. Calling cudaSetDevice resets the context: https://stackoverflow.com/questions/62877646/what-does-cudasetdevice-do-to-a-cuda-devices-context-stack

I suspect what you meant was this:

if (cudaSetDevice(comm->cudaDev) != cudaSuccess) {
  WARN("[Proxy Progress] Failed to set CUDA device %d", comm->cudaDev);
} else if (ncclSetThreadContext(comm) != ncclSuccess) {
  WARN("[Proxy Progress] Failed to set CUDA context on device %d", comm->cudaDev);
}

If I change it to that the code works. (NOTE: this might be redundant, does the context contain the device?)

sjeaugey · 2023-04-06T16:35:23Z

Add support for IB SHARP to NVLS (NVLink SHARP algorithm). Add NVLS+Tree algorithm. Add support for memory management using cuMem* functions. Use all NICs for Send/Receive operations on systems with more than one NIC per GPU (#804). Add ncclCommSplit primitive, with resource sharing option in config. Fix alltoallv hang (#788) Increase number of channels on H100 when we're not limited by NVLink. Improve error reporting in case of IB failure, printing local and remote ID (#779). Add build option to allow compilation against RDMA includes instead of dynamically loading IB verbs symbols (#802). Fix context creation for progress thread (#803). NET/IB: add option to use multiple QPs in round-robin mode. Fix tree performance issue when NVB is disabled on HCM topologies.

Dazz993 · 2024-04-11T20:17:20Z

Hi, I tried the script that @geohot provided with pytorch 2.1.1+cu121 (nccl version 2.18.6). It doesn't work for me, which means my program still hangs. Any ideas, comments, or suggestions? Thanks!

geohot linked a pull request Mar 17, 2023 that will close this issue

Flip order of ncclSetThreadContext and cudaSetDevice to use the correct context #806

Open

tangrc99 mentioned this issue Jul 19, 2023

NCCL_SHM_USE_CUDA_MEMCPY causes performance degradation #922

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NCCL_SHM_USE_CUDA_MEMCPY=1 causes hang in PyTorch #803

NCCL_SHM_USE_CUDA_MEMCPY=1 causes hang in PyTorch #803

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NCCL_SHM_USE_CUDA_MEMCPY=1 causes hang in PyTorch #803

NCCL_SHM_USE_CUDA_MEMCPY=1 causes hang in PyTorch #803

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!