Update with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls #3912

romerojosh · 2023-05-05T00:48:37Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

In CUDA 12.x, calls to cudaSetDevice create GPU contexts. It was found that the with_device objects used in the MXNet and PyTorch code can execute cudaSetDevice(0) in the destructor if called before the Horovod background thread sets a device (e.g. before launching any GPU collectives). With the new CUDA 12.x behavior, this will generate spurious contexts for all ranks on GPU 0, unnecessarily consuming GPU memory.

This PR updates the behavior of with_device to set the restore_device_ value to the device value supplied to the constructor if no current GPU context is associated with the thread.

github-actions · 2023-05-05T00:52:37Z

Unit Test Results

  1 155 files -   38   1 155 suites - 38 14h 26m 37s ⏱️ - 35m 28s
    887 tests ±    0     833 ✔️ ±    0     54 💤 ±    0 0 ❌ ±0
25 708 runs - 964 18 292 ✔️ - 592 7 416 💤 - 372 0 ❌ ±0

Results for commit a0a88fe. ± Comparison against base commit 1ea0067.

♻️ This comment has been updated with latest results.

github-actions · 2023-05-05T00:52:48Z

Unit Test Results (with flaky tests)

  1 270 files -   37   1 270 suites - 37 15h 8m 22s ⏱️ - 30m 23s
    887 tests ±    0     832 ✔️ -     1     54 💤 ±    0 1 ❌ +1
28 616 runs - 948 19 993 ✔️ - 583 8 622 💤 - 366 1 ❌ +1

For more details on these failures, see this check.

Results for commit a0a88fe. ± Comparison against base commit 1ea0067.

♻️ This comment has been updated with latest results.

maxhgerlach · 2023-05-05T08:23:42Z

Github builds with docker-compose don't work right now, but the Jenkins ppc64le build ran into an actual linker error (CUDA 10.2.89):

E ImportError: /opt/anaconda3/envs/wmlce/lib/python3.9/site-packages/horovod/torch/mpi_lib_v2.cpython-39-powerpc64le-linux-gnu.so: undefined symbol: cuCtxGetDevice

https://powerci.osuosl.org/job/Horovod_PPC64LE_GPU_PIPELINE/view/change-requests/job/PR-3912/2/console

EnricoMi

minor whitespace suggestions

horovod/mxnet/cuda_util.cc

horovod/torch/cuda_util.cc

maxhgerlach · 2023-05-05T13:08:19Z

@romerojosh, build should be fixed on master

romerojosh · 2023-05-05T13:37:01Z

Hmm, the driver linking error is unfortunate. I thought maybe that would happen. I can try and fix the build to link in the driver libraries. Alternatively, there is a much simpler option here that doesn't require adding CUDA driver APIs at all.

The current usage of this with_device class is unnecessary. The purpose of that object is to "reset" the Horovod thread to the device it was previously using. However, the Horovod thread is not really associated with a specific device; we just set the device as required based on the tensors we are processing (see: https://github.com/horovod/horovod/blob/master/horovod/common/ops/gpu_operations.cc#L36). In fact, we never "reset" the device after performing a GPU collective operation. As a simpler solution, I propose removing the CUDA device reset in with_device altogether (or replace all usage of with_device with cudaSetDevice() calls.

maxhgerlach · 2023-05-05T14:02:14Z

The current usage of this with_device class is unnecessary. The purpose of that object is to "reset" the Horovod thread to the device it was previously using. However, the Horovod thread is not really associated with a specific device; we just set the device as required based on the tensors we are processing

I am not super familiar with these concerns. But apparently with_device is not only instantiated in the Horovod background thread, but also in Torch "foreground" threads. For instance within TorchOpContext::AllocateOutput (1), which is directly called by DoReducescatter (2) and related functions. DoBroadcast has a similar usage (3). Would we need to be more careful there then, so we don't interfere with other framework operations?

(1):

horovod/horovod/torch/adapter_v2.cc

Line 146 in 1ea0067

with_device device_context(output_devices_.at(output_index));

(2):

horovod/horovod/torch/mpi_ops_v2.cc

Line 764 in 1ea0067

hvd_context->AllocateOutput(output_shape, &hvd_output, &ready_event));

(3):

horovod/horovod/torch/mpi_ops_v2.cc

Line 544 in 1ea0067

with_device device_guard(device);

romerojosh · 2023-05-05T14:16:05Z

Ah, good catch. I wasn't aware of any usage of the allocation functions in the mpi_ops definitions (seems it is only ReduceScatter that does it). Also, I thought PyTorch broadcast was implemented similar to MXNet, where the tensor creation was handled in Python like this:

horovod/horovod/mxnet/mpi_ops.py

Lines 424 to 428 in 1ea0067

    
           if rank() == root_rank: 
        
               output = tensor.copy() 
        
           else: 
        
               output = mx.nd.zeros(shape=tensor.shape, ctx=tensor.context, 
        
                                    dtype=tensor.dtype)

but apparently not.

Ok, I'll move forward with the current implementation then.

maxhgerlach · 2023-05-05T14:21:45Z

I wasn't aware of any usage of the allocation functions in the mpi_ops definitions (seems it is only ReduceScatter that does it).

True, it felt a bit unconventional to rework it like that in #3824, but also more straight forward than exposing ReducescatterOp::ComputeOutputShapeForRank to Python (or duplicating the logic there).

romerojosh · 2023-05-05T14:52:32Z

Alright, pushed changes to apply @EnricoMi's suggested whitespace changes. I also added a CUDA driver link to the CMake build should hopefully resolve the build/run issues.

The HVD_GPU_DRIVER_CHECK macro I added for ROCm is just best-effort since I do not have a system to test this on and the CI doesn't test ROCm builds.

maxhgerlach · 2023-05-05T15:03:18Z

Jenkins passed! 👍

The docker-compose fix from #3913 seems to be picked up already (maybe because this is a fork PR and the GitHub CI pipeline runs on master config?), but not the docs build fix. I think the failing readthedocs build should not block the Buildkite GPU pipelines, so this might run through without rebasing to master first.

maxhgerlach

LGTM, provided that the GPU test pipeline passes.

… cudaSetDevice calls. Signed-off-by: Josh Romero <joshr@nvidia.com>

Signed-off-by: Josh Romero <joshr@nvidia.com>

maxhgerlach · 2023-05-08T07:45:10Z

Hmm, the method using cuCtxGetDevice apparently doesn't work when one tries to use a CUDA-aware Horovod binary in a CPU-only context. In the CI, CPU-running tests for the GPU-aware docker image fail with ImportError: libcuda.so.1: cannot open shared object file: No such file or directory: https://github.com/horovod/horovod/actions/runs/4897708229/jobs/8747477066?pr=3912

I think that's not an uncommon use case. Would it be feasible to only load the driver library dynamically if a GPU device is available at run time?

romerojosh · 2023-05-08T15:30:46Z

Ok. I think the way around this is to load the driver library and functions at runtime using dlopen/dlsym. I will implement this.

Signed-off-by: Josh Romero <joshr@nvidia.com>

maxhgerlach · 2023-05-09T08:22:48Z

Thanks for the update @romerojosh! Looks like this works great. I'll go forward and merge.

maxhgerlach mentioned this pull request May 5, 2023

Bump version to 0.28.0 #3910

Merged

EnricoMi reviewed May 5, 2023

View reviewed changes

horovod/mxnet/cuda_util.cc Outdated Show resolved Hide resolved

horovod/torch/cuda_util.cc Outdated Show resolved Hide resolved

maxhgerlach added this to the v0.28.0 milestone May 5, 2023

maxhgerlach approved these changes May 5, 2023

View reviewed changes

romerojosh added 6 commits May 5, 2023 14:26

Update with_device functions in MXNet and PyTorch to skip unnecessary…

3b283b5

… cudaSetDevice calls. Signed-off-by: Josh Romero <joshr@nvidia.com>

Fix macro.

940ea0f

Signed-off-by: Josh Romero <joshr@nvidia.com>

Whitespace fixes.

073ebfe

Signed-off-by: Josh Romero <joshr@nvidia.com>

Add CUDA driver link to CMake build.

b2c8330

Signed-off-by: Josh Romero <joshr@nvidia.com>

Add equivalent macro for ROCM.

e37c234

Signed-off-by: Josh Romero <joshr@nvidia.com>

Update changelog.

739e4ec

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh force-pushed the context_fix_cuda12 branch from 78b42b8 to 739e4ec Compare May 5, 2023 21:29

romerojosh added 2 commits May 8, 2023 10:54

Only load required CUDA driver symbols at runtime when needed.

0cd9d1a

Signed-off-by: Josh Romero <joshr@nvidia.com>

Fix CPU build.

a0a88fe

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh force-pushed the context_fix_cuda12 branch from 4a6dc6c to a0a88fe Compare May 8, 2023 21:55

maxhgerlach merged commit 004fd0d into horovod:master May 9, 2023

maxhgerlach mentioned this pull request May 19, 2023

ROCm-Compilation fails for 0.28.0 #3927

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls #3912

Update with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls #3912

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Update with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls #3912

Update with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls #3912

Uh oh!

Conversation

Uh oh!

Checklist before submitting

Description

Uh oh!

Uh oh!

Unit Test Results

Uh oh!

Uh oh!

Unit Test Results (with flaky tests)

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!