8000 Update with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls by romerojosh · Pull Request #3912 · horovod/horovod · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Update with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls #3912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 9, 2023

Conversation

romerojosh
Copy link
Collaborator
@romerojosh romerojosh commented May 5, 2023

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

In CUDA 12.x, calls to cudaSetDevice create GPU contexts. It was found that the with_device objects used in the MXNet and PyTorch code can execute cudaSetDevice(0) in the destructor if called before the Horovod background thread sets a device (e.g. before launching any GPU collectives). With the new CUDA 12.x behavior, this will generate spurious contexts for all ranks on GPU 0, unnecessarily consuming GPU memory.

This PR updates the behavior of with_device to set the restore_device_ value to the device value supplied to the constructor if no current GPU context is associated with the thread.

@github-actions
Copy link
github-actions bot commented May 5, 2023

Unit Test Results

  1 155 files   -   38    1 155 suites   - 38   14h 26m 37s ⏱️ - 35m 28s
     887 tests ±    0       833 ✔️ ±    0       54 💤 ±    0  0 ±0 
25 708 runs   - 964  18 292 ✔️  - 592  7 416 💤  - 372  0 ±0 

Results for commit a0a88fe. ± Comparison against base commit 1ea0067.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
github-actions bot commented May 5, 2023

Unit Test Results (with flaky tests)

  1 270 files   -   37    1 270 suites   - 37   15h 8m 22s ⏱️ - 30m 23s
     887 tests ±    0       832 ✔️  -     1       54 💤 ±    0  1 +1 
28 616 runs   - 948  19 993 ✔️  - 583  8 622 💤  - 366  1 +1 

For more details on these failures, see this check.

Results for commit a0a88fe. ± Comparison against base commit 1ea0067.

♻️ This comment has been updated with latest results.

@maxhgerlach
Copy link
Collaborator
maxhgerlach commented May 5, 2023 8000

Github builds with docker-compose don't work right now, but the Jenkins ppc64le build ran into an actual linker error (CUDA 10.2.89):

E ImportError: /opt/anaconda3/envs/wmlce/lib/python3.9/site-packages/horovod/torch/mpi_lib_v2.cpython-39-powerpc64le-linux-gnu.so: undefined symbol: cuCtxGetDevice

https://powerci.osuosl.org/job/Horovod_PPC64LE_GPU_PIPELINE/view/change-requests/job/PR-3912/2/console

Copy link
Collaborator
@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor whitespace suggestions

@maxhgerlach
Copy link
Collaborator

@romerojosh, build should be fixed on master

@maxhgerlach maxhgerlach added this to the v0.28.0 milestone May 5, 2023
@romerojosh
Copy link
Collaborator Author

Hmm, the driver linking error is unfortunate. I thought maybe that would happen. I can try and fix the build to link in the driver libraries. Alternatively, there is a much simpler option here that doesn't require adding CUDA driver APIs at all.

The current usage of this with_device class is unnecessary. The purpose of that object is to "reset" the Horovod thread to the device it was previously using. However, the Horovod thread is not really associated with a specific device; we just set the device as required based on the tensors we are processing (see: https://github.com/horovod/horovod/blob/master/horovod/common/ops/gpu_operations.cc#L36). In fact, we never "reset" the device after performing a GPU collective operation. As a simpler solution, I propose removing the CUDA device reset in with_device altogether (or replace all usage of with_device with cudaSetDevice() calls.

@maxhgerlach
Copy link
Collaborator
maxhgerlach commented May 5, 2023

The current usage of this with_device class is unnecessary. The purpose of that object is to "reset" the Horovod thread to the device it was previously using. However, the Horovod thread is not really associated with a specific device; we just set the device as required based on the tensors we are processing

I am not super familiar with these concerns. But apparently with_device is not only instantiated in the Horovod background thread, but also in Torch "foreground" threads. For instance within TorchOpContext::AllocateOutput (1), which is directly called by DoReducescatter (2) and related functions. DoBroadcast has a similar usage (3). Would we need to be more careful there then, so we don't interfere with other framework operations?

(1):

with_device device_context(output_devices_.at(output_index));

(2):
hvd_context->AllocateOutput(output_shape, &hvd_output, &ready_event));

(3):
with_device device_guard(device);

@romerojosh
8000
Copy link
Collaborator Author

Ah, good catch. I wasn't aware of any usage of the allocation functions in the mpi_ops definitions (seems it is only ReduceScatter that does it). Also, I thought PyTorch broadcast was implemented similar to MXNet, where the tensor creation was handled in Python like this:

if rank() == root_rank:
output = tensor.copy()
else:
output = mx.nd.zeros(shape=tensor.shape, ctx=tensor.context,
dtype=tensor.dtype)

but apparently not.

Ok, I'll move forward with the current implementation then.

@maxhgerlach
Copy link
Collaborator

I wasn't aware of any usage of the allocation functions in the mpi_ops definitions (seems it is only ReduceScatter that does it).

True, it felt a bit unconventional to rework it like that in #3824, but also more straight forward than exposing ReducescatterOp::ComputeOutputShapeForRank to Python (or duplicating the logic there).

@romerojosh
Copy link
Collaborator Author

Alright, pushed changes to apply @EnricoMi's suggested whitespace changes. I also added a CUDA driver link to the CMake build should hopefully resolve the build/run issues.

The HVD_GPU_DRIVER_CHECK macro I added for ROCm is just best-effort since I do not have a system to test this on and the CI doesn't test ROCm builds.

@maxhgerlach
Copy link
Collaborator

Jenkins passed! 👍

The docker-compose fix from #3913 seems to be picked up already (maybe because this is a fork PR and the GitHub CI pipeline runs on master config?), but not the docs build fix. I think the failing readthedocs build should not block the Buildkite GPU pipelines, so this might run through without rebasing to master first.

Copy link
Collaborator
@maxhgerlach maxhgerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, provided that the GPU test pipeline passes.

romerojosh added 6 commits May 5, 2023 14:26
… cudaSetDevice calls.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
@romerojosh romerojosh force-pushed the context_fix_cuda12 branch from 78b42b8 to 739e4ec Compare May 5, 2023 21:29
@maxhgerlach
Copy link
Collaborator
maxhgerlach commented May 8, 2023

Hmm, the method using cuCtxGetDevice apparently doesn't work when one tries to use a CUDA-aware Horovod binary in a CPU-only context. In the CI, CPU-running tests for the GPU-aware docker image fail with ImportError: libcuda.so.1: cannot open shared object file: No such file or directory: https://github.com/horovod/horovod/actions/runs/4897708229/jobs/8747477066?pr=3912

I think that's not an uncommon use case. Would it be feasible to only load the driver library dynamically if a GPU device is available at run time?

@romerojosh
Copy link
Collaborator Author

Ok. I think the way around this is to load the driver library and functions at runtime using dlopen/dlsym. I will implement this.

romerojosh added 2 commits May 8, 2023 10:54
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
@romerojosh romerojosh force-pushed the context_fix_cuda12 branch from 4a6dc6c to a0a88fe Compare May 8, 2023 21:55
@maxhgerlach
Copy link
Collaborator

Thanks for the update @romerojosh! Looks like this works great. I'll go forward and merge.

@maxhgerlach maxhgerlach merged commit 004fd0d into horovod:master May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0