-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Update with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls #3912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unit Test Results (with flaky tests) 1 270 files - 37 1 270 suites - 37 15h 8m 22s ⏱️ - 30m 23s For more details on these failures, see this check. Results for commit a0a88fe. ± Comparison against base commit 1ea0067. ♻️ This comment has been updated with latest results. |
Github builds with docker-compose don't work right now, but the Jenkins ppc64le build ran into an actual linker error (CUDA 10.2.89):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor whitespace suggestions
@romerojosh, build should be fixed on master |
Hmm, the driver linking error is unfortunate. I thought maybe that would happen. I can try and fix the build to link in the driver libraries. Alternatively, there is a much simpler option here that doesn't require adding CUDA driver APIs at all. The current usage of this |
I am not super familiar with these concerns. But apparently (1): horovod/horovod/torch/adapter_v2.cc Line 146 in 1ea0067
(2): horovod/horovod/torch/mpi_ops_v2.cc Line 764 in 1ea0067
(3): horovod/horovod/torch/mpi_ops_v2.cc Line 544 in 1ea0067
|
Ah, good catch. I wasn't aware of any usage of the allocation functions in the horovod/horovod/mxnet/mpi_ops.py Lines 424 to 428 in 1ea0067
but apparently not. Ok, I'll move forward with the current implementation then. |
True, it felt a bit unconventional to rework it like that in #3824, but also more straight forward than exposing |
Alright, pushed changes to apply @EnricoMi's suggested whitespace changes. I also added a CUDA driver link to the CMake build should hopefully resolve the build/run issues. The |
Jenkins passed! 👍 The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, provided that the GPU test pipeline passes.
… cudaSetDevice calls. Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
78b42b8
to
739e4ec
Compare
Hmm, the method using I think that's not an uncommon use case. Would it be feasible to only load the driver library dynamically if a GPU device is available at run time? |
Ok. I think the way around this is to load the driver library and functions at runtime using |
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
4a6dc6c
to
a0a88fe
Compare
Thanks for the update @romerojosh! Looks like this works great. I'll go forward and merge. |
Checklist before submitting
Description
In CUDA 12.x, calls to
cudaSetDevice
create GPU contexts. It was found that thewith_device
objects used in the MXNet and PyTorch code can executecudaSetDevice(0)
in the destructor if called before the Horovod background thread sets a device (e.g. before launching any GPU collectives). With the new CUDA 12.x behavior, this will generate spurious contexts for all ranks on GPU 0, unnecessarily consuming GPU memory.This PR updates the behavior of
with_device
to set therestore_device_
value to thedevice
value supplied to the constructor if no current GPU context is associated with the thread.