8000 Fixed NCCL warning caused by barrier if using idist by vfdev-5 · Pull Request #2254 · pytorch/ignite · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fixed NCCL warning caused by barrier if using idist #2254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 9, 2021

Conversation

vfdev-5
Copy link
Collaborator
@vfdev-5 vfdev-5 commented Oct 9, 2021

Fixes #2212

Description:

  • Fixed NCCL warning caused by barrier if using idist

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@github-actions github-actions bot added the module: distributed Distributed module label Oct 9, 2021
Fixes #2212

Description:
- Fixed NCCL warning caused by barrier if using idist
@vfdev-5 vfdev-5 force-pushed the fix-2212-nccl-barrier-warning branch from 7574bff to 8148cec Compare October 9, 2021 21:06
@vfdev-5 vfdev-5 requested a review from sdesrozis October 9, 2021 21:07
Copy link
Contributor
@sdesrozis sdesrozis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the perfect fix for the warning !

@vfdev-5 vfdev-5 merged commit e810c08 into master Oct 9, 2021
@vfdev-5 vfdev-5 deleted the fix-2212-nccl-barrier-warning branch October 9, 2021 21:42
@sdesrozis
Copy link
Contributor

IMO This could be the default implementation of barrier for the native model regarding to #2213.

@vfdev-5
Copy link
Collaborator Author
vfdev-5 commented Oct 9, 2021

I'm not very sure about this warning. I checked detectron2 and inspired from its code. Prior to fixing it, I checked for the warning with pytorch nightly and surprisingly there were no warning... As this is pure NCCL thing, i'd avoid generalizing that.
I also think that the following code is partially responsible for the warning:

if torch.cuda.is_available():
    torch.cuda.set_device(self._local_rank)

which was called after barrier...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: distributed Distributed module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix NCCL warning caused by barrier if using idist
2 participants
0