8000 WIP: Cheaper threadfence by alex-breslow-amd · Pull Request #1765 · ROCm/rccl · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

WIP: Cheaper threadfence #1765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

alex-breslow-amd
Copy link
Contributor

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: "Internal", or link to GitHub issue (if applicable).

What were the changes?
Gets rid of the cache invalidate instruction that would be part of __threadfence (device scope). Speeds up small sizes. I want to run CI tests on it see if it's safe.

Why were the changes made?
Speed

How was the outcome achieved?
See above

Additional Documentation:
In progress, ignore for now. Just noodling on this.

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

Better comment

Make asm volatile to avoid reordering by compiler
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0