8000 Fix ROCm 6.1 issues by upsj · Pull Request #1670 · ginkgo-project/ginkgo · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix ROCm 6.1 issues #1670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 25, 2024
Merged

Fix ROCm 6.1 issues #1670

merged 3 commits into from
Aug 25, 2024

Conversation

upsj
Copy link
Member
@upsj upsj commented Aug 21, 2024

The CSR lookup tests and all usages of the lookup functionality fails with ROCm 6.1 for some reason. These workarounds were only necessary in older ROCm versions anyways.

TODO:

  • Fix sorting segfault
  • Fix ParILUT segfault (probably related to the sorting segfault)
  • Fix searching hang (not a hang, just a slow test)

@upsj upsj self-assigned this Aug 21, 2024
@ginkgo-bot ginkgo-bot added the mod:core This is related to the core module. label Aug 21, 2024
@upsj upsj changed the title Fix ROCm assertion issues Fix ROCm 6.1 issues Aug 21, 2024
@upsj upsj marked this pull request as draft August 21, 2024 14:23
@upsj upsj force-pushed the fix_rocm_assertions branch from e2f332a to 9c82da0 Compare August 24, 2024 08:51
@upsj upsj requested a review from MarcelKoch August 24, 2024 08:51
@upsj upsj marked this pull request as ready for review August 24, 2024 08:51
@upsj upsj requested a review from a team August 24, 2024 08:51
Comment on lines 120 to 121
auto other = __shfl_xor_sync(config::full_lane_mask, els[i],
num_threads / 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use tile.shfl_xor for CUDA ?

Copy link
Member Author
@upsj upsj Aug 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could, but that means we'd have to add tile again, which seems to be part of the reason why the compiler generates bad code in the first place, so I'd prefer using the same level of abstraction in both places.

auto other = __shfl_xor_sync(config::full_lane_mask, els[i],
num_threads / 2);
#else
auto other = __shfl_xor(els[i], num_threads / 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this changes should be the same as use tiled_partition<config::warp_size>.shlf_xor not <num_thread>
It will be similar to sycl now because sycl does not support subwarp operation and I need to perform it by assuming they are always perform full warp (which is true though)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is deliberate, since the full warp is always running this code, so there should be no effect of the smaller width. I'll check if there is any performance difference though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the diff smaller, I reverted it back to using num_threads

Copy link
Member
@yhmtsai yhmtsai Aug 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry. my first comment might be confusing.
I thought you can change it to tile_partitionconfig::warp_size for all.
With the current version, you can use the num_threads in width. so is the segfault from something in cooperative group class? not from compiler compiles sub_warp wrongly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can change it to that, but I thought since we know this is a smaller shuffle, we can potentially give the hardware a hint for faster execution, just a guess though. The segfault comes from a weird combination of the member variables of thread_block_tile (which should be optimized away since they are unused) and the variable-size shuffle width. But compiler bugs like this are very fragile and can disappear randomly, so I wanted to remove as many things that I saw trigger the bug as possible

upsj added 3 commits August 25, 2024 16:56
This causes some kernels on ROCm debug builds to fail
There is some weird interaction between inlining of shfl_xor and the
(otherwise unused) members of thread_block_tile.
The easiest way of working around it is to inline them explicitly as
__shfl_xor(_sync).
@upsj upsj force-pushed the fix_rocm_assertions branch from 9c82da0 to acb4ccc Compare August 25, 2024 14:56
@upsj upsj requested a review from yhmtsai August 25, 2024 14:59
@upsj upsj added the 1:ST:ready-to-merge This PR is ready to merge. label Aug 25, 2024
@upsj upsj merged commit c09529f into develop Aug 25, 2024
10 of 14 checks passed
@upsj upsj deleted the fix_rocm_assertions branch August 25, 2024 19:01
MarcelKoch pushed a commit to MarcelKoch/ginkgo that referenced this pull request Dec 2, 2024
This fixes some issues with assertions and the bitonic sorting kernels on ROCm 6.x

Related PR: ginkgo-project#1670
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-to-merge This PR is ready to merge. mod:core This is related to the core module.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0