-
Notifications
You must be signed in to change notification settings - Fork 97
Fix ROCm 6.1 issues #1670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ROCm 6.1 issues #1670
Conversation
e2f332a
to
9c82da0
Compare
auto other = __shfl_xor_sync(config::full_lane_mask, els[i], | ||
num_threads / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not use tile.shfl_xor
for CUDA ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could, but that means we'd have to add tile
again, which seems to be part of the reason why the compiler generates bad code in the first place, so I'd prefer using the same level of abstraction in both places.
auto other = __shfl_xor_sync(config::full_lane_mask, els[i], | ||
num_threads / 2); | ||
#else | ||
auto other = __shfl_xor(els[i], num_threads / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this changes should be the same as use tiled_partition<config::warp_size>.shlf_xor
not <num_thread>
It will be similar to sycl now because sycl does not support subwarp operation and I need to perform it by assuming they are always perform full warp (which is true though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is deliberate, since the full warp is always running this code, so there should be no effect of the smaller width. I'll check if there is any performance difference though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the diff smaller, I reverted it back to using num_threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry. my first comment might be confusing.
I thought you can change it to tile_partitionconfig::warp_size for all.
With the current version, you can use the num_threads in width. so is the segfault from something in cooperative group class? not from compiler compiles sub_warp wrongly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can change it to that, but I thought since we know this is a smaller shuffle, we can potentially give the hardware a hint for faster execution, just a guess though. The segfault comes from a weird combination of the member variables of thread_block_tile (which should be optimized away since they are unused) and the variable-size shuffle width. But compiler bugs like this are very fragile and can disappear randomly, so I wanted to remove as many things that I saw trigger the bug as possible
This causes some kernels on ROCm debug builds to fail
There is some weird interaction between inlining of shfl_xor and the (otherwise unused) members of thread_block_tile. The easiest way of working around it is to inline them explicitly as __shfl_xor(_sync).
9c82da0
to
acb4ccc
Compare
This fixes some issues with assertions and the bitonic sorting kernels on ROCm 6.x Related PR: ginkgo-project#1670
The CSR lookup tests and all usages of the lookup functionality fails with ROCm 6.1 for some reason. These workarounds were only necessary in older ROCm versions anyways.
TODO: