Fix ROCm 6.1 issues #1670

upsj · 2024-08-21T14:13:00Z

The CSR lookup tests and all usages of the lookup functionality fails with ROCm 6.1 for some reason. These workarounds were only necessary in older ROCm versions anyways.

TODO:

Fix sorting segfault
Fix ParILUT segfault (probably related to the sorting segfault)
Fix searching hang (not a hang, just a slow test)

include/ginkgo/core/base/types.hpp

pratikvn · 2024-08-24T10:00:57Z

co 8000 mmon/cuda_hip/components/sorting.hpp

+            auto other = __shfl_xor_sync(config::full_lane_mask, els[i],
+                                         num_threads / 2);


Can we not use tile.shfl_xor for CUDA ?

we could, but that means we'd have to add tile again, which seems to be part of the reason why the compiler generates bad code in the first place, so I'd prefer using the same level of abstraction in both places.

yhmtsai · 2024-08-24T12:47:51Z

common/cuda_hip/components/sorting.hpp

+            auto other = __shfl_xor_sync(config::full_lane_mask, els[i],
+                                         num_threads / 2);
+#else
+            auto other = __shfl_xor(els[i], num_threads / 2);


this changes should be the same as use tiled_partition<config::warp_size>.shlf_xor not <num_thread>
It will be similar to sycl now because sycl does not support subwarp operation and I need to perform it by assuming they are always perform full warp (which is true though)

yes, that is deliberate, since the full warp is always running this code, so there should be no effect of the smaller width. I'll check if there is any performance difference though

To make the diff smaller, I reverted it back to using num_threads

sorry. my first comment might be confusing.
I thought you can change it to tile_partitionconfig::warp_size for all.
With the current version, you can use the num_threads in width. so is the segfault from something in cooperative group class? not from compiler compiles sub_warp wrongly?

You can change it to that, but I thought since we know this is a smaller shuffle, we can potentially give the hardware a hint for faster execution, just a guess though. The segfault comes from a weird combination of the member variables of thread_block_tile (which should be optimized away since they are unused) and the variable-size shuffle width. But compiler bugs like this are very fragile and can disappear randomly, so I wanted to remove as many things that I saw trigger the bug as possible

This causes some kernels on ROCm debug builds to fail

There is some weird interaction between inlining of shfl_xor and the (otherwise unused) members of thread_block_tile. The easiest way of working around it is to inline them explicitly as __shfl_xor(_sync).

This fixes some issues with assertions and the bitonic sorting kernels on ROCm 6.x Related PR: ginkgo-project#1670

upsj self-assigned this Aug 21, 2024

ginkgo-bot added the mod:core This is related to the core module. label Aug 21, 2024

MarcelKoch reviewed Aug 21, 2024

View reviewed changes

include/ginkgo/core/base/types.hpp Show resolved Hide resolved

upsj changed the title ~~Fix ROCm assertion issues~~ Fix ROCm 6.1 issues Aug 21, 2024

upsj marked this pull request as draft August 21, 2024 14:23

upsj force-pushed the fix_rocm_assertions branch from e2f332a to 9c82da0 Compare August 24, 2024 08:51

upsj requested a review from MarcelKoch August 24, 2024 08:51

upsj marked this pull request as ready for review August 24, 2024 08:51

upsj requested a review from a team August 24, 2024 08:51

pratikvn approved these changes Aug 24, 2024

View reviewed changes

yhmtsai reviewed Aug 24, 2024

View reviewed changes

upsj added 3 commits August 25, 2024 16:56

remove assertion workaround

eb97b49

This causes some kernels on ROCm debug builds to fail

fix ROCm 6.x segfaults on MI50

0d66f5e

There is some weird interaction between inlining of shfl_xor and the (otherwise unused) members of thread_block_tile. The easiest way of working around it is to inline them explicitly as __shfl_xor(_sync).

8000
more precise shuffle bounds

acb4ccc

upsj force-pushed the fix_rocm_assertions branch from 9c82da0 to acb4ccc Compare August 25, 2024 14:56

upsj requested a review from yhmtsai August 25, 2024 14:59

yhmtsai approved these changes Aug 25, 2024

View reviewed changes

upsj added the 1:ST:ready-to-merge This PR is ready to merge. label Aug 25, 2024

upsj merged commit c09529f into develop Aug 25, 2024
10 of 14 checks passed

upsj deleted the fix_rocm_assertions branch August 25, 2024 19:01

MarcelKoch pushed a commit to MarcelKoch/ginkgo that referenced this pull request Dec 2, 2024

Merge fix for ROCm 6.1 issues

978ccdc

This fixes some issues with assertions and the bitonic sorting kernels on ROCm 6.x Related PR: ginkgo-project#1670

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ROCm 6.1 issues #1670

Fix ROCm 6.1 issues #1670

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		auto other = __shfl_xor_sync(config::full_lane_mask, els[i],
		num_threads / 2);

Fix ROCm 6.1 issues #1670

Fix ROCm 6.1 issues #1670

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!