Description
Summary:
Currently, reduce-then-scan kernels in oneDPL are reliant on underlying hardware supporting sub-group sizes of 32 (and 16 for some special cases). We enforce this through the usage of the sycl::reqd_sub_group_size
attribute and a fallback to the legacy scan-then-propagate implementation if sub-group sizes of 32 are not present.
Problem Statement:
Our reduce-then-scan implementation should ideally be able to support hardware with arbitrary sub-group sizes without sacrificing performance. This would enable us to use reduce-then-scan across all GPU accelerators instead of relying on our legacy implementation. It would also allow us to gracefully handle the scenarios which require special workarounds due to sub-group hardware bugs such as #2133.
Additional Context:
The decision to require a sub-group size of 32 was made to be able to use this sub-group size as a compile time constant throughout the reduce-then-scan implementation as opposed to relying on the sub_group
class' get_local_range
runtime query. Brief experimentation was performed during initial development with removing this sub-group size requirement and relying on run-time sub-group size queries, but large performance hits were observed.
The critical portion of code that is affected by this change is the sub-group level scan:
. I had experimented with replacing the sub-group size integral template with a run-time sub-group size variable and enabling partial loop unrolling, but I saw large performance hits. We should take a deeper dive into this issue and what can be done to resolve it.