Closed
Description
Describe the bug
TritonAMDGPULowerInstructionSchedHints::createLocalPrefetchSchedule() logic assume that the ir is like:
// Prefetch Schema cluster order and staging.
// for i in (...):
// local_stores: stage=i+1
// global_loads: stage=i+2
// compute: stage=i
// local_load: stage=i+1
// tail: stage=i
but TritonAMDGPUReorderInstructionsPass::scheduleGlobalLoadLocalStore() will reorder the order like:
// for i in (...):
// global_loads: stage=i+2
// local_stores: stage=i+1
// compute: stage=i
// local_load: stage=i+1
// tail: stage=i
and this will lead to TritonAMDGPULowerInstructionSchedHints::createLocalPrefetchSchedule can't work.
due to triton will insert sync & barrier before local_store, and global_loads & local_stores & compute cross sync & barrier limit. and sched.group mask can't work any more.
Environment details
Triton tip code.