8000 [FEATURE] Add associative_scan support by ThomasRaoux · Pull Request #1858 · triton-lang/triton · GitHub

More Web Proxy on the site http://driver.im/

[FEATURE] Add associative_scan support #1858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

ThomasRaoux merged 2 commits into triton-lang:main from ThomasRaoux:scan2

Jun 29, 2023

Collaborator

ThomasRaoux commented

Implement associative_scan in the front end and implement lowering to LLVM for blocked layout where the scan happens on the fastest moving dimension. This will later be generalized to support more layout.

ThomasRaoux force-pushed the scan2 branch 2 times, most recently from 317dc4a to 8091ffb Compare

June 29, 2023 06:36

ThomasRaoux marked this pull request as ready for review

June 29, 2023 14:56

ThomasRaoux requested review from Jokeren and ptillet as code owners

June 29, 2023 14:56


          [FEATURE] Add assocative_scan support

f48563e

Implement associative_scan in the front end and implement lowering to
LLVM for blocked layout where the scan happens on the fastest moving dimension.
This will later be generalized to support more layout.

ThomasRaoux force-pushed the scan2 branch from 8091ffb to f48563e Compare

June 29, 2023 15:02

Contributor

Jokeren commented

Thanks! Will review soon

Jokeren reviewed

View reviewed changes

include/triton/Analysis/Utility.h Outdated Show resolved Hide resolved

10000

include/triton/Analysis/Utility.h Outdated

+                // Return the number of elements per thread along non-axis dims.
+                unsigned getNumParallelElementsPerThread();
+                // Return the number of threads per warp along non-axis dims.
+                unsigned getNumParrallelThreadsPerWarp();

Contributor

Jokeren

ditto

include/triton/Analysis/Utility.h Outdated

+                // Return the number of threads per warp along non-axis dims.
+                unsigned getNumParrallelThreadsPerWarp();
+                // Return the flat numbers of threads computing independent scan results.
+                unsigned getNumParrallelThreadsPerCTA();

Contributor

Jokeren

I'm not sure about what it returns from the function name

Collaborator Author

ThomasRaoux

hopefully the comment is explicit enough?

include/triton/Analysis/Utility.h Outdated

+                // Return the flat numbers of threads computing independent scan results.
+                unsigned getNumParrallelThreadsPerCTA();
+                // Return the number of warps per CTA along axis dim.
+                unsigned getNumAxisWarps();

Contributor

Jokeren

getAxisNumWarps?

include/triton/Analysis/Utility.h Outdated

+                // Return the number of threads per warp along axis dim.
+                unsigned getAxisNumThreadsPerWarp();
+                // Return the number of blocks along axis dim.
+                unsigned getNumAxisBlocks();

Contributor

Jokeren

getAxisNumBlocks?

include/triton/Dialect/Triton/IR/TritonOps.td Outdated

+              //
+              def TT_ScanOp: TT_Op<"scan",
+                                     [Pure,
+                                      SameOperandsEncoding,

Contributor

Jokeren

Does it have SameOperandsAndResultEncoding and SameOperandsAndResultElementType?

Collaborator Author

ThomasRaoux

good point, added it.

lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp Outdated

+                return std::make_tuple(laneIdAxis, warpIdAxis, flatIdParallel);
+              }
+              // Naive lowering of the scan op as a fallback for cases that we don't know

Contributor

Jokeren

I thought emitFastScan is already not a native lowering because it does use warp shuffle and is not a fallback

Collaborator Author

ThomasRaoux

oops yes this comment was out of date.

lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp Show resolved Hide resolved

lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp Show resolved Hide resolved

lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp Outdated

+              // reduction into shared memory. Each parallel scan and each warp will store its
+              // own partial reductions. The shared memory is organized as follow:
+              //          -----------------------------------------------------------------
+              // chunk 0: | scan 0 warp 0 | scan 1 warp 0 | scan 0 warp 1 | scan 1 warp 1 |

Contributor

Jokeren

It's not clear to me what scan 1 and scan 0 are.
I get the idea though after reading the code

Collaborator Author

ThomasRaoux

those numbers are meant to be the non-axis dimension. I improved the comment a bit. Let me know if you think it could still be clarified.


          Address review comments

5738be7

ThomasRaoux requested a review from Jokeren

June 29, 2023 20:40

Jokeren approved these changes

View reviewed changes

ThomasRaoux merged commit 3be0608 into triton-lang:main

ngimel reviewed

View reviewed changes

lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.h

		@@ -0,0 +1,15 @@
		#ifndef TRITON_CONVERSION_TRITONGPU_TO_LLVM_SCAN_OP_H

ngimel

nit: use #pragma once

Collaborator Author

ThomasRaoux

This doesn't seem to be the convention followed in triton project.

This was referenced Jul 27, 2023

Parallel Associative Scan pytorch/pytorch#95408

Open

[Feature] Linear scan functionality #1997

Closed

pingzhuu pushed a commit to siliconflow/triton that referenced this pull request


          [FEATURE] Add associative_scan support (triton-lang#1858)

62b3b60

Implement associative_scan in the front end and implement lowering to
LLVM for blocked layout where the scan happens on the fastest moving
dimension. This will later be generalized to support more layout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

0