-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Insights: triton-lang/triton
Overview
Could not load contribution data
Please try again later
60 Pull requests merged by 31 people
-
[Proton][AMD] Fix peak TB/s and support gfx950 specs
#7175 merged
Jun 24, 2025 -
[NFC] Move getTiedArgs into TritonGPU utils
#7277 merged
Jun 23, 2025 -
[Tutorial] Fix
06-fused-attention.py
of FP8 provider#7043 merged
Jun 23, 2025 -
[Hopper][WS] Update pipeline to get GEMM/FA working
#7136 merged
Jun 23, 2025 -
[AMD] Added a canonicalizer to ConcatOp
#7273 merged
Jun 23, 2025 -
[AMD] Support splatted scale in MFMA
#7270 merged
Jun 23, 2025 -
[AMD][BACKEND] Do not pipeline via AsyncCopyGlobalToLocal if the load width is less than 32bit
#7250 merged
Jun 23, 2025 -
Improve detection of loop carries in triton frontend
#7200 merged
Jun 23, 2025 -
[BACKEND] Add a new pass to insert fence.proxy.async for write after read hazard
#7262 merged
Jun 22, 2025 -
[AMD] NFC: Tidy up FP8 variant support cases
#7267 merged
Jun 22, 2025 -
[Backend] Bump to llvm/llvm-project@570885128351
#7266 merged
Jun 22, 2025 -
[IR] tune the rematerialization heuristic to avoid harmful rematerialization
#7240 merged
Jun 21, 2025 -
[AMD] Fix pointer canonicalizer when propagating discardable attrs
#7242 merged
Jun 21, 2025 -
[PROTON-DEV] Fix build issues
#7257 merged
Jun 20, 2025 -
[ConSan] Concurrency Sanitizer - initial scaffolding and introduction of TritonInstrument dialect
#7157 merged
Jun 20, 2025 -
[Backend] Add a shared layout for padding
#7212 merged
Jun 20, 2025 -
[FRONTEND] Remove hardcoded warp size
#7253 merged
Jun 20, 2025 -
[AMD] Rewrite extract_slice op implementation
#7128 merged
Jun 20, 2025 -
[Backend] Assert that num threads is always power of 2 (NFC)
#7251 merged
Jun 20, 2025 -
[PROTON-DEV] Add Sched Barrier Pass To Prevent Instruction Reordering Outside Proton Record Regions
#7180 merged
Jun 20, 2025 -
[Backend] Bump to llvm/llvm-project@1b83f10072b322a206ffcaf737b42fe5c2d95b89
#7252 merged
Jun 20, 2025 -
[Blackwell] Fix codegen for
tmem_load
ofNx1xf32
#7234 merged
Jun 20, 2025 -
[BACKEND] hint to LLVM that we can bound threadIdx.x
#7249 merged
Jun 20, 2025 -
[BACKEND] Share ld/st.shared lowering between convert_layout and local_load/store
#7248 merged
Jun 20, 2025 -
[gluon] fix lint
#7246 merged
Jun 20, 2025 -
[PROTON-DEV] Support long clock for long-running kernels
#7228 merged
Jun 20, 2025 -
[gluon] fix some AMD compilation issues + skip tests on AMD for now
#7215 merged
Jun 19, 2025 -
Partially Revert "[LAYOUTS] Enable diagonal iteration unconditionally (#7218)"
#7245 merged
Jun 19, 2025 -
[KERNELS] move back to using host-side TMA for gathers
#7237 merged
Jun 19, 2025 -
[NFC][BACKEND] Rewrite convert_layout in a more functional way
#7241 merged
Jun 19, 2025 -
[NFC] Add is_hopper helper and rename is_hopper -> is_hopper_or_newer
#7224 merged
Jun 19, 2025 -
[AMD]Enable a few tests on HIP
#7236 merged
Jun 19, 2025 -
[LAYOUTS] Enable diagonal iteration unconditionally
#7218 merged
Jun 19, 2025 -
[kernels] revert bias subtiling changes
#7232 merged
Jun 18, 2025 -
[LAYOUTS] Enable generic swizzling on AMD
#7225 merged
Jun 18, 2025 -
[Triton] Clean up unused/old env vars (NFC)
#7229 merged
Jun 18, 2025 -
[Gluon] Fix linear layout MLIR->Python; fix CTA layout equality
#7230 merged
Jun 18, 2025 -
[Gluon][TTNG] Add async_copy ops including mbarrier arrive op
#7220 merged
Jun 18, 2025 -
[Gluon][Tutorial] Merge d64 and d128 attn kernels
#7226 merged
Jun 18, 2025 -
[Warp Specialization] Fix iterator invalidation
#7223 merged
Jun 18, 2025 -
[NVIDIA] L2 cache hints only for sm >= 80
#7219 merged
Jun 18, 2025 -
[BACKEND] Move lowering of CF as the last step of conversion to LLVM
#7213 merged
Jun 18, 2025 -
[KERNELS] Skip idle_sms on AMD
#7217 merged
Jun 18, 2025 -
[Blackwell] Fix
tmem_subslice
lowering for packed sub-32B layouts#7207 merged
Jun 18, 2025 -
[KERNELS] no longer enforce persistent when is used
#7214 merged
Jun 18, 2025 -
[KERNELS] fix handling of
opt_flags.idle_sms
#7211 merged
Jun 18, 2025 -
[KERNELS] added option and test to set idle sms in matmul_ogs
#7210 merged
Jun 18, 2025 -
[PROTON-DEV] Add SamplingStrategy::SELECTIVE for instrumentation
#7208 merged
Jun 18, 2025 -
[kernels] moved
reinterpret
to before tma creation#7205 merged
Jun 18, 2025 -
Fix out-of-bounds load in mxfp_matmul test kernel.
#7193 merged
Jun 17, 2025 -
[KERNELS] Fix bf16 x mxfp4 when EVEN_K is False
#7203 merged
Jun 17, 2025 -
[AMD][gfx12] WMMA AMD16x16x32 support for i4 operands
#7012 merged
Jun 17, 2025 -
[Gluon] Implement attention kernels for d64 and d128
#7009 merged
Jun 17, 2025 -
[kernels] use more host TMA for X, W, Mx in persistent matmul
#7182 merged
Jun 17, 2025 -
[Tutorial] Improve dhead=128 ws performance for attention
#7195 merged
Jun 17, 2025 -
[Pipeliner] Fix backward scheduling over
ttg.local_load
#7194 merged
Jun 17, 2025 -
[Bench][AMD] Fix torch ref routing and enable CI
#7183 merged
Jun 17, 2025 -
[BACKEND] Implement generic swizzling when lowering
convert_layout
#6982 merged
Jun 17, 2025 -
[BACKEND] simpler codegen for linear layouts
#7201 merged
Jun 17, 2025 -
[BACKEND] Workaround for ptxas bug in matrix descriptor arithmetic
#7197 merged
Jun 17, 2025
21 Pull requests opened by 16 people
-
[AMD] expose core pipeliner utilities and integrate in AMD pipeliner
#7222 opened
Jun 18, 2025 -
[wip] Logging debug info before async ops
#7231 opened
Jun 18, 2025 -
[Gluon][Tutorial] Optimize attention kernel
#7238 opened
Jun 19, 2025 -
Allow customization of the subscript operator for triton values
#7239 opened
Jun 19, 2025 -
[Layouts] Infer slice encoding for SplitOp result
#7247 opened
Jun 20, 2025 -
[Backend] Bump to llvm/llvm-project@570885128351
#7254 opened
Jun 20, 2025 -
Notes from 2025-03-12 community meetup
#7255 opened
Jun 20, 2025 -
Notes from 2025-05-01 community meetup
#7256 opened
Jun 20, 2025 -
[PROTON] Intra kernel profiling
#7258 opened
Jun 20, 2025 -
[KERNELS] some matmul refactoring
#7259 opened
Jun 21, 2025 -
[AMD] Use permlanex16 for shuffleXor on rdna
#7269 opened
Jun 23, 2025 -
[Frontend] Fix scope enter to do a deep copy of scopes
#7271 opened
Jun 23, 2025 -
[IR] Avoid rematerialization for non-associative reduce op
#7272 opened
Jun 23, 2025 -
[AMD] Loosed constraints for MemDescSubviewOp
#7274 opened
Jun 23, 2025 -
[TMA] Correctly get TMA Block Shape for SwizzledShared Blocks
#7275 opened
Jun 23, 2025 -
Updated CMakeLists.txt to install headers and the triton library
#7276 opened
Jun 23, 2025 -
[Warp Specialization] Fix WAR async+generic proxy for warp spec
#7278 opened
Jun 23, 2025 -
[README] Mention `make dev-install-llvm` for custom LLVM build
#7279 opened
Jun 23, 2025 -
[AMD] guard FoldTrueCmpI from tensors
#7281 opened
Jun 23, 2025 -
[AMD] Implement `tl.extra.hip.memrealtime` for timing
#7282 opened
Jun 23, 2025 -
[WIP!] [AMD] Add tilesPerWarp parameter to mfma layout
#7283 opened
Jun 23, 2025
5 Issues closed by 5 people
-
Problems building `triton` v3.2.0 in offline mode
#6919 closed
Jun 23, 2025 -
Reduction is duplicated in TTIR -> TTGIR with num_stages>1 causing strange inconsistencies
#6647 closed
Jun 21, 2025 -
AsyncCopyGlobalToLocalOpConversion::matchAndRewrite failure
#7243 closed
Jun 19, 2025 -
[AMD] Fix redundant data masking computations for stores
#5496 closed
Jun 19, 2025
10 Issues opened by 10 people
-
AOT Type Hint for Tensor / Block / Block Ptr
#7280 opened
Jun 23, 2025 -
Higher shared_memory usage in Triton 3.3
#7268 opened
Jun 23, 2025 -
Add support for installation of header files and built artifacts
#7265 opened
Jun 22, 2025 -
Butterfly shuffles in reductions trigger racecheck because they are not numerically stable
#7264 opened
Jun 22, 2025 -
Which Triton version support 2080Ti, P100 and MI50?
#7263 opened
Jun 22, 2025 -
Large Grid Size Triggers Kernel No-Op
#7260 opened
Jun 21, 2025 -
ICE "llvm::SmallVectorTemplateCommon<long> Assertion `idx < size()' failed"
#7244 opened
Jun 19, 2025 -
Why fused attn tutorial cannot pass bwd testop?
#7216 opened
Jun 18, 2025 -
If we perform a load without executing a matmul operation, the memory access won’t be coalesced.
#7202 opened
Jun 17, 2025 -
AMD/MI300X performance is lacking compared to torch.matmul
#7199 opened
Jun 17, 2025
17 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
[AMD] Add HIP AOT support to compile.py tool
#7007 commented on
Jun 23, 2025 • 20 new comments -
Fix: PY_SSIZE_T_CLEAN macro must be defined for '#' formats
#6928 commented on
Jun 19, 2025 • 2 new comments -
[do_bench] synchronize before first function call
#7124 commented on
Jun 18, 2025 • 0 new comments -
WIP: Use variadic argument cuda launcher
#6788 commented on
Jun 23, 2025 • 0 new comments -
[AMD] Use decomposed path for scaled dot software emulation
#6337 commented on
Jun 22, 2025 • 0 new comments -
[Proton][Dialect] Middle-end support of the Proton Dialect and the frontend Python package
#5677 commented on
Jun 20, 2025 • 0 new comments -
[Proton][Dialect] Add Proton Device Memory Buffer Init and Allocate Pass
#5606 commented on
Jun 20, 2025 • 0 new comments -
unable to build triton
#7088 commented on
Jun 23, 2025 • 0 new comments -
Adding Metal Backend to Triton
#4824 commented on
Jun 23, 2025 • 0 new comments -
Unsupported DotOp found when converting TritonGPU to LLVM
#6951 commented on
Jun 22, 2025 • 0 new comments -
3D tensor can't sum
#6039 commented on
Jun 21, 2025 • 0 new comments -
Optimizing Shared Memory Usage
#4756 commented on
Jun 20, 2025 • 0 new comments -
Triton 3.1.0 failed with a simple tl.dot and then tl.store example
#5557 commented on
Jun 19, 2025 • 0 new comments -
Microscaling dtypes in triton?
#6054 commented on
Jun 18, 2025 • 0 new comments -
Remove setuptools requirement
#7192 commented on
Jun 18, 2025 • 0 new comments -
int4 support
#675 commented on
Jun 17, 2025 • 0 new comments -
Triton 3.3 Performance Regression on Small Gemms
#7096 commented on
Jun 17, 2025 • 0 new comments