Pulse · triton-lang/triton · GitHub

8000 Pulse · triton-lang/triton · GitHub

More Web Proxy on the site http://driver.im/

June 16, 2025 – June 23, 2025

Overview

81 Active pull requests

15 Active issues

60 Pull requests merged by 31 people

[Proton][AMD] Fix peak TB/s and support gfx950 specs
#7175 merged Jun 24, 2025
[NFC] Move getTiedArgs into TritonGPU utils
#7277 merged Jun 23, 2025
[Tutorial] Fix 06-fused-attention.py of FP8 provider
#7043 merged Jun 23, 2025
[Hopper][WS] Update pipeline to get GEMM/FA working
#7136 merged Jun 23, 2025
[AMD] Added a canonicalizer to ConcatOp
#7273 merged Jun 23, 2025
[AMD] Support splatted scale in MFMA
#7270 merged Jun 23, 2025
[AMD][BACKEND] Do not pipeline via AsyncCopyGlobalToLocal if the load width is less than 32bit
#7250 merged Jun 23, 2025
Improve detection of loop carries in triton frontend
#7200 merged Jun 23, 2025
[BACKEND] Add a new pass to insert fence.proxy.async for write after read hazard
#7262 merged Jun 22, 2025
[AMD] NFC: Tidy up FP8 variant support cases
#7267 merged Jun 22, 2025
[Backend] Bump to llvm/llvm-project@570885128351
#7266 merged Jun 22, 2025
[IR] tune the rematerialization heuristic to avoid harmful rematerialization
#7240 merged Jun 21, 2025
[AMD] Fix pointer canonicalizer when propagating discardable attrs
#7242 merged Jun 21, 2025
[PROTON-DEV] Fix build issues
#7257 merged Jun 20, 2025
[ConSan] Concurrency Sanitizer - initial scaffolding and introduction of TritonInstrument dialect
#7157 merged Jun 20, 2025
[Backend] Add a shared layout for padding
#7212 merged Jun 20, 2025
[FRONTEND] Remove hardcoded warp size
#7253 merged Jun 20, 2025
[AMD] Rewrite extract_slice op implementation
#7128 merged Jun 20, 2025
[Backend] Assert that num threads is always power of 2 (NFC)
#7251 merged Jun 20, 2025
[PROTON-DEV] Add Sched Barrier Pass To Prevent Instruction Reordering Outside Proton Record Regions
#7180 merged Jun 20, 2025
[Backend] Bump to llvm/llvm-project@1b83f10072b322a206ffcaf737b42fe5c2d95b89
#7252 merged Jun 20, 2025
[Blackwell] Fix codegen for tmem_load of Nx1xf32
#7234 merged Jun 20, 2025
[BACKEND] hint to LLVM that we can bound threadIdx.x
#7249 merged Jun 20, 2025
[BACKEND] Share ld/st.shared lowering between convert_layout and local_load/store
#7248 merged Jun 20, 2025
[gluon] fix lint
#7246 merged Jun 20, 2025
[PROTON-DEV] Support long clock for long-running kernels
#7228 merged Jun 20, 2025
[gluon] fix some AMD compilation issues + skip tests on AMD for now
#7215 merged Jun 19, 2025
Partially Revert "[LAYOUTS] Enable diagonal iteration unconditionally (#7218)"
#7245 merged Jun 19, 2025
[KERNELS] move back to using host-side TMA for gathers
#7237 merged Jun 19, 2025
[NFC][BACKEND] Rewrite convert_layout in a more functional way
#7241 merged Jun 19, 2025
[NFC] Add is_hopper helper and rename is_hopper -> is_hopper_or_newer
#7224 merged Jun 19, 2025
[AMD]Enable a few tests on HIP
#7236 merged Jun 19, 2025
[LAYOUTS] Enable diagonal iteration unconditionally
#7218 merged Jun 19, 2025
[kernels] revert bias subtiling changes
#7232 merged Jun 18, 2025
[LAYOUTS] Enable generic swizzling on AMD
#7225 merged Jun 18, 2025
[Triton] Clean up unused/old env vars (NFC)
#7229 merged Jun 18, 2025
[Gluon] Fix linear layout MLIR->Python; fix CTA layout equality
#7230 merged Jun 18, 2025
[Gluon][TTNG] Add async_copy ops including mbarrier arrive op
#7220 merged Jun 18, 2025
[Gluon][Tutorial] Merge d64 and d128 attn kernels
#7226 merged Jun 18, 2025
[Warp Specialization] Fix iterator invalidation
#7223 merged Jun 18, 2025
[NVIDIA] L2 cache hints only for sm >= 80
#7219 merged Jun 18, 2025
[BACKEND] Move lowering of CF as the last step of conversion to LLVM
#7213 merged Jun 18, 2025
[KERNELS] Skip idle_sms on AMD
#7217 merged Jun 18, 2025
[Blackwell] Fix tmem_subslice lowering for packed sub-32B layouts
#7207 merged Jun 18, 2025
[KERNELS] no longer enforce persistent when is used
#7214 merged Jun 18, 2025
[KERNELS] fix handling of opt_flags.idle_sms
#7211 merged Jun 18, 2025
[KERNELS] added option and test to set idle sms in matmul_ogs
#7210 merged Jun 18, 2025
[PROTON-DEV] Add SamplingStrategy::SELECTIVE for instrumentation
#7208 merged Jun 18, 2025
[kernels] moved reinterpret to before tma creation
#7205 merged Jun 18, 2025
Fix out-of-bounds load in mxfp_matmul test kernel.
#7193 merged Jun 17, 2025
[KERNELS] Fix bf16 x mxfp4 when EVEN_K is False
#7203 merged Jun 17, 2025
[AMD][gfx12] WMMA AMD16x16x32 support for i4 operands
#7012 merged Jun 17, 2025
[Gluon] Implement attention kernels for d64 and d128
#7009 merged Jun 17, 2025
[kernels] use more host TMA for X, W, Mx in persistent matmul
#7182 merged Jun 17, 2025
[Tutorial] Improve dhead=128 ws performance for attention
#7195 merged Jun 17, 2025
[Pipeliner] Fix backward scheduling over ttg.local_load
#7194 merged Jun 17, 2025
[Bench][AMD] Fix torch ref routing and enable CI
#7183 merged Jun 17, 2025
[BACKEND] Implement generic swizzling when lowering convert_layout
#6982 merged Jun 17, 2025
[BACKEND] simpler codegen for linear layouts
#7201 merged Jun 17, 2025
[BACKEND] Workaround for ptxas bug in matrix descriptor arithmetic
#7197 merged Jun 17, 2025

21 Pull requests opened by 16 people

[AMD] expose core pipeliner utilities and integrate in AMD pipeliner
#7222 opened Jun 18, 2025
[wip] Logging debug info before async ops
#7231 opened Jun 18, 2025
[Gluon][Tutorial] Optimize attention kernel
#7238 opened Jun 19, 2025
Allow customization of the subscript operator for triton values
#7239 opened Jun 19, 2025
[Layouts] Infer slice encoding for SplitOp result
#7247 opened Jun 20, 2025
[Backend] Bump to llvm/llvm-project@570885128351
#7254 opened Jun 20, 2025
Notes from 2025-03-12 community meetup
#7255 opened Jun 20, 2025
Notes from 2025-05-01 community meetup
#7256 opened Jun 20, 2025
[PROTON] Intra kernel profiling
#7258 opened Jun 20, 2025
[KERNELS] some matmul refactoring
#7259 opened Jun 21, 2025
[AMD] Use permlanex16 for shuffleXor on rdna
#7269 opened Jun 23, 2025
[Frontend] Fix scope enter to do a deep copy of scopes
#7271 opened Jun 23, 2025
[IR] Avoid rematerialization for non-associative reduce op
#7272 opened Jun 23, 2025
[AMD] Loosed constraints for MemDescSubviewOp
#7274 opened Jun 23, 2025
[TMA] Correctly get TMA Block Shape for SwizzledShared Blocks
#7275 opened Jun 23, 2025
Updated CMakeLists.txt to install headers and the triton library
#7276 opened Jun 23, 2025
[Warp Specialization] Fix WAR async+generic proxy for warp spec
#7278 opened Jun 23, 2025
[README] Mention `make dev-install-llvm` for custom LLVM build
#7279 opened Jun 23, 2025
[AMD] guard FoldTrueCmpI from tensors
#7281 opened Jun 23, 2025
[AMD] Implement `tl.extra.hip.memrealtime` for timing
#7282 opened Jun 23, 2025
[WIP!] [AMD] Add tilesPerWarp parameter to mfma layout
#7283 opened Jun 23, 2025

5 Issues closed by 5 people

Problems building `triton` v3.2.0 in offline mode
#6919 closed Jun 23, 2025
Reduction is duplicated in TTIR -> TTGIR with num_stages>1 causing strange inconsistencies
#6647 closed Jun 21, 2025
AsyncCopyGlobalToLocalOpConversion::matchAndRewrite failure
#7243 closed Jun 19, 2025
[AMD] Fix redundant data masking computations for stores
#5496 closed Jun 19, 2025
AMD ReorderInstruction pass will reorder the global_load ahead of local_store and break the local_prefetch logic which will miss match TritonAMDGPULowerInstructionSchedHints::createLocalPrefetchSchedule code logic
#6750 closed Jun 19, 2025

10 Issues opened by 10 people

AOT Type Hint for Tensor / Block / Block Ptr
#7280 opened Jun 23, 2025
Higher shared_memory usage in Triton 3.3
#7268 opened Jun 23, 2025
Add support for installation of header files and built artifacts
#7265 opened Jun 22, 2025
Butterfly shuffles in reductions trigger racecheck because they are not numerically stable
#7264 opened Jun 22, 2025
Which Triton version support 2080Ti, P100 and MI50?
#7263 opened Jun 22, 2025
Large Grid Size Triggers Kernel No-Op
#7260 opened Jun 21, 2025
ICE "llvm::SmallVectorTemplateCommon<long> Assertion `idx < size()' failed"
#7244 opened Jun 19, 2025
Why fused attn tutorial cannot pass bwd testop?
#7216 opened Jun 18, 2025
If we perform a load without executing a matmul operation, the memory access won’t be coalesced.
#7202 opened Jun 17, 2025
AMD/MI300X performance is lacking compared to torch.matmul
#7199 opened Jun 17, 2025

17 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[AMD] Add HIP AOT support to compile.py tool
#7007 commented on Jun 23, 2025 • 20 new comments
Fix: PY_SSIZE_T_CLEAN macro must be defined for '#' formats
#6928 commented on Jun 19, 2025 • 2 new comments
[do_bench] synchronize before first function call
#7124 commented on Jun 18, 2025 • 0 new comments
WIP: Use variadic argument cuda launcher
#6788 commented on Jun 23, 2025 • 0 new comments
[AMD] Use decomposed path for scaled dot software emulation
#6337 commented on Jun 22, 2025 • 0 new comments
[Proton][Dialect] Middle-end support of the Proton Dialect and the frontend Python package
#5677 commented on Jun 20, 2025 • 0 new comments
[Proton][Dialect] Add Proton Device Memory Buffer Init and Allocate Pass
#5606 commented on Jun 20, 2025 • 0 new comments
unable to build triton
#7088 commented on Jun 23, 2025 • 0 new comments
Adding Metal Backend to Triton
#4824 commented on Jun 23, 2025 • 0 new comments
Unsupported DotOp found when converting TritonGPU to LLVM
#6951 commented on Jun 22, 2025 • 0 new comments
3D tensor can't sum
#6039 commented on Jun 21, 2025 • 0 new comments
Optimizing Shared Memory Usage
#4756 commented on Jun 20, 2025 • 0 new comments
Triton 3.1.0 failed with a simple tl.dot and then tl.store example
#5557 commented on Jun 19, 2025 • 0 new comments
Microscaling dtypes in triton?
#6054 commented on Jun 18, 2025 • 0 new comments
Remove setuptools requirement
#7192 commented on Jun 18, 2025 • 0 new comments
int4 support
#675 commented on Jun 17, 2025 • 0 new comments
Triton 3.3 Performance Regression on Small Gemms
#7096 commented on Jun 17, 2025 • 0 new comments

0