8000 [TMA] Correctly get TMA Block Shape for SwizzledShared Blocks by NikhilAPatel · Pull Request #7275 · triton-lang/triton · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[TMA] Correctly get TMA Block Shape for SwizzledShared Blocks #7275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

NikhilAPatel
Copy link
Contributor

When calculating the TMA Block Shape, we previously assumed the encoding was always an NVMMASharedEncodingAttr. However, when the inner dimension is less than 8, the actual encoding is a SwizzledSharedEncodingAttr. Under our old assumption, casting to NVMMASharedEncodingAttr would result in reading garbage values for fields like elementBitWidth, swizzleBytes, fp4Padded, transposed, and packedSize. In practice, this often led to fp4Padded being incorrectly interpreted as True, causing the calculated Block Shape to be multiplied by 2. In certain cases, this results in unpredictable behavior during TMA load/store operations.

When we have a SwizzledSharedEncodingAttr, the only modification necessary to calculate the Block Shape is to limit each dimension to 256, the max size for a TMA block pointer.

We also need to fix something similar when calculating the unpacked offset layout.

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because the changes should already be covered under test_tensor_descriptor_store and test_make_tensor_descriptor_matmul in test/unit/language/test_tensor_descriptor.py. And by test_tensor_descriptor_reduce, test_tma_gather, test_tma_scatter, test_host_tensor_descriptor_load, and test_host_tensor_descriptor_matmul in test/unit/cuda/test_tensor_descriptor.py
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@NikhilAPatel NikhilAPatel requested a review from bertmaher June 23, 2025 18:21
@NikhilAPatel NikhilAPatel requested a review from ptillet as a code owner June 23, 2025 18:21
Copy link
Contributor
@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a descriptor with SwizzledSharedEncodingAttr then that's a bug and the cast op is expected to assert at runtime. Do you have a reproducer that results in a non-nvmma encoding for descriptors?

@NikhilAPatel
Copy link
Contributor Author

If there is a descriptor with SwizzledSharedEncodingAttr then that's a bug and the cast op is expected to assert at runtime. Do you have a reproducer that results in a non-nvmma encoding for descriptors?

Sorry, you’re right — there were a few commits I needed to cherry-pick after our cutoff to see the correct behavior. With those included, the descriptor is showing up as nvmma as expected. Appreciate the clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0