[TMA] Correctly get TMA Block Shape for SwizzledShared Blocks #7275

NikhilAPatel · 2025-06-23T18:21:25Z

When calculating the TMA Block Shape, we previously assumed the encoding was always an NVMMASharedEncodingAttr. However, when the inner dimension is less than 8, the actual encoding is a SwizzledSharedEncodingAttr. Under our old assumption, casting to NVMMASharedEncodingAttr would result in reading garbage values for fields like elementBitWidth, swizzleBytes, fp4Padded, transposed, and packedSize. In practice, this often led to fp4Padded being incorrectly interpreted as True, causing the calculated Block Shape to be multiplied by 2. In certain cases, this results in unpredictable behavior during TMA load/store operations.

When we have a SwizzledSharedEncodingAttr, the only modification necessary to calculate the Block Shape is to limit each dimension to 256, the max size for a TMA block pointer.

We also need to fix something similar when calculating the unpacked offset layout.

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because the changes should already be covered under test_tensor_descriptor_store and test_make_tensor_descriptor_matmul in test/unit/language/test_tensor_descriptor.py. And by test_tensor_descriptor_reduce, test_tma_gather, test_tma_scatter, test_host_tensor_descriptor_load, and test_host_tensor_descriptor_matmul in test/unit/cuda/test_tensor_descriptor.py
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

peterbell10

If there is a descriptor with SwizzledSharedEncodingAttr then that's a bug and the cast op is expected to assert at runtime. Do you have a reproducer that results in a non-nvmma encoding for descriptors?

NikhilAPatel · 2025-06-24T18:11:09Z

If there is a descriptor with SwizzledSharedEncodingAttr then that's a bug and the cast op is expected to assert at runtime. Do you have a reproducer that results in a non-nvmma encoding for descriptors?

Sorry, you’re right — there were a few commits I needed to cherry-pick after our cutoff to see the correct behavior. With those included, the descriptor is showing up as nvmma as expected. Appreciate the clarification.

Correctly get TMA Block Shape for SwizzledShared Blocks

6359551

NikhilAPatel requested a review from bertmaher June 23, 2025 18:21

NikhilAPatel requested a review from ptillet as a code owner June 23, 2025 18:21

peterbell10 requested changes Jun 23, 2025

View reviewed changes

NikhilAPatel closed this Jun 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TMA] Correctly get TMA Block Shape for SwizzledShared Blocks #7275

[TMA] Correctly get TMA Block Shape for SwizzledShared Blocks #7275

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[TMA] Correctly get TMA Block Shape for SwizzledShared Blocks #7275

[TMA] Correctly get TMA Block Shape for SwizzledShared Blocks #7275

Conversation

New contributor declaration

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!