[TMA] Correctly get TMA Block Shape for SwizzledShared Blocks #7275
+20
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When calculating the TMA Block Shape, we previously assumed the encoding was always an
NVMMASharedEncodingAttr
. However, when the inner dimension is less than 8, the actual encoding is aSwizzledSharedEncodingAttr
. Under our old assumption, casting toNVMMASharedEncodingAttr
would result in reading garbage values for fields likeelementBitWidth
,swizzleBytes
,fp4Padded
,transposed
, andpackedSize
. In practice, this often led tofp4Padded
being incorrectly interpreted asTrue
, causing the calculated Block Shape to be multiplied by 2. In certain cases, this results in unpredictable behavior during TMA load/store operations.When we have a
SwizzledSharedEncodingAttr
, the only modification necessary to calculate the Block Shape is to limit each dimension to 256, the max size for a TMA block pointer.We also need to fix something similar when calculating the unpacked offset layout.
New contributor declaration
I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run
pre-commit run --from-ref origin/main --to-ref HEAD
.Select one of the following.
/test
forlit
tests/unittest
for C++ tests/python/test
for end-to-end teststest_tensor_descriptor_store
andtest_make_tensor_descriptor_matmul
in test/unit/language/test_tensor_descriptor.py. And bytest_tensor_descriptor_reduce
,test_tma_gather
,test_tma_scatter
,test_host_tensor_descriptor_load
, andtest_host_tensor_descriptor_matmul
in test/unit/cuda/test_tensor_descriptor.pySelect one of the following.
lit
tests.lit
tests I have added follow these best practices,including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)