[Bug]: Incorrect MFMA Peak FLOPs Calculations for BF16 and F16 in `gfx941/0200_system-speed-of-light.yaml · Issue #700 · ROCm/rocprofiler-compute · GitHub
More Web Proxy on the site http://driver.im/
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While validating peak throughput calculations on MI300X, I noticed that the MFMA metrics for BF16 and F16 in gfx941/0200_system-speed-of-light.yaml assume 4096 FLOPs per cycle per CU:
peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
This is incorrect. According to the CDNA3 whitepaper, Table 1, the correct peak throughput for BF16 and F16 MFMA is 2048 FLOPs per cycle per CU. The corrected expression should be:
peak: ((($max_sclk * $cu_per_gpu) * 2048) / 1000)
Other MFMA-related metrics such as F8, F32, F64, and I8 appear to follow a similar pattern and may also require review. Let me know how you'd prefer to track those.
Thanks.
Linux Distribution
NA
ROCm Compute Profiler Version
NA
GPU
AMD MI300X
ROCm Version
No response
Cluster name (if applicable)
No response
Reproducer
Shared the code snippet from src
Expected behavior
No response
Relevant log output
Screenshots
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered:
Describe the bug
While validating peak throughput calculations on MI300X, I noticed that the MFMA metrics for
BF16
andF16
in gfx941/0200_system-speed-of-light.yaml assume 4096 FLOPs per cycle per CU:This is incorrect. According to the CDNA3 whitepaper, Table 1, the correct peak throughput for BF16 and F16 MFMA is 2048 FLOPs per cycle per CU. The corrected expression should be:
Other MFMA-related metrics such as
F8
,F32
,F64
, andI8
appear to follow a similar pattern and may also require review. Let me know how you'd prefer to track those.Thanks.
Linux Distribution
NA
ROCm Compute Profiler Version
NA
GPU
AMD MI300X
ROCm Version
No response
Cluster name (if applicable)
No response
Reproducer
Shared the code snippet from src
Expected behavior
No response
Relevant log output
Screenshots
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: