Open
Description
Proposed Plan
A100 GPU with 108 physical SMs
- grid_dim = (108, 1, 1), block_dim = (128, 1, 1)
- 96 workers (96 SMs), 48 schedulers (12 SMs)
H100 GPU with 132 physical SMs
- grid_dim = (132, 1, 1), block_dim = (384, 1, 1)
- 128 workers (128 SMs), 16 schedulers (4 SMs)
- Per worker involves 128 threads for producer (TMA) and 256 threads for consumer (tensor cores)
B200 GPU with 160 physical SMs
- grid_dim = (4, 4, 10), block_dim = (384, 1, 1)
- 144 workers (144 SMs), 64 schedulers (16 SMs)
H20 GPU with 78 physical SMs
- grid_dim = (78, 1, 1), block_dim = (384, 1, 1)
- Option 1: 64 workers (64 SMs), 56 schedulers (14 SMs)
- Option 2: 72 workers (72 SMs), 24 schedulers (6 SMs)
Metadata
Metadata
Assignees
Type
Projects
Status
No status