Stars
📚FFPA(Split-D): Extend FlashAttention with Split-D for large headdim, O(1) GPU SRAM complexity, 1.8x~3x↑🎉 faster than SDPA EA.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
Flash Attention in ~100 lines of CUDA (forward pass only)
Learning materials for Stanford CS149 : Parallel Computing
使用Markdown制作和蒋炎岩老师幻灯片同一主题的Web幻灯片框架(Base on Reveal.js)