The official implementation of SpargeAttn, a universal sparse attention accelerating language, image, and video models.
python>=3.9
,torch>=2.3.0
CUDA
:>=12.8
for Blackwell>=12.4
for fp8 support on Ada>=12.3
for fp8 support on Hopper>=12.0
for Ampere
pip install ninja # for parallel compilation
python setup.py install # or pip install -e .
-
spas_sage2_attn_meansim_cuda
: SpargeAttn based on SageAttention2. -
spas_sage_attn_meansim_cuda
: SpargeAttn based on SageAttention.
Tuning:
# sequential tuning
python evaluate/cogvideo_example.py --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --tune
# parallel tuning, this will use all gpu available on the machine
python evaluate/cogvideo_example.py --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --tune --parallel_tune
Inference:
# `--compile` is optional and will slow the first time inference.
python evaluate/cogvideo_example.py --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --compile
Note: We provide pre-tuned hyper-parameters
CogVideoX-2b_0.06_0.07.pt
that allow running the inference script directly. However, for better performance in both speed and quality, we recommend re-tuning because the provided hyper-parameters are tuned with SpargeAttn based on SageAttention, whereas the default API is based on SageAttention2 now.
Note:
--compile
is optional and will further accelerate video generation but bring an overhead for the first video generation.
The tuning and inference usage is similar to CogVideoX.
Here’s a list of the tuned models so far, go to hugginface to see all tuned ckpt. Our approach is universal, and we warmly welcome contributions! Feel free to submit a pull request to support more models. 🚀
model name | example script | tuned ckpt |
---|---|---|
CogVideoX-2b | evaluate/cogvideo_example.py | link |
want2v-1.3B | evaluate/wan_example.py | link |
Flux | evaluate/flux_example.py | TBD |
Note: All experiments in the above Table and our paper used SpargeAttn based on SageAttention. An updated implementation based on SageAttention2, is available now. It further offers a 30% speedup.
The quality of video generation on Mochi. |
End-to-end performance of NIAH. |
If you use this code or find our work valuable, please cite:
@misc{zhang2025spargeattn,
title={SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference},
author={Jintao Zhang and Chendong Xiang and Haofeng Huang and Jia Wei and Haocheng Xi and Jun Zhu and Jianfei Chen},
year={2025},
eprint={2502.18137},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.18137},
}
@inproceedings{zhang2025sageattention,
title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration},
author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}
@misc{zhang2024sageattention2,
title={SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization},
author={Jintao Zhang and Haofeng Huang and Pengle Zhang and Jia Wei and Jun Zhu and Jianfei Chen},
year={2024},
eprint={2411.10958},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2411.10958},
}