Stars
[CVPR 2025] Official Implementation of LOCORE: Image Re-ranking with Long-Context
Scalable and memory-optimized training of diffusion models
Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Codebase for the Progressive Mixed-Precision Decoding paper.
Minimalistic 4D-parallelism distributed training framework for education purpose
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
Moments Retrieval Project Webpage (temporal)
[ICML'24] Recurrent Early Exits for Federated Learning with Heterogeneous Clients
Efficient LLM Inference Acceleration using Prompting
This repository contains an implementation of the models introduced in the paper Dialog-based Interactive Image Retrieval. The network is implemented using PyTorch and the rest of the framework is …