Lists (1)
Sort Name ascending (A-Z)
Stars
[ICLR 2025] MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
🧑🚀 全世界最好的LLM资料总结(视频生成、Agent、辅助编程、数据处理、模型训练、模型推理、o1 模型、MCP、小语言模型、视觉语言模型) | Summary of the world's best LLM resources.
Explore the Multimodal “Aha Moment” on 2B Model
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
MoBA: Mixture of Block Attention for Long-Context LLMs
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper
[CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models
Rui Qian, Xin Yin, Dejing Dou†: Reasoning to Attend: Try to Understand How <SEG> Token Works (CVPR 2025)
Align Anything: Training All-modality Model with Feedback
[ArXiv] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
[NeurIPS 2021] [T-PAMI] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
Official Repository of MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations
(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
⚡️SwanLab - an open-source, modern-design AI training tracking and visualization tool. Supports Cloud / Self-hosted use. Integrated with PyTorch / Transformers / LLaMA Factory / Swift / Ultralytics…
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen3, Qwen3-MoE, Llama4, InternLM3, DeepSeek-R1, ...) and 200+ MLLMs (Qwen2.5-VL, Qwen2.5-Omni, Qwen2-Audio, Ovis2, InternVL3, Llava, GLM4…
[ICLR2025] LLaVA-HR: High-Resolution Large Language-Vision Assistant
ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Ablation Capability for Large Vision-Language Models
DeepSeek-VL: Towards Real-World Vision-Language Understanding
[IEEE TPAMI] Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation
[CVPR2024] GSVA: Generalized Segmentation via Multimodal Large Language Models
[NeurIPS 2024] Visual Perception by Large Language Model’s Weights
[NeurIPS 2024 Best Paper][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ult…
[CVPR 2025] A Comprehensive Benchmark for Document Parsing and Evaluation
Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.