Lists (32)
Sort Name ascending (A-Z)
Ai消除
caption竞赛
grounding
llm 综述
传统nlp理解
传统OCR
其他
图像编辑
多模态大模型
多模态推理思考模型
多模态理解+图像编辑
多模态视频理解
大模型微调
大语言模型
大语言模型思维链
好玩的应用
强化学习
推理加速框架
文字inpainting
文搜视频
文生图
模型层可视化
海报多层生成
溯源码二维码
神经网络PPT
纯视觉分割检测识别
表征
视频处理工具
视频时间定位
训练框架
量化
音频工具
Stars
A Bulletproof Way to Generate Structured JSON from Language Models
A SOTA open-source image editing model, which aims to provide comparable performance against the closed-source models like GPT-4o and Gemini 2 Flash.
SGLang is a fast serving framework for large language models and vision language models.
StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and te…
[CVPR 2025] Official repo for ART:Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation
Skywork-R1V2:Multimodal Hybrid Reinforcement Learning for Reasoning
verl: Volcano Engine Reinforcement Learning for LLMs
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen3, Qwen3-MoE, Llama4, InternLM3, DeepSeek-R1, ...) and 200+ MLLMs (Qwen2.5-VL, Qwen2.5-Omni, Qwen2-Audio, Ovis2, InternVL3, Llava, GLM4…
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
An official implementation for "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval"
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
[CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".
[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Repository for 23'MM accepted paper "Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Grounding"
[ECCV 2024] SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation,
[ECCV2024] This is an official implementation for "PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model"
Project Page for "LISA: Reasoning Segmentation via Large Language Model"
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
OMG-LLaVA and OMG-Seg codebase [CVPR-24 and NeurIPS-24]
Project for "HyperSeg: Towards Universal Visual Segmentation with Large Language Model".
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Official implementation of "InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models"