-
HUST
- Wuhan, China
Stars
The official project of paper "Visual Text Processing: A Comprehensive Review and Unified Evaluation""
Solve Visual Understanding with Reinforced VLMs
Witness the aha moment of VLM with less than $3.
OCR & Document Extraction using vision models
Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
GUI Odyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 20…
[ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
MINT-1T: A one trillion token multimodal interleaved dataset.
[CVPR 2024] Official repository for "MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model"
Curated tutorials and resources for Large Language Models, AI Painting, and more.
✨✨Latest Advances on Multimodal Large Language Models
[CVPR 2024] Official RT-DETR (RTDETR paddle pytorch), Real-Time DEtection TRansformer, DETRs Beat YOLOs on Real-time Object Detection. 🔥 🔥 🔥
Ongoing research training transformer models at scale
An open-source framework for training large multimodal models.
Langchain-Chatchat(原Langchain-ChatGLM)基于 Langchain 与 ChatGLM, Qwen 与 Llama 等语言模型的 RAG 与 Agent 应用 | Langchain-Chatchat (formerly langchain-ChatGLM), local knowledge based LLM (like ChatGLM, Qwen and…
🦜🔗 Build context-aware reasoning applications
A toolbox of ocr models and algorithms based on MindSpore
InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editin…
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)
CDLA: A Chinese document layout analysis (CDLA) dataset
[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
Code release for ConvNeXt V2 model
CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.