-
Algomatic
- Japan
-
06:35
(UTC +09:00) - https://sites.google.com/view/yusukemikami
- in/yusukemikami
⭐ VLM
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
a state-of-the-art-level open visual language model | 多模态预训练模型
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)
Caption-Anything is a versatile tool combining image segmentation, visual captioning, and ChatGPT, generating tailored captions with diverse controls for user preferences. https://huggingface.co/sp…
LAVIS - A One-stop Library for Language-Vision Intelligence
Papers and resources on Controllable Generation using Diffusion Models, including ControlNet, DreamBooth, IP-Adapter.
A library for advanced large language model reasoning
ChatBridge, an approach to learning a unified multimodal model to interpret, correlate, and reason about various modalities without relying on all combinations of paired data.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VisualGPT, CVPR 2022 Proceeding, GPT as a decoder for vision-language models
Code and documentation to train Stanford's Alpaca models, and generate the data.
Recent LLM-based CV and related works. Welcome to comment/contribute!
Code for the paper "ViperGPT: Visual Inference via Python Execution for Reasoning"
Official code for VisProg (CVPR 2023 Best Paper!)
An open-source framework for training large multimodal models.
A self-hosted, offline, ChatGPT-like chatbot. Powered by Llama 2. 100% private, with no data leaving your device. New: Code Llama support!
VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
Reading list for research topics in embodied vision