Junfeng Wu1,2 · Yi Jiang2† · Chuofan Ma2,3
Yuliang Liu1 · Hengshuang Zhao3
Zehuan Yuan2 · Song Bai2* · Xiang Bai1*
1HUST 2ByteDance 3HKU
†project lead *corresponding author
This repo implements Liquid, a scalable and unified autoregressive generation paradigm that seamlessly integrates multimodal comprehension and generation.
2025-03-25: Data processing and model pretraining scripts have been updated in Data.md and TRAIN.md.
2025-03-04: Text-to-image and visual understanding evaluation scripts for Liquid are released in EVAL.md.
2025-02-28: Paper, demo, model, and project page for Liquid are all released.
- Liquid-7B-IT (Instruction Tuned Multimodal Model with Instruction Following Ability)
- [✅] Web Demo
- [✅] Evaluation
- [✅] Checkpoints
- [✅] Training Codes
- Liquid-0.5B~32B-Pretrain (Multimodal extension models of six different scales ranging from 0.5B to 32B across three model families. )
- Checkpoints
Using Liquid for inference or evaluation doesn't require complex environment dependencies. Since it's essentially a HuggingFace format language model, you only need the transformers
library and some basic components to run it. Refer to EVAL.md for recommended versions.
If deploying on a GPU with less than 30GB VRAM, you may need to enable load_in_8bit
in AutoModelForCausalLM.from_pretrained
in app.py
for image generation to avoid out-of-memory errors.
pip install gradio==4.44.1
pip install gradio_client==1.3.0
cd evaluation
python app.py
# Engage in pure language dialogue.
python inference_t2t.py --model_path Junfeng5/Liquid_V1_7B --prompt "Write me a poem about Machine Learning."
# image understanding
python inference_i2t.py --model_path Junfeng5/Liquid_V1_7B --image_path samples/baklava.png --prompt 'How to make this pastry?'
# image generation, add --load_8bit for GPU with less than 30GB VRAM
python inference_t2i.py --model_path Junfeng5/Liquid_V1_7B --prompt "young blue dragon with horn lightning in the style of dd fantasy full body"
-
We present Liquid, an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation.
-
Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using a single large language model (LLM), eliminating the need for external pretrained visual embeddings such as CLIP.
-
For the first time, Liquid uncovers a scaling law that performance drop unavoidably brought by the unified training of visual and language tasks diminishes as the model size increases.
-
Furthermore, the unified token space enables visual generation and comprehension tasks to mutually enhance each other
- Liquid : Scalable and Versatile Unified Multimodal Generator which supports Visual Understanding, Visual Generation and Multi-modal Generation
- Liquid can generate high-quality, photorealistic images of any aspect ratio by language in an autoregressive paradigm.
- Liquid shows clear Scaling Law in multimodal generation across different sizes(0.5B to 32B).
This project is licensed under the MIT License - see the LICENSE file for details.
If you find this project useful, please consider citing:
@article{wu2024liquid,
title={Liquid: Language models are scalable multi-modal generators},
author={Wu, Junfeng and Jiang, Yi and Ma, Chuofan and Liu, Yuliang and Zhao, Hengshuang and Yuan, Zehuan and Bai, Song and Bai, Xiang},
journal={arXiv preprint arXiv:2412.04332},
year={2024}
}