🌌 BLIP3-o

BLIP3-o is a unified multimodal model that combines the reasoning and instruction following strength of autoregressive models with the generative power of diffusion models. Unlike prior works that diffuse VAE features or raw pixels, BLIP3-o diffuses semantically rich CLIP image features, enabling a powerful and efficient architecture for both image understanding and generation.

📖 Arxiv

✨ Highlights

Fully Open-Source:
- Pretraining Data: 24 Million Detailed Captions, 5 Million Short Captions, 4 Million JourneyDB
- Instruction Tuning Data: 60 k GPT-4o Distilled Instruction Tuning Data
- Model Weights: 4 B, 8 B
- Training Code
Unified Architecture: for both image understanding and generation.
CLIP Feature Diffusion: Directly diffuses semantic vision features for stronger alignment and performance.
State-of-the-art performance: across a wide range of image understanding and generation benchmarks.

Update

[2025/05/29] 🔥 Check out new branch Qwen3-Siglip2, it supports SigLIP2 as the image understanding vision encoder and Qwen3 as the AR backbone, with flexible training strategies—either sequential training or joint training—for both image understanding and generation.
[2025/05/22] 🔥 Evaluation for image understanding and generation, please check folder eval.
[2025/05/20] 🔥 Welcome to discuss with us if you have any questions. Discord: https://discord.gg/SsVYdV84bw or Wechat

[2025/05/19] 🔥 We understand this is a large codebase, we shared a high-level overview of its Code Structure, feel free to open an issue if you encounter any problems.
[2025/05/16] 🔥 We’ve published a dataset of 24 million images with detailed captions BLIP3o Pretrain Long Caption and 5 million images with short caption BLIP3o Pretrain Short Caption. All images and their captions are compressed into tar archives, no separate image url downloads or manual unzipping required.
[2025/05/16] 🔥 We’ve reorganized and cleaned up the repository to ensure a clear, well-structured codebase. Please give the training and inference scripts a try, and feel free to leave an issue if you run into any problems. We apologize for any confusion caused by our original codebase release.

Demo

You can try out BLIP3-o in your browser using our interactive Demo.

Install package for tranining

conda create -n blip3o python=3.11 -y
conda activate blip3o
pip install --upgrade pip  setuptools
pip install -r requirements.txt

Model Checkpoint

BLIP3o-4B 4B

BLIP3o-8B 8B

Inference

You can download our checkpoint:

python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='BLIP3o/BLIP3o-Model-8B', repo_type='model'))"

and run the inference code

python inference.py  /HF_model/checkpoint/path/

Training

We include two scripts: slurm.sh for multi-node training on Slurm clusters, and run.sh for debugging.

For both slurm.sh and run.sh, you need to import huggingface home HF_HOME, training data folder IMG_FOLDER and output model save folder OUTPUT_FOLDER.

For our open source model training, we combine the pretraining dataset, including long caption, short caption and JourneyDB.

You can download all datasets and then modify the dataloader in the code to load whichever subset you need.

data_files = glob.glob(os.path.join('/long/caption/path', "*.tar")) + glob.glob(os.path.join('/short/caption/path', "*.tar")) + glob.glob(os.path.join('/JourneyDB/caption/path', "*.tar")) 
train_dataset = load_dataset("webdataset", data_files=data_files, split="train", num_proc=128)

When training the diffusion transformer from scratch, we recommend using a large number of training steps (at least 150k steps) along with a cosine annealing learning rate schedule that decays from 1×10⁻⁴ down to 1×10⁻⁵.

When finetuning with our BLIP3o-60k, we recommend using a large number of training steps (at least 10k steps) along with a cosine annealing learning rate schedule that decays from 1×10⁻⁴ down to 0.

CLIP + Diffusion (Encoder + Decoder)

We also provide two CLIP + Diffusion:

[EVA-CLIP + SDXL]: The model checkpoint already includes the diffusion decoder diffusion-decoder. The EVA-CLIP vision tower weights can be downloaded here EVA-CLIP, the preprocess of EVA-CLIP is in the training code EVA-CLIP-preprocess.

[SigLIP2 + SANA]: The model checkpoint is available here SigLIP2_SANA .

First, download the model checkpoint:

python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='BLIP3o/SigLIP2_SANA', repo_type='model'))"

and update img_path = 'fig.jpg' to any local image you like. Run the inference code

cd siglip2_sana
python inference.py  /HF_model/SigLIP2_SANA/path/

And you will get reconstruction.png.

Supported Tasks

Text → Text
Image → Text (Image Understanding)
Text → Image (Image Generation)
Image → Image (Image Editing)
Multitask Training (Image generation and undetstanding mix training)

Supported Image Generation Methods

CLIP + MSE
CLIP + Flow Matching
VAE + Flow Matching
Transfusion, LMFusion

Supported Autoregressive Backbones

Qwen-2.5-VL
LLaMA 3

We suggest to use Qwen-2.5-VL as the backbone, we are fixing some tokenizer issues for LLama3.

Supported Dataset Format

Webdataset
Json

Data Loading

Most of our training data use Huggingface datasets to load WebDataset. To download the datasets:

T2I Pretraining Dataset

👉 Pretrain

You can download the datasets by

python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='BLIP3o/BLIP3o-Pretrain', repo_type='dataset'))"

And load them directly with HuggingFace WebDataset

train_dataset = load_dataset("webdataset", data_files=data_files, split="train", num_proc=128)

Super High Quality T2I Instruction Tuning Data Prompted From GPT-4o

👉 BLIP3o-60k

You can download the datasets by

python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='BLIP3o/BLIP3o-60k', repo_type='dataset'))"

💥 In general, BLIP3o-60k can help pretrained T2I models achieve a 5–7 point absolute score improvement on the GenEval and DPG benchmarks.

Qualitative results of BLIP3-o

Figure: Qualitative results of BLIP3-o.

Image Benchmark Performance

Model	Pretrain Data	GenEval	DBP	WISE
4B (open source)	30 million open-source data	0.81	79.36	0.50
8B (open source)	30 million open-source data	0.83	80.73	0.52
8B (paper reported)	30 million open-source + 30 million proprietary data	0.84	81.60	0.62

Citation

To cite the paper and model

@article{chen2025blip3,
  title={BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset},
  author={Chen, Jiuhai and Xu, Zhiyang and Pan, Xichen and Hu, Yushi and Qin, Can and Goldstein, Tom and Huang, Lifu and Zhou, Tianyi and Xie, Saining and Savarese, Silvio and others},
  journal={arXiv preprint arXiv:2505.09568},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
blip3o		blip3o
deepspeed_scripts		deepspeed_scripts
eval		eval
figure		figure
gradio		gradio
siglip2_sana		siglip2_sana
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt
run.sh		run.sh
setup.py		setup.py
sft.sh		sft.sh
slurm.sh		slurm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌌 BLIP3-o

📖 Arxiv

✨ Highlights

Update

Demo

Model Checkpoint

BLIP3o-4B 4B

BLIP3o-8B 8B

Inference

Training

CLIP + Diffusion (Encoder + Decoder)

Supported Tasks

Supported Image Generation Methods

Supported Autoregressive Backbones

Supported Dataset Format

Data Loading

T2I Pretraining Dataset

👉 Pretrain

Super High Quality T2I Instruction Tuning Data Prompted From GPT-4o

👉 BLIP3o-60k

💥 In general, BLIP3o-60k can help pretrained T2I models achieve a 5–7 point absolute score improvement on the GenEval and DPG benchmarks.

Qualitative results of BLIP3-o

Image Benchmark Performance

Citation

About

Uh oh!

Releases

Packages

Languages

jasongong111/BLIP3o

Folders and files

Latest commit

History

Repository files navigation

🌌 BLIP3-o

📖 Arxiv

✨ Highlights

Update

Demo

Model Checkpoint

BLIP3o-4B 4B

BLIP3o-8B 8B

Inference

Training

CLIP + Diffusion (Encoder + Decoder)

Supported Tasks

Supported Image Generation Methods

Supported Autoregressive Backbones

Supported Dataset Format

Data Loading

T2I Pretraining Dataset

👉 Pretrain

Super High Quality T2I Instruction Tuning Data Prompted From GPT-4o

👉 BLIP3o-60k

💥 In general, BLIP3o-60k can help pretrained T2I models achieve a 5–7 point absolute score improvement on the GenEval and DPG benchmarks.

Qualitative results of BLIP3-o

Image Benchmark Performance

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages