Open-Qwen2VL

Official code repo for our work Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources.

Introduction

This repo supports:

Data quality score generation with DFN/CLIP and MLM-Filter
High quality data selection based on the quality scores and resharding into webdataset format
Multimodal Sequence Packing towards large-scale image-text dataset in webdataset format (supporting both caption data and interleaved data)
Pre-training with packed multimodal sequences
Supversied fine-tuning on both small-scale SFT data like LLaVA-665k and large-scale SFT data like MAmmoTH-VL-10M
Evaluation on a series of multimodal benchmarks

Release

[3/31/2025] 🔥 We released all pre-trained model and instruction-tuned model checkpoints at Open-Qwen2VL and Open-Qwen2VL-Base
[3/31/2025] 🔥 We released all pre-training data in webdataset format at Open-Qwen2VL-Data.
[3/31/2025] 🔥 We released the technical report for Open-Qwen2VL.

Install

conda create -n openqwen2vl python=3.10
conda activate openqwen2vl
pip install -e prismatic-vlms

If you need to pre-train or SFT the MLLM, install flash-attention:

pip install flash-attn --no-build-isolation

Directly Use or Trial

import requests
import torch
from PIL import Image
from prismatic import load

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# Load a pretrained VLM (either local path, or ID to auto-download from the HF Hub)
vlm = load("Open-Qwen2VL")
vlm.to(device, dtype=torch.bfloat16)

# Download an image and specify a prompt
image_url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
image = [vlm.vision_backbone.image_transform(Image.open(requests.get(image_url, stream=True).raw).convert("RGB")).unsqueeze(0)]
user_prompt = '<image>\nDescribe the image."

# Generate!
generated_text = vlm.generate_batch(
    image,
    [user_prompt],
    do_sample=False,
    max_new_tokens=512,
    min_length=1,
)
print(generated_text[0])

Multimodal Sequence Packing

We have released all our pre-training image-text caption data in webdataset format at Open-Qwen2VL-Data. Please download it with huggingface-cli download or directly git clone.

Then please run

bash mm_sequence_packing/multiprocess_sequence_packing_image_to_pil.sh 0 4 504 datacomp
bash mm_sequence_packing/multiprocess_sequence_packing_image_to_pil.sh 0 4 326 ccs

[Update on 5/16/2025] We support sequence packing into a single json file which contains base64-string images and texts. Storing in json files can save 95% more disk space. You can run

bash mm_sequence_packing/multiprocess_sequence_packing_image_to_json.sh 0 4 504 /path/to/datacomp_wds /path/to/datacomp_packed_files

Compute-Efficient MLLM Pre-Training

Prior to training, please follow the sequence packing instructions in the README to prepare the pickle files for each subdataset.

Then please run the training script

bash prismatic-vlms/train.sh ${CKPTID} ${STAGE} ${BSZ} ${PER_GPU_BSZ}

Here are the parameters for training:

CKPTID: id for the saved checkpoint;
STAGE: choose between pretrain and full-pretrain, in which the full-pretrain will make the vision encoder trainable as well;
BSZ: global batch size;
PER_GPU_BSZ: the batch size for each gpu. If the global_bsz != num_gpus * per_gpu_bsz, then the gradient accumulation will be applied.

Visual SFT

Large-Scale SFT on MammoTH-VL-10M

Please firstly download and unzip the images of MAmmoTH-VL-10M. Then run python data_prepare/split_mammoth_10m.py to split each instruction examples into single json files.

Please run the training script

bash prismatic-vlms/fine_tune_mammoth.sh ${CKPT_PATH} ${CKPTID}

Here are the parameters for training:

CKPT_PATH: the path to the pre-trained MLLM checkpoint after the pre-training stage;
CKPTID: id for the saved checkpoint

Normal SFT Scripts

Then please run the training script

bash prismatic-vlms/fine_tune.sh ${CKPT_PATH} ${CKPTID} ${DATAPATH}

Here are the parameters for training:

CKPT_PATH: the path to the pre-trained MLLM checkpoint after the pre-training stage;
CKPTID: id for the saved checkpoint;
DATAPATH: the path to the SFT dataset json file.

Evaluations

Please follow the doc docs/Eval.md for evaluation instructions.

Data Quality Score Generation and Data Filtering

Please follow the doc docs/DATA_Filter.md for data filtering instructions.

Citation

Please cite our paper if you find this repository interesting or helpful:

@article{Open-Qwen2VL,
  title={Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources},
  author={Wang, Weizhi and Tian, Yu and Yang, Linjie and Wang, Heng and Yan, Xifeng},
  journal={arXiv preprint arXiv:2504.00595},
  year={2025}
}

Credits

Our codebase is developed based on prismatic-vlms and vlm-evaluation.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data_filtering		data_filtering
data_prepare		data_prepare
docs		docs
mm_sequence_packing		mm_sequence_packing
prismatic-vlms		prismatic-vlms
vlm-evaluation		vlm-evaluation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Open-Qwen2VL

Introduction

Release

Install

Directly Use or Trial

Multimodal Sequence Packing

Compute-Efficient MLLM Pre-Training

Visual SFT

Large-Scale SFT on MammoTH-VL-10M

Normal SFT Scripts

Evaluations

Data Quality Score Generation and Data Filtering

Citation

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Victorwz/Open-Qwen2VL

Folders and files

Latest commit

History

Repository files navigation

Open-Qwen2VL

Introduction

Release

Install

Directly Use or Trial

Multimodal Sequence Packing

Compute-Efficient MLLM Pre-Training

Visual SFT

Large-Scale SFT on MammoTH-VL-10M

Normal SFT Scripts

Evaluations

Data Quality Score Generation and Data Filtering

Citation

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages