Koustuv Sinha¹, Michael Rabbat¹, Yann LeCun¹², Saining Xie², Zhuang Liu¹
¹FAIR Meta AI, ²New York University, ³Meta Reality Labs
🔥 Core Discovery: Visual generation capability naturally arises from understanding! With just 200K samples + co-training, LLMs can be taught to generate visual embeddings without extensive pretraining.
- 🚀 Understanding <=> Generation: Better VQA → Better Image Generation, and vice versa!
- 💪 Efficient: Co-training unlocks generation with few samples vs millions needed
- 🧠 LLM Capability Transfer: Generate rare concepts like "Chhogori" & solve visual puzzles
- 🎯 Minor Architectural Changes: Simple instruction tuning morphs LLM into a unified model
Fun fact: MetaMorph is inspired by the biological process of metamorphosis! Transforming a LLM to an Unified Model is just like transforming a caterpillar to a butterfly :)
- [04/14/25] 🔥 We released MetaMorph! We also release GPU training scripts.
- Installation
- Model Weights and Play with MetaMorph
- MetaMorph Instruction Tuning Data
- Train
- Potential Questions & Answers
- Clone this repository and navigate to into the codebase
git clone https://github.com/facebookresearch/metamorph
cd metamorph
- Install Packages
conda create -n metamorph python=3.11
conda activate metamorph
pip install -e .
We are still working on open-sourcing the model weights. We wrote a detailed usage guide in inference/DEMO.md
, for those who are able to train MetaMorph in the meantime.
We provide a sample model loading and demo script in inference/demo.py
.
MetaMorph mostly uses publicly available datasets for training. That said, this repo enables the usage of any data! See below for formatting instructions.
-
ImageQA: We use Cambrian-7M from [Tong et al., 2024], a curated collection of publicly available instruction tuning datasets
Example: Prompt: <image_start><image><image_end> What is the animal in the image? Response: It is a burmilla cat.
-
VideoQA: Includes VideoStar [Zohar et al., 2024] and ShareVideo [Zhang et al., 2024] datasets, processed at 1 FPS
Example: Prompt: <image_start><image><image_end> <image_start><image><image_end><image_start><image><image_end> What's the color of the dog in this video? (a) white (b) yellow (c) black Please only answer a single letter and nothing else Response: b
- Image-text pairs curated through the MetaCLIP pipeline [Xu et al., 2024]
- Note: MetaCLIP data is not publicly available, but you can substitute with other public image-text pair datasets or text-to-image datasets such as:
- CC12M (Conceptual 12M)
- LAION
- COYO
Example: Prompt: Generate an image of a puppy. Response: Here is an image based on your request: <image_start><image><image_end>
- Note: MetaCLIP data is not publicly available, but you can substitute with other public image-text pair datasets or text-to-image datasets such as:
-
Video Data:
- HowTo100M [Miech et al., 2019]
- SomethingSomethingV2 [Goyal et al., 2017]
Example (Forward Frame Prediction): Prompt: <image_start><image><image_end> Can you predict what happens in the next 3 frames, each 5 seconds apart? Response: <image_start><image><image_end> <image_start><image><image_end> <image_start><image><image_end>
-
Visual Thinking Data: Curated from:
- Visual CoT [Shao et al., 2024]
- VStar [Wu and Xie, 2024]
Example: Prompt: Think visually before you answer the question: what is the color of the hat in the image <image_start><image><image_end> Response: Let's consider this visually: <image_start><image><image_end> black.
-
Image-to-Image Data:
- InstructPix2Pix [Brooks et al., 2023]
- Aurora [Krojer et al., 2024]
Example: Prompt: <image_start><image><image_end> Make this a rainy day. Response: <image_start><image><image_end>
- All examples above show how to format your own data
<image_start>
and<image_end>
are special tokens that indicate the presence of image tokens- Images are internally processed into visual tokens but shown as
<image>
in the examples for clarity - Each data sample should be formatted as a conversation with a prompt and response
This section describes the training process, following the instruction tuning recipe developed in LLaVA-1.5 and Cambrian-1. The training process involves a 2-stage approach:
- Pretraining the MLP Connector: Connects vision and language.
- Fine-tuning: Optimizes both the LLM and the connector together.
We provide a detailed training breakdown
in the linked document.
- For GPU training with limited resources, you can adjust the
per_device_train_batch_size
and increasegradient_accumulation_steps
to maintain the global batch size:
Global Batch Size =per_device_train_batch_size
×gradient_accumulation_steps
×num_gpus
.
The hyperparameters for both pretraining and fine-tuning stages are detailed below.
- Per GPU Batch Size: 14
Base LLM | Global Batch Size | Learning Rate | Epochs | Max Length |
---|---|---|---|---|
LLaMA-3.1 8B | 448 | 3.74e-5 | 1 | 4096 |
- Per GPU Batch Size: 6
Base LLM | Global Batch Size | Learning Rate | Epochs | Max Length |
---|---|---|---|---|
LLaMA-3.1 8B | 1536 | 6.93e-5 | 1 | 4096 |
To train with different batch sizes, use the following formula (from Cambrian-1) to adjust the learning rate:
Optimal Learning Rate = Base Learning Rate * \sqrt{Batch Size / Base Batch Size}
For debugging / non-SLURM training, refer to the sample scripts:
scripts/pretrain_1node.sh
,
scripts/finetune_1node.sh
For multi-node training using a SLURM system, refer to the sample script provided:
scripts/slurm_train.sh
To train with other data sources or custom datasets, we support the commonly used LLaVA data format. For handling very large files, we recommend using the JSONL format instead of JSON. This approach optimizes memory usage through lazy data loading.
We provide the following resources in the visualization
folder to help with training visualization:
train.py
: The main training script.submit.sh
: A SLURM submission script for distributed training.Train_Visualization.md
: A detailed guide to help you set up and understand the training visualization process.
Here are some questions we received from colleagues and friends, along with their corresponding answers:
Q: Why is the CFG level used in fine-tuning the diffusion model so high (>0.7)?
A: This is because we start with a pretrained diffusion model and aim to transform the conditioning from CLIP text to SigLIP Image. A higher CFG level (e.g., 0.7) ensures the model continues generating high-quality images while gradually adapting to the new conditioning in the remaining fraction (e.g., 0.3). Empirically, this approach yields the best balance in terms of image quality.
Q: Why do you use 64 tokens for one image? Can I use more tokens?
A: Yes, you can use more tokens! We chose 64 tokens to allow training with longer videos. This configuration enables us to include more frames within a limited context length, making it optimal for our use case.
Q: I don’t have access to MetaCLIP data. Can I still train the model?
A: Absolutely! You can substitute MetaCLIP with publicly available image-text datasets such as CC12M, LAION, or COYO. Our method is a general framwork for converting LLMs into unified multimodal models, and does not rely on proprietary datasets. Just make sure that you format the data into the instruction-tuning style pairs as described in the paper.
Q: How many image-text pairs do I need for training?
A: For co-training, good results can be achieved with as few as 200K image-text pairs. However, if you are training solely on generation data without co-training, you may need a larger dataset.
Q: Do I need all the video datasets for training?
A: No, video datasets are not strictly required. They are used to showcase the model’s capabilities. You can start with image-text pairs and ImageQA data, which should suffice for initial training.
Q: Would you open source model weights?
A: We are working on the legal approval process for weights, so stay tuned :)
If you find MetaMorph useful for your research and applications, please cite using this BibTeX:
@article{tong2024metamorph,
title={Metamorph: Multimodal understanding and generation via instruction tuning},
author={Tong, Shengbang and Fan, David and Zhu, Jiachen and Xiong, Yunyang and Chen, Xinlei and Sinha, Koustuv and Rabbat, Michael and LeCun, Yann and Xie, Saining and Liu, Zhuang},
journal={arXiv preprint arXiv:2412.14164},
year={2024}
}
The majority of MetaMorph is licensed under CC-BY-NC, however portions of the project are available under separate license terms: Cambrian-1 and LLaVA are licensed under the Apache 2.0 license.
- LLaVA: Part of the codebase starts from the amazing LLaVA
- Cambrian-1: Part of the code base start from Cambrian-1
- LLaMA: We thank LLaMA for continuing contribution to the open-source community and providing LLaMA-3 checkpoints.