This repository is an Ultralytics fork of Apple's official MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training (CVPR 2024)
Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.
It provides code for inference, training, and evaluation of MobileCLIP models trained on DataCompDR datasets.
- Update 2024/11/22: iOS app released for real-time zero-shot image classification with MobileCLIP. Explore the iOS app.
- Update 2024/06/13: Training scripts for OpenCLIP on DataCompDR datasets are now available. See training/.
- Update 2024/06/13: MobileCLIP models and DataCompDR datasets are hosted on Hugging Face in the MobileCLIP/DataCompDR Collection.
- The smallest variant, MobileCLIP-S0, achieves comparable zero-shot performance to OpenAI's ViT-B/16 while being 4.8x faster and 2.8x smaller.
- MobileCLIP-S2 surpasses SigLIP's ViT-B/16 in average zero-shot performance, is 2.3x faster and 2.1x smaller, and is trained with 3x fewer seen samples.
- MobileCLIP-B (LT) attains a zero-shot ImageNet accuracy of 77.2%, outperforming recent models like DFN, SigLIP, and even OpenAI's ViT-L/14@336.
- Dedicated iOS app demonstrates high performance on mobile devices.
conda create -n clipenv python=3.10
conda activate clipenv
pip install -e .
To download pretrained checkpoints:
source get_pretrained_models.sh # Downloads files to the `checkpoints` directory.
To use models from the official repository:
import torch
from PIL import Image
import mobileclip
model, _, preprocess = mobileclip.create_model_and_transforms("mobileclip_s0", pretrained="/path/to/mobileclip_s0.pt")
tokenizer = mobileclip.get_tokenizer("mobileclip_s0")
image = preprocess(Image.open("docs/fig_accuracy_latency.png").convert("RGB")).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
For an example of loading data from Hugging Face, see hf_dataset_example.py.
MobileCLIP models are natively supported in OpenCLIP. To use them:
conda create -n clipenv python=3.10
conda activate clipenv
pip install git+https://github.com/mlfoundations/open_clip
pip install git+https://github.com/huggingface/pytorch-image-models
Example inference:
import open_clip
from mobileclip.modules.common.mobileone import reparameterize_model
model, _, preprocess = open_clip.create_model_and_transforms("MobileCLIP-S2", pretrained="datacompdr")
tokenizer = open_clip.get_tokenizer("MobileCLIP-S2")
# For inference/model exporting, reparameterize first
model.eval()
model = reparameterize_model(model)
# ... follow further examples in the OpenCLIP repository ...
Available variants on OpenCLIP:
- MobileCLIP-S1 (
datacompdr
) - MobileCLIP-S2 (
datacompdr
) - MobileCLIP-B (
datacompdr
) - MobileCLIP-B (
datacompdr_lt
)
Comprehensive evaluation results are available in the results directory.
To reproduce results, use the provided script for zero-shot evaluation on the ImageNet-1k dataset.
For evaluation on all 38 datasets, follow the instructions in the DataComp repository.
# Run evaluation with a single GPU
python eval/zeroshot_imagenet.py --model-arch mobileclip_s0 --model-path /path/to/mobileclip_s0.pt
Compare with other models using the OpenCLIP Results CSV.
Model | # Seen Samples (B) |
# Params (M) (img + txt) |
Latency (ms) (img + txt) |
IN-1k Zero-Shot Top-1 Acc. (%) |
Avg. Perf. (%) on 38 datasets |
PyTorch Checkpoint (URL) |
---|---|---|---|---|---|---|
MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | mobileclip_s0.pt |
MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | mobileclip_s1.pt |
MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | mobileclip_s2.pt |
MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | mobileclip_b.pt |
MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | mobileclip_blt.pt |
Note: MobileCLIP-B (LT) is trained for 300k iterations with a constant learning rate schedule and 300k iterations with a cosine learning rate schedule.
If you find this code useful, please cite:
@InProceedings{mobileclip2024,
author = {Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel},
title = {MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
}
This codebase builds upon multiple open-source contributions. See ACKNOWLEDGEMENTS for details.
We welcome your contributions! If you have suggestions, improvements, or want to get involved, please open an issue or submit a pull request.