This repository is the official implementation of FG-CLIP, a new generation of text-image cross-modal model excels in fine-grained discrimination and embedding.
FG-CLIP: Fine-Grained Visual and Textual Alignment
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Lengβ , Yuhui Yin(*Equal Contribution, βCorresponding Author)
- π [2025/05/13] We have uploaded the dataset FineHARD (Fine-Grained Visual Grounding+Recaption+Hard Negative Dataset) to π€(https://huggingface.co/datasets/qihoo360/FineHARD)!
- π [2025/05/09] We have uploaded the model to π€(https://huggingface.co/qihoo360/fg-clip-large), which supports quick and easy usage!
- π [2025/05/09] We have updated the FG-CLIP github repository, and now you can test our models!
- π [2025/05/09] We released the paper of FG-CLIP: Fine-Grained Visual and Textual Alignment.
- π [2025/05/02] FG-CLIP has been accepted by ICML'25.
FG-CLIPβs training proceeds in two stages: the first stage leverages global-level caption-image pairs to achieve initial fine-grained alignment, while the second stage supplements these with additional region-level captions, including detailed region captions and positive/negative region descriptions to further refine the alignment.
conda create -n FGCLIP python=3.10 -y
conda activate FGCLIP
cd FG-CLIP && pip install -e .
π€Vit-B@224px
π€Vit-L@336px
import torch
from PIL import Image
from transformers import (
AutoImageProcessor,
AutoTokenizer,
AutoModelForCausalLM,
)
model_root = "qihoo360/fg-clip-base"
image_size=224
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()
device = model.device
tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)
img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))
image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)
# NOTE Short captions: max_length=77 && walk_short_pos=True
walk_short_pos = True
captions=["a photo of a cat", "a photo of a dog"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
# NOTE Long captions: max_length=248 && walk_short_pos=False
# ......
with torch.no_grad():
image_feature = model.get_image_features(image_input)
text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos)
image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
logits_per_image = image_feature @ text_feature.T
logits_per_image = model.logit_scale.exp() * logits_per_image
probs = logits_per_image.softmax(dim=1)
print(probs)
# [[9.9997e-01, 3.3485e-05]]
import math
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))
image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)
with torch.no_grad():
dense_image_feature = model.get_image_dense_features(image_input)
captions = ["white cat"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
text_feature = model.get_text_features(caption_input,walk_short_pos=True)
text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)
similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
patch_size = int(math.sqrt(similarity.shape[0]))
original_shape = (patch_size, patch_size)
show_image = similarity.reshape(original_shape)
plt.figure(figsize=(6, 6))
plt.imshow(show_image)
plt.title('similarity Visualization')
plt.axis('off')
plt.savefig("FG-CLIP/use_imgs/FGCLIP_dfcolor_cat.png")
We develop a high-quality visual grounding dataset named FineHARD, featuring precise region-specific captions and challenging negative samples. We curate the overall dataset based on GRIT images. The process begins with generating detailed image captions using CogVLM2-19B, ensuring comprehensive and nuanced descriptions that capture the full context of each image. We then use SpaCy to parse the captions and extract the referring expressions. Subsequently, the images and referring expressions are fed into the pretrained object detection model, i.e., Yolo-World to obtain the associated bounding boxes. Non-maximum suppression is applied to eliminate overlapping bounding boxes, retaining only those with predicted confidence scores higher than 0.4. This process results in 12 million images and 40 million bounding boxes with fine-grained region captions.
For data preparation, please refer to: [Data: FineHARD]
Our training and inference code is completely based on the transformers repository provided by huggingface, which is a very easy to use and easy to reproduce. We have provided the training script in the scripts directory.
π€ Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Our training script supports the use of zero2, tf32 acceleration, and bf16 precision (note that fp16 precision may cause gradient NAN). If you do not meet the above conditions, please turn off tf32 and replace deepspeed startup with torchrun.
bash scripts/train.sh
Download the share-captioner_coco_lcs_sam_1246k_1107.json from the following link https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json
Download the CocoCaptions from the following link nd put them into data/coco/annotations/ https://github.com/tylin/coco-caption
Download the COCO from the following link and put them into data/coco https://cocodataset.org/dataset
Captions of DCI are from the following links and put them into data/densely_captioned_images https://github.com/facebookresearch/DCI
ImageNet-1K from from the following links and put them into data/IN1K_val https://image-net.org/
ImageNet-v2 from the following links and put them into data/imagenetv2-matched-frequency-format-val https://opendatalab.com/OpenDataLab/ImageNetV2/tree/main
FG-CLIP
βββ ...
βββ data
| βββ share-captioner_coco_lcs_sam_1246k_1107.json
β βββ coco
β β βββ train2017
β β βββ val2017
β β βββ annotations
β β | βββ captions_val2017.json
β βββ fgovd
β β βββ h_attributes_llava.jsonl
β β βββ m_attributes_llava.jsonl
β β βββ e_attributes_llava.jsonl
β β βββ shuffle_negatives_llava.jsonl
β βββ sam
β β βββ images
β β | βββ138
β βββ densely_captioned_images
β β βββ annotations
β β | βββ ...
β βββ IN1K_val
β β βββ val
β β βββ imagenet2012_mapclsloc.txt
β βββ imagenetv2-matched-frequency-format-val
βββ ...
bash scripts/eval/eval_long.sh
bash scripts/eval/eval_coco_retrieval.sh
bash scripts/eval/eval_coco_boxcls.sh
bash scripts/eval/eval_fgovd.sh
bash scripts/eval/eval_in1k.sh
bash scripts/eval/eval_v2.sh
We are seeking academic interns in the Multimodal field. If interested, please send your resume to xiechunyu@360.cn.
If you find FG-CLIP useful for your research and applications, please cite using this BibTeX:
@article{xie2025fgclip,
title={FG-CLIP: Fine-Grained Visual and Textual Alignment},
author={Chunyu Xie and Bin Wang and Fanjing Kong and Jincheng Li and Dawei Liang and Gengshen Zhang and Dawei Leng and Yuhui Yin},
year={2025},
eprint={2505.05071},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.05071},
}
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.
This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!