FG-CLIP: Fine-Grained Visual and Textual Alignment

This repository is the official implementation of FG-CLIP, a new generation of text-image cross-modal model excels in fine-grained discrimination and embedding.

FG-CLIP: Fine-Grained Visual and Textual Alignment
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng†, Yuhui Yin(*Equal Contribution, ✝Corresponding Author)

🔥 News

🚀 [2025/05/13] We have uploaded the dataset FineHARD (Fine-Grained Visual Grounding+Recaption+Hard Negative Dataset) to 🤗(https://huggingface.co/datasets/qihoo360/FineHARD)!
🚀 [2025/05/09] We have uploaded the model to 🤗(https://huggingface.co/qihoo360/fg-clip-large), which supports quick and easy usage!
🚀 [2025/05/09] We have updated the FG-CLIP github repository, and now you can test our models!
🚀 [2025/05/09] We released the paper of FG-CLIP: Fine-Grained Visual and Textual Alignment.
🚀 [2025/05/02] FG-CLIP has been accepted by ICML'25.

Model Framework

FG-CLIP’s training proceeds in two stages: the first stage leverages global-level caption-image pairs to achieve initial fine-grained alignment, while the second stage supplements these with additional region-level captions, including detailed region captions and positive/negative region descriptions to further refine the alignment.

Install

conda create -n FGCLIP python=3.10 -y
conda activate FGCLIP
cd FG-CLIP && pip install -e .

Model Zoo:

🤗Vit-B@224px
🤗Vit-L@336px

Quick Start 🤗

Load Model

import torch
from PIL import Image
from transformers import (
    AutoImageProcessor,
    AutoTokenizer,
    AutoModelForCausalLM,
)


model_root = "qihoo360/fg-clip-base"
image_size=224
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()

device = model.device

tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)

Retrieval

img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

# NOTE Short captions: max_length=77 && walk_short_pos=True
walk_short_pos = True
captions=["a photo of a cat", "a photo of a dog"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)

# NOTE Long captions: max_length=248 && walk_short_pos=False
# ......

with torch.no_grad():
  image_feature = model.get_image_features(image_input)
  text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos)
  image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
  text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)

logits_per_image = image_feature @ text_feature.T 
logits_per_image = model.logit_scale.exp() * logits_per_image
probs = logits_per_image.softmax(dim=1) 
print(probs)
# [[9.9997e-01, 3.3485e-05]]

Dense feature effect display

import math
import matplotlib
matplotlib.use('Agg') 
import matplotlib.pyplot as plt


img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

with torch.no_grad():
    dense_image_feature = model.get_image_dense_features(image_input)
    captions = ["white cat"]
    caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
    text_feature = model.get_text_features(caption_input,walk_short_pos=True)
    text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
    dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)

similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
patch_size = int(math.sqrt(similarity.shape[0]))


original_shape = (patch_size, patch_size)
show_image = similarity.reshape(original_shape) 


plt.figure(figsize=(6, 6))
plt.imshow(show_image)
plt.title('similarity Visualization')
plt.axis('off')  
plt.savefig("FG-CLIP/use_imgs/FGCLIP_dfcolor_cat.png")

Train

Data Preparation

We develop a high-quality visual grounding dataset named FineHARD, featuring precise region-specific captions and challenging negative samples. We curate the overall dataset based on GRIT images. The process begins with generating detailed image captions using CogVLM2-19B, ensuring comprehensive and nuanced descriptions that capture the full context of each image. We then use SpaCy to parse the captions and extract the referring expressions. Subsequently, the images and referring expressions are fed into the pretrained object detection model, i.e., Yolo-World to obtain the associated bounding boxes. Non-maximum suppression is applied to eliminate overlapping bounding boxes, retaining only those with predicted confidence scores higher than 0.4. This process results in 12 million images and 40 million bounding boxes with fine-grained region captions.
For data preparation, please refer to: [Data: FineHARD]

Ready for Training

Our training and inference code is completely based on the transformers repository provided by huggingface, which is a very easy to use and easy to reproduce. We have provided the training script in the scripts directory.
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Our training script supports the use of zero2, tf32 acceleration, and bf16 precision (note that fp16 precision may cause gradient NAN). If you do not meet the above conditions, please turn off tf32 and replace deepspeed startup with torchrun.

bash scripts/train.sh

Evalua 8000 tion

Data Preparation

Download the share-captioner_coco_lcs_sam_1246k_1107.json from the following link https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json

Download the CocoCaptions from the following link nd put them into data/coco/annotations/ https://github.com/tylin/coco-caption

Download the COCO from the following link and put them into data/coco https://cocodataset.org/dataset

Captions of DCI are from the following links and put them into data/densely_captioned_images https://github.com/facebookresearch/DCI

ImageNet-1K from from the following links and put them into data/IN1K_val https://image-net.org/

ImageNet-v2 from the following links and put them into data/imagenetv2-matched-frequency-format-val https://opendatalab.com/OpenDataLab/ImageNetV2/tree/main

FG-CLIP
├── ...
├── data
|   ├── share-captioner_coco_lcs_sam_1246k_1107.json
│   ├── coco
│   │   ├── train2017
│   │   ├── val2017
│   │   ├── annotations
│   │   |   ├── captions_val2017.json
│   ├── fgovd
│   │   ├── h_attributes_llava.jsonl
│   │   ├── m_attributes_llava.jsonl
│   │   ├── e_attributes_llava.jsonl
│   │   ├── shuffle_negatives_llava.jsonl
│   ├── sam
│   │   ├── images
│   │   |   ├──138
│   ├── densely_captioned_images
│   │   ├── annotations
│   │   |   ├── ...
│   ├── IN1K_val
│   │   ├── val
│   │   ├── imagenet2012_mapclsloc.txt
│   ├── imagenetv2-matched-frequency-format-val
├── ...

Long/Short caption image-text retrieval

bash scripts/eval/eval_long.sh
bash scripts/eval/eval_coco_retrieval.sh

Bounding box classification of COCO

bash scripts/eval/eval_coco_boxcls.sh

FG-OVD benchmark

bash scripts/eval/eval_fgovd.sh

Zero-Shot classification

bash scripts/eval/eval_in1k.sh
bash scripts/eval/eval_v2.sh

We Are Hiring

We are seeking academic interns in the Multimodal field. If interested, please send your resume to xiechunyu@360.cn.

Citation

If you find FG-CLIP useful for your research and applications, please cite using this BibTeX:

@article{xie2025fgclip,
      title={FG-CLIP: Fine-Grained Visual and Textual Alignment}, 
      author={Chunyu Xie and Bin Wang and Fanjing Kong and Jincheng Li and Dawei Liang and Gengshen Zhang and Dawei Leng and Yuhui Yin},
      year={2025},
      eprint={2505.05071},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.05071}, 
}

License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.

Related Projects

This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
fgclip		fgclip
scripts		scripts
use_imgs		use_imgs
LICENSE		LICENSE
README.md		README.md
pyproject.toml		8000 pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FG-CLIP: Fine-Grained Visual and Textual Alignment

🔥 News

Contents

Model Framework

Install

Model Zoo:

Quick Start 🤗

Load Model

Retrieval

Dense feature effect display

Train

Data Preparation

Ready for Training

Evalua 8000 tion

Data Preparation

Long/Short caption image-text retrieval

Bounding box classification of COCO

FG-OVD benchmark

Zero-Shot classification

We Are Hiring

Citation

License

Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

360CVGroup/FG-CLIP

Folders and files

Latest commit

History

Repository files navigation

FG-CLIP: Fine-Grained Visual and Textual Alignment

🔥 News

Contents

Model Framework

Install

Model Zoo:

Quick Start 🤗

Load Model

Retrieval

Dense feature effect display

Train

Data Preparation

Ready for Training

Evalua 8000 tion

Data Preparation

Long/Short caption image-text retrieval

Bounding box classification of COCO

FG-OVD benchmark

Zero-Shot classification

We Are Hiring

Citation

License

Related Projects

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages