10000 GitHub - 360CVGroup/FG-CLIP: New generation of CLIP with fine grained discrimination capability, ICML2025
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

360CVGroup/FG-CLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
8000
Β 
Β 

Repository files navigation

FG-CLIP: Fine-Grained Visual and Textual Alignment

This repository is the official implementation of FG-CLIP, a new generation of text-image cross-modal model excels in fine-grained discrimination and embedding.

FG-CLIP: Fine-Grained Visual and Textual Alignment
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng†, Yuhui Yin(*Equal Contribution, ✝Corresponding Author)
arXiv ICML HF-model HF-data

πŸ”₯ News

Contents

Model Framework

FG-CLIP’s training proceeds in two stages: the first stage leverages global-level caption-image pairs to achieve initial fine-grained alignment, while the second stage supplements these with additional region-level captions, including detailed region captions and positive/negative region descriptions to further refine the alignment.

Install

conda create -n FGCLIP python=3.10 -y
conda activate FGCLIP
cd FG-CLIP && pip install -e .

Model Zoo:

πŸ€—Vit-B@224px
πŸ€—Vit-L@336px

Quick Start πŸ€—

Load Model

import torch
from PIL import Image
from transformers import (
    AutoImageProcessor,
    AutoTokenizer,
    AutoModelForCausalLM,
)


model_root = "qihoo360/fg-clip-base"
image_size=224
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()

device = model.device

tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)

Retrieval

img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

# NOTE Short captions: max_length=77 && walk_short_pos=True
walk_short_pos = True
captions=["a photo of a cat", "a photo of a dog"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)

# NOTE Long captions: max_length=248 && walk_short_pos=False
# ......

with torch.no_grad():
  image_feature = model.get_image_features(image_input)
  text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos)
  image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
  text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)

logits_per_image = image_feature @ text_feature.T 
logits_per_image = model.logit_scale.exp() * logits_per_image
probs = logits_per_image.softmax(dim=1) 
print(probs)
# [[9.9997e-01, 3.3485e-05]]

Dense feature effect display

import math
import matplotlib
matplotlib.use('Agg') 
import matplotlib.pyplot as plt


img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

with torch.no_grad():
    dense_image_feature = model.get_image_dense_features(image_input)
    captions = ["white cat"]
    caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
    text_feature = model.get_text_features(caption_input,walk_short_pos=True)
    text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
    dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)

similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
patch_size = int(math.sqrt(similarity.shape[0]))


original_shape = (patch_size, patch_size)
show_image = similarity.reshape(original_shape) 


plt.figure(figsize=(6, 6))
plt.imshow(show_image)
plt.title('similarity Visualization')
plt.axis('off')  
plt.savefig("FG-CLIP/use_imgs/FGCLIP_dfcolor_cat.png")

Train

Data Preparation

We develop a high-quality visual grounding dataset named FineHARD, featuring precise region-specific captions and challenging negative samples. We curate the overall dataset based on GRIT images. The process begins with generating detailed image captions using CogVLM2-19B, ensuring comprehensive and nuanced descriptions that capture the full context of each image. We then use SpaCy to parse the captions and extract the referring expressions. Subsequently, the images and referring expressions are fed into the pretrained object detection model, i.e., Yolo-World to obtain the associated bounding boxes. Non-maximum suppression is applied to eliminate overlapping bounding boxes, retaining only those with predicted confidence scores higher than 0.4. This process results in 12 million images and 40 million bounding boxes with fine-grained region captions.
For data preparation, please refer to: [Data: FineHARD]

Ready for Training

Our training and inference code is completely based on the transformers repository provided by huggingface, which is a very easy to use and easy to reproduce. We have provided the training script in the scripts directory.
πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Our training script supports the use of zero2, tf32 acceleration, and bf16 precision (note that fp16 precision may cause gradient NAN). If you do not meet the above conditions, please turn off tf32 and replace deepspeed startup with torchrun.

bash scripts/train.sh

Evalua 8000 tion

Data Preparation

Download the share-captioner_coco_lcs_sam_1246k_1107.json from the following link https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json

Download the CocoCaptions from the following link nd put them into data/coco/annotations/ https://github.com/tylin/coco-caption

Download the COCO from the following link and put them into data/coco https://cocodataset.org/dataset

Captions of DCI are from the following links and put them into data/densely_captioned_images https://github.com/facebookresearch/DCI

ImageNet-1K from from the following links and put them into data/IN1K_val https://image-net.org/

ImageNet-v2 from the following links and put them into data/imagenetv2-matched-frequency-format-val https://opendatalab.com/OpenDataLab/ImageNetV2/tree/main

FG-CLIP
β”œβ”€β”€ ...
β”œβ”€β”€ data
|   β”œβ”€β”€ share-captioner_coco_lcs_sam_1246k_1107.json
β”‚   β”œβ”€β”€ coco
β”‚   β”‚   β”œβ”€β”€ train2017
β”‚   β”‚   β”œβ”€β”€ val2017
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   |   β”œβ”€β”€ captions_val2017.json
β”‚   β”œβ”€β”€ fgovd
β”‚   β”‚   β”œβ”€β”€ h_attributes_llava.jsonl
β”‚   β”‚   β”œβ”€β”€ m_attributes_llava.jsonl
β”‚   β”‚   β”œβ”€β”€ e_attributes_llava.jsonl
β”‚   β”‚   β”œβ”€β”€ shuffle_negatives_llava.jsonl
β”‚   β”œβ”€β”€ sam
β”‚   β”‚   β”œβ”€β”€ images
β”‚   β”‚   |   β”œβ”€β”€138
β”‚   β”œβ”€β”€ densely_captioned_images
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   |   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ IN1K_val
β”‚   β”‚   β”œβ”€β”€ val
β”‚   β”‚   β”œβ”€β”€ imagenet2012_mapclsloc.txt
β”‚   β”œβ”€β”€ imagenetv2-matched-frequency-format-val
β”œβ”€β”€ ...

Long/Short caption image-text retrieval

bash scripts/eval/eval_long.sh
bash scripts/eval/eval_coco_retrieval.sh

Bounding box classification of COCO

bash scripts/eval/eval_coco_boxcls.sh

FG-OVD benchmark

bash scripts/eval/eval_fgovd.sh

Zero-Shot classification

bash scripts/eval/eval_in1k.sh
bash scripts/eval/eval_v2.sh

We Are Hiring

We are seeking academic interns in the Multimodal field. If interested, please send your resume to xiechunyu@360.cn.

Citation

If you find FG-CLIP useful for your research and applications, please cite using this BibTeX:

@article{xie2025fgclip,
      title={FG-CLIP: Fine-Grained Visual and Textual Alignment}, 
      author={Chunyu Xie and Bin Wang and Fanjing Kong and Jincheng Li and Dawei Liang and Gengshen Zhang and Dawei Leng and Yuhui Yin},
      year={2025},
      eprint={2505.05071},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.05071}, 
}

License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.

Related Projects

This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!

About

New generation of CLIP with fine grained discrimination capability, ICML2025

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0