Official Pytorch implementation of [Decoupled Global-Local Alignment for Improving Compositional Understanding]
Decoupled Global-Local Alignment for Improving Compositional Understanding
Xiaoxing Hu, Kaicheng Yang, Jun Wang, Haoran Xu, Ziyong Feng, Yupei Wang
DeGLA is a novel fine-tuning framework designed to enhance CLIP's compositional understanding. Within this framework, we focus on improving the model's compositional understanding while mitigating the catastrophic forgetting of pre-trained knowledge that often occurs during fine-tuning. To achieve this, we introduce the DeGLA framework, which features a more effective negative sample generation pipeline and innovative training framework. Experimental results demonstrate that our approach establishes a new SOTA in both compositional understanding and general performance. For any inquiries, please contact xiaoxinghhh@gmail.com or raise an issue. Thank you for your attention.
- [2025/04/24]:✨The training code and pertrained weight of DeGLA have been released.
- [2025/04/24]:✨The paper of DeGLA is submitted to arXiv.
We propose a simple yet effective negative caption generation pipeline that harnesses the in-context learning capability of Large Language Models (LLMs) to produce high-quality negative captions, facilitating hard negative-based fine-tuning
We introduce the DeGLA framework, which employs a self- distillation mechanism within the global alignment to maintain the model’s inherent general comprehension capabilities. Addi- tionally, it combines Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to improve vision-language compositional understanding
- Release training code
- Release model weight
- Release training data
Our work is based on openclip,NegCLIP, CE-CLIP, you can refer to these repository for environment setup, then modify them according to our code and proceed with the train. Alternatively, you can refer to the environment detailed below:
conda create -n DeGLA python=3.9 -y
conda activate DeGLA
pip install -r requirements.txt
Our CUDA version is 12.1. You can adjust the versions of the relevant libraries, such as PyTorch, according to your CUDA version.
Our hard negative data is released at Baidu Yun,GoogleDrive and Huggingface.
git clone https://github.com/xiaoxing2001/DeGLA
cd DeGLA
./scripts/train_DeGLA.sh
Our weights is released at Baidu Yun,GoogleDrive and Huggingface. Our compositional reansoning evaluation is based on other repositories. For ARO, please visit ARO. For SugarCrepe, please visit SugarCrepe. For VALSE, please visit VALSE.
This project is based on CE-CLIP,NegCLIP,openclip, thanks for their works.
This project is released under the MIT license. Please see the LICENSE file for more information.
If you find this repository useful, please use the following BibTeX entry for citation.
@misc{hu2025decoupledgloballocalalignmentimproving,
title={Decoupled Global-Local Alignment for Improving Compositional Understanding},
author={Xiaoxing Hu and Kaicheng Yang and Jun Wang and Haoran Xu and Ziyong Feng and Yupei Wang},
year={2025},
eprint={2504.16801},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.16801},
}