Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Heng Tao Shen
This is an official PyTorch implementation of the paper UMP: Unified Modality-aware Prompt Tuning for Text-Video Retrieval (under review). In this work, we
- present UMP, a simple yet effective method that extends prompt tuning with pre-trained models for fast adaptation to text-video retrieval.
- devise a lightweight UPG module that generates modality-aware prompt tokens to facilitate the learning of a well-aligned unified representation.
- introduce a parameter-free STS module that fully exploits the spatial-temporal information among video tokens and prompt tokens
Unified Representation Alignment: existing methods adopt isolated text and video prompt tokens to prompt text and video encoders separately (e.g., VoP) making it challenging for models to learn a well-aligned unified representation, i.e., modality gap.
Fig. 1. Visualization comparison of modality gap between (a) VoP and (b) UMP via UMAP. Paired text-video inputs are fed into text and video encoders separately and the embeddings from the last layers are visualized via UMAP.
Prompt tuning, an emerging parameter-efficient strategy, leverages the powerful knowledge of large-scale pre-trained image-text models (e.g., CLIP) to swiftly adapt to downstream tasks. Despite its effectiveness, adapting prompt tuning to text-video retrieval encounters two limitations: i) existing methods adopt two isolated prompt tokens to prompt two modal branches separately, making it challenging to learn a well-aligned unified representation, i.e., modality gap; ii) video encoders typically utilize a fixed pre-trained visual backbone, neglecting the incorporation of spatial-temporal information.
Fig. 2. Performance comparison of parameter-efficient methods (including our UMP) and full fine-tuning (FULL). The diameter of the bubble is proportional to the number of trainable parameters.
To this end, we propose a simple yet effective method, named Unified Modality-aware Prompt Tuning (UMP), for text-video retrieval. Concretely, we first introduce a Unified Prompt Generation (UPG) module to dynamically produce modality-aware prompt tokens, enabling the perception of prior semantic information on both video and text inputs. These prompt tokens are simultaneously injected into two branches that can bridge the semantics gap between two modalities in a unified-adjusting manner. Then, we design a parameter-free Spatial-Temporal Shift (STS) module to facilitate both intra- and inter-communication among video tokens and prompt tokens in the spatial-temporal dimension. Notably, extensive experiments on four widely used benchmarks show that UMP achieves new state-of-the-art performance compared to existing prompt-tuning methods without bringing excessive parameters.
The UMP framework depends on the following main requirements:
- torch==1.8.1+cu114
- Transformers 4.6.1
- OpenCV 4.5.3
- tqdm
We train our model on MSR-VTT-9k
, MSR-VTT-7k
, DiDeMo
, LSMDC
, and ActivityNet
datasets respectively. Please refer to this repo for data preparation.
You need to download the corresponding data file from the OneDrieve Drive link
provided and point out the data path --videos_dir
when training. Then, you will need to run run_msrvtt.sh
to train the model.
For simple training on MSR-VTT-9k with default hyperparameters:
bash run_msrvtt.sh
or run in the terminal directly:
python train.py \
--exp_name={exp_name} \
--videos_dir={videos_dir} \
--batch_size=32 \
--noclip_lr=3e
6EB8
-5 \
--transformer_dropout=0.3 \
--dataset_name=MSRVTT \
--msrvtt_train_file=9k
python test.py
--exp_name={exp_name} \
--videos_dir={videos_dir} \
--batch_size=32 \
--load_epoch=-1 \
--dataset_name=MSRVTT \
--msrvtt_train_file=9k
Note
We are continuously refactoring our code, be patient and wait for the latest updates!
If you find this repository helpful for your project, please consider citing our work:
@article{zhang2024ump,
title={UMP: Unified Modality-aware Prompt Tuning for Text-Video Retrieval},
author={Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Heng Tao Shen},
journal={arXiv preprint arXiv:XXX.XXX},
year={2024}
}
This paper and codebase are motivated and derived from the following paper and their projects 123: