[CVPR2025] STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding

Zichen Liu¹ Kunlun Xu¹ Bing Su² Xu Zou³ Yuxin Peng¹ Jiahuan Zhou^{1 *}

¹Wangxuan Institute of Computer Technology, Peking University ²Gaoling School of Artificial Intelligence, Renmin University of China ³School of Artificial Intelligence and Automation, Huazhong University of Science and Technology

The official repository for STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding.

Environment

This code is based on pytorch2.4.0, pytorch-cuda11.8, and torchvision 0.19.0.

For a complete configuration environment, see environment.yaml

Download CLIP Model

Download CLIP pre-trained weights and place them in ${HOME}/models/pretrained. Then, you need to modify the pretrained_dir in the shell file to the path where you stored the models.

wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt

Download Datasets

MSR-VTT Download the splits and captions from CLIP4clip:

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip

Download the videos from Frozen️-in-Time:

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip

Prepare data

Video pre 76C9 processing can be done by preprocess/compress_video.py.

python preprocess/compress_video.py --input_root [raw_video_path] --output_root [compressed_video_path]

This script will compress the video to 3fps with width 224 (or height 224). Modify the variables for your customization.

STOP

Taking the MSRVTT dataset as an example, you can run the following command:

chmod +x ./scripts/msrvtt.sh
bash ./scripts/msrvtt.sh

Results

Comparison with state-of-the-art on the MSR-VTT dataset. For a fair comparison,all methods use CLIP-ViT-B/32 as the backbone. “Params” represents the number of trainable parameters of each method. The best and second best results are marked in RED and BLUE, respectively.

Citation

If you find this code useful for your research, please cite our paper.

@article{liu2025stop,
  title={STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding},
  author={Liu, Zichen and Xu, Kunlun and Su, Bing and Zou, Xu and Peng, Yuxin and Zhou, Jiahuan},
  journal={arXiv preprint arXiv:2503.15973},
  year={2025}
}

Acknowledgement

Our code is based on the DGL and CLIP4clip. Thanks for their impressive works!

Contact

For any questions, feel free to contact us (lzc20180720@stu.pku.edu.cn).

Welcome to our Laboratory Homepage (OV³ Lab) for more information about our papers, source codes, and datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dataloaders		dataloaders
dataset		dataset
figs		figs
modules		modules
preprocess		preprocess
qb_norm_sim_matrix		qb_norm_sim_matrix
scripts		scripts
utils		utils
LICENSE		LICENSE
environment.yaml		environment.yaml
kill.sh		kill.sh
main.py		main.py
params.py		params.py
readme.md		readme.md
search_for_best_r1_with_qb_norm.py		search_for_best_r1_with_qb_norm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[CVPR2025] STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding

Environment

Download CLIP Model

Download Datasets

Prepare data

STOP

Results

Citation

Acknowledgement

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

zhoujiahuan1991/CVPR2025-STOP

Folders and files

Latest commit

History

Repository files navigation

[CVPR2025] STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding

Environment

Download CLIP Model

Download Datasets

Prepare data

STOP

Results

Citation

Acknowledgement

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages