CAT-V

Official PyTorch implementation of Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

🚀 Updates

🕹️ Demo

🛠️ Getting Started

Set up a conda environment (python>= 3.10) using:

conda create -n cat2 python=3.10 -y
conda activate cat2

Install the requirements:

pip install -e .

Download checkpoints:

cd checkpoints && \
./download_ckpts.sh && \
cd ..

🏃 RUN

bash inference.sh

📖 Citation

If you find this work useful for your research or applications, please cite using this BibTeX:

@inproceedings{tang2025cat-v,
  title={Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting},
  author={Tang, Yunlong and Bi, Jing and Hua, Hang and Xiao, Yunzhong and Song, Yizhi and Liu, Pinxin and Huang, Chao and Feng, Mingqian and Guo, Junjia and Liu, Zhuo and Song, Luchuan and Liang, Susan and Shimada, Daiki and Vosoughi, Ali and He, Jinxi and He, Liu and Zhang, Zeliang and Luo, Jiebo and Xu, Chenliang},
  journel={arXiv},
  year={2025}
}

🙏 Acknowledgements

This work was supported by Sony Group Corporation. We would like to thank Sayaka Nakamura and Jerry Jun Yokono for their insightful discussion.

We are also grateful for the following awesome projects our CAT-V arising from:

👩‍💻 Contributors

Our project wouldn't be possible without the contributions of these amazing people! Thank you all for making this project better.

Yunlong Tang @ University of Rochester
Jing Bi @ University of Rochester
Chao Huang @ University of Rochester
Susan Liang @ University of Rochester
Daiki Shimada @ Sony Group Corporation
Hang Hua @ University of Rochester
Yunzhong Xiao @ Carnegie Mellon University
Yizhi Song @ Purdue University
Pinxin Liu @ University of Rochester
Mingqian Feng @ University of Rochester
Junjia Guo @ University of Rochester
Zhuo Liu @ University of Rochester
Luchuan Song @ University of Rochester
Ali Vosoughi @ University of Rochester
Jinxi He @ University of Rochester
Liu He @ Purdue University
Zeliang Zhang @ University of Rochester
Jiebo Luo @ University of Rochester
Chenliang Xu @ University of Rochester

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
checkpoints		checkpoints
internvl		internvl
sam2		sam2
scripts		scripts
trace		trace
.gitignore		.gitignore
LICENSE-Caption-Anything.txt		LICENSE-Caption-Anything.txt
LICENSE-Qwen2-VL.txt		LICENSE-Qwen2-VL.txt
LICENSE-SAMURAI.txt		LICENSE-SAMURAI.txt
LICENSE-VideoLLaMA2.txt		LICENSE-VideoLLaMA2.txt
README.md		README.md
environment.yml		environment.yml
eval_utils.py		eval_utils.py
gradio_app.py		gradio_app.py
inference.sh		inference.sh
init.sh		init.sh
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

CAT-V

🚀 Updates

🕹️ Demo

🛠️ Getting Started

🏃 RUN

📖 Citation

🙏 Acknowledgements

👩‍💻 Contributors

🌟 Star History

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

Licenses found

yunlong10/CAT-V

Folders and files

Latest commit

History

Repository files navigation

CAT-V

🚀 Updates

🕹️ Demo

🛠️ Getting Started

🏃 RUN

📖 Citation

🙏 Acknowledgements

👩‍💻 Contributors

🌟 Star History

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages