Zihang Lai, Andrea Vedaldi
Visual Geometry Group, University of Oxford
A plug-and-play transformer layer to turn image-based models into state-of-the-art video models using point tracking.
Tracktention is a novel architectural module that improves temporal consistency in video tasks like depth estimation and colorization. It leverages modern point trackers to explicitly align features across frames using attention — converting powerful image-based models into robust, temporally aware video models with minimal overhead.
- Tracktention Layer: Enhances existing ViT/ConvNet with motion-aware temporal reasoning.
- Plug-and-Play: Easily integrates into existing models like
Depth Anything
. - Lightweight: Only ~17M additional parameters with minimal runtime overhead.
- State-of-the-Art: Outperforms leading video models in depth prediction and video colorization benchmarks.
Tracktention consists of:
- Attentional Sampling: Pool features from image tokens to track tokens using cross-attention.
- Track Transformer: Propagate features along tracks for temporal consistency.
- Attentional Splatting: Redistribute processed track tokens back to image tokens.
We use CoTracker3 to generate point tracks.
Note: Usage instructions will be provided once the codebase is officially released.
If you use this code or Tracktention in your research, please cite:
@inproceedings{lai2025tracktention,
title={Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better},
author={Zihang Lai and Andrea Vedaldi},
booktitle={CVPR},
year={2025}
}