Chenxin An1 Lean Wang2 Xu Sun2 Lingpeng Kong1 Qi Liu1
This repo contains official implementation of our paper "Temporal Reasoning Transfer from Text to Video"
Please refer to ./probing
for details.
For LongVA experiments, we mix the Open-LLaVA-NeXT dataset with our T3 dataset. The data mixing process is implemented in t3_sft/data_creation.py
. The script handles:
- Loading and processing Open-LLaVA-NeXT data
- Loading our Video T3 dataset containing various aspects
The data mixing script allows for:
- Customizable dataset ratios (see Table 2 of the main paper and Figure 9 of Appendix for the best practice of mixing datasets.)
- Text length filtering (to avoid OOM when GPU memory is limited )
- Token length analysis and visualization
We use the LongVA codebase for training LongVA models. Please setup the environments according to LongVA. The training script is located at t3_sft/longva_exp/longva_t3.sh
.
For Qwen2VL models, we use LLaMa-Factory for fine-tuning. Please setup the environments according to LLaMa-Factory. The specific training configurations and scripts for Qwen2VL models could be found under t3_sft/qwen_exp/7b
and t3_sft/qwen_exp/72b
for 7B and 72B models, respectively.