If our project helps you, please give us a star ⭐ and cite our paper
git clone https://github.com/steven-ccq/ViLAMP.git
cd ViLAMP
# python 3.8
pip install -r requirements.txt
Please install 🤗ViLAMP-llava-qwen and place it into models/
We provide evaluation scripts for five benchmarks: Video-MME
, MLVU
, LongVideoBench
, ActivityNetQA
, and EgoSchema
. For more details, please refer to scripts/eval/
.
Before running these scripts, please download the evaluation datasets and place them in the dataset/
directory. Below are the instructions for the input arguments:
--dataset_path Path to the test dataset
--video_dir Path to the folder containing the videos required for testing
--output_dir Path to the folder where the results will be saved
--version Path to the model
--split Split the dataset in the format i_N, where N is the total number of splits and i is the current split index (starting from 1). Default is 1_1.
--max_frame_num Maximum number of frames to sample per video. Default is 600.
Here is the example of evaluating Video-MME
python exp_vMME.py \
--dataset_path dataset/Video-MME/videomme/test-00000-of-00001.parquet \
--video_dir dataset/Video-MME/data \
--output_dir dataset/Video-MME/output \
--version models/ViLAMP-llava-qwen \
--split 1_1 \
--max_frame_num 600
We offer training scripts designed for both single-node and multi-node environments. For more detailed instructions, please check the scripts/train/
directory.
To streamline dataset organization, we utilize the training_data.yaml
file to consolidate training data. Before initiating training, ensure that your dataset is registered in this file. We have included a straightforward example example-10.json
to demonstrate the expected dataset format.
Note that if your training dataset requires specific processing methods, you will need to modify line 1225 of the llava/train/train.py
file and insert your custom video processing function prior to this line to accommodate your dataset's needs.
@article{cheng2025vilamp,
title={Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation},
author={Cheng, Chuanqi and Guan, Jian and Wu, Wei and Yan, Rui},
journal={arXiv preprint arXiv:2504.02438},
year={2025}
}