Pytorch Implementation for the paper:
Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering
Ahjeong Seo, Gi-Cheon Kang, Joonhan Park, and Byoung-Tak Zhang
In ACL 2021
python 3.7, pytorch 1.2.0
- Download TGIF-QA dataset and refer to the paper for details.
- Download MSVD-QA and MSRVTT-QA.
- Appearance Features
- For local features, we used the Faster-RCNN pre-trained with Visual Genome. Please cite this Link.
- After you extracted object features by Faster-RCNN, you can convert them to hdf5 file with simple run:
python adaptive_detection_features_converter.py
- After you extracted object features by Faster-RCNN, you can convert them to hdf5 file with simple run:
- For global features, we used ResNet152 provided by torchvision. Please cite this Link.
- Motion Features
- For local features, we use RoIAlign with bounding box features obtained from Faster-RCNN. Please cite this Link.
- For global features, we use I3D pre-trained on Kinetics. Please cite this Link.
We uploaded our extracted features:
- TGIF-QA
res152_avgpool.hdf5
: appearance global features (3GB).tgif_btup_f_obj10.hdf5
: appearance local features (30GB).tgif_i3d_hw7_perclip_avgpool.hdf5
: motion global features (3GB).tgif_i3d_roialign_hw7_perclip_avgpool.hdf5
: motion local features (59GB).
- MSRVTT-QA
msrvtt_res152_avgpool.hdf5
: appearance global features (1.7GB).msrvtt_btup_f_obj10.hdf5
: appearance local features (17GB).msrvtt_i3d_avgpool_perclip.hdf5
: motion global features (1.7GB).msrvtt_i3d_roialign_perclip_obj10.hdf5
: motion local features (34GB).
- MSVD-QA
msvd_res152_avgpool.hdf5
: appearance global features (220MB).msvd_btup_f_obj10.hdf5
: appearance local features (2.2GB).msvd_i3d_avgpool_perclip.hdf5
: motion global features (220MB).msvd_i3d_roialign_perclip_obj10.hdf5
: motion local features (4.2GB).
Simple run
CUDA_VISIBLE_DEVICES=0 python main.py --task Count --batch_size 32
For MSRVTT-QA, run
CUDA_VISIBLE_DEVICES=0 python main_msrvtt.py --task MS-QA --batch_size 32
For MSVD-QA, run
CUDA_VISIBLE_DEVICES=0 python main_msvd.py --task MS-QA --batch_size 32
By default, our model save model checkpoints at every epoch. You can change the path for saving models by --save_path
options.
Each checkpoint's name is '[TASK]_[PERFORMANCE].pth' in default.
CUDA_VISIBLE_DEVICES=0 python main.py --test --checkpoint [NAME] --task Count --batch_size 32
Per 682C formance on TGIF-QA dataset:
Model | Count | Action | Trans. | FrameQA |
---|---|---|---|---|
MASN | 3.75 | 84.4 | 87.4 | 59.5 |
You can download our pre-trained model by this link : Count
, Action
, Trans.
, FrameQA
Performance on MSRVTT-QA and MSVD-QA dataset:
Model | MSRVTT-QA | MSVD-QA |
---|---|---|
MASN | 35.2 | 38.0 |
If this repository is helpful for your research, we'd really appreciate it if you could cite the following paper:
@article{seo2021attend,
title={Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering},
author={Seo, Ahjeong and Kang, Gi-Cheon and Park, Joonhan and Zhang, Byoung-Tak},
journal={arXiv preprint arXiv:2106.10446},
year={2021}
}
MIT License
This work was partly supported by the Institute of Information & Communications Technology Planning & Evaluation (2015-0-00310-SW.StarLab/25%, 2017-0-01772-VTT/25%, 2018-0-00622-RMI/25%, 2019-0-01371-BabyMind/25%) grant funded by the Korean government.