GitHub - eminorhan/mae_st: Spatiotemporal Masked Autoencoders for video-based SSL

Masked Autoencoders As Spatiotemporal Learners

This is my personal copy of Facebook's Spatiotemporal Masked Autoencoders repository customized for my own purposes. The code here can be used to train and finetune spatiotemporal MAEs.

Usage examples

Training: To train a spatiotemporal MAE model with a ViT-H/14 architecture from scratch on your data, use pretrain.py, e.g.:

python -u pretrain.py \
    --data_dirs DATA_DIRS \
    --datafile_dir DATAFILE_DIR \
    --save_prefix INFORMATIVE_SAVE_PREFIX \
    --output_dir OUTPUT_DIR \
    --model 'mae_vit_huge_patch14' \
    --batch_size_per_gpu 1 \
    --accum_iter 1 \
    --epochs 100000 \
    --num_frames 16 \
    --img_size 224 \
    --decoder_embed_dim 512 \
    --decoder_depth 4 \
    --pin_mem \
    --t_pa
7B21
tch_size 2 \
    --repeat_aug 16 \
    --sampling_rate 8 \
    --lr 0.0001 \
    --weight_decay 0.05 \
    --mask_ratio 0.9 \
    --pred_t_dim 16 \
    --clip_grad 0.1

Here, DATA_DIRS is a list of directories containing the video files, DATAFILE_DIR is the directory where a .csv file containing all the training video file paths (optionally, with the corresponding class labels) will be saved, and OUTPUT_DIR is the directory where the checkpoints and training logs will be saved.

Finetuning on videos: To finetune a ViT-H/14 model on a downstream video recognition task, use finetune.py, e.g.:

python -u finetune.py \
    --train_dir TRAIN_DIR \
    --val_dir VAL_DIR \
    --datafile_dir DATAFILE_DIR \
    --save_prefix INFORMATIVE_SAVE_PREFIX \
    --output_dir OUTPUT_DIR \
    --finetune SPATIOTEMPORAL_MAE_CHECKPOINT \
    --num_classes 174 \
    --model 'vit_huge_patch14' \
    --batch_size_per_gpu 4 \
    --accum_iter 1 \
    --epochs 100000 \
    --num_frames 16 \
    --input_size 224 \
    --pin_mem \
    --t_patch_size 2 \
    --repeat_aug 1 \
    --sampling_rate 8 \
    --blr 0.0024 \
    --clip_grad 5.0 \
    --mixup 0 \
    --cutmix 0.0

Here, TRAIN_DIR and VAL_DIR are the directories containing the training and validation videos, respectively, and SPATIOTEMPORAL_MAE_CHECKPOINT is the path to the pretrained spatiotemporal MAE checkpoint the model is initialized with (use "" here if you would like to finetune the model from scratch without any pretraining).

Finetuning on images: To finetune a ViT-H/14 model on a downstream image recognition task (e.g. ImageNet), use finetune_on_image.py, e.g.:

python -u finetune_on_image.py \
    --train_data_path TRAIN_DATA_PATH \
    --val_data_path VAL_TRAIN_DATA_PATH \
    --save_prefix INFORMATIVE_SAVE_PREFIX \
    --output_dir OUTPUT_DIR \
    --finetune SPATIOTEMPORAL_MAE_CHECKPOINT \
    --num_classes 1000 \
    --model 'vit_huge_patch14' \
    --batch_size_per_gpu 4 \
    --accum_iter 1 \
    --epochs 100000 \
    --num_frames 16 \
    --input_size 224 \
    --pin_mem \
    --t_patch_size 2 \
    --blr 0.0024 \
    --clip_grad 5.0 \
    --mixup 0 \
    --cutmix 0.0

Here, TRAIN_DATA_PATH and VAL_TRAIN_DATA_PATH are the directories containing the training and validation images, respectively, and SPATIOTEMPORAL_MAE_CHECKPOINT is the path to the pretrained spatiotemporal MAE checkpoint the model is initialized with. This script will effectively make a static video clip for each image by repeating the image 16 times (num_frames). This allows us to use the pretrained spatiotemporal MAE model as is without any modifications in the architecture.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
datafiles		datafiles
models		models
models_finetuned		models_finetuned
models_finetuned_imagenet		models_finetuned_imagenet
old		old
scripts		scripts
util		util
vidseg		vidseg
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
embed.py		embed.py
engine_finetune.py		engine_finetune.py
engine_finetune_img.py		engine_finetune_img.py
engine_finetune_on_image.py		engine_finetune_on_image.py
engine_pretrain.py		engine_pretrain.py
evaluate.py		evaluate.py
evaluate_vidseg.py		evaluate_vidseg.py
finetune.py		finetune.py
finetune_img.py		finetune_img.py
finetune_on_image.py		finetune_on_image.py
models_mae.py		models_mae.py
models_vit.py		models_vit.py
models_vit_img.py		models_vit_img.py
pretrain.py		pretrain.py
test_kinetics_all.py		test_kinetics_all.py
upload_models.py		upload_models.py
upload_models_cogsci.py		upload_models_cogsci.py
visualize.py		visualize.py
visualize_attn.py		visualize_attn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Masked Autoencoders As Spatiotemporal Learners

Usage examples

About

Releases

Packages

Languages

License

eminorhan/mae_st

Folders and files

Latest commit

History

Repository files navigation

Masked Autoencoders As Spatiotemporal Learners

Usage examples

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages