Selecting a Diverse Set of Aesthetically-Pleasing and Representative Video Thumbnails Using Reinforcement Learning

PyTorch Implementation of RL-DiVTS

From "Selecting a Diverse Set of Aesthetically-Pleasing and Representative Video Thumbnails Using Reinforcement Learning". Proc. of the IEEE Int. Conf. on Image Processing (ICIP), Oct. 2023.
Written by Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris and Ioannis Patras.
This software can be used for training a deep learning architecture for video thumbnail selection, which quantifies the representativeness and the aesthetic quality of the selected thumbnails using deterministic reward functions, and integrates a frame picking mechanism that takes frames’ diversity into account. After being trained on a collection of videos, RL-DiVTS's Thumbnail Selector is capable of selecting a diverse set of representative and aesthetically-pleasing video thumbnails for unseen videos, according to a user-specified value about the number of required thumbnails.
The PyTorch Implementation of the ARL-VTS video thumbnail selection method that is also evaluated in the paper, is available at ARL-VTS. The PyTorch Implementation of the AC-SUM-GAN and CA-SUM video summarization methods that are also evaluated in the paper, are available at AC-SUM-GAN and CA-SUM, respectively.

Main dependencies

Developed, checked and verified on an Ubuntu 20.04.5 PC with an NVIDIA RTX 2080Ti GPU and an i5-11600K CPU. Main packages required:

`Python`	`PyTorch`	`CUDA Version`	`cuDNN Version`	`TensorBoard`	`TensorFlow`	`NumPy`	`H5py`
3.8(.8)	1.7.1	11.4	8005	2.4.0	2.4.1	1.20.2	2.10.0

Data

Structured h5 files with the video features and annotations of the OVP and YouΤube datasets are available within the data folder. These files have the following structure:

/key
    /features                 2D-array with shape (n_steps, feature-dimension), feature vectors representing the content of the video frames; extracted from the pool5 layer of a GoogleNet trained on the ImageNet dataset
    /aesthetic_scores_mean    1D-array with shape (n_steps), scores representing the aesthetic quality of the video frames; computed as the softmax of the values in the final layer of a model of a Fully Convolutional Network trained on the AVA dataset
    /n_frames                 number of frames in original video
    /ssim_matrix              2D-array with shape (M, n_frames), the structural similarity scores between each of the M most selected thumbnails by the human annotators and the entire frame sequence (to support evaluation using 'Precision at 5' the number of selected thumbnails by the human annotators was set equal to five; however, M can be more than five if they exist more than five key-frames with the same ranking according to the number of selections made by the human annotators); the structural similarity scores were computed using the structural_similarity function of Python
    /top3_thumbnail_ids       the indices of the three most selected thumbnails by the human annotators (can be more than three if they exist more than three key-frames with the same ranking according to the number of selections made by the human annotators.

Original videos and annotations for each dataset are also available here:

Configurations

Setup for the training process:

In data_loader.py, specify the path to the h5 file of the used dataset, and the path to the JSON file containing data about the utilized data splits.
In configs.py, define the directory where the analysis results will be saved to.

Arguments in configs.py:

8000

Parameter name	Description	Default Value	Options
`--mode`	Mode for the configuration.	'train'	'train', 'test'
`--verbose`	Print or not training messages.	'false'	'true', 'false'
`--video_type`	Used dataset for training the model.	'OVP'	'OVP', 'Youtube'
`--input_size`	Size of the input feature vectors.	1024	int > 0
`--hidden_size`	Number of features in the LSTM hidden state.	512	int > 0
`--num_layers`	Number of LSTM recurrent layers.	2	int > 0
`--n_episodes`	Number of training episodes per epoch.	10	int > 0
`--selected_thumbs`	Number of selected thumbnails.	6	int > 0
`--n_epochs`	Number of training epochs.	150	int > 0
`--batch_size`	Size of the training batch, 40 for 'OVP' and 32 for 'Youtube'.	40	0 < int ≤ len(Dataset)
`--seed`	Chosen number for generating reproducible random numbers.	None	None, int
`--exp`	Experiment serial number.	1000	int
`--clip`	Gradient norm clipping parameter.	5.0	float
`--lr`	Value of the adopted learning rate.	1e-4	float
`--split_index`	Index of the utilized data split.	0	0 ≤ int ≤ 9

Training

To train the model using one of the aforementioned datasets and for a number of randomly created splits of the dataset (where in each split 80% of the data is used for training and 20% for testing) use the corresponding JSON file that is included in the data/splits directory. This file contains the 5 randomly-generated splits that were utilized in our experiments.

For training the model using a single split, run:

python model/main.py --split_index N --n_epochs E --batch_size B --video_type 'dataset_name' --exp ID

where, N refers to the index of the used data split, E refers to the number of training epochs, B refers to the batch size, dataset_name refers to the name of the used dataset, and ID refers to index of the folder that will be used for storing the analysis results (default naming format: expID). For example, to run an experiment using the first data split of the OVP dataset, a batch of 40 videos (full-batch), for 150 training epochs, and store the analysis results in the exp1 folder, execute the following command

python model/main.py --split_index 0 --n_epochs 150 --batch_size 40 --video_type 'OVP' --exp 1

Alternatively, to train the model for all 5 splits and all 5 different seeds, use the run_ovp_splits.sh and/or run_youtube_splits.sh script and do the following:

chmod +x model/run_ovp_splits.sh        # Makes the script executable.
chmod +x model/run_youtube_splits.sh    # Makes the script executable.
sh /model/run_ovp_splits.sh             # Runs the script. 
sh /model/run_youtube_splits.sh         # Runs the script.

Please note that after each training epoch the algorithm performs an evaluation step, using the trained model to compute the scores for the frames of each video of the test set. These scores are then used by the provided evaluation scripts to assess the overall performance of the model.

The progress of the training can be monitored via the TensorBoard platform and by:

opening a Terminal and executing the following command: tensorboard --logdir=/path/to/log-directory --host=localhost
and then opening a browser and pasting the returned URL after the execution of the above command in the Terminal

Evaluation and Model Selection

Given a test video, the top-3 selected key-frames among all annotators for this video are considered as the ground-truth thumbnails. As a side note, through this procedure some videos are associated with more than 3 ground-truth thumbnails, due to the existence of more than 3 key-frames with the same ranking according to the number of selections made by the human annotators. Nevertheless, in our evaluations we use the three thumbnails that come first according to the MSD (Most Significant Digit) Radix Sort of Python (which, e.g., sorts frame #20 before frame #3 based on the most significant digit).

In terms of evaluation, we applied the "top-3 matching" approach that measures the overlap between the top-3 machine- and human-selected thumbnails per video. We expressed this overlap as a scalar ranging in [0,1] and computed the average score over all videos of the test set.

The utilized model selection criterion relies on the maximization of the received reward and enables the selection of a well-trained model by indicating the training epoch. In evaluation we provide evaluate_all_exp.sh to evaluate the trained models of the architecture and automatically select a well-trained one, for each conducted experiment. To run this script, define:

the h5_file_path in compute_score.py,
the base_path in evaluate_all_exp.sh,
the init_id in evaluate_all_exp.sh,
and the 'dataset_name' in evaluate_all_exp.sh

and run

sh evaluation/evaluate_all_exp.sh '$exp_id' '$dataset_name'

where, $exp_id is the ID of the first (out of five in total) evaluated experiment, and $dataset_name refers to the dataset being used.

Citation

If you find our work or code, useful in your work, please cite the following publication:

E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, "Selecting a Diverse Set of Aesthetically-Pleasing and Representative Video Thumbnails Using Reinforcement Learning", accepted for publication in the 2023 IEEE Int. Conf. on Image Processing (ICIP 2023).

License

Copyright © 2023, Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris, Ioannis Patras / CERTH-ITI. All rights reserved. This code is provided for academic, non-commercial use only. Redistribution and use in source and binary forms, with or without modification, are permitted for academic non-commercial use provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation provided with the distribution.

This software is provided by the authors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the authors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.

Acknowledgement

This work was supported by the EU Horizon 2020 programme under grant agreement H2020-951911 AI4Media.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
data		data
evaluation		evaluation
model		model
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Selecting a Diverse Set of Aesthetically-Pleasing and Representative Video Thumbnails Using Reinforcement Learning

PyTorch Implementation of RL-DiVTS

Main dependencies

Data

Configurations

Training

Evaluation and Model Selection

Citation

License

Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

e-apostolidis/RL-DiVTS

Folders and files

Latest commit

History

Repository files navigation

Selecting a Diverse Set of Aesthetically-Pleasing and Representative Video Thumbnails Using Reinforcement Learning

PyTorch Implementation of RL-DiVTS

Main dependencies

Data

Configurations

Training

Evaluation and Model Selection

Citation

License

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages