Speech2Vid

Implementation of You said that? in Python.

Prerequisites

Python3
Cuda
ffmpeg
Scipy
OpenCV
dlib

Overview of the Project

This is a project on generating a video of a talking face. For any given audio segment and a face image, the method generates a video of the input face lip-synched with the given input audio. There are 2 major components in this method:

Data Preprocessing: The dataset used in this project is LRS2 dataset. Initially, the raw audio data and the frames are extracted from the 6A22 input video. From audio data, the Mel Frequency Cepstral Coefficients (MFCC) features are extracted (using scipy) for every 0.35 secs of audio data. From frames, the ground truth frame (center frame) and the feature frame (random frame far from the ground truth frame) are selected for every 0.35 secs window. The lower half of the ground truth frame is masked and concatenated with feature frame which forms the identity input.
Architecture: The encoder-decoder CNN model is built to generate the talking video. The audio features and the identity features are used to construct the audio and the identity encoders. the output of the audio and the identity encoders are concatenated and given to the image decoder which generates the output frame.

Usage

Feature extraction: Extract and store the audio and identity features python3 extract_features.py -d=data/
Train: Train the model using the processed data python3 train.py -f=features/
Generate video: Generate the output video for a given audio and image input. python3 predict.py -m=saved_models/best_model.h5 -t=1.mp4 -f=frame.jpg

Following are the other parameters that can be tuned:

>> python3 train.py


usage: train.py      [-f FEATURE_PATH]  	[-b BATCH_SIZE]
                     [-e NO_OF_EPOCHS]  	[-spe STEPS_PER_EPOCH] 
                     [-g NO_OF_GPUs]		[-s NO_OF_STILL_IMAGES]
                     [-md MODEL_DIRECTORY] 
                     
    
Optional arguments:
    
      -b BATCH_SIZE,		--batch_size                          
			          Batch size of the data to be used while training. Default value is 64.
                            
      -e NO_OF_EPOCHS, 		--epochs 
                                  Number of epochs to be used while training. Default value is 20.
                                  
      -spe STEPS_PER_EPOCH,	--steps_per_epoch 
	                          Number of step to be used per epoch while training. Default value is 1000.
             
      -g NO_OF_GPUs, 		--no_of_gpus
                                  Total number of GPUs to be used while training. If number of GPUs are greater than 1, then multi-gpu support will be enabled. Default value is 1. 
                                  
      -s NO_OF_STILL_IMAGES,	--no_of_still_images
                                  Number of still images to be used while training. Default value is 1.   
                                                                                
      -md MODEL_DIRECTORY, 	--model_directory
                                  Path to save the trained model. Default directory is "saved_models/"

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data/6046013437125640989		data/6046013437125640989
scripts		scripts
README.md		README.md
extract_features.py		extract_features.py
predict.py		predict.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Speech2Vid

Prerequisites

Overview of the Project

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Sindhu-Hegde/you_said_that

Folders and files

Latest commit

History

Repository files navigation

Speech2Vid

Prerequisites

Overview of the Project

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages