-
checkpoints
- weights of the pre-trained model -
data_files
- input video and target audio to be synced -
face_detection
- a model to detect faces in a frame (ref. https://github.com/Rudrabha/Wav2Lip) -
models
- SOTA model for the lipsync task Wav2Lip (ref. https://github.com/Rudrabha/Wav2Lip) -
requirements.txt
- packages pinned to be installed -
sync.py
- the actual Python script to be run by the end-user -
utils.py
- a utility script forsync.py
-
Create a virual environment
conda create --name listed_1 python=3.6
-
Activate the environment
conda activate listed_1
-
Clone the repo
git clone https://github.com/adityagandhamal/Assgn1.git
-
run
cd Assgn1
-
run
pip install -r requirements.txt
[Note: Go on installing each package if the process gets stuck (happens usually while building dependency wheels)] -
Download the face detection model and place it in
face_detection/detection/sfd/
ass3fd.pth
-
Download the weights of the pre-trained model Wav2Lip + GAN and place the file in
checkpoints
-
Run
python sync.py
-
You'll obtain an output
listed_out.mp4
vid_in_trim2.mp4
output10_trim.mp4
download.1.mp4
The sample above is just a demo to get a notion of the task while the actual output video has been attached as a drive link in the mail. Keeping in mind the limitations of the pre-trained model and the scope of the input video (as the subject is seen to be disappearing from the scene frequently), the video and the audio are both trimmed using a third-party website.
Also, make sure to run the code on a GPU instance as the process is killed on a CPU. The following attached is the proof of the same.