InsTaG: Learning Personalized 3D Talking Head from Few-Second Video

Official repository for our CVPR 2025 paper InsTaG: Learning Personalized 3D Talking Head from Few-Second Video.

Paper | Project | Video

Installation

Tested on Ubuntu 18.04, CUDA 11.3 / 11.7, PyTorch 1.12.1 / 1.13.1

git submodule update --init --recursive
conda env create --file environment.yml
conda activate instag
pip install "git+https://github.com/facebookresearch/pytorch3d.git"
pip install tensorflow-gpu==2.10.0

If encounter installation problem from the diff-gaussian-rasterization or gridencoder, please refer to diff-gaussian-rasterization and torch-ngp.

Preparation

Prepare pre-trained weights for the tools and the 3DMM model for head pose estimation.
```
bash scripts/prepare.sh
```

Download 3DMM model from Basel Face Model 2009:

# 1. copy 01_MorphableModel.mat to data_util/face_tracking/3DMM/
# 2. run following
cd data_utils/face_tracking
python convert_BFM.py

Prepare the environment for EasyPortrait:

# prepare mmcv
conda activate instag
pip install -U openmim
mim install mmcv-full==1.7.1 prettytable

# download model weight
wget "https://rndml-team-cv.obs.ru-moscow-1.hc.sbercloud.ru/datasets/easyportrait/experiments/models/fpn-fp-512.pth" -O data_utils/easyportrait/fpn-fp-512.pth

Prepare the environment for sapiens:

conda create -n sapiens_lite python=3.10
conda activate sapiens_lite
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install opencv-python tqdm json-tricks
# Download the sapiens models. Git LFS is required. 
# We defaultly choose 0.3b models here to save storage and time, while 2b is better.
# For the Chinese Mainland users, may manually choose the HF mirror in the script for acceleration.
bash scripts/prepare_sapiens.sh

Important Notice

This code is provided for research purposes only. The author makes no warranties, express or implied, as to the accuracy, completeness, or fitness for a particular purpose of the code. Use this code at your own risk.

The author explicitly prohibits the use of this code for any malicious or illegal activities. By using this code, you agree to comply with all applicable laws and regulations, and you agree not to use it to harm others or to perform any actions that would be considered unethical or illegal.

The author will not be responsible for any damages, losses, or issues that arise from the use of this code.

Users are required to use this code responsibly and ethically.

Data Preparation

Dataset Organization

The video datas should be seperated as two parts. In our paper, we take 5 long videos to compose the pre-training set stored in ./data/pretrain, and use the others for test. Our recommend structure is as follow:

./data/
├──pretrain # pre-training set
│  ├──<ID 1>
│  │  └──<ID 1>.mp4 # original video
│  ├──<ID 2>
│  │  └──<ID 2>.mp4 # original video
│  .. ..
├──<ID 3> # other data for test 
│  └─<ID 3>.mp4 # original video
.. ..

Pre-processing Training Video

Put training video under data/<ID>/<ID>.mp4.

The video must be 25FPS, with all frames containing the talking person.

The resolution should be about 512x512.

Run script to process the video.

# Required.
python data_utils/process.py data/<ID>/<ID>.mp4
# Optional. To retain at least 12s data for evaluation.
python data_utils/split.py data/<ID>/<ID>.mp4

Obtain Action Units

Run FeatureExtraction in OpenFace, rename and move the output CSV file to data/<ID>/au.csv.

Generate tooth masks

export PYTHONPATH=./data_utils/easyportrait 
python ./data_utils/easyportrait/create_teeth_mask.py ./data/<ID>

Generate geometry priors.

Only for adaptation. Not required for pre-training data.

conda activate sapiens_lite
# Generate geometry priors for the first 500 images in default using 4 GPUs.
# You can configure them at ./data_utils/sapiens/lite/scripts .
bash ./data_utils/sapiens/run.sh ./data/<ID>

Audio Pre-process

In our paper, we use DeepSpeech features for evaluation. Additionally, we find AVE from SyncTalk performs remarkably, despite a bit of instability.

DeepSpeech

The extractor we used in experiments for evaluations.

# saved to data/<name>.npy
python data_utils/deepspeech_features/extract_ds_features.py --input data/<name>.wav

Wav2Vec

Performs better than DeepSpeech for most cases.

# save to data/<name>_eo.npy
python data_utils/wav2vec.py --wav data/<name>.wav --save_feats

AVE

With the best lip-synchronization in most few-shot cases, especially for English, but may cause jitter and crash.
```
# No operation needed in this step. 
```
HuBERT

Good generalizability for non-English languages like Chinese and other OOD situations. Work better with a longer training video.

Notably, in our test, HuBERT does not perform very ideally for extreme few-shot scenarios, maybe due to its high dimension. Recommend using at least 10s data for training.
```
# Borrowed from GeneFace. English pre-trained.
python data_utils/hubert.py --wav data/<name>.wav # save to
8FF9
 data/<name>_hu.npy
```

Pre-training

We assume all the pre-training video data are in ./data/pretrain. The IDs of used videos can be specified in ./pretrain_face.py and ./pretrain_mouth.py.

In our paper, we use five videos for pre-training, including three videos "Obama1", "Jae-in", "Shaheen" from GeneFace, and "may", "macron" from TalkingGaussian.

After organizing and pre-processing the videos, run the script to start pre-training.

# Several scripts are available in ./scripts for different audio extractors.
bash scripts/pretrain_con.sh data/pretrain output/<project_name> <GPU_ID>

Memory Hints: Please be cautious about the consumption of computer memory. Each 5-minute training data requires about 12GB RAM for the preloading. You may implement on-the-fly loading by yourself to reduce the consumption.

Checkpoints: Here we provide four pre-training weights: Google Drive. You may unzip and put them in ./output for testing. Note that these pre-training weights are for research purposes, corresponding to our paper. You may need to retrain the model with customized data, language, and configuration to get personally required performance.

Additionally, a trial weight pre-trained with data including Chinese videos is provided here.

Adaptation

Given a new identity, run the script to train a new person-specific model based on the pre-training.

# Audio encoder and pre-training checkpoint can be configured in the script.
# By default, 10s data is used.
bash scripts/train_xx_few.sh data/<ID> output/<project_name> <GPU_ID>

The video of the new identity will be separated into a training clip, and a test clip with at least 12s length. Evaluation metrics will be calculated and reported on the test set.

The videos used for test in our paper is fetched from DFRF ("cnn", "cnn2", "english_w") and GeneFace ("Lieu"). You can first start with these videos to check the correctness of the environment installation.

Options

Here are some options that may help customization. Some defaults are set in the scripts.

--audio_extractor: Specify the type of used audio extractor. Options: deepspeech (default), ave, esperanto hubert.
--N_views: For training only. The number of frames used in training, 25 frames per second. -1 denotes using all the training clip.
--long: For training only. Specify it if data is sufficient, i.e., using minutes or longer video for training. Geometry regularization is forbidden under such mode with --N_views -1, so you can skip the generating of geometry priors.
--all_for_train: For training only. Merge training and test clip together to train the model. Note that after specifying this, the reported metrics will become invalid.
--use_train: For Inference only. Use the pose sequence of the training set to drive the rendering.

Test

Run the script to render the test clip.

# Saved to output/<project_name>/test/ours_None/renders
# Specify `--audio_extractor` if not using DeepSpeech
python synthesize_fuse.py -S data/<ID> -M output/<project_name> --eval

Inference with Specified Audio

You can also specify an audio to drive the talking head.

# Specify `--audio_extractor` if not using DeepSpeech
# If AVE is used, please set the path of a .wav file for `--audio`.
python synthesize_fuse.py -S data/<ID> -M output/<project_name> --dilate --use_train --audio <preprocessed_audio_feature>.npy --audio_extractor ave

Citation

Consider citing as below if you find this repository helpful to your project:

@inproceedings{li2025instag,
    title={InsTaG: Learning Personalized 3D Talking Head from Few-Second Video}, 
    author={Li, Jiahe and Zhang, Jiawei and Bai, Xiao and Zheng, Jin and Zhou, Jun and Gu, Lin},
    booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
    year={2025}
}

Acknowledgement

This code is developed on gaussian-splatting with simple-knn, and a modified diff-gaussian-rasterization. Partial codes are from RAD-NeRF, DFRF, GeneFace, and AD-NeRF. Teeth mask is from EasyPortrait. Geometry priors are from sapiens. Thanks for these great projects!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
arguments		arguments
assets		assets
data		data
data_utils		data_utils
gaussian_renderer		gaussian_renderer
gridencoder		gridencoder
lpipsPyTorch		lpipsPyTorch
output		output
scene		scene
scene_pretrain		scene_pretrain
scripts		scripts
shencoder		shencoder
submodules		submodules
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
auerror.py		auerror.py
encoding.py		encoding.py
environment.yml		environment.yml
environment_cu117.yml		environment_cu117.yml
environment_cu121.yml		environment_cu121.yml
metrics.py		metrics.py
pretrain_face.py		pretrain_face.py
pretrain_mouth.py		pretrain_mouth.py
requirement.yml		requirement.yml
synthesize_fuse.py		synthesize_fuse.py
train_face.py		train_face.py
train_fuse_con.py		train_fuse_con.py
train_mouth.py		train_mouth.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

InsTaG: Learning Personalized 3D Talking Head from Few-Second Video

Installation

Preparation

Important Notice

Data Preparation

Dataset Organization

Pre-processing Training Video

Audio Pre-process

Pre-training

Adaptation

Options

Test

Inference with Specified Audio

Citation

Acknowledgement

About

Uh oh!

Releases

Uh oh!

Languages

Fictionarry/InsTaG

Folders and files

Latest commit

History

Repository files navigation

InsTaG: Learning Personalized 3D Talking Head from Few-Second Video

Installation

Preparation

Important Notice

Data Preparation

Dataset Organization

Pre-processing Training Video

Audio Pre-process

Pre-training

Adaptation

Options

Test

Inference with Specified Audio

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Languages