WildVidFit Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Model

Method piepeline

Requirements

diffusers==0.17.0

DataSet and pre-process

VITON-HD

Download the VITON-HD dataset
Clothes feature extraction, use extract_dino_fea.py

Once the dataset is downloaded, the folder structure should look like this:

├── VITON-HD
|   ├── test_pairs.txt
|   ├── train_pairs.txt
│   ├── [train | test]
|   |   ├── image
│   │   │   ├── [000006_00.jpg | 000008_00.jpg | ...]
│   │   ├── cloth
│   │   │   ├── [000006_00.jpg | 000008_00.jpg | ...]
│   │   ├── cloth-mask
│   │   │   ├── [000006_00.jpg | 000008_00.jpg | ...]
│   │   ├── image-parse-v3
│   │   │   ├── [000006_00.png | 000008_00.png | ...]
│   │   ├── openpose_img
│   │   │   ├── [000006_00_rendered.png | 000008_00_rendered.png | ...]
│   │   ├── openpose_json
│   │   │   ├── [000006_00_keypoints.json | 000008_00_keypoints.json | ...]
│   │   ├── dino_fea
│   │   │   ├── [000006_00.pt  | 000008_00.pt  | ...]

DressCode

Download the DressCode dataset
To enhance the performance of our warping module, we have discovered that using in-shop images with a white background yields better results. To facilitate this process, we now offer pre-extracted masks that can be used to remove the background from the images. You can download the masks from the following link: here. Once downloaded, please extract the mask files and place them in the dataset folder alongside the corresponding images.
Clothes featue extraction, use extract_dino_fea.py Once the dataset is downloaded, the folder structure should look like this:

├── DressCode
|   ├── test_pairs_paired.txt
|   ├── test_pairs_unpaired.txt
|   ├── train_pairs.txt
│   ├── [dresses | lower_body | upper_body]
|   |   ├── test_pairs_paired.txt
|   |   ├── test_pairs_unpaired.txt
|   |   ├── train_pairs.txt
│   │   ├── images
│   │   │   ├── [013563_0.jpg | 013563_1.jpg | 013564_0.jpg | 013564_1.jpg | ...]
│   │   ├── masks
│   │   │   ├── [013563_1.png| 013564_1.png | ...]
│   │   ├── keypoints
│   │   │   ├── [013563_2.json | 013564_2.json | ...]
│   │   ├── label_maps
│   │   │   ├── [013563_4.png | 013564_4.png | ...]
│   │   ├── skeletons
│   │   │   ├── [013563_5.jpg | 013564_5.jpg | ...]
│   │   ├── dense
│   │   │   ├── [013563_5.png | 013563_5_uv.npz | 013564_5.png | 013564_5_uv.npz | ...]
│   │   ├── dino_fea
│   │   │   ├── [013563_1.pt  | 013564_1.pt  | ...]

VVT Dataset

Download the VVT Dataset (ask the author)
Pose estimation
Feature extraction, use extract_dino_fea_vtt.py

├── VVT
|   ├── test_pairs.txt
|   ├── train_pairs.txt
|   ├── clothes_person
│   │   ├── dino_fea
│   │   ├── img
|   ├── train_frames
│   │   │   ├── [4be21d0a1-n11 | 4be21d09i-k11 | ...]
|   ├── train_frames_parsing
│   │   │   ├── [4be21d0a1-n11 | 4be21d09i-k11 | ...]
|   ├── train_openpose_img
│   │   │   ├── [4be21d0a1-n11 | 4be21d09i-k11 | ...]
|   ├── train_openpose_json
│   │   │   ├── [4be21d0a1-n11 | 4be21d09i-k11 | ...]
|   ├── test_frames
│   │   │   ├── [4he21d00f-g11 | 4he21d00f-k11 | ...]
|   ├── test_frames_parsing
│   │   │   ├── [4he21d00f-g11 | 4he21d00f-k11 | ...]
|   ├── test_openpose_img
│   │   │   ├── [4he21d00f-g11 | 4he21d00f-k11 | ...]
|   ├── test_openpose_json
│   │   │   ├── [4he21d00f-g11 | 4he21d00f-k11 | ...]

TikTok Dataset

Download the TikTok Dataset
Pre-process
1. openpose for pose estimation
2. parsing by SCHPA
3. feature extraction, use extract_dino_fea.py

├── TikTok
|   ├── test_pairs.txt
|   ├── train_pairs.txt
|   ├── cloth
│   │   ├── [00008.png | 00009.png | ...]
│   ├── dino_fea
│   │   ├── [00008.pt  | 00009.pt  | ...]
|   ├── openpose_img
│   │   │   ├── [00001 | 00002 | ...]
|   ├── openpose_json
│   │   │   ├── [00001 | 00002 | ...]
|   ├── parse
│   │   │   ├── [00001 | 00002 | ...]
|   ├── parse_lip
│   │   │   ├── [00001 | 00002 | ...]
|   ├── TikTok_dataset
│   │   │   ├── [00001 | 00002 | ...]

Pretrain model preparation

Download Stable Diffusion 1.5
Download mae pretrain model

Trained model

unet

agnostic_norm_hair_have_background: VITON, DressCode, Wild Video使用, 但是这个模型在down fuse存在一个bug
agnostic_nonorm_hair_have_background_black nonorm: nonorm, 有背景和头发，但是agnostic的cloth用黑色mask
agnostic_nonorm_hair_have_background_gray nonorm: nonorm, 有背景和头发，但是agnostic的cloth用常规的灰色mask
model_VITON_512_fixbug: 上面的模型将这个bug修复后
model_VITON_512_DINO_large_large2: rebuttal 模型，将edge放到cross attention，然后garment size提升到1022，需要用带‘new’的文件
model_VTT_192_256_1030_fixbug： VTT
model_VTT_192_256_1030_fixbug_long: VTT
model_TikTok_512_fixbug_1107: TikTok
model_TikTok_512_fixbug_1109_atr: TikTTok, parse use atr
model_TikTok_512_fixbug_1109_lip:: TikTok, parse use lip
model_TikTok_rebuttal_ft_from_TikTok: 基于1109_lip模型进行ft,增加了classifier free guidance
model_VITON_512_DINO_large_large_TikTok2: 基于model_VITON_512_DINO_large_large2在TikTok数据集上ft，需要用带‘new’的文件

vae

HR_VITON_vae：通用的用emasc ft后的vae
model_VTT_vae: VVT数据集的vae

Training

Train the network for VITON, DressCode

accelerate launch --mixed_precision="fp16" anydoor_train.py

Train the network for TikTok (only dataset different)

accelerate launch --mixed_precision="fp16" anydoor_train_TikTok.py

Train vae Train the vae using emasc in LaDI-VTON

train_cloth_vae_agnostic.py

train_cloth_vae_agnostic_TikTok.py

Inference

image-level

revise the dataset, model in config.py, then

python infer.py

video-level

ps: 注意dataset调整起始位置 python infer_video.py: base video infer. TikTok, VVT, Wild video can use

ps: 注意VTTDataset的set_group 用于设定源数据是哪个 python infer_video_vtt_list: for VVT, multiple cases prediction

python infer_video_mae_guided.py: add mae guiding

python infer_video_guided.py: add clip guiding based on this clip guiding example

python infer_video_mae_clip_guided.py: mae and similarity guidace

Evaluate

Fid and Kid for image try-on network. Note that we should resize the GT into the size match with the prediction

fidelity --input1 VITON_test_unpaired --input2 /data1/hzj/zalando-hd-resized/test/image_512/ -g 1 -f -k

Other Note

anydoor_train时读自己得模型不需要删除conv wild_config.py WildVideoDataset.py 用wild视频直接换衣 pipeline 应该用pcm而不是paser_upper_mask, pcm归0区域更小有一问题没解决：agnostic = agnostic * (1-pcm) + pcm * torch.zeros_like(agnostic) 这里好像会造成data leak，具体原因未知（猜测是灰色mask得数值不一定是0），因为可视化都一致

Rebuttal update

Dataset bug fix, ImageDraw的灰色在归一化的时候不是0，而是一个很小的值0.00392163，以前有data leak
增加了classifier free guidance，在TikTok和VITON数据集上面都有明显提升，使用one_stage_train.py进行训练. 为减少改动，我直接在blende_cloth_pipeline.py里面直接写死了guidance scale
将edge map concat到garment,一起进行cross attention, 并将DINO feature的输入提高到1022，这些需要带new结尾的文件. 推理时需要手动更改unet_emasc(包括video_models/里面的)的channel (11->7)，因为conditional branch不再输入edge map

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
dinov2		dinov2
figures		figures
mae		mae
video_models		video_models
videomae		videomae
.gitignore		.gitignore
CPDataset_HD.py		CPDataset_HD.py
CPDataset_HD_new.py		CPDataset_HD_new.py
DressCodeDataSet.py		DressCodeDataSet.py
DressCodeDataSet_new.py		DressCodeDataSet_new.py
README.md		README.md
TikTokDataSet.py		TikTokDataSet.py
TikTokDataSet_new.py		TikTokDataSet_new.py
VTTDataSet.py		VTTDataSet.py
VTTDataSet_train.py		VTTDataSet_train.py
VideoDataSet.py		VideoDataSet.py
WildVideoDataSet.py		WildVideoDataSet.py
anydoor_train.py		anydoor_train.py
anydoor_train_TikTok.py		anydoor_train_TikTok.py
anydoor_train_video.py		anydoor_train_video.py
autoencoder_kl_emasc.py		autoencoder_kl_emasc.py
blended_cloth_pipeline.py		blended_cloth_pipeline.py
blended_cloth_pipeline_new.py		blended_cloth_pipeline_new.py
config.py		config.py
dino_module.py		dino_module.py
evaluate.py		evaluate.py
extract_dino_fea.py		extract_dino_fea.py
extract_dino_fea_vtt.py		extract_dino_fea_vtt.py
infer.py		infer.py
infer_new.py		infer_new.py
infer_video.py		infer_video.py
infer_video_guided.py		infer_video_guided.py
infer_video_mae_clip_guided.py		infer_video_mae_clip_guided.py
infer_video_mae_guided.py		infer_video_mae_guided.py
infer_video_mae_guided_2.py		infer_video_mae_guided_2.py
infer_video_new.py		infer_video_new.py
infer_video_vtt_list.py		infer_video_vtt_list.py
one_stage_train.py		one_stage_train.py
requirements.txt		requirements.txt
train_cloth_vae_agnostic.py		train_cloth_vae_agnostic.py
train_cloth_vae_agnostic_TikTok.py		train_cloth_vae_agnostic_TikTok.py
unet_emasc.py		unet_emasc.py
utils.py		utils.py
vis2.py		vis2.py
wild_config.py		wild_config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WildVidFit Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Model

Method piepeline

Requirements

DataSet and pre-process

VITON-HD

DressCode

VVT Dataset

TikTok Dataset

Pretrain model preparation

Trained model

Training

Inference

image-level

video-level

Evaluate

Other Note

Rebuttal update

Demo

About

Uh oh!

Releases

Packages

Languages

senlin-ali/video_try_on

Folders and files

Latest commit

History

Repository files navigation

WildVidFit Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Model

Method piepeline

Requirements

DataSet and pre-process

VITON-HD

DressCode

VVT Dataset

TikTok Dataset

Pretrain model preparation

Trained model

Training

Inference

image-level

video-level

Evaluate

Other Note

Rebuttal update

Demo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages