8000 GitHub - 193746/VHASR
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

193746/VHASR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository is the official implementation of VHASR. The paper has been accepted by EMNLP 2024.

Prepare dataset

Download the image and audio data from {dataset} and place them separately in VHASR/dataset/{dataset}/image and VHASR/dataset/{dataset}/audio

VHASR/
│
└── dataset/
    ├── ADE20k/
    │   ├── image/
    |   |   ├── ADE_train_00000001.jpg
    |   |   ├── ADE_train_00000002.jpg
    |   |   └── ......
    │   ├── audio/
    |   |   ├── ade20k_train_0000000000000001_90.ogg
    |   |   ├── ade20k_train_0000000000000002_72.ogg
    |   |   └── ......
    │   └── train_data/   
    ├── COCO/
    ├── Flickr8k/
    └── OpenImages/

VHASR/dataset/{dataset}/train_data/{split}/img.txt and VHASR/dataset/{dataset}/train_data/{split}/wav.scp record the required image path and audio path.

The audio of Flickr8k is available at https://sls.csail.mit.edu/downloads/placesaudio/downloads.cgi.

The audios of ADE20k, COCO and OpenImages are available at https://google.github.io/localized-narratives/.

Prepare pretrained model

Download clip-vit-base-patch32 and place it on VHASR/pretrained-model/clip-vit-base-patch32

VHASR/
│
└── pretrained_model/
    └── clip-vit-base-patch32/
        ├── config.json
        ├── merges.txt
        ├── pytorch_model.bin
        └── ......

Install packages

pip install -r requirements.txt

Train

Download base model: "speech_paraformer_asr-en-16k-vocab4199-pytorch"

from modelscope import snapshot_download
snapshot_download('damo/speech_paraformer_asr-en-16k-vocab4199-pytorch',local_dir='{path_to_save_model}')

Copy the model file and othe config files to VHASR/pretrained_model/VHASR_base

cp -rn {path_to_save_model}/speech_paraformer_asr-en-16k-vocab4199-pytorch/* VHASR/pretrained_model/VHASR_base

Start training.

cd VHASR
CUDA_VISIBLE_DEVICES=1 python src/finetune.py \
--model_name "pretrained_model/VHASR_base" \
--output_dir "{dataset}_checkpoint" \
--data_path "dataset/{dataset}/train_data" \
--epoch 120 

Follow our paper, you can use VHASR pretrained on SpokenCOCO as the base model for training.

cd VHASR
CUDA_VISIBLE_DEVICES=1 python src/finetune.py \
--model_name "pretrained_model/VHASR_pretrain" \
--output_dir "{dataset}_checkpoint" \
--data_path "dataset/{dataset}/train_data" \
--epoch 120 

After training, place the trained model file and other configuration files in the same folder for subsequent testing.

cd VHASR
mkdir pretrained_model/my_VHASR_{dataset}
ls pretrained_model/VHASR_base/ | grep -v model.pb | xargs -i cp -r pretrained_model/VHASR_base/{} pretrained_model/my_VHASR_{dataset}
cp {dataset}_checkpoint/valid.acc.best.pb pretrained_model/my_VHASR_{dataset}/model.pb

Test

Test you own trained model. "merge_method" can select 1, 2, or 3, corresponding to $M_1$, $M_2$, and $M_3$ in the paper, respectively.

cd VHASR
CUDA_VISIBLE_DEVICES=1 python src/evaluate.py \
--model_name "pretrained_model/my_VHASR_{dataset}" \
--data_path "dataset/{dataset}/train_data" \
--merge_method 3

You can download Trained VHASR from Google Drive or Hugging Face, and put the files in VHASR/pretrained_model/VHASR_{dataset}.

VHASR/
│
└── pretrained_model/
    ├── VHASR_ADE20k/
    |   ├── am.mvn
    |   ├── config.yaml
    |   ├── model.pb
    |   └── ......
    ├── VHASR_COCO/
    ├── VHASR_Flickr8k/
    └── VHASR_OpenImages/

Start testing.

cd VHASR
CUDA_VISIBLE_DEVICES=1 python src/evaluate.py \
--model_name "pretrained_model/VHASR_{dataset}" \
--data_path "dataset/{dataset}/train_data" \
--merge_method 3

Infer

We also provide an inference interface. If you are interested in our work, you can use large-scale data to train your own model and perform inference using the following command. Note that you should place the config file of clip in '{model_file}/clip_config' like the four pretrained models we provide.

cd VHASR
CUDA_VISIBLE_DEVICES=1 python src/infer.py \
--model_name "{path_to_model_folder}" \
--speech_path "{path_to_speech}" \
--image_path "{path_to_image}" \
--merge_method 3

Statement

Most of the code in this repository is modified from https://github.com/modelscope/FunASR/tree/v0.8.8

Citation

@misc{hu2024vhasrmultimodalspeechrecognition,
      title={VHASR: A Multimodal Speech Recognition System With Vision Hotwords}, 
      author={Jiliang Hu and Zuchao Li and Ping Wang and Haojun Ai and Lefei Zhang and Hai Zhao},
      year={2024},
      eprint={2410.00822},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2410.00822}, 
}

License: cc-by-nc-4.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0