This is a PyTorch implementation for INTERSPEECH 2025 main conference paper "CMSP-ST: Cross-modal Mixup with Speech Purification for End-to-End Speech Translation".
- Python version >= 3.8
- Pytorch
- To install fairseq version 0.12.2 and develop locally:
cd fairseq pip install --editable ./
-
MuST-C: Download MuST-C v1.0 dataset. Place the dataset in
./st/dataset/MuST-C/
. -
CoVoST-2: Download CoVoST-2 dataset. Place the dataset in
./st/dataset/CoVoST/
. -
HuBERT Model: Download HuBERT Base model. Place the model in
./models/pretrain/
. -
WMT: Download WMT 14 / 16 dataset. Place the dataset in
./mt/dataset/WMT/
.
- cd ./data/st/s2t_raw/
- bash
prep_mustc_data.sh
orprep_covost_data.sh
- cd ./data/mt/s2t_raw/
- bash
prep_mtl_mustc_mt.sh
orprep_mtl_covost_mt.sh
(for multi-task learning) - bash
prep_exp_mustc_mt.sh
orprep_exp_covost_mt.sh
(for expanded data)
- cd ./scripts/pretrain/
- bash
train_mtl_mt.sh
andaverage_cpt.sh
- bash
train_exp_mt.sh
andaverage_cpt.sh
- bash
train_exp_mtl_mt.sh
andaverage_cpt.sh
- cd ./scripts/train/
- bash
train_xxxxx_xx2xx.sh
andevaluation.sh