LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

[📄[Paper](https://arxiv.org/abs/2407.15415)]

Training codes is released in xtuner, and more details will be completed in the near future. Thank you for your attention!

Introduction

We introduces LLaST, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation~(E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs. We believe this effective method will serve as a strong baseline for speech translation and provide insights for future improvements of the LLM-based speech translation framework

Model List

Model	Speech Encoder	LLM	HuggingFace	ModelScope
LLaST-2B	Whisper-Large	TinyLlama	TBD	TBD
LLaST-8B	Whisper-Large	Llama2-7B-Instruct	TBD	TBD

Training LLaST

Data Preparation

Download data from CommonVoice
Prepare tsv data as follows:

covost2/tsv
├── covost_v2.de_en.dev.tsv
├── covost_v2.de_en.test.tsv

Prepare the multi-lingual data as the follows

covost/audio
├── de
├── en
├── es
├── fr
├── it
├── ja
└── zh-CN

Prepare the audio data as the follows:

covost2/audio/fr/clips_16k
├── common_voice_fr_20241860.wav
├── common_voice_fr_20241864.wav
├── common_voice_fr_20241868.wav
├── common_voice_fr_20241872.wav
└── common_voice_fr_20241875.wav

Training with XTuner

Install xtuner

git clone git@github.com:ChenX17/xtuner.git

cd xtuner

git checkout add_llast

Training

export XTUNER_DATASET_TIMEOUT=120
export HF_EVALUATE_OFFLINE=1 
export HF_DATASETS_OFFLINE=1 
export TRANSFORMERS_OFFLINE=1 
python xtuner/tools/train.py worksapce/configs/llast_2b_tinyllama_chat.py  --deepspeed deepspeed_zero2

Evaluation

export HF_EVALUATE_OFFLINE=1 
export HF_DATASETS_OFFLINE=1 
export TRANSFORMERS_OFFLINE=1 
python xtuner/tools/test.py worksapce/configs/llast_2b_tinyllama_chat.py --checkpoint work_dir/xxxx/epoch_1.pth/mp_rank_00_model_states.pt --laucher slurm

Citation

@inproceedings{chen2024llast,
  title = {LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models},
  author = {Chen, Xi and Zhang, Songyang and Bai, Qibing and Chen, Kai and Nakamura, Satoshi},
  booktitle = {Findings of the Association for Computational Linguistics (ACL),},
  year = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 311 Commits
.github		.github
docs		docs
examples		examples
requirements		requirements
workspace		workspace
xtuner		xtuner
.gitignore		.gitignore
.owners.yml		.owners.yml
.pre-commit-config-zh-cn.yaml		.pre-commit-config-zh-cn.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

Introduction

Model List

Training LLaST

Data Preparation

Training with XTuner

Evaluation

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

ChenX17/xtuner

Folders and files

Latest commit

History

Repository files navigation

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

Introduction

Model List

Training LLaST

Data Preparation

Training with XTuner

Evaluation

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages