ArTST

This repository contains the implementation of the paper:

ArTST: Arabic Text and Speech Transformer

Hawau Olamide Toyin ^{* 1} Amirbek Djanibekov ^{* 1} Ajinkya Kulkarni ¹ Raghad Alshalan ² Abdullah Alitr ² Hanan Aldarmaki ¹

^* equal contribution ¹ MBZUAI ² STC

ArabicNLP 2023, ACL 2025

ArTST

ArTST, a pre-trained Arabic text and speech transformer for supporting open-source speech technologies for the Arabic language. The model architecture in this first edition follows the unified-modal framework, SpeechT5, that was recently released for English, and is focused on Modern Standard Arabic (MSA), with plans to extend the model for dialectal and code-switched Arabic in future editions. We pre-trained the model from scratch on MSA speech and text data, and fine-tuned it for the following tasks: Automatic Speech Recognition (ASR), Text-To-Speech synthesis (TTS), and spoken dialect identification.

Latest Highlights from the ArTST

June 2024 — ArTSTv1.5: MSA pre-training with diacritics for Arabic. Pre-trained ArTST from scratch on MGB2 and Tashkeela dataset.
May 2024 — ArTSTv2 and ArTSTv3 make it to ACL (main conference). Our paper, "Dialectal Coverage and Generalization in Arabic Speech Recognition", was officially accepted at ACL 2025. A significant milestone for dialectal Arabic ASR research.
April 2025 — A milestone in dialectal ASR: 17 fine-tuned checkpoints released. We’ve expanded ArTSTv2’s reach by publishing 17 fine-tuned models across Arabic dialects. Explore them all on HuggingFace.
April 2025 — ArTSTv3 goes live on HuggingFace. ArTSTv3 ASR model cards are now available: MGB2 version and QASR version.
March 2025 — ArTSTv3: Multilingual pre-training for Arabic + English, French, Spanish. A major upgrade for ArTST. ArTSTV3 now spans Arabic dialects and adds multilingual support. Available on HuggingFace.
December 2024 — Fine-tuning made easy. Released a practical notebook for fine-tuning with Hugging Face Trainer—run it on Google Colab.
October 2024 — ArTSTv2 released with HuggingFace integration. ArTSTv2 is now live along with model cards: ASR v2 (MGB2) and ASR v2 (QASR)
October 2024 — ArTSTv2: Dialectal pre-training for Arabic. Pre-trained ArTST from scratch on 17 Arabic dialects
October 2024 — ArTSTv1 joins the HuggingFace. The first version of our ASR model has been made available here: ASR v1
February 2024 - Bug fix. Addressed key checkpoint-loading issues
February 2024 — TTS support launched. We’ve released ArTST TTS for Arabic via HuggingFace's Transformers: TTS model.
December 2023 — Speech-to-Text (ASR) demo now on HuggingFace Spaces. Try the ArTST ASR model in real-time: Demo here
November 2023 — Text-to-Speech (TTS) demo released. Experience our TTS model in action: Demo here
October 2023 — ArTST goes open-source. Model weights are now publicly accessible on HuggingFace.
October 2023 — ArTST recognized at EMNLP 2023. Our work was accepted at the ArabicNLP workshop at EMNLP 2023.

Checkpoints

Pre-Trained Models

Model	Pre-train Dataset	Model	Tokenizer
ArTST v1 base	MGB2	Hugging Face	Hugging Face
ArTST v1.5 base	MGB2 + Tashkeela	Hugging Face	Hugging Face
ArTST v2 base	Dialects	Hugging Face	Hugging Face
ArTST v3 base	Multilingual	HuggingFace	HuggingFace

Finetuned Models

Model	FInetune Dataset	Model	Tokenizer
ArTST v1 ASR	MGB2	Hugging Face	Hugging Face
ArTST v1 TTS	ClArTTS	Hugging Face	Hugging Face
ArTST* TTS	ClArTTS	Hugging Face	Hugging Face
ArTST v2 ASR	QASR	Hugging Face - safetenors	Hugging Face
ArTST v2 ASR	MGB2	Hugging Face	Hugging Face
ArTST v2 ASR	QASR	Hugging Face	Hugging Face
ArTST v2 ASR	Dialects	Hugging Face	Hugging Face
ArTST v3 ASR	MGB2	Hugging Face	Hugging Face
ArTST v3 ASR	QASR	Hugging Face	Hugging Face
ArTST v3 ASR	Mutlilingual	soon	soon

Environment & Installation

Python version: 3.8+

Clone this repo

cd ArTST
conda create -n artst python=3.8
conda activate artst
pip install -r requirements.txt

Install fairseq

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
python setup.py build_ext --inplace

Download Checkpoints

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/MBZUAI/ArTST

Loading Model

With HuggingFace Transformers

Transformers version: 4.46.3

from transformers import (
    SpeechT5ForSpeechToText,
    SpeechT5Processor,
    SpeechT5Tokenizer,
)

device = "cuda" if torch.cuda.is_available() else "CPU"

model_id = "mbzuai/artst-v2-asr" # or "mbzuai/artst_asr" for v1

tokenizer = SpeechT5Tokenizer.from_pretrained(model_id)
processor = SpeechT5Processor.from_pretrained(model_id , tokenizer=tokenizer)
model = SpeechT5ForSpeechToText.from_pretrained(model_id).to(device)

With Fairseq

import torch
from artst.tasks.artst import ArTSTTask
from artst.models.artst import ArTSTTransformerModel

checkpoint = torch.load('checkpoint.pt')
checkpoint['cfg']['task'].t5_task = 't2s' # or "s2t" for asr
checkpoint['cfg']['task'].data = 'path-to-folder-with-checkpoints'
task = ArTSTTask.setup_task(checkpoint['cfg']['task'])

model = ArTSTTransformerModel.build_model(checkpoint['cfg']['model'], task)
model.load_state_dict(checkpoint['model'])

Data Preparation

Speech

For pretraining, follow the steps for preparing wav2vec 2.0 manifest here and preparing HuBERT label here.

For finetuning TTS task, an extra column is required in the speech manifest file for speaker embedding. To generate speaker embedding, we use speech brain. Here is a DATA_ROOT sample folder structure that contains manifest samples.

Text

Pretrain:

Please use fairseq-preprocess to generate the index and bin files for the text data. We use sentencepiece to pre-process the text, we've provided our SPM models and dictionary in this repo. You need to use the SPM model to process the text and then use fairseq-preprocess with the provided dictionary to get the index and bin files. Note that after SPM processes sentences, the resulting text should have individual characters separated by space.

For Finetuning, a simple text file containing corresponding texts on each line suffices. See here for sample manifest. Normalize the texts as we did for training/evaluation using this script.

Training

The bash files contain the parameters and hyperparameters used for pretraining and finetuning. Find more details on training arguments here

Pretrain

bash /scripts/pretrain/train.sh

Finetune

ASR

bash /scripts/ASR/finetune.sh

TTS

bash /scripts/TTS/finetune.sh

Inference

ASR

bash /scripts/ASR/inference.sh

TTS

bash /scripts/TTS/inference.sh

Acknowledgements

ArTST is built on SpeechT5 Architecture. If you use any of ArTST models, please cite

@inproceedings{toyin2023artst,
  title={ArTST: Arabic Text and Speech Transformer},
  author={Toyin, Hawau and Djanibekov, Amirbek and Kulkarni, Ajinkya and Aldarmaki, Hanan},
  booktitle={Proceedings of ArabicNLP 2023},
  pages={41--51},
  year={2023}
}

@misc{djanibekov2025dialectalcoveragegeneralizationarabic,
      title={Dialectal Coverage And Generalization in Arabic Speech Recognition}, 
      author={Amirbek Djanibekov and Hawau Olamide Toyin and Raghad Alshalan and Abdullah Alitr and Hanan Aldarmaki},
      year={2025},
      eprint={2411.05872},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.05872}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
artst		artst
output		output
scripts		scripts
vocoders		vocoders
.gitignore		.gitignore
README.md		README.md
app.py		app.py
demo-artst-asr.ipynb		demo-artst-asr.ipynb
demo-artst-tts.ipynb		demo-artst-tts.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ArTST

ArTST

Latest Highlights from the ArTST

Checkpoints

Pre-Trained Models

Finetuned Models

Environment & Installation

Loading Model

With HuggingFace Transformers

With Fairseq

Data Preparation

Speech

Text

Training

Pretrain

Finetune

ASR

TTS

Inference

ASR

TTS

Acknowledgements

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

mbzuai-nlp/ArTST

Folders and files

Latest commit

History

Repository files navigation

ArTST

ArTST

Latest Highlights from the ArTST

Checkpoints

Pre-Trained Models

Finetuned Models

Environment & Installation

Loading Model

With HuggingFace Transformers

With Fairseq

Data Preparation

Speech

Text

Training

Pretrain

Finetune

ASR

TTS

Inference

ASR

TTS

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages