conda create --name DuplexMamba python=3.9
conda activate DuplexMamba
pip install -r requirements.txt
pip install -e src/transformers/
pip install -e src/speechbrain/
You may need to install lower or higher versions of torch, torchaudio, causal-conv1d and mamba-ssm based on your hardware and system. Make sure they are compatible. If the installation of causal-conv1d
and mamba-ssm
fails, you can manually download the corresponding .whl
files from causal-conv1d releases and mamba releases and install them.
-
Download the mamba-2.8b-hf into the
model
folder, then run:python safetensor2bin.py
-
Download the checkpoint of our trained ASR model and the checkpoints for all four stages of the DuplexMamba model from DuplexMamba and save them in the
checkpoints
folder. If you only need the model for inference, you can simply download the Stage 4 checkpoint.
- Our training code requires all data to be stored in a format similar to LibriSpeech.
- For the raw data of Stage 1 and Stage 2, you can download LibriSpeech, TED-LIUM, mls_eng_10k, and VoiceAssistant-400K.
- The state discrimination dataset we used can be accessed here.
- For the preprocessed data for Stage 3 and Stage 4, you can download it from here.
Stage1 Multimodal Alignment:
torchrun --nproc-per-node 6 train_stage1.py hparams/S2S/train_stage1.yaml --data_folder <YOUR_PATH_TO_DATASETS> --precision bf16
Stage2 Multimodal Instruction Tuning:
torchrun --nproc-per-node 6 train_stage2.py hparams/S2S/train_stage2.yaml --data_folder <YOUR_PATH_TO_DATASETS> --precision bf16
Stage3 Input State Discrimination:
torchrun --nproc-per-node 6 train_stage3.py hparams/S2S/train_stage3.yaml --data_folder <YOUR_PATH_TO_DATASETS> --precision bf16
Stage4 Streaming Alignment:
torchrun --nproc-per-node 1 train_stage4.py hparams/S2S/train_stage4.yaml --data_folder <YOUR_PATH_TO_DATASETS> --precision bf16
python CustomGenerator.py duplex/duplex.yaml --precision bf16 --wav_path example/rlhf-57762.flac
We also provide the duplex_voice_assistant()
method in the duplex_inference.py
script for simulating duplex conversations. You can modify wav_list
on line 236 and output_dir
on line 239 of the script, then run the following command to start the experiment:
python duplex_inference.py duplex/duplex.yaml --precision bf16
We acknowledge the wonderful work of Mamba, Vision Mamba, and ConMamba. We borrowed their implementation of Mamba, bidirectional Mamba, and ConMamba. The training recipes are adapted from SpeechBrain.
If you find this work helpful, please consider citing:
@misc{lu2025duplexmambaenhancingrealtimespeech,
title={DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities},
author={Xiangyu Lu and Wang Xu and Haoyu Wang and Hongyun Zhou and Haiyan Zhao and Conghui Zhu and Tiejun Zhao and Muyun Yang},
year={2025},
eprint={2502.11123},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.11123},
}
This project is licensed under the GNU General Public License v3.0. It is based on Mamba-ASR, which is also licensed under the GPL.