8000 GitHub - khfs/DuplexMamba
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

khfs/DuplexMamba

Repository files navigation

DuplexMamba arXiv

Architecture

DuplexMamba

duplex_decoding

Prerequisites

Install Packages

conda create --name DuplexMamba python=3.9
conda activate DuplexMamba
pip install -r requirements.txt
pip install -e src/transformers/
pip install -e src/speechbrain/

You may need to install lower or higher versions of torch, torchaudio, causal-conv1d and mamba-ssm based on your hardware and system. Make sure they are compatible. If the installation of causal-conv1d and mamba-ssm fails, you can manually download the corresponding .whl files from causal-conv1d releases and mamba releases and install them.

Pretrained Model and Checkpoints

  1. Download the mamba-2.8b-hf into the model folder, then run:

    python safetensor2bin.py
    
  2. Download the checkpoint of our trained ASR model and the checkpoints for all four stages of the DuplexMamba model from DuplexMamba and save them in the checkpoints folder. If you only need the model for inference, you can simply download the Stage 4 checkpoint.

Training

training_data

datasets

  1. Our training code requires all data to be stored in a format similar to LibriSpeech.
  2. For the raw data of Stage 1 and Stage 2, you can download LibriSpeech, TED-LIUM, mls_eng_10k, and VoiceAssistant-400K.
  3. The state discrimination dataset we used can be accessed here.
  4. For the preprocessed data for Stage 3 and Stage 4, you can download it from here.

Stage1 Multimodal Alignment:

torchrun --nproc-per-node 6 train_stage1.py hparams/S2S/train_stage1.yaml --data_folder <YOUR_PATH_TO_DATASETS> --precision bf16

Stage2 Multimodal Instruction Tuning:

torchrun --nproc-per-node 6 train_stage2.py hparams/S2S/train_stage2.yaml --data_folder <YOUR_PATH_TO_DATASETS> --precision bf16

Stage3 Input State Discrimination:

torchrun --nproc-per-node 6 train_stage3.py hparams/S2S/train_stage3.yaml --data_folder <YOUR_PATH_TO_DATASETS> --precision bf16

Stage4 Streaming Alignment:

torchrun --nproc-per-node 1 train_stage4.py hparams/S2S/train_stage4.yaml --data_folder <YOUR_PATH_TO_DATASETS> --precision bf16

Inference

python CustomGenerator.py duplex/duplex.yaml --precision bf16 --wav_path example/rlhf-57762.flac

We also provide the duplex_voice_assistant() method in the duplex_inference.py script for simulating duplex conversations. You can modify wav_list on line 236 and output_dir on line 239 of the script, then run the following command to start the experiment:

python duplex_inference.py duplex/duplex.yaml --precision bf16

A simple Case

case

Acknowledgement

We acknowledge the wonderful work of Mamba, Vision Mamba, and ConMamba. We borrowed their implementation of Mamba, bidirectional Mamba, and ConMamba. The training recipes are adapted from SpeechBrain.

Citation

If you find this work helpful, please consider citing:

@misc{lu2025duplexmambaenhancingrealtimespeech,
      title={DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities}, 
      author={Xiangyu Lu and Wang Xu and Haoyu Wang and Hongyun Zhou and Haiyan Zhao and Conghui Zhu and Tiejun Zhao and Muyun Yang},
      year={2025},
      eprint={2502.11123},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.11123}, 
}

License

This project is licensed under the GNU General Public License v3.0. It is based on Mamba-ASR, which is also licensed under the GPL.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0