This repository provides a Cog container for Dia, a 1.6 billion parameter text-to-speech model developed by Nari Labs. Dia generates highly realistic dialogue audio directly from text, including multiple speakers and non-verbal sounds like (laughs)
.
Model Links:
- Original Model: nari-labs/Dia-1.6B on Hugging Face
- Original Code: github.com/nari-labs/dia
- This Cog packaging by: zsxkib on GitHub / @zsakib_ on Twitter
- Docker: To build and run the container. Install Docker.
- Cog: To build and run this model locally. Install Cog.
- NVIDIA GPU: An NVIDIA GPU is required to run this model.
Running this model locally is straightforward with Cog. It handles building the environment and downloading the model weights automatically.
-
Clone this repository:
git clone https://github.com/zsxkib/cog-dia.git cd cog-dia
-
Run a prediction: The first time you run
cog predict
, it builds the container and downloads the weights, which takes a few minutes. Subsequent runs are much faster.# Example prediction cog predict -i text="[S1] This is a test using Cog! [S2] It downloads the weights automatically. (laughs)"
Cog will output the path to the generated
.wav
file.You can pass other inputs too:
cog predict \ -i text="[S1] Another example! [S2] With different settings." \ -i cfg_scale=3.5 \ -i temperature=1.1
Check
predict.py
for all available inputs likeaudio_prompt
,seed
, etc.
Cog uses cog.yaml
to define the environment and predict.py
to run the model. The setup()
function in predict.py
automatically downloads the model weights from a Replicate CDN using pget
if they aren't already cached locally within the container.
The original Dia model is licensed under Apache 2.0. This Cog packaging code is MIT licensed. Please respect the original model's usage restrictions.
⭐ Star this repo on GitHub!
👋 Follow me on Twitter/X