NeuralCodecs

NeuralCodecs is a .NET library for neural audio codec implementations and TTS models written purely in C#. It includes implementations of SNAC, DAC, Encodec, and Dia, along with advanced audio processing tools.

Features

SNAC: Multi-Scale Neural Audio Codec
- Support for multiple sampling rates: 24kHz, 32kHz, and 44.1kHz
- Attention mechanisms with adjustable window sizes for improved quality
- Automatic resampling for input flexibility
DAC: Descript Audio Codec
- Supports multiple sampling rates: 16kHz, 24kHz, and 44.1kHz
- Configurable encoder/decoder architecture with variable rates
- Flexible bitrate configurations from 8kbps to 16kbps
Encodec: Meta's Encodec neural audio compression
- Supports stereo audio at 24kHz and 48kHz sample rates
- Variable bitrate compression (1.5-24 kbps)
- Neural language model for enhanced compression quality
- Direct file compression to .ecdc format
Dia: Nari Labs' Dia text-to-speech model
- 1.6B parameter text-to-speech model for highly realistic dialogue generation
- Direct transcript-to-speech generation with emotion and tone control
- Audio-conditioned generation for voice cloning and style transfer
- Support for non-verbal communications (laughter, coughing, throat clearing, etc.)
- Speaker-aware dialogue generation with [S1] and [S2] tags
- Custom dynamic speed control to handle Dia's issue with automatic speed-up on long inputs
AudioTools: Advanced audio processing utilities
- Based on Descript's audiotools Python package
- Extended with .NET-specific optimizations and additional features
- Audio filtering, transformation, and effects processing
- Works with Descript's AudioSignal or Tensors
Audio Visualization: Example project includes spectrogram generation and comparison tools

Requirements

.NET 8.0 or later
TorchSharp or libTorch compatible with your platform
NAudio (for audio processing)
SkiaSharp (for visualization features)

Installation

Install the main package from NuGet:

dotnet add package NeuralCodecs

Or the Package Manager Console:

Install-Package NeuralCodecs

Model Downloads

Models will be automatically downloaded given the huggingface user/model, or can be downloaded separately:

SNAC Models - Available from hubersiuzdak's HuggingFace

DAC Models - Available from Descript's HuggingFace

Encodec Models - Available from Meta's HuggingFace

Dia Model - Available from Nari Labs' HuggingFace

Requires both Dia model weights and DAC codec for full audio generation

Quick Start

Here's a simple example to get you started:

using NeuralCodecs;

// Load a SNAC model
var model = await NeuralCodecs.CreateSNACAsync("path/to/model.pt");

// Process audio
float[] audioData = LoadAudioFile("input.wav");
var compressed = model.ProcessAudio(audioData, sampleRate: 24000);

// Save the result
SaveAudioFile("output.wav", compressed);

For more detailed examples, see the examples section below.

Usage

Creating/loading the model

There are several ways to load a model:

Using static factory method:

// Load SNAC model with static method provided for built-in models
var model = await NeuralCodecs.CreateSNACAsync("model.pt");

Using premade config:
SnacConfig provides premade configurations for 24kHz, 32kHz, and 44kHz sampling rates.

var model = await NeuralCodecs.CreateSNACAsync(modelPath, SNACConfig.SNAC24Khz);

Using IModelLoader instance with default config:
Allows the use of custom loader implementations

// Load model with default config from IModelLoader instance
var torchLoader = NeuralCodecs.CreateTorchLoader();
var model = await torchLoader.LoadModelAsync<SNAC, SNACConfig>("model.pt");

Using IModelLoader instance with custom config:

// For Encodec with custom bandwidth and settings
var encodecConfig = new EncodecConfig { 
    SampleRate = 48000,
    Bandwidth = 12.0f,
    Channels = 2,  // Stereo audio
    Normalize = true
};
var encodecModel = await torchLoader.LoadModelAsync<Encodec, EncodecConfig>("encodec_model.pt", encodecConfig);

Using factory method for custom models:
Allows the use of custom model implementations with built-in or custom loaders

// Load custom model with factory method
var model = await torchLoader.LoadModelAsync<CustomModel, CustomConfig>(
    "model.pt",
    config => new CustomModel(config, ...),
    config);

Models can be loaded in Pytorch or Safetensors format.

AudioTools Features

The AudioTools namespace provides extensive audio processing capabilities:

var audio = new Tensor(...); // Load or create audio tensor

// Apply effects
var processedAudio = AudioEffects.ApplyCompressor(
    audio, 
    sampleRate: 48000,
    threshold: -20f,
    ratio: 4.0f);

// Compute spectrograms and transforms
var spectrogram = DSP.MelSpectrogram(audio, sampleRate);
var stft = DSP.STFT(audio, windowSize: 1024, hopSize: 512, windowType: "hann");

Encoding and Decoding Audio

There are two main ways to process audio:

Using the simplified ProcessAudio method:

// Compress audio in one step
var processedAudio = model.ProcessAudio(audioData, sampleRate);

Using separate encode and decode steps:

// Encode audio to compressed format
var codes = model.Encode(buffer);

// Decode back to audio
var processedAudio = model.Decode(codes);

Saving the processed audio

Use your preferred method to save WAV files

// using NAudio
await using var writer = new WaveFileWriter(
    outputPath,
    new WaveFormat(model.Config.SamplingRate, channels: model.Channels)
);
writer.WriteSamples(processedAudio, 0, processedAudio.Length);

Encodec-Specific Features

Encodec provides additional capabilities:

// Set target bandwidth for compression (supported values depend on model)
encodecModel.SetTargetBandwidth(12.0f); // 12 kbps

// Get available bandwidth options
var availableBandwidths = encodecModel.TargetBandwidths; // e.g. [1.5, 3, 6, 12, 24]

// Use language model for enhanced compression quality
var lm = await encodecModel.GetLanguageModel();
// Apply LM during encoding/decoding for better quality

// Direct file compression
await EncodecCompressor.CompressToFileAsync(encodecModel, audioTensor, "audio.ecdc", useLm: true);

// Decompress from file
var (decompressedAudio, sampleRate) = await EncodecCompressor.DecompressFromFileAsync("audio.ecdc");

Dia Text-to-Speech Features

Dia is a 1.6B parameter text-to-speech model that generates highly realistic dialogue directly from transcripts:

// Load Dia model with optional DAC codec
var diaConfig = new DiaConfig 
{ 
    LoadDACModel = true,
    SampleRate = 44100 
};
var diaModel = NeuralCodecs.CreateDiaAsync("model.pt", diaconfig)

// or use LoadDACModel = false in config and manually load DAC:
diaModel.LoadDacModel("dac_model.pt");

// Basic text-to-speech generation
var text = "[S1] Hello, how are you today? [S2] I'm doing great, thanks for asking!";
var audioOutput = diaModel.Generate(
    text: text,
    maxTokens: 1000,
    cfgScale: 3.0f,
    temperature: 1.2f,
    topP: 0.95f);

// Voice cloning with audio prompt
var audioPromptPath = "reference_voice.wav";
var clonedAudio = diaModel.Generate(
    text: "[S1] This is my cloned voice speaking new words.",
    audioPromptPath: audioPromptPath,
    maxTokens: 1000);

// Batch generation for multiple texts
var texts = new List<string>
{
    "[S1] First dialogue line.",
    "[S2] Second dialogue line with (laughs) non-verbal."
};
var batchResults = diaModel.Generate(texts, maxTokens: 800);

// Save generated audio
Dia.SaveAudio("output.wav", audioOutput);

Advanced Dia Configuration

Audio Speed Correction: Dia includes built-in speed correction to handle the automatic speed-up issue on longer inputs:

var diaConfig = new DiaConfig 
{ 
    LoadDACModel = true,
    SampleRate = 44100,
    // Configure speed correction method
    SpeedCorrectionMethod = AudioSpeedCorrectionMethod.Hybrid, // Default: best quality
    // Configure slowdown mode
    SlowdownMode = AudioSlowdownMode.Dynamic // Default: adapts to text length
};

Available speed correction methods:

None: No speed correction applied
TorchSharp: TorchSharp-based linear interpolation
Hybrid: Combines TorchSharp and NAudio methods (recommended)
NAudioResampling: Uses NAudio resampling for speed correction
All: Creates separate outputs using all methods (for testing/comparison)

Available slowdown modes:

Static: Uses a fixed slowdown factor
Dynamic: Adjusts slowdown based on text length (recommended)

Speed Correction Examples:

// For highest quality output (default)
var highQualityConfig = new DiaConfig 
{ 
    SpeedCorrectionMethod = AudioSpeedCorrectionMethod.Hybrid,
    SlowdownMode = AudioSlowdownMode.Dynamic
};

// For testing multiple correction methods
var testConfig = new DiaConfig 
{ 
    SpeedCorrectionMethod = AudioSpeedCorrectionMethod.All // Generates multiple output variants
};

// For no speed correction (fastest processing)
var fastConfig = new DiaConfig 
{ 
    SpeedCorrectionMethod = AudioSpeedCorrectionMethod.None
};

Dia Generation Guidelines

Text Format Requirements:

Always begin input text with [S1] speaker tag
Alternate between [S1] and [S2] for dialogue (repeating the same speaker tag consecutively may impact generation)
Keep input text moderate length (10-20 seconds of corresponding audio)

Non-Verbal Communications: Dia supports various non-verbal tags. Some work more consistently than others (laughs, chuckles), but be prepared for occasional unexpected output from some tags (sneezes, applause, coughs ...)

var textWithNonVerbals = "[S1] I can't believe it! (gasps) [S2] That's amazing! (laughs)";

Supported non-verbals: (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles)

Voice Cloning Best Practices:

Provide 5-10 seconds of reference audio for optimal results
Include the transcript of the reference audio before your generation text
Use correct speaker tags in the reference transcript
Approximately 1 second per 86 tokens for duration estimation

// Voice cloning example with transcript
var referenceTranscript = "[S1] This is the reference voice speaking clearly.";
var newText = "[S1] Now I will say something completely different.";
var clonedOutput = diaModel.Generate(
    text: referenceTranscript + " " + newText,
    audioPromptPath: "reference.wav");

Dia Performance

Memory Usage: Similar to the python implementation, ~10-11GB GPU memory is required for the Dia model with DAC codec.

Speed Comparison (RTX 3090): The C# implementation shows slight performance improvement compared to the original Python version in my limited testing (Windows/no torch compile):

Python (original): ~35 tokens/second (without torch.compile)
C# (NeuralCodecs): ~40 tokens/second

Performance Notes:

TorchSharp currently lacks torch.compile support, which limits potential speed gains compared to PyTorch
Dia's performance is reduced on Windows machines compared to Linux environments
Actual performance will vary based on hardware configuration, text length, and generation parameters

Example

Check out the Example project for a complete implementation, including:

Model loading and configuration
Audio processing workflows
Command-line interface implementation
Audio Visualization

The example includes tools for visualizing and comparing audio spectrograms:

Audio before and after compression with DAC Codec 24kHz

Acknowledgments

SNAC - hubertsiuzdak's original python implementation
Descript Audio Codec - Descript's original python implementation
Encodec - Meta's original python implementation
Dia - Nari Labs' original python implementation

Contributing

Suggestions and contributions are welcome! Here's how you can help:

Ways to Contribute

Bug Reports: Submit issues with reproduction steps
Feature Requests: Propose new codec implementations or features
Code Contributions: Submit pull requests with improvements
Documentation: Help improve examples and documentation
Testing: Test with different models and platforms

License

This project is licensed under the Apache-2.0 License, see the LICENSE file for more information.
This project uses libraries under several different licenses, see THIRD-PARTY-NOTICES for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
Docs/Images		Docs/Images
NeuralCodecs.Core		NeuralCodecs.Core
NeuralCodecs.Diagnostics		NeuralCodecs.Diagnostics
NeuralCodecs.Torch.Examples		NeuralCodecs.Torch.Examples
NeuralCodecs.Torch		NeuralCodecs.Torch
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
NeuralCodecs.sln		NeuralCodecs.sln
README.md		README.md
THIRD-PARTY-NOTICES.md		THIRD-PARTY-NOTICES.md
nc_logo.png		nc_logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NeuralCodecs

Features

Requirements

Installation

Model Downloads

Quick Start

Usage

Creating/loading the model

Using static factory method:

Using premade config:

Using IModelLoader instance with default config:

Using IModelLoader instance with custom config:

Using factory method for custom models:

AudioTools Features

Encoding and Decoding Audio

Encodec-Specific Features

Dia Text-to-Speech Features

Advanced Dia Configuration

Available speed correction methods:

Available slowdown modes:

Dia Generation Guidelines

Dia Performance

Example

Acknowledgments

Contributing

Ways to Contribute

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

DillionLowry/NeuralCodecs

Folders and files

Latest commit

History

Repository files navigation

NeuralCodecs

Features

Requirements

Installation

Model Downloads

Quick Start

Usage

Creating/loading the model

Using static factory method:

Using premade config:

Using IModelLoader instance with default config:

Using IModelLoader instance with custom config:

Using factory method for custom models:

AudioTools Features

Encoding and Decoding Audio

Encodec-Specific Features

Dia Text-to-Speech Features

Advanced Dia Configuration

Available speed correction methods:

Available slowdown modes:

Dia Generation Guidelines

Dia Performance

Example

Acknowledgments

Contributing

Ways to Contribute

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages