NeuralCodecs is a .NET library for neural audio codec implementations and TTS models written purely in C#. It includes implementations of SNAC, DAC, Encodec, and Dia, along with advanced audio processing tools.
- SNAC: Multi-Scale Neural Audio Codec
- Support for multiple sampling rates: 24kHz, 32kHz, and 44.1kHz
- Attention mechanisms with adjustable window sizes for improved quality
- Automatic resampling for input flexibility
- DAC: Descript Audio Codec
- Supports multiple sampling rates: 16kHz, 24kHz, and 44.1kHz
- Configurable encoder/decoder architecture with variable rates
- Flexible bitrate configurations from 8kbps to 16kbps
- Encodec: Meta's Encodec neural audio compression
- Supports stereo audio at 24kHz and 48kHz sample rates
- Variable bitrate compression (1.5-24 kbps)
- Neural language model for enhanced compression quality
- Direct file compression to .ecdc format
- Dia: Nari Labs' Dia text-to-speech model
- 1.6B parameter text-to-speech model for highly realistic dialogue generation
- Direct transcript-to-speech generation with emotion and tone control
- Audio-conditioned generation for voice cloning and style transfer
- Support for non-verbal communications (laughter, coughing, throat clearing, etc.)
- Speaker-aware dialogue generation with [S1] and [S2] tags
- Custom dynamic speed control to handle Dia's issue with automatic speed-up on long inputs
- AudioTools: Advanced audio processing utilities
- Based on Descript's audiotools Python package
- Extended with .NET-specific optimizations and additional features
- Audio filtering, transformation, and effects processing
- Works with Descript's AudioSignal or Tensors
- Audio Visualization: Example project includes spectrogram generation and comparison tools
- .NET 8.0 or later
- TorchSharp or libTorch compatible with your platform
- NAudio (for audio processing)
- SkiaSharp (for visualization features)
Install the main package from NuGet:
dotnet add package NeuralCodecs
Or the Package Manager Console:
Install-Package NeuralCodecs
Models will be automatically downloaded given the huggingface user/model, or can be downloaded separately:
SNAC Models - Available from hubersiuzdak's HuggingFace
DAC Models - Available from Descript's HuggingFace
Encodec Models - Available from Meta's HuggingFace
Dia Model - Available from Nari Labs' HuggingFace
- Requires both Dia model weights and DAC codec for full audio generation
Here's a simple example to get you started:
using NeuralCodecs;
// Load a SNAC model
var model = await NeuralCodecs.CreateSNACAsync("path/to/model.pt");
// Process audio
float[] audioData = LoadAudioFile("input.wav");
var compressed = model.ProcessAudio(audioData, sampleRate: 24000);
// Save the result
SaveAudioFile("output.wav", compressed);
For more detailed examples, see the examples section below.
There are several ways to load a model:
// Load SNAC model with static method provided for built-in models
var model = await NeuralCodecs.CreateSNACAsync("model.pt");
- SnacConfig provides premade configurations for 24kHz, 32kHz, and 44kHz sampling rates.
var model = await NeuralCodecs.CreateSNACAsync(modelPath, SNACConfig.SNAC24Khz);
- Allows the use of custom loader implementations
// Load model with default config from IModelLoader instance
var torchLoader = NeuralCodecs.CreateTorchLoader();
var model = await torchLoader.LoadModelAsync<SNAC, SNACConfig>("model.pt");
// For Encodec with custom bandwidth and settings
var encodecConfig = new EncodecConfig {
SampleRate = 48000,
Bandwidth = 12.0f,
Channels = 2, // Stereo audio
Normalize = true
};
var encodecModel = await torchLoader.LoadModelAsync<Encodec, EncodecConfig>("encodec_model.pt", encodecConfig);
- Allows the use of custom model implementations with built-in or custom loaders
// Load custom model with factory method
var model = await torchLoader.LoadModelAsync<CustomModel, CustomConfig>(
"model.pt",
config => new CustomModel(config, ...),
config);
Models can be loaded in Pytorch or Safetensors format.
The AudioTools namespace provides extensive audio processing capabilities:
var audio = new Tensor(...); // Load or create audio tensor
// Apply effects
var processedAudio = AudioEffects.ApplyCompressor(
audio,
sampleRate: 48000,
threshold: -20f,
ratio: 4.0f);
// Compute spectrograms and transforms
var spectrogram = DSP.MelSpectrogram(audio, sampleRate);
var stft = DSP.STFT(audio, windowSize: 1024, hopSize: 512, windowType: "hann");
There are two main ways to process audio:
- Using the simplified ProcessAudio method:
// Compress audio in one step
var processedAudio = model.ProcessAudio(audioData, sampleRate);
- Using separate encode and decode steps:
// Encode audio to compressed format
var codes = model.Encode(buffer);
// Decode back to audio
var processedAudio = model.Decode(codes);
-
Saving the processed audio
Use your preferred method to save WAV files
// using NAudio
await using var writer = new WaveFileWriter(
outputPath,
new WaveFormat(model.Config.SamplingRate, channels: model.Channels)
);
writer.WriteSamples(processedAudio, 0, processedAudio.Length);
Encodec provides additional capabilities:
// Set target bandwidth for compression (supported values depend on model)
encodecModel.SetTargetBandwidth(12.0f); // 12 kbps
// Get available bandwidth options
var availableBandwidths = encodecModel.TargetBandwidths; // e.g. [1.5, 3, 6, 12, 24]
// Use language model for enhanced compression quality
var lm = await encodecModel.GetLanguageModel();
// Apply LM during encoding/decoding for better quality
// Direct file compression
await EncodecCompressor.CompressToFileAsync(encodecModel, audioTensor, "audio.ecdc", useLm: true);
// Decompress from file
var (decompressedAudio, sampleRate) = await EncodecCompressor.DecompressFromFileAsync("audio.ecdc");
Dia is a 1.6B parameter text-to-speech model that generates highly realistic dialogue directly from transcripts:
// Load Dia model with optional DAC codec
var diaConfig = new DiaConfig
{
LoadDACModel = true,
SampleRate = 44100
};
var diaModel = NeuralCodecs.CreateDiaAsync("model.pt", diaconfig)
// or use LoadDACModel = false in config and manually load DAC:
diaModel.LoadDacModel("dac_model.pt");
// Basic text-to-speech generation
var text = "[S1] Hello, how are you today? [S2] I'm doing great, thanks for asking!";
var audioOutput = diaModel.Generate(
text: text,
maxTokens: 1000,
cfgScale: 3.0f,
temperature: 1.2f,
topP: 0.95f);
// Voice cloning with audio prompt
var audioPromptPath = "reference_voice.wav";
var clonedAudio = diaModel.Generate(
text: "[S1] This is my cloned voice speaking new words.",
audioPromptPath: audioPromptPath,
maxTokens: 1000);
// Batch generation for multiple texts
var texts = new List<string>
{
"[S1] First dialogue line.",
"[S2] Second dialogue line with (laughs) non-verbal."
};
var batchResults = diaModel.Generate(texts, maxTokens: 800);
// Save generated audio
Dia.SaveAudio("output.wav", audioOutput);
Audio Speed Correction: Dia includes built-in speed correction to handle the automatic speed-up issue on longer inputs:
var diaConfig = new DiaConfig
{
LoadDACModel = true,
SampleRate = 44100,
// Configure speed correction method
SpeedCorrectionMethod = AudioSpeedCorrectionMethod.Hybrid, // Default: best quality
// Configure slowdown mode
SlowdownMode = AudioSlowdownMode.Dynamic // Default: adapts to text length
};
- None: No speed correction applied
- TorchSharp: TorchSharp-based linear interpolation
- Hybrid: Combines TorchSharp and NAudio methods (recommended)
- NAudioResampling: Uses NAudio resampling for speed correction
- All: Creates separate outputs using all methods (for testing/comparison)
- Static: Uses a fixed slowdown factor
- Dynamic: Adjusts slowdown based on text length (recommended)
Speed Correction Examples:
// For highest quality output (default)
var highQualityConfig = new DiaConfig
{
SpeedCorrectionMethod = AudioSpeedCorrectionMethod.Hybrid,
SlowdownMode = AudioSlowdownMode.Dynamic
};
// For testing multiple correction methods
var testConfig = new DiaConfig
{
SpeedCorrectionMethod = AudioSpeedCorrectionMethod.All // Generates multiple output variants
};
// For no speed correction (fastest processing)
var fastConfig = new DiaConfig
{
SpeedCorrectionMethod = AudioSpeedCorrectionMethod.None
};
Text Format Requirements:
- Always begin input text with
[S1]
speaker tag - Alternate between
[S1]
and[S2]
for dialogue (repeating the same speaker tag consecutively may impact generation) - Keep input text moderate length (10-20 seconds of corresponding audio)
Non-Verbal Communications: Dia supports various non-verbal tags. Some work more consistently than others (laughs, chuckles), but be prepared for occasional unexpected output from some tags (sneezes, applause, coughs ...)
var textWithNonVerbals = "[S1] I can't believe it! (gasps) [S2] That's amazing! (laughs)";
Supported non-verbals: (laughs)
, (clears throat)
, (sighs)
, (gasps)
, (coughs)
, (singing)
, (sings)
, (mumbles)
, (beep)
, (groans)
, (sniffs)
, (claps)
, (screams)
, (inhales)
, (exhales)
, (applause)
, (burps)
, (humming)
, (sneezes)
, (chuckle)
, (whistles)
Voice Cloning Best Practices:
- Provide 5-10 seconds of reference audio for optimal results
- Include the transcript of the reference audio before your generation text
- Use correct speaker tags in the reference transcript
- Approximately 1 second per 86 tokens for duration estimation
// Voice cloning example with transcript
var referenceTranscript = "[S1] This is the reference voice speaking clearly.";
var newText = "[S1] Now I will say something completely different.";
var clonedOutput = diaModel.Generate(
text: referenceTranscript + " " + newText,
audioPromptPath: "reference.wav");
Memory Usage: Similar to the python implementation, ~10-11GB GPU memory is required for the Dia model with DAC codec.
Speed Comparison (RTX 3090): The C# implementation shows slight performance improvement compared to the original Python version in my limited testing (Windows/no torch compile):
- Python (original): ~35 tokens/second (without torch.compile)
- C# (NeuralCodecs): ~40 tokens/second
Performance Notes:
- TorchSharp currently lacks torch.compile support, which limits potential speed gains compared to PyTorch
- Dia's performance is reduced on Windows machines compared to Linux environments
- Actual performance will vary based on hardware configuration, text length, and generation parameters
Check out the Example project for a complete implementation, including:
- Model loading and configuration
- Audio processing workflows
- Command-line interface implementation
- Audio Visualization
The example includes tools for visualizing and comparing audio spectrograms:
Audio before and after compression with DAC Codec 24kHz
- SNAC - hubertsiuzdak's original python implementation
- Descript Audio Codec - Descript's original python implementation
- Encodec - Meta's original python implementation
- Dia - Nari Labs' original python implementation
Suggestions and contributions are welcome! Here's how you can help:
- Bug Reports: Submit issues with reproduction steps
- Feature Requests: Propose new codec implementations or features
- Code Contributions: Submit pull requests with improvements
- Documentation: Help improve examples and documentation
- Testing: Test with different models and platforms
This project is licensed under the Apache-2.0 License, see the LICENSE file for more information.
This project uses libraries under several different licenses, see THIRD-PARTY-NOTICES for more information.