-
CVSSP @ University of Surrey
- Guildford
- xinhaomei.github.io
Stars
Efficient Training of Audio Transformers with Patchout
A collection of datasets for the purpose of emotion recognition/detection in speech.
A 6-million Audio-Caption Paired Dataset Built with a LLMs and ALMs-based Automatic Pipeline
Codec for paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
Janus-Series: Unified Multimodal Understanding and Generation Models
SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders.
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings
Implementation of Band Split Roformer, SOTA Attention network for music source separation out of ByteDance AI Labs
The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
Utilities intended for use with Llama models.
Welcome to the Llama Cookbook! This is your go to guide for Building with Llama: Getting started with Inference, Fine-Tuning, RAG. We also show you how to solve end to end problems using Llama mode…
Gemma open-weight LLM library, from Google DeepMind
MobileLLM Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. In ICML 2024.
code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
[NeurIPS 2024 Best Paper][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ult…
Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
Zero-Shot Speech Editing and Text-to-Speech in the Wild
Audio Codec Speech processing Universal PERformance Benchmark
Aty-TTS: Improving fairness for spoken language understanding in atypical speech with Text-to-Speech
A simple library for Fréchet Audio Distance (FAD) calculation
Vector (and Scalar) Quantization, in Pytorch
A lightweight library for PyTorch training tools and utilities
Baseline multi-resolution cross network model trained using the Divide and Remaster Dataset