- Baltimore, Maryland
- https://steventan0110.github.io/
Lists (1)
Sort Name ascending (A-Z)
Stars
Agent framework and applications built upon Qwen>=3.0, featuring Function Calling, MCP, Code Interpreter, RAG, Chrome extension, etc.
Train your Agent model via our easy and efficient framework
Model Context Protocol Servers
Benchmark data and code for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
JARVIS, a system to connect LLMs with ML community. Paper: https://arxiv.org/pdf/2303.17580.pdf
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.
Official PyTorch implementation for "MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens" (ACL 2025 Findings)
PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
Audio-Visual Corruption Modeling of our paper "Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring" in CVPR23
[CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
This repository contains the code for the paper `End-to-End Multimodal Emotion Recognition using Deep Neural Networks`.
Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice
AI for Screeps, a multiplayer programming strategy game
Efficient face emotion recognition in photos and videos
A collection of resources and papers on Vector Quantized Variational Autoencoder (VQ-VAE) and its application
Multi-Scale Neural Audio Codec (SNAC) compresses audio into discrete codes at a low bitrate
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models (ICASSP 2024)
NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference
Reading list for research topics in multimodal machine learning
[ICML 2024 Best Paper] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (https://arxiv.org/abs/2310.16834)
Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch
Image-to-Image Translation in PyTorch
Image-to-image translation with conditional adversarial nets
AudioLDM: Generate speech, sound effects, music and beyond, with text.
Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs