Stars
🧑🏫 60+ Implementations/tutorials of deep learning papers with side-by-side notes 📝; including transformers (original, xl, switch, feedback, vit, ...), optimizers (adam, adabelief, sophia, ...), ga…
A curated list of peer-reviewed papers on theoretical and practical aspects of drivers' attention used for paper "Attention for Vision-Based Assistive and Automated Driving: A Review of Algorithms …
new large-scale dataset for vision-based drowsiness detection
PyTorch implementation of MoCo: https://arxiv.org/abs/1911.05722
An awesome face technology repository.
State-of-the-art 2D and 3D Face Analysis Project
OCR software, free and offline. 开源、免费的离线OCR软件。支持截屏/批量导入图片,PDF文档识别,排除水印/页眉页脚,扫描/生成二维码。内置多国语言库。
Collection of various algorithms in mathematics, machine learning, computer science and physics implemented in C++ for educational purposes.
Course to get into Large Language Models (LLMs) with roadmaps and Colab notebooks.
Controllable and fast Text-to-Speech for over 7000 languages!
Simple text to phones converter for multiple languages
eSpeak NG is an open source speech synthesizer that supports more than hundred languages and accents.
Multilingual Automatic Speech Recognition with word-level timestamps and confidence
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
The Microsoft Scalable Noisy Speech Dataset (MS-SNSD) is a noisy speech dataset that can scale to arbitrary sizes depending on the number of speakers, noise types, and Speech to Noise Ratio (SNR) l…
A Non-Autoregressive Transformer based Text-to-Speech, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, …
Chinese text normalization for speech processing
Production First and Production Ready End-to-End Speech Recognition Toolkit
Tacotron 2 - PyTorch implementation with faster-than-realtime inference
GAN-based Mel-Spectrogram Inversion Network for Text-to-Speech Synthesis
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.