An application for desktop STT using OpenAI-Whisper
Type in any application using your voice. WinSTT is an application that leverages the power of OpenAI's Whisper STT model for efficient voice typing functionality. This desktop tool allows users to transcribe speech into text, with support for over 99 languages and the capability to run locally without the need for an internet connection.
Existing Windows speech to text is slow, not accurate, and not intuitive. This app provides customizable hotkey activation, and fast and accurate transcription for rapid typing. This is especially useful to those who write articles, blogs, and even conversations.
- Download the
.exe
file from the latest release from the Releases section .
-
First, clone the repo:
git clone https://github.com/dahshury/WinSTT
-
Navigate to the cloned directory:
cd WinSTT
-
Initialize the environment and install the requirements:
CPU VERSION
conda env create -f env.yaml
GPU VERSION
conda env create -f env-gpu.yaml
Linux users only: additional setup for PyAudio
For Linux, you need to install
PortAudio
, which PyAudio depends on. Use the following commands to install PortAudio on common Linux distributions:- Debian/Ubuntu:
sudo apt update sudo apt install portaudio19-dev libxcb1 libxcb-cursor0 libxcb-keysyms1 libxcb-render0 libxcb-shape0 libxcb-shm0 libxcb-xfixes0 libxcb-icccm4 libxcb-image0 libxcb-sync1 libxcb-xinerama0 libxcb-randr0 libxcb-util1 libx11-xcb1 libxrender1 libxkbcommon-x11-0
- Debian/Ubuntu:
-
Activate the environment:
conda activate WinSTT
- Start the GUI by running the bash command:
python winSTT.py
- alternatively, you can use the python script listener.py, which contains the default functionality:
python -m utils.listener
Hold the Alt+Ctrl+A key combination to start recording, release it to stop. There can be a very slight delay between the start of the pressing and the start of the app listening to the audio from your microphone. You should only start speaking after hearing the audio cue.
-
Releasing the key will transcribe the audio you recorded, paste it wherever your typing pointer is in any application. The processing speed will depend on the model chosen and your computer capabilities.
-
The app contains a "record key" button, which allows you to change the recording key that you have to hold to start recording. Press record key, and then press and hold the buttons you wish to start the recording with, then click stop to change the recording key.
-
This tool is powered by Hugging Face's ASR models, primarily Whisper by OpenAI. The larger the model, the better the accuracy and the slower the speed. Try the model that best suits your hardware and needs.
- Upon loading the app for the first time, Please wait for the model files to be downloaded, (about 1 GB for CPU version, 3 GB for GPU version) this will depend on your internet connection. After the model is downloaded, no internet connection needed unless you change the model. After that, the first recording might be pasted a little bit slower than the consequent ones.
- The app will automatically detect if audio is present in the speech. If not, or if an error occurs, it will output a message inside the app and inside the logs folder.
- The application only records while the record key is held down.
- You can use this app using a CPU, it will run Whisper-Turbo quantized by default. However, if you have a CUDA GPU, the app will run the full version and this will increase the speed and the accuracy and is highly recommended.
- The application does not transcribe audio that is less than 0.5 second long. If your sentence is short, consider not letting go of the button until 0.5s has passed.
- Some antivirus programs may flag .exe files generated by PyInstaller as current releases as suspicious. This is a known issue. Rest assured, the binaries are clean and safe. The app has passed most VirusTotal's tests, which you can check out here, the rest are false positives.
- Silero's Voice Activity Detection (VAD) is implemented to prevent hallucinations on silence start, and prevent empty file processing.