WinSTT

An application for desktop STT using OpenAI-Whisper

Type in any application using your voice. WinSTT is an application that leverages the power of OpenAI's Whisper STT model for efficient voice typing functionality. This desktop tool allows users to transcribe speech into text, with support for over 99 languages and the capability to run locally without the need for an internet connection.

Why

Existing Windows speech to text is slow, not accurate, and not intuitive. This app provides customizable hotkey activation, and fast and accurate transcription for rapid typing. This is especially useful to those who write articles, blogs, and even conversations.

Setup

Precompiled Binary (Recommended for Windows Users)

Download the .exe file from the latest release from the Releases section .

Python Version Setup

Install Dependencies

First, clone the repo:

git clone https://github.com/dahshury/WinSTT

Navigate to the cloned directory:
```
cd WinSTT
```

Initialize the environment and install the requirements:

CPU VERSION

conda env create -f env.yaml

GPU VERSION

conda env create -f env-gpu.yaml

Linux users only: additional setup for PyAudio

For Linux, you need to install PortAudio, which PyAudio depends on. Use the following commands to install PortAudio on common Linux distributions:

Debian/Ubuntu:

sudo apt update
sudo apt install portaudio19-dev libxcb1 libxcb-cursor0 libxcb-keysyms1 libxcb-render0 libxcb-shape0 libxcb-shm0 libxcb-xfixes0 libxcb-icccm4 libxcb-image0 libxcb-sync1 libxcb-xinerama0 libxcb-randr0 libxcb-util1 libx11-xcb1 libxrender1 libxkbcommon-x11-0

Activate the environment:
```
conda activate WinSTT
```

Start The App

Start the GUI by running the bash command:

python winSTT.py

alternatively, you can use the python script listener.py, which contains the default functionality:

python -m utils.listener

Usage

Hold the Alt+Ctrl+A key combination to start recording, release it to stop. There can be a very slight delay between the start of the pressing and the start of the app listening to the audio from your microphone. You should only start speaking after hearing the audio cue.

Releasing the key will transcribe the audio you recorded, paste it wherever your typing pointer is in any application. The processing speed will depend on the model chosen and your computer capabilities.
The app contains a "record key" button, which allows you to change the recording key that you have to hold to start recording. Press record key, and then press and hold the buttons you wish to start the recording with, then click stop to change the recording key.
This tool is powered by Hugging Face's ASR models, primarily Whisper by OpenAI. The larger the model, the better the accuracy and the slower the speed. Try the model that best suits your hardware and needs.

Notes

Upon loading the app for the first time, Please wait for the model files to be downloaded, (about 1 GB for CPU version, 3 GB for GPU version) this will depend on your internet connection. After the model is downloaded, no internet connection needed unless you change the model. After that, the first recording might be pasted a little bit slower than the consequent ones.
The app will automatically detect if audio is present in the speech. If not, or if an error occurs, it will output a message inside the app and inside the logs folder.
The application only records while the record key is held down.
You can use this app using a CPU, it will run Whisper-Turbo quantized by default. However, if you have a CUDA GPU, the app will run the full version and this will increase the speed and the accuracy and is highly recommended.
The application does not transcribe audio that is less than 0.5 second long. If your sentence is short, consider not letting go of the button until 0.5s has passed.
Some antivirus programs may flag .exe files generated by PyInstaller as current releases as suspicious. This is a known issue. Rest assured, the binaries are clean and safe. The app has passed most VirusTotal's tests, which you can check out here, the rest are false positives.

Acknowledgments

Silero's Voice Activity Detection (VAD) is implemented to prevent hallucinations on silence start, and prevent empty file processing.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
logger		logger
media		media
utils		utils
.gitignore		.gitignore
README.md		README.md
autopyexeconfig.json		autopyexeconfig.json
env-gpu.yaml		env-gpu.yaml
env.yaml		env.yaml
license		license
winSTT.py		winSTT.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WinSTT

Why

Setup

Precompiled Binary (Recommended for Windows Users)

Python Version Setup

Install Dependencies

Start The App

Usage

Notes

Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

dahshury/WinSTT

Folders and files

Latest commit

History

Repository files navigation

WinSTT

Why

Setup

Precompiled Binary (Recommended for Windows Users)

Python Version Setup

Install Dependencies

Start The App

Usage

Notes

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages