Speech to Text application using state-of-the-art open source models.
- Convert speech to text using OpenAI's Whisper model
- Record audio directly from your microphone
- Open existing audio files for transcription
- Save transcriptions to text files
- Save recorded audio to WAV files
- Noise cancellation for improved audio quality
- Split long recordings into chunks for better processing
- View detailed segment information for each transcription
- Model finetuning on your voice for improved accuracy
- Support for multiple Whisper model sizes:
- tiny (fast but less accurate)
- base (good balance)
- small (better accuracy)
- medium (high accuracy)
- large (best accuracy but slower and requires more memory)
- NEW: Overlay mode for real-time transcription that can insert text anywhere
-
Clone the repository:
git clone https://github.com/example/stt-app.git cd stt-app -
Install dependencies:
pip install -e . -
For the overlay functionality, install additional system dependencies:
sudo apt-get install xdotool python3-xlib -
Run the application:
python -m stt_app.main # For the standard applicationOr run the overlay mode:
python run_overlay.py # For the overlay transcription mode
To build a Debian package:
-
Install build dependencies:
sudo apt-get install debhelper dh-python python3-setuptools -
Build the package:
cd stt-app dpkg-buildpackage -us -uc -
Install the package:
sudo dpkg -i ../stt-app_0.1.0-1_all.deb sudo apt-get install -f # Install any missing dependencies
-
Launch the application by running
stt-apporstt-app-guior from your applications menu. -
Choose a Whisper model from the dropdown menu.
-
Configure recording settings:
- Enable/disable noise cancellation
- Set the chunk duration for processing
-
Either:
- Click "Record" to start recording from your microphone, then "Stop Recording" when finished.
- Click "Open Audio File" to select an existing audio file.
-
Click "Transcribe" to convert the speech to text.
-
The transcription will appear in the text area. You can:
- View the full transcription in the "Transcription" tab
- Examine individual segments in the "Segments" tab
- View and manipulate chunks in the "Chunks" tab
- Save the transcription by clicking "Save Text"
The overlay mode provides a Windows 11-like experience for speech-to-text input anywhere on your system.
-
Launch the overlay by running:
python run_overlay.py -
A floating window will appear on your screen.
-
Keyboard shortcuts:
Alt+Shift+S: Start/stop listeningAlt+Shift+I: Insert transcribed text at the current cursor positionAlt+Shift+C: Clear the transcriptionAlt+Shift+O: Show/hide the overlay window
-
To use:
- Start listening by clicking the "Start Listening" button or pressing
Alt+Shift+S - Speak clearly into your microphone
- The transcription appears in real-time in the overlay
- Position your cursor where you want to insert the text
- Click "Insert Text" or press
Alt+Shift+Ito paste the transcribed text
- Start listening by clicking the "Start Listening" button or pressing
-
The overlay window can be dragged to any position on your screen.
The application supports finetuning the Whisper model on your voice to improve transcription accuracy:
-
Go to Model → Finetune Model... to open the finetuning dialog.
-
In the Data Collection tab:
- Click "Next Prompt" to display a random text prompt
- Click "Record (5s)" to record yourself reading the prompt
- Repeat this process several times to build a training dataset
- It's recommended to record at least 10 samples for good results
-
In the Model Finetuning tab:
- Select the base model to finetune (e.g., "base")
- Set training parameters (epochs and batch size)
- Click "Start Finetuning" to begin the process
- Monitor progress in the log display
- Once completed, your finetuned model will be available to use
-
To use a finetuned model:
- Go to Model → Finetuned Models and select your model
- Or click "Load Selected Model" in the finetuning dialog
The finetuning process adapts the model to your voice, accent, and speech patterns, which can significantly improve transcription accuracy.
usage: stt-app [-h] [--model {tiny,base,small,medium,large}] [--device DEVICE] [--audio-file AUDIO_FILE]
Speech to Text Application
optional arguments:
-h, --help show this help message and exit
--model {tiny,base,small,medium,large}, -m {tiny,base,small,medium,large}
Whisper model to use (default: base)
--device DEVICE, -d DEVICE
Device to use for inference (cpu, cuda, default: auto-detect)
--audio-file AUDIO_FILE, -a AUDIO_FILE
Audio file to transcribe on startup
For the overlay mode to work properly, you'll need:
- Python 3.7 or later
- PyQt5
- whisper
- keyboard
- pyperclip
- python-xlib
- xdotool (system package)
The overlay mode allows you to use speech-to-text functionality system-wide, similar to the Windows 11 speech input feature (Win+H).
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI Whisper for the speech recognition models
- PyQt5 for the GUI framework
- noisereduce for audio noise reduction
- Transformers for model finetuning capabilities