Skip to content

afif-malghani/stt-linux

Repository files navigation

STT App

Speech to Text application using state-of-the-art open source models.

Features

  • Convert speech to text using OpenAI's Whisper model
  • Record audio directly from your microphone
  • Open existing audio files for transcription
  • Save transcriptions to text files
  • Save recorded audio to WAV files
  • Noise cancellation for improved audio quality
  • Split long recordings into chunks for better processing
  • View detailed segment information for each transcription
  • Model finetuning on your voice for improved accuracy
  • Support for multiple Whisper model sizes:
    • tiny (fast but less accurate)
    • base (good balance)
    • small (better accuracy)
    • medium (high accuracy)
    • large (best accuracy but slower and requires more memory)
  • NEW: Overlay mode for real-time transcription that can insert text anywhere

Installation

From Source

  1. Clone the repository:

    git clone https://github.com/example/stt-app.git
    cd stt-app
    
  2. Install dependencies:

    pip install -e .
    
  3. For the overlay functionality, install additional system dependencies:

    sudo apt-get install xdotool python3-xlib
    
  4. Run the application:

    python -m stt_app.main  # For the standard application
    

    Or run the overlay mode:

    python run_overlay.py  # For the overlay transcription mode
    

Debian Package

To build a Debian package:

  1. Install build dependencies:

    sudo apt-get install debhelper dh-python python3-setuptools
    
  2. Build the package:

    cd stt-app
    dpkg-buildpackage -us -uc
    
  3. Install the package:

    sudo dpkg -i ../stt-app_0.1.0-1_all.deb
    sudo apt-get install -f  # Install any missing dependencies
    

Usage

Standard Application

  1. Launch the application by running stt-app or stt-app-gui or from your applications menu.

  2. Choose a Whisper model from the dropdown menu.

  3. Configure recording settings:

    • Enable/disable noise cancellation
    • Set the chunk duration for processing
  4. Either:

    • Click "Record" to start recording from your microphone, then "Stop Recording" when finished.
    • Click "Open Audio File" to select an existing audio file.
  5. Click "Transcribe" to convert the speech to text.

  6. The transcription will appear in the text area. You can:

    • View the full transcription in the "Transcription" tab
    • Examine individual segments in the "Segments" tab
    • View and manipulate chunks in the "Chunks" tab
    • Save the transcription by clicking "Save Text"

Overlay Mode (NEW)

The overlay mode provides a Windows 11-like experience for speech-to-text input anywhere on your system.

  1. Launch the overlay by running:

    python run_overlay.py
    
  2. A floating window will appear on your screen.

  3. Keyboard shortcuts:

    • Alt+Shift+S: Start/stop listening
    • Alt+Shift+I: Insert transcribed text at the current cursor position
    • Alt+Shift+C: Clear the transcription
    • Alt+Shift+O: Show/hide the overlay window
  4. To use:

    • Start listening by clicking the "Start Listening" button or pressing Alt+Shift+S
    • Speak clearly into your microphone
    • The transcription appears in real-time in the overlay
    • Position your cursor where you want to insert the text
    • Click "Insert Text" or press Alt+Shift+I to paste the transcribed text
  5. The overlay window can be dragged to any position on your screen.

Finetuning

The application supports finetuning the Whisper model on your voice to improve transcription accuracy:

  1. Go to Model → Finetune Model... to open the finetuning dialog.

  2. In the Data Collection tab:

    • Click "Next Prompt" to display a random text prompt
    • Click "Record (5s)" to record yourself reading the prompt
    • Repeat this process several times to build a training dataset
    • It's recommended to record at least 10 samples for good results
  3. In the Model Finetuning tab:

    • Select the base model to finetune (e.g., "base")
    • Set training parameters (epochs and batch size)
    • Click "Start Finetuning" to begin the process
    • Monitor progress in the log display
    • Once completed, your finetuned model will be available to use
  4. To use a finetuned model:

    • Go to Model → Finetuned Models and select your model
    • Or click "Load Selected Model" in the finetuning dialog

The finetuning process adapts the model to your voice, accent, and speech patterns, which can significantly improve transcription accuracy.

Command Line Options

usage: stt-app [-h] [--model {tiny,base,small,medium,large}] [--device DEVICE] [--audio-file AUDIO_FILE]

Speech to Text Application

optional arguments:
  -h, --help            show this help message and exit
  --model {tiny,base,small,medium,large}, -m {tiny,base,small,medium,large}
                        Whisper model to use (default: base)
  --device DEVICE, -d DEVICE
                        Device to use for inference (cpu, cuda, default: auto-detect)
  --audio-file AUDIO_FILE, -a AUDIO_FILE
                        Audio file to transcribe on startup

Requirements for Overlay Mode

For the overlay mode to work properly, you'll need:

  • Python 3.7 or later
  • PyQt5
  • whisper
  • keyboard
  • pyperclip
  • python-xlib
  • xdotool (system package)

The overlay mode allows you to use speech-to-text functionality system-wide, similar to the Windows 11 speech input feature (Win+H).

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

About

Speech to text for linux using whisper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published