STT Server

Real-time Speech-to-Text WebSocket server for conversational agents.

Overview

STT Server provides low-latency streaming transcription with turn-taking awareness. It processes continuous audio streams and returns transcription segments as they become available, distinguishing between intermediate (tentative) and final results.

The server solves the problem of integrating speech recognition into conversational AI systems where:

Low latency is critical for natural conversation flow
Turn boundaries need to be detected automatically
Transcription should be progressive (showing partial results)

Architecture

The system uses a multi-stage async pipeline:

Audio → [VAD] → [ASR] → [Sink] → WebSocket
         │        │        │
         │        │        └─ Serializes to JSON and sends to client
         │        └─ Transcribes audio using Canary-Qwen-2.5B
         └─ Detects speech/silence, segments by turn boundaries

VAD Stage - Uses Silero VAD to classify audio frames as speech or silence. Implements hysteresis-based state machine to prevent chatter at silence boundaries. Emits audio chunks on small gaps (0.3s) for continuous transcription, and end-of-turn signals on large gaps (1.5s).

ASR Stage - Transcribes audio chunks using Canary-Qwen-2.5B (2.5B parameter model). Maintains audio overlap for context continuity and uses semi-global alignment to merge overlapping transcriptions.

Sink Stage - Terminal stage that serializes transcription segments to JSON and sends them directly to the WebSocket client.

Sub-Projects

nemo_lite

Lightweight inference-only wrapper for Canary-Qwen-2.5B. Located in nemo_lite/.

The official NVIDIA NeMo toolkit has heavy dependencies that are problematic for deployment:

lhotse, nv-one-logger-*, fiddle, lightning, hydra - none needed for inference

nemo_lite provides the same transcription capability with minimal dependencies:

torch, torchaudio - Core tensor operations
transformers, peft, safetensors - Model loading
librosa - Mel filterbank (matches NeMo exactly)

Key components:

AudioPreprocessor - Converts PCM to mel spectrogram (128 features, 16kHz)
FastConformer - 32-layer encoder (1024 dim, 8x temporal downsampling)
Qwen3-1.7B + LoRA - Text generation via HuggingFace transformers

strops

Rust/Python library for word sequence merging. Located in strops-rs/.

Provides merge_by_overlap(prev, new) function that uses semi-global alignment to find where the suffix of previous transcription overlaps with the prefix of new transcription. This maintains context continuity when processing audio in overlapping chunks.

>>> from strops import merge_by_overlap
>>> merge_by_overlap(["The", "quick", "brown", "fox"], ["brown", "fox", "jumps"])
["The", "quick", "brown", "fox", "jumps"]

Built with maturin and pyo3 for Python bindings.

Usage

Prerequisites

NVIDIA GPU with CUDA - Required for real-time performance (RTX 2070 or better recommended)
CPU mode - Works but significantly slower, not suitable for real-time use
Nix - For dependency management (or manually install Python dependencies)
~5GB disk space - For model weights (downloaded on first run)

Running the Server

Development mode:

# Enter the Nix development shell
nix develop

# Run the server
python stt_server/server.py --port 15751 --host 0.0.0.0

# Or with CPU mode
STT_DEVICE=cpu python stt_server/server.py

Using the built package:

nix build .#stt-server
./result/bin/stt-server --port 15751

The server exposes:

GET /health - Health check endpoint
WebSocket /ws/transcribe - Streaming transcription

WebSocket Protocol

Client sends AudioFrame messages:

{
  "samples": "<base64-encoded 16-bit PCM>",
  "sampleRate": 16000,
  "channels": 1
}

Server sends TranscriptionSegment messages:

{
  "text": "transcribed text here",
  "isFinal": false,
  "isEndOfTurn": false
}

Test Client

Stream from microphone:

python -m stt_server.scripts.stt_client

Interactive device selection if multiple microphones are available.

Stream from audio file:

python -m stt_server.scripts.stt_client path/to/audio.mp3

Supports WAV, FLAC, MP3, OGG with automatic resampling to 16kHz mono.

NixOS Service

Step 1: Add stt-server to your flake inputs

# flake.nix
{
  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
    stt-server.url = "github:breakds/stt-server";
  };

  outputs = { nixpkgs, stt-server, ... }: {
    nixosConfigurations.your-host = nixpkgs.lib.nixosSystem {
      system = "x86_64-linux";
      specialArgs = { inherit stt-server; };
      modules = [ ./configuration.nix ];
    };
  };
}

Step 2: Configure the service

# configuration.nix
{ stt-server, ... }:

{
  imports = [ stt-server.nixosModules.default ];

  nixpkgs.overlays = [ stt-server.overlays.default ];

  # Required for CUDA support
  nixpkgs.config.cudaSupport = true;

  services.stt-server = {
    enable = true;
    port = 15751;
    host = "0.0.0.0";
    device = "cuda";  # or "cpu"
    openFirewall = true;  # for internal network access
  };
}

Configuration options:

port - Server port (default: 15751)
host - Bind address (default: "0.0.0.0")
device - "cuda" or "cpu" (default: "cuda")
package - The stt-server package to use
openFirewall - Open TCP port in firewall (default: false)

Notes:

Model weights are cached in /var/cache/stt-server (managed by systemd)
For CUDA support, ensure nixpkgs.config.cudaSupport = true is set
After adding openFirewall = true, you may need to run sudo systemctl restart firewall for the port to open

For Developers

Development Shells

Main development:

nix develop

Python environment with all ML dependencies (torch, transformers, etc.) and dev tools (basedpyright, ruff).

strops development:

nix develop .#strops

Rust toolchain for developing the strops library.

Testing

python -m unittest discover -s stt_server/tests

Project Structure

stt-server/
├── stt_server/           # Main Python package
│   ├── server.py         # FastAPI WebSocket server
│   ├── session.py        # Transcription session management
│   ├── pipeline.py       # Async pipeline infrastructure
│   ├── data_types.py     # Pydantic models for protocol
│   ├── stages/           # Pipeline stages (VAD, ASR, Sink)
│   ├── scripts/          # CLI tools (stt_client)
│   └── tests/            # Unit tests
├── nemo_lite/            # Lightweight Canary-Qwen wrapper
│   ├── model.py          # Main CanaryQwen class
│   ├── preprocessing.py  # Mel spectrogram extraction
│   ├── conformer_lite/   # FastConformer encoder
│   ├── qwen/             # Qwen3 LLM wrapper
│   └── weights.py        # Weight loading utilities
├── strops-rs/            # Rust/Python sequence alignment
│   ├── src/              # Rust source
│   └── nix/              # Nix packaging
└── nix/                  # Nix configuration
    ├── development.nix   # Dev shell
    ├── release.nix       # Package/module exports
    ├── packages/         # Nix packages
    └── modules/          # NixOS modules

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
examples		examples
exploration/canary_qwen		exploration/canary_qwen
nemo_lite		nemo_lite
nix		nix
strops-rs		strops-rs
stt_server		stt_server
.envrc		.envrc
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STT Server

Overview

Architecture

Sub-Projects

nemo_lite

strops

Usage

Prerequisites

Running the Server

WebSocket Protocol

Test Client

NixOS Service

For Developers

Development Shells

Testing

Project Structure

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

breakds/stt-server

Folders and files

Latest commit

History

Repository files navigation

STT Server

Overview

Architecture

Sub-Projects

nemo_lite

strops

Usage

Prerequisites

Running the Server

WebSocket Protocol

Test Client

NixOS Service

For Developers

Development Shells

Testing

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages