Skip to content

Architecture

Behnam Ebrahimi edited this page Mar 29, 2026 · 1 revision

Architecture

Overview

Vayu is organized as a single Python package (whisper_mlx) with a clean separation of concerns.

Module Map

whisper_mlx/
├── __init__.py        # Public API exports
├── __main__.py        # python -m whisper_mlx entry point
├── cli.py             # Command-line interface (argparse)
├── lightning.py       # LightningWhisperMLX wrapper class
├── transcribe.py      # Core transcription pipeline (~730 lines)
├── whisper.py         # Whisper neural network architecture (MLX)
├── audio.py           # Audio loading (FFmpeg) & mel spectrogram
├── decoding.py        # Token generation, beam search, sampling
├── tokenizer.py       # Tiktoken wrapper with Whisper special tokens
├── load_models.py     # Model loading with path security validation
├── writers.py         # Output format writers (txt, srt, vtt, tsv, json)
├── timing.py          # Word-level timestamps (cross-attention + DTW)
├── utils.py           # Model name mappings & format helpers
├── constants.py       # Centralized constants
├── speculative.py     # Speculative decoding (experimental)
└── assets/            # Mel filters, tokenizer vocabularies
    ├── mel_filters.npz
    ├── gpt2.tiktoken
    └── multilingual.tiktoken

Data Flow

Audio File (any format)
    │
    ▼ FFmpeg via load_audio()
16kHz PCM Waveform (float32)
    │
    ▼ log_mel_spectrogram()
Mel Spectrogram (80/128 channels x 3000 frames)
    │
    ▼ Batched: stack batch_size segments
    │
    ├── AudioEncoder
    │   └── Conv layers + Transformer → Audio features
    │
    └── TextDecoder
        ├── Language detection (if needed)
        ├── Autoregressive token generation
        ├── Temperature fallback on quality failures
        └── Cross-attention extraction (for word timestamps)
            │
            ▼
    Segments with text & timestamps
            │
            ▼ Output writers
    .txt / .srt / .vtt / .tsv / .json

Key Design Patterns

Singleton — ModelHolder

Models are cached to avoid reloading on repeated calls:

class ModelHolder:
    @staticmethod
    def get_model(path_or_hf_repo, dtype) -> Whisper
        # Returns cached model if same path+dtype

Factory — Writers, Tokenizers, Model Resolution

get_writer("srt", output_dir)     # Returns WriteSRT instance
get_tokenizer(multilingual=True)  # Returns configured Tokenizer
resolve_model_path("turbo")       # Returns HuggingFace repo URL

Wrapper — LightningWhisperMLX

Simplifies the full transcribe() API for common use cases. Maps friendly model names to HuggingFace repos, handles quantization resolution.

Temperature Fallback Strategy

When a decoded segment fails quality checks (high compression ratio or low log probability), Vayu retries with progressively higher sampling temperatures:

Temps: [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
         │
    t=0.0 → Greedy decode
         │ Fails quality check?
    t=0.2 → Light sampling
         │ Still failing?
    t=0.4+ → Continue escalating

This avoids hallucinations while maintaining output quality.

Security

Model Path Validation

load_models.py validates all model paths against a whitelist to prevent path traversal attacks:

  • HuggingFace cache directory
  • /usr/local/share system directory
  • Custom directories via WHISPER_MLX_MODEL_DIRS environment variable

Symbolic links are resolved before validation.

Audio Constants

All audio processing uses Whisper's standard parameters:

Constant Value Meaning
SAMPLE_RATE 16,000 Hz Input sample rate
N_FFT 400 FFT window size
HOP_LENGTH 160 Samples between frames
CHUNK_LENGTH 30 seconds Audio chunk size
N_FRAMES 3,000 Mel frames per chunk
TOKENS_PER_SECOND 50 20ms per token

Clone this wiki locally