Architecture

Overview

Vayu is organized as a single Python package (whisper_mlx) with a clean separation of concerns.

Module Map

whisper_mlx/
├── __init__.py        # Public API exports
├── __main__.py        # python -m whisper_mlx entry point
├── cli.py             # Command-line interface (argparse)
├── lightning.py       # LightningWhisperMLX wrapper class
├── transcribe.py      # Core transcription pipeline (~730 lines)
├── whisper.py         # Whisper neural network architecture (MLX)
├── audio.py           # Audio loading (FFmpeg) & mel spectrogram
├── decoding.py        # Token generation, beam search, sampling
├── tokenizer.py       # Tiktoken wrapper with Whisper special tokens
├── load_models.py     # Model loading with path security validation
├── writers.py         # Output format writers (txt, srt, vtt, tsv, json)
├── timing.py          # Word-level timestamps (cross-attention + DTW)
├── utils.py           # Model name mappings & format helpers
├── constants.py       # Centralized constants
├── speculative.py     # Speculative decoding (experimental)
└── assets/            # Mel filters, tokenizer vocabularies
    ├── mel_filters.npz
    ├── gpt2.tiktoken
    └── multilingual.tiktoken

Data Flow

Audio File (any format)
    │
    ▼ FFmpeg via load_audio()
16kHz PCM Waveform (float32)
    │
    ▼ log_mel_spectrogram()
Mel Spectrogram (80/128 channels x 3000 frames)
    │
    ▼ Batched: stack batch_size segments
    │
    ├── AudioEncoder
    │   └── Conv layers + Transformer → Audio features
    │
    └── TextDecoder
        ├── Language detection (if needed)
        ├── Autoregressive token generation
        ├── Temperature fallback on quality failures
        └── Cross-attention extraction (for word timestamps)
            │
            ▼
    Segments with text & timestamps
            │
            ▼ Output writers
    .txt / .srt / .vtt / .tsv / .json

Key Design Patterns

Singleton — ModelHolder

Models are cached to avoid reloading on repeated calls:

class ModelHolder:
    @staticmethod
    def get_model(path_or_hf_repo, dtype) -> Whisper
        # Returns cached model if same path+dtype

Factory — Writers, Tokenizers, Model Resolution

get_writer("srt", output_dir)     # Returns WriteSRT instance
get_tokenizer(multilingual=True)  # Returns configured Tokenizer
resolve_model_path("turbo")       # Returns HuggingFace repo URL

Wrapper — LightningWhisperMLX

Simplifies the full transcribe() API for common use cases. Maps friendly model names to HuggingFace repos, handles quantization resolution.

Temperature Fallback Strategy

When a decoded segment fails quality checks (high compression ratio or low log probability), Vayu retries with progressively higher sampling temperatures:

Temps: [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
         │
    t=0.0 → Greedy decode
         │ Fails quality check?
    t=0.2 → Light sampling
         │ Still failing?
    t=0.4+ → Continue escalating

This avoids hallucinations while maintaining output quality.

Security

Model Path Validation

load_models.py validates all model paths against a whitelist to prevent path traversal attacks:

HuggingFace cache directory
/usr/local/share system directory
Custom directories via WHISPER_MLX_MODEL_DIRS environment variable

Symbolic links are resolved before validation.

Audio Constants

All audio processing uses Whisper's standard parameters:

Constant	Value	Meaning
`SAMPLE_RATE`	16,000 Hz	Input sample rate
`N_FFT`	400	FFT window size
`HOP_LENGTH`	160	Samples between frames
`CHUNK_LENGTH`	30 seconds	Audio chunk size
`N_FRAMES`	3,000	Mel frames per chunk
`TOKENS_PER_SECOND`	50	20ms per token

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

Architecture

Overview

Module Map

Data Flow

Key Design Patterns

Singleton — ModelHolder

Factory — Writers, Tokenizers, Model Resolution

Wrapper — LightningWhisperMLX

Temperature Fallback Strategy

Security

Model Path Validation

Audio Constants

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally