-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
Behnam Ebrahimi edited this page Mar 29, 2026
·
1 revision
Vayu is organized as a single Python package (whisper_mlx) with a clean separation of concerns.
whisper_mlx/
├── __init__.py # Public API exports
├── __main__.py # python -m whisper_mlx entry point
├── cli.py # Command-line interface (argparse)
├── lightning.py # LightningWhisperMLX wrapper class
├── transcribe.py # Core transcription pipeline (~730 lines)
├── whisper.py # Whisper neural network architecture (MLX)
├── audio.py # Audio loading (FFmpeg) & mel spectrogram
├── decoding.py # Token generation, beam search, sampling
├── tokenizer.py # Tiktoken wrapper with Whisper special tokens
├── load_models.py # Model loading with path security validation
├── writers.py # Output format writers (txt, srt, vtt, tsv, json)
├── timing.py # Word-level timestamps (cross-attention + DTW)
├── utils.py # Model name mappings & format helpers
├── constants.py # Centralized constants
├── speculative.py # Speculative decoding (experimental)
└── assets/ # Mel filters, tokenizer vocabularies
├── mel_filters.npz
├── gpt2.tiktoken
└── multilingual.tiktoken
Audio File (any format)
│
▼ FFmpeg via load_audio()
16kHz PCM Waveform (float32)
│
▼ log_mel_spectrogram()
Mel Spectrogram (80/128 channels x 3000 frames)
│
▼ Batched: stack batch_size segments
│
├── AudioEncoder
│ └── Conv layers + Transformer → Audio features
│
└── TextDecoder
├── Language detection (if needed)
├── Autoregressive token generation
├── Temperature fallback on quality failures
└── Cross-attention extraction (for word timestamps)
│
▼
Segments with text & timestamps
│
▼ Output writers
.txt / .srt / .vtt / .tsv / .json
Models are cached to avoid reloading on repeated calls:
class ModelHolder:
@staticmethod
def get_model(path_or_hf_repo, dtype) -> Whisper
# Returns cached model if same path+dtypeget_writer("srt", output_dir) # Returns WriteSRT instance
get_tokenizer(multilingual=True) # Returns configured Tokenizer
resolve_model_path("turbo") # Returns HuggingFace repo URLSimplifies the full transcribe() API for common use cases. Maps friendly model names to HuggingFace repos, handles quantization resolution.
When a decoded segment fails quality checks (high compression ratio or low log probability), Vayu retries with progressively higher sampling temperatures:
Temps: [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
│
t=0.0 → Greedy decode
│ Fails quality check?
t=0.2 → Light sampling
│ Still failing?
t=0.4+ → Continue escalating
This avoids hallucinations while maintaining output quality.
load_models.py validates all model paths against a whitelist to prevent path traversal attacks:
- HuggingFace cache directory
-
/usr/local/sharesystem directory - Custom directories via
WHISPER_MLX_MODEL_DIRSenvironment variable
Symbolic links are resolved before validation.
All audio processing uses Whisper's standard parameters:
| Constant | Value | Meaning |
|---|---|---|
SAMPLE_RATE |
16,000 Hz | Input sample rate |
N_FFT |
400 | FFT window size |
HOP_LENGTH |
160 | Samples between frames |
CHUNK_LENGTH |
30 seconds | Audio chunk size |
N_FRAMES |
3,000 | Mel frames per chunk |
TOKENS_PER_SECOND |
50 | 20ms per token |