Skip to content

Examples and Use Cases

Behnam Ebrahimi edited this page Mar 29, 2026 · 1 revision

Examples & Use Cases

Podcast Transcription

Transcribe a long podcast episode with high accuracy:

from whisper_mlx import LightningWhisperMLX

whisper = LightningWhisperMLX(model="distil-large-v3", batch_size=12)
result = whisper.transcribe("podcast_episode.mp3", language="en")

# Save as plain text
with open("transcript.txt", "w") as f:
    f.write(result["text"])

CLI equivalent:

vayu podcast_episode.mp3 --model distil-large-v3 --batch-size 12 -f txt -o ./transcripts

Subtitle Generation

Generate SRT subtitles for a video:

vayu video_audio.mp3 --batch-size 12 --output-format srt --word-timestamps True

With word-by-word highlighting (karaoke-style):

vayu video_audio.mp3 --batch-size 12 -f srt \
    --word-timestamps True \
    --highlight-words True \
    --max-line-width 42 \
    --max-line-count 2

Generate WebVTT for HTML5 video:

vayu video_audio.mp3 --batch-size 12 -f vtt --word-timestamps True

Meeting Notes

Transcribe a meeting recording with speaker context:

from whisper_mlx import LightningWhisperMLX

whisper = LightningWhisperMLX(model="large-v3", batch_size=6)

result = whisper.transcribe(
    "meeting.m4a",
    language="en",
    word_timestamps=True,
    initial_prompt="Meeting participants: Alice, Bob, Charlie. Topic: Q4 planning.",
)

# Print timestamped segments
for seg in result["segments"]:
    minutes = int(seg["start"] // 60)
    seconds = int(seg["start"] % 60)
    print(f"[{minutes:02d}:{seconds:02d}] {seg['text'].strip()}")

The initial_prompt helps the model with proper nouns and domain-specific vocabulary.


Batch Processing

Transcribe an entire directory of audio files:

# All MP3 files in a directory
vayu recordings/*.mp3 --batch-size 12 -f all -o ./transcripts

# With error tolerance (continues on failure)
vayu recordings/*.mp3 --batch-size 12 -f json -o ./transcripts

In Python:

from pathlib import Path
from whisper_mlx import LightningWhisperMLX

whisper = LightningWhisperMLX(model="distil-large-v3", batch_size=12)

audio_dir = Path("recordings")
output_dir = Path("transcripts")
output_dir.mkdir(exist_ok=True)

for audio_file in sorted(audio_dir.glob("*.mp3")):
    print(f"Processing: {audio_file.name}")
    result = whisper.transcribe(str(audio_file), language="en")

    output_path = output_dir / f"{audio_file.stem}.txt"
    output_path.write_text(result["text"])

Translation

Translate non-English audio to English text:

whisper = LightningWhisperMLX(model="large-v3", batch_size=6)

# Spanish audio → English text
result = whisper.transcribe("spanish_interview.mp3", language="es", task="translate")
print(result["text"])  # English translation
vayu spanish_interview.mp3 --language es --task translate --batch-size 6

YouTube Video Transcription

# Step 1: Extract audio with yt-dlp
yt-dlp -x --audio-format mp3 -o "video.mp3" "https://youtube.com/watch?v=VIDEO_ID"

# Step 2: Transcribe
vayu video.mp3 --batch-size 12 -f srt -o ./subtitles

Audio Clip Processing

Transcribe specific portions of a long audio file:

# Process only 0-60s and 120-180s
vayu long_recording.mp3 --clip-timestamps "0,60,120,180" --batch-size 12

Low Memory Usage

For Macs with limited RAM or when running alongside other apps:

# 4-bit quantized model uses ~4x less memory
whisper = LightningWhisperMLX(model="distil-large-v3", quant="4bit", batch_size=6)
result = whisper.transcribe("audio.mp3")
# Use a tiny model with high batch size
vayu audio.mp3 --model tiny --batch-size 32

JSON Processing Pipeline

Extract structured data from transcriptions:

import json
from whisper_mlx import LightningWhisperMLX

whisper = LightningWhisperMLX(model="distil-large-v3", batch_size=12)

result = whisper.transcribe("interview.mp3", language="en", word_timestamps=True)

# Build a searchable index
index = []
for seg in result["segments"]:
    index.append({
        "start": seg["start"],
        "end": seg["end"],
        "text": seg["text"].strip(),
        "words": seg.get("words", []),
    })

with open("searchable_index.json", "w") as f:
    json.dump(index, f, indent=2)

Reading from stdin

Pipe audio directly from another process:

# From FFmpeg conversion
ffmpeg -i video.mkv -vn -ar 16000 -ac 1 -f wav - | vayu - --output-name video_transcript

# From a download
curl -sL "https://example.com/audio.mp3" | vayu - --output-name downloaded

Multiple Output Formats

Generate all formats at once for different consumers:

vayu lecture.mp3 --batch-size 12 -f all -o ./output --word-timestamps True
# Creates: lecture.txt, lecture.srt, lecture.vtt, lecture.tsv, lecture.json
  • .txt — for reading / search indexing
  • .srt — for video players
  • .vtt — for web embedding
  • .tsv — for spreadsheet analysis
  • .json — for programmatic access

Clone this wiki locally