Local on-Mac audio → timestamped transcript pipeline. Accepts Olympus
.ds2/.dss recordings (via the sibling ds2-converter project) as well as
.mp3/.wav/.m4a/.flac/.opus/.ogg/.aac/.aiff/.aif, normalizes
everything to 16 kHz mono WAV, runs Silero VAD to skip silence, then runs
mlx-whisper (whisper-large-v3-turbo on Apple MLX). Output is a
.transcript.txt (segment-level timestamps inline) and a .transcript.json
sidecar (full Whisper output incl. word timestamps).
The open-source dss-codec WASM decoder we use mis-decodes DS2 files
recorded with Insert mode (audio spliced into the middle of an existing
recording). The codec-stuck and bitrate-anomaly detectors will warn when a
file triggers the bug. When that happens:
- Open the offending
.DS2file in DSS Player for Mac (Olympus's official decoder — it handles inserts correctly). - File → Save As → AIFF to export the audio.
- Drop the resulting
.aifinto~/Audio/Inbox/and re-runtranscribe.
The pipeline now accepts AIFF inputs alongside the other formats, so the recovered audio flows through the same VAD → Whisper path as everything else.
The transcript is meant to be pasted into a Claude.ai project for the final stitch / cleanup step — that step is intentionally not automated here.
~/Audio/Inbox/ you drop files here
~/Audio/Done/YYYY-MM-DD/ sources move here after success
~/Audio/Out/YYYY-MM-DD/ transcripts land here
~/Audio/Prompts/default.txt (optional) glossary loaded automatically
brew install ffmpeg pipx
pipx ensurepath
# ds2-converter must already be `npm link`-ed so `ds2-convert` is on PATHcd ~/Code/transcribe
pipx install --editable .
pipx inject transcribe silero-vad # for VAD pre-segmentation# Process everything in ~/Audio/Inbox/
transcribe
# Or transcribe specific files (sources are left in place)
transcribe path/to/recording1.ds2 path/to/recording2.mp3The first run downloads whisper-large-v3-turbo weights (~800 MB) into
~/.cache/huggingface/hub/. Subsequent runs are cached.
- Format detection —
.ds2/.dssgo throughds2-convert; everything else goes throughffmpegfor normalization to 16 kHz mono WAV. - Silero VAD — speech regions are detected and merged aggressively (gaps
< 2 s collapsed, blips < 0.5 s dropped, ±0.5 s padding). Silence between
speech regions is skipped from Whisper entirely. Disable with
--no-vad. - mlx-whisper transcription with
whisper-large-v3-turbo(English accuracy parity withlarge-v3at 6-8× speed). Decoder tuned for solo dictation:condition_on_previous_text=False,hallucination_silence_threshold=2.0, plus tightenedno_speech_threshold,compression_ratio_threshold,logprob_threshold. - Initial prompt — a 224-token glossary/style hint that biases the decoder's vocabulary priors. See "Prompts" below.
- Bag-of-Hallucinations filter — drops segments matching known Whisper hallucinations ("Thank you for watching", "you", "Bye", etc.) and segments with within-segment 3-gram looping. Per Barański et al., arxiv 2501.11378.
A 224-token (~150-180 word) prefix is sent to Whisper to bias vocabulary and style priors. Three sources are checked, highest priority first:
--prompt "..."or--prompt-file PATHon the command line.- Sidecar file
<audio>.promptin the same directory as the audio (e.g.,~/Audio/Inbox/foo.mp3.prompt). ~/Audio/Prompts/default.txt(loaded automatically if it exists).
--no-prompt disables all sources for one run.
What works in a prompt: glossary of proper nouns spelled the way you want them, domain vocabulary, 1-2 sentences in your preferred style. The model follows the style of the prompt — it does NOT follow instructions like "transcribe carefully" or "fix grammar."
What doesn't work: instructions, long backstory (anything past 224 tokens is silently dropped), style hints that contradict the audio.
Known limitation: because condition_on_previous_text=False (an
anti-hallucination must), the prompt only seeds the first ~30 s of
each recording. Names mentioned later in a long file decode fresh and
won't get the prompt's vocabulary bias. The Whisper architecture doesn't
have a clean way around this.
recording.transcript.txt:
[00:00.000 → 00:03.420] Hello, this is a test recording.
[00:03.420 → 00:07.812] Thinking out loud about a few things today.
recording.transcript.json contains the full mlx-whisper result dict
(segments with word-level timestamps, language detection, etc.).
transcribe [--quiet | --show-segments]
[--prompt TEXT | --prompt-file PATH | --no-prompt]
[--no-vad]
[--model MODEL]
[files ...]
| Flag | Effect |
|---|---|
| (default) | tqdm progress bar (% through audio + ETA) |
-v / --show-segments |
streams each Whisper segment to stdout as decoded |
-q / --quiet |
silent until each file finishes |
--prompt TEXT |
inline initial_prompt (≤224 tokens; overrides sidecar/default) |
--prompt-file PATH |
read prompt from a file (overrides sidecar/default) |
--no-prompt |
disable all prompt sources for this run |
--no-vad |
skip Silero VAD pre-segmentation (transcribe whole audio) |
--model MODEL |
override the HuggingFace MLX model id |
| Config | RTF (12 min file) | Notes |
|---|---|---|
| large-v3 + decoder defaults | 0.22x | original baseline |
| large-v3-turbo + tightened decoder | 0.12x | speed win, English parity |
| + initial prompt | 0.12x | no speed cost |
| + BoH filter | 0.12x | trivial post-process |
| + Silero VAD (default) | 0.20x | ~1.7x slower, restores some detail lost to tightened thresholds, skips silence |
The VAD slowdown is real because each VAD region is a separate Whisper
seek. We use aggressive merging (min_silence_duration_ms=2000) so a
typical recording yields ~30-60 regions, not hundreds. If you don't need
VAD's anti-hallucination defense for a clean recording, --no-vad
restores the faster path.
-
word_timestamps=Truememory leak in long batch jobs (mlx-examples #1254). Single-file invocations are unaffected; a future launchd watcher should restart the process between batches. -
Prompt-leakage: Whisper occasionally inserts prompt vocabulary into the transcript during silent stretches. Mitigated by VAD + BoH filter, but worth eyeballing if a recording has long silences.
-
dss-codec stuck-output regions: the upstream WASM decoder (
hirparak/dss-codec) occasionally loses sync on certain DS2 byte patterns and emits a constant sample value (typically ±32767) for several seconds before recovering. Whisper hears the resulting flat tone as "voice-like" and hallucinates gibberish on top of it. The pipeline detects these regions and prints a warning like:⚠ dss-codec stuck: 1 constant-value region(s), 11.9s lost 18.7s → 30.5s (constant sample value -32767)If you see this warning, the corresponding seconds of audio are unrecoverable and the transcript around them is likely hallucination. The detector only flags the symptom; the cause lives upstream in the codec.
-
dss-codec bitrate anomaly: a complementary detector for the same upstream bug. Healthy DS-5000
ds2_qpfiles have a tightly-clustered compressed bitrate of ~3545 bytes/sec; when the WASM decoder mis-decodes a file, it tends to consume extra input bytes per output second, pushing the observed bitrate well above the expected rate. The pipeline warns when the ratio exceeds expected by ≥20%:⚠ ds2 bitrate anomaly: 5131 bytes/sec (expected ~3545 for ds2_qp, +45%). Decoder likely mis-decoded — preserve DS500339.DS2 for re-decoding with another tool.This warning sometimes fires when the stuck-region detector doesn't, catching files where the codec stays just below the constant-run threshold but still mis-decodes content. The
.DS2file itself is fine — your Olympus recorder will play it correctly. To recover the audio, re-decode the preserved.DS2with DSS Player Pro R5 (Windows) or DSS Player for Mac (uses Olympus's own decoder).