Skip to content

robrawks/transcribe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

transcribe

Local on-Mac audio → timestamped transcript pipeline. Accepts Olympus .ds2/.dss recordings (via the sibling ds2-converter project) as well as .mp3/.wav/.m4a/.flac/.opus/.ogg/.aac/.aiff/.aif, normalizes everything to 16 kHz mono WAV, runs Silero VAD to skip silence, then runs mlx-whisper (whisper-large-v3-turbo on Apple MLX). Output is a .transcript.txt (segment-level timestamps inline) and a .transcript.json sidecar (full Whisper output incl. word timestamps).

DS2 Insert-mode recovery workflow

The open-source dss-codec WASM decoder we use mis-decodes DS2 files recorded with Insert mode (audio spliced into the middle of an existing recording). The codec-stuck and bitrate-anomaly detectors will warn when a file triggers the bug. When that happens:

  1. Open the offending .DS2 file in DSS Player for Mac (Olympus's official decoder — it handles inserts correctly).
  2. File → Save As → AIFF to export the audio.
  3. Drop the resulting .aif into ~/Audio/Inbox/ and re-run transcribe.

The pipeline now accepts AIFF inputs alongside the other formats, so the recovered audio flows through the same VAD → Whisper path as everything else.

The transcript is meant to be pasted into a Claude.ai project for the final stitch / cleanup step — that step is intentionally not automated here.

Layout

~/Audio/Inbox/                       you drop files here
~/Audio/Done/YYYY-MM-DD/             sources move here after success
~/Audio/Out/YYYY-MM-DD/              transcripts land here
~/Audio/Prompts/default.txt          (optional) glossary loaded automatically

Prerequisites

brew install ffmpeg pipx
pipx ensurepath
# ds2-converter must already be `npm link`-ed so `ds2-convert` is on PATH

Install

cd ~/Code/transcribe
pipx install --editable .
pipx inject transcribe silero-vad     # for VAD pre-segmentation

Usage

# Process everything in ~/Audio/Inbox/
transcribe

# Or transcribe specific files (sources are left in place)
transcribe path/to/recording1.ds2 path/to/recording2.mp3

The first run downloads whisper-large-v3-turbo weights (~800 MB) into ~/.cache/huggingface/hub/. Subsequent runs are cached.

What's in the pipeline (in order)

  1. Format detection.ds2/.dss go through ds2-convert; everything else goes through ffmpeg for normalization to 16 kHz mono WAV.
  2. Silero VAD — speech regions are detected and merged aggressively (gaps < 2 s collapsed, blips < 0.5 s dropped, ±0.5 s padding). Silence between speech regions is skipped from Whisper entirely. Disable with --no-vad.
  3. mlx-whisper transcription with whisper-large-v3-turbo (English accuracy parity with large-v3 at 6-8× speed). Decoder tuned for solo dictation: condition_on_previous_text=False, hallucination_silence_threshold=2.0, plus tightened no_speech_threshold, compression_ratio_threshold, logprob_threshold.
  4. Initial prompt — a 224-token glossary/style hint that biases the decoder's vocabulary priors. See "Prompts" below.
  5. Bag-of-Hallucinations filter — drops segments matching known Whisper hallucinations ("Thank you for watching", "you", "Bye", etc.) and segments with within-segment 3-gram looping. Per Barański et al., arxiv 2501.11378.

Prompts

A 224-token (~150-180 word) prefix is sent to Whisper to bias vocabulary and style priors. Three sources are checked, highest priority first:

  1. --prompt "..." or --prompt-file PATH on the command line.
  2. Sidecar file <audio>.prompt in the same directory as the audio (e.g., ~/Audio/Inbox/foo.mp3.prompt).
  3. ~/Audio/Prompts/default.txt (loaded automatically if it exists).

--no-prompt disables all sources for one run.

What works in a prompt: glossary of proper nouns spelled the way you want them, domain vocabulary, 1-2 sentences in your preferred style. The model follows the style of the prompt — it does NOT follow instructions like "transcribe carefully" or "fix grammar."

What doesn't work: instructions, long backstory (anything past 224 tokens is silently dropped), style hints that contradict the audio.

Known limitation: because condition_on_previous_text=False (an anti-hallucination must), the prompt only seeds the first ~30 s of each recording. Names mentioned later in a long file decode fresh and won't get the prompt's vocabulary bias. The Whisper architecture doesn't have a clean way around this.

Output format

recording.transcript.txt:

[00:00.000 → 00:03.420]  Hello, this is a test recording.
[00:03.420 → 00:07.812]  Thinking out loud about a few things today.

recording.transcript.json contains the full mlx-whisper result dict (segments with word-level timestamps, language detection, etc.).

CLI flags

transcribe [--quiet | --show-segments]
           [--prompt TEXT | --prompt-file PATH | --no-prompt]
           [--no-vad]
           [--model MODEL]
           [files ...]
Flag Effect
(default) tqdm progress bar (% through audio + ETA)
-v / --show-segments streams each Whisper segment to stdout as decoded
-q / --quiet silent until each file finishes
--prompt TEXT inline initial_prompt (≤224 tokens; overrides sidecar/default)
--prompt-file PATH read prompt from a file (overrides sidecar/default)
--no-prompt disable all prompt sources for this run
--no-vad skip Silero VAD pre-segmentation (transcribe whole audio)
--model MODEL override the HuggingFace MLX model id

Performance notes (M1 Pro 16 GB)

Config RTF (12 min file) Notes
large-v3 + decoder defaults 0.22x original baseline
large-v3-turbo + tightened decoder 0.12x speed win, English parity
+ initial prompt 0.12x no speed cost
+ BoH filter 0.12x trivial post-process
+ Silero VAD (default) 0.20x ~1.7x slower, restores some detail lost to tightened thresholds, skips silence

The VAD slowdown is real because each VAD region is a separate Whisper seek. We use aggressive merging (min_silence_duration_ms=2000) so a typical recording yields ~30-60 regions, not hundreds. If you don't need VAD's anti-hallucination defense for a clean recording, --no-vad restores the faster path.

Known footguns

  • word_timestamps=True memory leak in long batch jobs (mlx-examples #1254). Single-file invocations are unaffected; a future launchd watcher should restart the process between batches.

  • Prompt-leakage: Whisper occasionally inserts prompt vocabulary into the transcript during silent stretches. Mitigated by VAD + BoH filter, but worth eyeballing if a recording has long silences.

  • dss-codec stuck-output regions: the upstream WASM decoder (hirparak/dss-codec) occasionally loses sync on certain DS2 byte patterns and emits a constant sample value (typically ±32767) for several seconds before recovering. Whisper hears the resulting flat tone as "voice-like" and hallucinates gibberish on top of it. The pipeline detects these regions and prints a warning like:

    ⚠ dss-codec stuck: 1 constant-value region(s), 11.9s lost
        18.7s →   30.5s  (constant sample value -32767)
    

    If you see this warning, the corresponding seconds of audio are unrecoverable and the transcript around them is likely hallucination. The detector only flags the symptom; the cause lives upstream in the codec.

  • dss-codec bitrate anomaly: a complementary detector for the same upstream bug. Healthy DS-5000 ds2_qp files have a tightly-clustered compressed bitrate of ~3545 bytes/sec; when the WASM decoder mis-decodes a file, it tends to consume extra input bytes per output second, pushing the observed bitrate well above the expected rate. The pipeline warns when the ratio exceeds expected by ≥20%:

    ⚠ ds2 bitrate anomaly: 5131 bytes/sec (expected ~3545 for ds2_qp, +45%).
      Decoder likely mis-decoded — preserve DS500339.DS2 for re-decoding
      with another tool.
    

    This warning sometimes fires when the stuck-region detector doesn't, catching files where the codec stays just below the constant-run threshold but still mis-decodes content. The .DS2 file itself is fine — your Olympus recorder will play it correctly. To recover the audio, re-decode the preserved .DS2 with DSS Player Pro R5 (Windows) or DSS Player for Mac (uses Olympus's own decoder).

About

Local on-Mac audio → timestamped transcript pipeline. DS2/MP3/WAV → mlx-whisper large-v3-turbo with VAD + glossary + BoH filter.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages