Skip to content

Word Timestamps Guide

Behnam Ebrahimi edited this page Mar 29, 2026 · 1 revision

Word Timestamps Guide

Vayu can extract precise word-level timestamps using cross-attention alignment and Dynamic Time Warping (DTW).

Basic Usage

from whisper_mlx import LightningWhisperMLX

whisper = LightningWhisperMLX(model="distil-large-v3", batch_size=12)
result = whisper.transcribe("audio.mp3", word_timestamps=True)

for segment in result["segments"]:
    for word in segment["words"]:
        print(f"[{word['start']:.2f} - {word['end']:.2f}] {word['word']} ({word['probability']:.0%})")

Output:

[0.00 - 0.32] Hello (98%)
[0.32 - 0.58] world, (95%)
[0.58 - 0.84] this (97%)
[0.84 - 1.10] is (96%)
[1.10 - 1.52] a (94%)
[1.52 - 1.98] test. (93%)

Word Timing Data Structure

Each word in the words array contains:

{
    "word": "Hello",         # The word text (with leading space for non-first words)
    "start": 0.00,           # Start time in seconds (rounded to 2 decimals)
    "end": 0.32,             # End time in seconds (rounded to 2 decimals)
    "probability": 0.98      # Confidence score (0.0 to 1.0)
}

How It Works

Algorithm Overview

  1. Cross-Attention Extraction: During decoding, the model's cross-attention weights are captured. These weights show which audio frames each text token "attends to"
  2. Alignment Matrix: Attention weights are aggregated across heads and layers into a token-to-frame alignment matrix
  3. Median Filtering: A median filter (width=7) smooths the alignment to reduce noise
  4. Dynamic Time Warping (DTW): Finds the optimal monotonic alignment between tokens and frames using Numba-accelerated DTW
  5. Word Grouping: Tokens are grouped into words, and punctuation is merged with adjacent words

Punctuation Merging

Vayu automatically merges punctuation with the correct word:

  • Prepend punctuation (merged with the following word): " ' ( [ { -
  • Append punctuation (merged with the preceding word): " ' . , ! ? : ) ] }

This ensures clean word boundaries in the output.

Subtitle Workflows

SRT with Word Highlighting

vayu audio.mp3 -f srt --word-timestamps True --highlight-words True

Produces SRT with <u> tags that underline words as they're spoken:

1
00:00:00,000 --> 00:00:01,980
<u>Hello</u> world, this is a test.

1
00:00:00,000 --> 00:00:01,980
Hello <u>world,</u> this is a test.

VTT for Web Video

vayu audio.mp3 -f vtt --word-timestamps True --highlight-words True

Formatting Controls

# Max 42 characters per line, 2 lines per subtitle
vayu audio.mp3 -f srt --word-timestamps True --max-line-width 42 --max-line-count 2

# Max 5 words per line
vayu audio.mp3 -f srt --word-timestamps True --max-words-per-line 5

Integration with Video Editors

Extracting Word-Level Data for Editing

import json

result = whisper.transcribe("video_audio.mp3", word_timestamps=True)

# Export as JSON for import into editing tools
words = []
for seg in result["segments"]:
    for w in seg.get("words", []):
        words.append({
            "text": w["word"].strip(),
            "start_ms": int(w["start"] * 1000),
            "end_ms": int(w["end"] * 1000),
            "confidence": w["probability"],
        })

with open("word_timings.json", "w") as f:
    json.dump(words, f, indent=2)

Generating EDL (Edit Decision List)

# Find timestamps for a specific phrase
search_phrase = "important announcement"
for seg in result["segments"]:
    if search_phrase.lower() in seg["text"].lower():
        print(f"Found at {seg['start']:.2f}s - {seg['end']:.2f}s")

Language-Specific Behavior

CJK Languages

Chinese, Japanese, Thai, Lao, Myanmar, and Cantonese use character-level splitting instead of whitespace-based word splitting:

result = whisper.transcribe("chinese.mp3", language="zh", word_timestamps=True)
# Each "word" is typically one or two characters

Whitespace Languages

All other languages split on whitespace and punctuation boundaries, grouping subword tokens into complete words.

Performance Considerations

Word timestamp extraction adds processing overhead:

  • DTW alignment runs for each decoded segment (Numba JIT-compiled for speed)
  • Cross-attention caching requires additional memory during decoding
  • Median filtering adds a small CPU cost

Recommendation: Only enable word timestamps when you need them. For plain transcription without timing, omit the flag for faster processing.

Accuracy Tips

  1. Use a larger model — larger models produce better cross-attention patterns
  2. Clean audio — background noise degrades alignment accuracy
  3. Specify language — correct language improves tokenization and alignment
  4. Check probability — low-probability words may have inaccurate timestamps. Filter by word["probability"] > 0.5 for higher confidence

Clone this wiki locally