Word Timestamps Guide

Vayu can extract precise word-level timestamps using cross-attention alignment and Dynamic Time Warping (DTW).

Basic Usage

from whisper_mlx import LightningWhisperMLX

whisper = LightningWhisperMLX(model="distil-large-v3", batch_size=12)
result = whisper.transcribe("audio.mp3", word_timestamps=True)

for segment in result["segments"]:
    for word in segment["words"]:
        print(f"[{word['start']:.2f} - {word['end']:.2f}] {word['word']} ({word['probability']:.0%})")

Output:

[0.00 - 0.32] Hello (98%)
[0.32 - 0.58] world, (95%)
[0.58 - 0.84] this (97%)
[0.84 - 1.10] is (96%)
[1.10 - 1.52] a (94%)
[1.52 - 1.98] test. (93%)

Word Timing Data Structure

Each word in the words array contains:

{
    "word": "Hello",         # The word text (with leading space for non-first words)
    "start": 0.00,           # Start time in seconds (rounded to 2 decimals)
    "end": 0.32,             # End time in seconds (rounded to 2 decimals)
    "probability": 0.98      # Confidence score (0.0 to 1.0)
}

How It Works

Algorithm Overview

Cross-Attention Extraction: During decoding, the model's cross-attention weights are captured. These weights show which audio frames each text token "attends to"
Alignment Matrix: Attention weights are aggregated across heads and layers into a token-to-frame alignment matrix
Median Filtering: A median filter (width=7) smooths the alignment to reduce noise
Dynamic Time Warping (DTW): Finds the optimal monotonic alignment between tokens and frames using Numba-accelerated DTW
Word Grouping: Tokens are grouped into words, and punctuation is merged with adjacent words

Punctuation Merging

Vayu automatically merges punctuation with the correct word:

Prepend punctuation (merged with the following word): " ' ( [ { -
Append punctuation (merged with the preceding word): " ' . , ! ? : ) ] }

This ensures clean word boundaries in the output.

Subtitle Workflows

SRT with Word Highlighting

vayu audio.mp3 -f srt --word-timestamps True --highlight-words True

Produces SRT with <u> tags that underline words as they're spoken:

1
00:00:00,000 --> 00:00:01,980
<u>Hello</u> world, this is a test.

1
00:00:00,000 --> 00:00:01,980
Hello <u>world,</u> this is a test.

VTT for Web Video

vayu audio.mp3 -f vtt --word-timestamps True --highlight-words True

Formatting Controls

# Max 42 characters per line, 2 lines per subtitle
vayu audio.mp3 -f srt --word-timestamps True --max-line-width 42 --max-line-count 2

# Max 5 words per line
vayu audio.mp3 -f srt --word-timestamps True --max-words-per-line 5

Integration with Video Editors

Extracting Word-Level Data for Editing

import json

result = whisper.transcribe("video_audio.mp3", word_timestamps=True)

# Export as JSON for import into editing tools
words = []
for seg in result["segments"]:
    for w in seg.get("words", []):
        words.append({
            "text": w["word"].strip(),
            "start_ms": int(w["start"] * 1000),
            "end_ms": int(w["end"] * 1000),
            "confidence": w["probability"],
        })

with open("word_timings.json", "w") as f:
    json.dump(words, f, indent=2)

Generating EDL (Edit Decision List)

# Find timestamps for a specific phrase
search_phrase = "important announcement"
for seg in result["segments"]:
    if search_phrase.lower() in seg["text"].lower():
        print(f"Found at {seg['start']:.2f}s - {seg['end']:.2f}s")

Language-Specific Behavior

CJK Languages

Chinese, Japanese, Thai, Lao, Myanmar, and Cantonese use character-level splitting instead of whitespace-based word splitting:

result = whisper.transcribe("chinese.mp3", language="zh", word_timestamps=True)
# Each "word" is typically one or two characters

Whitespace Languages

All other languages split on whitespace and punctuation boundaries, grouping subword tokens into complete words.

Performance Considerations

Word timestamp extraction adds processing overhead:

DTW alignment runs for each decoded segment (Numba JIT-compiled for speed)
Cross-attention caching requires additional memory during decoding
Median filtering adds a small CPU cost

Recommendation: Only enable word timestamps when you need them. For plain transcription without timing, omit the flag for faster processing.

Accuracy Tips

Use a larger model — larger models produce better cross-attention patterns
Clean audio — background noise degrades alignment accuracy
Specify language — correct language improves tokenization and alignment
Check probability — low-probability words may have inaccurate timestamps. Filter by word["probability"] > 0.5 for higher confidence

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word Timestamps Guide

Word Timestamps Guide

Basic Usage

Word Timing Data Structure

How It Works

Algorithm Overview

Punctuation Merging

Subtitle Workflows

SRT with Word Highlighting

VTT for Web Video

Formatting Controls

Integration with Video Editors

Extracting Word-Level Data for Editing

Generating EDL (Edit Decision List)

Language-Specific Behavior

CJK Languages

Whitespace Languages

Performance Considerations

Accuracy Tips

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally