-
Notifications
You must be signed in to change notification settings - Fork 0
Word Timestamps Guide
Vayu can extract precise word-level timestamps using cross-attention alignment and Dynamic Time Warping (DTW).
from whisper_mlx import LightningWhisperMLX
whisper = LightningWhisperMLX(model="distil-large-v3", batch_size=12)
result = whisper.transcribe("audio.mp3", word_timestamps=True)
for segment in result["segments"]:
for word in segment["words"]:
print(f"[{word['start']:.2f} - {word['end']:.2f}] {word['word']} ({word['probability']:.0%})")Output:
[0.00 - 0.32] Hello (98%)
[0.32 - 0.58] world, (95%)
[0.58 - 0.84] this (97%)
[0.84 - 1.10] is (96%)
[1.10 - 1.52] a (94%)
[1.52 - 1.98] test. (93%)
Each word in the words array contains:
{
"word": "Hello", # The word text (with leading space for non-first words)
"start": 0.00, # Start time in seconds (rounded to 2 decimals)
"end": 0.32, # End time in seconds (rounded to 2 decimals)
"probability": 0.98 # Confidence score (0.0 to 1.0)
}- Cross-Attention Extraction: During decoding, the model's cross-attention weights are captured. These weights show which audio frames each text token "attends to"
- Alignment Matrix: Attention weights are aggregated across heads and layers into a token-to-frame alignment matrix
- Median Filtering: A median filter (width=7) smooths the alignment to reduce noise
- Dynamic Time Warping (DTW): Finds the optimal monotonic alignment between tokens and frames using Numba-accelerated DTW
- Word Grouping: Tokens are grouped into words, and punctuation is merged with adjacent words
Vayu automatically merges punctuation with the correct word:
-
Prepend punctuation (merged with the following word):
" ' ( [ { - -
Append punctuation (merged with the preceding word):
" ' . , ! ? : ) ] }
This ensures clean word boundaries in the output.
vayu audio.mp3 -f srt --word-timestamps True --highlight-words TrueProduces SRT with <u> tags that underline words as they're spoken:
1
00:00:00,000 --> 00:00:01,980
<u>Hello</u> world, this is a test.
1
00:00:00,000 --> 00:00:01,980
Hello <u>world,</u> this is a test.
vayu audio.mp3 -f vtt --word-timestamps True --highlight-words True# Max 42 characters per line, 2 lines per subtitle
vayu audio.mp3 -f srt --word-timestamps True --max-line-width 42 --max-line-count 2
# Max 5 words per line
vayu audio.mp3 -f srt --word-timestamps True --max-words-per-line 5import json
result = whisper.transcribe("video_audio.mp3", word_timestamps=True)
# Export as JSON for import into editing tools
words = []
for seg in result["segments"]:
for w in seg.get("words", []):
words.append({
"text": w["word"].strip(),
"start_ms": int(w["start"] * 1000),
"end_ms": int(w["end"] * 1000),
"confidence": w["probability"],
})
with open("word_timings.json", "w") as f:
json.dump(words, f, indent=2)# Find timestamps for a specific phrase
search_phrase = "important announcement"
for seg in result["segments"]:
if search_phrase.lower() in seg["text"].lower():
print(f"Found at {seg['start']:.2f}s - {seg['end']:.2f}s")Chinese, Japanese, Thai, Lao, Myanmar, and Cantonese use character-level splitting instead of whitespace-based word splitting:
result = whisper.transcribe("chinese.mp3", language="zh", word_timestamps=True)
# Each "word" is typically one or two charactersAll other languages split on whitespace and punctuation boundaries, grouping subword tokens into complete words.
Word timestamp extraction adds processing overhead:
- DTW alignment runs for each decoded segment (Numba JIT-compiled for speed)
- Cross-attention caching requires additional memory during decoding
- Median filtering adds a small CPU cost
Recommendation: Only enable word timestamps when you need them. For plain transcription without timing, omit the flag for faster processing.
- Use a larger model — larger models produce better cross-attention patterns
- Clean audio — background noise degrades alignment accuracy
- Specify language — correct language improves tokenization and alignment
-
Check probability — low-probability words may have inaccurate timestamps. Filter by
word["probability"] > 0.5for higher confidence