Generate accurate .srt subtitle files by aligning a plain text transcript (no timestamps) to a YouTube video or local audio/video file.
SDT uses forced alignment — it takes your known transcript and synchronizes it with the audio using stable-ts (built on OpenAI Whisper). This is much faster and more accurate than transcribing from scratch, because the model already knows what is being said and only needs to figure out when.
Input: YouTube URL + Plain Transcript → Output: Timed .srt Subtitle File
- 🎬 YouTube support — paste a URL, audio is downloaded automatically via
yt-dlp - 📁 Local files — works with any video/audio format (MP4, MKV, MP3, WAV, etc.)
- 🌍 Multilingual — supports all Whisper languages (English, Chinese, Japanese, etc.)
- 🚀 GPU acceleration — auto-detects CUDA GPU for fast alignment
- ✂️ Smart segmentation — automatically splits subtitles at natural breakpoints
- 📓 Colab ready — included notebook for easy cloud usage with free GPU
pip install -r requirements.txtNote: FFmpeg must be installed on your system for YouTube downloads.
# From YouTube
python -m sdt -i "https://youtube.com/watch?v=VIDEO_ID" -t transcript.txt -o output.srt
# From local file
python -m sdt -i video.mp4 -t transcript.txt
# Chinese with large model
python -m sdt -i video.mp4 -t transcript.txt -l zh -m large-v3
# Preview without saving
python -m sdt -i audio.mp3 -t script.txt --previewfrom sdt import download_audio, align_transcript
from sdt.srt_writer import write_srt
# 1. Get audio
audio_path = download_audio("https://youtube.com/watch?v=VIDEO_ID")
# 2. Align transcript
with open("transcript.txt", "r") as f:
transcript = f.read()
result = align_transcript(audio_path, transcript, language="en")
# 3. Generate SRT
write_srt(result, "output.srt")Open SDT_Colab.ipynb in Google Colab for a ready-to-use notebook with free GPU.
| Flag | Description | Default |
|---|---|---|
-i, --input |
YouTube URL or local file path | required |
-t, --transcript |
Path to plain text transcript | required |
-o, --output |
Output file path | auto-named |
-l, --language |
Language code (en, zh, ja, ...) |
auto-detect |
-m, --model |
Whisper model size | medium |
--max-chars |
Max characters per subtitle | 42 |
--max-duration |
Max seconds per subtitle | 5.0 |
--format |
Output format (srt or vtt) |
srt |
--preview |
Print to console, don't save | off |
| Model | Parameters | English | Multilingual | Speed |
|---|---|---|---|---|
tiny |
39M | ⭐⭐ | ⭐ | ⚡⚡⚡⚡ |
base |
74M | ⭐⭐⭐ | ⭐⭐ | ⚡⚡⚡ |
small |
244M | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⚡⚡ |
medium |
769M | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⚡ |
large-v3 |
1.5B | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 🐢 |
- Python ≥ 3.9
- FFmpeg (for YouTube downloads)
- NVIDIA GPU (optional, but recommended for speed)