YATSEE -- Yet Another Tool for Speech Extraction & Enrichment
YATSEE is a local-first, end-to-end data pipeline designed to systematically refine raw meeting audio into a clean, searchable, and auditable intelligence layer. It automates the tedious work of downloading, transcribing, and normalizing unstructured conversations.
This is a local-first, privacy-respecting toolkit for anyone who wants to turn public noise into actionable intelligence.
Public records are often public in name only. Civic business is frequently buried in four-hour livestreams and jargon-filled transcripts that are technically accessible but functionally opaque. The barrier to entry for an interested citizen is hours of time and complex jargon.
YATSEE solves that by using a carefully tuned local LLM to transform that wall of text into a high-signal summary—extracting the specific votes, contracts, and policy debates that matter. It's a tool for creating the clarity and accountability that modern civic discourse requires, with or without the government's help.
Follow these steps to get YATSEE running.
git clone https://github.com/alias454/yatsee.git
cd yatsee# Copy the template to create your local config file
cp yatsee.conf yatsee.toml- Open
yatsee.tomlin any text editor. - Add at least one entity with the required fields:
entity(unique identifier)
chmod +x setup.sh
./setup.sh- Installs Python dependencies
- Downloads NLP models (
spaCy, etc.) - Checks for GPU (CUDA/MPS) and warns if only CPU is available
source .venv/bin/activatePython ≥3.10 recommended. CPU works, but GPU/MPS accelerates transcription.
python yatsee_build_config.py --create- Uses the entity info in
yatsee.tomlto:- Create the main directory(default ./data) for the pipeline
- Initialize per-entity pipeline configs
sources.youtube.youtube_path(YouTube channel/playlist)- Any optional data structures like titles, people, replacements etc.
- This is the minimum viable entity needed for the downloader.
- Important: Run this after
setup.shand after adding at least one entity.
see Script Summary below- Processes audio/video in
downloads/ - Converts to
.flac/.wavinaudio/ - Generates transcripts, normalizes text, and produces summaries
- All scripts are modular: you can run them individually or as a pipeline
streamlit run yatsee_search_demo.py -- -e entity_name_configured- Provides semantic and structured search over transcripts and summaries
entityis a unique key identifier for all scripts. Keep it consistent.- Each pipeline stage ensures directories exist for output; do not manually create them.
- Optional: You can edit additional pipeline settings (like per-entity hotwords or divisions) in the generated config.
A modular pipeline for extracting, enriching, and summarizing civic meeting audio data.
downloads/→ raw video/audioaudio/→ converted.wavor.flactranscripts_<model>/→.vtt+ flat.txtnormalized/→ cleaned, structured.txtsummary/→.mdor.yamlsummaries
YATSEE is designed as a collection of independent tools. While they work best as a unified pipeline, each script can be run standalone as long as the input data matches the Interface Contract.
- Script:
yatsee_download_audio.py - Input: YouTube URL (bestaudio)
- Output:
.mp4or.webmtodownloads/ - Tool:
yt-dlp- Purpose: Archive livestream audio for local processing
- Script:
yatsee_format_audio.py - Input:
.mp4or.webmfromdownloads/ - Output:
.wavor.flactoaudio/ - Tool:
ffmpeg - Format Settings:
- WAV:
-ar 16000 -ac 1 -sample_fmt s16 -c:a pcm_s16le - FLAC
-ar 16000 -ac 1 -sample_fmt s16 -c:a flac
- WAV:
- Notes:
- Supports chunked output for long audio
- Optional overlap between chunks to prevent cutting phrases
- Script:
yatsee_transcribe_audio.py - Input:
.flacfromaudio/ - Output:
.vtttotranscripts_<model>/ - Tool:
whisperorfaster-whisper - Notes:
- Supports stitching chunked audio back into a single transcript
- Accepts model selection:
small,medium,large, etc. - Faster-whisper improves performance if installed
- Script:
yatsee_slice_vtt.py - Input:
.vttfromtranscripts_<model>/ - Output:
.txt(timestamp-free) to same folder
- Script:
yatsee_slice_vtt.py - Input:
.vttfromtranscripts_<model>/ - Output:
.jsonlto same folder- Purpose: JSONL segments (sliced transcript for embeddings/search).
- Script:
yatsee_normalize_structure.py - Input:
.punct.txtfromnormalized/ - Output:
.norm.txttonormalized/ - Tool:
spaCy- Purpose: Segment text into readable sentences and normalize punctuation/spacing.
- Script:
yatsee_summarize_transcripts.py - Input:
.txtfromnormalized/ - Output:
.mdor.yamltosummary/ - Tool:
ollama - Notes:
- Supports short and long-form summaries
- Optional YAML output (e.g., vote logs, action items, discussion summaries)
All scripts are modular and can be run independently or as part of an automated workflow.
- Script:
yatsee_index_data.py - Input:
.txtfromnormalized/ - Input:
.mdfromsummary/ - Output:
embeddingstoyatsee_db/ - Tool:
ChromaDB - Notes:
- Generate embeddings from raw transcripts and summaries into a searchable civic intelligence database.
- Vector Search (Semantic): Uses ChromaDB with the
BAAI/bge-small-en-v1.5model to allow for fuzzy, concept-based queries (e.g., "Find discussions about road repairs").
- Script:
yatsee_index_data.py - Input:
.txtfromnormalized/ - Input:
.mdfromsummary/ - Input:
embeddingsfromyatsee_db/ - Tool:
Streamlit and ChromaDB - Notes:
- UI: A simple web interface built with Streamlit to provide an overview of the generated transcripts and summaries.
- Planned Graph Search (Relational): Extract structured data (Votes, Contracts, Appointments) into a knowledge graph to trace connections between people and money.
data/
└── <entity_handle>/
├── downloads/ ← Raw input (audio/video)
├── audio/ ← Converted 16kHz mono files
├── transcripts_<model>/ ← VTTs + initial flat .txt files
├── normalized/ ← Cleaned + structured output (spaCy)
├── summary/ ← Generated meeting summaries (.md/.yaml)
├── yatsee_db/ ← Vector database files (ChromaDB)
├── prompts/ ← Optional default prompt overrides(created by user)
└── conf.toml ← Localized entity config
Global TOML
|
+--> Entity handle
|
+--> Local config (hotwords, divisions, data_path)
|
+--> Pipeline stage (downloads, audio, transcripts)
./prompts/ # default prompts for all entities
└── research/
└── prompts.toml # default prompts & routing for 'research' job type
./data/
└── defined_entity/ # entity-specific data
└── prompts/
└── research/
└── prompts.toml # full override for defined_entity 'research' job type
./data/
└── generic_entity/ # another entity with no override
└── prompts/
└── research/
# no file, falls back to default in prompts/research/prompts.toml
**Behavior**:
- Loader first checks `data/<entity>/prompts/<job_type>/prompts.toml`.
- If found → full override of defaults.
- If not found → fall back to `prompts/<job_type>/prompts.toml`.
This pipeline was developed and tested on the following setup:
- CPU: Intel Core i7-10750H (6 cores / 12 threads, up to 5.0 GHz)
- RAM: 32 GB DDR4
- GPU: NVIDIA GeForce RTX 2060 (6 GB VRAM, CUDA 12.8)
- Storage: NVMe SSD
- OS: Fedora Linux
- Shell: Bash
- Python: 3.10 or newer
Additional testing was performed on Apple Silicon (macOS):
- Model: Mac Mini (M4 Base)
- CPU: Apple M4 (10 cores / 4 performance cores, up to 120GB/s memory bandwidth)
- RAM: 16 GB
- Storage: NVMe SSD
- OS: macOS Sonoma / Sequoia
- Shell: ZSH
- Python: 3.9 or newer
GPU acceleration was enabled for Whisper / faster-whisper using CUDA 12.8 and NVIDIA driver 570.144 on Linux. However, faster whisper has limited/no support for mps.
Note: Audio transcription was much slower on the MAC than on Linux. it's doable but it's much slower.
Note: The pipeline works on CPU-only systems without a GPU. However, transcription (especially with Whisper or faster-whisper) will be much slower compared to systems with CUDA-enabled GPU acceleration or MPS.
⚠️ Not tested on Windows. Use at your own risk onWindowsplatforms.
Manual Installation (If not using setup.sh)
If you cannot use the setup script, ensure you have ffmpeg and yt-dlp installed via your package manager, then install the Python requirements:
yt-dlp– Download livestream audio from YouTubeffmpeg– Convert audio to.flacor.wavformat
tomlNeeded for reading the toml configrequestsNeeded for interacting with ollama API if installedtorchRequired for Whisper and model inference (with or without CUDA)pyyamlYAML output support (for summaries)whisperAudio transcription (standard)spacySentence segmentation + text cleanup- Model: en_core_web_sm (or larger)
faster-whisperAudio transcription (optional)
- ollama Run local LLMs for summarization
macOS (Homebrew):
brew install yt-dlp ffmpegFedora:
sudo dnf install yt-dlp ffmpegDebian/Ubuntu:
sudo apt-get update
sudo apt-get install ffmpeg
sudo curl -L https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -o /usr/local/bin/yt-dlp
sudo chmod a+rx /usr/local/bin/yt-dlpYou can use pip to install the core requirements:
Install:
pip install torch torchaudio tqdm
pip install --upgrade git+https://github.com/openai/whisper.git
pip install toml pyyaml spacy
python -m spacy download en_core_web_smInstall:
pip install faster-whisper # optional, for better performanceOn first run, it will download a model (e.g., base, medium). Ensure you have enough RAM.
Used for generating markdown or YAML summaries from transcripts.
install:
pip install requests
curl -fsSL https://ollama.com/install.sh | shSee https://ollama.com for supported models and system requirements.
Run each script in sequence or independently as needed: All scripts accept the -e (entity) flag to route data to the correct folders defined in yatsee.toml.
| Script | Purpose |
|---|---|
python3 yatsee_download_audio.py -e <entity> |
Download audio from YouTube URLs |
python3 yatsee_format_audio.py -e <entity> |
Convert downloaded files to .flac or .wav |
python3 yatsee_transcribe_audio.py -e <entity> |
Transcribe audio files to .vtt |
python3 yatsee_slice_vtt.py -e <entity> |
Slice and segment .vtt files |
python3 yatsee_normalize_structure.py -e <entity> |
Clean and normalize text structure |
python3 yatsee_summarize_transcripts.py -e <entity> |
Generate summaries from cleaned transcripts |
python3 yatsee_index_data.py -e <entity> |
Vectorize and index embeddings |
streamlit run yatsee_search_demo.py -- -e <entity> |
Search summaries and transcripts |