YATSEE Audio Transcription Pipeline

YATSEE -- Yet Another Tool for Speech Extraction & Enrichment

YATSEE is a local-first, end-to-end data pipeline designed to systematically refine raw meeting audio into a clean, searchable, and auditable intelligence layer. It automates the tedious work of downloading, transcribing, and normalizing unstructured conversations.

This is a local-first, privacy-respecting toolkit for anyone who wants to turn public noise into actionable intelligence.

Why This Exists

Public records are often public in name only. Civic business is frequently buried in four-hour livestreams and jargon-filled transcripts that are technically accessible but functionally opaque. The barrier to entry for an interested citizen is hours of time and complex jargon.

YATSEE solves that by using a carefully tuned local LLM to transform that wall of text into a high-signal summary—extracting the specific votes, contracts, and policy debates that matter. It's a tool for creating the clarity and accountability that modern civic discourse requires, with or without the government's help.

🚀 Quick Start (Plug-and-Play)

Follow these steps to get YATSEE running.

1. Clone the Repository

git clone https://github.com/alias454/yatsee.git
cd yatsee

2. Edit Initial Configuration

# Copy the template to create your local config file
cp yatsee.conf yatsee.toml

Open yatsee.toml in any text editor.
Add at least one entity with the required fields:
- entity (unique identifier)

3. Run the Setup Script

chmod +x setup.sh
./setup.sh

Installs Python dependencies
Downloads NLP models (spaCy, etc.)
Checks for GPU (CUDA/MPS) and warns if only CPU is available

4. Activate the Python Environment

source .venv/bin/activate

Python ≥3.10 recommended. CPU works, but GPU/MPS accelerates transcription.

5. Run the Config Builder

python yatsee_build_config.py --create

Uses the entity info in yatsee.toml to:
- Create the main directory(default ./data) for the pipeline
- Initialize per-entity pipeline configs
- sources.youtube.youtube_path (YouTube channel/playlist)
- Any optional data structures like titles, people, replacements etc.
This is the minimum viable entity needed for the downloader.
Important: Run this after setup.sh and after adding at least one entity.

6. Run the Processing Pipeline

see Script Summary below

Processes audio/video in downloads/
Converts to .flac/.wav in audio/
Generates transcripts, normalizes text, and produces summaries
All scripts are modular: you can run them individually or as a pipeline

7. Launch the Demo Search UI

streamlit run yatsee_search_demo.py -- -e entity_name_configured

Provides semantic and structured search over transcripts and summaries

⚠️ Notes for New Users

entity is a unique key identifier for all scripts. Keep it consistent.
Each pipeline stage ensures directories exist for output; do not manually create them.
Optional: You can edit additional pipeline settings (like per-entity hotwords or divisions) in the generated config.

📌 YATSEE Audio Pipeline Overview

A modular pipeline for extracting, enriching, and summarizing civic meeting audio data.

Pipeline Flow Overview

downloads/ → raw video/audio
audio/ → converted .wav or .flac
transcripts_<model>/ → .vtt + flat .txt
normalized/ → cleaned, structured .txt
summary/ → .md or .yaml summaries

YATSEE is designed as a collection of independent tools. While they work best as a unified pipeline, each script can be run standalone as long as the input data matches the Interface Contract.

1. Automated Download

Script: yatsee_download_audio.py
Input: YouTube URL (bestaudio)
Output: .mp4 or .webm to downloads/
Tool: yt-dlp
- Purpose: Archive livestream audio for local processing

2. Convert Audio

Script: yatsee_format_audio.py
Input: .mp4 or .webm from downloads/
Output: .wav or .flac to audio/
Tool: ffmpeg
Format Settings:
- WAV: -ar 16000 -ac 1 -sample_fmt s16 -c:a pcm_s16le
- FLAC -ar 16000 -ac 1 -sample_fmt s16 -c:a flac
Notes:
- Supports chunked output for long audio
- Optional overlap between chunks to prevent cutting phrases

3. Transcribe Audio

Script: yatsee_transcribe_audio.py
Input: .flac from audio/
Output: .vtt to transcripts_<model>/
Tool: whisper or faster-whisper
Notes:
- Supports stitching chunked audio back into a single transcript
- Accepts model selection: small, medium, large, etc.
- Faster-whisper improves performance if installed

4. Clean and Normalize

a. Strip Timestamps

Script: yatsee_slice_vtt.py
Input: .vtt from transcripts_<model>/
Output: .txt (timestamp-free) to same folder

b. JSONL Segmentation and Text Compression

Script: yatsee_slice_vtt.py
Input: .vtt from transcripts_<model>/
Output: .jsonl to same folder
- Purpose: JSONL segments (sliced transcript for embeddings/search).

c. Text Normalization

Script: yatsee_normalize_structure.py
Input: .punct.txt from normalized/
Output: .norm.txt to normalized/
Tool: spaCy
- Purpose: Segment text into readable sentences and normalize punctuation/spacing.

5. Summarize Transcripts using AI (Local LLM)

Script: yatsee_summarize_transcripts.py
Input: .txt from normalized/
Output: .md or .yaml to summary/
Tool: ollama
Notes:
- Supports short and long-form summaries
- Optional YAML output (e.g., vote logs, action items, discussion summaries)

All scripts are modular and can be run independently or as part of an automated workflow.

Index and Search (In Development)

Index Data

Script: yatsee_index_data.py
Input: .txt from normalized/
Input: .md from summary/
Output: embeddings to yatsee_db/
Tool: ChromaDB
Notes:
- Generate embeddings from raw transcripts and summaries into a searchable civic intelligence database.
- Vector Search (Semantic): Uses ChromaDB with the BAAI/bge-small-en-v1.5 model to allow for fuzzy, concept-based queries (e.g., "Find discussions about road repairs").

Search (In Development)

Script: yatsee_index_data.py
Input: .txt from normalized/
Input: .md from summary/
Input: embeddings from yatsee_db/
Tool: Streamlit and ChromaDB
Notes:
- UI: A simple web interface built with Streamlit to provide an overview of the generated transcripts and summaries.
- Planned Graph Search (Relational): Extract structured data (Votes, Contracts, Appointments) into a knowledge graph to trace connections between people and money.

📁 Filesystem Layout

data/
└── <entity_handle>/
    ├── downloads/                ← Raw input (audio/video)
    ├── audio/                    ← Converted 16kHz mono files
    ├── transcripts_<model>/      ← VTTs + initial flat .txt files
    ├── normalized/               ← Cleaned + structured output (spaCy)
    ├── summary/                  ← Generated meeting summaries (.md/.yaml)
    ├── yatsee_db/                ← Vector database files (ChromaDB)
    ├── prompts/                  ← Optional default prompt overrides(created by user)
    └── conf.toml                 ← Localized entity config

Config file routing/load order

Global TOML
    |
    +--> Entity handle
            |
            +--> Local config (hotwords, divisions, data_path)
                    |
                    +--> Pipeline stage (downloads, audio, transcripts)

Prompt override layout example:

./prompts/                      # default prompts for all entities
  └── research/
      └── prompts.toml          # default prompts & routing for 'research' job type

./data/
  └── defined_entity/           # entity-specific data
      └── prompts/
          └── research/
              └── prompts.toml  # full override for defined_entity 'research' job type

./data/
  └── generic_entity/           # another entity with no override
      └── prompts/
          └── research/
              # no file, falls back to default in prompts/research/prompts.toml

**Behavior**:

  - Loader first checks `data/<entity>/prompts/<job_type>/prompts.toml`.  
  - If found → full override of defaults.  
  - If not found → fall back to `prompts/<job_type>/prompts.toml`.

⚙️ Requirements & Setup

This pipeline was developed and tested on the following setup:

CPU: Intel Core i7-10750H (6 cores / 12 threads, up to 5.0 GHz)
RAM: 32 GB DDR4
GPU: NVIDIA GeForce RTX 2060 (6 GB VRAM, CUDA 12.8)
Storage: NVMe SSD
OS: Fedora Linux
Shell: Bash
Python: 3.10 or newer

Additional testing was performed on Apple Silicon (macOS):

Model: Mac Mini (M4 Base)
CPU: Apple M4 (10 cores / 4 performance cores, up to 120GB/s memory bandwidth)
RAM: 16 GB
Storage: NVMe SSD
OS: macOS Sonoma / Sequoia
Shell: ZSH
Python: 3.9 or newer

GPU acceleration was enabled for Whisper / faster-whisper using CUDA 12.8 and NVIDIA driver 570.144 on Linux. However, faster whisper has limited/no support for mps.

Note: Audio transcription was much slower on the MAC than on Linux. it's doable but it's much slower.

Note: The pipeline works on CPU-only systems without a GPU. However, transcription (especially with Whisper or faster-whisper) will be much slower compared to systems with CUDA-enabled GPU acceleration or MPS.

⚠️ Not tested on Windows. Use at your own risk on Windows platforms.

Manual Installation (If not using setup.sh)

If you cannot use the setup script, ensure you have ffmpeg and yt-dlp installed via your package manager, then install the Python requirements:

Required CLI Tools

yt-dlp – Download livestream audio from YouTube
ffmpeg – Convert audio to .flac or .wav format

Required Python Packages

toml Needed for reading the toml config
requests Needed for interacting with ollama API if installed
torch Required for Whisper and model inference (with or without CUDA)
pyyaml YAML output support (for summaries)
whisper Audio transcription (standard)
spacy Sentence segmentation + text cleanup
- Model: en_core_web_sm (or larger)

Optional:

faster-whisper Audio transcription (optional)

Optional: Summarization with Local LLM

ollama Run local LLMs for summarization

Install CLI tools for your platform

macOS (Homebrew):

brew install yt-dlp ffmpeg

Fedora:

sudo dnf install yt-dlp ffmpeg

Debian/Ubuntu:

sudo apt-get update
sudo apt-get install ffmpeg
sudo curl -L https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -o /usr/local/bin/yt-dlp
sudo chmod a+rx /usr/local/bin/yt-dlp

Install Python Packages

You can use pip to install the core requirements:

Install:

pip install torch torchaudio tqdm
pip install --upgrade git+https://github.com/openai/whisper.git

pip install toml pyyaml spacy
python -m spacy download en_core_web_sm

Install faster-whisper (Optional for faster transcription)

Install:

pip install faster-whisper  # optional, for better performance

On first run, it will download a model (e.g., base, medium). Ensure you have enough RAM.

Install ollama(Optional)

Used for generating markdown or YAML summaries from transcripts.

install:

pip install requests

curl -fsSL https://ollama.com/install.sh | sh

See https://ollama.com for supported models and system requirements.

🚀 Running the Pipeline

Run each script in sequence or independently as needed: All scripts accept the -e (entity) flag to route data to the correct folders defined in yatsee.toml.

📄 Script Summary

Script	Purpose
`python3 yatsee_download_audio.py -e <entity>`	Download audio from YouTube URLs
`python3 yatsee_format_audio.py -e <entity>`	Convert downloaded files to `.flac` or `.wav`
`python3 yatsee_transcribe_audio.py -e <entity>`	Transcribe audio files to `.vtt`
`python3 yatsee_slice_vtt.py -e <entity>`	Slice and segment `.vtt` files
`python3 yatsee_normalize_structure.py -e <entity>`	Clean and normalize text structure
`python3 yatsee_summarize_transcripts.py -e <entity>`	Generate summaries from cleaned transcripts
`python3 yatsee_index_data.py -e <entity>`	Vectorize and index embeddings
`streamlit run yatsee_search_demo.py -- -e <entity>`	Search summaries and transcripts

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
prompts/summary		prompts/summary
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh
yatsee.conf		yatsee.conf
yatsee_config_builder.py		yatsee_config_builder.py
yatsee_download_audio.py		yatsee_download_audio.py
yatsee_format_audio.py		yatsee_format_audio.py
yatsee_index_data.py		yatsee_index_data.py
yatsee_normalize_structure.py		yatsee_normalize_structure.py
yatsee_search_demo.py		yatsee_search_demo.py
yatsee_slice_vtt.py		yatsee_slice_vtt.py
yatsee_summarize_transcripts.py		yatsee_summarize_transcripts.py
yatsee_transcribe_audio.py		yatsee_transcribe_audio.py

alias454/YATSEE

Folders and files

Latest commit

History

Repository files navigation