Skip to content
forked from emcf/thepipe

Feed PDFs, URLs, Slides, YouTube, and more into Vision-Language models with one line of code⚡

License

Notifications You must be signed in to change notification settings

skyler14/thepipe

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pipeline Illustration

Extract clean data from anything → Feed it to any LLM

python-gh-action codecov MIT license PyPI

What is thepipe?

thepipe extracts clean markdown, images, and structured data from complex sources — PDFs, URLs, codebases, databases, and more. It works out-of-the-box with any LLM, VLM, or RAG pipeline.

Key Features

Feature Description
📄 Universal Extraction PDFs, DOCX, PPTX, images, audio, video, Jupyter notebooks, spreadsheets
🌐 Web & Cloud URLs, GitHub repos, YouTube transcription, Google Drive
💾 Databases PostgreSQL, MySQL, MariaDB, SQLite, DuckDB, JDBC URLs
📊 Data Formats Parquet, ORC, Feather/Arrow, CSV, JSONL, Excel
🔍 Code Analysis 90%+ token savings with intelligent digests & dependency mapping
🤖 Agent Mode Seamlessly integrate with AI coding assistants via named pipes

Quick Start

pip install thepipe-api

Basic Usage

from thepipe.scraper import scrape_file
from thepipe.core import chunks_to_messages

# Extract from any source
chunks = scrape_file("document.pdf")

# Ready for any LLM
messages = chunks_to_messages(chunks)

CLI Usage

# Scrape a file
thepipe document.pdf -f

# Scrape a URL
thepipe https://example.com -f

# Scrape a codebase with intelligent analysis
thepipe ./my-project --options '{"code_relations": "auto"}' -f

# Query a database
thepipe "postgresql://user:pass@host/db" --db "SELECT * FROM users" -f

🔥 Code Analysis (90%+ Token Savings)

For codebases, use code_relations mode for intelligent digests:

thepipe ./repo --options '{"code_relations": "auto"}' -f

Benefits:

  • 🔗 Dependency mapping across imports
  • 🏷️ Semantic tagging (auth, database, API, testing)
  • 📊 Full codebase context in minimal tokens
  • 🌍 Supports Python, JS/TS, Dart, Swift, Kotlin, Ruby, Go, Rust, C/C++, Java, +155 more

💾 Database Support

Connection Formats

# PostgreSQL
thepipe "postgresql://user:pass@host:5432/db" --db "SELECT * FROM table"

# MySQL / MariaDB
thepipe "mysql://user:pass@host:3306/db" --db "SELECT * FROM table"

# JDBC URLs (auto-converted)
thepipe "jdbc:mysql://host:3306/db" --db "SELECT * FROM table"

# SQLite
thepipe "sqlite:///path/to/database.db" --db "SELECT * FROM table"

# DuckDB
thepipe "duckdb:///analytics.duckdb" --db "SELECT * FROM table"

Data File Formats

Format Extensions Backend
Parquet .parquet, .parq DuckDB
ORC .orc DuckDB
Feather/Arrow .feather, .arrow, .ipc DuckDB
JSON Lines .jsonl, .ndjson DuckDB
CSV .csv DuckDB
Excel .xlsx, .xls Pandas → DuckDB
# Query data files directly
thepipe data.parquet --db "SELECT * FROM parquet_data LIMIT 10"
thepipe logs.jsonl --db "SELECT * FROM jsonl_data WHERE level = 'error'"

🔌 Named Pipe (FIFO) Input

thepipe accepts named pipes as input sources — useful for streaming data:

# Create a FIFO
mkfifo /tmp/my_pipe

# thepipe reads from it (blocks until data arrives)
thepipe /tmp/my_pipe -f &

# Write data to the pipe
echo '{"key": "value"}' > /tmp/my_pipe

Content type is auto-detected via Magika.


🤖 Agent Mode (LLM Inference Delegation)

When running inside an AI coding assistant, thepipe can delegate LLM calls back to the host agent:

thepipe document.pdf --options '{"llm_provider": "agent"}' -f

How it works:

  1. thepipe creates named pipes in /tmp/thepipe_pipes/
  2. Outputs query with <<<THEPIPE_LLM_QUERY>>> markers
  3. Agent reads query, executes LLM call, writes response
  4. thepipe continues seamlessly

This avoids double API charges when running inside Antigravity, Claude Code, or similar tools.


Supported Sources

Source Input Types Multimodal
Documents .pdf, .docx, .pptx, .txt, .md ✔️
Spreadsheets .csv, .xlsx, .xls
Images .jpg, .png, .gif ✔️
Audio/Video .mp3, .wav, .mp4, .mov ✔️
Code .py, .js, .ts, .java, +155 more
Notebooks .ipynb ✔️
Archives .zip ✔️
Web http://, https:// ✔️
GitHub github.com/user/repo ✔️
YouTube youtube.com/watch?v=... ✔️
Databases SQL connection strings
Data Files .parquet, .orc, .feather, .jsonl
Named Pipes FIFOs (auto-detected) ✔️

LLM Integration

OpenAI

from openai import OpenAI
from thepipe.scraper import scrape_file
from thepipe.core import chunks_to_messages

client = OpenAI()
chunks = scrape_file("document.pdf")
messages = [{"role": "user", "content": "Summarize this document:"}]
messages += chunks_to_messages(chunks)

response = client.chat.completions.create(model="gpt-4o", messages=messages)

LlamaIndex

from thepipe.scraper import scrape_file

chunks = scrape_file("document.pdf")
documents = [chunk.to_llamaindex() for chunk in chunks]

Environment Variables

# OpenAI / VLM
export OPENAI_API_KEY=sk-...
export DEFAULT_AI_MODEL=gpt-4o

# GitHub (for repo scraping)
export GITHUB_TOKEN=ghp_...

# Audio transcription limit (seconds)
export MAX_WHISPER_DURATION=600

# Image hosting
export HOST_IMAGES=true

Installation Options

# Basic install
pip install thepipe-api

# Full install (video, audio, web scraping)
apt-get install -y ffmpeg
pip install thepipe-api[full]
python -m playwright install --with-deps chromium

AI Registration

thepipe can self-register with your AI coding assistant, enabling it to call the tool directly.

# Register with Claude Code
thepipe --register code

# Register with Google Antigravity
thepipe --register agent

# Show all registration options
thepipe --register help

# Generate manual instructions for any chat interface (ChatGPT, Claude.ai, etc.)
thepipe --register

Contributing

git clone https://github.com/emcf/thepipe.git
cd thepipe
pip install -r requirements.txt
python -m pytest tests/

License

MIT License — see LICENSE for details.

Sponsors

Support thepipe development: Become a sponsor

Book us with Cal.com

About

Feed PDFs, URLs, Slides, YouTube, and more into Vision-Language models with one line of code⚡

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.3%
  • HTML 0.5%
  • Jupyter Notebook 0.1%
  • TypeScript 0.1%
  • CSS 0.0%
  • C++ 0.0%