thepi.pe

thepi.pe

Extract clean data from anything → Feed it to any LLM

What is thepipe?

thepipe extracts clean markdown, images, and structured data from complex sources — PDFs, URLs, codebases, databases, and more. It works out-of-the-box with any LLM, VLM, or RAG pipeline.

Key Features

Feature	Description
📄 Universal Extraction	PDFs, DOCX, PPTX, images, audio, video, Jupyter notebooks, spreadsheets
🌐 Web & Cloud	URLs, GitHub repos, YouTube transcription, Google Drive
💾 Databases	PostgreSQL, MySQL, MariaDB, SQLite, DuckDB, JDBC URLs
📊 Data Formats	Parquet, ORC, Feather/Arrow, CSV, JSONL, Excel
🔍 Code Analysis	90%+ token savings with intelligent digests & dependency mapping
🤖 Agent Mode	Seamlessly integrate with AI coding assistants via named pipes

Quick Start

pip install thepipe-api

Basic Usage

from thepipe.scraper import scrape_file
from thepipe.core import chunks_to_messages

# Extract from any source
chunks = scrape_file("document.pdf")

# Ready for any LLM
messages = chunks_to_messages(chunks)

CLI Usage

# Scrape a file
thepipe document.pdf -f

# Scrape a URL
thepipe https://example.com -f

# Scrape a codebase with intelligent analysis
thepipe ./my-project --options '{"code_relations": "auto"}' -f

# Query a database
thepipe "postgresql://user:pass@host/db" --db "SELECT * FROM users" -f

🔥 Code Analysis (90%+ Token Savings)

For codebases, use code_relations mode for intelligent digests:

thepipe ./repo --options '{"code_relations": "auto"}' -f

Benefits:

🔗 Dependency mapping across imports
🏷️ Semantic tagging (auth, database, API, testing)
📊 Full codebase context in minimal tokens
🌍 Supports Python, JS/TS, Dart, Swift, Kotlin, Ruby, Go, Rust, C/C++, Java, +155 more

💾 Database Support

Connection Formats

# PostgreSQL
thepipe "postgresql://user:pass@host:5432/db" --db "SELECT * FROM table"

# MySQL / MariaDB
thepipe "mysql://user:pass@host:3306/db" --db "SELECT * FROM table"

# JDBC URLs (auto-converted)
thepipe "jdbc:mysql://host:3306/db" --db "SELECT * FROM table"

# SQLite
thepipe "sqlite:///path/to/database.db" --db "SELECT * FROM table"

# DuckDB
thepipe "duckdb:///analytics.duckdb" --db "SELECT * FROM table"

Data File Formats

Format	Extensions	Backend
Parquet	`.parquet`, `.parq`	DuckDB
ORC	`.orc`	DuckDB
Feather/Arrow	`.feather`, `.arrow`, `.ipc`	DuckDB
JSON Lines	`.jsonl`, `.ndjson`	DuckDB
CSV	`.csv`	DuckDB
Excel	`.xlsx`, `.xls`	Pandas → DuckDB

# Query data files directly
thepipe data.parquet --db "SELECT * FROM parquet_data LIMIT 10"
thepipe logs.jsonl --db "SELECT * FROM jsonl_data WHERE level = 'error'"

🔌 Named Pipe (FIFO) Input

thepipe accepts named pipes as input sources — useful for streaming data:

# Create a FIFO
mkfifo /tmp/my_pipe

# thepipe reads from it (blocks until data arrives)
thepipe /tmp/my_pipe -f &

# Write data to the pipe
echo '{"key": "value"}' > /tmp/my_pipe

Content type is auto-detected via Magika.

🤖 Agent Mode (LLM Inference Delegation)

When running inside an AI coding assistant, thepipe can delegate LLM calls back to the host agent:

thepipe document.pdf --options '{"llm_provider": "agent"}' -f

How it works:

thepipe creates named pipes in /tmp/thepipe_pipes/
Outputs query with <<<THEPIPE_LLM_QUERY>>> markers
Agent reads query, executes LLM call, writes response
thepipe continues seamlessly

This avoids double API charges when running inside Antigravity, Claude Code, or similar tools.

Supported Sources

Source	Input Types	Multimodal
Documents	`.pdf`, `.docx`, `.pptx`, `.txt`, `.md`	✔️
Spreadsheets	`.csv`, `.xlsx`, `.xls`	❌
Images	`.jpg`, `.png`, `.gif`	✔️
Audio/Video	`.mp3`, `.wav`, `.mp4`, `.mov`	✔️
Code	`.py`, `.js`, `.ts`, `.java`, +155 more	❌
Notebooks	`.ipynb`	✔️
Archives	`.zip`	✔️
Web	`http://`, `https://`	✔️
GitHub	`github.com/user/repo`	✔️
YouTube	`youtube.com/watch?v=...`	✔️
Databases	SQL connection strings	❌
Data Files	`.parquet`, `.orc`, `.feather`, `.jsonl`	❌
Named Pipes	FIFOs (auto-detected)	✔️

LLM Integration

OpenAI

from openai import OpenAI
from thepipe.scraper import scrape_file
from thepipe.core import chunks_to_messages

client = OpenAI()
chunks = scrape_file("document.pdf")
messages = [{"role": "user", "content": "Summarize this document:"}]
messages += chunks_to_messages(chunks)

response = client.chat.completions.create(model="gpt-4o", messages=messages)

LlamaIndex

from thepipe.scraper import scrape_file

chunks = scrape_file("document.pdf")
documents = [chunk.to_llamaindex() for chunk in chunks]

Environment Variables

# OpenAI / VLM
export OPENAI_API_KEY=sk-...
export DEFAULT_AI_MODEL=gpt-4o

# GitHub (for repo scraping)
export GITHUB_TOKEN=ghp_...

# Audio transcription limit (seconds)
export MAX_WHISPER_DURATION=600

# Image hosting
export HOST_IMAGES=true

Installation Options

# Basic install
pip install thepipe-api

# Full install (video, audio, web scraping)
apt-get install -y ffmpeg
pip install thepipe-api[full]
python -m playwright install --with-deps chromium

AI Registration

thepipe can self-register with your AI coding assistant, enabling it to call the tool directly.

# Register with Claude Code
thepipe --register code

# Register with Google Antigravity
thepipe --register agent

# Show all registration options
thepipe --register help

# Generate manual instructions for any chat interface (ChatGPT, Claude.ai, etc.)
thepipe --register

Contributing

git clone https://github.com/emcf/thepipe.git
cd thepipe
pip install -r requirements.txt
python -m pytest tests/

License

MIT License — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 436 Commits
.agent		.agent
.github/workflows		.github/workflows
examples		examples
tests		tests
thepipe		thepipe
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

thepi.pe

What is thepipe?

Key Features

Quick Start

Basic Usage

CLI Usage

🔥 Code Analysis (90%+ Token Savings)

💾 Database Support

Connection Formats

Data File Formats

🔌 Named Pipe (FIFO) Input

🤖 Agent Mode (LLM Inference Delegation)

Supported Sources

LLM Integration

OpenAI

LlamaIndex

Environment Variables

Installation Options

AI Registration

Contributing

License

Sponsors

About

Uh oh!

Releases

Packages

Languages

License

skyler14/thepipe

Folders and files

Latest commit

History

Repository files navigation

thepi.pe

What is thepipe?

Key Features

Quick Start

Basic Usage

CLI Usage

🔥 Code Analysis (90%+ Token Savings)

💾 Database Support

Connection Formats

Data File Formats

🔌 Named Pipe (FIFO) Input

🤖 Agent Mode (LLM Inference Delegation)

Supported Sources

LLM Integration

OpenAI

LlamaIndex

Environment Variables

Installation Options

AI Registration

Contributing

License

Sponsors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages