thepipe extracts clean markdown, images, and structured data from complex sources — PDFs, URLs, codebases, databases, and more. It works out-of-the-box with any LLM, VLM, or RAG pipeline.
| Feature | Description |
|---|---|
| 📄 Universal Extraction | PDFs, DOCX, PPTX, images, audio, video, Jupyter notebooks, spreadsheets |
| 🌐 Web & Cloud | URLs, GitHub repos, YouTube transcription, Google Drive |
| 💾 Databases | PostgreSQL, MySQL, MariaDB, SQLite, DuckDB, JDBC URLs |
| 📊 Data Formats | Parquet, ORC, Feather/Arrow, CSV, JSONL, Excel |
| 🔍 Code Analysis | 90%+ token savings with intelligent digests & dependency mapping |
| 🤖 Agent Mode | Seamlessly integrate with AI coding assistants via named pipes |
pip install thepipe-apifrom thepipe.scraper import scrape_file
from thepipe.core import chunks_to_messages
# Extract from any source
chunks = scrape_file("document.pdf")
# Ready for any LLM
messages = chunks_to_messages(chunks)# Scrape a file
thepipe document.pdf -f
# Scrape a URL
thepipe https://example.com -f
# Scrape a codebase with intelligent analysis
thepipe ./my-project --options '{"code_relations": "auto"}' -f
# Query a database
thepipe "postgresql://user:pass@host/db" --db "SELECT * FROM users" -fFor codebases, use code_relations mode for intelligent digests:
thepipe ./repo --options '{"code_relations": "auto"}' -fBenefits:
- 🔗 Dependency mapping across imports
- 🏷️ Semantic tagging (auth, database, API, testing)
- 📊 Full codebase context in minimal tokens
- 🌍 Supports Python, JS/TS, Dart, Swift, Kotlin, Ruby, Go, Rust, C/C++, Java, +155 more
# PostgreSQL
thepipe "postgresql://user:pass@host:5432/db" --db "SELECT * FROM table"
# MySQL / MariaDB
thepipe "mysql://user:pass@host:3306/db" --db "SELECT * FROM table"
# JDBC URLs (auto-converted)
thepipe "jdbc:mysql://host:3306/db" --db "SELECT * FROM table"
# SQLite
thepipe "sqlite:///path/to/database.db" --db "SELECT * FROM table"
# DuckDB
thepipe "duckdb:///analytics.duckdb" --db "SELECT * FROM table"| Format | Extensions | Backend |
|---|---|---|
| Parquet | .parquet, .parq |
DuckDB |
| ORC | .orc |
DuckDB |
| Feather/Arrow | .feather, .arrow, .ipc |
DuckDB |
| JSON Lines | .jsonl, .ndjson |
DuckDB |
| CSV | .csv |
DuckDB |
| Excel | .xlsx, .xls |
Pandas → DuckDB |
# Query data files directly
thepipe data.parquet --db "SELECT * FROM parquet_data LIMIT 10"
thepipe logs.jsonl --db "SELECT * FROM jsonl_data WHERE level = 'error'"thepipe accepts named pipes as input sources — useful for streaming data:
# Create a FIFO
mkfifo /tmp/my_pipe
# thepipe reads from it (blocks until data arrives)
thepipe /tmp/my_pipe -f &
# Write data to the pipe
echo '{"key": "value"}' > /tmp/my_pipeContent type is auto-detected via Magika.
When running inside an AI coding assistant, thepipe can delegate LLM calls back to the host agent:
thepipe document.pdf --options '{"llm_provider": "agent"}' -fHow it works:
- thepipe creates named pipes in
/tmp/thepipe_pipes/ - Outputs query with
<<<THEPIPE_LLM_QUERY>>>markers - Agent reads query, executes LLM call, writes response
- thepipe continues seamlessly
This avoids double API charges when running inside Antigravity, Claude Code, or similar tools.
| Source | Input Types | Multimodal |
|---|---|---|
| Documents | .pdf, .docx, .pptx, .txt, .md |
✔️ |
| Spreadsheets | .csv, .xlsx, .xls |
❌ |
| Images | .jpg, .png, .gif |
✔️ |
| Audio/Video | .mp3, .wav, .mp4, .mov |
✔️ |
| Code | .py, .js, .ts, .java, +155 more |
❌ |
| Notebooks | .ipynb |
✔️ |
| Archives | .zip |
✔️ |
| Web | http://, https:// |
✔️ |
| GitHub | github.com/user/repo |
✔️ |
| YouTube | youtube.com/watch?v=... |
✔️ |
| Databases | SQL connection strings | ❌ |
| Data Files | .parquet, .orc, .feather, .jsonl |
❌ |
| Named Pipes | FIFOs (auto-detected) | ✔️ |
from openai import OpenAI
from thepipe.scraper import scrape_file
from thepipe.core import chunks_to_messages
client = OpenAI()
chunks = scrape_file("document.pdf")
messages = [{"role": "user", "content": "Summarize this document:"}]
messages += chunks_to_messages(chunks)
response = client.chat.completions.create(model="gpt-4o", messages=messages)from thepipe.scraper import scrape_file
chunks = scrape_file("document.pdf")
documents = [chunk.to_llamaindex() for chunk in chunks]# OpenAI / VLM
export OPENAI_API_KEY=sk-...
export DEFAULT_AI_MODEL=gpt-4o
# GitHub (for repo scraping)
export GITHUB_TOKEN=ghp_...
# Audio transcription limit (seconds)
export MAX_WHISPER_DURATION=600
# Image hosting
export HOST_IMAGES=true# Basic install
pip install thepipe-api
# Full install (video, audio, web scraping)
apt-get install -y ffmpeg
pip install thepipe-api[full]
python -m playwright install --with-deps chromiumthepipe can self-register with your AI coding assistant, enabling it to call the tool directly.
# Register with Claude Code
thepipe --register code
# Register with Google Antigravity
thepipe --register agent
# Show all registration options
thepipe --register help
# Generate manual instructions for any chat interface (ChatGPT, Claude.ai, etc.)
thepipe --registergit clone https://github.com/emcf/thepipe.git
cd thepipe
pip install -r requirements.txt
python -m pytest tests/MIT License — see LICENSE for details.
Support thepipe development: Become a sponsor