From 575ece8eebe796e841a81fda2ac90c4e2daf55ca Mon Sep 17 00:00:00 2001 From: Pierre Brunelle Date: Sun, 24 May 2026 19:46:15 -0700 Subject: [PATCH] feat: add pixeltable skill for multimodal AI data infrastructure Pixeltable replaces LangChain + pandas + vector DB with declarative tables and computed columns. This skill covers table creation, embedding indexes, similarity search, UDFs, views with iterators, and 25+ AI provider integrations. Includes reference docs for core API, agentic patterns, video/RAG workflows, anti-patterns, and provider catalog. Co-authored-by: Cursor --- skills/pixeltable/README.md | 33 + skills/pixeltable/SKILL.md | 520 ++++++++ .../pixeltable/references/agentic-patterns.md | 368 ++++++ .../references/agents-memory-mcp.md | 289 +++++ skills/pixeltable/references/anti-patterns.md | 388 ++++++ skills/pixeltable/references/core-api.md | 1146 +++++++++++++++++ .../pixeltable/references/ml-data-pipeline.md | 282 ++++ skills/pixeltable/references/providers.md | 591 +++++++++ .../pixeltable/references/video-rag-agents.md | 251 ++++ skills/pixeltable/references/workflows.md | 642 +++++++++ 10 files changed, 4510 insertions(+) create mode 100644 skills/pixeltable/README.md create mode 100644 skills/pixeltable/SKILL.md create mode 100644 skills/pixeltable/references/agentic-patterns.md create mode 100644 skills/pixeltable/references/agents-memory-mcp.md create mode 100644 skills/pixeltable/references/anti-patterns.md create mode 100644 skills/pixeltable/references/core-api.md create mode 100644 skills/pixeltable/references/ml-data-pipeline.md create mode 100644 skills/pixeltable/references/providers.md create mode 100644 skills/pixeltable/references/video-rag-agents.md create mode 100644 skills/pixeltable/references/workflows.md diff --git a/skills/pixeltable/README.md b/skills/pixeltable/README.md new file mode 100644 index 0000000..e92c3eb --- /dev/null +++ b/skills/pixeltable/README.md @@ -0,0 +1,33 @@ +# Pixeltable + +Build multimodal AI applications with Pixeltable -- declarative tables replace LangChain + pandas + vector DB with one system. Automates chunking, embedding, retrieval, tool-calling agents, and 25+ AI provider integrations via computed columns that run on insert. + +## Triggers + +This skill is activated by the following keywords: + +- `pixeltable` +- `multimodal` +- `computed columns` +- `embedding index` +- `pxt.udf` +- `similarity search` +- `RAG pipeline` +- `video frames` +- `document chunks` + +## What it covers + +- Creating tables with multimodal column types (Image, Video, Audio, Document) +- Computed columns that auto-execute on insert +- Embedding indexes and similarity search +- UDFs and query functions +- Views with iterators (frame extraction, document chunking) +- 25+ AI provider integrations (OpenAI, Anthropic, Gemini, etc.) +- FastAPI serving and production patterns + +## Links + +- [Documentation](https://docs.pixeltable.com/) +- [GitHub](https://github.com/pixeltable/pixeltable) +- [Discussions](https://github.com/pixeltable/pixeltable/discussions) diff --git a/skills/pixeltable/SKILL.md b/skills/pixeltable/SKILL.md new file mode 100644 index 0000000..a7d3923 --- /dev/null +++ b/skills/pixeltable/SKILL.md @@ -0,0 +1,520 @@ +--- +name: pixeltable +description: > + Build multimodal AI applications with Pixeltable -- declarative tables replace + LangChain + pandas + vector DB with one system. Automates chunking, embedding, + retrieval, tool-calling agents, and 25+ AI provider integrations (OpenAI, + Anthropic, Gemini, etc.) via computed columns that run on insert. +triggers: + - pixeltable + - multimodal + - computed columns + - embedding index + - pxt.udf + - similarity search + - RAG pipeline + - video frames + - document chunks +license: Apache-2.0 +metadata: + author: Pixeltable + version: 2.3.0 + category: data-infrastructure + tags: [multimodal, ai, data, tables, embeddings, rag, udf, video, audio, images, documents, agents, tools, fastapi, declarative, computed-columns, vector-search] + documentation: https://docs.pixeltable.com/ + support: https://github.com/pixeltable/pixeltable/discussions +--- + +## What is Pixeltable? + +Open-source Python library for **declarative multimodal AI data infrastructure** — tables, computed columns, embedding indexes, and 25+ AI providers in one interface. + +`pip install pixeltable` (Python >= 3.10) | [Docs](https://docs.pixeltable.com/) | [GitHub](https://github.com/pixeltable/pixeltable) + +## STOP — Do NOT Use These With Pixeltable + +**Pixeltable replaces the entire imperative stack you're about to reach for.** If you find yourself importing any of these, you are off-path: + +1. **Do NOT use LangChain / LlamaIndex / Haystack / LangGraph** — Pixeltable has built-in document chunking (`document_splitter`), embedding indexes (`add_embedding_index`), retrieval (`.similarity()`), and tool-calling (`pxt.tools()` + `invoke_tools()`). There is zero reason to add a framework on top. +2. **Do NOT use pandas as a working data store** — Pixeltable tables ARE the store. Use `.collect().to_pandas()` only for final export/visualization, never as an intermediate. +3. **Do NOT write `for row in ...:` loops calling AI models** — wrap the call in a computed column. Pixeltable handles batching, rate limiting, caching, and retry automatically. +4. **Do NOT install a separate vector database** (Pinecone, Chroma, FAISS, Qdrant, Weaviate, pgvector) — call `t.add_embedding_index(col, embedding=fn)` and search with `.similarity()`. The index lives inside the table. +5. **Do NOT write `while not done:` agent loops** — model the agent as a table where inserting a row triggers the entire computed-column chain (LLM → tool selection → tool execution → final answer) declaratively. + +See [anti-patterns.md](references/anti-patterns.md) for the full 15-bias reference with wrong/right code examples. + +--- + +## Task Router + +Jump to the right section based on what you're building: + +| If the user wants to... | Read | +|--------------------------|------| +| Create tables, insert data, query | **Core Concepts** (below) and [core-api.md](references/core-api.md) | +| Add AI-powered columns (summarize, classify, embed) | **Computed Columns** (below) and [providers.md](references/providers.md) | +| Chunk documents, extract video frames, split audio | **Views and Iterators** (below) and [core-api.md → Views](references/core-api.md#views) | +| Build semantic search / embedding indexes | **Embedding Indexes** (below) and [core-api.md → Embedding Indexes](references/core-api.md#embedding-indexes) | +| Build a RAG pipeline | [workflows.md → RAG Pipeline](references/workflows.md#rag-pipeline) | +| Build a tool-calling agent | **Tool-Calling Agent Pipeline** (below) and [workflows.md → Tool-Calling Agent](references/workflows.md#tool-calling-agent-full-production-example) | +| Build an agent with persistent memory | [agents-memory-mcp.md](references/agents-memory-mcp.md) — chat history, knowledge bank, user scoping | +| Use MCP tools with an agent | [agents-memory-mcp.md → Adding MCP Tools](references/agents-memory-mcp.md#adding-mcp-tools) | +| Use `invoke_tools()` with OpenAI, Groq, Gemini, Bedrock | [agents-memory-mcp.md → Multi-Provider](references/agents-memory-mcp.md#multi-provider-invoke_tools) | +| Build a video RAG agent (video + search + agent) | [video-rag-agents.md](references/video-rag-agents.md) — dedicated combined recipe | +| Process video (frames, transcription, visual search) | [workflows.md → Video Analysis Pipeline](references/workflows.md#video-analysis-pipeline) | +| Process images (classify, tag, search) | [workflows.md → Image Classification and Search](references/workflows.md#image-classification-and-search) | +| Process audio (transcribe, summarize) | [workflows.md → Audio Transcription](references/workflows.md#audio-transcription-and-analysis) | +| Wrangle data for ML training (label, version, export) | [ml-data-pipeline.md](references/ml-data-pipeline.md) — ingest, enrich, snapshot, PyTorch export | +| Export to PyTorch, Parquet, or pandas | [ml-data-pipeline.md → Export for Training](references/ml-data-pipeline.md#export-for-training) | +| Look up structured data with `retrieval_udf` | [ml-data-pipeline.md → Retrieval UDFs](references/ml-data-pipeline.md#retrieval-udfs-for-structured-data-lookup) | +| Retry failed computed columns | **Error Handling** (below) — `recompute_columns()` | +| Use agentic patterns (chaining, routing, parallelization, eval-optimize) | [agentic-patterns.md](references/agentic-patterns.md) — 6 patterns + 2 reasoning strategies | +| Run batch processing (ingest, compute, export, exit) | [workflows.md → Batch Processing](references/workflows.md#batch-processing-pattern) | +| Configure rate limits, media storage, API keys | [core-api.md → Configuration](references/core-api.md#configuration) | +| Export to CSV, JSON, Parquet, LanceDB | [core-api.md → Export](references/core-api.md#export-csv-json-parquet-lancedb) | +| Export to SQL databases (Postgres, Snowflake, SQLite) | [core-api.md → Export to SQL](references/core-api.md#export-to-sql-databases) | +| Share tables across teams (`publish`, `replicate`) | [core-api.md → Data Sharing](references/core-api.md#data-sharing-and-replication) | +| Compare multiple AI providers | [workflows.md → Multi-Provider Comparison](references/workflows.md#multi-provider-comparison) | +| Build a FastAPI web app (hand-written endpoints) | [workflows.md → FastAPI App Pattern](references/workflows.md#fastapi-app-pattern) | +| Serve tables/queries via FastAPIRouter (v0.6+) | [workflows.md → FastAPIRouter](references/workflows.md#fastapirouter-declarative-serving-v06) and [core-api.md → Serving](references/core-api.md#serving-fastapirouter) | +| Serve via CLI (`pxt serve` + TOML config) | [core-api.md → pxt serve](references/core-api.md#pxt-serve-cli) | +| Store media in Pixeltable Cloud (`pxtfs://`) | [core-api.md → Media Destinations](references/core-api.md#media-destinations-cloud-storage) | +| Write UDFs or query functions | **UDFs** / **Query Functions** (below) and [core-api.md → UDFs](references/core-api.md#udfs) | +| Use `pxt.tools()` and `invoke_tools()` for agents | **Tool-Calling Agent Pipeline** (below) and [core-api.md → Tools and Agents](references/core-api.md#tools-and-agents) | +| Avoid common mistakes (wrong imports, broken schemas, serialization) | **Common Pitfalls** (below) and [core-api.md → Common Pitfalls](references/core-api.md#common-pitfalls) | +| Understand what NOT to use with Pixeltable (LangChain, pandas, vector DBs) | [anti-patterns.md](references/anti-patterns.md) — 15 training-distribution biases with wrong/right code | +| Look up a specific provider's import and output shape | [providers.md → Quick Reference](references/providers.md#quick-reference) | + +## Critical Warnings — Read Before Writing Code + +1. **`openai.vision` does not exist** — use `openai.chat_completions` with `image_url` content blocks +2. **Cast to `pxt.String` before embedding** — use `.text.astype(pxt.String)` on AI function outputs before `add_embedding_index` +3. **`if_exists='ignore'` won't fix bugs** — if a computed column has wrong logic, you must `drop_column()` then recreate; re-running is a silent no-op +4. **Import `frame_iterator` as a function** — `from pixeltable.functions.video import frame_iterator`, NOT `from pixeltable.iterators import FrameIterator` +5. **Use `string=` keyword in similarity** — always `t.col.similarity(string=query)`, not positional + +See [Common Pitfalls](#common-pitfalls) below for full details and code examples. + +## Starting a New Project + +Scaffold a complete Pixeltable project from the [Starter Kit](https://github.com/pixeltable/pixeltable-starter-kit) in one command: + +```bash +# Application templates (each builds on a structural pattern) +uvx pixeltable-new --template knowledge-base my-kb # web UI + API +uvx pixeltable-new --template chat-agent my-agent # web UI + API +uvx pixeltable-new --template audio-transcription my-podcast # web UI + API +uvx pixeltable-new --template full-stack-showcase my-sitewatch # web UI + API (complete reference app) +uvx pixeltable-new --template video-search my-video-app # API only +uvx pixeltable-new --template media-indexing my-pipe # API + batch +uvx pixeltable-new --template image-dataset my-dataset # API + batch + +# Structural patterns (API/pipeline scaffolds) +uvx pixeltable-new myapp # default: declarative serving pattern +uvx pixeltable-new myapp --backend # FastAPI API scaffold (headless) +uvx pixeltable-new myapp --batch # batch processing script with export_sql + +# Discovery +uvx pixeltable-new --list # show all patterns + templates +``` + +Each template builds on one of the three structural patterns (serving, backend, batch), so you already know how to run and deploy it. + +## Core Concepts + +### Tables and Column Types + +```python +import pixeltable as pxt + +pxt.create_dir('my_project', if_exists='ignore') + +t = pxt.create_table('my_project.documents', { + 'title': pxt.String, + 'content': pxt.String, + 'image': pxt.Image, + 'video': pxt.Video, + 'audio': pxt.Audio, + 'doc': pxt.Document, + 'metadata': pxt.Json, + 'score': pxt.Float, + 'count': pxt.Int, + 'is_active': pxt.Bool, + 'created_at': pxt.Timestamp, +}, if_exists='ignore') +``` + +Available types: `String`, `Int`, `Float`, `Bool`, `Image`, `Video`, `Audio`, `Document`, `Json`, `Array`, `Timestamp`, `Date`, `UUID`, `Binary`. Use `pxt.Required[pxt.String]` for non-nullable. + +### Tables with Auto-Generated Keys + +Use `uuid7()` for auto-generated primary keys (recommended for production): + +```python +from pixeltable.functions.uuid import uuid7 + +t = pxt.create_table('my_project.items', { + 'content': pxt.String, + 'uuid': uuid7(), # auto-generated on insert + 'timestamp': pxt.Timestamp, +}, primary_key=['uuid'], if_exists='ignore') +``` + +### Inserting Data + +```python +t.insert([{'title': 'Doc 1', 'content': 'Hello world', 'score': 0.95}]) # list of dicts +t.insert(title='Doc 2', content='Single row', score=0.75) # keyword syntax +t.insert(source='path/to/data.csv') # from file +``` + +### Computed Columns + +Auto-run on insert. Chain AI providers, UDFs, or expressions: + +```python +from pixeltable.functions.openai import chat_completions + +t.add_computed_column( + summary=chat_completions( + messages=[{'role': 'user', 'content': t.content}], + model='gpt-4o-mini' + ).choices[0].message.content, + if_exists='ignore' +) + +t.add_computed_column(upper_title=t.title.upper(), if_exists='ignore') +``` + +### Querying + +```python +results = t.select(t.title, t.score).collect() +results = t.where(t.score > 0.8).select(t.title, t.content).collect() +results = t.order_by(t.score, asc=False).limit(10).select(t.title).collect() +count = t.count() +df = t.select(t.title, t.score).collect().to_pandas() +items = list(t.select(title=t.title, score=t.score).collect().to_pydantic(MyModel)) +``` + +### Views and Iterators + +Split rows into sub-rows (chunking, frame extraction, audio splitting): + +```python +from pixeltable.functions.document import document_splitter +from pixeltable.functions.video import frame_iterator +from pixeltable.functions.string import string_splitter +from pixeltable.functions.audio import audio_splitter + +# Chunk documents into 300-token pieces (requires: pip install tiktoken) +chunks = pxt.create_view( + 'my_project.doc_chunks', t, + iterator=document_splitter(t.doc, separators='token_limit', limit=300), + if_exists='ignore' +) + +# Extract video frames at 1 fps +frames = pxt.create_view( + 'my_project.video_frames', t, + iterator=frame_iterator(t.video, fps=1.0), + if_exists='ignore' +) + +# Split text into sentences +sentences = pxt.create_view( + 'my_project.sentences', t, + iterator=string_splitter(t.content, separators='sentence'), + if_exists='ignore' +) + +# Split audio into 30-second chunks +audio_chunks = pxt.create_view( + 'my_project.audio_chunks', t, + iterator=audio_splitter(audio=t.audio, duration=30.0), + if_exists='ignore' +) + +# Filtered view (no iterator needed) +active = pxt.create_view( + 'my_project.active', t.where(t.is_active == True), + if_exists='ignore' +) +``` + +### Embedding Indexes and Similarity Search + +```python +from pixeltable.functions.huggingface import clip, sentence_transformer + +embed_fn = clip.using(model_id='openai/clip-vit-base-patch32') +t.add_embedding_index('content', embedding=embed_fn, if_exists='ignore') + +# Search +sim = t.content.similarity(string='search query') +results = t.order_by(sim, asc=False).limit(5).select(t.title, t.content, sim).collect() + +# Image search with text (multimodal CLIP) +sim = t.image.similarity(string='a photo of a cat') +results = t.order_by(sim, asc=False).limit(5).select(t.image, sim).collect() +``` + +### Built-in Image and Video Functions + +```python +from pixeltable.functions import image as pxt_image +from pixeltable.functions.video import extract_audio + +# Image thumbnails and encoding +t.add_computed_column( + thumbnail=pxt_image.b64_encode( + pxt_image.thumbnail(t.image, size=(320, 320)) + ), + if_exists='ignore' +) + +# Extract audio from video +t.add_computed_column( + audio=extract_audio(t.video, format='mp3'), + if_exists='ignore' +) +``` + +### User-Defined Functions (UDFs) + +```python +@pxt.udf +def clean_text(text: str) -> str: + return text.strip().lower() + +@pxt.udf +def safe_length(text: str | None) -> str: + return 0 if text is None else len(text) + +t.add_computed_column(cleaned=clean_text(t.content), if_exists='ignore') +``` + +### Query Functions (also usable as agent tools) + +```python +@pxt.query +def search_documents(query_text: str, limit: int = 10): + sim = t.content.similarity(string=query_text) + return t.order_by(sim, asc=False).limit(limit).select(t.title, t.content, sim) + +results = search_documents('machine learning').collect() +``` + +## Tool-Calling Agent Pipeline + +Inserting a row triggers the entire computed column chain automatically. + +```python +import pixeltable as pxt +from pixeltable.functions.anthropic import messages, invoke_tools +from datetime import datetime + +tools = pxt.tools(web_search, search_documents) # @pxt.udf + @pxt.query + +@pxt.udf +def assemble_context(question: str, tool_outputs: list | None, doc_context: list | None) -> str: + tool_str = str(tool_outputs) if tool_outputs else 'N/A' + doc_str = '\n'.join( + f"- {item.get('text', '')}" for item in (doc_context or []) if isinstance(item, dict) + ) or 'N/A' + return (f"QUESTION: {question}\n\n" + f"\n{tool_str}\n\n\n" + f"\n{doc_str}\n") + +agent = pxt.create_table('my_project.agent', { + 'prompt': pxt.String, 'timestamp': pxt.Timestamp, + 'system_prompt': pxt.String, 'max_tokens': pxt.Int, 'temperature': pxt.Float, +}, if_exists='ignore') + +# LLM selects tools → execute tools → RAG retrieval → assemble → final answer +agent.add_computed_column(initial_response=messages( + model='claude-sonnet-4-20250514', + messages=[{'role': 'user', 'content': [{'type': 'text', 'text': agent.prompt}]}], + tools=tools, tool_choice=tools.choice(required=True), + max_tokens=agent.max_tokens, + model_kwargs={'system': agent.system_prompt, 'temperature': agent.temperature}, +), if_exists='ignore') + +agent.add_computed_column(tool_output=invoke_tools(tools, agent.initial_response), if_exists='ignore') +agent.add_computed_column(doc_context=search_documents(agent.prompt), if_exists='ignore') +agent.add_computed_column(context=assemble_context(agent.prompt, agent.tool_output, agent.doc_context), if_exists='ignore') + +agent.add_computed_column(final_response=messages( + model='claude-sonnet-4-20250514', + messages=[{'role': 'user', 'content': [{'type': 'text', 'text': agent.context}]}], + max_tokens=agent.max_tokens, + model_kwargs={'system': agent.system_prompt, 'temperature': agent.temperature}, +), if_exists='ignore') + +agent.add_computed_column(answer=agent.final_response.content[0].text, if_exists='ignore') + +# Usage +agent.insert([{'prompt': 'What is quantum computing?', 'timestamp': datetime.now(), + 'system_prompt': 'You are a helpful assistant.', 'max_tokens': 1024}]) +result = agent.where(agent.prompt == 'What is quantum computing?').select(agent.answer).collect() +``` + +## AI Provider Integrations + +Built-in functions for 25+ providers in `pixeltable.functions.*`: + +| Provider | Module | Key Functions | +|----------|--------|---------------| +| OpenAI | `openai` | `chat_completions` (supports multimodal/vision via messages), `embeddings`, `image_generations`, `speech`, `transcriptions` | +| Anthropic | `anthropic` | `messages`, `invoke_tools` | +| Gemini | `gemini` | `generate_content`, `invoke_tools` | +| Hugging Face | `huggingface` | `clip`, `sentence_transformer`, `detr_for_object_detection` | +| Together | `together` | `chat_completions`, `embeddings`, `image_generations` | +| Fireworks | `fireworks` | `chat_completions`, `embeddings` | +| Ollama | `ollama` | `chat_completions`, `embeddings` | +| Mistral | `mistralai` | `chat_completions`, `embeddings` | +| Groq | `groq` | `chat_completions`, `invoke_tools` | +| DeepSeek | `deepseek` | `chat_completions` | +| Replicate | `replicate` | `run` | +| Voyage AI | `voyageai` | `embed` | +| Bedrock | `bedrock` | `converse`, `invoke_tools` | +| OpenRouter | `openrouter` | `chat_completions` | +| Whisper | `whisper` | `transcribe` (local transcription) | +| WhisperX | `whisperx` | `transcribe` (local, with speaker diarization) | +| Twelve Labs | `twelvelabs` | `embed` (video understanding) | +| Jina AI | `jina` | `embeddings`, `rerank` | +| BFL FLUX | `bfl` | `generate`, `edit`, `expand`, `fill` (image generation/editing) | +| RunwayML | `runwayml` | `text_to_video`, `image_to_video`, `text_to_image`, `video_to_video` | +| fal.ai | `fal` | `run` (execute any fal.ai model) | +| Reve | `reve` | `create`, `edit`, `remix` (image generation) | +| Microsoft Fabric | `fabric` | `chat_completions`, `embeddings` (Azure OpenAI via Fabric) | +| llama.cpp | `llama_cpp` | `create_chat_completion` (local GGUF models) | +| YOLOX | `yolox` | `yolox` (object detection) | + +## Import/Export + +```python +# From CSV / Parquet +t = pxt.create_table('dir.from_csv', source='data.csv') +t = pxt.create_table('dir.from_parquet', source='data.parquet') + +# With schema overrides (remap columns to media types) +t = pxt.create_table('dir.data', source='data.csv', + schema_overrides={'image_col': pxt.Image, 'doc_col': pxt.Document}) + +# From Hugging Face +from pixeltable.io import import_huggingface_dataset +import datasets +ds = datasets.load_dataset('squad', split='train[:1000]') +t = import_huggingface_dataset('dir.squad', ds) + +# From pandas +from pixeltable.io import import_pandas +t = import_pandas('dir.from_df', df) + +# Export +from pixeltable.io import export_parquet +export_parquet(t, 'output/') +``` + +## Idempotent Operations and Error Handling + +CRITICAL: Always use `if_exists='ignore'` on every `create_*` and `add_*` call. + +```python +# Fault-tolerant inserts +status = t.insert(rows, on_error='ignore') +# Inspect errors +t.where(t.summary.errortype != None).select(t.title, t.summary.errormsg).collect() +# Retry failed columns +t.recompute_columns(columns=['summary'], where=t.summary.errortype != None) +``` + +## Common Pitfalls + +| # | Wrong | Correct | +|---|-------|---------| +| 1 | `openai.vision(prompt=..., image=t.image)` | `openai.chat_completions(messages=[{'role':'user','content':[{'type':'text','text':'...'}, {'type':'image_url','image_url':{'url':t.image}}]}], model='gpt-4o-mini').choices[0].message.content` | +| 2 | `from pixeltable.iterators import FrameIterator` | `from pixeltable.functions.video import frame_iterator` | +| 3 | `t.add_embedding_index('transcript', ...)` on Json col | Extract `.text.astype(pxt.String)` first, then index | +| 4 | Fix code + re-run with `if_exists='ignore'` | Must `t.drop_column('col')` then recreate — re-run is a no-op | +| 5 | `{'type':'image', 'data': t.image}` in messages | Use `{'type':'image_url', 'image_url':{'url': t.image}}` | +| 6 | `t.content.similarity(query)` (positional) | `t.content.similarity(string=query)` (keyword) | +| 7 | Schema corruption (`IntegrityError`) | `pip install -U pixeltable && rm -rf ~/.pixeltable` | +| 8 | `.collect()` or `pxt.get_table()` inside `@pxt.query` | `@pxt.query` compiles the body at decoration time with expression placeholders — don't call `.collect()`, `insert()`, or reference tables that may not exist. Use a plain `def` for imperative logic | +| 9 | `'id': pxt.String` as primary key | PK columns must be non-nullable. Use `pxt.Required[pxt.String]` or `uuid7()` as a computed default | +| 10 | Module-level `Table` object used in FastAPI endpoint | `Table` objects are thread-bound. Call `pxt.get_table()` inside each endpoint function, not at module level | + +Full examples in [core-api.md → Common Pitfalls](references/core-api.md#common-pitfalls). + +## Table Management + +```python +pxt.list_tables() +t = pxt.get_table('my_project.my_table') +pxt.drop_table('my_project.my_table') +pxt.drop_dir('my_project', force=True) +t.describe() +t.columns() + +# Snapshots (point-in-time copy) +snap = pxt.create_snapshot('my_project.snapshot_v1', t, if_exists='ignore') + +# Update and delete +t.update({'score': 1.0}, where=t.category == 'important') +t.delete(where=t.is_active == False) +``` + +## Building Apps with Pixeltable + +- Pixeltable IS the data layer — no ORM, no SQLAlchemy +- **Prefer `FastAPIRouter`** (v0.6+) over hand-written endpoints — `add_insert_route`, `add_query_route`, `add_delete_route` generate endpoints from tables and `@pxt.query` functions +- Use `background=True` on `add_insert_route` for long-running inserts (returns a job handle, client polls for completion) +- FastAPI endpoints: use `def` not `async def` (Pixeltable is synchronous) +- Business logic in `@pxt.udf` / `@pxt.query`, not in endpoint handlers +- Schema in one file, queries co-located with routes in each router file +- Insert a row → entire computed column chain runs automatically + +```python +from pixeltable.serving import FastAPIRouter +import pixeltable as pxt + +router = FastAPIRouter(prefix="/api/data", tags=["data"]) +docs = pxt.get_table("app.documents") + +router.add_insert_route(docs, path="/upload", uploadfile_inputs=["document"], + inputs=["timestamp"], outputs=["uuid"], background=True) +router.add_delete_route(docs, path="/delete") + +@pxt.query +def list_docs(): + return docs.select(uuid=docs.uuid, name=docs.document).order_by(docs.timestamp, asc=False) + +router.add_query_route(path="/list", query=list_docs, method="get") +``` + +Reference: [Pixeltable Starter Kit](https://github.com/pixeltable/pixeltable-starter-kit) | [workflows.md → FastAPIRouter](references/workflows.md#fastapirouter-declarative-serving-v06) | [core-api.md → Serving](references/core-api.md#serving-fastapirouter) + +## Resources + +- [Starter Kit](https://github.com/pixeltable/pixeltable-starter-kit) — 3 structural patterns + 7 application templates: + - **Patterns**: `backend/` (FastAPI + React), `batch/` (no HTTP server), `serving/` (`pxt serve` + TOML) + - **app.py templates** (have UI, run `python app.py`): `knowledge-base`, `chat-agent`, `audio-transcription`, `full-stack-showcase` + - **pxt-serve templates** (API only, run `python schema.py` then `pxt serve `): `video-search`, `media-indexing`, `image-dataset` + - All `app.py` templates include port auto-detection (probes upward from 8000; override with `PORT` env var) + - Scaffold with [`pixeltable-new`](https://github.com/pixeltable/pixeltable-new): `uvx pixeltable-new --template my-app` +- [MCP Server](https://github.com/pixeltable/mcp-server-pixeltable-developer) — Explore Pixeltable tables via MCP +- [LLM Docs](https://docs.pixeltable.com/llms-full.txt) — Complete documentation as plain text | [llms.txt](https://www.pixeltable.com/llms.txt) + +## Reference Files + +| File | Coverage | +|------|----------| +| [core-api.md](references/core-api.md) | Tables, querying, views, embeddings, UDFs, tools, **serving (FastAPIRouter)**, B-tree indexes, recompute, config, data sharing, SQL export | +| [providers.md](references/providers.md) | Quick-reference table + full examples for all 25+ AI providers | +| [workflows.md](references/workflows.md) | RAG, video analysis, image classification, audio, multi-provider, agent, **batch processing**, FastAPI, **FastAPIRouter**, export | +| [video-rag-agents.md](references/video-rag-agents.md) | Video + transcript/frame retrieval + tool-calling agent | +| [agents-memory-mcp.md](references/agents-memory-mcp.md) | Agent with persistent memory, MCP integration, multi-provider invoke_tools | +| [ml-data-pipeline.md](references/ml-data-pipeline.md) | Ingest, enrich, version, export to PyTorch/Parquet/pandas | +| [agentic-patterns.md](references/agentic-patterns.md) | 6 architectural patterns + 2 reasoning strategies | +| [anti-patterns.md](references/anti-patterns.md) | 15 training-distribution biases LLMs bring; wrong/right code for each | diff --git a/skills/pixeltable/references/agentic-patterns.md b/skills/pixeltable/references/agentic-patterns.md new file mode 100644 index 0000000..6ba84d6 --- /dev/null +++ b/skills/pixeltable/references/agentic-patterns.md @@ -0,0 +1,368 @@ +# Agentic Patterns + +Six architectural patterns and two reasoning strategies for building AI agents with Pixeltable. Every pattern uses declarative computed columns — no async code, no orchestration framework, no loop management. + +**Core principle**: Your agent _is_ a table. Each step is a computed column. The engine resolves dependencies, parallelizes independent columns, caches results, and persists every intermediate step automatically. + +## Contents + +- [Prompt Chaining](#prompt-chaining) — sequential multi-step generation +- [Routing](#routing) — classify intent, dispatch to specialized handlers +- [Parallelization](#parallelization) — independent analyses on same input +- [Tool Use](#tool-use) — LLM selects and calls external functions +- [Evaluator-Optimizer](#evaluator-optimizer) — generate, judge, refine +- [Orchestrator-Worker](#orchestrator-worker) — decompose, delegate, synthesize +- [ReAct](#react-reasoning--acting) — reason-act-observe loop +- [Planning](#planning) — plan upfront, then execute + +--- + +## Prompt Chaining + +Sequential steps where each output feeds into the next. + +```python +import pixeltable as pxt +from pixeltable.functions.openai import chat_completions + +chain = pxt.create_table('demo.chain', {'topic': pxt.String}, if_exists='ignore') + +# Step 1: Generate outline +chain.add_computed_column( + outline=chat_completions( + messages=[{'role': 'user', 'content': 'Create a 3-point outline about: ' + chain.topic}], + model='gpt-4o-mini' + ).choices[0].message.content, + if_exists='ignore') + +# Step 2: Write draft from outline (depends on step 1) +chain.add_computed_column( + draft=chat_completions( + messages=[{'role': 'user', 'content': 'Write article based on outline:\n\n' + chain.outline}], + model='gpt-4o-mini' + ).choices[0].message.content, + if_exists='ignore') + +# Step 3: Polish draft (depends on step 2) +chain.add_computed_column( + final=chat_completions( + messages=[{'role': 'user', 'content': 'Edit for clarity and conciseness:\n\n' + chain.draft}], + model='gpt-4o-mini' + ).choices[0].message.content, + if_exists='ignore') + +chain.insert([{'topic': 'benefits of declarative AI pipelines'}]) +``` + +**When to use**: Content generation, data transformation pipelines, multi-step extraction. + +## Routing + +Classify input and dispatch to specialized handlers. + +```python +router = pxt.create_table('demo.router', {'query': pxt.String}, if_exists='ignore') + +# Classify intent +router.add_computed_column( + intent=chat_completions( + messages=[{ + 'role': 'user', + 'content': 'Classify into exactly one word — technical, billing, or general:\n\n' + router.query + }], + model='gpt-4o-mini' + ).choices[0].message.content, + if_exists='ignore') + +# Route to specialized prompt +@pxt.udf +def route_prompt(intent: str, query: str) -> list[dict]: + prompts = { + 'technical': 'You are a senior technical support engineer.', + 'billing': 'You are a billing specialist. Be empathetic.', + 'general': 'You are a friendly customer service representative.', + } + system = prompts.get(intent.strip().lower(), prompts['general']) + return [{'role': 'system', 'content': system}, {'role': 'user', 'content': query}] + +router.add_computed_column( + routed_messages=route_prompt(router.intent, router.query), + if_exists='ignore') + +router.add_computed_column( + response=chat_completions( + messages=router.routed_messages, model='gpt-4o-mini' + ).choices[0].message.content, + if_exists='ignore') +``` + +**When to use**: Customer support, multi-domain Q&A, content moderation. + +## Parallelization + +Multiple independent analyses on the same input — auto-parallelized by the engine. + +```python +parallel = pxt.create_table('demo.parallel', {'text': pxt.String}, if_exists='ignore') + +# Three independent columns (no dependencies → run concurrently) +parallel.add_computed_column( + sentiment=chat_completions( + messages=[{'role': 'user', 'content': 'Sentiment (positive/negative/neutral):\n\n' + parallel.text}], + model='gpt-4o-mini' + ).choices[0].message.content, if_exists='ignore') + +parallel.add_computed_column( + entities=chat_completions( + messages=[{'role': 'user', 'content': 'Extract named entities as JSON:\n\n' + parallel.text}], + model='gpt-4o-mini' + ).choices[0].message.content, if_exists='ignore') + +parallel.add_computed_column( + summary=chat_completions( + messages=[{'role': 'user', 'content': 'Summarize in one sentence:\n\n' + parallel.text}], + model='gpt-4o-mini' + ).choices[0].message.content, if_exists='ignore') + +# Merge results (depends on all three → runs after they complete) +@pxt.udf +def merge(sentiment: str, entities: str, summary: str) -> dict: + return {'sentiment': sentiment.strip(), 'entities': entities.strip(), 'summary': summary.strip()} + +parallel.add_computed_column( + report=merge(parallel.sentiment, parallel.entities, parallel.summary), + if_exists='ignore') +``` + +**When to use**: Document analysis, multi-aspect evaluation, feature extraction. + +## Tool Use + +LLM chooses which tools to call; Pixeltable executes them automatically. + +```python +from pixeltable.functions.openai import chat_completions, invoke_tools + +@pxt.udf +def get_weather(city: str) -> str: + """Get current weather for a city.""" + data = {'tokyo': 'Rainy, 65F', 'london': 'Cloudy, 58F', 'paris': 'Sunny, 72F'} + return data.get(city.lower(), f'No data for {city}') + +@pxt.udf +def get_stock_price(symbol: str) -> str: + """Get current stock price.""" + prices = {'AAPL': '$178.50', 'GOOGL': '$141.25', 'MSFT': '$378.90'} + return prices.get(symbol.upper(), f'No data for {symbol}') + +tools = pxt.tools(get_weather, get_stock_price) + +agent = pxt.create_table('demo.tool_agent', {'query': pxt.String}, if_exists='ignore') + +agent.add_computed_column( + response=chat_completions( + messages=[{'role': 'user', 'content': agent.query}], + model='gpt-4o-mini', tools=tools, + ), if_exists='ignore') + +agent.add_computed_column( + tool_output=invoke_tools(tools, agent.response), + if_exists='ignore') + +agent.insert([ + {'query': "What's the weather in Tokyo?"}, + {'query': "What's Apple's stock price?"}, +]) +``` + +**When to use**: Any agent that needs external data or actions. See also [agents-memory-mcp.md](agents-memory-mcp.md) for memory and MCP integration. + +## Evaluator-Optimizer + +Generate → judge → refine loop as three chained columns. + +```python +evaluator = pxt.create_table('demo.evaluator', {'brief': pxt.String}, if_exists='ignore') + +# Generate first draft +evaluator.add_computed_column( + draft=chat_completions( + messages=[{'role': 'user', 'content': 'Write a marketing tagline for:\n\n' + evaluator.brief}], + model='gpt-4o-mini' + ).choices[0].message.content, if_exists='ignore') + +# LLM-as-judge evaluates the draft +evaluator.add_computed_column( + evaluation=chat_completions( + messages=[{ + 'role': 'user', + 'content': 'Rate clarity and creativity (1-10) with feedback:\n\nTagline: ' + evaluator.draft + }], + model='gpt-4o-mini' + ).choices[0].message.content, if_exists='ignore') + +# Refine based on feedback +evaluator.add_computed_column( + refined=chat_completions( + messages=[{ + 'role': 'user', + 'content': 'Improve based on feedback:\n\nOriginal: ' + evaluator.draft + '\n\nFeedback: ' + evaluator.evaluation + }], + model='gpt-4o-mini' + ).choices[0].message.content, if_exists='ignore') +``` + +**When to use**: Content quality control, code review pipelines, iterative refinement. + +## Orchestrator-Worker + +Central agent decomposes tasks, specialized worker tables handle sub-tasks. + +```python +# Worker A: Summarizer (reusable table-as-UDF) +summarizer = pxt.create_table('demo.summarizer', {'text': pxt.String}, if_exists='ignore') +summarizer.add_computed_column( + summary=chat_completions( + messages=[{'role': 'user', 'content': 'Summarize:\n\n' + summarizer.text}], + model='gpt-4o-mini' + ).choices[0].message.content, if_exists='ignore') + +# Worker B: Fact-checker +checker = pxt.create_table('demo.checker', {'claim': pxt.String}, if_exists='ignore') +checker.add_computed_column( + assessment=chat_completions( + messages=[{'role': 'user', 'content': 'Is this plausible? Reply PLAUSIBLE or DUBIOUS:\n\n' + checker.claim}], + model='gpt-4o-mini' + ).choices[0].message.content, if_exists='ignore') + +# Wrap worker tables as callable UDFs +summarize_fn = pxt.udf(summarizer, return_value=summarizer.summary) +fact_check_fn = pxt.udf(checker, return_value=checker.assessment) + +# Orchestrator: calls workers in parallel, then synthesizes +orchestrator = pxt.create_table('demo.orchestrator', {'article': pxt.String}, if_exists='ignore') +orchestrator.add_computed_column(summary=summarize_fn(text=orchestrator.article), if_exists='ignore') +orchestrator.add_computed_column(fact_check=fact_check_fn(claim=orchestrator.article), if_exists='ignore') + +orchestrator.add_computed_column( + briefing=chat_completions( + messages=[{ + 'role': 'user', + 'content': 'Write editorial note:\n\nSummary: ' + orchestrator.summary + '\n\nFact-check: ' + orchestrator.fact_check + }], + model='gpt-4o-mini' + ).choices[0].message.content, if_exists='ignore') +``` + +**Key technique**: `pxt.udf(table, return_value=table.col)` wraps an entire table pipeline as a callable function. Workers are reusable across multiple orchestrators. + +**When to use**: Research assistants, report generation, multi-agent systems. + +## ReAct (Reasoning + Acting) + +Agent alternates between reasoning and acting in a loop. Each step is a row. + +```python +@pxt.udf +def lookup_population(country: str) -> str: + """Look up country population.""" + populations = {'united states': '331 million', 'brazil': '214 million', 'germany': '84 million'} + return populations.get(country.lower(), 'Not available') + +react_tools = pxt.tools(lookup_population) + +react = pxt.create_table('demo.react', { + 'step': pxt.Int, 'prompt': pxt.String, 'system_prompt': pxt.String, +}, if_exists='ignore') + +react.add_computed_column( + response=chat_completions( + messages=[ + {'role': 'system', 'content': react.system_prompt}, + {'role': 'user', 'content': react.prompt} + ], + model='gpt-4o-mini', tools=react_tools, + ), if_exists='ignore') + +react.add_computed_column( + answer=react.response.choices[0].message.content, + if_exists='ignore') + +react.add_computed_column( + tool_output=invoke_tools(react_tools, react.response), + if_exists='ignore') + +# Reasoning loop — each iteration is a new row +SYSTEM = "Answer step by step. Use tools when needed. Say FINAL ANSWER when done." +question = "Which has a larger population, Brazil or Germany?" +history = [] + +for step in range(1, 5): + prompt = question + ('\n\nObservations so far:\n' + '\n'.join(history) if history else '') + react.insert([{'step': step, 'prompt': prompt, 'system_prompt': SYSTEM}]) + + row = react.where(react.step == step).select(react.answer, react.tool_output).collect()[0] + if row['tool_output']: + history.append(f'Step {step}: {row["tool_output"]}') + if row['answer'] and 'FINAL' in row['answer'].upper(): + break +``` + +**When to use**: Multi-step research, complex reasoning requiring external data. + +## Planning + +Generate a complete plan upfront, then execute all steps. + +```python +import json + +planner = pxt.create_table('demo.planner', {'question': pxt.String}, if_exists='ignore') + +# Generate plan as JSON +planner.add_computed_column( + plan_text=chat_completions( + messages=[{ + 'role': 'user', + 'content': 'Break into 2-3 research steps. Return JSON: {"steps": ["step1", "step2"]}\n\n' + planner.question + }], + model='gpt-4o-mini' + ).choices[0].message.content, if_exists='ignore') + +# Format plan into execution prompt +@pxt.udf +def format_plan(plan_json: str, question: str) -> str: + try: + data = json.loads(plan_json) + steps = data if isinstance(data, list) else data.get('steps', []) + step_list = '\n'.join(f'{i+1}. {s}' for i, s in enumerate(steps)) + except Exception: + step_list = '1. ' + question + return f'Answer each sub-question, then synthesize:\n\nOriginal: {question}\n\n{step_list}' + +planner.add_computed_column( + exec_prompt=format_plan(planner.plan_text, planner.question), + if_exists='ignore') + +planner.add_computed_column( + answer=chat_completions( + messages=[{'role': 'user', 'content': planner.exec_prompt}], + model='gpt-4o-mini' + ).choices[0].message.content, if_exists='ignore') +``` + +**When to use**: Complex questions, multi-step research, structured problem solving. + +## Comparison with Traditional Frameworks + +| Concept | Pixeltable | LangChain / CrewAI / LangGraph | +|---------|-----------|-------------------------------| +| Pipeline step | Computed column | Function in a chain/loop | +| Parallel execution | Independent columns (automatic) | `asyncio.gather` / explicit | +| Persistence | Built-in — every intermediate stored | Separate logging/DB layer | +| Caching | Automatic — same input never recomputed | Manual memoization | +| Reusable sub-agent | `pxt.udf(table, return_value=...)` | Agent class with `.run()` | +| Error recovery | `recompute_columns(where=errortype != None)` | Re-run entire pipeline | +| Observability | Query any column on any row | Attach tracing callbacks | + +Patterns compose naturally — an orchestrator can use routing in its dispatch, tool use within workers, and ReAct reasoning inside tool loops, all without special glue code. diff --git a/skills/pixeltable/references/agents-memory-mcp.md b/skills/pixeltable/references/agents-memory-mcp.md new file mode 100644 index 0000000..5a6bfff --- /dev/null +++ b/skills/pixeltable/references/agents-memory-mcp.md @@ -0,0 +1,289 @@ +# Agent with Memory and MCP Tools + +A production recipe combining a tool-calling agent with persistent memory (chat history + knowledge bank) and external MCP server integration. The agent remembers past conversations, retrieves stored facts, and can call both local tools and remote MCP tools. + +## Workflow + +1. Create a chat history table with embedding index for semantic recall +2. Create a memory bank table for long-lived facts and preferences +3. Write `@pxt.query` retrieval functions for both (filtered by `user_id`) +4. Write local `@pxt.udf` tools (including a `save_memory` tool for the LLM to store facts) +5. (Optional) Load MCP tools with `pxt.mcp_udfs()` and combine with local tools +6. Bundle all tools with `pxt.tools()` +7. Create agent table with computed column chain: LLM -> invoke_tools -> context assembly -> final answer +8. After each agent response, save the conversation to chat history for future recall + +## Full Pipeline + +```python +import pixeltable as pxt +from pixeltable.functions.openai import chat_completions, embeddings +from pixeltable.functions.openai import invoke_tools as openai_invoke_tools +from pixeltable.functions.huggingface import sentence_transformer +from datetime import datetime + +pxt.create_dir('agent_app', if_exists='ignore') + +# ── 1. Memory: Chat History ───────────────────────────────────────── +# Stores every user and assistant message with embeddings for recall. + +chat_history = pxt.create_table('agent_app.chat_history', { + 'role': pxt.String, # 'user' or 'assistant' + 'content': pxt.String, + 'timestamp': pxt.Timestamp, + 'user_id': pxt.String, +}, if_exists='ignore') + +embed_fn = sentence_transformer.using(model_id='all-MiniLM-L6-v2') +chat_history.add_embedding_index('content', string_embed=embed_fn, if_exists='ignore') + +@pxt.query +def recall_chat_history(query_text: str, user_id: str, top_k: int = 5): + """Retrieve past conversation turns relevant to the current query.""" + sim = chat_history.content.similarity(string=query_text) + return ( + chat_history + .where((chat_history.user_id == user_id) & (sim > 0.5)) + .order_by(sim, asc=False) + .limit(top_k) + .select(chat_history.role, chat_history.content, sim=sim) + ) + +# ── 2. Memory: Knowledge Bank ─────────────────────────────────────── +# Stores user preferences, facts, and persistent notes. + +memory_bank = pxt.create_table('agent_app.memory_bank', { + 'content': pxt.String, + 'category': pxt.String, # 'preference', 'fact', 'note' + 'user_id': pxt.String, + 'timestamp': pxt.Timestamp, +}, if_exists='ignore') + +memory_bank.add_embedding_index('content', string_embed=embed_fn, if_exists='ignore') + +@pxt.query +def recall_memories(query_text: str, user_id: str, top_k: int = 3): + """Retrieve relevant stored memories for a user.""" + sim = memory_bank.content.similarity(string=query_text) + return ( + memory_bank + .where((memory_bank.user_id == user_id) & (sim > 0.5)) + .order_by(sim, asc=False) + .limit(top_k) + .select(memory_bank.content, memory_bank.category, sim=sim) + ) + +# Seed memories +memory_bank.insert([ + {'content': 'User prefers concise answers with code examples.', + 'category': 'preference', 'user_id': 'user_1', 'timestamp': datetime.now()}, + {'content': 'Project uses FastAPI with Python 3.12.', + 'category': 'fact', 'user_id': 'user_1', 'timestamp': datetime.now()}, +]) + +# ── 3. Local tools ────────────────────────────────────────────────── + +@pxt.udf +def get_weather(city: str) -> str: + """Get current weather for a city.""" + weather_data = { + 'new york': 'Sunny, 72F', 'london': 'Cloudy, 58F', + 'tokyo': 'Rainy, 65F', 'paris': 'Partly cloudy, 68F', + } + return weather_data.get(city.lower(), f'Weather data not available for {city}') + +@pxt.udf +def save_memory(content: str, category: str, user_id: str) -> str: + """Save a new fact or preference to the user's memory bank.""" + memory_bank.insert([{ + 'content': content, 'category': category, + 'user_id': user_id, 'timestamp': datetime.now(), + }]) + return f'Saved to memory: {content}' + +# ── 4. MCP tools (optional) ───────────────────────────────────────── +# Load tools from any MCP-compliant server and combine with local tools. + +# mcp_tools = pxt.mcp_udfs('http://localhost:8000/mcp') +# tools = pxt.tools(get_weather, save_memory, recall_memories, *mcp_tools) + +# Without MCP: +tools = pxt.tools(get_weather, save_memory, recall_memories) + +# ── 5. Context assembly ───────────────────────────────────────────── + +@pxt.udf +def build_prompt( + question: str, + tool_outputs: list | None, + chat_context: list | None, + memory_context: list | None, +) -> str: + parts = [f"USER QUESTION: {question}"] + + if memory_context: + mem_str = '\n'.join( + f"- [{item.get('category', '?')}] {item.get('content', '')}" + for item in memory_context if isinstance(item, dict) + ) + parts.append(f"\n[USER MEMORIES]\n{mem_str}") + + if chat_context: + chat_str = '\n'.join( + f"- {item.get('role', '?')}: {item.get('content', '')}" + for item in chat_context if isinstance(item, dict) + ) + parts.append(f"\n[RECENT CONVERSATION]\n{chat_str}") + + if tool_outputs: + parts.append(f"\n[TOOL RESULTS]\n{tool_outputs}") + + return '\n'.join(parts) + +# ── 6. Agent pipeline ─────────────────────────────────────────────── + +agent = pxt.create_table('agent_app.agent', { + 'prompt': pxt.String, + 'user_id': pxt.String, + 'timestamp': pxt.Timestamp, + 'system_prompt': pxt.String, + 'max_tokens': pxt.Int, + 'temperature': pxt.Float, +}, if_exists='ignore') + +# Step 1: Tool selection +agent.add_computed_column( + initial_response=chat_completions( + messages=[{'role': 'user', 'content': agent.prompt}], + model='gpt-4o-mini', + tools=tools, + ), if_exists='ignore') + +# Step 2: Execute tools +agent.add_computed_column( + tool_output=openai_invoke_tools(tools, agent.initial_response), + if_exists='ignore') + +# Step 3: Retrieve memory context (runs in parallel as separate computed columns) +agent.add_computed_column( + chat_context=recall_chat_history(agent.prompt, agent.user_id), + if_exists='ignore') + +agent.add_computed_column( + memory_context=recall_memories(agent.prompt, agent.user_id), + if_exists='ignore') + +# Step 4: Assemble prompt +agent.add_computed_column( + context=build_prompt( + agent.prompt, agent.tool_output, + agent.chat_context, agent.memory_context), + if_exists='ignore') + +# Step 5: Final response +agent.add_computed_column( + final_response=chat_completions( + messages=[ + {'role': 'system', 'content': agent.system_prompt}, + {'role': 'user', 'content': agent.context}, + ], + model='gpt-4o-mini', + max_tokens=agent.max_tokens, + temperature=agent.temperature, + ), if_exists='ignore') + +agent.add_computed_column( + answer=agent.final_response.choices[0].message.content, + if_exists='ignore') +``` + +## Usage + +```python +# Ask a question — memory and tools are used automatically +agent.insert([{ + 'prompt': 'What is the weather in Tokyo? Remember that I like brief answers.', + 'user_id': 'user_1', + 'timestamp': datetime.now(), + 'system_prompt': 'You are a helpful assistant. Use tools and memories to personalize your response.', + 'max_tokens': 512, + 'temperature': 0.7, +}]) + +result = agent.order_by(agent.timestamp, asc=False).limit(1).select(agent.answer).collect() + +# Save the conversation to chat history for future recall +agent_row = agent.order_by(agent.timestamp, asc=False).limit(1).select( + agent.prompt, agent.answer, agent.user_id, agent.timestamp).collect() +row = agent_row[0] + +chat_history.insert([ + {'role': 'user', 'content': row['prompt'], + 'user_id': row['user_id'], 'timestamp': row['timestamp']}, + {'role': 'assistant', 'content': row['answer'], + 'user_id': row['user_id'], 'timestamp': datetime.now()}, +]) +``` + +## Adding MCP Tools + +Connect to any MCP-compliant server to extend the agent with external tools: + +```python +# Load tools from an MCP server +mcp_tools = pxt.mcp_udfs('http://localhost:8000/mcp') + +# Inspect available tools +for tool in mcp_tools: + print(f'- {tool.name}: {tool.comment()}') + +# Combine with local tools +tools = pxt.tools(get_weather, save_memory, recall_memories, *mcp_tools) +``` + +MCP tools are called via `invoke_tools()` exactly like local UDFs — no special handling needed. + +## Multi-Provider invoke_tools + +The agent pipeline works with any provider that supports tool calling: + +| Provider | Import | invoke_tools | +|----------|--------|-------------| +| OpenAI | `from pixeltable.functions.openai import invoke_tools` | `openai.invoke_tools(tools, response)` | +| Anthropic | `from pixeltable.functions.anthropic import invoke_tools` | `anthropic.invoke_tools(tools, response)` | +| Groq | `from pixeltable.functions.groq import invoke_tools` | `groq.invoke_tools(tools, response)` | +| Gemini | `from pixeltable.functions.gemini import invoke_tools` | `gemini.invoke_tools(tools, response)` | +| Bedrock | `from pixeltable.functions.bedrock import invoke_tools` | `bedrock.invoke_tools(tools, response)` | + +To switch providers, change the import and the LLM call function. The `tools` object and `invoke_tools()` pattern stay the same. + +## How It Works + +1. **Chat history** — Every conversation is stored in a table with an embedding index. The `recall_chat_history` query retrieves semantically relevant past turns for the current user. + +2. **Memory bank** — Long-lived facts and preferences are stored separately. The `recall_memories` query retrieves relevant memories. The `save_memory` tool lets the LLM itself save new facts during conversation. + +3. **User scoping** — All queries filter by `user_id`, so multiple users can share the same tables without seeing each other's data. + +4. **MCP integration** — `pxt.mcp_udfs()` loads tools from any MCP server as regular Pixeltable UDFs. They're bundled with `pxt.tools()` and executed with `invoke_tools()` just like local functions. + +## Adapting This Recipe + +- **Add document RAG**: Create a document chunking view and add a `search_documents` query to the tools list +- **Add image memory**: Use CLIP embeddings on an image column for visual memory recall +- **Serve via API**: Wrap in a FastAPI endpoint — see [workflows.md → FastAPI App Pattern](workflows.md#fastapi-app-pattern) +- **Use Anthropic instead**: Swap `chat_completions` → `messages` and `openai.invoke_tools` → `anthropic.invoke_tools` — see [providers.md → Quick Reference](providers.md#quick-reference) + +## Agent with Memory Checklist + +- [ ] Chat history table created with `user_id`, `role`, `content`, `timestamp` columns +- [ ] Embedding index added on chat history `content` column +- [ ] Memory bank table created with `user_id`, `content`, `category` columns +- [ ] Embedding index added on memory bank `content` column +- [ ] Recall queries filter by `user_id` (multi-tenant safety) +- [ ] Recall queries use `.similarity(string=...)` with keyword argument and a minimum threshold +- [ ] `save_memory` tool has a clear docstring so the LLM knows when to store facts +- [ ] Tools bundled with `pxt.tools()` — includes both local UDFs and MCP tools if any +- [ ] `invoke_tools()` import matches the LLM provider used +- [ ] Agent response saved to chat history after each interaction (both user and assistant turns) +- [ ] Tested with multiple user IDs to verify scoping works diff --git a/skills/pixeltable/references/anti-patterns.md b/skills/pixeltable/references/anti-patterns.md new file mode 100644 index 0000000..9288fa3 --- /dev/null +++ b/skills/pixeltable/references/anti-patterns.md @@ -0,0 +1,388 @@ +# Anti-Patterns: Training-Distribution Biases LLMs Bring to Pixeltable + +LLMs are trained on millions of imperative Python examples using pandas, LangChain, standalone vector DBs, and raw loops. These priors are **wrong for Pixeltable**. This page lists every common bias and the correct idiomatic shape. + +## The 5 Macro Biases (High Priority) + +These are structural — getting any one wrong means the entire solution is non-idiomatic. + +### 1. Framework addiction (LangChain / LlamaIndex / Haystack / LangGraph) + +**Wrong:** +```python +from langchain.text_splitter import RecursiveCharacterTextSplitter +from langchain_community.vectorstores import Chroma +from langchain_openai import OpenAIEmbeddings, ChatOpenAI +from langchain.chains import RetrievalQA + +splitter = RecursiveCharacterTextSplitter(chunk_size=512) +chunks = splitter.split_documents(docs) +vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings()) +chain = RetrievalQA.from_chain_type(ChatOpenAI(), retriever=vectorstore.as_retriever()) +``` + +**Right:** +```python +import pixeltable as pxt +from pixeltable.functions.document import document_splitter +from pixeltable.functions.openai import chat_completions, embeddings + +docs = pxt.create_table('app.docs', {'doc': pxt.Document}, if_exists='ignore') +chunks = pxt.create_view('app.chunks', docs, + iterator=document_splitter(docs.doc, separators='token_limit', limit=512), + if_exists='ignore') +chunks.add_embedding_index('text', embedding=embeddings(model='text-embedding-3-small'), if_exists='ignore') +``` + +**Why:** Pixeltable handles chunking, embedding, indexing, and retrieval natively. Adding a framework on top creates redundant abstraction, breaks incremental updates, and loses version control. + +--- + +### 2. pandas as working store + +**Wrong:** +```python +import pandas as pd + +df = pd.read_csv('data.csv') +df['summary'] = df['text'].apply(lambda x: call_openai(x)) +df['embedding'] = df['text'].apply(lambda x: get_embedding(x)) +df.to_parquet('output.parquet') +``` + +**Right:** +```python +import pixeltable as pxt +from pixeltable.functions.openai import chat_completions, embeddings + +t = pxt.create_table('app.data', source='data.csv', if_exists='ignore') +t.add_computed_column(summary=chat_completions( + messages=[{'role': 'user', 'content': 'Summarize: ' + t.text}], + model='gpt-4o-mini' +).choices[0].message.content, if_exists='ignore') +t.add_embedding_index('text', embedding=embeddings(model='text-embedding-3-small'), if_exists='ignore') + +# Export ONLY at the end if needed +df = t.select(t.text, t.summary).collect().to_pandas() +``` + +**Why:** pandas has no persistence, no incremental computation, no automatic retry on API failures, and no version control. Pixeltable tables persist, recompute only new/failed rows, and maintain full history. + +--- + +### 3. For-loops calling AI models + +**Wrong:** +```python +results = [] +for _, row in df.iterrows(): + response = openai.chat.completions.create( + model='gpt-4o-mini', + messages=[{'role': 'user', 'content': row['text']}] + ) + results.append(response.choices[0].message.content) +df['summary'] = results +``` + +**Right:** +```python +from pixeltable.functions.openai import chat_completions + +t.add_computed_column( + summary=chat_completions( + messages=[{'role': 'user', 'content': t.text}], + model='gpt-4o-mini' + ).choices[0].message.content, + if_exists='ignore' +) +``` + +**Why:** Computed columns handle batching, rate limiting (configured in `~/.pixeltable/config.toml`), automatic caching (never re-calls for unchanged rows), error isolation per row, and retry via `recompute_columns()`. A for-loop has none of this. + +--- + +### 4. Separate vector database + +**Wrong:** +```python +import chromadb +from chromadb.utils import embedding_functions + +client = chromadb.Client() +ef = embedding_functions.OpenAIEmbeddingFunction(api_key=os.environ['OPENAI_API_KEY']) +collection = client.create_collection("docs", embedding_function=ef) +collection.add(documents=texts, ids=ids) +results = collection.query(query_texts=["search query"], n_results=5) +``` + +**Right:** +```python +from pixeltable.functions.openai import embeddings + +t.add_embedding_index('text', + embedding=embeddings(model='text-embedding-3-small'), + if_exists='ignore') + +sim = t.text.similarity(string='search query') +results = t.order_by(sim, asc=False).limit(5).select(t.text, sim).collect() +``` + +**Why:** The embedding index lives inside the table — it updates automatically when rows are inserted, shares the same version history, and requires no separate service. Querying uses the same expression language as everything else. + +--- + +### 5. While-loop agent patterns + +**Wrong:** +```python +messages = [{"role": "user", "content": user_query}] +while True: + response = openai.chat.completions.create(model="gpt-4o", messages=messages, tools=tools) + if response.choices[0].finish_reason == "stop": + break + tool_calls = response.choices[0].message.tool_calls + for tc in tool_calls: + result = execute_tool(tc) + messages.append({"role": "tool", "content": result, "tool_call_id": tc.id}) +``` + +**Right:** +```python +from pixeltable.functions.openai import chat_completions, invoke_tools + +tools = pxt.tools(search_docs, get_weather) + +agent = pxt.create_table('app.agent', {'prompt': pxt.String}, if_exists='ignore') +agent.add_computed_column(response=chat_completions( + messages=[{'role': 'user', 'content': agent.prompt}], + model='gpt-4o', tools=tools, tool_choice=tools.choice(required=True) +), if_exists='ignore') +agent.add_computed_column(tool_output=invoke_tools(tools, agent.response), if_exists='ignore') +agent.add_computed_column(final=chat_completions( + messages=[{'role': 'user', 'content': agent.prompt + '\n\nContext: ' + agent.tool_output.astype(pxt.String)}], + model='gpt-4o' +).choices[0].message.content, if_exists='ignore') + +agent.insert([{'prompt': 'What is the weather in NYC?'}]) +``` + +**Why:** The declarative chain persists every intermediate result, enables debugging by inspecting any column, retries individual steps without re-running the whole chain, and maintains a complete audit trail. The while-loop loses all intermediate state on failure. + +--- + +## The Full 15-Bias Reference + +| # | LLM's prior reaches for | Correct Pixeltable shape | Why the prior is wrong | +|---|--------------------------|--------------------------|------------------------| +| 1 | LangChain / LlamaIndex / Haystack / LangGraph | `create_view` + iterator + `add_embedding_index` + `pxt.tools()` | Redundant abstraction; breaks incremental updates | +| 2 | `pandas.DataFrame` as working store | Pixeltable table is the store; `.to_pandas()` for export only | No persistence, no incremental, no versioning | +| 3 | `for row in ...:` calling AI per row | Computed column | No batching, no rate limits, no caching, no retry | +| 4 | Pinecone / Chroma / FAISS / Qdrant / pgvector | `t.add_embedding_index(col, embedding=fn)` | Separate service; no auto-update; no version control | +| 5 | Embeddings as `list[list[float]]` in memory | Stored as computed column with type `pxt.Array` | Volatile; lost on restart; can't query | +| 6 | `while not done:` agent loop | Table where insert triggers computed-column chain | Loses intermediate state; no audit trail | +| 7 | `cv2.VideoCapture` / Pillow loops for media | `frame_iterator` + `pixeltable.functions.image.*` | No persistence; manual frame management | +| 8 | `psycopg2` / `sqlalchemy` against `~/.pixeltable/pgdata` | SDK only (never touch embedded Postgres) | Corrupts internal schema; breaks versioning | +| 9 | `async def` FastAPI endpoints calling Pixeltable | `def` endpoints (Pixeltable is synchronous) | Deadlocks or silent failures under async | +| 10 | Drop + recreate tables as "initialization" | `if_exists='ignore'` on `create_table` / `create_view` | Data loss; breaks incremental computation | +| 11 | `if_exists='ignore'` to "update" column logic | `t.drop_column('col')` then recreate | `if_exists='ignore'` is a no-op if column exists | +| 12 | Threading `api_key=` into every provider call | Environment variables or `~/.pixeltable/config.toml` | Leaks keys; breaks portability | +| 13 | `openai-whisper` / `faster-whisper` imperative | `whisper.transcribe` or `openai.transcriptions` as computed column | No caching; manual error handling | +| 14 | Pydantic / dataclass schemas for table definition | `{'col': pxt.Type}` dict | Pixeltable has its own type system; Pydantic adds nothing | +| 15 | Chat history in Python `list` or Redis | Table with embedding index for semantic memory retrieval | Volatile or disconnected from the data layer | + +## Per-Bias Code Examples (6–15) + +### 5. Embeddings as raw lists + +**Wrong:** +```python +embeddings_cache = [] +for text in texts: + emb = openai.embeddings.create(input=text, model="text-embedding-3-small") + embeddings_cache.append(emb.data[0].embedding) +# Now what? Save to pickle? Rebuild on every restart? +``` + +**Right:** +```python +from pixeltable.functions.openai import embeddings +t.add_embedding_index('text', embedding=embeddings(model='text-embedding-3-small'), if_exists='ignore') +``` + +### 7. cv2 / Pillow loops for video/image processing + +**Wrong:** +```python +import cv2 +cap = cv2.VideoCapture('video.mp4') +frames = [] +while cap.isOpened(): + ret, frame = cap.read() + if not ret: + break + if frame_count % 30 == 0: + frames.append(frame) +``` + +**Right:** +```python +from pixeltable.functions.video import frame_iterator + +frames = pxt.create_view('app.frames', videos, + iterator=frame_iterator(videos.video, fps=1.0), + if_exists='ignore') +``` + +### 8. Direct Postgres access + +**Wrong:** +```python +import psycopg2 +conn = psycopg2.connect(dbname='pixeltable', host='/tmp/.s.PGSQL.5432') +cur = conn.cursor() +cur.execute("SELECT * FROM ...") # NEVER DO THIS +``` + +**Right:** Always use the Pixeltable SDK. The embedded Postgres is an implementation detail. + +### 9. async def with Pixeltable + +**Wrong:** +```python +@app.post("/query") +async def query_endpoint(q: str): + results = t.where(t.text.contains(q)).collect() # May deadlock + return results +``` + +**Right:** +```python +@app.post("/query") +def query_endpoint(q: str): + results = t.where(t.text.contains(q)).select(t.text, t.score).collect() + return results.to_pandas().to_dict(orient='records') +``` + +### 10. Drop + recreate as init + +**Wrong:** +```python +pxt.drop_table('app.data', force=True) +t = pxt.create_table('app.data', {'text': pxt.String}) +``` + +**Right:** +```python +t = pxt.create_table('app.data', {'text': pxt.String}, if_exists='ignore') +``` + +### 11. if_exists='ignore' to update logic + +**Wrong:** +```python +# Bug in summary prompt — "fix" by re-running: +t.add_computed_column(summary=fixed_expression, if_exists='ignore') +# ↑ SILENT NO-OP — column already exists with old logic +``` + +**Right:** +```python +t.drop_column('summary') +t.add_computed_column(summary=fixed_expression) +``` + +### 12. Hardcoding API keys + +**Wrong:** +```python +from pixeltable.functions.openai import chat_completions +t.add_computed_column(resp=chat_completions(..., api_key='sk-abc123')) +``` + +**Right:** Set `OPENAI_API_KEY` env var or add to `~/.pixeltable/config.toml`: +```toml +[openai] +api_key = 'sk-...' +``` + +### 13. Imperative whisper + +**Wrong:** +```python +import whisper +model = whisper.load_model("base") +for audio_file in audio_files: + result = model.transcribe(audio_file) + transcripts.append(result["text"]) +``` + +**Right:** +```python +from pixeltable.functions.whisper import transcribe + +t.add_computed_column( + transcript=transcribe(t.audio, model='base').text, + if_exists='ignore' +) +``` + +### 14. Pydantic schemas + +**Wrong:** +```python +from pydantic import BaseModel + +class Document(BaseModel): + title: str + content: str + embedding: list[float] + +# Then trying to map this to Pixeltable somehow... +``` + +**Right:** +```python +t = pxt.create_table('app.docs', { + 'title': pxt.String, + 'content': pxt.String, +}, if_exists='ignore') +# Embeddings are computed, not schema-declared +t.add_embedding_index('content', embedding=embed_fn, if_exists='ignore') +``` + +### 15. Chat history in lists or Redis + +**Wrong:** +```python +chat_history = [] # Lost on restart +# or +import redis +r = redis.Redis() +r.lpush(f"chat:{user_id}", json.dumps(message)) +``` + +**Right:** +```python +memory = pxt.create_table('app.memory', { + 'role': pxt.String, + 'content': pxt.String, + 'session_id': pxt.String, + 'timestamp': pxt.Timestamp, +}, if_exists='ignore') +memory.add_embedding_index('content', + embedding=embeddings(model='text-embedding-3-small'), + if_exists='ignore') + +# Retrieve relevant past context +sim = memory.content.similarity(string=current_query) +context = memory.where(memory.session_id == sid).order_by(sim, asc=False).limit(5).collect() +``` + +--- + +## Cross-References + +- [SKILL.md → Critical Warnings](../SKILL.md#critical-warnings--read-before-writing-code) — hallucinated API fixes +- [SKILL.md → Common Pitfalls](../SKILL.md#common-pitfalls) — wrong/right table for specific APIs +- [core-api.md → Common Pitfalls](core-api.md#common-pitfalls) — extended examples +- [Migration guides](https://docs.pixeltable.com/migrate/from-agent-frameworks) — porting from LangChain/LlamaIndex diff --git a/skills/pixeltable/references/core-api.md b/skills/pixeltable/references/core-api.md new file mode 100644 index 0000000..2bb38f1 --- /dev/null +++ b/skills/pixeltable/references/core-api.md @@ -0,0 +1,1146 @@ +# Pixeltable Core API Reference + +Complete reference for table operations, querying, computed columns, views, embedding indexes, UDFs, tools, and configuration. + +## Contents + +- [Table Creation](#table-creation) (basic, primary key, UUID, from source) +- [Querying](#querying) (select, where, order by, pandas, Pydantic) +- [Computed Columns](#computed-columns) +- [Views](#views) (filtered, document chunking, video frames, string splitting, audio splitting) +- [Built-in Functions](#built-in-image-functions) (image, video, string) +- [Embedding Indexes](#embedding-indexes) (add index, similarity search, distance metrics) +- [UDFs](#udfs) (basic, optional args, batch, aggregate, retrieval) +- [Update and Delete](#update-and-delete) +- [Table Operations](#table-operations) +- [Snapshots](#snapshots) +- [Tools and Agents](#tools-and-agents) (create tools, agent pipeline, MCP) +- [Serving (FastAPIRouter)](#serving-fastapirouter) (add_insert_route, add_query_route, add_delete_route, background jobs, pxt serve) +- [Export](#export-csv-json-parquet-lancedb) (CSV, JSON, Parquet, LanceDB, SQL) +- [Configuration](#configuration) (API keys, config.toml, rate limiting, media destinations, pxtfs://) +- [Performance Tips](#performance-tips) + +--- + +## Table Creation + +### Basic Table + +```python +import pixeltable as pxt + +t = pxt.create_table('dir.table_name', { + 'col1': pxt.String, + 'col2': pxt.Int, + 'col3': pxt.Float, + 'col4': pxt.Bool, + 'col5': pxt.Image, + 'col6': pxt.Video, + 'col7': pxt.Audio, + 'col8': pxt.Document, + 'col9': pxt.Json, + 'col10': pxt.Array[(3, 4), pxt.Float], # 3x4 float array + 'col11': pxt.Timestamp, + 'col12': pxt.Date, + 'col13': pxt.UUID, + 'col14': pxt.Binary, +}, if_exists='ignore') +``` + +### Table with Primary Key + +```python +t = pxt.create_table('dir.table', { + 'id': pxt.Required[pxt.String], + 'data': pxt.String, +}, primary_key=['id'], if_exists='ignore') +``` + +### Table with Auto-Generated UUID Primary Key + +Production-ready pattern using uuid7() for automatic unique IDs: + +```python +from pixeltable.functions.uuid import uuid7 + +t = pxt.create_table('dir.items', { + 'content': pxt.String, + 'uuid': uuid7(), # auto-generated on insert + 'timestamp': pxt.Timestamp, +}, primary_key=['uuid'], if_exists='ignore') + +# No need to provide uuid when inserting +from datetime import datetime +t.insert([{'content': 'Hello', 'timestamp': datetime.now()}]) +``` + +### Table from Data Source + +```python +t = pxt.create_table('dir.from_csv', source='data.csv') +t = pxt.create_table('dir.from_parquet', source='data.parquet') +t = pxt.create_table('dir.data', source='data.csv', + schema_overrides={'image_col': pxt.Image, 'doc_col': pxt.Document}) +``` + +## Querying + +### Select + +```python +results = t.collect() # all columns +results = t.select(t.col1, t.col2).collect() # specific columns +results = t.select(t.col1, doubled=t.col2 * 2).collect() # with expressions +``` + +### Where (Filter) + +```python +results = t.where(t.col2 > 10).select(t.col1).collect() +results = t.where((t.col2 > 10) & (t.col1 != 'exclude')).collect() +results = t.where(t.col1.like('%pattern%')).collect() +``` + +### Order By / Limit / Count / Sample + +```python +results = t.order_by(t.col2, asc=False).limit(10).collect() +total = t.count() +filtered = t.where(t.score > 0.5).count() + +# Pagination with offset +page2 = t.order_by(t.col2).limit(10, offset=10).collect() + +# Random sample (reproducible with seed) +sample = t.sample(n=100, seed=42).select(t.col1, t.col2).collect() +``` + +### Conversions + +```python +df = t.select(t.col1, t.col2).collect().to_pandas() # to pandas +items = list(t.select(title=t.title, score=t.score).collect().to_pydantic(M)) # to Pydantic (names must match) +t.insert([pydantic_model_instance]) # insert Pydantic models +first_5 = t.head(5) + +# return_rows=True: get computed columns back from insert without a follow-up query +status = t.insert([row], return_rows=True) +data = status.rows[0] # dict with ALL columns including computed +``` + +## Computed Columns + +```python +# Simple expression +t.add_computed_column(upper_name=t.name.upper(), if_exists='ignore') + +# Using a UDF +t.add_computed_column(result=my_udf(t.input_col), if_exists='ignore') + +# Using an AI provider +from pixeltable.functions.openai import chat_completions +t.add_computed_column( + summary=chat_completions( + messages=[{'role': 'user', 'content': t.text}], + model='gpt-4o-mini' + ).choices[0].message.content, + if_exists='ignore' +) + +# Drop column +t.drop_column('column_name') + +# Recompute failed or outdated columns (critical for error recovery) +t.recompute_columns(columns=['summary']) +t.recompute_columns(columns=['summary'], where=t.summary.errortype != None) +``` + +## Views + +### Filtered View + +```python +v = pxt.create_view('dir.active', t.where(t.is_active == True), if_exists='ignore') +``` + +### Document Chunking + +```python +from pixeltable.functions.document import document_splitter + +# Separators: 'token_limit', 'sentence', 'heading', 'page', or combine: 'page, sentence' +chunks = pxt.create_view('dir.chunks', t, + iterator=document_splitter(t.doc, separators='token_limit', limit=300), + if_exists='ignore') + +# With metadata extraction and image extraction (PDF) +chunks = pxt.create_view('dir.chunks', t, + iterator=document_splitter(t.doc, separators='page, sentence', + metadata='title,heading,page', elements=['text', 'image']), + if_exists='ignore') +``` + +### Video Frame Extraction + +```python +from pixeltable.functions.video import frame_iterator + +frames = pxt.create_view('dir.frames', t, iterator=frame_iterator(t.video, fps=1.0), if_exists='ignore') +# Options: fps=N, num_frames=N, keyframes_only=True +# Output columns: frame (Image), frame_idx, pos_msec, pos_frame +``` + +### String / Audio Splitting + +```python +from pixeltable.functions.string import string_splitter +from pixeltable.functions.audio import audio_splitter + +sentences = pxt.create_view('dir.sentences', t, + iterator=string_splitter(text=t.content, separators='sentence'), if_exists='ignore') +audio_chunks = pxt.create_view('dir.audio_chunks', t, + iterator=audio_splitter(audio=t.audio, duration=30.0), if_exists='ignore') +``` + +## Built-in Image Functions + +```python +from pixeltable.functions import image as pxt_image + +# Thumbnail generation +t.add_computed_column( + thumb=pxt_image.thumbnail(t.image, size=(320, 320)), + if_exists='ignore') + +# Base64 encoding (useful for API responses and Anthropic vision) +t.add_computed_column( + b64=pxt_image.b64_encode(t.image), + if_exists='ignore') + +# Combined: thumbnail + base64 (common pattern for APIs) +t.add_computed_column( + thumbnail=pxt_image.b64_encode( + pxt_image.thumbnail(t.image, size=(320, 320)) + ), + if_exists='ignore') + +# Base64 with explicit format +t.add_computed_column( + png_b64=pxt_image.b64_encode(t.image, 'png'), + if_exists='ignore') +``` + +## Built-in Image Functions (Additional) + +```python +from pixeltable.functions.image import draw_bounding_boxes + +# Draw detection results on images (pairs with DETR/YOLOX output) +t.add_computed_column( + annotated=draw_bounding_boxes(t.image, t.detections), + if_exists='ignore') +``` + +## Built-in Video Functions + +```python +from pixeltable.functions.video import ( + extract_audio, resize, crop, concat_videos, + with_audio, pan, mix_audio, overlay_image, +) + +# Extract audio track from video +t.add_computed_column( + audio=extract_audio(t.video, format='mp3'), + if_exists='ignore') + +# Resize video +t.add_computed_column( + resized=resize(t.video, width=640, height=480), + if_exists='ignore') + +# Crop video region +t.add_computed_column( + cropped=crop(t.video, x=100, y=100, w=400, h=300), + if_exists='ignore') + +# Concatenate two videos +t.add_computed_column( + combined=concat_videos(t.intro_video, t.main_video), + if_exists='ignore') + +# Replace audio track on a video +t.add_computed_column( + with_new_audio=with_audio(t.video, t.narration), + if_exists='ignore') + +# Ken Burns pan effect on an image (creates video from still image) +t.add_computed_column( + clip=pan(t.image, duration=5.0, zoom_start=1.0, zoom_end=1.3), + if_exists='ignore') + +# Mix (overlay) two audio tracks +t.add_computed_column( + mixed=mix_audio(t.narration, t.background_music), + if_exists='ignore') + +# Overlay image (watermark) on video +t.add_computed_column( + watermarked=overlay_image(t.video, t.logo, x=10, y=10), + if_exists='ignore') +``` + +## Built-in String Functions + +```python +from pixeltable.functions import string as pxt_str + +# String length +t.add_computed_column(text_len=pxt_str.len(t.content), if_exists='ignore') +``` + +## Embedding Indexes + +### Add Index + +```python +from pixeltable.functions.huggingface import clip, sentence_transformer + +# CLIP (multimodal: text + image) +embed_fn = clip.using(model_id='openai/clip-vit-base-patch32') +t.add_embedding_index('image_col', embedding=embed_fn, if_exists='ignore') + +# Sentence Transformers (text) +embed_fn = sentence_transformer.using(model_id='all-MiniLM-L6-v2') +t.add_embedding_index('text_col', embedding=embed_fn, if_exists='ignore') + +# Sentence Transformers (multilingual, high quality, recommended for production) +embed_fn = sentence_transformer.using(model_id='intfloat/multilingual-e5-large-instruct') +t.add_embedding_index('text_col', string_embed=embed_fn, if_exists='ignore') + +# OpenAI embeddings +from pixeltable.functions.openai import embeddings +t.add_embedding_index('text_col', embedding=embeddings.using(model='text-embedding-3-small'), if_exists='ignore') +``` + +### Similarity Search + +```python +# Text +sim = t.text_col.similarity(string='search query') +results = t.order_by(sim, asc=False).limit(10).select(t.text_col, sim).collect() + +# Text with threshold filter +sim = t.text_col.similarity(string='search query') +results = t.where(sim > 0.5).order_by(sim, asc=False).limit(10).select(t.text_col, sim).collect() + +# Image with text (multimodal) +sim = t.image_col.similarity(string='a red car') +results = t.order_by(sim, asc=False).limit(5).select(t.image_col, sim).collect() + +# Image with image +sim = t.image_col.similarity(image='path/to/query.jpg') +results = t.order_by(sim, asc=False).limit(5).select(t.image_col, sim).collect() +``` + +### Distance Metrics + +```python +t.add_embedding_index('col', embedding=fn, metric='cosine') # default +t.add_embedding_index('col', embedding=fn, metric='ip') # inner product +t.add_embedding_index('col', embedding=fn, metric='l2') # euclidean +``` + +## B-Tree Indexes + +For efficient range queries and equality lookups on non-embedding columns: + +```python +# Add B-tree index for fast filtering +t.add_btree_index('category', if_exists='ignore') +t.add_btree_index('timestamp', if_exists='ignore') + +# Drop an index +t.drop_index('index_name') +``` + +## UDFs + +### Basic + +```python +@pxt.udf +def my_function(x: str) -> str: + return x.upper() +``` + +### With Optional Args + +```python +from typing import Optional + +@pxt.udf +def safe_process(value: Optional[str], default: str = '') -> str: + return value if value is not None else default +``` + +### Batch UDF + +```python +from pixeltable.func import Batch + +@pxt.udf(batch_size=32) +def batch_process(texts: Batch[str]) -> Batch[list[float]]: + return model.encode(texts).tolist() +``` + +### Aggregate UDF + +```python +@pxt.uda +class MyAggregator(pxt.Aggregator): + def __init__(self): + self.sum = 0 + self.count = 0 + + def update(self, val: int) -> None: + self.sum += val + self.count += 1 + + def value(self) -> float: + return self.sum / self.count if self.count > 0 else 0.0 +``` + +### Retrieval UDF (for AI Tool Use) + +```python +lookup_fn = pxt.retrieval_udf(t, name='lookup_items', description='Look up items by name', + parameters=['name'], limit=5) +``` + +### Custom Iterator + +Define custom iterators that produce multiple output rows from a single input: + +```python +@pxt.iterator +class SlidingWindowIterator: + """Produce overlapping windows from a text.""" + def __init__(self, text: str, window_size: int = 100, stride: int = 50): + self.text = text + self.window_size = window_size + self.stride = stride + + def __next__(self) -> dict: # yields {'window': str} + ... +``` + +### List Iterator + +Split a list/array column into one row per element: + +```python +from pixeltable.functions import list_iterator + +# Explode a JSON array column into individual rows +items = pxt.create_view('dir.items', t, + iterator=list_iterator(t.tags), + if_exists='ignore') +``` + +## Update and Delete + +```python +t.update({'score': 1.0}, where=t.category == 'important') +t.delete(where=t.is_active == False) +``` + +### return_rows=True (insert-then-read) + +Get all column values (including computed columns) back from `insert()`, `update()`, or `batch_update()` without a follow-up query: + +```python +# Anti-pattern: insert then query +t.insert([row]) +result = t.where(t.id == value).select(...).collect() +data = result[0] + +# Correct: return_rows=True +status = t.insert([row], return_rows=True) +data = status.rows[0] # dict with ALL columns including computed +``` + +For typed access, use Pydantic `model_validate()` with `extra="ignore"` (row dicts contain every column): + +```python +from pydantic import BaseModel + +class AgentResult(BaseModel): + model_config = {"extra": "ignore"} + answer: str | None = None + tool_output: Any = None + +status = agent.insert([{"prompt": user_input}], return_rows=True) +result = AgentResult.model_validate(status.rows[0]) +``` + +**When to use which:** +- `return_rows=True` -- insert/update and read computed columns back in one call +- `to_pydantic()` -- reading from a `ResultSet` (after `.collect()`) +- `model_validate()` -- reading from `status.rows` (plain dicts from `return_rows=True`) + +## Table Operations + +```python +t.rename_column('old_name', 'new_name') +t.add_column(new_col=pxt.String) +t.drop_column('col_name') +t.describe() +t.columns() + +# Directory management +pxt.list_dirs() +pxt.list_tables() +contents = pxt.get_dir_contents('my_dir') +``` + +## Recompute Columns + +Re-run computed columns on existing rows. Critical for retrying after API errors or rate limits: + +```python +# Recompute all rows for a column +t.recompute_columns(columns=['summary']) + +# Recompute only failed rows (most common pattern) +t.recompute_columns(columns=['summary'], where=t.summary.errortype != None) + +# Recompute specific rows matching a condition +t.recompute_columns(columns=['label'], where=t.category == 'pending') +``` + +## Snapshots and Version History + +Point-in-time copies of tables: + +```python +snap = pxt.create_snapshot('dir.snap_v1', t, if_exists='ignore') +# Query the snapshot like any table +snap.select(snap.col1).collect() + +# View table version history +versions = t.get_versions() +``` + +## Tools and Agents + +### Create Tools from UDFs and Query Functions + +```python +@pxt.udf +def web_search(keywords: str) -> str: + """Search the web for information.""" + from duckduckgo_search import DDGS + with DDGS() as ddgs: + results = list(ddgs.news(keywords=keywords, max_results=5)) + return '\n'.join(f"{r['title']}: {r['body']}" for r in results) if results else 'No results.' + +@pxt.query +def search_docs(query_text: str): + """Search documents by semantic similarity.""" + sim = chunks.text.similarity(string=query_text) + return chunks.order_by(sim, asc=False).limit(10).select(chunks.text, sim) + +tools = pxt.tools(web_search, search_docs) +``` + +### Full Tool-Calling Agent Pipeline + +The agent pipeline uses chained computed columns. Inserting a row triggers the entire pipeline: + +```python +from pixeltable.functions.anthropic import messages, invoke_tools + +agent = pxt.create_table('project.agent', { + 'prompt': pxt.String, + 'timestamp': pxt.Timestamp, + 'initial_system_prompt': pxt.String, + 'final_system_prompt': pxt.String, + 'max_tokens': pxt.Int, + 'temperature': pxt.Float, +}, if_exists='ignore') + +# Step 1: Initial LLM call with tool selection +agent.add_computed_column( + initial_response=messages( + model='claude-sonnet-4-20250514', + messages=[{'role': 'user', 'content': [{'type': 'text', 'text': agent.prompt}]}], + tools=tools, + tool_choice=tools.choice(required=True), + max_tokens=agent.max_tokens, + model_kwargs={ + 'system': agent.initial_system_prompt, + 'temperature': agent.temperature, + }, + ), + if_exists='ignore', +) + +# Step 2: Execute the tools the LLM selected +agent.add_computed_column( + tool_output=invoke_tools(tools, agent.initial_response), + if_exists='ignore', +) + +# Step 3: RAG context retrieval +agent.add_computed_column( + doc_context=search_docs(agent.prompt), + if_exists='ignore', +) + +# Step 4: Assemble context with a UDF +agent.add_computed_column( + context=assemble_context(agent.prompt, agent.tool_output, agent.doc_context), + if_exists='ignore', +) + +# Step 5: Final LLM call with full context +agent.add_computed_column( + final_response=messages( + model='claude-sonnet-4-20250514', + messages=[{'role': 'user', 'content': [{'type': 'text', 'text': agent.context}]}], + max_tokens=agent.max_tokens, + model_kwargs={ + 'system': agent.final_system_prompt, + 'temperature': agent.temperature, + }, + ), + if_exists='ignore', +) + +# Step 6: Extract answer text +agent.add_computed_column( + answer=agent.final_response.content[0].text, + if_exists='ignore', +) +``` + +### Using the Agent Pipeline + +```python +from datetime import datetime + +agent.insert([{ + 'prompt': 'What are the latest developments in quantum computing?', + 'timestamp': datetime.now(), + 'initial_system_prompt': 'Identify the best tool(s) to answer the query.', + 'final_system_prompt': 'Provide a clear answer. Cite sources when possible.', + 'max_tokens': 1024, + 'temperature': 0.7, +}]) + +result = agent.order_by(agent.timestamp, asc=False).limit(1).select(agent.answer).collect() +``` + +### MCP Integration + +```python +udfs = pxt.mcp_udfs('http://localhost:8080/sse') +``` + +--- + +## Serving (FastAPIRouter) + +`pixeltable.serving.FastAPIRouter` (v0.6+) is a subclass of FastAPI's `APIRouter` that generates endpoints from tables and `@pxt.query` functions. No Pydantic models or hand-written handlers needed. + +### add_insert_route + +```python +from pixeltable.serving import FastAPIRouter +import pixeltable as pxt + +router = FastAPIRouter(prefix="/api/data", tags=["data"]) +docs = pxt.get_table("app.documents") + +# Synchronous insert — returns inserted row fields +router.add_insert_route(docs, path="/upload/image", + uploadfile_inputs=["image"], inputs=["timestamp"], outputs=["uuid", "thumbnail"]) + +# Background insert — returns job handle for polling +router.add_insert_route(docs, path="/upload/document", + uploadfile_inputs=["document"], inputs=["timestamp"], outputs=["uuid"], + background=True) +# Client receives { "job_url": "http://host/jobs/{id}" } +# Poll GET /jobs/{id} → { "status": "pending" | "done" | "error", "result": {...} } +``` + +Parameters: +- `uploadfile_inputs` — column names sent as `UploadFile` (multipart form) +- `inputs` — column names sent as form fields +- `outputs` — column names to return after insert +- `background=True` — return immediately with a job URL; client polls for result + +### add_query_route + +```python +@pxt.query +def search_docs(query_text: str): + sim = chunks.text.similarity(string=query_text) + return chunks.where(sim > 0.3).order_by(sim, asc=False).select( + text=chunks.text, sim=sim).limit(20) + +router.add_query_route(path="/search", query=search_docs, method="post") +# POST /api/data/search {"query_text": "..."} → { "rows": [...] } + +@pxt.query +def list_docs(): + return docs.select(uuid=docs.uuid, name=docs.document).order_by(docs.timestamp, asc=False) + +router.add_query_route(path="/list", query=list_docs, method="get") +# GET /api/data/list → { "rows": [...] } +``` + +### add_delete_route + +```python +# Delete by primary key +router.add_delete_route(docs, path="/delete") +# POST /api/data/delete {"uuid": "..."} → { "num_rows": 1 } + +# Delete by non-PK column +router.add_delete_route(chat, path="/delete-conversation", match_columns=["conversation_id"]) +``` + +### Architecture pattern + +``` +setup_pixeltable.py — flat module: creates tables, views, indexes on import +routers/data.py — pxt.get_table() + @pxt.query + add_*_route +routers/search.py — pxt.get_table() + @pxt.query + add_*_route +main.py — import setup_pixeltable; from routers import data, search +``` + +See [workflows.md → FastAPIRouter](workflows.md#fastapirouter-declarative-serving-v06) for a complete example. + +### pxt serve (CLI) + +Define routes in `pyproject.toml` (standard Python convention) or a standalone `pixeltable.toml`, then run `pxt serve`: + +```toml +# In pyproject.toml (alongside [project] and dependencies) +# Requires [build-system] + [tool.setuptools] py-modules = ["schema"] +# so pxt serve can import schema.py without PYTHONPATH hacks. + +[[tool.pixeltable.service]] +name = "pipeline" +prefix = "/api" +port = 8000 + +[[tool.pixeltable.service.routes]] +type = "query" +path = "/search" +query = "schema:search_documents" # colon-separated: module:attribute +method = "post" + +[[tool.pixeltable.service.routes]] +type = "insert" +path = "/ingest/document" +table = "pipeline.documents" +inputs = ["title", "body", "source_id"] +outputs = ["uuid"] + +[[tool.pixeltable.service.routes]] +type = "delete" +path = "/delete/document" +table = "pipeline.documents" +``` + +```bash +pxt serve pipeline # serves at http://localhost:8000 +pxt serve pipeline --port 9000 +``` + +Insert routes can auto-export to a serving DB on every request: + +```toml +[[tool.pixeltable.service.routes]] +type = "insert" +path = "/ingest/document" +table = "pipeline.documents" +inputs = ["title", "body", "source_id"] +outputs = ["uuid"] + +[tool.pixeltable.service.routes.export_sql] +db_connect = "postgresql+psycopg://user:pass@host/db" +table = "processed_documents" +method = "insert" +``` + +`pxt serve` generates a complete FastAPI app with OpenAPI docs at `/docs`. Same capabilities as `FastAPIRouter` (insert, query, delete, background jobs). See the [Starter Kit `serving/` directory](https://github.com/pixeltable/pixeltable-starter-kit/tree/main/serving) for a working example. + +--- + +## Data Sharing and Replication + +Share tables across teams or environments: + +```python +# Publish a table version (makes it shareable) +t.publish() + +# Replicate a published table (creates a local synchronized copy) +replica = pxt.replicate('dir.local_copy', source_table_uri) + +# Sync changes +replica.pull() # fetch latest from source +replica.push() # push local changes to source +``` + +## Export (CSV, JSON, Parquet, LanceDB) + +```python +import pixeltable as pxt + +t = pxt.get_table('myapp/documents') + +# Export to CSV +pxt.io.export_csv(t, '/data/documents.csv') + +# Export to JSON +pxt.io.export_json(t, '/data/documents.json') + +# Export to Parquet +pxt.io.export_parquet(t, '/data/documents.parquet') + +# Export to LanceDB (vector DB) +pxt.io.export_lancedb(t, db_uri='/data/lance', table_name='docs') + +# Export filtered query results +results = t.where(t.score > 0.8).select(t.title, t.score) +pxt.io.export_csv(results, '/data/filtered.csv') + +# Other formats +df = t.collect().to_pandas() # Pandas DataFrame +ds = t.to_pytorch_dataset(['image']) # PyTorch DataLoader +coco = t.to_coco_dataset() # COCO format +``` + +--- + +## Export to SQL Databases + +```python +from pixeltable.io.sql import export_sql + +# Export full table to SQLite +export_sql(t, 'my_table', db_connect_str='sqlite:///data.db') + +# Export filtered query with column rename +export_sql( + t.where(t.score > 0.8).select(product_name=t.name, price=t.price), + 'filtered_products', + db_connect_str='sqlite:///data.db', +) + +# Append to existing SQL table +export_sql(t, 'products', db_connect_str=connection_string, if_exists='insert') + +# Replace existing SQL table +export_sql(t, 'products', db_connect_str=connection_string, if_exists='replace') + +# Cloud databases (PostgreSQL, Snowflake, etc.) +export_sql(t, 'products', db_connect_str='postgresql+psycopg://user:pass@host:5432/db') +``` + +--- + +## Configuration + +### API Keys + +```python +# Via init +pxt.init({'openai.api_key': 'sk-...', 'anthropic.api_key': 'sk-ant-...'}) + +# Via environment variables (recommended) +# OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY / GEMINI_API_KEY, +# TOGETHER_API_KEY, FIREWORKS_API_KEY, MISTRAL_API_KEY, GROQ_API_KEY, +# DEEPSEEK_API_KEY, VOYAGE_API_KEY, REPLICATE_API_TOKEN, HF_AUTH_TOKEN, +# OPENROUTER_API_KEY, FAL_API_KEY, REVE_API_KEY, TWELVELABS_API_KEY, +# BEDROCK_API_KEY +``` + +### config.toml + +Located at `~/.pixeltable/config.toml`: + +```toml +[pixeltable] +file_cache_size_g = 250 +time_zone = "America/Los_Angeles" +hide_warnings = true +verbosity = 2 + +[openai] +api_key = 'sk-...' +# For Azure OpenAI, add these to the same [openai] section: +# base_url = 'https://my-deployment.openai.azure.com/' +# api_version = '2024-02-01' + +# Per-model rate limits (requests per minute) +[openai.rate_limits] +gpt-4o = 500 +gpt-4o-mini = 1000 +tts-1 = 50 +dall-e-3 = 10 + +[anthropic] +api_key = 'sk-ant-...' + +[mistral] +api_key = 'my-mistral-key' +rate_limit = 600 +``` + +### Rate Limiting + +Default: 600 requests per minute per provider. Configure in `config.toml`: + +```toml +# Single rate limit for all models of a provider +[fireworks] +rate_limit = 300 + +# Per-model rate limits +[openai.rate_limits] +gpt-4o = 500 +gpt-4o-mini = 1000 +``` + +Custom resource pools for non-built-in APIs: + +```python +@pxt.udf(resource_pool='request-rate:my_service') +def call_custom_api(prompt: str) -> str: + return requests.post('https://my-api.com/generate', json={'prompt': prompt}).json()['text'] +``` + +### Media Destinations (Cloud Storage) + +Store media files in S3, GCS, Azure, or other cloud storage instead of locally: + +```toml +# config.toml — global default +[pixeltable] +input_media_dest = "s3://my-bucket/input/" +output_media_dest = "s3://my-bucket/output/" +``` + +```bash +# Or via environment variables +export PIXELTABLE_INPUT_MEDIA_DEST="s3://my-bucket/input/" +export PIXELTABLE_OUTPUT_MEDIA_DEST="s3://my-bucket/output/" +``` + +```python +# Per-column destination (overrides global default) +t.add_computed_column( + thumbnail=pxt_image.thumbnail(t.image, size=(256, 256)), + destination='s3://my-bucket/thumbnails/', + if_exists='ignore', +) +``` + +Supported providers: Amazon S3, Google Cloud Storage (`gs://`), Azure Blob Storage (`wasbs://`), Cloudflare R2, Backblaze B2, Tigris. + +**Pixeltable Cloud (home bucket):** Free R2-backed storage. No AWS credentials needed: + +```python +# Use pxtfs:// URI as a destination +t.add_computed_column( + thumbnail=pxt_image.thumbnail(t.image, size=(256, 256)), + destination='pxtfs://org:db/home/thumbnails/', +) +``` + +```bash +# Or set globally +export PIXELTABLE_API_KEY="pxt_..." +export PIXELTABLE_OUTPUT_MEDIA_DEST="pxtfs://org:db/home/" +``` + +See [Cloud Storage docs](https://docs.pixeltable.com/integrations/cloud-storage). + +## Common Pitfalls + +### Deprecated/Wrong Imports + +```python +# WRONG — openai.vision does not exist +from pixeltable.functions.openai import vision +description = vision(prompt='Describe', image=t.image) + +# CORRECT — use chat_completions with multimodal messages +from pixeltable.functions.openai import chat_completions +description = chat_completions( + messages=[{ + 'role': 'user', + 'content': [ + {'type': 'text', 'text': 'Describe this image.'}, + {'type': 'image_url', 'image_url': {'url': t.image}} + ] + }], + model='gpt-4o-mini' +).choices[0].message.content + +# WRONG — FrameIterator class import +from pixeltable.iterators import FrameIterator +pxt.create_view('v', t, iterator=FrameIterator.create(video=t.video, fps=1)) + +# CORRECT — function import +from pixeltable.functions.video import frame_iterator +pxt.create_view('v', t, iterator=frame_iterator(t.video, fps=1), if_exists='ignore') +``` + +### Cast to String Before Embedding + +AI functions often return `Json` or complex objects. Embedding indexes require `String` columns: + +```python +# WRONG — transcriptions returns a Json object, not a String +t.add_computed_column(transcript=openai.transcriptions(audio=t.audio, model='whisper-1'), if_exists='ignore') +t.add_embedding_index('transcript', embedding=embed_fn) # silently fails + +# CORRECT — extract .text and cast +t.add_computed_column( + transcript=openai.transcriptions(audio=t.audio, model='whisper-1').text.astype(pxt.String), + if_exists='ignore') +t.add_embedding_index('transcript', embedding=embed_fn, if_exists='ignore') +``` + +This applies to any computed column used as an embedding source — always ensure it evaluates to `pxt.String`. + +### The `if_exists='ignore'` Trap + +If you create a column with buggy logic, fixing the code and re-running does **NOT** update the column. `if_exists='ignore'` silently skips the already-existing (broken) column: + +```python +# Bug: wrong model name +t.add_computed_column(summary=openai.chat_completions(..., model='nonexistent'), if_exists='ignore') + +# Fixing the code and re-running does NOTHING — old column persists +t.add_computed_column(summary=openai.chat_completions(..., model='gpt-4o-mini'), if_exists='ignore') + +# FIX: drop the column first, then recreate +t.drop_column('summary') +t.add_computed_column(summary=openai.chat_completions(..., model='gpt-4o-mini'), if_exists='ignore') + +# OR: wipe the entire namespace during development +pxt.drop_dir('my_project', force=True) +``` + +### Other Pitfalls + +```python +# Image in messages: use image_url, never raw pxt.Image +messages=[{'role': 'user', 'content': [ + {'type': 'text', 'text': 'Describe.'}, + {'type': 'image_url', 'image_url': {'url': t.image}} # NOT {'type': 'image', 'data': t.image} +]}] + +# Similarity: always use string= keyword +sim = t.content.similarity(string=query_text) # NOT .similarity(query_text) +``` + +Schema corruption (`IntegrityError`): `pip install -U pixeltable && rm -rf ~/.pixeltable` + +### `@pxt.query` Eager Compilation + +`@pxt.query` compiles the function body at **decoration time** by calling it with expression placeholders. This means: + +```python +# WRONG — .collect() executes during decoration, not at call time +@pxt.query +def find_similar(ref_id: str): + ref = t.where(t.uuid == ref_id).select(t.embedding).collect() # FAILS at decoration + return t.order_by(t.embedding.similarity(ref[0]['embedding'])).limit(5) + +# CORRECT — use a plain def for imperative logic that needs .collect() +def find_similar(ref_id: str) -> list[dict]: + ref = t.where(t.uuid == ref_id).select(t.embedding).collect() + return list(t.order_by(t.embedding.similarity(ref[0]['embedding'])).limit(5).collect()) + +# WRONG — references a table that may not exist yet +@pxt.query +def search(): + t = pxt.get_table('maybe.missing') # FAILS if table doesn't exist at decoration time + return t.select(t.col) +``` + +### Nullable Primary Keys + +Primary key columns must be non-nullable. Bare `pxt.String` is nullable by default: + +```python +# WRONG — nullable PK rejected at table creation +t = pxt.create_table('dir.items', { + 'id': pxt.String, # nullable! +}, primary_key=['id']) + +# CORRECT — explicit non-nullable +t = pxt.create_table('dir.items', { + 'id': pxt.Required[pxt.String], +}, primary_key=['id']) + +# CORRECT — uuid7() computed default (recommended) +from pixeltable.functions.uuid import uuid7 +t = pxt.create_table('dir.items', { + 'content': pxt.String, + 'uuid': uuid7(), +}, primary_key=['uuid']) +``` + +### Thread-Safety in FastAPI + +`Table` objects are bound to the thread that created them. In FastAPI (which dispatches sync endpoints to a thread pool), call `pxt.get_table()` inside each endpoint: + +```python +# WRONG — module-level Table used across threads +docs = pxt.get_table('app.documents') + +@app.get('/count') +def count(): + return {'count': docs.count()} # fails: wrong thread + +# CORRECT — get a fresh handle per request +@app.get('/count') +def count(): + docs = pxt.get_table('app.documents') + return {'count': docs.count()} +``` + +### `document_splitter` with `token_limit` + +The `token_limit` separator requires the `tiktoken` package: + +```bash +pip install tiktoken +``` + +Without it, `document_splitter(t.doc, separators='token_limit', ...)` raises `RequestError: This feature requires the tiktoken package`. + +## Performance Tips + +- Batch inserts for efficiency +- Use `on_error='ignore'` to continue past row failures +- Use `batch_size` in `@pxt.udf(batch_size=32)` for GPU models +- Embedding indexes use HNSW for fast approximate nearest neighbor search +- Use `t.insert(source='file.csv')` instead of loading into memory for large datasets +- Use `keyframes_only=True` in `frame_iterator` for efficient video processing +- Use `thumbnail()` + `b64_encode()` for API-friendly image responses +- Configure rate limits in `config.toml` to avoid 429 errors on provider APIs +- Use `recompute_columns(where=t.col.errortype != None)` to retry only failed rows +- Use `add_btree_index()` on columns used frequently in `where()` filters +- Cast AI function outputs to `pxt.String` with `.astype(pxt.String)` before embedding indexing +- During development, use `pxt.drop_dir('dir', force=True)` to reset schema cleanly diff --git a/skills/pixeltable/references/ml-data-pipeline.md b/skills/pixeltable/references/ml-data-pipeline.md new file mode 100644 index 0000000..56ede38 --- /dev/null +++ b/skills/pixeltable/references/ml-data-pipeline.md @@ -0,0 +1,282 @@ +# ML Data Wrangling Pipeline + +A complete recipe for processing multimodal data (video, audio, images, documents) into training-ready datasets. Covers ingestion, enrichment with AI models, dataset versioning, and export to PyTorch, Parquet, and pandas. + +## Ingest Raw Data + +```python +import pixeltable as pxt +from pixeltable.functions.video import frame_iterator +from pixeltable.functions.openai import chat_completions +from pixeltable.functions.huggingface import clip, detr_for_object_detection +from pixeltable.functions import image as pxt_image + +pxt.create_dir('ml_data', if_exists='ignore') + +# From local files, URLs, or cloud storage (S3, GCS, Azure) +images = pxt.create_table('ml_data.images', { + 'image': pxt.Image, + 'filename': pxt.String, + 'split': pxt.String, # 'train', 'val', 'test' +}, if_exists='ignore') + +images.insert([ + {'image': 'path/to/cat_01.jpg', 'filename': 'cat_01.jpg', 'split': 'train'}, + {'image': 'path/to/dog_01.jpg', 'filename': 'dog_01.jpg', 'split': 'train'}, + {'image': 's3://bucket/images/bird.jpg', 'filename': 'bird.jpg', 'split': 'val'}, +]) + +# From CSV with schema overrides for media columns +labeled_data = pxt.create_table('ml_data.labeled', + source='annotations.csv', + schema_overrides={'image_path': pxt.Image}, + if_exists='ignore') + +# From Hugging Face datasets +from pixeltable.io import import_huggingface_dataset +import datasets +ds = datasets.load_dataset('cifar10', split='train[:500]') +cifar = import_huggingface_dataset('ml_data.cifar', ds, if_exists='ignore') +``` + +## Explore and Sample + +```python +# Quick look at the data +first_5 = images.head(5) +total = images.count() +train_count = images.where(images.split == 'train').count() + +# Random sample for exploration +sample = images.sample(n=10, seed=42).select(images.image, images.filename).collect() +``` + +## Enrich with AI Models + +```python +# Resize images for consistent training input (thumbnail preserves aspect ratio) +images.add_computed_column( + resized=pxt_image.thumbnail(images.image, size=(224, 224)), + if_exists='ignore') + +# Auto-classify with a vision LLM +images.add_computed_column( + label=chat_completions( + messages=[{ + 'role': 'user', + 'content': [ + {'type': 'text', 'text': 'Classify this image into exactly one word: cat, dog, bird, or other.'}, + {'type': 'image_url', 'image_url': {'url': images.image}} + ] + }], + model='gpt-4o-mini', + ).choices[0].message.content, + if_exists='ignore') + +# Object detection for bounding boxes +detect = detr_for_object_detection.using(model_id='facebook/detr-resnet-50') +images.add_computed_column( + detections=detect(images.image, threshold=0.8), + if_exists='ignore') + +# Visualize detections (draw bounding boxes on images) +from pixeltable.functions.image import draw_bounding_boxes +images.add_computed_column( + annotated=draw_bounding_boxes(images.image, images.detections), + if_exists='ignore') + +# Generate captions +images.add_computed_column( + caption=chat_completions( + messages=[{ + 'role': 'user', + 'content': [ + {'type': 'text', 'text': 'Describe this image in one sentence.'}, + {'type': 'image_url', 'image_url': {'url': images.image}} + ] + }], + model='gpt-4o-mini', + ).choices[0].message.content, + if_exists='ignore') + +# Add CLIP embeddings for similarity search and deduplication +embed_fn = clip.using(model_id='openai/clip-vit-base-patch32') +images.add_embedding_index('image', embedding=embed_fn, if_exists='ignore') +``` + +## Curate: Filter, Deduplicate, Quality Check + +```python +# Test on a small sample first (recommended workflow) +sample = images.limit(5).select(images.image, images.label, images.caption).collect() + +# Filter by label +cats = images.where(images.label == 'cat').select(images.image, images.caption).collect() + +# Find near-duplicates via similarity +sim = images.image.similarity(image='path/to/reference.jpg') +near_dupes = images.where(sim > 0.95).select(images.filename, sim).collect() + +# Review errors from computed columns +errors = images.where(images.label.errortype != None).select( + images.filename, images.label.errormsg).collect() + +# Recompute failed columns (critical for retrying after API errors) +images.recompute_columns(columns=['label'], where=images.label.errortype != None) +``` + +## Video Frame Extraction + +```python +videos = pxt.create_table('ml_data.videos', { + 'video': pxt.Video, + 'category': pxt.String, +}, if_exists='ignore') + +frames = pxt.create_view('ml_data.frames', videos, + iterator=frame_iterator(videos.video, fps=1.0), + if_exists='ignore') + +frames.add_computed_column( + resized=pxt_image.thumbnail(frames.frame, size=(224, 224)), + if_exists='ignore') +``` + +## Retrieval UDFs for Structured Data Lookup + +```python +# Create a lookup function for enrichment across tables +products = pxt.create_table('ml_data.products', { + 'sku': pxt.String, + 'name': pxt.String, + 'category': pxt.String, +}, if_exists='ignore') + +get_product = pxt.retrieval_udf( + products, + name='get_product', + description='Look up a product by SKU', + parameters=['sku'], + limit=1, +) + +# Use as a computed column for cross-table enrichment +# orders.add_computed_column(product_info=get_product(sku=orders.product_sku), if_exists='ignore') +``` + +## Version with Snapshots + +```python +# Take a point-in-time snapshot before exporting +snap_v1 = pxt.create_snapshot('ml_data.images_v1', images, if_exists='ignore') + +# Later, take another snapshot after adding more data +# snap_v2 = pxt.create_snapshot('ml_data.images_v2', images, if_exists='ignore') + +# Query any snapshot like a regular table +snap_v1.select(snap_v1.filename, snap_v1.label).limit(5).collect() +``` + +## Export for Training + +```python +# To PyTorch Dataset (recommended for training loops) +train_query = images.where(images.split == 'train').select( + images.resized, images.label) + +pytorch_ds = train_query.to_pytorch_dataset(image_format='pt') + +from torch.utils.data import DataLoader +dataloader = DataLoader(pytorch_ds, batch_size=32, num_workers=4) + +# Iterate in a training loop +for batch in dataloader: + imgs, labels = batch # imgs: (32, 3, 224, 224) tensor + # ... training step ... + break + +# To Parquet (for Spark, DuckDB, or cross-platform sharing) +from pixeltable.io import export_parquet + +export_parquet( + images.where(images.split == 'train').select( + images.filename, images.label, images.caption), + 'output/train/') + +export_parquet( + images.where(images.split == 'val').select( + images.filename, images.label, images.caption), + 'output/val/') + +# To pandas (for quick analysis or CSV export) +df = images.select( + images.filename, images.label, images.caption +).collect().to_pandas() +df.to_csv('output/annotations.csv', index=False) +``` + +## Key Patterns + +### Test Before Deploying + +Always test transformations on a small sample before committing: + +```python +# 1. Test the expression inline +result = images.limit(5).select( + images.image, label=chat_completions(...).choices[0].message.content +).collect() + +# 2. Review results, then deploy as a computed column +images.add_computed_column(label=chat_completions(...).choices[0].message.content, if_exists='ignore') +``` + +### Error Handling and Recomputation + +```python +# Insert with error tolerance +status = images.insert(rows, on_error='ignore') +print(f'Inserted: {status.num_rows}, Errors: {status.num_excs}') + +# Find and inspect failed rows +errors = images.where(images.label.errortype != None).select( + images.filename, images.label.errormsg).collect() + +# Retry failed computations (e.g., after fixing rate limits) +images.recompute_columns(columns=['label'], where=images.label.errortype != None) +``` + +### PyTorch Dataset Options + +| Parameter | Values | Description | +|-----------|--------|-------------| +| `image_format` | `'pt'` | CxHxW float tensors in [0, 1] | +| `image_format` | `'np'` | HxWxC uint8 arrays in [0, 255] | + +Data is cached to disk for efficient repeated loading. Use `num_workers > 0` in DataLoader for parallel loading. + +## Building Blocks + +| Step | Function | Purpose | +|------|----------|---------| +| Ingest | `create_table(source='file.csv')` | Load from CSV, Parquet, URLs, S3 | +| Ingest | `import_huggingface_dataset()` | Load from Hugging Face Hub | +| Explore | `t.head(5)`, `t.count()`, `t.sample(n)` | Quick data inspection | +| Enrich | `add_computed_column(label=...)` | Auto-label with AI models | +| Enrich | `detr_for_object_detection()` | Bounding box detection | +| Visualize | `draw_bounding_boxes(image, detections)` | Overlay detections on images | +| Search | `add_embedding_index()` + `.similarity()` | Find similar / deduplicate | +| Curate | `.where(col.errortype != None)` | Review failed transformations | +| Retry | `recompute_columns(columns=[...], where=...)` | Re-run failed computations | +| Version | `create_snapshot('name', table)` | Point-in-time dataset copy | +| Export | `to_pytorch_dataset(image_format='pt')` | PyTorch DataLoader-ready | +| Export | `export_parquet(query, 'path/')` | Parquet files for sharing | +| Export | `.collect().to_pandas()` | pandas DataFrame | +| Lookup | `pxt.retrieval_udf(table, ...)` | Structured data enrichment | + +## Adapting This Recipe + +- **Audio data**: Use `audio_splitter` and `transcriptions` to create labeled audio datasets — see [workflows.md → Audio Transcription](workflows.md#audio-transcription-and-analysis) +- **Document data**: Use `document_splitter` to chunk PDFs into training examples — see [workflows.md → RAG Pipeline](workflows.md#rag-pipeline) +- **Add human labels**: Export to Label Studio, annotate, then re-import +- **Multi-GPU training**: The PyTorch dataset supports `DistributedSampler` with standard PyTorch patterns diff --git a/skills/pixeltable/references/providers.md b/skills/pixeltable/references/providers.md new file mode 100644 index 0000000..f889b5c --- /dev/null +++ b/skills/pixeltable/references/providers.md @@ -0,0 +1,591 @@ +# Pixeltable AI Provider Reference + +Complete examples for all 25+ built-in AI provider integrations. All functions live in `pixeltable.functions.*`. + +## Quick Reference + +Use this table to find the correct import, function, and output accessor for each provider: + +| Provider | Import | Function | Extract answer | +|----------|--------|----------|----------------| +| OpenAI | `from pixeltable.functions.openai import chat_completions` | `chat_completions(messages=..., model='gpt-4o-mini')` | `.choices[0].message.content` | +| OpenAI Embeddings | `from pixeltable.functions.openai import embeddings` | `embeddings(input=..., model='text-embedding-3-small')` | `.data[0].embedding` | +| OpenAI TTS | `from pixeltable.functions.openai import speech` | `speech(input=..., model='tts-1', voice='alloy')` | *(returns Audio directly)* | +| OpenAI Transcription | `from pixeltable.functions.openai import transcriptions` | `transcriptions(audio=..., model='whisper-1')` | `.text` | +| OpenAI DALL-E | `from pixeltable.functions.openai import image_generations` | `image_generations(prompt=..., model='dall-e-3')` | `.data[0].url` | +| Anthropic | `from pixeltable.functions.anthropic import messages` | `messages(messages=..., model='claude-sonnet-4-20250514', max_tokens=1024)` | `.content[0].text` | +| Gemini | `from pixeltable.functions.gemini import generate_content, embed_content` | `generate_content(contents=..., model='gemini-2.0-flash')` | *(returns text directly)* | +| Together | `from pixeltable.functions.together import chat_completions` | `chat_completions(messages=..., model='meta-llama/...')` | `.choices[0].message.content` | +| Fireworks | `from pixeltable.functions.fireworks import chat_completions` | `chat_completions(messages=..., model='accounts/fireworks/...')` | `.choices[0].message.content` | +| Ollama | `from pixeltable.functions.ollama import chat_completions` | `chat_completions(messages=..., model='llama3.1')` | `.choices[0].message.content` | +| Mistral | `from pixeltable.functions.mistralai import chat_completions` | `chat_completions(messages=..., model='mistral-large-latest')` | `.choices[0].message.content` | +| Groq | `from pixeltable.functions.groq import chat_completions` | `chat_completions(messages=..., model='llama-3.1-70b-versatile')` | `.choices[0].message.content` | +| DeepSeek | `from pixeltable.functions.deepseek import chat_completions` | `chat_completions(messages=..., model='deepseek-chat')` | `.choices[0].message.content` | +| OpenRouter | `from pixeltable.functions.openrouter import chat_completions` | `chat_completions(messages=..., model='anthropic/claude-sonnet-4-20250514')` | `.choices[0].message.content` | +| Hugging Face CLIP | `from pixeltable.functions.huggingface import clip` | `clip.using(model_id='openai/clip-vit-base-patch32')` | *(use as embedding index)* | +| Hugging Face ST | `from pixeltable.functions.huggingface import sentence_transformer` | `sentence_transformer.using(model_id='all-MiniLM-L6-v2')` | *(use as embedding index)* | +| Whisper (Local) | `from pixeltable.functions.whisper import transcribe` | `transcribe(audio=..., model='base')` | *(returns text directly)* | +| WhisperX (Local) | `from pixeltable.functions.whisperx import transcribe` | `transcribe(audio=..., model='large-v2', diarize=True)` | *(returns JSON with segments)* | +| Voyage AI | `from pixeltable.functions.voyageai import embed` | `embed(input=..., model='voyage-2')` | *(returns embedding directly)* | +| Jina AI | `from pixeltable.functions.jina import embeddings` | `embeddings(text=..., model='jina-embeddings-v3')` | *(use as embedding index)* | +| Twelve Labs | `from pixeltable.functions.twelvelabs import embed` | `embed(video_segment=..., model_name='marengo3.0')` | *(use as video embedding index)* | +| BFL FLUX | `from pixeltable.functions.bfl import generate` | `generate(prompt=..., width=1024, height=1024)` | *(returns Image directly)* | +| RunwayML | `from pixeltable.functions.runwayml import text_to_video` | `text_to_video(prompt=..., model='gen4.5')` | `['output'][0]` cast to `pxt.Video` | +| fal.ai | `from pixeltable.functions.fal import run` | `run(input=json, app='fal-ai/flux/schnell')` | *(returns JSON)* | +| Reve | `from pixeltable.functions.reve import create` | `create(prompt=...)` | *(returns Image directly)* | +| Fabric | `from pixeltable.functions.fabric import chat_completions` | `chat_completions(messages=..., model='gpt-4.1')` | `.choices[0].message.content` | +| llama.cpp | `from pixeltable.functions.llama_cpp import create_chat_completion` | `create_chat_completion(messages=..., repo_id='...', repo_filename='*q5_k_m.gguf')` | `.choices[0].message.content` | +| YOLOX | `from pixeltable.functions.yolox import yolox` | `yolox(image=...)` | *(returns detection JSON)* | +| Replicate | `from pixeltable.functions.replicate import run` | `run(input=json, model='owner/model')` | *(returns JSON)* | +| Bedrock | `from pixeltable.functions.bedrock import converse` | `converse(messages=..., model='...')` | `.output.message.content[0].text` | + +**Key patterns**: OpenAI-compatible providers (Together, Fireworks, Ollama, Mistral, Groq, DeepSeek, OpenRouter, Fabric) all return `.choices[0].message.content`. Anthropic returns `.content[0].text`. Embedding functions are used with `add_embedding_index()`, not accessed directly. Image generation functions (BFL, Reve) return `pxt.Image` directly. + +--- + +## Full Examples + +### OpenAI + +### Chat Completions + +```python +from pixeltable.functions.openai import chat_completions + +# Basic +t.add_computed_column( + response=chat_completions( + messages=[{'role': 'user', 'content': t.prompt}], + model='gpt-4o-mini' + ).choices[0].message.content, + if_exists='ignore', +) + +# With system message +t.add_computed_column( + response=chat_completions( + messages=[ + {'role': 'system', 'content': 'You are a helpful assistant.'}, + {'role': 'user', 'content': t.prompt} + ], + model='gpt-4o', + max_tokens=1000, + temperature=0.7 + ).choices[0].message.content, + if_exists='ignore', +) + +# Vision (image analysis) +t.add_computed_column( + description=chat_completions( + messages=[{ + 'role': 'user', + 'content': [ + {'type': 'text', 'text': 'Describe this image.'}, + {'type': 'image_url', 'image_url': {'url': t.image}} + ] + }], + model='gpt-4o' + ).choices[0].message.content, + if_exists='ignore', +) + +# JSON mode +t.add_computed_column( + structured=chat_completions( + messages=[{'role': 'user', 'content': t.text}], + model='gpt-4o-mini', + response_format={'type': 'json_object'} + ).choices[0].message.content, + if_exists='ignore', +) +``` + +### Embeddings + +```python +from pixeltable.functions.openai import embeddings + +t.add_computed_column( + embed=embeddings(input=t.text, model='text-embedding-3-small').data[0].embedding, + if_exists='ignore', +) + +# As index +t.add_embedding_index('text', embedding=embeddings.using(model='text-embedding-3-small'), if_exists='ignore') +``` + +### Image Generation (DALL-E) + +```python +from pixeltable.functions.openai import image_generations + +t.add_computed_column( + generated=image_generations(prompt=t.description, model='dall-e-3', size='1024x1024').data[0].url, + if_exists='ignore', +) +``` + +### Speech (TTS) + +```python +from pixeltable.functions.openai import speech + +t.add_computed_column(audio=speech(input=t.text, model='tts-1', voice='alloy'), if_exists='ignore') +``` + +### Transcription + +```python +from pixeltable.functions.openai import transcriptions + +t.add_computed_column(transcript=transcriptions(audio=t.audio_file, model='whisper-1').text, if_exists='ignore') +``` + +## Anthropic + +```python +from pixeltable.functions.anthropic import messages + +# Basic +t.add_computed_column( + response=messages( + messages=[{'role': 'user', 'content': [{'type': 'text', 'text': t.prompt}]}], + model='claude-sonnet-4-20250514', + max_tokens=1024 + ).content[0].text, + if_exists='ignore', +) + +# With system prompt +t.add_computed_column( + response=messages( + messages=[{'role': 'user', 'content': [{'type': 'text', 'text': t.prompt}]}], + model='claude-sonnet-4-20250514', + system='You are an expert analyst.', + max_tokens=2048 + ).content[0].text, + if_exists='ignore', +) + +# With tool calling +from pixeltable.functions.anthropic import messages, invoke_tools + +tools = pxt.tools(search_fn, lookup_fn) +t.add_computed_column( + response=messages( + messages=[{'role': 'user', 'content': [{'type': 'text', 'text': t.prompt}]}], + model='claude-sonnet-4-20250514', + tools=tools, + tool_choice=tools.choice(required=True), + max_tokens=1024, + ), + if_exists='ignore', +) +t.add_computed_column( + tool_results=invoke_tools(tools, t.response), + if_exists='ignore', +) +``` + +## Google Gemini + +```python +from pixeltable.functions.gemini import generate_content, embed_content + +# Text generation +t.add_computed_column(response=generate_content(contents=t.prompt, model='gemini-2.0-flash'), if_exists='ignore') + +# Embeddings (for add_embedding_index) +t.add_embedding_index( + 'text', + string_embed=embed_content.using(model='gemini-embedding-2-preview'), +) + +# Multimodal: pass images alongside text +t.add_computed_column( + vision=generate_content(contents=[t.image, t.prompt], model='gemini-2.0-flash'), + if_exists='ignore', +) +``` + +## Together AI + +```python +from pixeltable.functions.together import chat_completions + +t.add_computed_column( + response=chat_completions( + messages=[{'role': 'user', 'content': t.prompt}], + model='meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo' + ).choices[0].message.content, + if_exists='ignore', +) +``` + +## Fireworks + +```python +from pixeltable.functions.fireworks import chat_completions + +t.add_computed_column( + response=chat_completions( + messages=[{'role': 'user', 'content': t.prompt}], + model='accounts/fireworks/models/llama-v3p1-70b-instruct' + ).choices[0].message.content, + if_exists='ignore', +) +``` + +## Ollama (Local) + +```python +from pixeltable.functions.ollama import chat_completions, embeddings + +# Chat +t.add_computed_column( + response=chat_completions( + messages=[{'role': 'user', 'content': t.prompt}], + model='llama3.1' + ).choices[0].message.content, + if_exists='ignore', +) + +# Embeddings +t.add_computed_column(embed=embeddings(input=t.text, model='nomic-embed-text'), if_exists='ignore') +``` + +## Mistral AI + +```python +from pixeltable.functions.mistralai import chat_completions + +t.add_computed_column( + response=chat_completions( + messages=[{'role': 'user', 'content': t.prompt}], + model='mistral-large-latest' + ).choices[0].message.content, + if_exists='ignore', +) +``` + +## Groq + +```python +from pixeltable.functions.groq import chat_completions + +t.add_computed_column( + response=chat_completions( + messages=[{'role': 'user', 'content': t.prompt}], + model='llama-3.1-70b-versatile' + ).choices[0].message.content, + if_exists='ignore', +) +``` + +## DeepSeek + +```python +from pixeltable.functions.deepseek import chat_completions + +t.add_computed_column( + response=chat_completions( + messages=[{'role': 'user', 'content': t.prompt}], + model='deepseek-chat' + ).choices[0].message.content, + if_exists='ignore', +) +``` + +## OpenRouter + +```python +from pixeltable.functions.openrouter import chat_completions + +t.add_computed_column( + response=chat_completions( + messages=[{'role': 'user', 'content': t.prompt}], + model='anthropic/claude-sonnet-4-20250514' + ).choices[0].message.content, + if_exists='ignore', +) +``` + +## Hugging Face + +### CLIP (Multimodal Embeddings) + +```python +from pixeltable.functions.huggingface import clip + +embed_fn = clip.using(model_id='openai/clip-vit-base-patch32') +t.add_embedding_index('image', embedding=embed_fn, if_exists='ignore') + +sim = t.image.similarity(string='a photo of a dog') +results = t.order_by(sim, asc=False).limit(5).select(t.image, sim).collect() +``` + +### Sentence Transformers + +```python +from pixeltable.functions.huggingface import sentence_transformer + +embed_fn = sentence_transformer.using(model_id='all-MiniLM-L6-v2') +t.add_embedding_index('text', embedding=embed_fn, if_exists='ignore') + +# For multilingual / high-quality (recommended for production) +embed_fn = sentence_transformer.using(model_id='intfloat/multilingual-e5-large-instruct') +t.add_embedding_index('text', string_embed=embed_fn, if_exists='ignore') +``` + +### Object Detection (DETR) + +```python +from pixeltable.functions.huggingface import detr_for_object_detection + +detect = detr_for_object_detection.using(model_id='facebook/detr-resnet-50') +t.add_computed_column(detections=detect(t.image, threshold=0.8), if_exists='ignore') +``` + +## Whisper (Local) + +```python +from pixeltable.functions.whisper import transcribe + +t.add_computed_column(transcript=transcribe(audio=t.audio, model='base'), if_exists='ignore') +``` + +## Voyage AI + +```python +from pixeltable.functions.voyageai import embed + +t.add_computed_column(embed=embed(input=t.text, model='voyage-2'), if_exists='ignore') +``` + +## WhisperX (Local) + +Enhanced local transcription with word-level timestamps and speaker diarization. + +```python +from pixeltable.functions.whisperx import transcribe + +# Basic transcription +t.add_computed_column( + transcript=transcribe(audio=t.audio, model='large-v2'), + if_exists='ignore') + +# With speaker diarization (requires HF_TOKEN for pyannote) +t.add_computed_column( + transcript=transcribe(audio=t.audio, model='large-v2', diarize=True), + if_exists='ignore') +``` + +## Jina AI + +Embeddings and reranking for search pipelines. + +```python +from pixeltable.functions.jina import embeddings, rerank + +# Embeddings (multilingual, 89+ languages) +t.add_embedding_index('text', + embedding=embeddings.using(model='jina-embeddings-v3', task='retrieval.passage'), + if_exists='ignore') + +# Reranking search results +t.add_computed_column( + ranked=rerank( + query=t.query, + documents=t.candidates, + model='jina-reranker-v2-base-multilingual', + top_n=3, + return_documents=True, + ), if_exists='ignore') +``` + +## Twelve Labs + +Video understanding via multimodal embeddings. + +```python +from pixeltable.functions.twelvelabs import embed + +# Add video embedding index for semantic video search +t.add_embedding_index('video', + embedding=embed.using(model_name='marengo3.0'), + if_exists='ignore') + +# Search videos by text query +sim = t.video.similarity(string='person giving a presentation') +results = t.order_by(sim, asc=False).limit(5).select(t.video, sim).collect() +``` + +## BFL FLUX + +Image generation and editing with Black Forest Labs FLUX models. + +```python +from pixeltable.functions.bfl import generate, edit, expand, fill + +# Text-to-image generation +t.add_computed_column( + image=generate(prompt=t.description, width=1024, height=1024), + if_exists='ignore') + +# Edit an existing image +t.add_computed_column( + edited=edit(image=t.image, prompt='Make the sky more dramatic'), + if_exists='ignore') + +# Expand image canvas (outpainting) +t.add_computed_column( + expanded=expand(image=t.image, prompt='Extend the landscape', top=200, right=200), + if_exists='ignore') + +# Inpaint masked region +t.add_computed_column( + filled=fill(image=t.image, mask=t.mask, prompt='A wooden bench'), + if_exists='ignore') +``` + +## RunwayML + +AI video generation and transformation. + +```python +from pixeltable.functions.runwayml import text_to_video, image_to_video + +# Generate video from text +t.add_computed_column( + video=text_to_video( + prompt=t.description, model='gen4.5', ratio='1280:720', duration=5, + ).astype(pxt.Video), + if_exists='ignore') + +# Animate an image into a video +t.add_computed_column( + video=image_to_video( + prompt=t.description, image=t.image, model='gen4.5', ratio='1280:720', + ).astype(pxt.Video), + if_exists='ignore') +``` + +## fal.ai + +Run any model on fal.ai's inference platform. + +```python +from pixeltable.functions.fal import run + +# Image generation with FLUX Schnell +t.add_computed_column( + result=run( + input={'prompt': t.description, 'image_size': 'landscape_16_9'}, + app='fal-ai/flux/schnell', + ), if_exists='ignore') +``` + +## Reve + +Image generation, editing, and remixing. + +```python +from pixeltable.functions.reve import create, edit, remix + +# Text-to-image +t.add_computed_column( + image=create(prompt=t.description), + if_exists='ignore') + +# Edit an existing image +t.add_computed_column( + edited=edit(image=t.image, edit_instruction='Make it look like a watercolor painting'), + if_exists='ignore') +``` + +## Microsoft Fabric + +Azure OpenAI models via Microsoft Fabric notebooks (no API key needed in Fabric environment). + +```python +from pixeltable.functions.fabric import chat_completions, embeddings + +# Chat +t.add_computed_column( + response=chat_completions( + messages=[{'role': 'user', 'content': t.prompt}], + model='gpt-4.1', + ).choices[0].message.content, + if_exists='ignore') + +# Embeddings +t.add_embedding_index('text', + embedding=embeddings.using(model='text-embedding-3-small'), + if_exists='ignore') +``` + +## llama.cpp + +Run local GGUF models via llama.cpp (auto-downloaded from Hugging Face). + +```python +from pixeltable.functions.llama_cpp import create_chat_completion + +t.add_computed_column( + response=create_chat_completion( + messages=[{'role': 'user', 'content': t.prompt}], + repo_id='Qwen/Qwen2.5-0.5B-Instruct-GGUF', + repo_filename='*q5_k_m.gguf', + ), if_exists='ignore') +``` + +## Replicate + +Run any model on Replicate's cloud platform. + +```python +from pixeltable.functions.replicate import run + +t.add_computed_column( + result=run(input={'prompt': t.description}, model='stability-ai/sdxl'), + if_exists='ignore') +``` + +## Bedrock + +AWS Bedrock models. + +```python +from pixeltable.functions.bedrock import converse, invoke_tools + +# Chat +t.add_computed_column( + response=converse( + messages=[{'role': 'user', 'content': [{'text': t.prompt}]}], + model='anthropic.claude-sonnet-4-20250514-v1:0', + ).output.message.content[0].text, + if_exists='ignore') + +# Tool calling +tools = pxt.tools(search_fn, lookup_fn) +t.add_computed_column( + response=converse( + messages=[{'role': 'user', 'content': [{'text': t.prompt}]}], + model='anthropic.claude-sonnet-4-20250514-v1:0', + tools=tools, + ), if_exists='ignore') +t.add_computed_column( + tool_results=invoke_tools(tools, t.response), + if_exists='ignore') +``` + +## YOLOX + +Local object detection. + +```python +from pixeltable.functions.yolox import yolox + +t.add_computed_column(detections=yolox(t.image), if_exists='ignore') +``` diff --git a/skills/pixeltable/references/video-rag-agents.md b/skills/pixeltable/references/video-rag-agents.md new file mode 100644 index 0000000..d751730 --- /dev/null +++ b/skills/pixeltable/references/video-rag-agents.md @@ -0,0 +1,251 @@ +# Video RAG Agent + +A complete recipe that combines video processing, document/transcript retrieval, and a tool-calling agent into one pipeline. Insert a video and a question — the agent automatically searches frames, transcripts, and documents to answer it. + +## Full Pipeline + +```python +import pixeltable as pxt +from pixeltable.functions.video import frame_iterator, extract_audio +from pixeltable.functions.audio import audio_splitter +from pixeltable.functions.string import string_splitter +from pixeltable.functions.openai import chat_completions, transcriptions +from pixeltable.functions.huggingface import clip, sentence_transformer +from pixeltable.functions.anthropic import messages, invoke_tools +from pixeltable.functions import image as pxt_image +from datetime import datetime + +pxt.create_dir('vrag', if_exists='ignore') + +# ── 1. Video ingestion ────────────────────────────────────────────── + +videos = pxt.create_table('vrag.videos', { + 'video': pxt.Video, + 'title': pxt.String, +}, if_exists='ignore') + +# ── 2. Keyframe extraction + CLIP visual search ───────────────────── + +frames = pxt.create_view('vrag.frames', videos, + iterator=frame_iterator(videos.video, keyframes_only=True), + if_exists='ignore') + +frames.add_computed_column( + thumbnail=pxt_image.b64_encode( + pxt_image.thumbnail(frames.frame, size=(320, 320))), + if_exists='ignore') + +frames.add_embedding_index('frame', + embedding=clip.using(model_id='openai/clip-vit-base-patch32'), + if_exists='ignore') + +# Describe each frame with a vision LLM +frames.add_computed_column( + description=chat_completions( + messages=[{ + 'role': 'user', + 'content': [ + {'type': 'text', 'text': 'Describe this video frame in one sentence.'}, + {'type': 'image_url', 'image_url': {'url': frames.frame}} + ] + }], + model='gpt-4o-mini' + ).choices[0].message.content, + if_exists='ignore') + +# ── 3. Audio extraction → transcription → sentence embedding ──────── + +videos.add_computed_column( + audio=extract_audio(videos.video, format='mp3'), + if_exists='ignore') + +audio_chunks = pxt.create_view('vrag.audio_chunks', videos, + iterator=audio_splitter(audio=videos.audio, duration=30.0), + if_exists='ignore') + +audio_chunks.add_computed_column( + transcription=transcriptions( + audio=audio_chunks.audio_chunk, model='whisper-1'), + if_exists='ignore') + +sentences = pxt.create_view('vrag.sentences', + audio_chunks.where(audio_chunks.transcription != None), + iterator=string_splitter( + text=audio_chunks.transcription.text, separators='sentence'), + if_exists='ignore') + +embed_fn = sentence_transformer.using(model_id='all-MiniLM-L6-v2') +sentences.add_embedding_index('text', string_embed=embed_fn, if_exists='ignore') + +# ── 4. Query functions (become agent tools) ────────────────────────── + +@pxt.query +def search_video_frames(query_text: str): + """Search video frames by visual similarity using CLIP.""" + sim = frames.frame.similarity(string=query_text) + return frames.order_by(sim, asc=False).limit(10).select( + frames.description, frames.thumbnail, sim=sim) + +@pxt.query +def search_transcripts(query_text: str): + """Search video transcripts by semantic similarity.""" + sim = sentences.text.similarity(string=query_text) + return sentences.where(sim > 0.5).order_by(sim, asc=False).select( + sentences.text, sim=sim).limit(20) + +@pxt.udf +def web_search(keywords: str) -> str: + """Search the web for additional context.""" + from duckduckgo_search import DDGS + with DDGS() as ddgs: + results = list(ddgs.news(keywords=keywords, max_results=5)) + return '\n'.join( + f"{r['title']}: {r['body']}" for r in results + ) if results else 'No results.' + +# ── 5. Context assembly ───────────────────────────────────────────── + +@pxt.udf +def assemble_context( + question: str, + tool_outputs: list | None, + transcript_context: list | None, + frame_context: list | None, +) -> str: + parts = [f"QUESTION: {question}"] + + tool_str = str(tool_outputs) if tool_outputs else 'N/A' + parts.append(f"\n\n{tool_str}\n") + + if transcript_context: + transcript_str = '\n'.join( + f"- {item.get('text', '')}" + for item in transcript_context if isinstance(item, dict) + ) or 'N/A' + else: + transcript_str = 'N/A' + parts.append(f"\n\n{transcript_str}\n") + + if frame_context: + frame_str = '\n'.join( + f"- {item.get('description', '')}" + for item in frame_context if isinstance(item, dict) + ) or 'N/A' + else: + frame_str = 'N/A' + parts.append(f"\n\n{frame_str}\n") + + return '\n'.join(parts) + +# ── 6. Agent pipeline ─────────────────────────────────────────────── + +tools = pxt.tools(web_search, search_transcripts, search_video_frames) + +agent = pxt.create_table('vrag.agent', { + 'prompt': pxt.String, + 'timestamp': pxt.Timestamp, + 'system_prompt': pxt.String, + 'max_tokens': pxt.Int, + 'temperature': pxt.Float, +}, if_exists='ignore') + +# Step 1: Initial LLM call — tool selection +agent.add_computed_column( + initial_response=messages( + model='claude-sonnet-4-20250514', + messages=[{'role': 'user', 'content': [{'type': 'text', 'text': agent.prompt}]}], + tools=tools, + tool_choice=tools.choice(required=True), + max_tokens=agent.max_tokens, + model_kwargs={'system': agent.system_prompt, 'temperature': agent.temperature}, + ), if_exists='ignore') + +# Step 2: Execute the tools the LLM selected +agent.add_computed_column( + tool_output=invoke_tools(tools, agent.initial_response), + if_exists='ignore') + +# Step 3: RAG context from transcripts and frames +agent.add_computed_column( + transcript_context=search_transcripts(agent.prompt), + if_exists='ignore') + +agent.add_computed_column( + frame_context=search_video_frames(agent.prompt), + if_exists='ignore') + +# Step 4: Assemble all context +agent.add_computed_column( + context=assemble_context( + agent.prompt, agent.tool_output, + agent.transcript_context, agent.frame_context), + if_exists='ignore') + +# Step 5: Final LLM call with full context +agent.add_computed_column( + final_response=messages( + model='claude-sonnet-4-20250514', + messages=[{'role': 'user', 'content': [{'type': 'text', 'text': agent.context}]}], + max_tokens=agent.max_tokens, + model_kwargs={ + 'system': 'Answer based on the video transcripts, visual descriptions, and tool results. Cite timestamps when possible.', + 'temperature': agent.temperature, + }, + ), if_exists='ignore') + +# Step 6: Extract answer +agent.add_computed_column( + answer=agent.final_response.content[0].text, + if_exists='ignore') +``` + +## Usage + +```python +# Insert videos +videos.insert([ + {'video': 'lecture.mp4', 'title': 'ML Lecture'}, + {'video': 'https://example.com/demo.mp4', 'title': 'Product Demo'}, +]) + +# Ask a question — the full pipeline runs automatically +agent.insert([{ + 'prompt': 'What visual examples does the lecturer use to explain gradient descent?', + 'timestamp': datetime.now(), + 'system_prompt': 'Use search_video_frames for visual content and search_transcripts for spoken content.', + 'max_tokens': 1024, + 'temperature': 0.7, +}]) + +result = agent.order_by(agent.timestamp, asc=False).limit(1).select(agent.answer).collect() +``` + +## How It Works + +The pipeline is a chain of computed columns. Inserting a row into `agent` triggers these steps automatically: + +1. **Initial LLM call** — Claude selects which tools to call (transcript search, frame search, web search) +2. **Tool execution** — `invoke_tools()` runs the selected `@pxt.query` / `@pxt.udf` functions +3. **RAG retrieval** — Transcript and frame similarity searches run in parallel as computed columns +4. **Context assembly** — A UDF merges tool outputs, transcript excerpts, and visual descriptions +5. **Final LLM call** — Claude synthesizes everything into a grounded answer + +### Key building blocks + +| Concept | Function | Purpose | +|---------|----------|---------| +| `frame_iterator` | `pxt.create_view(..., iterator=frame_iterator(...))` | Extract video keyframes | +| `audio_splitter` | `pxt.create_view(..., iterator=audio_splitter(...))` | Split audio into chunks | +| `transcriptions` | `t.add_computed_column(transcription=transcriptions(...))` | Transcribe audio chunks | +| `string_splitter` | `pxt.create_view(..., iterator=string_splitter(...))` | Split transcript into sentences | +| `add_embedding_index` | `t.add_embedding_index('col', embedding=fn)` | Enable similarity search | +| `@pxt.query` | `def search_transcripts(query_text: str): ...` | Reusable retrieval + agent tool | +| `pxt.tools()` | `tools = pxt.tools(fn1, fn2)` | Bundle functions as LLM tools | +| `invoke_tools()` | `invoke_tools(tools, response)` | Execute the tools the LLM chose | + +## Adapting This Recipe + +- **Swap providers**: Replace `messages` (Anthropic) with `chat_completions` (OpenAI/Together/etc.) — see [providers.md](providers.md#quick-reference) for import and output shapes +- **Add document RAG**: Add a `document_splitter` view and a `search_documents` query function to the tools list +- **Use local models**: Replace OpenAI transcription with `whisper.transcribe()` and use `ollama.chat_completions` for the LLM — see [workflows.md → Local LLM Pipeline](workflows.md#local-llm-pipeline-ollama) +- **Serve via API**: Wrap the pipeline in a FastAPI endpoint — see [workflows.md → FastAPI App Pattern](workflows.md#fastapi-app-pattern) diff --git a/skills/pixeltable/references/workflows.md b/skills/pixeltable/references/workflows.md new file mode 100644 index 0000000..56ca79f --- /dev/null +++ b/skills/pixeltable/references/workflows.md @@ -0,0 +1,642 @@ +# Pixeltable End-to-End Workflow Templates + +Complete, production-ready workflow templates combining multiple Pixeltable features. + +## Contents + +- [RAG Pipeline](#rag-pipeline) +- [Video Analysis Pipeline](#video-analysis-pipeline) +- [Image Classification and Search](#image-classification-and-search) +- [Audio Transcription and Analysis](#audio-transcription-and-analysis) +- [Multi-Provider Comparison](#multi-provider-comparison) +- [Tool-Calling Agent (Full Production Example)](#tool-calling-agent-full-production-example) +- [Local LLM Pipeline (Ollama)](#local-llm-pipeline-ollama) +- [FastAPI App Pattern](#fastapi-app-pattern) (hand-written endpoints) +- [FastAPIRouter — Declarative Serving (v0.6+)](#fastapirouter-declarative-serving-v06) (preferred) +- [Export Workflow](#export-workflow) + +--- + +### RAG Pipeline + +```python +import pixeltable as pxt +from pixeltable.functions.document import document_splitter +from pixeltable.functions.openai import chat_completions, embeddings + +pxt.create_dir('rag', if_exists='ignore') + +docs = pxt.create_table('rag.documents', { + 'doc': pxt.Document, + 'title': pxt.String, +}, if_exists='ignore') + +chunks = pxt.create_view('rag.chunks', docs, + iterator=document_splitter(docs.doc, separators='token_limit', limit=300, metadata='title,heading'), + if_exists='ignore') + +chunks.add_embedding_index('text', + embedding=embeddings.using(model='text-embedding-3-small'), + if_exists='ignore') + +docs.insert([ + {'doc': 'path/to/document.pdf', 'title': 'My Document'}, + {'doc': 'https://example.com/page.html', 'title': 'Web Page'}, +]) + +@pxt.query +def retrieve(question: str, top_k: int = 5): + sim = chunks.text.similarity(string=question) + return chunks.order_by(sim, asc=False).limit(top_k).select(chunks.text, chunks.title, sim) + +context = retrieve('What is machine learning?').collect() +``` + +### Video Analysis Pipeline + +```python +import pixeltable as pxt +from pixeltable.functions.video import frame_iterator, extract_audio +from pixeltable.functions.audio import audio_splitter +from pixeltable.functions.string import string_splitter +from pixeltable.functions.openai import chat_completions, transcriptions +from pixeltable.functions.huggingface import clip, sentence_transformer +from pixeltable.functions import image as pxt_image + +pxt.create_dir('video', if_exists='ignore') + +videos = pxt.create_table('video.library', { + 'video': pxt.Video, 'title': pxt.String +}, if_exists='ignore') + +# 1. Keyframe extraction + CLIP visual search +frames = pxt.create_view('video.frames', videos, + iterator=frame_iterator(videos.video, keyframes_only=True), + if_exists='ignore') + +frames.add_computed_column( + thumbnail=pxt_image.b64_encode( + pxt_image.thumbnail(frames.frame, size=(320, 320))), + if_exists='ignore') + +frames.add_embedding_index('frame', + embedding=clip.using(model_id='openai/clip-vit-base-patch32'), + if_exists='ignore') + +# 2. Audio extraction -> transcription -> sentence embedding +videos.add_computed_column( + audio=extract_audio(videos.video, format='mp3'), + if_exists='ignore') + +audio_chunks = pxt.create_view('video.audio_chunks', videos, + iterator=audio_splitter(audio=videos.audio, duration=30.0), + if_exists='ignore') + +audio_chunks.add_computed_column( + transcription=transcriptions( + audio=audio_chunks.audio_chunk, model='whisper-1'), + if_exists='ignore') + +sentences = pxt.create_view('video.sentences', + audio_chunks.where(audio_chunks.transcription != None), + iterator=string_splitter( + text=audio_chunks.transcription.text, separators='sentence'), + if_exists='ignore') + +embed_fn = sentence_transformer.using(model_id='all-MiniLM-L6-v2') +sentences.add_embedding_index('text', string_embed=embed_fn, if_exists='ignore') + +# 3. Describe frames with vision LLM +frames.add_computed_column( + description=chat_completions( + messages=[{ + 'role': 'user', + 'content': [ + {'type': 'text', 'text': 'Describe this video frame in one sentence.'}, + {'type': 'image_url', 'image_url': {'url': frames.frame}} + ] + }], + model='gpt-4o-mini' + ).choices[0].message.content, + if_exists='ignore') + +# Visual search +sim = frames.frame.similarity(string='person riding a bicycle') +results = frames.order_by(sim, asc=False).limit(10).select( + frames.frame, frames.description, sim).collect() + +# Transcript search +@pxt.query +def search_transcripts(query_text: str): + sim = sentences.text.similarity(string=query_text) + return sentences.where(sim > 0.7).order_by(sim, asc=False).select( + sentences.text, sim=sim + ).limit(20) +``` + +### Image Classification and Search + +```python +import pixeltable as pxt +from pixeltable.functions.openai import chat_completions +from pixeltable.functions.huggingface import clip +from pixeltable.functions import image as pxt_image + +pxt.create_dir('images', if_exists='ignore') + +catalog = pxt.create_table('images.catalog', { + 'image': pxt.Image, 'filename': pxt.String, +}, if_exists='ignore') + +catalog.add_computed_column( + thumbnail=pxt_image.b64_encode( + pxt_image.thumbnail(catalog.image, size=(320, 320))), + if_exists='ignore') + +catalog.add_computed_column( + tags=chat_completions( + messages=[{ + 'role': 'user', + 'content': [ + {'type': 'text', 'text': 'List 5 descriptive tags as a comma-separated list.'}, + {'type': 'image_url', 'image_url': {'url': catalog.image}} + ] + }], + model='gpt-4o-mini' + ).choices[0].message.content, + if_exists='ignore') + +embed_fn = clip.using(model_id='openai/clip-vit-base-patch32') +catalog.add_embedding_index('image', embedding=embed_fn, if_exists='ignore') + +sim = catalog.image.similarity(string='sunset over the ocean') +results = catalog.order_by(sim, asc=False).limit(5).select( + catalog.image, catalog.tags, sim).collect() +``` + +### Audio Transcription and Analysis + +```python +import pixeltable as pxt +from pixeltable.functions.openai import transcriptions, chat_completions + +pxt.create_dir('audio', if_exists='ignore') + +recordings = pxt.create_table('audio.recordings', { + 'audio': pxt.Audio, 'speaker': pxt.String, +}, if_exists='ignore') + +recordings.add_computed_column( + transcript=transcriptions(audio=recordings.audio, model='whisper-1').text, + if_exists='ignore') + +recordings.add_computed_column( + summary=chat_completions( + messages=[ + {'role': 'system', 'content': 'Summarize in 2-3 sentences.'}, + {'role': 'user', 'content': recordings.transcript} + ], + model='gpt-4o-mini' + ).choices[0].message.content, + if_exists='ignore') +``` + +### Multi-Provider Comparison + +```python +import pixeltable as pxt +from pixeltable.functions.openai import chat_completions as openai_chat +from pixeltable.functions.anthropic import messages as anthropic_msg +from pixeltable.functions.together import chat_completions as together_chat + +pxt.create_dir('compare', if_exists='ignore') +prompts = pxt.create_table('compare.prompts', {'prompt': pxt.String}, if_exists='ignore') + +prompts.add_computed_column( + openai=openai_chat( + messages=[{'role': 'user', 'content': prompts.prompt}], model='gpt-4o-mini' + ).choices[0].message.content, if_exists='ignore') + +prompts.add_computed_column( + anthropic=anthropic_msg( + messages=[{'role': 'user', 'content': [{'type': 'text', 'text': prompts.prompt}]}], + model='claude-sonnet-4-20250514', max_tokens=1024 + ).content[0].text, if_exists='ignore') + +prompts.add_computed_column( + llama=together_chat( + messages=[{'role': 'user', 'content': prompts.prompt}], + model='meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo' + ).choices[0].message.content, if_exists='ignore') + +prompts.insert([{'prompt': 'Explain quantum computing simply.'}]) +results = prompts.select( + prompts.prompt, prompts.openai, prompts.anthropic, prompts.llama).collect() +``` + +### Tool-Calling Agent (Full Production Example) + +Complete agent pipeline as used in the [Pixeltable Starter Kit](https://github.com/pixeltable/pixeltable-starter-kit): + +```python +import pixeltable as pxt +from pixeltable.functions.anthropic import messages, invoke_tools +from pixeltable.functions.huggingface import sentence_transformer, clip +from pixeltable.functions.document import document_splitter +from pixeltable.functions import image as pxt_image +from datetime import datetime + +pxt.create_dir('app', if_exists='ignore') + +# --- Data pipelines --- +documents = pxt.create_table('app.documents', {'document': pxt.Document}, if_exists='ignore') +chunks = pxt.create_view('app.chunks', documents, + iterator=document_splitter(documents.document, + separators='page, sentence', metadata='title,heading,page'), + if_exists='ignore') + +embed_fn = sentence_transformer.using(model_id='intfloat/multilingual-e5-large-instruct') +chunks.add_embedding_index('text', string_embed=embed_fn, if_exists='ignore') + +images = pxt.create_table('app.images', {'image': pxt.Image}, if_exists='ignore') +images.add_computed_column( + thumbnail=pxt_image.b64_encode(pxt_image.thumbnail(images.image, size=(320, 320))), + if_exists='ignore') +images.add_embedding_index('image', + embedding=clip.using(model_id='openai/clip-vit-base-patch32'), if_exists='ignore') + +# --- Query functions (become tools + RAG context) --- +@pxt.query +def search_documents(query_text: str): + sim = chunks.text.similarity(string=query_text) + return chunks.where(sim > 0.5).order_by(sim, asc=False).select( + chunks.text, sim=sim).limit(20) + +@pxt.query +def search_images(query_text: str): + sim = images.image.similarity(string=query_text) + return images.where(sim > 0.25).order_by(sim, asc=False).select( + encoded_image=pxt_image.b64_encode( + pxt_image.thumbnail(images.image, size=(224, 224)), 'png'), + sim=sim).limit(5) + +@pxt.udf +def web_search(keywords: str) -> str: + """Search the web using DuckDuckGo.""" + from duckduckgo_search import DDGS + with DDGS() as ddgs: + results = list(ddgs.news(keywords=keywords, max_results=5)) + return '\n'.join( + f"{r['title']}: {r['body']}" for r in results + ) if results else 'No results.' + +@pxt.udf +def assemble_context(question: str, tool_outputs: list | None, doc_context: list | None) -> str: + tool_str = str(tool_outputs) if tool_outputs else 'N/A' + doc_str = '\n'.join( + f"- {item.get('text', '')}" for item in (doc_context or []) if isinstance(item, dict) + ) or 'N/A' + return (f"QUESTION: {question}\n\n" + f"\n{tool_str}\n\n\n" + f"\n{doc_str}\n") + +# --- Agent pipeline --- +tools = pxt.tools(web_search, search_documents) + +agent = pxt.create_table('app.agent', { + 'prompt': pxt.String, + 'timestamp': pxt.Timestamp, + 'system_prompt': pxt.String, + 'max_tokens': pxt.Int, + 'temperature': pxt.Float, +}, if_exists='ignore') + +agent.add_computed_column( + initial_response=messages( + model='claude-sonnet-4-20250514', + messages=[{'role': 'user', 'content': agent.prompt}], + tools=tools, + tool_choice=tools.choice(required=True), + max_tokens=agent.max_tokens, + model_kwargs={'system': agent.system_prompt, 'temperature': agent.temperature}, + ), if_exists='ignore') + +agent.add_computed_column(tool_output=invoke_tools(tools, agent.initial_response), if_exists='ignore') +agent.add_computed_column(doc_context=search_documents(agent.prompt), if_exists='ignore') +agent.add_computed_column( + context=assemble_context(agent.prompt, agent.tool_output, agent.doc_context), + if_exists='ignore') + +agent.add_computed_column( + final_response=messages( + model='claude-sonnet-4-20250514', + messages=[{'role': 'user', 'content': agent.context}], + max_tokens=agent.max_tokens, + model_kwargs={'system': 'Answer based on context. Cite sources.', 'temperature': agent.temperature}, + ), if_exists='ignore') + +agent.add_computed_column(answer=agent.final_response.content[0].text, if_exists='ignore') + +# --- Usage --- +agent.insert([{ + 'prompt': 'What are the latest AI breakthroughs?', + 'timestamp': datetime.now(), + 'system_prompt': 'Use tools to gather information, then answer.', + 'max_tokens': 1024, + 'temperature': 0.7, +}]) +result = agent.order_by(agent.timestamp, asc=False).limit(1).select(agent.answer).collect() +``` + +### Local LLM Pipeline (Ollama) + +```python +import pixeltable as pxt +from pixeltable.functions.ollama import chat_completions, embeddings + +pxt.create_dir('local', if_exists='ignore') +t = pxt.create_table('local.data', {'text': pxt.String}, if_exists='ignore') + +t.add_computed_column( + analysis=chat_completions( + messages=[{'role': 'user', 'content': 'Analyze: ' + t.text}], + model='llama3.1' + ).choices[0].message.content, if_exists='ignore') + +t.add_embedding_index('text', + embedding=embeddings.using(model='nomic-embed-text'), + if_exists='ignore') + +t.insert([{'text': 'Machine learning fundamentals'}]) +sim = t.text.similarity(string='neural networks') +results = t.order_by(sim, asc=False).limit(5).select(t.text, sim).collect() +``` + +### FastAPI App Pattern + +Production-ready pattern for web apps with Pixeltable: + +```python +# setup_pixeltable.py -- Run once to initialize schema +import pixeltable as pxt +from pixeltable.functions.uuid import uuid7 +from pixeltable.functions.document import document_splitter +from pixeltable.functions.huggingface import sentence_transformer + +pxt.drop_dir('app', force=True) +pxt.create_dir('app', if_exists='ignore') + +documents = pxt.create_table('app.documents', { + 'document': pxt.Document, + 'uuid': uuid7(), + 'timestamp': pxt.Timestamp, +}, primary_key=['uuid'], if_exists='ignore') + +chunks = pxt.create_view('app.chunks', documents, + iterator=document_splitter( + documents.document, separators='page, sentence', + metadata='title,heading,page'), + if_exists='ignore') + +embed_fn = sentence_transformer.using( + model_id='intfloat/multilingual-e5-large-instruct') +chunks.add_embedding_index('text', string_embed=embed_fn, if_exists='ignore') + +@pxt.query +def search_documents(query_text: str): + sim = chunks.text.similarity(string=query_text) + return chunks.where(sim > 0.5).order_by(sim, asc=False).select( + chunks.text, sim=sim, title=chunks.title + ).limit(20) +``` + +```python +# main.py -- FastAPI app (use def, not async def) +from fastapi import FastAPI +from pydantic import BaseModel +import pixeltable as pxt + +app = FastAPI() + +class SearchRequest(BaseModel): + query: str + +class SearchResult(BaseModel): + text: str + sim: float + title: str | None = None + +class SearchResponse(BaseModel): + query: str + results: list[SearchResult] + +@app.post("/api/search", response_model=SearchResponse) +def search(body: SearchRequest): # sync, not async + table = pxt.get_table('app.chunks') + sim = table.text.similarity(body.query) + result = ( + table.where(sim > 0.3) + .order_by(sim, asc=False) + .select(text=table.text, sim=sim, title=table.title) + .limit(20) + .collect() + ) + items = list(result.to_pydantic(SearchResult)) # direct conversion + return SearchResponse(query=body.query, results=items) +``` + +### FastAPIRouter — Declarative Serving (v0.6+) + +`pixeltable.serving.FastAPIRouter` generates endpoints from tables and `@pxt.query` functions — no Pydantic models, no hand-written handlers. It's a subclass of FastAPI's `APIRouter`. + +```python +# setup_pixeltable.py — flat module, runs on import +import pixeltable as pxt +from pixeltable.functions.uuid import uuid7 +from pixeltable.functions.document import document_splitter +from pixeltable.functions.huggingface import sentence_transformer + +pxt.create_dir('app', if_exists='ignore') + +docs = pxt.create_table('app.documents', { + 'document': pxt.Document, 'uuid': uuid7(), 'timestamp': pxt.Timestamp, +}, primary_key=['uuid'], if_exists='ignore') + +chunks = pxt.create_view('app.chunks', docs, + iterator=document_splitter(docs.document, separators='page, sentence', metadata='title,heading,page'), + if_exists='ignore') + +embed_fn = sentence_transformer.using(model_id='intfloat/multilingual-e5-large-instruct') +chunks.add_embedding_index('text', idx_name='chunks_embed', string_embed=embed_fn, if_exists='ignore') +``` + +```python +# routers/data.py — queries co-located with routes +import pixeltable as pxt +from pixeltable.serving import FastAPIRouter + +router = FastAPIRouter(prefix="/api/data", tags=["data"]) +docs = pxt.get_table("app.documents") +chunks = pxt.get_table("app.chunks") + +# Upload with background processing (returns job handle, client polls /jobs/{id}) +router.add_insert_route(docs, path="/upload", + uploadfile_inputs=["document"], inputs=["timestamp"], outputs=["uuid"], + background=True) + +router.add_delete_route(docs, path="/delete") + +@pxt.query +def list_docs(): + return docs.select(uuid=docs.uuid, name=docs.document, timestamp=docs.timestamp).order_by(docs.timestamp, asc=False) + +@pxt.query +def search_docs(query_text: str): + sim = chunks.text.similarity(string=query_text) + return chunks.where(sim > 0.3).order_by(sim, asc=False).select( + text=chunks.text, sim=sim, title=chunks.title).limit(20) + +router.add_query_route(path="/list", query=list_docs, method="get") +router.add_query_route(path="/search", query=search_docs, method="post") +``` + +```python +# main.py +from fastapi import FastAPI +import setup_pixeltable # noqa: F401 — triggers schema init +from routers import data + +app = FastAPI() +app.include_router(data.router) +``` + +Key points: +- **`add_insert_route`** — generates POST endpoint from table columns. Use `uploadfile_inputs` for file uploads, `background=True` for long-running inserts +- **`add_query_route`** — wraps a `@pxt.query` function as GET or POST. Returns `{ "rows": [...] }` automatically +- **`add_delete_route`** — generates POST endpoint for row deletion by primary key or `match_columns` +- **Schema in one file, queries in routers** — `setup_pixeltable.py` creates tables/views/indexes on import. Routers get table refs via `pxt.get_table()` and define `@pxt.query` locally +- **Only write custom endpoints** for multi-table side effects (e.g., agent insert + chat history saves) + +#### return_rows=True for hand-written endpoints + +When you do need a hand-written endpoint (multi-table side effects, conditional logic), use `return_rows=True` to read computed columns back without a follow-up query: + +```python +from pydantic import BaseModel + +class AgentResult(BaseModel): + model_config = {"extra": "ignore"} + answer: str | None = None + tool_output: Any = None + +@router.post("/query") +def agent_query(request: QueryRequest): + status = agent_table.insert( + [{"prompt": request.prompt}], return_rows=True + ) + result = AgentResult.model_validate(status.rows[0]) + # Conditional: save to chat history based on computed result + if result.answer: + chat_table.insert([{"role": "assistant", "content": result.answer}]) + return result +``` + +`extra="ignore"` is required because `status.rows` dicts contain every column; Pydantic would reject the extras without it. + +Reference: [Pixeltable Starter Kit](https://github.com/pixeltable/pixeltable-starter-kit) | [core-api.md → Serving](core-api.md#serving-fastapirouter) + +### Batch Processing Pattern + +Use Pixeltable as a batch processing engine: no HTTP server, no FastAPI. A Python script that creates the schema, inserts data, lets computed columns process it, exports results to a serving database, and exits. Run it as a Cloud Run Job, ECS Task, K8s Job, Lambda, or a cron container. + +```python +# schema.py: declarative schema (idempotent) +import pixeltable as pxt +from pixeltable.functions.huggingface import sentence_transformer +from pixeltable.functions.string import string_splitter +from pixeltable.functions.uuid import uuid7 + +pxt.create_dir('pipeline', if_exists='ignore') +embed_fn = sentence_transformer.using(model_id='all-MiniLM-L6-v2') + +documents = pxt.create_table('pipeline.documents', { + 'title': pxt.String, + 'body': pxt.String, + 'source_id': pxt.String, + 'uuid': uuid7(), + 'timestamp': pxt.Timestamp, +}, primary_key=['uuid'], if_exists='ignore') + +sentences = pxt.create_view( + 'pipeline.sentences', documents, + iterator=string_splitter(text=documents.body, separators='sentence'), + if_exists='ignore', +) +sentences.add_embedding_index( + 'text', idx_name='sentences_embed', string_embed=embed_fn, if_exists='ignore' +) +``` + +```python +# pipeline.py: ingest, compute, export, exit +import json +from datetime import datetime +from pixeltable.io.sql import export_sql +import schema + +SERVING_DB_URL = 'postgresql+psycopg://user:pass@host/db' + +with open('batch.json') as f: + batch = json.load(f) + +now = datetime.now() +for row in batch['documents']: + row.setdefault('timestamp', now) + +# Insert triggers computed columns: chunking, embeddings, etc. +schema.documents.insert(batch['documents']) + +# Export structured results to serving DB +export_sql( + schema.documents.select( + schema.documents.source_id, + schema.documents.title, + schema.documents.body, + ), + 'processed_documents', + db_connect_str=SERVING_DB_URL, + if_exists='replace', +) + +# Verify semantic search works +sim = schema.sentences.text.similarity(string='test query') +results = (schema.sentences.order_by(sim, asc=False) + .limit(3).select(schema.sentences.text, sim=sim).collect()) +``` + +Key points: +- `schema.py` is a flat module that creates everything on import (idempotent) +- `pipeline.py` is the driver: load data, insert, export, exit +- Computed columns fire automatically on insert (chunking, embeddings, LLM calls) +- `export_sql` pushes processed data to any SQL database (Postgres, MySQL, Snowflake, SQLite) +- Set `PIXELTABLE_HOME=/tmp/pixeltable` for ephemeral containers +- Use the `destination` parameter on `add_computed_column` to route generated media to cloud buckets (S3, GCS, Azure Blob) + +Reference: [Starter Kit `batch/` directory](https://github.com/pixeltable/pixeltable-starter-kit/tree/main/batch) + +### Export Workflow + +```python +from pixeltable.io import export_parquet + +# To Parquet +export_parquet(t, 'output/my_data/') + +# Query result to Parquet +query = t.where(t.score > 0.8).select(t.title, t.content, t.score) +export_parquet(query, 'output/filtered/') + +# To pandas +df = t.select(t.title, t.content).collect().to_pandas() +df.to_csv('output/data.csv', index=False) +```