Open Ingest is a distributed ingestion API that turns unstructured data into LLM-ready outputs: layout-aware markdown, structured data from charts and figures.
It runs on Tensorlake's serverless orchestration engine: Python functions executed inside Firecracker microVMs, with S3 as the durable substrate for RPC, queues, and checkpoints. There's no separate queues or orchestration engine to wire up — the runtime treats S3 as the system of record, so every stage of an ingestion pipeline is durable and resumable by default.
This codebase has been in production for over a year, processing documents for Tensorlake's customers across finance, healthcare, and legal workloads. We're open-sourcing it so developers can build their own ingestion pipelines on the same runtime — fork the repo, write Python functions for whatever extraction, enrichment, or routing logic your project needs, and Tensorlake orchestrate handles scheduling, scaling, and durability.
Open Ingest is also the reference implementation for ingestion projects using Tensorlake's Orchestration engine: if you want to see how durable execution, microVM sandboxing, and S3-backed coordination compose into a real production system, the code is here.
Works out of the box — point it at a PDF, get back layout-aware markdown, tables, and structured data. Pick an OCR backend, supply a key, you're done.
And every layer is swappable if you want to go further:
- BYO OCR — 4 backends ship (DotsOCR, Azure DI, Textract, Gemini); register your own in
src/tensorlake_docai/ocr/. - BYO LLM — provider-agnostic; supply only the keys for what you use (OpenAI, Anthropic, Gemini).
- BYO schema — Pydantic
BaseModelor JSON Schema for structured extraction with citations. - Forkable stages — table merging, VLM enrichment, page classification, structured extraction are all standalone
@function()s you can swap or extend. --localfor hacking,tl deployfor production — sameParseRequest, two runners.
Given a PDF, image, DOCX, or Office file, the workflow:
-
Normalizes the file. MIME type is detected from content (via
python-magic) with filename andContent-Typeas fallbacks, then each format is routed through a dedicated path:Input What happens Tool Downstream MIME PDF Passed through; page count via pypdf— application/pdfJPG / PNG / single-page TIFF Passed through as 1 page — image/*Multi-page TIFF Frame count via PIL.Image.n_framesPillow image/tiffDOCX Parsed in-process to structured pages + bboxes; a preview PDF is rendered and returned as base64 docx_parsingtext/htmlDOC Converted to DOCX, then routed to the DOCX path LibreOffice ( soffice)text/htmlXLSX / XLS / XLSM Each sheet → HTML table → split on empty rows → Markdown; sheet name kept as page_classpandas+markdownifytext/tableTXT / HTML / CSV / XML / MD Decoded as UTF-8, with chardetfallback for other encodings (CSV preserved for structured extraction)chardettext/plainortext/csvP7M (PKCS#7) Inner payload extracted, MIME re-detected, then routed by the new type openssl cms/smime/pkcs7depends on inner content Anything else LibreOffice can open (RTF, ODT, PPT, …) Converted to PDF LibreOffice ( soffice)application/pdfQuota and
pages_to_parsevalidation runs after normalization, so out-of-range pages fail before any OCR or VLM cost is incurred. -
Runs OCR with the provider of your choice — four backends ship:
ocr_modelProvider Best for dots-ocr(model03)DotsOCR + Ovis2.5 on a CUDA GPU Complex documents — open-sourced with the full serving setup (vLLM, two-stage Ovis figure OCR, masked-region retries). Needs your own GPU host ( --local) or a managed Tensorlake deployment.azure-di(model01)Azure Document Intelligence Fast cloud OCR with cell-level table bboxes textract(model02)AWS Textract Native PDF, async S3 jobs geminiGoogle Gemini VLM VLM-powered semantic OCR -
Enriches with VLM passes (optional): table summarization, figure summarization, table merging for cross-page tables, chart extraction, page classification, signature detection.
-
Extracts structured data against a JSON schema with citation tracking.
-
Returns a single
ParsedDocument(pages, fragments, tables, structured outputs, usage) — no DB, no webhooks.
See docs/pipeline.md for the full DAG and
docs/models.md for the OCR backend comparison.
Each stage is independent — toggle only what you need. Every feature below is gated by one field on ParseRequest and exposed as a CLI flag on examples/parse_pdf.py. Run python examples/parse_pdf.py --help for the full list, or docs/running.md §4 for the field-by-field map.
Layout-aware text, tables, figures, charts, formulas, headers, footers, page numbers, and reading order — across four interchangeable OCR backends (docs/models.md). Output is a ParsedDocument with per-fragment bounding boxes and ref_ids, so anything downstream can re-locate the source pixels.
| Feature | CLI flag | What it does |
|---|---|---|
| Cross-page table merging | --table-merging |
Stitches tables that wrap across pages, and same-page tables split by intervening text. Uses a fast Gemini "is this a continuation?" prompt per pair and falls back to a visual-alignment pass when column counts disagree. |
| Table summarization | --table-summarization |
One-sentence VLM summary attached to each table; --table-summarization-prompt overrides the default. |
| Table cell grounding | --table-cell-grounding |
Per-cell bounding boxes — useful for click-to-source UIs and entity location. |
| Table output format | --table-output-mode {markdown,html,json} |
Markdown by default; HTML preserves merged cells and spans. |
| Feature | CLI flag | What it does |
|---|---|---|
| Figure summarization | --figure-summarization |
VLM caption per figure; --figure-summarization-prompt overrides. |
| Figure grounding | --figure-grounding |
Bounding boxes for text regions inside figures. |
| Chart extraction | --chart-extraction |
Extracts the underlying data series as JSON — line, bar, pie, scatter. |
| Figure OCR prompt | --figure-ocr-prompt |
Override the DotsOCR figure-OCR prompt (dots-ocr only). |
| Feature | CLI flag | What it does |
|---|---|---|
| Key-value extraction | --key-value-extraction |
Pulls key/value pairs out of form-shaped regions without a schema. |
| Structured extraction | (use extract_structured.py) |
Schema-driven extraction with citations — JSON Schema or Pydantic. See docs/running.md. |
| Form filling | form_filling=FormFillingRequest(...) |
Fills the source PDF/DOCX with extracted values and returns a base64 of the filled doc. Python-only — see docs/running.md. |
| Feature | CLI flag | What it does |
|---|---|---|
| Signature detection | --detect-signature |
Locates signatures via Textract; needs AWS keys. |
| Barcode detection | --detect-barcode |
Decodes 1D/2D barcodes (QR, Code-128, etc.). |
| Feature | CLI flag | What it does |
|---|---|---|
| Page classification | --classify NAME:DESCRIPTION (repeatable) |
Multi-label or --classification-type multi_class classification using natural-language class definitions. |
| Cross-page header detection | --xpage-header-detection |
Drops repeating page headers/footers from the output. |
| Page selection | --pages 1 2 5 |
1-indexed; saves both money and time on long docs. |
| Chunking | --chunk-strategy {none,page,section,fragment} |
Controls the granularity of chunks[] in the output. |
| Drop fragment types | --ignore-sections page_footer figure |
Filter unwanted fragments from the final document. |
Two ways to run — same ParseRequest payload, different runner:
- Local (
--local) — no deploy, no Tensorlake account. Every task runs in your Python process. Best for reading, debugging, and iterating on a request. - Remote on Tensorlake —
tl deployonce, then each task runs in its own container with autoscaling and retries.
git clone https://github.com/tensorlakeai/openingest
cd openingest
pip install -e .
export GEMINI_API_KEY=...
python examples/parse_pdf.py --file my.pdf --ocr-model gemini --localgit clone https://github.com/tensorlakeai/openingest
cd openingest
uv sync --extra cpu # CPU machine (installs torch + transformers)
# uv sync --extra gpu # Linux GPU machine (installs CUDA torch + transformers)
# pip install vllm # additionally required for dots-ocr on a GPU machine
export GEMINI_API_KEY=...
python examples/parse_pdf.py --file my.pdf --ocr-model gemini --localResults are written to debug/
Add --draw-bboxes to also write debug/bbox_page_N.png per page, with fragment bounding boxes overlayed on the rendered page image — handy for sanity-checking layout output. Local files only.
TENSORLAKE_API_KEY is not required for local runs.
--local only changes where the workflow itself runs — it doesn't bypass external OCR backends. Each backend needs its own provider keys (azure-di, textract, gemini); dots-ocr needs a CUDA-equipped host.
cp .env.example .env
$EDITOR .env # add TENSORLAKE_API_KEY + provider keys you'll use
set -a; source .env; set +a
bash scripts/sync-secrets.sh # push provider keys from .env to Tensorlake
tl deploy src/workflow.py
python examples/parse_pdf.py --file my.pdf --ocr-model azure-discripts/sync-secrets.sh reads your .env and pushes the provider keys the
workflow declares in @function(secrets=[...]) to Tensorlake via
tl secrets set. Re-run it whenever those values change.
tl deploy packages every @function() in src/workflow.py and ships it to Tensorlake's pool — re-deploy when those source files change.
For structured extraction (works in both modes):
python examples/extract_structured.py --file invoice.pdf --schema Invoice --localOnce deployed, you can view and inspect job runs in the Tensorlake UI:
Screen.Recording.2026-05-18.at.5.55.51.PM.mov
See docs/running.md for the full walkthrough — per-backend key
matrix, VLM/structured-extraction/page-classification examples, and the
ParseRequest knobs that toggle each DAG stage.
The pipeline is provider-agnostic — supply only the keys for the backends you
plan to use. The .env.example file groups them by feature:
- Tensorlake (required if you want a deployed application):
TENSORLAKE_API_KEY - DotsOCR on a CUDA GPU (
dots-ocr): no provider keys. Run it on any CUDA-equipped host via--local, or on a managed Tensorlake GPU deployment — the@function()decorators are already pinned toH100/A100-80GB, but GPU workers aren't part of the open serverless tier today, so reach out to support@tensorlake.ai if you'd like one provisioned. Weights (rednote-hilab/dots.mocrandAIDC-AI/Ovis2.5-9B) are pulled from Hugging Face Hub on first cold-start. - Azure Document Intelligence (
azure-di):AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT,AZURE_DOCUMENT_INTELLIGENCE_KEY - AWS Textract (
textract):AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION,S3_BUCKET_NAME - Google Gemini VLM (
gemini):GEMINI_API_KEY - VLM enrichment / structured extraction: any of
OPENAI_API_KEY,ANTHROPIC_API_KEY,GEMINI_API_KEY
Missing keys silently disable the dependent feature — the rest of the pipeline keeps running.
Every stage in the workflow is a Tensorlake @function() you can fork, replace,
or add to. Common extension points:
- Bring your own OCR backend. Drop a file in
src/tensorlake_docai/ocr/, register it inocr/__init__.py, widen theocr_modelenum inpipeline/api.py, and import it inworkflow.py. By callingroute_after_ocrat the end of your task, the new backend automatically participates in table merging, structured extraction, VLM enrichment, and the unified output format. Step-by-step walkthrough insrc/tensorlake_docai/ocr/README.md. - Add a VLM enrichment pass. Table/figure summarization, chart extraction,
and signature detection all live in
src/tensorlake_docai/vlm/cloud.pyas batched passes over the document — add another by following the same shape. - Drop in your own structured-extraction schema. Define a Pydantic
BaseModelin your own code (no SDK edits needed) and passjson.dumps(YourModel.model_json_schema())toStructuredExtractionRequest. Seeexamples/extract_structured.pyfor a runnable end-to-end wiring anddocs/running.md§"Bringing your own schema" for the recommended pattern. The SDK also ships a few sample schemas (Invoice,Customer,BankStatement,Receipt) intensorlake_docai.extraction.schema_collectionsyou can import for quick testing.
The dots-ocr backend doubles as the reference implementation for serving a
GPU model on this pipeline — vLLM engine setup, model caching across
containers, two-stage classification → extraction, and masked-region retries
all live in src/tensorlake_docai/ocr/dots_ocr.py and figure_ocr.py.
openingest/
├── examples/ # parse_pdf.py, extract_structured.py
├── src/
│ ├── workflow.py # `tl deploy` entrypoint — must sit one level above the package
│ └── tensorlake_docai/
│ ├── pipeline/ # file_converter, routing, output_formatter, api
│ ├── ocr/ # azure, textract, gemini, dots_ocr_*, figure_ocr
│ ├── vlm/ # cloud VLM summarization, grounding, chart extraction
│ ├── extraction/ # structured extraction + chunking + schemas
│ ├── tables/ # cross-page merging, cell grounding, correction
│ ├── postprocess/ # header correction, formatter, output cleaner
│ ├── models/ # ParseResult, PageLayout, etc.
│ ├── providers/ # LLM client wrappers
│ └── prompts/ # prompt templates
├── docs/ # pipeline.md, models.md, running.md, deployment.md
├── pyproject.toml
└── .env.example
- License: Apache-2.0
- Maintenance: bug fixes and security patches, light feature work
- Issues: github.com/tensorlakeai/openingest/issues
- Security: report to support@tensorlake.ai (see
SECURITY.md) - Contributing: see
CONTRIBUTING.md
This pipeline stands on the shoulders of several open-source models and libraries. Credit and thanks to:
- DotsOCR (rednote-hilab) —
the layout + OCR model behind
dots-ocr. A small amount of preprocessing and output-cleaning code is adapted from the upstream repo (see SPDX headers insrc/tensorlake_docai/ocr/andsrc/tensorlake_docai/postprocess/output_cleaner.py). - Ovis2.5-9B (AIDC-AI) —
the VLM used for figure OCR classification and extraction when
ocr_model='dots-ocr'. - vLLM — the inference server used inside the GPU OCR container.
- jdeskew — skew-correction used during page preprocessing.
The commercial OCR/VLM providers above (Azure, AWS, Google) are accessed via their public APIs using your own keys; no provider code is redistributed in this repo.