ingo

Minimal shell runtime to ingest Spanish legal PDFs into Upstash Vector and query them from the same CLI.

What This Repo Does

ingo runs an end-to-end ingest pipeline (fetch -> OCR -> chunk -> embed -> cleanup) and exposes a query command over the same Upstash Vector namespace. It supports two deployment roles: INGO_ROLE=all for ingest/query workers and INGO_ROLE=query for query-only devices where ingest commands are blocked.

Features

Single CLI for ingest and retrieval (bin/ingo).
Ingest stages: fetch, ocr, chunk, embed, cleanup, plus run orchestration.
Query endpoint integration via query with validated --top-k.
Relevance gate (strict mode by default) to reject low-signal OCR output.
HTTP wrapper with timeout and retry/backoff controls.
Marker-based embed skip (.jsonl.embedded) to avoid re-upserting unchanged chunks.

Requirements

Runtime dependencies checked by bin/ingo doctor:

bash
curl
jq
awk
grep
tesseract (with INGO_LANG data installed, default: spa)
pdftotext
pdftoppm

Notes:

Query-only role (INGO_ROLE=query) requires only query dependencies (curl, jq) and Upstash credentials.
Ingest role (INGO_ROLE=all) requires the full OCR + text processing stack above.

Quick Start (5 Minutes)

Copy env template:

cp .env.example .env

Set required values in .env:

UPSTASH_VECTOR_REST_URL
UPSTASH_VECTOR_REST_TOKEN

Verify setup:

bin/ingo doctor

Run one ingest pass:

bin/ingo run --dir data/ingest --strict

Run one query:

bin/ingo query "¿Qué exige la licencia ambiental para vertimientos?" --top-k 8

Command Reference

Command	Purpose	Key Flags	Role Restriction
`bin/ingo doctor`	Validate dependencies and required env vars	none	Available in all roles (checks OCR stack only when role is not `query`)
`bin/ingo fetch`	Discover PDFs in inbox, download one URL, or run document-only seed crawl into corpus artifacts	`--url URL`, `--seeds FILE`, `--crawl-depth N`, `--allow-hosts FILE`, `--manifest FILE`, `--progress-every N`, `--snapshot-pages`, `--verbose`, `--reset-manifests`, `--dir DIR`	Blocked when `INGO_ROLE=query`
`bin/ingo ocr`	OCR PDFs into `data/raw/*.txt`	`--dir DIR`	Blocked when `INGO_ROLE=query`
`bin/ingo chunk`	Convert OCR text to chunk JSONL	`--strict`, `--no-strict`	Blocked when `INGO_ROLE=query`
`bin/ingo embed`	Upsert chunks into Upstash Vector	`--force`	Blocked when `INGO_ROLE=query`
`bin/ingo cleanup`	Remove local intermediates based on cleanup policy	`--markers`	Blocked when `INGO_ROLE=query`
`bin/ingo run`	Run fetch -> ocr -> chunk -> embed -> cleanup pipeline	`--url URL`, `--dir DIR`, `--strict`, `--no-strict`, `--force`	Blocked when `INGO_ROLE=query`
`bin/ingo query "<question>"`	Query indexed content in configured namespace	`--top-k N` (positive integer only)	Available in all roles

Common flows:

Local folder ingest:

bin/ingo run --dir data/ingest --strict

URL one-shot ingest:

bin/ingo run --url "https://example.com/doc.pdf" --strict

Seed-crawl discovery (corpus mode via fetch):

bin/ingo fetch \
  --seeds data/corpus/seeds/seed_url.txt \
  --crawl-depth 2 \
  --allow-hosts data/corpus/config/allowlist.txt \
  --manifest data/corpus/manifests/gdb_documents.ndjson \
  --progress-every 10

Notes:

Seed crawl is append-only by default for manifests/ledgers (no automatic reset).
Already-seen URLs in the manifest are skipped (reason: already_seen_url).
Extraction in fetch --seeds is scoped to files downloaded in the current run.
Host policy in seed crawl is seed hosts + --allow-hosts (merged into a runtime allow-hosts file).
Use --reset-manifests only when you explicitly want a clean ledger for a new run.

Runtime visibility:

crawl-progress: discovery/queue phase counters.
collect-progress: probe/download phase counters.
extract-progress: extraction phase counters.
End-of-run reason breakdowns are printed from skipped/error ledgers.

Optional fallback to export eligible pages as PDF when no document links are found on a page:

bin/ingo fetch \
  --seeds data/corpus/seeds/seed_url.txt \
  --crawl-depth 2 \
  --allow-hosts data/corpus/config/allowlist.txt \
  --manifest data/corpus/manifests/gdb_documents.ndjson \
  --snapshot-pages

Manual curation before indexing:

# review downloaded corpus files
find data/corpus/downloads -type f | sort
# remove files you do not want indexed
rm -f "data/corpus/downloads/unwanted-file.pdf"

Index curated corpus using existing pipeline:

bin/ingo chunk --no-strict
bin/ingo embed

Inspect crawl ledgers:

jq -r '.status' data/corpus/manifests/gdb_documents.ndjson | sort | uniq -c
jq -r '.reason' data/corpus/manifests/gdb_skipped.ndjson | sort | uniq -c
jq -r '.error' data/corpus/manifests/gdb_errors.ndjson | sort | uniq -c

Query-only device:

INGO_ROLE="query" bin/ingo doctor
INGO_ROLE="query" bin/ingo query "resume obligaciones de vertimientos" --top-k 5

Configuration

Use .env.example as a starter, then adjust based on runtime defaults below.

Variable	Required	Default (`lib/env.sh`)	Description
`UPSTASH_VECTOR_REST_URL`	Yes	none	Base URL for Upstash Vector REST API.
`UPSTASH_VECTOR_REST_TOKEN`	Yes	none	Bearer token for Upstash Vector REST API.
`INGO_INBOX`	No	`$HOME/ingest`	Source directory for PDFs.
`INGO_RAW_DIR`	No	`data/raw`	OCR output directory (relative to repo root).
`INGO_CHUNK_DIR`	No	`data/chunks`	Chunk JSONL output directory (relative to repo root).
`INGO_NAMESPACE`	No	`legal-co`	Namespace used for upsert and query.
`INGO_LANG`	No	`spa`	OCR language passed to Tesseract; must be installed.
`INGO_CHUNK_SIZE`	No	`1400`	Target chunk size (characters).
`INGO_CHUNK_OVERLAP`	No	`180`	Character overlap between consecutive chunks.
`INGO_CLEANUP_TEXT`	No	`1`	When `1`, cleanup deletes OCR/chunk intermediates.
`INGO_ROLE`	No	`all`	`all` enables ingest+query; `query` blocks ingest commands.
`INGO_RELEVANCE_MODE`	No	`strict`	Relevance gate mode for chunking (`strict` or effectively off).
`INGO_MIN_TERM_MATCHES`	No	`2`	Minimum term matches for strict relevance gate.
`INGO_REJECTED_DIR`	No	`data/rejected`	Directory for rejected raw text files.
`INGO_RELEVANCE_TERMS`	No	`ambiental,licencia,vertimiento,emision,resolucion,decreto,articulo,autoridad,ministerio,agua,suelo,aire`	Comma-separated relevance terms.
`INGO_CORPUS_DIR`	No	`data/corpus`	Base directory for crawl state, downloaded documents, extracted text, and manifests.
`INGO_CRAWL_DEPTH`	No	`2`	Maximum crawl depth used by `fetch --seeds`.
`INGO_ALLOWED_HOSTS_FILE`	No	`data/corpus/config/allow_hosts.txt`	Extra hosts merged with seed hosts for runtime crawl allow policy.
`INGO_INCLUDE_EXTENSIONS`	No	`pdf,docx,xlsx,xlsm`	Comma-separated allowlist for document-only crawl downloads.
`INGO_EXCLUDE_EXTENSIONS`	No	`zip,png,jpg,jpeg,gif,webp,svg,ico,js,css,map,woff,woff2,ttf,eot,mp3,mp4,mov,avi`	Comma-separated denylist for crawl filtering.
`INGO_PROGRESS_EVERY`	No	`25`	Print crawl progress every N processed URLs.
`INGO_SNAPSHOT_PAGES_TO_PDF`	No	`0`	When `1`, attempt `wkhtmltopdf` page snapshots for eligible pages with no discovered document links.
`INGO_CRAWL_ALLOW_HTTP`	No	`0`	When `0`, crawl upgrades `http://` URLs to `https://` by default to avoid insecure/dead HTTP endpoints.
`INGO_SKIP_PROBE_FOR_ALLOWED_EXTENSIONS`	No	`1`	When `1`, skip MIME probe round-trip for allowlisted extensions (`pdf/docx/xlsx/xlsm`) to speed up large runs.

Spreadsheet handling:

.xlsx and .xlsm are downloaded and converted to text for chunking/embedding.
Extraction priority is xlsx2csv -> in2csv -> Python openpyxl fallback.
If none are available, spreadsheet extraction is marked unsupported and the original files are still preserved. | INGO_HTTP_CONNECT_TIMEOUT | No | 5 | HTTP connect timeout (seconds). | | INGO_HTTP_READ_TIMEOUT | No | 30 | HTTP max request time (seconds). | | INGO_HTTP_DOWNLOAD_TIMEOUT | No | 120 | HTTP max time for document downloads (seconds). | | INGO_HTTP_RETRY_ATTEMPTS | No | 2 | Number of retry attempts for retriable failures. | | INGO_HTTP_RETRY_BACKOFF_MIN | No | 1 | Initial retry backoff (seconds). | | INGO_HTTP_RETRY_BACKOFF_MAX | No | 8 | Maximum retry backoff (seconds). | | INGO_HTTP_RETRY_BACKOFF_FACTOR | No | 2 | Backoff multiplier between retries. | | INGO_HTTP_RETRY_AFTER_MAX | No | INGO_HTTP_RETRY_BACKOFF_MAX | Maximum accepted Retry-After sleep (seconds). |

Legacy compatibility:

INGO_HTTP_RETRY_MAX is still accepted and mapped to INGO_HTTP_RETRY_ATTEMPTS.
INGO_HTTP_RETRY_BACKOFF is still accepted and mapped to INGO_HTTP_RETRY_BACKOFF_MIN.
Runtime exports both legacy names for compatibility with older scripts/tests.

Data Layout

data/
  ingest/             # optional local inbox if you set INGO_INBOX=data/ingest
  raw/                # OCR text + source metadata (*.txt, *.meta)
  chunks/             # chunk files (*.jsonl) and embed markers (*.jsonl.embedded)
  rejected/           # OCR text rejected by relevance gate or empty-content check

Default inbox is $HOME/ingest unless overridden by INGO_INBOX or command --dir.

Operational Notes

Relevance behavior:
- chunk --strict (or default INGO_RELEVANCE_MODE=strict) rejects low-signal text into INGO_REJECTED_DIR.
- chunk --no-strict disables relevance filtering for that run.
Marker behavior:
- embed skips files when chunk hash + namespace match existing .jsonl.embedded marker.
- embed --force bypasses marker skip and re-upserts.
Query output shape:
- JSON object with match_count and matches[].
- Each match includes normalized fields such as id, score, text, source, section, article, page, date_indexed.

Testing

Run all test scripts:

for f in tests/*_test.sh; do bash "$f"; done

Key coverage areas:

query_top_k_validation_test.sh: validates --top-k acceptance/rejection rules.
doctor_lang_check_test.sh: validates exact OCR language detection behavior.
http_wrapper_test.sh: validates timeout/retry/backoff + wrapper integration in fetch/embed/query.
embed_marker_rerun_test.sh: validates marker skip/re-embed/force behavior.
path_key_collision_test.sh: validates collision-safe artifact keys and meta lookup fidelity.

Troubleshooting

missing env var: UPSTASH_VECTOR_REST_URL or UPSTASH_VECTOR_REST_TOKEN:
- Define both in .env or shell environment.
missing OCR language data for INGO_LANG=spa:
- Install Tesseract language data for INGO_LANG and verify with tesseract --list-langs.
invalid --top-k: must be a positive integer:
- Use an integer greater than zero (for example --top-k 8).
ingest commands are disabled on this device (INGO_ROLE=query):
- Expected in query-only mode; use INGO_ROLE=all for ingest commands.

Roadmap / Improvements

See IMPROVEMENTS.md for ongoing improvement notes. See docs/corpus-fetch-workflow.md for the crawl-enabled fetch workflow. See docs/document-only-crawl.md for document-only vector-ready crawl behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ingo

What This Repo Does

Features

Requirements

Quick Start (5 Minutes)

Command Reference

Configuration

Data Layout

Operational Notes

Testing

Troubleshooting

Roadmap / Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github/workflows		.github/workflows
bin		bin
docs		docs
lib		lib
tests		tests
.env.example		.env.example
.gitignore		.gitignore
IMPROVEMENTS.md		IMPROVEMENTS.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

ingo

What This Repo Does

Features

Requirements

Quick Start (5 Minutes)

Command Reference

Configuration

Data Layout

Operational Notes

Testing

Troubleshooting

Roadmap / Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages