Minimal shell runtime to ingest Spanish legal PDFs into Upstash Vector and query them from the same CLI.
ingo runs an end-to-end ingest pipeline (fetch -> OCR -> chunk -> embed -> cleanup) and exposes a query command over the same Upstash Vector namespace. It supports two deployment roles: INGO_ROLE=all for ingest/query workers and INGO_ROLE=query for query-only devices where ingest commands are blocked.
- Single CLI for ingest and retrieval (
bin/ingo). - Ingest stages:
fetch,ocr,chunk,embed,cleanup, plusrunorchestration. - Query endpoint integration via
querywith validated--top-k. - Relevance gate (
strictmode by default) to reject low-signal OCR output. - HTTP wrapper with timeout and retry/backoff controls.
- Marker-based embed skip (
.jsonl.embedded) to avoid re-upserting unchanged chunks.
Runtime dependencies checked by bin/ingo doctor:
bashcurljqawkgreptesseract(withINGO_LANGdata installed, default:spa)pdftotextpdftoppm
Notes:
- Query-only role (
INGO_ROLE=query) requires only query dependencies (curl,jq) and Upstash credentials. - Ingest role (
INGO_ROLE=all) requires the full OCR + text processing stack above.
- Copy env template:
cp .env.example .env- Set required values in
.env:
UPSTASH_VECTOR_REST_URLUPSTASH_VECTOR_REST_TOKEN
- Verify setup:
bin/ingo doctor- Run one ingest pass:
bin/ingo run --dir data/ingest --strict- Run one query:
bin/ingo query "¿Qué exige la licencia ambiental para vertimientos?" --top-k 8| Command | Purpose | Key Flags | Role Restriction |
|---|---|---|---|
bin/ingo doctor |
Validate dependencies and required env vars | none | Available in all roles (checks OCR stack only when role is not query) |
bin/ingo fetch |
Discover PDFs in inbox, download one URL, or run document-only seed crawl into corpus artifacts | --url URL, --seeds FILE, --crawl-depth N, --allow-hosts FILE, --manifest FILE, --progress-every N, --snapshot-pages, --verbose, --reset-manifests, --dir DIR |
Blocked when INGO_ROLE=query |
bin/ingo ocr |
OCR PDFs into data/raw/*.txt |
--dir DIR |
Blocked when INGO_ROLE=query |
bin/ingo chunk |
Convert OCR text to chunk JSONL | --strict, --no-strict |
Blocked when INGO_ROLE=query |
bin/ingo embed |
Upsert chunks into Upstash Vector | --force |
Blocked when INGO_ROLE=query |
bin/ingo cleanup |
Remove local intermediates based on cleanup policy | --markers |
Blocked when INGO_ROLE=query |
bin/ingo run |
Run fetch -> ocr -> chunk -> embed -> cleanup pipeline | --url URL, --dir DIR, --strict, --no-strict, --force |
Blocked when INGO_ROLE=query |
bin/ingo query "<question>" |
Query indexed content in configured namespace | --top-k N (positive integer only) |
Available in all roles |
Common flows:
Local folder ingest:
bin/ingo run --dir data/ingest --strictURL one-shot ingest:
bin/ingo run --url "https://example.com/doc.pdf" --strictSeed-crawl discovery (corpus mode via fetch):
bin/ingo fetch \
--seeds data/corpus/seeds/seed_url.txt \
--crawl-depth 2 \
--allow-hosts data/corpus/config/allowlist.txt \
--manifest data/corpus/manifests/gdb_documents.ndjson \
--progress-every 10Notes:
- Seed crawl is append-only by default for manifests/ledgers (no automatic reset).
- Already-seen URLs in the manifest are skipped (reason:
already_seen_url). - Extraction in
fetch --seedsis scoped to files downloaded in the current run. - Host policy in seed crawl is
seed hosts + --allow-hosts(merged into a runtime allow-hosts file). - Use
--reset-manifestsonly when you explicitly want a clean ledger for a new run.
Runtime visibility:
crawl-progress: discovery/queue phase counters.collect-progress: probe/download phase counters.extract-progress: extraction phase counters.- End-of-run reason breakdowns are printed from skipped/error ledgers.
Optional fallback to export eligible pages as PDF when no document links are found on a page:
bin/ingo fetch \
--seeds data/corpus/seeds/seed_url.txt \
--crawl-depth 2 \
--allow-hosts data/corpus/config/allowlist.txt \
--manifest data/corpus/manifests/gdb_documents.ndjson \
--snapshot-pagesManual curation before indexing:
# review downloaded corpus files
find data/corpus/downloads -type f | sort
# remove files you do not want indexed
rm -f "data/corpus/downloads/unwanted-file.pdf"Index curated corpus using existing pipeline:
bin/ingo chunk --no-strict
bin/ingo embedInspect crawl ledgers:
jq -r '.status' data/corpus/manifests/gdb_documents.ndjson | sort | uniq -c
jq -r '.reason' data/corpus/manifests/gdb_skipped.ndjson | sort | uniq -c
jq -r '.error' data/corpus/manifests/gdb_errors.ndjson | sort | uniq -cQuery-only device:
INGO_ROLE="query" bin/ingo doctor
INGO_ROLE="query" bin/ingo query "resume obligaciones de vertimientos" --top-k 5Use .env.example as a starter, then adjust based on runtime defaults below.
| Variable | Required | Default (lib/env.sh) |
Description |
|---|---|---|---|
UPSTASH_VECTOR_REST_URL |
Yes | none | Base URL for Upstash Vector REST API. |
UPSTASH_VECTOR_REST_TOKEN |
Yes | none | Bearer token for Upstash Vector REST API. |
INGO_INBOX |
No | $HOME/ingest |
Source directory for PDFs. |
INGO_RAW_DIR |
No | data/raw |
OCR output directory (relative to repo root). |
INGO_CHUNK_DIR |
No | data/chunks |
Chunk JSONL output directory (relative to repo root). |
INGO_NAMESPACE |
No | legal-co |
Namespace used for upsert and query. |
INGO_LANG |
No | spa |
OCR language passed to Tesseract; must be installed. |
INGO_CHUNK_SIZE |
No | 1400 |
Target chunk size (characters). |
INGO_CHUNK_OVERLAP |
No | 180 |
Character overlap between consecutive chunks. |
INGO_CLEANUP_TEXT |
No | 1 |
When 1, cleanup deletes OCR/chunk intermediates. |
INGO_ROLE |
No | all |
all enables ingest+query; query blocks ingest commands. |
INGO_RELEVANCE_MODE |
No | strict |
Relevance gate mode for chunking (strict or effectively off). |
INGO_MIN_TERM_MATCHES |
No | 2 |
Minimum term matches for strict relevance gate. |
INGO_REJECTED_DIR |
No | data/rejected |
Directory for rejected raw text files. |
INGO_RELEVANCE_TERMS |
No | ambiental,licencia,vertimiento,emision,resolucion,decreto,articulo,autoridad,ministerio,agua,suelo,aire |
Comma-separated relevance terms. |
INGO_CORPUS_DIR |
No | data/corpus |
Base directory for crawl state, downloaded documents, extracted text, and manifests. |
INGO_CRAWL_DEPTH |
No | 2 |
Maximum crawl depth used by fetch --seeds. |
INGO_ALLOWED_HOSTS_FILE |
No | data/corpus/config/allow_hosts.txt |
Extra hosts merged with seed hosts for runtime crawl allow policy. |
INGO_INCLUDE_EXTENSIONS |
No | pdf,docx,xlsx,xlsm |
Comma-separated allowlist for document-only crawl downloads. |
INGO_EXCLUDE_EXTENSIONS |
No | zip,png,jpg,jpeg,gif,webp,svg,ico,js,css,map,woff,woff2,ttf,eot,mp3,mp4,mov,avi |
Comma-separated denylist for crawl filtering. |
INGO_PROGRESS_EVERY |
No | 25 |
Print crawl progress every N processed URLs. |
INGO_SNAPSHOT_PAGES_TO_PDF |
No | 0 |
When 1, attempt wkhtmltopdf page snapshots for eligible pages with no discovered document links. |
INGO_CRAWL_ALLOW_HTTP |
No | 0 |
When 0, crawl upgrades http:// URLs to https:// by default to avoid insecure/dead HTTP endpoints. |
INGO_SKIP_PROBE_FOR_ALLOWED_EXTENSIONS |
No | 1 |
When 1, skip MIME probe round-trip for allowlisted extensions (pdf/docx/xlsx/xlsm) to speed up large runs. |
Spreadsheet handling:
.xlsxand.xlsmare downloaded and converted to text for chunking/embedding.- Extraction priority is
xlsx2csv->in2csv-> Pythonopenpyxlfallback. - If none are available, spreadsheet extraction is marked
unsupportedand the original files are still preserved. |INGO_HTTP_CONNECT_TIMEOUT| No |5| HTTP connect timeout (seconds). | |INGO_HTTP_READ_TIMEOUT| No |30| HTTP max request time (seconds). | |INGO_HTTP_DOWNLOAD_TIMEOUT| No |120| HTTP max time for document downloads (seconds). | |INGO_HTTP_RETRY_ATTEMPTS| No |2| Number of retry attempts for retriable failures. | |INGO_HTTP_RETRY_BACKOFF_MIN| No |1| Initial retry backoff (seconds). | |INGO_HTTP_RETRY_BACKOFF_MAX| No |8| Maximum retry backoff (seconds). | |INGO_HTTP_RETRY_BACKOFF_FACTOR| No |2| Backoff multiplier between retries. | |INGO_HTTP_RETRY_AFTER_MAX| No |INGO_HTTP_RETRY_BACKOFF_MAX| Maximum acceptedRetry-Aftersleep (seconds). |
Legacy compatibility:
INGO_HTTP_RETRY_MAXis still accepted and mapped toINGO_HTTP_RETRY_ATTEMPTS.INGO_HTTP_RETRY_BACKOFFis still accepted and mapped toINGO_HTTP_RETRY_BACKOFF_MIN.- Runtime exports both legacy names for compatibility with older scripts/tests.
data/
ingest/ # optional local inbox if you set INGO_INBOX=data/ingest
raw/ # OCR text + source metadata (*.txt, *.meta)
chunks/ # chunk files (*.jsonl) and embed markers (*.jsonl.embedded)
rejected/ # OCR text rejected by relevance gate or empty-content check
Default inbox is $HOME/ingest unless overridden by INGO_INBOX or command --dir.
- Relevance behavior:
chunk --strict(or defaultINGO_RELEVANCE_MODE=strict) rejects low-signal text intoINGO_REJECTED_DIR.chunk --no-strictdisables relevance filtering for that run.
- Marker behavior:
embedskips files when chunk hash + namespace match existing.jsonl.embeddedmarker.embed --forcebypasses marker skip and re-upserts.
- Query output shape:
- JSON object with
match_countandmatches[]. - Each match includes normalized fields such as
id,score,text,source,section,article,page,date_indexed.
- JSON object with
Run all test scripts:
for f in tests/*_test.sh; do bash "$f"; doneKey coverage areas:
query_top_k_validation_test.sh: validates--top-kacceptance/rejection rules.doctor_lang_check_test.sh: validates exact OCR language detection behavior.http_wrapper_test.sh: validates timeout/retry/backoff + wrapper integration in fetch/embed/query.embed_marker_rerun_test.sh: validates marker skip/re-embed/force behavior.path_key_collision_test.sh: validates collision-safe artifact keys and meta lookup fidelity.
missing env var: UPSTASH_VECTOR_REST_URLorUPSTASH_VECTOR_REST_TOKEN:- Define both in
.envor shell environment.
- Define both in
missing OCR language data for INGO_LANG=spa:- Install Tesseract language data for
INGO_LANGand verify withtesseract --list-langs.
- Install Tesseract language data for
invalid --top-k: must be a positive integer:- Use an integer greater than zero (for example
--top-k 8).
- Use an integer greater than zero (for example
ingest commands are disabled on this device (INGO_ROLE=query):- Expected in query-only mode; use
INGO_ROLE=allfor ingest commands.
- Expected in query-only mode; use
See IMPROVEMENTS.md for ongoing improvement notes. See docs/corpus-fetch-workflow.md for the crawl-enabled fetch workflow. See docs/document-only-crawl.md for document-only vector-ready crawl behavior.