A robust, multi-strategy document intelligence system designed for precise triage, extraction, and semantic chunking of diverse PDF documents.
It intelligently routes documents to the most cost-effective extraction strategy and provides an agentic RAG system with mathematically verifiable provenance.
- Intelligent Triage: Automatically detects document types (Financial, Legal, Survey) and Layout Complexity.
- Multi-Strategy Extraction: Routes to Strategy A (pdfplumber), B (docling), or C (Vision VLMs) based on triage.
- PageIndex Navigation: High-level hierarchical indexing of document sections for efficient retrieval.
- Agentic Query Interface: A LangGraph agent that routes queries to either structured SQL (Quantitative) or semantic Vector (Conceptual) backends.
- Audit Mode: Third-party verification of agent claims using LLM-based auditing and deterministic fallback checks.
- Provenance Enforcement: Every answer is strictly cited with
document_name,page_number, andbounding_boxes.
- Triage Agent: Determines the origin (Digital vs. Scanned) and Domain.
- Extraction Logic: Implements 5 chunking rules (Text, Table, List, Figure, Header) via
ChunkingEngine. - Indexing Engine: Builds the
PageIndextree and populatesRefineryVectorStore(ChromaDB) andFactTableStore(SQLite). - Query Agent: Orchestrates retrieval and synthesis with mandatory source verification.
- Python 3.11+
- Virtual Environment (e.g.,
.venv)
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .Create a .env file with your OpenRouter credentials:
OPENROUTER_API_KEY="your-sk-or-v1-key-here"Run the orchestrator to process your PDFs and create LDUs:
PYTHONPATH=src python scripts/run_phase2.pyInteractive manual testing via the terminal:
# Ask a conceptual question (Vector)
PYTHONPATH=src python scripts/ask.py "Who prepared the report?"
# Ask a quantitative question (SQL)
PYTHONPATH=src python scripts/ask.py "What is the report_year?"Run the end-to-end operational test and guardrail checks:
# Core operational test (SQL + Vector + Audit)
PYTHONPATH=src python scripts/test_phase4_operational.py
# Rubric Compliance: Anti-Hallucination Guardrail
PYTHONPATH=src pytest tests/agents/test_provenance_enforcement.py -v
# Rubric Compliance: SQL Injection Protection
PYTHONPATH=src pytest tests/indexing/test_fact_table_queries.py -vThe application generates the following deterministic artifacts in the .refinery/ directory:
.refinery/extraction_ledger.jsonl: Full performance and confidence logs..refinery/pageindex/: JSON representation of document hierarchies..refinery/extracted/: Structured JSON output of raw extractions..refinery/facts.db: Persistent SQLite store for quantitative data..refinery/ldus_sample.json: Exported Logical Document Units for rubric validation.