Document Intelligence Refinery

A robust, multi-strategy document intelligence system designed for precise triage, extraction, and semantic chunking of diverse PDF documents.

It intelligently routes documents to the most cost-effective extraction strategy and provides an agentic RAG system with mathematically verifiable provenance.

🚀 Key Features

Intelligent Triage: Automatically detects document types (Financial, Legal, Survey) and Layout Complexity.
Multi-Strategy Extraction: Routes to Strategy A (pdfplumber), B (docling), or C (Vision VLMs) based on triage.
PageIndex Navigation: High-level hierarchical indexing of document sections for efficient retrieval.
Agentic Query Interface: A LangGraph agent that routes queries to either structured SQL (Quantitative) or semantic Vector (Conceptual) backends.
Audit Mode: Third-party verification of agent claims using LLM-based auditing and deterministic fallback checks.
Provenance Enforcement: Every answer is strictly cited with document_name, page_number, and bounding_boxes.

🛠️ Architecture

Triage Agent: Determines the origin (Digital vs. Scanned) and Domain.
Extraction Logic: Implements 5 chunking rules (Text, Table, List, Figure, Header) via ChunkingEngine.
Indexing Engine: Builds the PageIndex tree and populates RefineryVectorStore (ChromaDB) and FactTableStore (SQLite).
Query Agent: Orchestrates retrieval and synthesis with mandatory source verification.

📦 Setup Instructions

1. Prerequisites

Python 3.11+
Virtual Environment (e.g., .venv)

2. Installation

python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

3. Environment Configuration

Create a .env file with your OpenRouter credentials:

OPENROUTER_API_KEY="your-sk-or-v1-key-here"

🎮 How to Test

1. Extraction Pipeline (Phase 2 & 3)

Run the orchestrator to process your PDFs and create LDUs:

PYTHONPATH=src python scripts/run_phase2.py

2. Manual Query Interface (Phase 4 & 5)

Interactive manual testing via the terminal:

# Ask a conceptual question (Vector)
PYTHONPATH=src python scripts/ask.py "Who prepared the report?"

# Ask a quantitative question (SQL)
PYTHONPATH=src python scripts/ask.py "What is the report_year?"

3. Automated Verification

Run the end-to-end operational test and guardrail checks:

# Core operational test (SQL + Vector + Audit)
PYTHONPATH=src python scripts/test_phase4_operational.py

# Rubric Compliance: Anti-Hallucination Guardrail
PYTHONPATH=src pytest tests/agents/test_provenance_enforcement.py -v

# Rubric Compliance: SQL Injection Protection
PYTHONPATH=src pytest tests/indexing/test_fact_table_queries.py -v

📂 Artifacts

The application generates the following deterministic artifacts in the .refinery/ directory:

.refinery/extraction_ledger.jsonl: Full performance and confidence logs.
.refinery/pageindex/: JSON representation of document hierarchies.
.refinery/extracted/: Structured JSON output of raw extractions.
.refinery/facts.db: Persistent SQLite store for quantitative data.
.refinery/ldus_sample.json: Exported Logical Document Units for rubric validation.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.chromadb		.chromadb
.refinery		.refinery
.vscode		.vscode
config		config
data		data
eval_chroma_db		eval_chroma_db
public/fonts		public/fonts
rubric		rubric
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
DOMAIN_NOTES.md		DOMAIN_NOTES.md
Dockerfile		Dockerfile
README.md		README.md
analysis_results_raw.json		analysis_results_raw.json
final_project_report.md		final_project_report.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
refactor_flat.py		refactor_flat.py
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Intelligence Refinery

🚀 Key Features

🛠️ Architecture

📦 Setup Instructions

1. Prerequisites

2. Installation

3. Environment Configuration

🎮 How to Test

1. Extraction Pipeline (Phase 2 & 3)

2. Manual Query Interface (Phase 4 & 5)

3. Automated Verification

📂 Artifacts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Document Intelligence Refinery

🚀 Key Features

🛠️ Architecture

📦 Setup Instructions

1. Prerequisites

2. Installation

3. Environment Configuration

🎮 How to Test

1. Extraction Pipeline (Phase 2 & 3)

2. Manual Query Interface (Phase 4 & 5)

3. Automated Verification

📂 Artifacts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages