CorrectiveRAGProject — Evidence‑First RAG with Guardrails (LangGraph)

A Corrective RAG pipeline that routes a question, retrieves supporting context from a local vector store, optionally falls back to web search, and runs graders (relevance + grounding + answer-quality) before returning an answer.

Principle: if an answer can’t be supported by evidence, the system should search / retry / refuse, rather than hallucinate.

What the workflow does

Given an input question, the LangGraph state machine follows this pattern:

Route the question (VectorStore/RAG vs Web Search) using an LLM router.
Retrieve documents from a local Chroma vector store.
Grade retrieved docs for relevance.
If evidence is insufficient, Web Search (Tavily) and append results as an evidence Document.
Generate an answer from the current evidence set.
Grade the generation
- Is it grounded in the provided documents?
- Does it answer the question?

You can see the execution trace in console logs (e.g. ---ROUTE QUESTION---, ---RETRIEVE---, ---WEB SEARCH---, ---CHECK HALLUCINATIONS---, etc.).

Repo structure

Based on the current project layout shown in this repo listing.

main.py
Demo entrypoint (loads env + calls app.invoke({"question": ...})).
ingestion.py
Defines ingestion + vectorstore setup and exposes a retriever used by the graph.
The local Chroma persistence directory is ./.chroma/ (gitignored).
graph/graph.py
Builds and compiles the LangGraph app.
⚠️ Note: the current graph builder also exports a Mermaid PNG via: app.get_graph().draw_mermaid_png(output_file_path="graph.png")
graph/state.py
The graph state schema.
graph/node_constants.py
Node name constants (RETRIEVE, GRADE_DOCUMENTS, WEBSEARCH, GENERATE, …).
graph/nodes/
Node implementations (retrieve, grade_documents, web_search, generate).
graph/chains/
Prompt + LLM “chains” and graders (router, relevance grader, hallucination/grounding grader, answer grader).
graph.png
Exported graph diagram (generated by the graph builder).

Requirements

Python 3.11+
Environment variables:
- OPENAI_API_KEY (LLM + embeddings)
- TAVILY_API_KEY (web search fallback)
- USER_AGENT (optional but recommended; silences some HTTP warnings)
Optional (observability / LangSmith):
- LANGCHAIN_TRACING_V2
- LANGCHAIN_PROJECT
- LANGCHAIN_API_KEY

Quickstart

1) Create and activate a virtual environment

python -m venv .venv
source .venv/bin/activate

2) Install dependencies

If you have requirements.txt:

pip install -r requirements.txt

Otherwise (minimal set matching current imports):

pip install -U   python-dotenv   langgraph   langchain-openai   langchain-chroma   langchain-tavily   langchain-community   langchain-text-splitters   chromadb

3) Configure `.env` (do not commit)

Create .env in repo root:

OPENAI_API_KEY=sk-xxxxxxxx
TAVILY_API_KEY=tvly-xxxxxxxx
USER_AGENT=ozgunes-corrective-rag/1.0

# Optional: LangSmith tracing
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=CorrectiveRAGProject
LANGCHAIN_API_KEY=lsv2_xxxxx

A template is included as .env.example.

4) Build / refresh local Chroma (if needed)

ingestion.py defines the ingestion pipeline (URLs/loaders → chunking → embeddings → Chroma persist to ./.chroma/).

Depending on your current implementation, the store may be created:

by running ingestion.py directly if it has a __main__ entrypoint, or
on import when retriever is initialized.

5) Run

python main.py

Evidence sources

This project has two evidence paths:

Curated corpus (Chroma)
Documents indexed by your ingestion.py.
Live web fallback (Tavily)
Triggered when graded evidence is insufficient. Tavily results are wrapped into a langchain_core.documents.Document and appended to the evidence list.

Common warnings (and what they mean)

Structured output warning (`json_schema` → `function_calling`)

Some models (notably gpt-3.5-turbo) don’t support OpenAI Structured Outputs via json_schema. LangChain falls back to function_calling. This is safe; the fix is to explicitly set method="function_calling" when you build structured graders.

Deprecation warnings (Chroma / Tavily)

Prefer split packages (already used in your latest code):

from langchain_chroma import Chroma
from langchain_tavily import TavilySearch

`USER_AGENT environment variable not set`

Set USER_AGENT in .env to silence the warning.

Known implementation gotcha (recommended fix)

If your graders return string labels like "yes" / "no" in binary_score, make sure you check explicitly:

if score.binary_score == "yes":
    ...

Because in Python, any non-empty string (including "no") is truthy.

Suggested next steps (to “productize”)

Add a small UI (streamlit_app.py) with:
- chat
- route badge (RAG vs Web)
- evidence panel (sources + snippets)
- grounded / not grounded indicators
Standardize output schema: answer, route, web_search_used, grounded, answers_question, sources[]
Add eval/ with a tiny test set and a report: grounding pass rate + answer-quality pass rate
Add a license

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
graph		graph
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_TR.md		README_TR.md
graph.png		graph.png
ingestion.py		ingestion.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CorrectiveRAGProject — Evidence‑First RAG with Guardrails (LangGraph)

What the workflow does

Repo structure

Requirements

Quickstart

1) Create and activate a virtual environment

2) Install dependencies

3) Configure `.env` (do not commit)

4) Build / refresh local Chroma (if needed)

5) Run

Evidence sources

Common warnings (and what they mean)

Structured output warning (`json_schema` → `function_calling`)

Deprecation warnings (Chroma / Tavily)

`USER_AGENT environment variable not set`

Known implementation gotcha (recommended fix)

Suggested next steps (to “productize”)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CorrectiveRAGProject — Evidence‑First RAG with Guardrails (LangGraph)

What the workflow does

Repo structure

Requirements

Quickstart

1) Create and activate a virtual environment

2) Install dependencies

3) Configure .env (do not commit)

4) Build / refresh local Chroma (if needed)

5) Run

Evidence sources

Common warnings (and what they mean)

Structured output warning (json_schema → function_calling)

Deprecation warnings (Chroma / Tavily)

USER_AGENT environment variable not set

Known implementation gotcha (recommended fix)

Suggested next steps (to “productize”)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3) Configure `.env` (do not commit)

Structured output warning (`json_schema` → `function_calling`)

`USER_AGENT environment variable not set`

Packages