Production RAG System

A production-style Retrieval Augmented Generation (RAG) pipeline built using local LLMs and open-source tools.

This project upgrades a basic RAG implementation into a real-world architecture used in modern AI systems by adding:

Query rewriting
Hybrid search (vector + keyword)
Metadata filtering
Neural reranking
Context compression

The system runs entirely locally using Ollama and demonstrates how modern GenAI retrieval pipelines are built.

Architecture

The pipeline follows a typical production RAG workflow:

User Query
   ↓
Query Rewrite (LLM)
   ↓
Hybrid Retrieval
   ├─ Vector Search (FAISS)
   └─ Keyword Search (BM25)
   ↓
Metadata Filtering
   ↓
Cross-Encoder Reranking
   ↓
Context Compression
   ↓
LLM Answer Generation

This design improves retrieval quality, answer accuracy, and system scalability.

Key Features

Query Rewriting

User queries are rewritten using an LLM to improve document retrieval.

Example:

User: explain redis
↓
Rewrite: redis caching architecture and use cases

This improves retrieval accuracy when queries are vague or incomplete.

Hybrid Search

The system combines:

Vector search (FAISS) for semantic similarity
BM25 keyword search for exact matches

This hybrid approach captures both meaning and keywords, which significantly improves retrieval performance.

Metadata Filtering

Documents contain metadata fields such as:

topic
source
tags

Filtering allows retrieval from specific document groups such as:

topic = "database"
source = "docs"

This is common in enterprise knowledge bases and internal search systems.

Neural Reranking

Retrieved documents are reranked using a cross-encoder model:

cross-encoder/ms-marco-MiniLM-L-6-v2

This step evaluates the relevance between the query and each document and selects the most relevant context.

Reranking dramatically improves answer quality compared to raw vector search.

Context Compression

The retrieved documents are compressed using an LLM before answering.

Benefits:

reduces token usage
removes irrelevant information
improves answer precision

Project Structure

production_rag
│
├── app
│   ├── config.py
│   │
│   ├── ingestion
│   │   └── load_docs.py
│   │
│   ├── embeddings
│   │   └── embedder.py
│   │
│   ├── retrieval
│   │   ├── vector_store.py
│   │   ├── hybrid_search.py
│   │   ├── reranker.py
│   │   └── metadata_filter.py
│   │
│   ├── pipeline
│   │   ├── query_rewrite.py
│   │   ├── context_compressor.py
│   │   └── rag_pipeline.py
│   │
│   └── llm
│       └── ollama_client.py
│
├── data
│   └── docs.txt
│
├── main.py
└── requirements.txt

The architecture is modular so components can be swapped easily (vector DB, embedding models, LLMs, etc.).

Installation

Clone the repository:

git clone https://github.com/yourusername/production-rag-system.git
cd production-rag-system

Create a virtual environment:

python -m venv venv
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Running the System

Start the local assistant:

python main.py

Example interaction:

You: What is Redis used for?
Assistant: Redis is an in-memory data store commonly used for caching, session storage, and real-time analytics.

Tech Stack

Python
Ollama (local LLM runtime)
Sentence Transformers
FAISS (vector search)
BM25 (keyword search)
Cross Encoder reranking

Learning Goals

This project was built to understand how real-world RAG systems are designed, including:

retrieval pipelines
hybrid search strategies
reranking models
modular AI system architecture

Possible Improvements

Future upgrades could include:

multi-query retrieval
reciprocal rank fusion
document chunk scoring
retrieval caching
evaluation metrics
FastAPI inference server

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Production RAG System

Architecture

Key Features

Query Rewriting

Hybrid Search

Metadata Filtering

Neural Reranking

Context Compression

Project Structure

Installation

Running the System

Tech Stack

Learning Goals

Possible Improvements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Production RAG System

Architecture

Key Features

Query Rewriting

Hybrid Search

Metadata Filtering

Neural Reranking

Context Compression

Project Structure

Installation

Running the System

Tech Stack

Learning Goals

Possible Improvements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages