A production-style Retrieval Augmented Generation (RAG) pipeline built using local LLMs and open-source tools.
This project upgrades a basic RAG implementation into a real-world architecture used in modern AI systems by adding:
- Query rewriting
- Hybrid search (vector + keyword)
- Metadata filtering
- Neural reranking
- Context compression
The system runs entirely locally using Ollama and demonstrates how modern GenAI retrieval pipelines are built.
The pipeline follows a typical production RAG workflow:
User Query
↓
Query Rewrite (LLM)
↓
Hybrid Retrieval
├─ Vector Search (FAISS)
└─ Keyword Search (BM25)
↓
Metadata Filtering
↓
Cross-Encoder Reranking
↓
Context Compression
↓
LLM Answer Generation
This design improves retrieval quality, answer accuracy, and system scalability.
User queries are rewritten using an LLM to improve document retrieval.
Example:
User: explain redis
↓
Rewrite: redis caching architecture and use cases
This improves retrieval accuracy when queries are vague or incomplete.
The system combines:
- Vector search (FAISS) for semantic similarity
- BM25 keyword search for exact matches
This hybrid approach captures both meaning and keywords, which significantly improves retrieval performance.
Documents contain metadata fields such as:
topic
source
tags
Filtering allows retrieval from specific document groups such as:
topic = "database"
source = "docs"
This is common in enterprise knowledge bases and internal search systems.
Retrieved documents are reranked using a cross-encoder model:
cross-encoder/ms-marco-MiniLM-L-6-v2
This step evaluates the relevance between the query and each document and selects the most relevant context.
Reranking dramatically improves answer quality compared to raw vector search.
The retrieved documents are compressed using an LLM before answering.
Benefits:
- reduces token usage
- removes irrelevant information
- improves answer precision
production_rag
│
├── app
│ ├── config.py
│ │
│ ├── ingestion
│ │ └── load_docs.py
│ │
│ ├── embeddings
│ │ └── embedder.py
│ │
│ ├── retrieval
│ │ ├── vector_store.py
│ │ ├── hybrid_search.py
│ │ ├── reranker.py
│ │ └── metadata_filter.py
│ │
│ ├── pipeline
│ │ ├── query_rewrite.py
│ │ ├── context_compressor.py
│ │ └── rag_pipeline.py
│ │
│ └── llm
│ └── ollama_client.py
│
├── data
│ └── docs.txt
│
├── main.py
└── requirements.txt
The architecture is modular so components can be swapped easily (vector DB, embedding models, LLMs, etc.).
Clone the repository:
git clone https://github.com/yourusername/production-rag-system.git
cd production-rag-system
Create a virtual environment:
python -m venv venv
source venv/bin/activate
Install dependencies:
pip install -r requirements.txt
Start the local assistant:
python main.py
Example interaction:
You: What is Redis used for?
Assistant: Redis is an in-memory data store commonly used for caching, session storage, and real-time analytics.
- Python
- Ollama (local LLM runtime)
- Sentence Transformers
- FAISS (vector search)
- BM25 (keyword search)
- Cross Encoder reranking
This project was built to understand how real-world RAG systems are designed, including:
- retrieval pipelines
- hybrid search strategies
- reranking models
- modular AI system architecture
Future upgrades could include:
- multi-query retrieval
- reciprocal rank fusion
- document chunk scoring
- retrieval caching
- evaluation metrics
- FastAPI inference server
MIT License