Skip to content

Akshayshahi900/Retrieval-Augmented-Generation-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Production RAG System

A production-style Retrieval Augmented Generation (RAG) pipeline built using local LLMs and open-source tools.

This project upgrades a basic RAG implementation into a real-world architecture used in modern AI systems by adding:

  • Query rewriting
  • Hybrid search (vector + keyword)
  • Metadata filtering
  • Neural reranking
  • Context compression

The system runs entirely locally using Ollama and demonstrates how modern GenAI retrieval pipelines are built.


Architecture

The pipeline follows a typical production RAG workflow:

User Query
   ↓
Query Rewrite (LLM)
   ↓
Hybrid Retrieval
   ├─ Vector Search (FAISS)
   └─ Keyword Search (BM25)
   ↓
Metadata Filtering
   ↓
Cross-Encoder Reranking
   ↓
Context Compression
   ↓
LLM Answer Generation

This design improves retrieval quality, answer accuracy, and system scalability.


Key Features

Query Rewriting

User queries are rewritten using an LLM to improve document retrieval.

Example:

User: explain redis
↓
Rewrite: redis caching architecture and use cases

This improves retrieval accuracy when queries are vague or incomplete.


Hybrid Search

The system combines:

  • Vector search (FAISS) for semantic similarity
  • BM25 keyword search for exact matches

This hybrid approach captures both meaning and keywords, which significantly improves retrieval performance.


Metadata Filtering

Documents contain metadata fields such as:

topic
source
tags

Filtering allows retrieval from specific document groups such as:

topic = "database"
source = "docs"

This is common in enterprise knowledge bases and internal search systems.


Neural Reranking

Retrieved documents are reranked using a cross-encoder model:

cross-encoder/ms-marco-MiniLM-L-6-v2

This step evaluates the relevance between the query and each document and selects the most relevant context.

Reranking dramatically improves answer quality compared to raw vector search.


Context Compression

The retrieved documents are compressed using an LLM before answering.

Benefits:

  • reduces token usage
  • removes irrelevant information
  • improves answer precision

Project Structure

production_rag
│
├── app
│   ├── config.py
│   │
│   ├── ingestion
│   │   └── load_docs.py
│   │
│   ├── embeddings
│   │   └── embedder.py
│   │
│   ├── retrieval
│   │   ├── vector_store.py
│   │   ├── hybrid_search.py
│   │   ├── reranker.py
│   │   └── metadata_filter.py
│   │
│   ├── pipeline
│   │   ├── query_rewrite.py
│   │   ├── context_compressor.py
│   │   └── rag_pipeline.py
│   │
│   └── llm
│       └── ollama_client.py
│
├── data
│   └── docs.txt
│
├── main.py
└── requirements.txt

The architecture is modular so components can be swapped easily (vector DB, embedding models, LLMs, etc.).


Installation

Clone the repository:

git clone https://github.com/yourusername/production-rag-system.git
cd production-rag-system

Create a virtual environment:

python -m venv venv
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Running the System

Start the local assistant:

python main.py

Example interaction:

You: What is Redis used for?
Assistant: Redis is an in-memory data store commonly used for caching, session storage, and real-time analytics.

Tech Stack

  • Python
  • Ollama (local LLM runtime)
  • Sentence Transformers
  • FAISS (vector search)
  • BM25 (keyword search)
  • Cross Encoder reranking

Learning Goals

This project was built to understand how real-world RAG systems are designed, including:

  • retrieval pipelines
  • hybrid search strategies
  • reranking models
  • modular AI system architecture

Possible Improvements

Future upgrades could include:

  • multi-query retrieval
  • reciprocal rank fusion
  • document chunk scoring
  • retrieval caching
  • evaluation metrics
  • FastAPI inference server

License

MIT License

About

Production-grade Retrieval Augmented Generation (RAG) system with hybrid search, query rewriting, reranking, metadata filtering, and local LLM inference.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages