Skip to content

chanu1406/rag

Repository files navigation

Local RAG

A Retrieval-Augmented Generation system that runs entirely on your local machine. Chat about your documents without any data leaving your machine.

Python LangChain Ollama ChromaDB


Features

  • Privacy — Zero external API calls. All processing happens locally.
  • GPU Accelerated — CUDA-powered embeddings for fast vectorization.
  • Hybrid Search — Combines semantic (vector) + keyword (BM25) retrieval with Reciprocal Rank Fusion.
  • Re-ranking — FlashRank cross-encoder for improved relevance.
  • Multi-Format — Supports PDF, Markdown, Text, and Word documents.
  • Configurable — YAML-based configuration for all parameters.

Quick Start

Prerequisites

  • Python 3.11 or 3.12 (3.13 not yet supported)
  • NVIDIA GPU with CUDA (recommended for performance)
  • Ollama installed and running

1. Clone & Setup

# Clone the repository
git clone https://github.com/yourusername/local-rag.git
cd local-rag

# Create virtual environment
python -m venv .venv

# Activate (Windows PowerShell)
.venv\Scripts\Activate.ps1

# Activate (Linux/macOS)
source .venv/bin/activate

2. Install Dependencies

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install project dependencies
pip install -r requirements.txt

3. Setup Ollama

# Pull the LLM model
ollama pull llama3

# Start Ollama server (if not running)
ollama serve

4. Ingest Documents

# Add your documents to data/documents/ then run:
python main.py ingest ./data/documents

# Or ingest with reset (clears existing vectors)
python main.py ingest ./data/documents --reset

5. Start Chatting

python main.py chat

Usage

CLI Commands

Command Description
python main.py ingest <path> Ingest documents from a file or directory
python main.py ingest <path> --reset Clear vector store and re-ingest
python main.py chat Start interactive chat session

Chat Commands

Command Description
/help Show available commands
/exit Exit the chat

Configuration

All settings are managed in config/config.yaml:

llm:
  model: "llama3"              # Ollama model to use
  temperature: 0.0             # 0 = deterministic responses

embedding:
  model_name: "sentence-transformers/all-MiniLM-L6-v2"
  device: "cuda"               # Use GPU for embeddings

retrieval:
  chunk_size: 500              # Characters per chunk
  k_retrieved: 20              # Candidates for re-ranking
  k_final: 5                   # Final context to LLM
  use_reranker: true           # Enable FlashRank

Architecture

Documents --> Ingestion --> Chunks --> Embedding (CUDA) --> ChromaDB

User Question --> Query Embedding --> Retrieval (Hybrid + RRF) --> Ollama --> Answer + Sources

Project Structure

local-rag/
├── main.py                 # CLI entry point
├── config/
│   └── config.yaml         # Configuration file
├── src/
│   ├── config.py           # Config validation (Pydantic)
│   ├── ingest.py           # Document loading & chunking
│   ├── vectorstore.py      # ChromaDB + embeddings + retrieval
│   ├── rag.py              # RAG orchestration
│   └── utils.py            # VRAM monitoring utilities
├── data/
│   ├── documents/          # Your documents go here
│   └── chroma_db/          # Persisted vector store
├── tests/                  # Test suite
└── requirements.txt        # Dependencies

Testing

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=src --cov-report=html

Troubleshooting

Issue Solution
CUDA not available Ensure NVIDIA drivers and CUDA toolkit are installed
Ollama connection failed Run ollama serve in a separate terminal
Python version error Use Python 3.11 or 3.12
Out of memory Reduce k_retrieved in config or use smaller embedding model

License

MIT License - see LICENSE for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors