A Retrieval-Augmented Generation system that runs entirely on your local machine. Chat about your documents without any data leaving your machine.
- Privacy — Zero external API calls. All processing happens locally.
- GPU Accelerated — CUDA-powered embeddings for fast vectorization.
- Hybrid Search — Combines semantic (vector) + keyword (BM25) retrieval with Reciprocal Rank Fusion.
- Re-ranking — FlashRank cross-encoder for improved relevance.
- Multi-Format — Supports PDF, Markdown, Text, and Word documents.
- Configurable — YAML-based configuration for all parameters.
- Python 3.11 or 3.12 (3.13 not yet supported)
- NVIDIA GPU with CUDA (recommended for performance)
- Ollama installed and running
# Clone the repository
git clone https://github.com/yourusername/local-rag.git
cd local-rag
# Create virtual environment
python -m venv .venv
# Activate (Windows PowerShell)
.venv\Scripts\Activate.ps1
# Activate (Linux/macOS)
source .venv/bin/activate# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install project dependencies
pip install -r requirements.txt# Pull the LLM model
ollama pull llama3
# Start Ollama server (if not running)
ollama serve# Add your documents to data/documents/ then run:
python main.py ingest ./data/documents
# Or ingest with reset (clears existing vectors)
python main.py ingest ./data/documents --resetpython main.py chat| Command | Description |
|---|---|
python main.py ingest <path> |
Ingest documents from a file or directory |
python main.py ingest <path> --reset |
Clear vector store and re-ingest |
python main.py chat |
Start interactive chat session |
| Command | Description |
|---|---|
/help |
Show available commands |
/exit |
Exit the chat |
All settings are managed in config/config.yaml:
llm:
model: "llama3" # Ollama model to use
temperature: 0.0 # 0 = deterministic responses
embedding:
model_name: "sentence-transformers/all-MiniLM-L6-v2"
device: "cuda" # Use GPU for embeddings
retrieval:
chunk_size: 500 # Characters per chunk
k_retrieved: 20 # Candidates for re-ranking
k_final: 5 # Final context to LLM
use_reranker: true # Enable FlashRankDocuments --> Ingestion --> Chunks --> Embedding (CUDA) --> ChromaDB
User Question --> Query Embedding --> Retrieval (Hybrid + RRF) --> Ollama --> Answer + Sources
local-rag/
├── main.py # CLI entry point
├── config/
│ └── config.yaml # Configuration file
├── src/
│ ├── config.py # Config validation (Pydantic)
│ ├── ingest.py # Document loading & chunking
│ ├── vectorstore.py # ChromaDB + embeddings + retrieval
│ ├── rag.py # RAG orchestration
│ └── utils.py # VRAM monitoring utilities
├── data/
│ ├── documents/ # Your documents go here
│ └── chroma_db/ # Persisted vector store
├── tests/ # Test suite
└── requirements.txt # Dependencies
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=src --cov-report=html| Issue | Solution |
|---|---|
| CUDA not available | Ensure NVIDIA drivers and CUDA toolkit are installed |
| Ollama connection failed | Run ollama serve in a separate terminal |
| Python version error | Use Python 3.11 or 3.12 |
| Out of memory | Reduce k_retrieved in config or use smaller embedding model |
MIT License - see LICENSE for details.