A self-directed deep dive into RAG (Retrieval-Augmented Generation) — built to understand how modern AI pipelines work under the hood, from raw text ingestion all the way to semantic search.
Status: 🚧 In Development — architecture planned, implementation in progress
Upload any PDF and ask it questions in plain English. Rather than keyword matching, the system understands meaning — surfacing the most relevant passages using vector similarity search.
PDF → Parse → Chunk → Embed → Store in Vector DB → Query → Answer
This is the same fundamental pipeline behind tools like ChatGPT's file uploads, Notion AI, and enterprise document search systems.
https://realpython.com/chromadb-vector-database/ https://realpython.com/tutorials/ai/
https://www.datacamp.com/ https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐
│ PDF Input │────▶│ Text Parser │────▶│ Chunking Engine │
└─────────────┘ └──────────────┘ └──────────┬──────────┘
│
┌──────────▼──────────┐
│ Embedding Model │
│ (Sentence-BERT) │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ Vector Store │
│ (ChromaDB) │
└──────────┬──────────┘
│
┌──────────────────────▼────────────────────────┐
│ Query Interface — Semantic Similarity Search │
└───────────────────────────────────────────────┘
| Component | Tool | Why |
|---|---|---|
| PDF Parsing | PyMuPDF / pdfplumber |
Robust text + layout extraction |
| Embeddings | sentence-transformers |
Pretrained SBERT models, no fine-tuning needed |
| Vector Database | ChromaDB |
Lightweight, local-first vector store |
| Language | Python 3.10+ | Industry standard for ML pipelines |
- Text Embeddings — converting unstructured text into dense numerical vectors that encode semantic meaning
- Chunking Strategy — splitting documents into overlapping segments to preserve context across chunk boundaries
- Cosine Similarity — measuring vector proximity to rank retrieved passages by relevance
- RAG Pipeline Design — decoupling retrieval from generation; understanding where each component lives in the stack
- Vector Databases — indexing, storing, and querying high-dimensional embeddings efficiently
pdf-rag/
├── data/ # Sample PDFs for testing
├── src/
│ ├── ingest.py # PDF loading and text extraction
│ ├── chunker.py # Text splitting logic
│ ├── embed.py # Embedding generation via sentence-transformers
│ ├── store.py # ChromaDB read/write interface
│ └── query.py # Similarity search and result ranking
├── notebooks/ # Exploratory work and model comparisons
├── requirements.txt
└── README.md
- PDF ingestion and text extraction
- Chunking with overlap
- Embedding pipeline with
sentence-transformers - ChromaDB integration
- Semantic query interface
- Notebook with embedding visualizations (UMAP/t-SNE)
- Swap and benchmark multiple SBERT models
- Optional: lightweight frontend (Streamlit or React)
RAG is one of the most widely deployed patterns in production AI systems today. Rather than treating it as a black box, I wanted to learn it myself. From how embeddings encode meaning, to how vector databases work.
[Your Name] — [your-email@northeastern.edu] — LinkedIn — GitHub