🔍 PDF RAG — Retrieval-Augmented Generation from Scratch

A self-directed deep dive into RAG (Retrieval-Augmented Generation) — built to understand how modern AI pipelines work under the hood, from raw text ingestion all the way to semantic search.

Status: 🚧 In Development — architecture planned, implementation in progress

💡 What This Project Does

Upload any PDF and ask it questions in plain English. Rather than keyword matching, the system understands meaning — surfacing the most relevant passages using vector similarity search.

PDF → Parse → Chunk → Embed → Store in Vector DB → Query → Answer

This is the same fundamental pipeline behind tools like ChatGPT's file uploads, Notion AI, and enterprise document search systems.

Resources/to-do:

https://realpython.com/chromadb-vector-database/ https://realpython.com/tutorials/ai/

https://www.datacamp.com/ https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers https://www.sbert.net/docs/sentence_transformer/pretrained_models.html

🧱 System Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────────────┐
│  PDF Input  │────▶│  Text Parser │────▶│   Chunking Engine   │
└─────────────┘     └──────────────┘     └──────────┬──────────┘
                                                     │
                                          ┌──────────▼──────────┐
                                          │  Embedding Model    │
                                          │  (Sentence-BERT)    │
                                          └──────────┬──────────┘
                                                     │
                                          ┌──────────▼──────────┐
                                          │  Vector Store       │
                                          │  (ChromaDB)         │
                                          └──────────┬──────────┘
                                                     │
                              ┌──────────────────────▼────────────────────────┐
                              │  Query Interface — Semantic Similarity Search  │
                              └───────────────────────────────────────────────┘

🛠️ Tech Stack

Component	Tool	Why
PDF Parsing	`PyMuPDF` / `pdfplumber`	Robust text + layout extraction
Embeddings	`sentence-transformers`	Pretrained SBERT models, no fine-tuning needed
Vector Database	`ChromaDB`	Lightweight, local-first vector store
Language	Python 3.10+	Industry standard for ML pipelines

📐 Key Concepts Explored

Text Embeddings — converting unstructured text into dense numerical vectors that encode semantic meaning
Chunking Strategy — splitting documents into overlapping segments to preserve context across chunk boundaries
Cosine Similarity — measuring vector proximity to rank retrieved passages by relevance
RAG Pipeline Design — decoupling retrieval from generation; understanding where each component lives in the stack
Vector Databases — indexing, storing, and querying high-dimensional embeddings efficiently

🗂️ Project Structure Goal

pdf-rag/
├── data/               # Sample PDFs for testing
├── src/
│   ├── ingest.py       # PDF loading and text extraction
│   ├── chunker.py      # Text splitting logic
│   ├── embed.py        # Embedding generation via sentence-transformers
│   ├── store.py        # ChromaDB read/write interface
│   └── query.py        # Similarity search and result ranking
├── notebooks/          # Exploratory work and model comparisons
├── requirements.txt
└── README.md

🗺️ Roadmap

PDF ingestion and text extraction
Chunking with overlap
Embedding pipeline with sentence-transformers
ChromaDB integration
Semantic query interface
Notebook with embedding visualizations (UMAP/t-SNE)
Swap and benchmark multiple SBERT models
Optional: lightweight frontend (Streamlit or React)

🎯 Why I Built This

RAG is one of the most widely deployed patterns in production AI systems today. Rather than treating it as a black box, I wanted to learn it myself. From how embeddings encode meaning, to how vector databases work.

📬 Contact

[Your Name] — [your-email@northeastern.edu] — LinkedIn — GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 PDF RAG — Retrieval-Augmented Generation from Scratch

💡 What This Project Does

Resources/to-do:

🧱 System Architecture

🛠️ Tech Stack

📐 Key Concepts Explored

🗂️ Project Structure Goal

🗺️ Roadmap

🎯 Why I Built This

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔍 PDF RAG — Retrieval-Augmented Generation from Scratch

💡 What This Project Does

Resources/to-do:

🧱 System Architecture

🛠️ Tech Stack

📐 Key Concepts Explored

🗂️ Project Structure Goal

🗺️ Roadmap

🎯 Why I Built This

📬 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages