A local NotebookLM clone that lets you upload PDF documents, ask questions about them, and get cited answers — all running entirely on your machine. No cloud APIs, no subscriptions.
When you upload a PDF, the app extracts the text and splits it into overlapping chunks using a sentence-aware splitter. This means chunks respect sentence boundaries instead of cutting words in half, and they overlap slightly so important ideas that span two chunks aren't lost.
Each chunk is converted into a numerical vector (embedding) using a language model (BAAI/bge-base-en-v1.5). These vectors capture the meaning of the text, not just keywords. They're stored in Qdrant, a vector database optimized for similarity search.
When you ask a question, the LLM breaks it down into multiple targeted search queries with thinking disabled for speed. For example, "How does knowledge distillation compare to pruning?" might become separate queries for "knowledge distillation technique" and "model pruning methods." This retrieves more relevant chunks than a single query would.
The decomposed queries are run in parallel against Qdrant. Results are deduplicated by content to avoid redundant chunks, then passed to the reranker.
Initial retrieval casts a wide net. A cross-encoder model (BAAI/bge-reranker-base) then re-scores every retrieved chunk by looking at the query and chunk together, producing much more accurate relevance rankings than the initial embedding similarity alone.
The best-ranked chunks are stitched together in document order and fed directly to the LLM. The model generates a Markdown-formatted answer with LaTeX math support and citations pointing back to specific document sections like [paper.pdf#3].
Chat history is summarized into a compact 2–3 sentence recap by the LLM (with thinking disabled) before each new question. This gives the model conversational context without bloating the prompt, which is important for smaller models with limited context windows.
The embedding and reranking models automatically detect your hardware:
- Apple Silicon → Metal Performance Shaders (MPS)
- NVIDIA GPU → CUDA
- No GPU → CPU fallback
You can check/uncheck documents in the sidebar. Only checked documents are included in retrieval, so you can focus the model's attention on specific papers. Individual documents can also be deleted with the × button.
Documents and chat history exist only for the current browser session. Closing the tab or refreshing the page clears all data — nothing persists between sessions.
| Optimization | Effect |
|---|---|
Consistent num_ctx |
All Ollama calls use the same context size to prevent costly model reloads between requests |
| Thinking disabled for intermediates | Planning and chat-summary calls skip Qwen3's <think> blocks, saving 5–10s per query |
| No intermediate summarization | Retrieved context is fed directly to the final answer instead of through an extra summarization LLM call |
| Single retrieval round | One retrieval pass instead of iterative multi-round, cutting 1–2 extra LLM calls |
| No output token limit on answers | Streaming answer generation runs until the model finishes naturally — no truncation |
| Model preloading at startup | Embedding and reranker models load during container startup, not on the first query |
| Real-time status indicators | Pulsing status messages (Planning → Searching → Reranking → Generating) keep the UI responsive |
| Component | Technology |
|---|---|
| LLM | gemma3:4b via Ollama (local) |
| Embeddings | BAAI/bge-base-en-v1.5 |
| Reranker | BAAI/bge-reranker-base |
| Vector DB | Qdrant |
| Backend | FastAPI + Uvicorn |
| Frontend | Vanilla HTML/CSS/JS + Marked.js + KaTeX |
| PDF Parsing | pypdf |
| Chunking | LlamaIndex SentenceSplitter |
| Containerization | Docker Compose |
- MPS (Apple Silicon) or CUDA (NVIDIA GPU) device recommended
- Docker Desktop installed and running
- Ollama installed (ollama.com/download or
brew install ollama)
Ollama runs natively on your host to access the GPU (Docker can't access Metal/MPS).
In a separate terminal (keep it running):
ollama serveYou should see it listening on http://127.0.0.1:11434.
In another terminal, pull the model while Ollama is serving:
ollama pull gemma3:4bThis downloads ~5 GB. Verify it's there:
ollama listgit clone https://github.com/your-username/i-hate-reading.git
cd i-hate-reading
docker compose up --buildThis starts:
- Qdrant (vector database) on port 6333
- notebook-agent (backend + frontend) on port 8000
First build takes a few minutes to download dependencies and models.
Go to http://localhost:8000 in your browser.
- Upload one or more PDF documents
- Wait for processing to complete
- Ask questions about your documents
- Get cited, formatted answers with math rendering
# Stop Docker services
docker compose down
# Stop Ollama (Ctrl+C in the terminal running `ollama serve`)Configured in docker-compose.yml under notebook-agent:
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST |
http://host.docker.internal:11434 |
Ollama API URL |
OLLAMA_MODEL |
gemma3:4b |
LLM model name |
QDRANT_HOST |
http://qdrant:6333 |
Qdrant URL |
NUM_CTX |
8192 |
Context window size for all LLM calls |
RERANK_MODEL |
BAAI/bge-reranker-base |
Cross-encoder model |
ENABLE_RERANK |
1 |
Toggle reranking (0 to disable) |
MAX_CONTEXT_CHUNKS |
8 |
Max chunks in final prompt |
i-hate-reading/
├── docker-compose.yml
├── .env
├── README.md
└── agent/
├── Dockerfile
├── requirements.txt
├── app.py # FastAPI server
├── rag/
│ ├── pipeline.py # RAG orchestrator
│ ├── planner.py # Query decomposition
│ ├── retrieval.py # Vector search
│ ├── rerank.py # Cross-encoder reranking
│ ├── embeddings.py # Text → vectors
│ ├── chunking.py # Document splitting
│ ├── ingestion.py # PDF → chunks → Qdrant
│ ├── llm.py # Ollama API wrapper
│ └── vector_store.py # Qdrant client
└── static/
├── index.html
├── css/styles.css
└── js/app.js
Built by lemonjerome.