I Hate Reading

A local NotebookLM clone that lets you upload PDF documents, ask questions about them, and get cited answers — all running entirely on your machine. No cloud APIs, no subscriptions.

How It Works

1. Smart Document Processing

When you upload a PDF, the app extracts the text and splits it into overlapping chunks using a sentence-aware splitter. This means chunks respect sentence boundaries instead of cutting words in half, and they overlap slightly so important ideas that span two chunks aren't lost.

2. Semantic Embeddings

Each chunk is converted into a numerical vector (embedding) using a language model (BAAI/bge-base-en-v1.5). These vectors capture the meaning of the text, not just keywords. They're stored in Qdrant, a vector database optimized for similarity search.

3. Intelligent Query Planning

When you ask a question, the LLM breaks it down into multiple targeted search queries with thinking disabled for speed. For example, "How does knowledge distillation compare to pruning?" might become separate queries for "knowledge distillation technique" and "model pruning methods." This retrieves more relevant chunks than a single query would.

4. Vector Retrieval

The decomposed queries are run in parallel against Qdrant. Results are deduplicated by content to avoid redundant chunks, then passed to the reranker.

5. Cross-Encoder Reranking

Initial retrieval casts a wide net. A cross-encoder model (BAAI/bge-reranker-base) then re-scores every retrieved chunk by looking at the query and chunk together, producing much more accurate relevance rankings than the initial embedding similarity alone.

6. Context-Aware Answers

The best-ranked chunks are stitched together in document order and fed directly to the LLM. The model generates a Markdown-formatted answer with LaTeX math support and citations pointing back to specific document sections like [paper.pdf#3].

7. Conversation Memory

Chat history is summarized into a compact 2–3 sentence recap by the LLM (with thinking disabled) before each new question. This gives the model conversational context without bloating the prompt, which is important for smaller models with limited context windows.

8. Automatic GPU Detection

The embedding and reranking models automatically detect your hardware:

Apple Silicon → Metal Performance Shaders (MPS)
NVIDIA GPU → CUDA
No GPU → CPU fallback

9. Source Filtering

You can check/uncheck documents in the sidebar. Only checked documents are included in retrieval, so you can focus the model's attention on specific papers. Individual documents can also be deleted with the × button.

10. Session-Based Storage

Documents and chat history exist only for the current browser session. Closing the tab or refreshing the page clears all data — nothing persists between sessions.

Performance Optimizations

Optimization	Effect
Consistent `num_ctx`	All Ollama calls use the same context size to prevent costly model reloads between requests
Thinking disabled for intermediates	Planning and chat-summary calls skip Qwen3's `<think>` blocks, saving 5–10s per query
No intermediate summarization	Retrieved context is fed directly to the final answer instead of through an extra summarization LLM call
Single retrieval round	One retrieval pass instead of iterative multi-round, cutting 1–2 extra LLM calls
No output token limit on answers	Streaming answer generation runs until the model finishes naturally — no truncation
Model preloading at startup	Embedding and reranker models load during container startup, not on the first query
Real-time status indicators	Pulsing status messages (Planning → Searching → Reranking → Generating) keep the UI responsive

Tech Stack

Component	Technology
LLM	gemma3:4b via Ollama (local)
Embeddings	BAAI/bge-base-en-v1.5
Reranker	BAAI/bge-reranker-base
Vector DB	Qdrant
Backend	FastAPI + Uvicorn
Frontend	Vanilla HTML/CSS/JS + Marked.js + KaTeX
PDF Parsing	pypdf
Chunking	LlamaIndex SentenceSplitter
Containerization	Docker Compose

Setup Instructions

Prerequisites

MPS (Apple Silicon) or CUDA (NVIDIA GPU) device recommended
Docker Desktop installed and running
Ollama installed (ollama.com/download or brew install ollama)

Step 1: Start Ollama

Ollama runs natively on your host to access the GPU (Docker can't access Metal/MPS).

In a separate terminal (keep it running):

ollama serve

You should see it listening on http://127.0.0.1:11434.

Step 2: Pull the LLM Model

In another terminal, pull the model while Ollama is serving:

ollama pull gemma3:4b

This downloads ~5 GB. Verify it's there:

ollama list

Step 3: Clone and Start

git clone https://github.com/your-username/i-hate-reading.git
cd i-hate-reading
docker compose up --build

This starts:

Qdrant (vector database) on port 6333
notebook-agent (backend + frontend) on port 8000

First build takes a few minutes to download dependencies and models.

Step 4: Open the App

Go to http://localhost:8000 in your browser.

Upload one or more PDF documents
Wait for processing to complete
Ask questions about your documents
Get cited, formatted answers with math rendering

Stopping the App

# Stop Docker services
docker compose down

# Stop Ollama (Ctrl+C in the terminal running `ollama serve`)

Environment Variables

Configured in docker-compose.yml under notebook-agent:

Variable	Default	Description
`OLLAMA_HOST`	`http://host.docker.internal:11434`	Ollama API URL
`OLLAMA_MODEL`	`gemma3:4b`	LLM model name
`QDRANT_HOST`	`http://qdrant:6333`	Qdrant URL
`NUM_CTX`	`8192`	Context window size for all LLM calls
`RERANK_MODEL`	`BAAI/bge-reranker-base`	Cross-encoder model
`ENABLE_RERANK`	`1`	Toggle reranking (0 to disable)
`MAX_CONTEXT_CHUNKS`	`8`	Max chunks in final prompt

Project Structure

i-hate-reading/
├── docker-compose.yml
├── .env
├── README.md
└── agent/
    ├── Dockerfile
    ├── requirements.txt
    ├── app.py                  # FastAPI server
    ├── rag/
    │   ├── pipeline.py         # RAG orchestrator
    │   ├── planner.py          # Query decomposition
    │   ├── retrieval.py        # Vector search
    │   ├── rerank.py           # Cross-encoder reranking
    │   ├── embeddings.py       # Text → vectors
    │   ├── chunking.py         # Document splitting
    │   ├── ingestion.py        # PDF → chunks → Qdrant
    │   ├── llm.py              # Ollama API wrapper
    │   └── vector_store.py     # Qdrant client
    └── static/
        ├── index.html
        ├── css/styles.css
        └── js/app.js

Built by lemonjerome.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
agent		agent
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

I Hate Reading

How It Works

1. Smart Document Processing

2. Semantic Embeddings

3. Intelligent Query Planning

4. Vector Retrieval

5. Cross-Encoder Reranking

6. Context-Aware Answers

7. Conversation Memory

8. Automatic GPU Detection

9. Source Filtering

10. Session-Based Storage

Performance Optimizations

Tech Stack

Setup Instructions

Prerequisites

Step 1: Start Ollama

Step 2: Pull the LLM Model

Step 3: Clone and Start

Step 4: Open the App

Stopping the App

Environment Variables

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

I Hate Reading

How It Works

1. Smart Document Processing

2. Semantic Embeddings

3. Intelligent Query Planning

4. Vector Retrieval

5. Cross-Encoder Reranking

6. Context-Aware Answers

7. Conversation Memory

8. Automatic GPU Detection

9. Source Filtering

10. Session-Based Storage

Performance Optimizations

Tech Stack

Setup Instructions

Prerequisites

Step 1: Start Ollama

Step 2: Pull the LLM Model

Step 3: Clone and Start

Step 4: Open the App

Stopping the App

Environment Variables

Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages