A multi-turn conversational RAG (Retrieval-Augmented Generation) application for querying PDF documents. It combines hybrid vector + keyword search via Weaviate, a structured 4-node LangGraph reasoning pipeline, and a real-time streaming interface over WebSockets.
- Multi-turn conversations — full conversation history with context-aware reformulation
- Hybrid search — combines dense vector similarity and BM25 keyword search (50/50 alpha) for best-of-both-worlds retrieval
- Streaming responses — tokens stream in real time to the browser via WebSocket
- Persistent sessions — conversation state is stored in SQLite; restarting the server preserves history
- REST fallback — non-streaming HTTP endpoint available alongside WebSocket
User (Streamlit UI)
│
│ WebSocket ws://localhost:8000/ws/chat/{session_id}
▼
FastAPI Backend (api_service.py)
│
▼
LangGraph RAG Pipeline
├── 1. Contextualize — analyze history, extract current intent
├── 2. Extract Terms — distill 1–3 precise search queries via LLM
├── 3. Retrieve — hybrid search (vector + BM25) against Weaviate
└── 4. Generate — stream grounded answer from LLM
│
▼
Conversation Memory (SQLite conversations.db)
| Layer | Technology |
|---|---|
| Frontend | Streamlit |
| Backend | FastAPI + WebSocket |
| RAG Pipeline | LangGraph |
| Vector Database | Weaviate (local Docker) |
| LLM & Embeddings | OpenAI-compatible API (gpt-4o-mini / text-embedding-3-large) |
| Conversation Memory | SQLite via LangGraph SqliteSaver |
| Document Loader | PyPDF |
.
├── config.py # All configuration constants (reads from .env)
├── embeddings.py # Embedding client wrapper
├── vectorstore.py # PDF loading, chunking, and Weaviate indexing
│
├── graph/
│ ├── state.py # GraphState TypedDict
│ ├── nodes.py # Pipeline nodes + generate_stream
│ └── builder.py # Graph wiring and compilation
│
├── api/
│ ├── lifespan.py # FastAPI startup/shutdown + thread pool
│ └── routes.py # HTTP and WebSocket endpoints
│
├── ui/
│ ├── styles.py # Custom CSS
│ ├── websocket_client.py # WebSocket communication helpers
│ ├── sidebar.py # Settings sidebar
│ └── chat.py # Chat history and input handling
│
├── hybrid_search_graph_history.py # Entry point — run pipeline from terminal
├── api_service.py # Entry point — FastAPI server
├── streamlit_app.py # Entry point — Streamlit UI
│
├── reset_weaviate.py # Utility: wipe and re-index Weaviate collection
├── .env.example # Environment variable template
├── requirements.txt # Core dependencies
└── requirements_auth.txt # Extended deps (Firebase, Firestore, websocket-client)
- Python 3.10+
- Docker (for Weaviate)
- An OpenAI-compatible API key
docker run -d \
-p 8080:8080 \
-p 50051:50051 \
cr.weaviate.io/semitechnologies/weaviate:latestpip install -r requirements.txtCopy .env.example to .env and add your API key:
cp .env.example .env
# then edit .env:
# LLM_API_KEY=sk-...All other settings (model names, ports, PDF path) are in config.py.
Place your PDF in the project root. By default the app looks for computer_architecture.pdf. To use a different file, update PDF_PATH in config.py.
uvicorn api_service:app --reload --host 0.0.0.0 --port 8000On first startup the backend automatically loads the PDF, chunks it, embeds it, and indexes it in Weaviate. Subsequent starts skip this step if the collection already exists.
streamlit run streamlit_app.pyOpen http://localhost:8501 in your browser.
All settings are in config.py and can be overridden via environment variables or by editing the file directly:
| Variable | Default | Description |
|---|---|---|
PDF_PATH |
computer_architecture.pdf |
Path to the source PDF |
LLM_MODEL |
gpt-4o-mini |
Chat model |
EMBEDDING_MODEL |
text-embedding-3-large |
Embedding model |
LLM_API_BASE |
https://api.openai.com/v1 |
API base URL (swap for any OpenAI-compatible endpoint) |
LLM_API_KEY |
(from .env) |
API key |
WEAVIATE_URL |
http://localhost:8080 |
Weaviate HTTP endpoint |
EMBED_BATCH_SIZE |
128 |
Chunks per embedding request |
API_HOST / API_PORT |
localhost / 8000 |
FastAPI server binding |
The pipeline is a directed graph with four nodes executed in sequence:
-
Contextualize — The LLM reads the last 8 messages and the current question, then produces a
CONTEXT_SUMMARYdescribing what prior context is needed. For standalone questions it passes through unchanged. -
Extract Terms — The LLM distills the reformulated question into 1–3 precise search terms optimized for retrieval.
-
Retrieve — For each search term, a hybrid search runs against Weaviate combining:
- Dense vector similarity (
text-embedding-3-large) - BM25 keyword search
- Equal 50/50 alpha weighting
Results are deduplicated by
chunk_idand the top 10 chunks by score are kept. - Dense vector similarity (
-
Generate — The LLM streams an answer using the retrieved chunks and the context summary as grounding.
Each session has a session_id (e.g. session_a3f2c1b0). State is persisted in conversations.db (SQLite) via LangGraph's SqliteSaver. Reusing the same session ID after a server restart restores the full conversation history.
On first run, the PDF is:
- Loaded page-by-page with
PyPDFLoader - Split into ~800-token chunks with 100-token overlap
- Each chunk is assigned a unique
chunk_id - Embedded in batches and stored in the Weaviate collection
BookChunk_hist
| Method | Path | Description |
|---|---|---|
GET |
/ |
API info and available endpoints |
GET |
/health |
Weaviate connectivity check |
GET |
/session |
Generate a new session ID |
WS |
/ws/chat/{session_id} |
Streaming chat (WebSocket) |
POST |
/chat/{session_id} |
Non-streaming chat (REST) |
GET |
/debug/history/{session_id} |
Inspect conversation history for a session |
GET |
/docs |
Interactive Swagger UI |
Client → Server
{ "question": "What is pipelining?" }Server → Client (in order)
{ "type": "connected", "session_id": "...", "message": "Connected to RAG chat service" }
{ "type": "processing" }
{ "type": "stream_start", "search_queries": ["pipelining", "instruction hazards"] }
{ "type": "token", "content": "P" }
{ "type": "token", "content": "ipelining" }
{ "type": "complete", "answer": "...", "sources": [...], "search_queries": [...] }curl -X POST http://localhost:8000/chat/my-session \
-H "Content-Type: application/json" \
-d '{"question": "What is cache coherence?"}'Force re-indexing (deletes the Weaviate collection so the next startup rebuilds it):
python reset_weaviate.pyRun the pipeline from the terminal (no UI):
python hybrid_search_graph_history.pyWeaviate connection refused Confirm Docker is running and the container is up:
docker ps
curl http://localhost:8080/v1/.well-known/readyLLM_API_KEY not found
Make sure .env exists in the project root with LLM_API_KEY=... set, or export the variable in your shell.
PDF not found on startup
Check that your PDF file name matches PDF_PATH in config.py and that it is placed in the project root.
Want to use a different LLM provider
Set LLM_API_BASE to any OpenAI-compatible endpoint (e.g. Azure OpenAI, local Ollama with openai compatibility) and update LLM_MODEL accordingly.