Skip to content

Commit 953fde4

Browse files
committed
docs: add architecture diagrams, examples, and project report
- docs/architecture.md: full system diagrams for indexing pipeline, query pipeline, DNA extraction, dependency analysis, and MCP layer - docs/examples/README.md: real example outputs for all 5 MCP tools with performance benchmarks and retrieval quality metrics - docs/project-report.md: complete assignment documentation covering RAG + prompt engineering implementation, performance metrics, challenges/solutions, future improvements, and ethical considerations
1 parent add41ac commit 953fde4

3 files changed

Lines changed: 863 additions & 0 deletions

File tree

docs/architecture.md

Lines changed: 320 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,320 @@
1+
# OpenCodeIntel — System Architecture
2+
3+
## Overview
4+
5+
OpenCodeIntel is a RAG-based (Retrieval-Augmented Generation) code intelligence system. It indexes codebases into a vector database, then assembles per-task context for AI coding assistants via the Model Context Protocol (MCP).
6+
7+
---
8+
9+
## High-Level Architecture
10+
11+
```
12+
┌─────────────────────────────────────────────────────────────────┐
13+
│ Developer's Machine │
14+
│ │
15+
│ ┌──────────┐ ┌────────────────┐ ┌──────────────────┐ │
16+
│ │ Claude │ │ Cursor / VSC │ │ Gemini CLI / │ │
17+
│ │ Code │ │ + Copilot │ │ Any MCP client │ │
18+
│ └────┬─────┘ └───────┬────────┘ └────────┬─────────┘ │
19+
│ │ │ │ │
20+
│ └──────────────────┼───────────────────────┘ │
21+
│ │ MCP Protocol │
22+
└───────────────────────────┼──────────────────────────────────────┘
23+
24+
┌───────────▼───────────┐
25+
│ MCP Server │
26+
│ mcp.opencodeintel.com│
27+
│ (FastMCP, Railway) │
28+
│ │
29+
│ Dual transport: │
30+
│ • stdio (local) │
31+
│ • streamable-http │
32+
└───────────┬───────────┘
33+
│ REST API
34+
┌───────────▼───────────┐
35+
│ Backend API │
36+
│ api.opencodeintel.com│
37+
│ (FastAPI, Railway) │
38+
└──┬──────────┬─────────┘
39+
│ │
40+
┌──────────▼──┐ ┌────▼──────────┐
41+
│ Supabase │ │ Pinecone │
42+
│ (Postgres) │ │ (Vector DB) │
43+
│ │ │ │
44+
│ - users │ │ - embeddings │
45+
│ - repos │ │ - 1536/3072d │
46+
│ - api_keys │ │ - cosine sim │
47+
│ - deps │ └───────────────┘
48+
└─────────────┘
49+
```
50+
51+
---
52+
53+
## Indexing Pipeline (RAG — Knowledge Base Construction)
54+
55+
```
56+
Repository (GitHub URL or local path)
57+
58+
59+
┌─────────────────────────┐
60+
│ Repo Cloning / Access │
61+
│ (GitPython) │
62+
└───────────┬─────────────┘
63+
64+
65+
┌─────────────────────────┐
66+
│ File Filtering │
67+
│ - Skip: node_modules, │
68+
│ .git, build dirs │
69+
│ - Include: .py, .ts, │
70+
│ .tsx, .js, .go etc │
71+
│ - include_paths filter│
72+
│ (Path.parts — exact)│
73+
└───────────┬─────────────┘
74+
75+
76+
┌─────────────────────────┐
77+
│ Tree-sitter Parsing │ ← Language: Python, JS, TS, TSX
78+
│ (Function-level │
79+
│ extraction — v2) │
80+
│ │
81+
│ Extracts per chunk: │
82+
│ - function name │
83+
│ - docstring │
84+
│ - parameters │
85+
│ - return type │
86+
│ - surrounding context │
87+
└───────────┬─────────────┘
88+
89+
90+
┌─────────────────────────┐
91+
│ Embedding Generation │ ← OpenAI text-embedding-3-small
92+
│ (AsyncOpenAI) │ or text-embedding-3-large
93+
│ │
94+
│ Batch size: 100 │
95+
│ Parallel files: 10 │
96+
│ Rich text format: │
97+
│ "File: X\nFunc: Y\n │
98+
│ Params: Z\nBody: ..."│
99+
└───────────┬─────────────┘
100+
101+
102+
┌─────────────────────────┐
103+
│ Pinecone Upsert │ ← Serverless index, AWS us-east-1
104+
│ (Batch: 100 vectors) │ Cosine similarity metric
105+
│ │
106+
│ Metadata stored: │
107+
│ - file_path │
108+
│ - function_name │
109+
│ - repo_id │
110+
│ - chunk_type │
111+
└─────────────────────────┘
112+
```
113+
114+
---
115+
116+
## Query Pipeline (RAG — Retrieval + Context Assembly)
117+
118+
```
119+
User Task: "add rate limiting to the settings endpoints"
120+
121+
122+
┌─────────────────────────┐
123+
│ Query Expansion │ ← SearchEnhancer (GPT-4o-mini)
124+
│ (search_enhancer.py) │ Generates 3 semantic variants
125+
│ │ to improve recall
126+
└───────────┬─────────────┘
127+
128+
129+
┌─────────────────────────┐
130+
│ Pinecone Semantic │
131+
│ Search │
132+
│ - top-k results │
133+
│ - cosine similarity │
134+
│ - repo_id filter │
135+
│ - keyword boosting │
136+
│ for exact matches │
137+
└───────────┬─────────────┘
138+
139+
140+
┌─────────────────────────┐
141+
│ 1-Hop Dependency │ ← dependency_analyzer.py
142+
│ Expansion │ Tree-sitter AST import graph
143+
│ (context_assembler.py)│ Adds files that import or
144+
│ │ are imported by top results
145+
└───────────┬─────────────┘
146+
147+
148+
┌─────────────────────────┐
149+
│ Rules Matching │ ← Reads CLAUDE.md / AGENTS.md /
150+
│ (context_assembler.py)│ .cursorrules / CONVENTIONS.md
151+
│ │ Splits by ## headers
152+
│ │ Matches sections to task
153+
│ │ using regex + keyword overlap
154+
└───────────┬─────────────┘
155+
156+
157+
┌─────────────────────────┐
158+
│ Token Budget Packing │ ← Default budget: 1500 tokens
159+
│ (context_assembler.py)│ Ranks files by relevance score
160+
│ │ Packs until budget is hit
161+
│ │ ~1 token per 4 chars estimate
162+
└───────────┬─────────────┘
163+
164+
165+
Assembled Context (Markdown):
166+
→ backend/routes/settings.py (94% relevant)
167+
→ backend/services/user_limits.py (87% relevant)
168+
→ backend/middleware/auth.py (81% relevant)
169+
→ Rule: Use LimitCheckError, not a new exception
170+
→ Rule: require_auth on all user routes
171+
Total: ~1,400 tokens
172+
```
173+
174+
---
175+
176+
## DNA Extraction Pipeline (Prompt Engineering Component)
177+
178+
```
179+
Repository Files
180+
181+
182+
┌──────────────────────────┐
183+
│ dna_extractor.py │
184+
│ (Tree-sitter AST scan) │
185+
│ │
186+
│ Detects: │
187+
│ - Auth patterns │
188+
│ (middleware, decorators,│
189+
│ ownership checks) │
190+
│ - Service patterns │
191+
│ (singletons, DI) │
192+
│ - DB patterns │
193+
│ (ORM, RLS, ID types) │
194+
│ - Error patterns │
195+
│ (exception classes) │
196+
│ - Logging patterns │
197+
│ - Naming conventions │
198+
│ - Style conventions │
199+
│ (async ratio, type │
200+
│ hint coverage) │
201+
└───────────┬──────────────┘
202+
203+
204+
AGENTS.md file
205+
(Machine-readable rules
206+
for AI coding assistants)
207+
```
208+
209+
---
210+
211+
## Dependency Analysis Pipeline
212+
213+
```
214+
Repository Source Files
215+
216+
217+
┌─────────────────────────┐
218+
│ dependency_analyzer.py │
219+
│ (Tree-sitter AST) │
220+
│ │
221+
│ Languages: │
222+
│ Python: import X, │
223+
│ from X import Y│
224+
│ JS/TS/TSX: import X │
225+
│ require('X')│
226+
│ │
227+
│ Resolves relative paths│
228+
│ Handles index.ts files │
229+
│ Uses Path.parts for │
230+
│ exact prefix matching │
231+
└───────────┬─────────────┘
232+
233+
234+
┌─────────────────────────┐
235+
│ Supabase Storage │
236+
│ file_dependencies table│
237+
│ (source → target edges)│
238+
└───────────┬─────────────┘
239+
240+
241+
Context assembler uses
242+
dep graph to expand
243+
1 hop beyond search hits
244+
```
245+
246+
---
247+
248+
## MCP Protocol Layer
249+
250+
```
251+
MCP Client (Claude Code / Cursor / etc.)
252+
253+
│ JSON-RPC over stdio OR HTTP
254+
255+
┌─────────────────────────────────┐
256+
│ FastMCP Server (server.py) │
257+
│ │
258+
│ Tools exposed: │
259+
│ ┌─────────────────────────┐ │
260+
│ │ search_code │ │
261+
│ │ get_context_for_task │ │ ← Core RAG tool
262+
│ │ get_codebase_dna │ │
263+
│ │ get_dependency_graph │ │
264+
│ │ analyze_impact │ │
265+
│ │ analyze_code_style │ │
266+
│ │ add_repository │ │
267+
│ │ index_repository │ │
268+
│ │ list_repositories │ │
269+
│ └─────────────────────────┘ │
270+
│ │
271+
│ Auth: MCP_API_KEY (ci_ prefix) │
272+
│ SHA-256 hashed in DB │
273+
└─────────────────────────────────┘
274+
```
275+
276+
---
277+
278+
## Tech Stack Summary
279+
280+
| Layer | Technology |
281+
|---|---|
282+
| Backend API | FastAPI, Python 3.11+, Uvicorn |
283+
| Frontend | React 18, TypeScript, Vite, Tailwind, shadcn/ui, Bun |
284+
| MCP Server | FastMCP, Python, Streamable HTTP + stdio |
285+
| Vector DB | Pinecone (Serverless, AWS us-east-1) |
286+
| Embeddings | OpenAI text-embedding-3-small / text-embedding-3-large |
287+
| Code Parsing | tree-sitter (Python, JS, TS, TSX) |
288+
| Relational DB | Supabase (PostgreSQL + RLS) |
289+
| Auth | JWT + API key (SHA-256 hashed) |
290+
| Infrastructure | Railway (backend + MCP), Vercel (frontend) |
291+
| CI/CD | GitHub Actions (path-filtered per service) |
292+
| Package Manager | Bun (frontend), pip (backend/MCP) |
293+
294+
---
295+
296+
## Data Flow: End-to-End
297+
298+
```
299+
1. Developer adds repo URL at opencodeintel.com
300+
301+
2. Backend clones repo, tree-sitter parses all files
302+
303+
3. Functions/chunks embedded via OpenAI → stored in Pinecone
304+
305+
4. Import graph built → stored in Supabase
306+
307+
5. DNA extracted → AGENTS.md written to repo
308+
309+
6. Developer configures MCP in Claude Code / Cursor
310+
311+
7. AI assistant calls get_context_for_task("add rate limiting...")
312+
313+
8. MCP server → Backend: query expansion + vector search
314+
315+
9. Top hits + 1-hop deps + matching rules assembled
316+
317+
10. Context package returned to AI assistant (~1,400 tokens)
318+
319+
11. AI generates correct code, first try
320+
```

0 commit comments

Comments
 (0)