Skip to content

Commit 9915bab

Browse files
authored
Merge pull request #295 from DevanshuNEU/docs/assignment-project-report
docs: add architecture diagrams, examples, and project report
2 parents add41ac + 89698cf commit 9915bab

3 files changed

Lines changed: 869 additions & 0 deletions

File tree

docs/architecture.md

Lines changed: 324 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,324 @@
1+
# OpenCodeIntel — System Architecture
2+
3+
## Overview
4+
5+
![System Architecture](https://github.com/user-attachments/assets/5c4f5846-184f-46ec-b4c4-a444cbb6540d)
6+
7+
8+
9+
OpenCodeIntel is a RAG-based (Retrieval-Augmented Generation) code intelligence system. It indexes codebases into a vector database, then assembles per-task context for AI coding assistants via the Model Context Protocol (MCP).
10+
11+
---
12+
13+
## High-Level Architecture
14+
15+
```text
16+
┌─────────────────────────────────────────────────────────────────┐
17+
│ Developer's Machine │
18+
│ │
19+
│ ┌──────────┐ ┌────────────────┐ ┌──────────────────┐ │
20+
│ │ Claude │ │ Cursor / VS Code │ │ Gemini CLI / │ │
21+
│ │ Code │ │ + Copilot │ │ Any MCP client │ │
22+
│ └────┬─────┘ └───────┬────────┘ └────────┬─────────┘ │
23+
│ │ │ │ │
24+
│ └──────────────────┼───────────────────────┘ │
25+
│ │ MCP Protocol │
26+
└───────────────────────────┼──────────────────────────────────────┘
27+
28+
┌───────────▼───────────┐
29+
│ MCP Server │
30+
│ mcp.opencodeintel.com│
31+
│ (FastMCP, Railway) │
32+
│ │
33+
│ Dual transport: │
34+
│ • stdio (local) │
35+
│ • streamable-http │
36+
└───────────┬───────────┘
37+
│ REST API
38+
┌───────────▼───────────┐
39+
│ Backend API │
40+
│ api.opencodeintel.com│
41+
│ (FastAPI, Railway) │
42+
└──┬──────────┬─────────┘
43+
│ │
44+
┌──────────▼──┐ ┌────▼──────────┐
45+
│ Supabase │ │ Pinecone │
46+
│ (Postgres) │ │ (Vector DB) │
47+
│ │ │ │
48+
│ - users │ │ - embeddings │
49+
│ - repos │ │ - 1536/3072d │
50+
│ - api_keys │ │ - cosine sim │
51+
│ - deps │ └───────────────┘
52+
└─────────────┘
53+
```
54+
55+
---
56+
57+
## Indexing Pipeline (RAG — Knowledge Base Construction)
58+
59+
```text
60+
Repository (GitHub URL or local path)
61+
62+
63+
┌─────────────────────────┐
64+
│ Repo Cloning / Access │
65+
│ (GitPython) │
66+
└───────────┬─────────────┘
67+
68+
69+
┌─────────────────────────┐
70+
│ File Filtering │
71+
│ - Skip: node_modules, │
72+
│ .git, build dirs │
73+
│ - Include: .py, .ts, │
74+
│ .tsx, .js, .go etc │
75+
│ - include_paths filter│
76+
│ (Path.parts — exact)│
77+
└───────────┬─────────────┘
78+
79+
80+
┌─────────────────────────┐
81+
│ Tree-sitter Parsing │ ← Language: Python, JS, TS, TSX
82+
│ (Function-level │
83+
│ extraction — v2) │
84+
│ │
85+
│ Extracts per chunk: │
86+
│ - function name │
87+
│ - docstring │
88+
│ - parameters │
89+
│ - return type │
90+
│ - surrounding context │
91+
└───────────┬─────────────┘
92+
93+
94+
┌─────────────────────────┐
95+
│ Embedding Generation │ ← OpenAI text-embedding-3-small
96+
│ (AsyncOpenAI) │ or text-embedding-3-large
97+
│ │
98+
│ Batch size: 100 │
99+
│ Parallel files: 10 │
100+
│ Rich text format: │
101+
│ "File: X\nFunc: Y\n │
102+
│ Params: Z\nBody: ..."│
103+
└───────────┬─────────────┘
104+
105+
106+
┌─────────────────────────┐
107+
│ Pinecone Upsert │ ← Serverless index, AWS us-east-1
108+
│ (Batch: 100 vectors) │ Cosine similarity metric
109+
│ │
110+
│ Metadata stored: │
111+
│ - file_path │
112+
│ - function_name │
113+
│ - repo_id │
114+
│ - chunk_type │
115+
└─────────────────────────┘
116+
```
117+
118+
---
119+
120+
## Query Pipeline (RAG — Retrieval + Context Assembly)
121+
122+
```text
123+
User Task: "add rate limiting to the settings endpoints"
124+
125+
126+
┌─────────────────────────┐
127+
│ Query Expansion │ ← SearchEnhancer (GPT-4o-mini)
128+
│ (search_enhancer.py) │ Generates 3 semantic variants
129+
│ │ to improve recall
130+
└───────────┬─────────────┘
131+
132+
133+
┌─────────────────────────┐
134+
│ Pinecone Semantic │
135+
│ Search │
136+
│ - top-k results │
137+
│ - cosine similarity │
138+
│ - repo_id filter │
139+
│ - keyword boosting │
140+
│ for exact matches │
141+
└───────────┬─────────────┘
142+
143+
144+
┌─────────────────────────┐
145+
│ 1-Hop Dependency │ ← dependency_analyzer.py
146+
│ Expansion │ Tree-sitter AST import graph
147+
│ (context_assembler.py)│ Adds files that import or
148+
│ │ are imported by top results
149+
└───────────┬─────────────┘
150+
151+
152+
┌─────────────────────────┐
153+
│ Rules Matching │ ← Reads CLAUDE.md / AGENTS.md /
154+
│ (context_assembler.py)│ .cursorrules / CONVENTIONS.md
155+
│ │ Splits by ## headers
156+
│ │ Matches sections to task
157+
│ │ using regex + keyword overlap
158+
└───────────┬─────────────┘
159+
160+
161+
┌─────────────────────────┐
162+
│ Token Budget Packing │ ← Default budget: 1500 tokens
163+
│ (context_assembler.py)│ Ranks files by relevance score
164+
│ │ Packs until budget is hit
165+
│ │ ~1 token per 4 chars estimate
166+
└───────────┬─────────────┘
167+
168+
169+
Assembled Context (Markdown):
170+
→ backend/routes/settings.py (94% relevant)
171+
→ backend/services/user_limits.py (87% relevant)
172+
→ backend/middleware/auth.py (81% relevant)
173+
→ Rule: Use LimitCheckError, not a new exception
174+
→ Rule: require_auth on all user routes
175+
Total: ~1,400 tokens
176+
```
177+
178+
---
179+
180+
## DNA Extraction Pipeline (Prompt Engineering Component)
181+
182+
```text
183+
Repository Files
184+
185+
186+
┌──────────────────────────┐
187+
│ dna_extractor.py │
188+
│ (Tree-sitter AST scan) │
189+
│ │
190+
│ Detects: │
191+
│ - Auth patterns │
192+
│ (middleware, decorators,│
193+
│ ownership checks) │
194+
│ - Service patterns │
195+
│ (singletons, DI) │
196+
│ - DB patterns │
197+
│ (ORM, RLS, ID types) │
198+
│ - Error patterns │
199+
│ (exception classes) │
200+
│ - Logging patterns │
201+
│ - Naming conventions │
202+
│ - Style conventions │
203+
│ (async ratio, type │
204+
│ hint coverage) │
205+
└───────────┬──────────────┘
206+
207+
208+
AGENTS.md file
209+
(Machine-readable rules
210+
for AI coding assistants)
211+
```
212+
213+
---
214+
215+
## Dependency Analysis Pipeline
216+
217+
```text
218+
Repository Source Files
219+
220+
221+
┌─────────────────────────┐
222+
│ dependency_analyzer.py │
223+
│ (Tree-sitter AST) │
224+
│ │
225+
│ Languages: │
226+
│ Python: import X, │
227+
│ from X import Y│
228+
│ JS/TS/TSX: import X │
229+
│ require('X')│
230+
│ │
231+
│ Resolves relative paths│
232+
│ Handles index.ts files │
233+
│ Uses Path.parts for │
234+
│ exact prefix matching │
235+
└───────────┬─────────────┘
236+
237+
238+
┌─────────────────────────┐
239+
│ Supabase Storage │
240+
│ file_dependencies table│
241+
│ (source → target edges)│
242+
└───────────┬─────────────┘
243+
244+
245+
Context assembler uses
246+
dep graph to expand
247+
1 hop beyond search hits
248+
```
249+
250+
---
251+
252+
## MCP Protocol Layer
253+
254+
```text
255+
MCP Client (Claude Code / Cursor / etc.)
256+
257+
│ JSON-RPC over stdio OR HTTP
258+
259+
┌─────────────────────────────────┐
260+
│ FastMCP Server (server.py) │
261+
│ │
262+
│ Tools exposed: │
263+
│ ┌─────────────────────────┐ │
264+
│ │ search_code │ │
265+
│ │ get_context_for_task │ │ ← Core RAG tool
266+
│ │ get_codebase_dna │ │
267+
│ │ get_dependency_graph │ │
268+
│ │ analyze_impact │ │
269+
│ │ analyze_code_style │ │
270+
│ │ add_repository │ │
271+
│ │ index_repository │ │
272+
│ │ list_repositories │ │
273+
│ └─────────────────────────┘ │
274+
│ │
275+
│ Auth: MCP_API_KEY (ci_ prefix) │
276+
│ SHA-256 hashed in DB │
277+
└─────────────────────────────────┘
278+
```
279+
280+
---
281+
282+
## Tech Stack Summary
283+
284+
| Layer | Technology |
285+
|---|---|
286+
| Backend API | FastAPI, Python 3.11+, Uvicorn |
287+
| Frontend | React 18, TypeScript, Vite, Tailwind, shadcn/ui, Bun |
288+
| MCP Server | FastMCP, Python, Streamable HTTP + stdio |
289+
| Vector DB | Pinecone (Serverless, AWS us-east-1) |
290+
| Embeddings | OpenAI text-embedding-3-small / text-embedding-3-large |
291+
| Code Parsing | tree-sitter (Python, JS, TS, TSX) |
292+
| Relational DB | Supabase (PostgreSQL + RLS) |
293+
| Auth | JWT + API key (SHA-256 hashed) |
294+
| Infrastructure | Railway (backend + MCP), Vercel (frontend) |
295+
| CI/CD | GitHub Actions (path-filtered per service) |
296+
| Package Manager | Bun (frontend), pip (backend/MCP) |
297+
298+
---
299+
300+
## Data Flow: End-to-End
301+
302+
```text
303+
1. Developer adds repo URL at opencodeintel.com
304+
305+
2. Backend clones repo, tree-sitter parses all files
306+
307+
3. Functions/chunks embedded via OpenAI → stored in Pinecone
308+
309+
4. Import graph built → stored in Supabase
310+
311+
5. DNA extracted → AGENTS.md written to repo
312+
313+
6. Developer configures MCP in Claude Code / Cursor
314+
315+
7. AI assistant calls get_context_for_task("add rate limiting...")
316+
317+
8. MCP server → Backend: query expansion + vector search
318+
319+
9. Top hits + 1-hop deps + matching rules assembled
320+
321+
10. Context package returned to AI assistant (~1,400 tokens)
322+
323+
11. AI generates correct code, first try
324+
```

0 commit comments

Comments
 (0)