You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/project-report.md
+16-16Lines changed: 16 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# OpenCodeIntel — Project Report
1
+
# OpenCodeIntel: Project Report
2
2
3
3
**Course:** Generative AI
4
4
**Project:** AI-Powered Code Intelligence System
@@ -14,13 +14,13 @@
14
14
15
15
OpenCodeIntel (OCI) is an open-source, production-grade generative AI system that solves a real problem: AI coding assistants like Claude Code, Cursor, and Copilot do not know your codebase. They hallucinate function names, use wrong patterns, and put code in wrong files because they lack context.
16
16
17
-
OCI builds a persistent, semantic knowledge base from your codebase and delivers exactly the right context — the relevant files, functions, and project rules — to any AI assistant the moment it needs them, via the Model Context Protocol (MCP).
17
+
OCI builds a persistent, semantic knowledge base from your codebase and delivers exactly the right context (the relevant files, functions, and project rules) to any AI assistant the moment it needs them, via the Model Context Protocol (MCP).
18
18
19
19
This is the open-source alternative to Augment Code, which raised $252M to solve the same problem for enterprise teams. OCI solves it for every developer, free and self-hostable.
- Prompt Engineering: per-task context assembly with systematic rule injection
24
24
25
25
---
26
26
@@ -30,16 +30,16 @@ See [`docs/architecture.md`](./architecture.md) for full diagrams.
30
30
31
31
The system has four layers:
32
32
33
-
**Layer 1 — Indexing (RAG Knowledge Base Construction)**
33
+
**Layer 1: Indexing (RAG Knowledge Base Construction)**
34
34
Repositories are parsed with tree-sitter at the function level. Each function is embedded using OpenAI `text-embedding-3-small` and stored in Pinecone (vector database). Import relationships are extracted into a dependency graph stored in Supabase.
35
35
36
-
**Layer 2 — Retrieval (RAG Query Pipeline)**
36
+
**Layer 2: Retrieval (RAG Query Pipeline)**
37
37
Given a natural language task description, the system expands the query into semantic variants, runs cosine similarity search against Pinecone, expands results by one hop through the dependency graph, and matches project rules from `CLAUDE.md` / `AGENTS.md` / `.cursorrules`.
Retrieved chunks are ranked by relevance score and packed into a token budget (default 1,500 tokens). Matched rules are injected. The result is a structured markdown context package delivered to the AI assistant.
41
41
42
-
**Layer 4 — MCP Protocol**
42
+
**Layer 4: MCP Protocol**
43
43
An MCP server exposes all capabilities as tools consumable by any MCP-compatible AI client (Claude Code, Cursor, Copilot, Gemini CLI) via stdio or streamable HTTP.
Rules files are split by `##` markdown headers into discrete sections. Each section is matched against the user's task using regex patterns and keyword overlap scoring. Only relevant rule sections are included — not the entire file — to stay within the token budget.
117
+
Rules files are split by `##` markdown headers into discrete sections. Each section is matched against the user's task using regex patterns and keyword overlap scoring. Only relevant rule sections are included (not the entire file) to stay within the token budget.
118
118
119
119
**Token Budget Management**
120
120
121
121
Context is packed greedily within a configurable token budget (default: 1,500 tokens, estimated at 1 token per 4 characters). Files are ordered by relevance score. The assembler stops adding files when the budget would be exceeded. This constraint forces precision: the system returns exactly what's needed, not everything it finds.
122
122
123
123
**System Prompt Design for DNA Extraction**
124
124
125
-
The `get_codebase_dna` tool uses tree-sitter to statically analyze the codebase and extract architectural patterns — auth patterns, service patterns, DB patterns, error patterns, naming conventions — without calling an LLM. This deterministic extraction produces machine-readable rules that serve as grounding context for AI code generation, reducing hallucination of project-specific names and patterns.
125
+
The `get_codebase_dna` tool uses tree-sitter to statically analyze the codebase and extract architectural patterns (auth patterns, service patterns, DB patterns, error patterns, naming conventions) without calling an LLM. This deterministic extraction produces machine-readable rules that serve as grounding context for AI code generation, reducing hallucination of project-specific names and patterns.
126
126
127
127
---
128
128
@@ -142,7 +142,7 @@ The `get_codebase_dna` tool uses tree-sitter to statically analyze the codebase
|`middleware/auth.py`| JWT + API key authentication |
144
144
145
-
All services use the singleton pattern (instantiated once in `dependencies.py`, injected via FastAPI `Depends()`). This is critical for Pinecone connection reuse — creating a new connection per request would add ~800ms of latency.
145
+
All services use the singleton pattern (instantiated once in `dependencies.py`, injected via FastAPI `Depends()`). This is critical for Pinecone connection reuse; creating a new connection per request would add ~800ms of latency.
146
146
147
147
**Multi-tenancy:** Every Pinecone query filters by `repo_id` (UUID). Supabase Row-Level Security (RLS) enforces that users can only access their own repositories at the database level, independent of application logic.
Dashboard for repository management, API key generation, and indexing status. Built with shadcn/ui components and Tailwind CSS. All server state managed via TanStack Query (`useQuery`/`useMutation`) — no raw `fetch` in `useEffect`.
159
+
Dashboard for repository management, API key generation, and indexing status. Built with shadcn/ui components and Tailwind CSS. All server state managed via TanStack Query (`useQuery`/`useMutation`); no raw `fetch` in `useEffect`.
160
160
161
161
### 4.4 Infrastructure
162
162
163
-
-**Backend + MCP:** Railway (separate services, shared environment — using`MCP_API_KEY` to avoid collision with backend's `API_KEY`)
163
+
-**Backend + MCP:** Railway (separate services, shared environment; uses`MCP_API_KEY` to avoid collision with backend's `API_KEY`)
164
164
-**Frontend:** Vercel
165
165
-**CI/CD:** GitHub Actions with path-filtered jobs (backend tests only run when `backend/**` changes)
166
166
-**Domains:** GoDaddy → Vercel/Railway
@@ -257,19 +257,19 @@ Switching embedding models between `text-embedding-3-small` (1536d) and `text-em
257
257
## 8. Ethical Considerations
258
258
259
259
**Data Privacy**
260
-
Code is a form of intellectual property. OCI's hosted service stores repository embeddings (vector representations, not raw source code) in Pinecone and metadata in Supabase. API keys are stored as SHA-256 hashes — the raw key is shown once and never persisted. Users retain full ownership of their data and can delete repositories at any time. The self-hosted deployment option (Docker Compose) gives users complete control over data residency.
260
+
Code is a form of intellectual property. OCI's hosted service stores repository embeddings (vector representations, not raw source code) in Pinecone and metadata in Supabase. API keys are stored as SHA-256 hashes; the raw key is shown once and never persisted. Users retain full ownership of their data and can delete repositories at any time. The self-hosted deployment option (Docker Compose) gives users complete control over data residency.
261
261
262
262
**Access Control**
263
263
Multi-tenancy is enforced at two layers: application-level (all queries filter by `repo_id` tied to the authenticated user) and database-level (Supabase Row-Level Security). These two independent layers prevent cross-user data leakage even if one layer has a bug.
264
264
265
265
**Potential Misuse**
266
-
The context assembly feature could theoretically be used to extract sensitive patterns from a codebase (auth logic, secret handling). OCI mitigates this by requiring user authentication for all repository access and by not storing raw source code — only embeddings.
266
+
The context assembly feature could theoretically be used to extract sensitive patterns from a codebase (auth logic, secret handling). OCI mitigates this by requiring user authentication for all repository access and by not storing raw source code, only embeddings.
267
267
268
268
**Bias and Representation**
269
269
The embedding model (OpenAI `text-embedding-3-small`) may perform better on English-language identifiers and comments than on other languages. Codebases with non-English naming conventions may see lower retrieval recall. This is a known limitation.
270
270
271
271
**Copyright**
272
-
OCI does not reproduce or redistribute source code. It stores vector embeddings (real-valued floating point arrays) which significantly reduces the risk of reconstructing original source code. Retrieval returns file paths and function signatures to help the AI locate relevant code — not the code itself verbatim (unless the user has authorized access to that repo).
272
+
OCI does not reproduce or redistribute source code. It stores vector embeddings (real-valued floating point arrays) which significantly reduces the risk of reconstructing original source code. Retrieval returns file paths and function signatures to help the AI locate relevant code, not the code itself verbatim (unless the user has authorized access to that repo).
273
273
274
274
**Content Filtering**
275
275
The system does not filter for malicious code patterns. It indexes whatever the user points it at. Users are responsible for ensuring they have authorization to index the repositories they connect.
0 commit comments