Skip to content

Commit 4e65fd8

Browse files
authored
Merge pull request #296 from DevanshuNEU/docs/genai-submission-artifacts
docs: add prompt engineering inventory and clean prose pass
2 parents 9915bab + c1b8edc commit 4e65fd8

3 files changed

Lines changed: 353 additions & 19 deletions

File tree

docs/architecture.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# OpenCodeIntel System Architecture
1+
# OpenCodeIntel: System Architecture
22

33
## Overview
44

@@ -54,7 +54,7 @@ OpenCodeIntel is a RAG-based (Retrieval-Augmented Generation) code intelligence
5454

5555
---
5656

57-
## Indexing Pipeline (RAG Knowledge Base Construction)
57+
## Indexing Pipeline (RAG: Knowledge Base Construction)
5858

5959
```text
6060
Repository (GitHub URL or local path)
@@ -117,7 +117,7 @@ Repository (GitHub URL or local path)
117117

118118
---
119119

120-
## Query Pipeline (RAG Retrieval + Context Assembly)
120+
## Query Pipeline (RAG: Retrieval + Context Assembly)
121121

122122
```text
123123
User Task: "add rate limiting to the settings endpoints"

docs/project-report.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# OpenCodeIntel Project Report
1+
# OpenCodeIntel: Project Report
22

33
**Course:** Generative AI
44
**Project:** AI-Powered Code Intelligence System
@@ -14,13 +14,13 @@
1414

1515
OpenCodeIntel (OCI) is an open-source, production-grade generative AI system that solves a real problem: AI coding assistants like Claude Code, Cursor, and Copilot do not know your codebase. They hallucinate function names, use wrong patterns, and put code in wrong files because they lack context.
1616

17-
OCI builds a persistent, semantic knowledge base from your codebase and delivers exactly the right context the relevant files, functions, and project rules to any AI assistant the moment it needs them, via the Model Context Protocol (MCP).
17+
OCI builds a persistent, semantic knowledge base from your codebase and delivers exactly the right context (the relevant files, functions, and project rules) to any AI assistant the moment it needs them, via the Model Context Protocol (MCP).
1818

1919
This is the open-source alternative to Augment Code, which raised $252M to solve the same problem for enterprise teams. OCI solves it for every developer, free and self-hostable.
2020

2121
**Generative AI components implemented:**
22-
- Retrieval-Augmented Generation (RAG)primary
23-
- Prompt Engineering per-task context assembly with systematic rule injection
22+
- Retrieval-Augmented Generation (RAG): primary component
23+
- Prompt Engineering: per-task context assembly with systematic rule injection
2424

2525
---
2626

@@ -30,16 +30,16 @@ See [`docs/architecture.md`](./architecture.md) for full diagrams.
3030

3131
The system has four layers:
3232

33-
**Layer 1 Indexing (RAG Knowledge Base Construction)**
33+
**Layer 1: Indexing (RAG Knowledge Base Construction)**
3434
Repositories are parsed with tree-sitter at the function level. Each function is embedded using OpenAI `text-embedding-3-small` and stored in Pinecone (vector database). Import relationships are extracted into a dependency graph stored in Supabase.
3535

36-
**Layer 2 Retrieval (RAG Query Pipeline)**
36+
**Layer 2: Retrieval (RAG Query Pipeline)**
3737
Given a natural language task description, the system expands the query into semantic variants, runs cosine similarity search against Pinecone, expands results by one hop through the dependency graph, and matches project rules from `CLAUDE.md` / `AGENTS.md` / `.cursorrules`.
3838

39-
**Layer 3 Context Assembly (Prompt Engineering)**
39+
**Layer 3: Context Assembly (Prompt Engineering)**
4040
Retrieved chunks are ranked by relevance score and packed into a token budget (default 1,500 tokens). Matched rules are injected. The result is a structured markdown context package delivered to the AI assistant.
4141

42-
**Layer 4 MCP Protocol**
42+
**Layer 4: MCP Protocol**
4343
An MCP server exposes all capabilities as tools consumable by any MCP-compatible AI client (Claude Code, Cursor, Copilot, Gemini CLI) via stdio or streamable HTTP.
4444

4545
### Architecture Diagram
@@ -114,15 +114,15 @@ CLAUDE.md → AGENTS.md → .cursorrules → .codeintel/rules.md
114114
→ CODING_GUIDELINES.md
115115
```
116116

117-
Rules files are split by `##` markdown headers into discrete sections. Each section is matched against the user's task using regex patterns and keyword overlap scoring. Only relevant rule sections are included not the entire file to stay within the token budget.
117+
Rules files are split by `##` markdown headers into discrete sections. Each section is matched against the user's task using regex patterns and keyword overlap scoring. Only relevant rule sections are included (not the entire file) to stay within the token budget.
118118

119119
**Token Budget Management**
120120

121121
Context is packed greedily within a configurable token budget (default: 1,500 tokens, estimated at 1 token per 4 characters). Files are ordered by relevance score. The assembler stops adding files when the budget would be exceeded. This constraint forces precision: the system returns exactly what's needed, not everything it finds.
122122

123123
**System Prompt Design for DNA Extraction**
124124

125-
The `get_codebase_dna` tool uses tree-sitter to statically analyze the codebase and extract architectural patterns auth patterns, service patterns, DB patterns, error patterns, naming conventions without calling an LLM. This deterministic extraction produces machine-readable rules that serve as grounding context for AI code generation, reducing hallucination of project-specific names and patterns.
125+
The `get_codebase_dna` tool uses tree-sitter to statically analyze the codebase and extract architectural patterns (auth patterns, service patterns, DB patterns, error patterns, naming conventions) without calling an LLM. This deterministic extraction produces machine-readable rules that serve as grounding context for AI code generation, reducing hallucination of project-specific names and patterns.
126126

127127
---
128128

@@ -142,7 +142,7 @@ The `get_codebase_dna` tool uses tree-sitter to statically analyze the codebase
142142
| `search_v2/` | Function-level extraction (tree-sitter) |
143143
| `middleware/auth.py` | JWT + API key authentication |
144144

145-
All services use the singleton pattern (instantiated once in `dependencies.py`, injected via FastAPI `Depends()`). This is critical for Pinecone connection reuse creating a new connection per request would add ~800ms of latency.
145+
All services use the singleton pattern (instantiated once in `dependencies.py`, injected via FastAPI `Depends()`). This is critical for Pinecone connection reuse; creating a new connection per request would add ~800ms of latency.
146146

147147
**Multi-tenancy:** Every Pinecone query filters by `repo_id` (UUID). Supabase Row-Level Security (RLS) enforces that users can only access their own repositories at the database level, independent of application logic.
148148

@@ -156,11 +156,11 @@ Tools exposed: `search_code`, `get_context_for_task`, `get_codebase_dna`, `get_d
156156

157157
### 4.3 Frontend (React 18, TypeScript, Vite)
158158

159-
Dashboard for repository management, API key generation, and indexing status. Built with shadcn/ui components and Tailwind CSS. All server state managed via TanStack Query (`useQuery`/`useMutation`) no raw `fetch` in `useEffect`.
159+
Dashboard for repository management, API key generation, and indexing status. Built with shadcn/ui components and Tailwind CSS. All server state managed via TanStack Query (`useQuery`/`useMutation`); no raw `fetch` in `useEffect`.
160160

161161
### 4.4 Infrastructure
162162

163-
- **Backend + MCP:** Railway (separate services, shared environment — using `MCP_API_KEY` to avoid collision with backend's `API_KEY`)
163+
- **Backend + MCP:** Railway (separate services, shared environment; uses `MCP_API_KEY` to avoid collision with backend's `API_KEY`)
164164
- **Frontend:** Vercel
165165
- **CI/CD:** GitHub Actions with path-filtered jobs (backend tests only run when `backend/**` changes)
166166
- **Domains:** GoDaddy → Vercel/Railway
@@ -257,19 +257,19 @@ Switching embedding models between `text-embedding-3-small` (1536d) and `text-em
257257
## 8. Ethical Considerations
258258

259259
**Data Privacy**
260-
Code is a form of intellectual property. OCI's hosted service stores repository embeddings (vector representations, not raw source code) in Pinecone and metadata in Supabase. API keys are stored as SHA-256 hashes the raw key is shown once and never persisted. Users retain full ownership of their data and can delete repositories at any time. The self-hosted deployment option (Docker Compose) gives users complete control over data residency.
260+
Code is a form of intellectual property. OCI's hosted service stores repository embeddings (vector representations, not raw source code) in Pinecone and metadata in Supabase. API keys are stored as SHA-256 hashes; the raw key is shown once and never persisted. Users retain full ownership of their data and can delete repositories at any time. The self-hosted deployment option (Docker Compose) gives users complete control over data residency.
261261

262262
**Access Control**
263263
Multi-tenancy is enforced at two layers: application-level (all queries filter by `repo_id` tied to the authenticated user) and database-level (Supabase Row-Level Security). These two independent layers prevent cross-user data leakage even if one layer has a bug.
264264

265265
**Potential Misuse**
266-
The context assembly feature could theoretically be used to extract sensitive patterns from a codebase (auth logic, secret handling). OCI mitigates this by requiring user authentication for all repository access and by not storing raw source code only embeddings.
266+
The context assembly feature could theoretically be used to extract sensitive patterns from a codebase (auth logic, secret handling). OCI mitigates this by requiring user authentication for all repository access and by not storing raw source code, only embeddings.
267267

268268
**Bias and Representation**
269269
The embedding model (OpenAI `text-embedding-3-small`) may perform better on English-language identifiers and comments than on other languages. Codebases with non-English naming conventions may see lower retrieval recall. This is a known limitation.
270270

271271
**Copyright**
272-
OCI does not reproduce or redistribute source code. It stores vector embeddings (real-valued floating point arrays) which significantly reduces the risk of reconstructing original source code. Retrieval returns file paths and function signatures to help the AI locate relevant code not the code itself verbatim (unless the user has authorized access to that repo).
272+
OCI does not reproduce or redistribute source code. It stores vector embeddings (real-valued floating point arrays) which significantly reduces the risk of reconstructing original source code. Retrieval returns file paths and function signatures to help the AI locate relevant code, not the code itself verbatim (unless the user has authorized access to that repo).
273273

274274
**Content Filtering**
275275
The system does not filter for malicious code patterns. It indexes whatever the user points it at. Users are responsible for ensuring they have authorization to index the repositories they connect.

0 commit comments

Comments
 (0)