Skip to content

feat(dna): Add CodeDNA extractor for codebase pattern analysis#205

Merged
DevanshuNEU merged 8 commits into
OpenCodeIntel:mainfrom
DevanshuNEU:feature/codedna-extractor
Jan 12, 2026
Merged

feat(dna): Add CodeDNA extractor for codebase pattern analysis#205
DevanshuNEU merged 8 commits into
OpenCodeIntel:mainfrom
DevanshuNEU:feature/codedna-extractor

Conversation

@DevanshuNEU

Copy link
Copy Markdown
Collaborator

Summary

Adds a CodeDNA extractor that analyzes codebases to extract architectural patterns, conventions, and constraints. This helps AI assistants understand how to write code consistent with existing patterns.

Changes

New Files

  • backend/services/dna_extractor.py - Core DNA extraction service

Modified Files

  • backend/routes/analysis.py - Added /repos/{repo_id}/dna endpoint
  • backend/dependencies.py - Added dna_extractor singleton
  • mcp-server/server.py - Added get_codebase_dna MCP tool

Features

Pattern Detection

  • Framework Detection: FastAPI, Django, Django REST Framework, Starlette, Flask, aiohttp, Tornado, Express, Next.js, NestJS
  • Auth Patterns: Middleware, decorators, ownership checks, auth context types
  • Middleware Patterns: Framework-specific middleware detection
  • Database Patterns: ORM detection (Django ORM, SQLAlchemy, Prisma, Tortoise, Supabase), ID types, timestamps, RLS, cascades
  • Error Patterns: Exception classes, HTTP exception usage, error logging
  • Logging Patterns: Logger imports, log levels, structured logging, metrics
  • Test Patterns: Framework (pytest/unittest), fixtures, mocks, factories, coverage
  • Config Patterns: Env loading, settings structure, secrets handling
  • Naming Conventions: Function, class, constant, and file naming styles

Robustness

  • File content caching to avoid re-reading files
  • Size limits (1MB per file, 5000 files max)
  • Encoding fallbacks (utf-8, latin-1, cp1252)
  • Binary file detection
  • Symlink handling
  • Performance logging (duration, files read/skipped/errors)

Output Formats

  • JSON format for programmatic access
  • Markdown format for AI consumption
  • Results cached in Supabase for fast retrieval

API

GET /api/v1/repos/{repo_id}/dna?format=json|markdown

MCP Tool

get_codebase_dna(repo_id: string)

Returns formatted DNA profile that AI assistants can use before generating code.

Known Limitations

  • Logging detection may miss some framework-specific patterns
  • JavaScript/TypeScript framework detection is basic compared to Python
  • Pattern detection is based on string matching, not full AST analysis

These will be addressed in follow-up PRs based on real usage feedback.

- Add DNAExtractor service that extracts architectural patterns
- Extract auth patterns (middleware, decorators, ownership checks)
- Extract service patterns (singletons, dependencies.py)
- Extract database patterns (UUID, TIMESTAMPTZ, RLS, cascades)
- Extract error handling and logging patterns
- Extract naming conventions
- Add /repos/{repo_id}/dna endpoint with json/markdown format
- Add get_codebase_dna MCP tool for AI assistants
- Cache DNA in repository_insights table
- Fixed save_to_cache to use existing table schema
- Fixed load_from_cache to read from architecture_patterns JSONB
- DNA stored as {codebase_dna: {...}} inside architecture_patterns
- Add detected_framework field to CodebaseDNA
- Add _detect_framework() for FastAPI/Starlette/Flask/Django/Express/Next/Nest
- Add _extract_middleware_patterns() for framework-specific middleware detection
- Improve _extract_auth_patterns() with Starlette/Flask/Django patterns
- Add middleware_patterns field to output
- Update to_markdown() with framework and middleware sections
- Add TestPattern and ConfigPattern dataclasses
- Add Django ORM, SQLAlchemy, Prisma, Tortoise ORM detection
- Add Django + DRF framework detection
- Add aiohttp, tornado framework detection
- Add Django middleware patterns (MIDDLEWARE, MiddlewareMixin, hooks)
- Add DRF permission_classes and authentication_classes detection
- Add test framework detection (pytest, unittest, django.test)
- Add mock library detection (unittest.mock, responses, pytest-mock)
- Add config pattern detection (dotenv, environs, django-environ, pydantic)
- Add secrets handling detection (AWS Secrets Manager, Vault, env vars)
- Update to_markdown() with test and config sections
- Add file content cache to avoid re-reading files
- Add MAX_FILE_SIZE (1MB) and MAX_FILES (5000) limits
- Add _safe_read_file() with encoding fallbacks (utf-8, latin-1, cp1252)
- Add binary file detection (null bytes check)
- Add symlink handling in _discover_files()
- Add path validation in extract_dna()
- Add performance stats logging (files_read, skipped, errors, duration)
- Add .venv and site-packages to SKIP_DIRS
- Add deduplication for auth_decorators list
- Add logging.getLogger() pattern detection
- Add structlog detection
- Improve log level detection (.info, .debug, etc)
- Use _safe_read_file in logging pattern extraction
@vercel

vercel Bot commented Jan 11, 2026

Copy link
Copy Markdown

@DevanshuNEU is attempting to deploy a commit to the Dev's projects Team on Vercel.

A member of the Team first needs to authorize it.

@vercel

vercel Bot commented Jan 12, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Review Updated (UTC)
opencodeintel Ignored Ignored Preview Jan 12, 2026 0:57am

@DevanshuNEU DevanshuNEU merged commit 56d2309 into OpenCodeIntel:main Jan 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant