๐ CLI-first semantic code search with MCP integration
โ ๏ธ Production Release (v2.5.56): Stable and actively maintained. LanceDB is now the default backend for better performance and stability.
A modern, fast, and intelligent code search tool that understands your codebase through semantic analysis and AST parsing. Built with Python, powered by LanceDB, and designed for developer productivity.
- Semantic Search: Find code by meaning, not just keywords
- AST-Aware Parsing: Understands code structure (functions, classes, methods)
- Multi-Language Support: 13 languages - Python, JavaScript, TypeScript, C#, Dart/Flutter, PHP, Ruby, Java, Go, Rust, HTML, and Markdown/Text (with extensible architecture)
- Knowledge Graph: Temporal knowledge graph with KuzuDB for entity extraction and relationship mapping (
kg build,kg status,kg query) - Interactive Visualization: D3.js-powered visualization with 5+ views (Treemap, Sunburst, Force Graph, Knowledge Graph, Heatmap)
- Development Narratives: Generate git history narratives with
storycommand (markdown, JSON, HTML output) - Real-time Indexing: File watching with automatic index updates
- Automatic Version Tracking: Smart reindexing on tool upgrades
- Local-First: Complete privacy with on-device processing
- Zero Configuration: Auto-detects project structure and languages
- CLI-First Design: Simple commands for immediate productivity
- Rich Output: Syntax highlighting, similarity scores, context
- Fast Performance: Sub-second search responses, efficient indexing with pipeline parallelism (37% faster); IVF-PQ vector index delivers 4.9x faster queries (3.4ms vs 16.7ms)
- Modern Architecture: Async-first, type-safe, modular design
- Semi-Automatic Reindexing: Multiple strategies without daemon processes
- 17 MCP Tools: Comprehensive MCP integration for AI assistants (search, analysis, documentation, KG, story generation)
- Chat Mode: LLM-powered code Q&A with iterative refinement (up to 30 queries), deep search, and KG query tools
- CodeT5+ Embeddings: Code-specific embeddings via
index-codecommand (Salesforce/codet5p-110m-embedding)
- Vector Database: LanceDB (serverless, file-based)
- Embedding Models: Configurable sentence transformers with GPU acceleration
- Smart Reindexing: Search-triggered, Git hooks, scheduled tasks, and manual options
- Extensible Parsers: Plugin architecture for new languages
- Configuration Management: Project-specific settings
- Production Ready: Write buffering, auto-indexing, comprehensive error handling
- Performance: Apple Silicon M4 Max optimizations (2-4x speedup with MPS)
# Install from PyPI (recommended)
pip install mcp-vector-search
# Or with UV (faster)
uv pip install mcp-vector-search
# Or install from source
git clone https://github.com/bobmatnyc/mcp-vector-search.git
cd mcp-vector-search
uv sync && uv pip install -e .Verify Installation:
# Check that all dependencies are installed correctly
mcp-vector-search doctor
# Should show all โ marks
# If you see missing dependencies, try:
pip install --upgrade mcp-vector-searchThe fastest way to get started - completely hands-off, just one command:
# Smart zero-config setup (recommended)
mcp-vector-search setupWhat setup does automatically:
- โ Detects your project's languages and file types
- โ Initializes semantic search with optimal settings
- โ Indexes your entire codebase
- โ Configures ALL installed MCP platforms (Claude Code, Cursor, etc.)
- โ
Uses native Claude CLI integration (
claude mcp add) when available - โ
Falls back to
.mcp.jsonif Claude CLI not available - โ Sets up file watching for auto-reindex
- โ Zero user input required!
Behind the scenes:
- Server name:
mcp(for consistency with other MCP projects) - Command:
uv run python -m mcp_vector_search.mcp.server {PROJECT_ROOT} - File watching: Enabled via
MCP_ENABLE_FILE_WATCHING=true - Integration method: Native
claude mcp add(or.mcp.jsonfallback)
Example output:
๐ Smart Setup for mcp-vector-search
๐ Detecting project...
โ
Found 3 language(s): Python, JavaScript, TypeScript
โ
Detected 8 file type(s)
โ
Found 2 platform(s): claude-code, cursor
โ๏ธ Configuring...
โ
Embedding model: sentence-transformers/all-MiniLM-L6-v2
๐ Initializing...
โ
Vector database created
โ
Configuration saved
๐ Indexing codebase...
โ
Indexing completed in 12.3s
๐ Configuring MCP integrations...
โ
Using Claude CLI for automatic setup
โ
Registered with Claude CLI
โ
Configured 2 platform(s)
๐ Setup Complete!
Options:
# Force re-setup
mcp-vector-search setup --force
# Verbose output for debugging (shows Claude CLI commands)
mcp-vector-search setup --verboseFor more control over the installation process:
# Manual setup with MCP integration
mcp-vector-search install --with-mcp
# Custom file extensions
mcp-vector-search install --extensions .py,.js,.ts,.dart
# Skip automatic indexing
mcp-vector-search install --no-auto-index
# Just initialize (no indexing or MCP)
mcp-vector-search initAutomatic (Recommended):
# One command sets up all detected platforms
mcp-vector-search setupManual Platform Installation:
# Add Claude Code integration (project-scoped)
mcp-vector-search install claude-code
# Add Cursor IDE integration (global)
mcp-vector-search install cursor
# See all available platforms
mcp-vector-search install listNote: The setup command uses native claude mcp add when Claude CLI is available, providing better integration than manual .mcp.json creation.
# Remove specific platform
mcp-vector-search uninstall claude-code
# Remove all integrations
mcp-vector-search uninstall --all
# List configured integrations
mcp-vector-search uninstall list# Search your code
mcp-vector-search search "authentication logic"
mcp-vector-search search "database connection setup"
mcp-vector-search search "error handling patterns"
# Index your codebase (if not done during setup)
mcp-vector-search index
# Index with code-specific embeddings (CodeT5+)
mcp-vector-search index-code
# Check project status
mcp-vector-search status
# Start file watching (auto-update index)
mcp-vector-search watch
# Interactive visualization (5+ views)
mcp-vector-search visualize
# Generate development narrative from git history
mcp-vector-search story
# Knowledge graph operations
mcp-vector-search kg build
mcp-vector-search kg status
mcp-vector-search kg query "find all Python functions"
# Chat mode with LLM
mcp-vector-search chat "explain the authentication flow"
# Code analysis
mcp-vector-search analyze complexity
mcp-vector-search analyze dead-codeThe CLI includes intelligent command suggestions for typos:
# Typos are automatically detected and corrected
$ mcp-vector-search serach "auth"
No such command 'serach'. Did you mean 'search'?
$ mcp-vector-search indx
No such command 'indx'. Did you mean 'index'?See docs/guides/cli-usage.md for more details.
This project uses semantic versioning with an automated release workflow.
make version-show- Display current versionmake release-patch- Create patch releasemake publish- Publish to PyPI
See docs/development/versioning.md for complete documentation.
Context-aware code review using your entire codebase as context โ Not just diff analysis!
Traditional code review tools only see individual files or diffs. MCP Vector Search analyzes code with full codebase context by:
- ๐ Semantic Search: Finding related patterns and similar implementations
- ๐ธ๏ธ Knowledge Graph: Understanding dependencies and callers
- ๐ค LLM Analysis: Deep analysis with language-specific standards
- โก Smart Caching: 5x speedup with intelligent result caching
# Security review of your codebase
mvs analyze review security
# Review a pull request with full context
mvs analyze review-pr --baseline main --head feature-branch
# Review only changed files (fast!)
mvs analyze review security --changed-only --baseline main
# Run multiple review types at once
mvs analyze review --types security,quality,architecture| Type | Focus | Key Checks |
|---|---|---|
| security | OWASP Top 10, CWE | SQL injection, XSS, auth flaws, hardcoded secrets |
| architecture | SOLID principles | Coupling, circular deps, god classes, SRP violations |
| performance | Efficiency | N+1 queries, O(nยฒ) algorithms, blocking I/O |
| quality | Maintainability | Code smells, duplication, magic numbers, dead code |
| testing | Test coverage | Missing tests, edge cases, test quality |
| documentation | Code docs | Missing docstrings, TODOs, outdated comments |
The killer feature โ review PRs using the entire codebase as context:
# Review PR with context-aware analysis
mvs analyze review-pr --baseline main --format github-json
# For each changed file, finds:
# โ Similar patterns in codebase (consistency checking)
# โ Callers and dependencies (impact analysis)
# โ Existing tests (coverage gaps)
# โ Language-specific idioms (12 languages supported)Context Strategy:
Changed File โ Vector Search (similar patterns)
โ Knowledge Graph (callers, deps)
โ Test Discovery (coverage)
โ LLM Analysis (with full context)
โ Actionable Comments
12 languages with language-specific idioms, anti-patterns, and security checks:
Python โข TypeScript โข JavaScript โข Java โข C# โข Ruby โข Go โข Rust โข PHP โข Swift โข Kotlin โข Scala
Each language has tailored standards:
- Python: PEP 8, type hints, context managers, SQL injection patterns
- TypeScript: Strict mode, no
any, XSS patterns - Java: SOLID principles, Optional over null, XXE patterns
- Ruby: Guard clauses, blocks, RuboCop standards
- Go: Error handling, goroutines, interfaces
Create .mcp-vector-search/review-instructions.yaml:
language_standards:
python:
- "Enforce type hints on all public functions"
- "Use Pydantic for data validation"
scope_standards:
src/auth:
- "All auth functions must have audit logging"
custom_review_focus:
security:
- "Flag any hardcoded credentials"Automatically reads and applies standards from your existing config files:
- Python:
pyproject.toml,.flake8,mypy.ini,ruff.toml - TypeScript:
tsconfig.json,.eslintrc.json - Ruby:
.rubocop.yml - Java:
checkstyle.xml,pom.xml - +8 more languages
# .github/workflows/code-review.yml
- name: Review PR
run: |
mvs analyze review-pr \
--baseline ${{ github.base_ref }} \
--format sarif \
--output review.sarif
- name: Upload to Security tab
uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: review.sarif- console: Rich, colored output for humans
- json: Machine-readable structured data
- sarif: GitHub Security tab integration
- markdown: Reports for documentation
- github-json: PR comments (summary + inline)
- Vector Search: <0.5s (find relevant code)
- KG Queries: <0.2s (relationships)
- LLM Analysis: 10-15s (deep analysis)
- Cache Hit: 5x speedup on repeat reviews
Smart Caching: Unchanged code chunks return cached findings instantly.
๐ Complete Documentation โ Architecture, examples, best practices
๐ CI/CD Integration Guide โ GitHub Actions, GitLab CI, pre-commit hooks
๐ Multi-Language Support โ 12 languages with standards
Audit codebases against their stated privacy policies using semantic code search and knowledge graph analysis.
# Install with auditor dependencies
pip install 'mcp-vector-search[auditor]'
# Run a privacy audit
mvs audit run --target /path/to/repo --policy /path/to/repo/PRIVACY.md
# Check for policy/code drift
mvs audit drift-check --target /path/to/repo --policy /path/to/repo/PRIVACY.md
# Verify a certification
mvs audit verify audits/<target>/latest/
# List audit history
mvs audit list- Extract Claims โ Parses privacy policies into testable assertions (hybrid text analysis + LLM)
- Collect Evidence โ Queries the codebase via vector search, hybrid search, and knowledge graph
- Judge Verdicts โ LLM evaluates each claim against evidence (PASS / FAIL / INSUFFICIENT / MANUAL_REVIEW)
- Certify โ Produces signed certification documents with per-claim verdicts and evidence
- Dual output โ Certification saved in both auditor repo and target repo
- Multiple LLM backends โ OpenRouter and Anthropic (auto-detected)
- GitHub Actions โ On-demand audit workflow + daily drift detection
- GPG signing โ Optional cryptographic signatures for certification integrity
- Auto-issue creation โ GitHub issues for claims needing review
- .audit-ignore.yml โ Suppress specific claims with documented justifications
See docs/features/privacy-auditor.md for full documentation.
# One command to do everything (recommended)
mcp-vector-search setup
# What it does automatically:
# - Detects project languages and file types
# - Initializes semantic search
# - Indexes entire codebase
# - Configures all detected MCP platforms
# - Sets up file watching
# - Zero configuration needed!
# Force re-setup
mcp-vector-search setup --force
# Verbose output for debugging
mcp-vector-search setup --verboseKey Features:
- Zero Configuration: No user input required
- Smart Detection: Automatically discovers languages and platforms
- Comprehensive: Handles init + index + MCP setup in one command
- Idempotent: Safe to run multiple times
- Fast: Timeout-protected scanning (won't hang on large projects)
- Team-Friendly: Commit
.mcp.jsonto share configuration
When to use:
- โ First-time project setup
- โ Team onboarding
- โ Quick testing in new codebases
- โ Setting up multiple MCP platforms at once
# Manual setup with more control
mcp-vector-search install
# Install with all MCP integrations
mcp-vector-search install --with-mcp
# Custom file extensions
mcp-vector-search install --extensions .py,.js,.ts
# Skip automatic indexing
mcp-vector-search install --no-auto-index
# Platform-specific MCP integration
mcp-vector-search install claude-code # Project-scoped
mcp-vector-search install cursor # Global
mcp-vector-search install windsurf # Global
mcp-vector-search install vscode # Global
# List available platforms
mcp-vector-search install listWhen to use:
- Use
installwhen you need fine-grained control over extensions, models, or MCP platforms - Use
setupfor quick, zero-config onboarding (recommended)
# Remove specific platform
mcp-vector-search uninstall claude-code
# Remove all integrations
mcp-vector-search uninstall --all
# List configured integrations
mcp-vector-search uninstall list
# Skip backup creation
mcp-vector-search uninstall claude-code --no-backup
# Alias (same as uninstall)
mcp-vector-search remove claude-code# Basic initialization (no indexing or MCP)
mcp-vector-search init
# Custom configuration
mcp-vector-search init --extensions .py,.js,.ts --embedding-model sentence-transformers/all-MiniLM-L6-v2
# Force re-initialization
mcp-vector-search init --forceNote: For most users, use setup instead of init. The init command is for advanced users who want manual control.
# Index all files
mcp-vector-search index
# Index specific directory
mcp-vector-search index /path/to/code
# Force re-indexing
mcp-vector-search index --force
# Reindex entire project
mcp-vector-search index reindex
# Reindex entire project (explicit)
mcp-vector-search index reindex --all
# Reindex entire project without confirmation
mcp-vector-search index reindex --force
# Reindex specific file
mcp-vector-search index reindex path/to/file.py# Basic search
mcp-vector-search search "function that handles user authentication"
# Adjust similarity threshold
mcp-vector-search search "database queries" --threshold 0.7
# Limit results
mcp-vector-search search "error handling" --limit 10
# Search in specific context
mcp-vector-search search similar "path/to/function.py:25"# Setup all auto-indexing strategies
mcp-vector-search auto-index setup --method all
# Setup specific strategies
mcp-vector-search auto-index setup --method git-hooks
mcp-vector-search auto-index setup --method scheduled --interval 60
# Check for stale files and auto-reindex
mcp-vector-search auto-index check --auto-reindex --max-files 10
# View auto-indexing status
mcp-vector-search auto-index status
# Remove auto-indexing setup
mcp-vector-search auto-index teardown --method all# Start watching for changes
mcp-vector-search watch
# Check watch status
mcp-vector-search watch status
# Enable/disable watching
mcp-vector-search watch enable
mcp-vector-search watch disable# Basic status
mcp-vector-search status
# Detailed information
mcp-vector-search status --verbose# View configuration
mcp-vector-search config show
# Update settings
mcp-vector-search config set similarity_threshold 0.8
mcp-vector-search config set embedding_model microsoft/codebert-base
# Configure indexing behavior
mcp-vector-search config set skip_dotfiles true # Skip dotfiles (default)
mcp-vector-search config set respect_gitignore true # Respect .gitignore (default)
# Get specific setting
mcp-vector-search config get skip_dotfiles
mcp-vector-search config get respect_gitignore
# List available models
mcp-vector-search config models
# List all configuration keys
mcp-vector-search config list-keys# Index with CodeT5+ embeddings (code-optimized)
mcp-vector-search index-code
# Feature-flagged via environment variable
export MCP_CODE_ENRICHMENT=true
mcp-vector-search index-code# Launch visualization server
mcp-vector-search visualize
# Start on custom port
mcp-vector-search visualize --port 8080
# Available views:
# - Treemap: Hierarchical view with size/complexity encoding
# - Sunburst: Radial hierarchical view
# - Force Graph: Network visualization of code relationships
# - Knowledge Graph: Entity and relationship visualization
# - Heatmap: Complexity and quality heatmap# Generate development narrative from git history
mcp-vector-search story
# Output formats
mcp-vector-search story --format markdown
mcp-vector-search story --format json
mcp-vector-search story --format html
# Serve as HTTP endpoint
mcp-vector-search story --serve
# Extract-only mode (no LLM)
mcp-vector-search story --no-llm
# Custom LLM model
mcp-vector-search story --model gpt-4o# Build knowledge graph
mcp-vector-search kg build
# Check knowledge graph status
mcp-vector-search kg status
# Query knowledge graph
mcp-vector-search kg query "find all Python functions"
mcp-vector-search kg query "show classes in module auth"
# Browse document ontology (file-level document classification)
mcp-vector-search kg ontology
mcp-vector-search kg ontology --category guide # filter by category
mcp-vector-search kg ontology --verbose # include file paths
# Knowledge graph entities:
# - CodeFile, Function, Class, Person
# - ProgrammingLanguage, ProgrammingFramework
# - Document (file-level, with doc_category classification)
# - Topic (hierarchical taxonomy)# Ask questions about your codebase
mcp-vector-search chat "explain the authentication flow"
mcp-vector-search chat "how does error handling work?"
# Iterative refinement (up to 30 queries)
# Automatically uses deep search and KG query tools
# Advanced reasoning mode
mcp-vector-search chat "architectural patterns" --think
# Filter by files
mcp-vector-search chat "validation logic" --files "src/*.py"# Complexity analysis
mcp-vector-search analyze complexity
# Dead code detection
mcp-vector-search analyze dead-code
# Output formats
mcp-vector-search analyze complexity --json
mcp-vector-search analyze complexity --sarif
mcp-vector-search analyze complexity --output-format markdown
# CI/CD integration
mcp-vector-search analyze complexity --fail-on-smellMCP Vector Search includes several query-time optimizations that are automatically enabled as your index grows.
IVF-PQ Index is built automatically after indexing more than 256 rows. It uses Inverted File with Product Quantization to partition vectors into clusters, so queries scan only a relevant subset rather than the full index. The index parameters adapt to your data: num_partitions = clamp(sqrt(N), 16, 512) and num_sub_vectors = dim // 4.
Two-stage retrieval improves precision on top of the IVF-PQ scan: the engine probes 20 IVF partitions (nprobes=20) and fetches 5x the requested candidates, then reranks them with exact cosine similarity (refine_factor=5). Applied to both the LanceDB and legacy vector backends.
Contextual chunking prepends a compact metadata header to each chunk before embedding, so the vector captures file, language, class, and function context rather than code text alone. Format: File: core/search.py | Lang: python | Class: Engine | Fn: search | Uses: lancedb. Based on Anthropic research showing 35-49% fewer retrieval failures.
| Optimization | Impact |
|---|---|
| IVF-PQ index + two-stage retrieval | 4.9x faster queries (3.4ms vs 16.7ms median) |
| Contextual chunking | 35-49% fewer retrieval failures |
| Pipeline parallelism | 37% faster indexing |
| Apple Silicon MPS | 2-4x faster embedding generation |
See docs/performance/search-optimizations.md for technical details and benchmark methodology.
LanceDB is now the default vector database for better performance and stability:
- Serverless Architecture: No separate server process needed
- Better Scaling: Superior performance for large codebases (>100k chunks)
- File-Based Storage: Simple directory-based persistence
- Fewer Corruption Issues: More stable than ChromaDB's HNSW indices
- Write Buffering: 2-4x faster indexing with accumulated batch writes
To use ChromaDB (legacy), set environment variable:
export MCP_VECTOR_SEARCH_BACKEND=chromadbMigrate existing ChromaDB database:
mcp-vector-search migrate db chromadb-to-lancedbSee docs/LANCEDB_BACKEND.md for detailed documentation.
2-4x speedup on Apple Silicon with automatic hardware detection:
- MPS Backend: Metal Performance Shaders GPU acceleration for embeddings
- Intelligent Batch Sizing: Auto-detects GPU memory (384-512 for M4 Max with 128GB RAM)
- Multi-Core Optimization: Utilizes all 12 performance cores efficiently
- Zero Configuration: Automatically enabled on Apple Silicon Macs
Environment variables for tuning:
export MCP_VECTOR_SEARCH_MPS_BATCH_SIZE=512 # Override MPS batch size
export MCP_VECTOR_SEARCH_BATCH_SIZE=128 # Override all backendsMultiple strategies to keep your index up-to-date without daemon processes:
- Search-Triggered: Automatically checks for stale files during searches
- Git Hooks: Triggers reindexing after commits, merges, checkouts
- Scheduled Tasks: System-level cron jobs or Windows tasks
- Manual Checks: On-demand via CLI commands
- Periodic Checker: In-process periodic checks for long-running apps
# Setup all strategies
mcp-vector-search auto-index setup --method all
# Check status
mcp-vector-search auto-index statusProjects are configured via .mcp-vector-search/config.json:
{
"project_root": "/path/to/project",
"file_extensions": [".py", ".js", ".ts"],
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
"similarity_threshold": 0.75,
"languages": ["python", "javascript", "typescript"],
"watch_files": true,
"cache_embeddings": true,
"skip_dotfiles": true,
"respect_gitignore": true
}skip_dotfiles (default: true)
- Controls whether files and directories starting with "." are skipped during indexing
- Whitelisted directories are always indexed regardless of this setting:
.github/- GitHub workflows and actions.gitlab-ci/- GitLab CI configuration.circleci/- CircleCI configuration
- When
false: All dotfiles are indexed (subject to gitignore rules ifrespect_gitignoreistrue)
respect_gitignore (default: true)
- Controls whether
.gitignorepatterns are respected during indexing - When
false: Files in.gitignoreare indexed (subject toskip_dotfilesif enabled)
force_include_patterns (default: [])
- Glob patterns to force-include files/directories even if they are gitignored
- Patterns support
**for recursive matching (e.g.,repos/**/*.javamatches all Java files inrepos/and subdirectories) - Force-include patterns override
.gitignorerules, allowing selective indexing of gitignored directories - Example use case: Index specific file types in a gitignored
repos/directory
Example: Force-include Java files from gitignored directory
# Set force_include_patterns via JSON list
mcp-vector-search config set force_include_patterns '["repos/**/*.java", "repos/**/*.kt"]'
# Or add patterns one at a time (requires custom CLI command)
# This allows .gitignore to exclude repos/ from git, but mcp-vector-search still indexes Java/Kotlin filesExample config.json with force_include_patterns:
{
"respect_gitignore": true,
"force_include_patterns": [
"repos/**/*.java",
"repos/**/*.kt",
"vendor/internal/**/*.go"
]
}Default Behavior (Recommended for most projects):
# Skip dotfiles AND respect .gitignore
mcp-vector-search config set skip_dotfiles true
mcp-vector-search config set respect_gitignore trueIndex Everything (Useful for deep code analysis):
# Index all files including dotfiles and gitignored files
mcp-vector-search config set skip_dotfiles false
mcp-vector-search config set respect_gitignore falseIndex Dotfiles but Respect .gitignore:
# Index configuration files but skip build artifacts
mcp-vector-search config set skip_dotfiles false
mcp-vector-search config set respect_gitignore trueSkip Dotfiles but Ignore .gitignore:
# Useful when you want to index files in .gitignore but skip hidden config files
mcp-vector-search config set skip_dotfiles true
mcp-vector-search config set respect_gitignore falseSelective Gitignore Override with Force-Include Patterns:
# Index specific file types from gitignored directories
# Example: .gitignore excludes repos/, but you want to index Java/Kotlin files
mcp-vector-search config set respect_gitignore true
mcp-vector-search config set force_include_patterns '["repos/**/*.java", "repos/**/*.kt"]'
# This allows:
# - .gitignore to exclude repos/ from git (keeps your repo clean)
# - mcp-vector-search to index Java/Kotlin files in repos/ (semantic search)
# - Other files in repos/ (e.g., .class, .jar) remain excluded- Parser Registry: Extensible system for language-specific parsing
- Semantic Indexer: Efficient code chunking and embedding generation
- Vector Database: LanceDB for similarity search
- File Watcher: Real-time monitoring and incremental updates
- CLI Interface: Rich, user-friendly command-line experience
MCP Vector Search supports 13 programming languages with full semantic search capabilities:
| Language | Extensions | Status | Features |
|---|---|---|---|
| Python | .py, .pyw |
โ Full | Functions, classes, methods, docstrings |
| JavaScript | .js, .jsx, .mjs |
โ Full | Functions, classes, JSDoc, ES6+ syntax |
| TypeScript | .ts, .tsx |
โ Full | Interfaces, types, generics, decorators |
| C# | .cs |
โ Full | Classes, interfaces, structs, enums, methods, XML docs, attributes |
| Dart | .dart |
โ Full | Functions, classes, widgets, async, dartdoc |
| PHP | .php, .phtml |
โ Full | Classes, methods, traits, PHPDoc, Laravel patterns |
| Ruby | .rb, .rake, .gemspec |
โ Full | Modules, classes, methods, RDoc, Rails patterns |
| Java | .java |
โ Full | Classes, methods, annotations, interfaces |
| Go | .go |
โ Full | Functions, structs, interfaces, packages |
| Rust | .rs |
โ Full | Functions, structs, traits, implementations |
| HTML | .html, .htm |
โ Full | Semantic content extraction, heading hierarchy, text chunking |
| Text/Markdown | .txt, .md, .markdown |
โ Basic | Semantic chunking for documentation |
HTML Support (Unreleased):
- Semantic Extraction: Content from h1-h6, p, section, article, main, aside, nav, header, footer
- Intelligent Chunking: Based on heading hierarchy (h1-h6)
- Context Preservation: Maintains class and id attributes for searchability
- Script/Style Filtering: Ignores non-content elements
- Use Cases: Static sites, documentation, web templates, HTML fragments
Dart/Flutter Support (v0.4.15):
- Widget Detection: StatelessWidget, StatefulWidget recognition
- State Classes: Automatic parsing of
_WidgetNameStatepatterns - Async Support: Future and async function handling
- Dartdoc: Triple-slash comment extraction
- Tree-sitter AST: Fast, accurate parsing with regex fallback
PHP Support (v0.5.0):
- Class Detection: Classes, interfaces, traits
- Method Extraction: Public, private, protected, static methods
- Magic Methods: __construct, __get, __set, __call, etc.
- PHPDoc: Full comment extraction
- Laravel Patterns: Controllers, Models, Eloquent support
- Tree-sitter AST: Fast parsing with regex fallback
Ruby Support (v0.5.0):
- Module/Class Detection: Full namespace support (::)
- Method Extraction: Instance and class methods
- Special Syntax: Method names with ?, ! support
- Attribute Macros: attr_accessor, attr_reader, attr_writer
- RDoc: Comment extraction (# and =begin...=end)
- Rails Patterns: ActiveRecord, Controllers support
- Tree-sitter AST: Fast parsing with regex fallback
We welcome contributions! Please see our Contributing Guide for details.
# Clone the repository
git clone https://github.com/bobmatnyc/mcp-vector-search.git
cd mcp-vector-search
# Install development environment (includes dependencies + editable install)
make dev
# Test CLI from source (recommended during development)
./scripts/dev-mcp version # Shows [DEV] indicator
./scripts/dev-mcp search "test" # No reinstall needed after code changes
# Run tests and quality checks
make test-unit # Run unit tests
make quality # Run linting and type checking
make fix # Auto-fix formatting issues
# View all available targets
make helpFor detailed development workflow and dev-mcp usage, see the Development section below.
- Create a new parser in
src/mcp_vector_search/parsers/ - Extend the
BaseParserclass - Register the parser in
parsers/registry.py - Add tests and documentation
- Indexing Speed: ~1000 files/minute (typical Python project)
- Search Latency: 3.4ms median with IVF-PQ index (4.9x faster than without)
- Memory Usage: ~50MB baseline + ~1MB per 1000 code chunks
- Storage: ~1KB per code chunk (compressed embeddings)
- Tree-sitter Integration: Currently using regex fallback parsing (Tree-sitter setup needs improvement)
- Search Relevance: Embedding model may need tuning for code-specific queries
- Error Handling: Some edge cases may not be gracefully handled
- Documentation: API documentation is minimal
- Testing: Limited test coverage, needs real-world validation
We're actively seeking feedback on:
- Search Quality: How relevant are the search results for your codebase?
- Performance: How does indexing and search speed feel in practice?
- Usability: Is the CLI interface intuitive and helpful?
- Language Support: Which languages would you like to see added next?
- Features: What functionality is missing for your workflow?
Please open an issue or start a discussion to share your experience!
- Core CLI interface
- Multi-language parsing (13 languages: Python, JavaScript, TypeScript, C#, Dart, PHP, Ruby, Java, Go, Rust, HTML, Markdown, Text)
- LanceDB default backend (ChromaDB legacy support)
- Apple Silicon optimizations (2-4x speedup with MPS)
- File watching and auto-reindexing
- MCP server implementation with 17 tools
- Advanced search modes (semantic, contextual, similar code)
- Code analysis tools (complexity, dead code detection, code smells)
- Interactive D3.js visualization (5+ views: Treemap, Sunburst, Force Graph, KG, Heatmap)
- Knowledge Graph with KuzuDB (entity extraction, relationship mapping)
- Development narrative generation (
storycommand) - Chat mode with LLM integration (iterative refinement, up to 30 queries)
- CodeT5+ code-specific embeddings
- Pipeline parallelism (37% faster indexing)
- Production-ready performance (write buffering, GPU acceleration, async pipeline)
- IVF-PQ vector index with two-stage retrieval (4.9x faster queries)
- Contextual chunking (metadata-enriched embeddings, 35-49% fewer retrieval failures)
- CodeRankEmbed model support (
nomic-ai/CodeRankEmbed, 768d, 8K context) - Document ontology with 23 categories (
kg ontologycommand)
- Hybrid search (vector + keyword + BM25)
- Additional language support (more languages beyond 13)
- IDE extensions (VS Code, JetBrains)
- Team collaboration features
- Advanced code refactoring suggestions
- Real-time collaboration on knowledge graph
- Multi-project knowledge graph federation
Stage A: Local Development & Testing
# Setup development environment
make dev
# Run development tests
make test-unit
# Run CLI from source (recommended during development)
./dev-mcp version # Visual [DEV] indicator
./dev-mcp status # Any command works
./dev-mcp search "auth" # Immediate feedback on changes
# Run quality checks
make quality
# Alternative: use uv run directly
uv run mcp-vector-search versionThe ./dev-mcp script provides a streamlined way to run the CLI from source code during development, eliminating the need for repeated installations.
Key Features:
- Visual [DEV] Indicator: Shows
[DEV]prefix to distinguish from installed version - No Reinstall Required: Reflects code changes immediately
- Complete Argument Forwarding: Works with all CLI commands and options
- Verbose Mode: Debug output with
--verboseflag - Built-in Help: Script usage with
--help
Usage Examples:
# Basic commands (note the [DEV] prefix in output)
./dev-mcp version
./dev-mcp status
./dev-mcp index
./dev-mcp search "authentication logic"
# With CLI options
./dev-mcp search "error handling" --limit 10
./dev-mcp index --force
# Script verbose mode (shows Python interpreter, paths)
./dev-mcp --verbose search "database"
# Script help (shows dev-mcp usage, not CLI help)
./dev-mcp --help
# CLI command help (forwards --help to the CLI)
./dev-mcp search --help
./dev-mcp index --helpWhen to Use:
./dev-mcpโ Development workflow (runs from source code)mcp-vector-searchโ Production usage (runs installed version via pipx/pip)
Benefits:
- Instant Feedback: Changes to source code are reflected immediately
- No Build Step: Skip the reinstall cycle during active development
- Clear Context: Visual
[DEV]indicator prevents confusion about which version is running - Error Handling: Built-in checks for uv installation and project structure
Requirements:
- Must have
uvinstalled (pip install uv) - Must run from project root directory
- Requires
pyproject.tomlin current directory
Stage B: Local Deployment Testing
# Build and test clean deployment
./scripts/deploy-test.sh
# Test on other projects
cd ~/other-project
mcp-vector-search init && mcp-vector-search indexStage C: PyPI Publication
# Publish to PyPI
./scripts/publish.sh
# Verify published version
pip install mcp-vector-search --upgrade./scripts/workflow.sh # Show workflow overviewSee DEVELOPMENT.md for detailed development instructions.
For comprehensive documentation, see docs/index.md - the complete documentation hub.
- Installation Guide - Complete installation instructions
- First Steps - Quick start tutorial
- Configuration - Basic configuration
- Searching Guide - Master semantic code search
- Indexing Guide - Indexing strategies and optimization
- CLI Usage - Advanced CLI features
- MCP Integration - AI tool integration
- File Watching - Real-time index updates
- CLI Commands - Complete command reference
- Configuration Options - All configuration settings
- Features - Feature overview
- Architecture - System architecture
- Contributing - How to contribute
- Testing - Testing guide
- Code Quality - Linting and formatting
- API Reference - Internal API docs
- Deployment - Release and deployment guide
- Troubleshooting - Common issues and solutions
- Performance - Performance optimization
- Extending - Adding new features
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Elastic License 2.0 - see LICENSE file for details.
Note: This software may not be provided to third parties as a hosted or managed service.
- LanceDB for vector database
- Tree-sitter for parsing infrastructure
- Sentence Transformers for embeddings
- Typer for CLI framework
- Rich for beautiful terminal output
Built with โค๏ธ for developers who love efficient code search