Problem
The learn MCP tool gets stuck or appears frozen when indexing large codebases with many folders and subfolders.
Root Causes
-
No .gitignore support — discover_files() uses Path.rglob("*") with only a hardcoded SKIP_DIRS set. Nested node_modules, build artifacts, and generated files all get traversed and indexed, causing 10x+ slowdown on real projects.
-
No checkpointing — Metadata is saved only after ALL embeddings complete. An interruption at chunk 99K of 100K loses all progress and requires a full restart.
-
Destructive full-index — index_codebase() deletes the entire ChromaDB collection before re-creating it, preventing any form of resumption.
-
Weak progress reporting — File discovery phase emits zero progress events. Embedding progress lacks ETA, making the tool appear frozen.
Proposed Fix
- Rewrite
discover_files() to use os.walk() with pathspec for .gitignore support and directory pruning
- Add
max_files safety limit (addresses SECURITY_REVIEW HIGH-003: Unbounded Resource Consumption)
- Switch from
delete_collection()+create_collection() to get_or_create_collection()+upsert()
- Add resumable checkpointing with atomic writes every 1000 chunks
- Improve progress reporting with discovery events and ETA
Impact
- No changes to MCP tool interface (
learn params stay the same)
- Backward compatible (all new params have defaults)
- Also addresses security findings HIGH-003 and LOW-009
Problem
The
learnMCP tool gets stuck or appears frozen when indexing large codebases with many folders and subfolders.Root Causes
No
.gitignoresupport —discover_files()usesPath.rglob("*")with only a hardcodedSKIP_DIRSset. Nestednode_modules, build artifacts, and generated files all get traversed and indexed, causing 10x+ slowdown on real projects.No checkpointing — Metadata is saved only after ALL embeddings complete. An interruption at chunk 99K of 100K loses all progress and requires a full restart.
Destructive full-index —
index_codebase()deletes the entire ChromaDB collection before re-creating it, preventing any form of resumption.Weak progress reporting — File discovery phase emits zero progress events. Embedding progress lacks ETA, making the tool appear frozen.
Proposed Fix
discover_files()to useos.walk()withpathspecfor.gitignoresupport and directory pruningmax_filessafety limit (addresses SECURITY_REVIEW HIGH-003: Unbounded Resource Consumption)delete_collection()+create_collection()toget_or_create_collection()+upsert()Impact
learnparams stay the same)