feat: monorepo subset indexing -- include_paths support (OPE-106)#266
Conversation
Thread include_paths from API through entire indexing chain:
- POST /repos/{id}/index/async accepts IndexConfig with include_paths
- _run_async_indexing passes include_paths to indexer
- indexer._discover_code_files filters files by include_paths
- indexer.extract_functions_v2 passes include_paths to tree-sitter
- indexer.index_repository_with_progress passes include_paths to discovery
- tree_sitter_extractor.extract_from_repo already supported include_paths
New endpoint:
- GET /repos/{id}/directories returns top-level dirs with file counts
(for UI directory picker in monorepo subset selection)
Requested by Trevor Keith (Solid/trysolid.com) for Effect-TS monorepo
(200K+ functions across 30+ packages, only needs 2).
|
@DevanshuNEU is attempting to deploy a commit to the Dev's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Repository UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdded monorepo subdirectory listing and subset indexing: new GET route to list top-level repo directories with per-directory code-file counts; introduced Changes
Sequence DiagramsequenceDiagram
actor Client
participant API as "API Routes"
participant Indexer as "Indexer Service"
participant FS as "File System"
Client->>API: GET /repo/{id}/directories
API->>FS: scan top-level dirs (sync)
FS-->>API: directories + per-dir file counts
API-->>Client: return directories
Client->>API: POST /repo/{id}/index (IndexConfig incl. include_paths)
API->>Indexer: schedule _run_async_indexing(repo_id, include_paths, incremental)
Indexer->>Indexer: index_repository_with_progress(include_paths)
Indexer->>FS: discover code files (filtered by include_paths)
FS-->>Indexer: filtered file list
Indexer->>Indexer: extract_functions_v2(include_paths)
Indexer-->>API: progress updates (include_paths)
API-->>Client: indexing started / progress updates
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/routes/repos.py`:
- Around line 320-321: The incremental indexing branch ignores include_paths so
changed-file selection can include unrelated files; update the indexing handler
in backend/routes/repos.py (the function that accepts incremental: bool and
include_paths: Optional[List[str]]) to apply the same include_paths filtering
used in the full-index path when computing the set of files to index during
incremental=True (including the path/subset selection when a last commit
exists), i.e., after computing changed_files or candidate_files in the
incremental branch, filter them by matching include_paths patterns (respecting
the same glob/substring logic used in the full-index code) before proceeding
with indexing.
- Around line 199-207: The synchronous filesystem traversal in
get_repo_directories (using local_path.iterdir() and item.rglob()) must be moved
into a separate blocking helper function (e.g., _scan_repo_directories or
scan_repo_dirs) that performs the directory iteration and computes file_count,
then call that helper from the async route via asyncio.to_thread(...) to offload
blocking I/O; ensure the helper returns the data structure expected by
get_repo_directories and replace the inline iteration and rglob usage with the
to_thread call so the event loop is not blocked.
In `@backend/services/indexer_optimized.py`:
- Around line 135-145: The current loop over repo_path.rglob('*') still scans
the whole repo and the include-path check uses rel.startswith which can produce
false matches (e.g., "packages/effectx" matching "packages/effect"); change the
scan to only iterate the specified include_paths when include_paths is set
(build roots = [repo_path / p for p in include_paths] and rglob from each root,
falling back to repo_path.rglob when include_paths is empty), and replace the
startswith check with a segment-aware test on rel (e.g., require rel == p or
rel.startswith(p + os.sep) for each p in include_paths, or compare
Path(file_path).parts prefix equality) so only exact-path or subpath matches are
included; keep existing skip_dirs, extensions, and file_path.is_dir checks
intact (refer to repo_path.rglob, include_paths, file_path, rel, skip_dirs,
extensions).
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/services/indexer_optimized.py (1)
700-707: 🛠️ Refactor suggestion | 🟠 MajorAdd return type annotation
-> intto the function signature.The
index_repository_with_progressmethod at lines 700-707 omits the return type annotation required by the coding guidelines. The function returns a numeric count (line 835:return len(all_functions_data)).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/services/indexer_optimized.py` around lines 700 - 707, The function index_repository_with_progress is missing a return type annotation; update its signature to declare it returns an int (i.e., add "-> int" to the def line for index_repository_with_progress) so the signature matches the actual return value (len(all_functions_data)); ensure any corresponding type hints or imports remain valid.
♻️ Duplicate comments (3)
backend/services/indexer_optimized.py (1)
135-145:⚠️ Potential issue | 🟠 MajorFix subset matching and scan scope in
_discover_code_files.Line 135 still scans the full repo in subset mode, and Line 144 uses prefix string matching that can over-include sibling paths (e.g.,
packages/effectxforpackages/effect).Proposed fix
- for file_path in repo_path.rglob('*'): + normalized_includes = [ + Path(p.strip("/")) for p in (include_paths or []) if p and p.strip("/") + ] + scan_roots = [repo_path / p for p in normalized_includes] if normalized_includes else [repo_path] + + for root in scan_roots: + if not root.exists() or not root.is_dir(): + continue + for file_path in root.rglob('*'): if file_path.is_dir(): continue if any(skip in file_path.parts for skip in skip_dirs): continue if file_path.suffix in extensions: - # Subset filter: only include files under specified paths - if include_paths: - rel = str(file_path.relative_to(repo_path)) - if not any(rel.startswith(p) for p in include_paths): - continue - code_files.append(file_path) + if normalized_includes: + rel = file_path.relative_to(repo_path) + if not any(rel == inc or inc in rel.parents for inc in normalized_includes): + continue + code_files.append(file_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/services/indexer_optimized.py` around lines 135 - 145, _in _discover_code_files the loop currently walks the entire repo even when include_paths is set and uses simple string startswith on rel which can match sibling prefixes; change behavior so when include_paths is provided you only iterate files under those include_paths (e.g., iterate repo_path / p for each p in include_paths rather than repo_path.rglob('*')), or filter using path-component-aware checks (compare tuples of Path.parts or ensure rel.startswith(p + os.sep)) to avoid matching names like "packages/effectx" for "packages/effect"; update variables repo_path, include_paths, rel, and file_path handling accordingly so scan scope is restricted and matching is done on path boundaries.backend/routes/repos.py (2)
346-353:⚠️ Potential issue | 🟠 MajorHonor
include_pathsin the incremental branch (or bypass incremental when subset mode is requested).When
incremental=Trueandlast_commitexists,include_pathsis not applied, so files outside the requested subset can be indexed.Proposed fix
- if incremental and last_commit: + if incremental and last_commit and not include_paths: logger.info("Async INCREMENTAL indexing", repo_id=repo_id, last_commit=last_commit[:8]) total_functions = await indexer.incremental_index_repository( repo_id, repo["local_path"], last_commit ) index_type = "incremental" @@ - else: + else: + if incremental and last_commit and include_paths: + logger.info( + "Subset indexing requested; skipping incremental path to honor include_paths", + repo_id=repo_id, + include_paths=include_paths, + ) logger.info("Async FULL indexing with progress", repo_id=repo_id)Also applies to: 392-397
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/routes/repos.py` around lines 346 - 353, The incremental branch ignores include_paths causing out-of-subset files to be indexed; update the logic around the incremental check so that when include_paths is provided you either call a version of indexer.incremental_index_repository that accepts and applies include_paths (pass include_paths through) or skip incremental and call the full indexer.index_repository with include_paths instead; modify the block that uses incremental, last_commit and calls indexer.incremental_index_repository (and the corresponding similar block later) to honor include_paths and set index_type accordingly.
199-207:⚠️ Potential issue | 🟠 MajorMove repository scanning off the async event loop.
Line 199 and Line 204 perform blocking filesystem traversal (
iterdir,rglob) in an async route, which can stall the loop on large repos.Proposed fix
+ def _scan_dirs() -> List[dict]: + extensions = {".py", ".js", ".jsx", ".ts", ".tsx"} + out: List[dict] = [] + for item in sorted(local_path.iterdir()): + if item.is_dir() and item.name not in skip and not item.name.startswith("."): + file_count = sum( + 1 for f in item.rglob("*") + if f.is_file() and f.suffix in extensions + and not any(s in f.parts for s in skip) + ) + out.append({ + "name": item.name, + "path": str(item.relative_to(local_path)), + "file_count": file_count, + }) + return out + - dirs = [] - for item in sorted(local_path.iterdir()): - if item.is_dir() and item.name not in skip and not item.name.startswith("."): - # count code files in this directory - extensions = {".py", ".js", ".jsx", ".ts", ".tsx"} - file_count = sum( - 1 for f in item.rglob("*") - if f.is_file() and f.suffix in extensions - and not any(s in f.parts for s in skip) - ) - dirs.append({ - "name": item.name, - "path": str(item.relative_to(local_path)), - "file_count": file_count, - }) + dirs = await asyncio.to_thread(_scan_dirs)Based on learnings: Applies to backend/**/*.py : Use async/await for I/O operations in Python backend.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/routes/repos.py` around lines 199 - 207, The blocking filesystem traversal using local_path.iterdir() and item.rglob() (which computes file_count and uses the extensions set) is running inside an async route handler and should be moved off the event loop; refactor by extracting the directory scanning logic that builds file_count (and uses extensions and skip) into a synchronous helper function (e.g., sync_scan_directory(local_path, skip, extensions)) and call it from the async handler via await asyncio.to_thread(...) or loop.run_in_executor(...), returning the result so the async route no longer performs direct iterdir()/rglob() calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@backend/services/indexer_optimized.py`:
- Around line 700-707: The function index_repository_with_progress is missing a
return type annotation; update its signature to declare it returns an int (i.e.,
add "-> int" to the def line for index_repository_with_progress) so the
signature matches the actual return value (len(all_functions_data)); ensure any
corresponding type hints or imports remain valid.
---
Duplicate comments:
In `@backend/routes/repos.py`:
- Around line 346-353: The incremental branch ignores include_paths causing
out-of-subset files to be indexed; update the logic around the incremental check
so that when include_paths is provided you either call a version of
indexer.incremental_index_repository that accepts and applies include_paths
(pass include_paths through) or skip incremental and call the full
indexer.index_repository with include_paths instead; modify the block that uses
incremental, last_commit and calls indexer.incremental_index_repository (and the
corresponding similar block later) to honor include_paths and set index_type
accordingly.
- Around line 199-207: The blocking filesystem traversal using
local_path.iterdir() and item.rglob() (which computes file_count and uses the
extensions set) is running inside an async route handler and should be moved off
the event loop; refactor by extracting the directory scanning logic that builds
file_count (and uses extensions and skip) into a synchronous helper function
(e.g., sync_scan_directory(local_path, skip, extensions)) and call it from the
async handler via await asyncio.to_thread(...) or loop.run_in_executor(...),
returning the result so the async route no longer performs direct
iterdir()/rglob() calls.
In `@backend/services/indexer_optimized.py`:
- Around line 135-145: _in _discover_code_files the loop currently walks the
entire repo even when include_paths is set and uses simple string startswith on
rel which can match sibling prefixes; change behavior so when include_paths is
provided you only iterate files under those include_paths (e.g., iterate
repo_path / p for each p in include_paths rather than repo_path.rglob('*')), or
filter using path-component-aware checks (compare tuples of Path.parts or ensure
rel.startswith(p + os.sep)) to avoid matching names like "packages/effectx" for
"packages/effect"; update variables repo_path, include_paths, rel, and file_path
handling accordingly so scan scope is restricted and matching is done on path
boundaries.
1. Add -> int return annotation to index_repository_with_progress 2. Skip incremental indexing when include_paths is set (incremental_index_repository uses git diff which doesn't understand subset boundaries) 3. Move blocking rglob() out of async handler into _scan_directories() sync helper, called via asyncio.to_thread() to avoid blocking the event loop on large repos 4. Replace full-repo rglob + startswith filter in _discover_code_files with targeted subtree walking (only iterate include_paths dirs) -- eliminates prefix collision risk (packages/effectx no longer matches packages/effect)
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (4)
backend/routes/repos.py (4)
205-228: Add return type annotation to endpoint.The endpoint is missing a return type annotation. Consider adding
-> dictfor consistency with coding guidelines requiring type hints on all function signatures.Proposed fix
`@router.get`("/{repo_id}/directories") async def get_repo_directories( repo_id: str, auth: AuthContext = Depends(require_auth), -): +) -> dict:The use of
asyncio.to_thread(_scan_directories, local_path)correctly offloads blocking filesystem I/O from the event loop.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/routes/repos.py` around lines 205 - 228, The endpoint function get_repo_directories is missing a return type annotation; update its signature to include an explicit return type (-> dict) to satisfy the project's typing guidelines, e.g., change the def get_repo_directories(...) line to include -> dict, leaving the body and use of asyncio.to_thread(_scan_directories, local_path) unchanged and referencing get_repo_directories and _scan_directories to locate the function.
181-202: Consider using a more specific return type annotation.The return type
listis generic. Per coding guidelines requiring type hints on all function signatures, a more specific annotation improves clarity and enables better static analysis.Proposed fix
-def _scan_directories(local_path: Path) -> list: +def _scan_directories(local_path: Path) -> List[dict]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/routes/repos.py` around lines 181 - 202, The function _scan_directories currently returns a generic list; update its signature to a precise type such as List[Dict[str, Union[str, int]]] (or list[dict[str, str | int]] if using Python 3.10+), add the necessary typing imports (List, Dict, Union or Any) at the top of the module, and adjust any tests/type checks accordingly so the returned list of dicts with keys "name" (str), "path" (str) and "file_count" (int) is fully reflected in the annotation.
459-463: Consider validatinginclude_pathsentries.The
include_pathsfield accepts arbitrary strings without validation. Consider adding Pydantic validators to:
- Reject empty strings
- Reject path traversal patterns (e.g.,
..components)- Optionally normalize paths (strip leading/trailing slashes)
This provides defense-in-depth even if downstream code handles invalid paths safely.
Example validation
+from pydantic import field_validator + class IndexConfig(BaseModel): """Optional config for indexing -- supports monorepo subset selection.""" include_paths: Optional[List[str]] = None # e.g. ["packages/effect", "packages/schema"] incremental: bool = True + + `@field_validator`("include_paths") + `@classmethod` + def validate_paths(cls, v: Optional[List[str]]) -> Optional[List[str]]: + if v is None: + return v + validated = [] + for p in v: + p = p.strip().strip("/") + if not p: + raise ValueError("Empty path not allowed") + if ".." in p.split("/"): + raise ValueError(f"Path traversal not allowed: {p}") + validated.append(p) + return validated if validated else None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/routes/repos.py` around lines 459 - 463, The IndexConfig.include_paths currently accepts arbitrary strings; add Pydantic validation on IndexConfig to (1) reject empty strings, (2) reject any path containing '..' path-segments (path traversal), and (3) normalize entries by stripping leading/trailing slashes; implement this with a validator on "include_paths" (or validator("include_paths", each_item=True)) that iterates items, strips whitespace/slashes, checks for empty result and for any '..' components (split on os.path.sep or "/"), and raises ValueError with a clear message for invalid entries so downstream code receives sanitized, normalized path strings.
617-621: WebSocket indexing endpoint lacksinclude_pathssupport for subset selection.The
websocket_indexfunction does not acceptinclude_pathsas a parameter and callsindex_repository_with_progresswithout it, while the async HTTP indexing flow (_run_async_indexing) supportsinclude_pathsfor the same underlying indexer method. This creates an inconsistency between two alternative indexing paths.Consider either:
- Adding
include_pathsparameter towebsocket_indexand propagating it to the indexer call for feature parity, or- Documenting why WebSocket indexing intentionally excludes this capability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/routes/repos.py` around lines 617 - 621, The websocket_index handler currently calls index_repository_with_progress(repo_id, repo["local_path"], progress_callback) without supporting include_paths, causing feature mismatch with the async HTTP path; update the websocket_index function signature to accept an include_paths parameter (same name and semantics as in _run_async_indexing) and pass it through to index_repository_with_progress, or if deliberate, add a clear comment and validation that websocket_index does not accept include_paths; locate websocket_index and the index_repository_with_progress call in backend/routes/repos.py and mirror how _run_async_indexing handles include_paths to ensure consistent behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/routes/repos.py`:
- Around line 205-228: The endpoint function get_repo_directories is missing a
return type annotation; update its signature to include an explicit return type
(-> dict) to satisfy the project's typing guidelines, e.g., change the def
get_repo_directories(...) line to include -> dict, leaving the body and use of
asyncio.to_thread(_scan_directories, local_path) unchanged and referencing
get_repo_directories and _scan_directories to locate the function.
- Around line 181-202: The function _scan_directories currently returns a
generic list; update its signature to a precise type such as List[Dict[str,
Union[str, int]]] (or list[dict[str, str | int]] if using Python 3.10+), add the
necessary typing imports (List, Dict, Union or Any) at the top of the module,
and adjust any tests/type checks accordingly so the returned list of dicts with
keys "name" (str), "path" (str) and "file_count" (int) is fully reflected in the
annotation.
- Around line 459-463: The IndexConfig.include_paths currently accepts arbitrary
strings; add Pydantic validation on IndexConfig to (1) reject empty strings, (2)
reject any path containing '..' path-segments (path traversal), and (3)
normalize entries by stripping leading/trailing slashes; implement this with a
validator on "include_paths" (or validator("include_paths", each_item=True))
that iterates items, strips whitespace/slashes, checks for empty result and for
any '..' components (split on os.path.sep or "/"), and raises ValueError with a
clear message for invalid entries so downstream code receives sanitized,
normalized path strings.
- Around line 617-621: The websocket_index handler currently calls
index_repository_with_progress(repo_id, repo["local_path"], progress_callback)
without supporting include_paths, causing feature mismatch with the async HTTP
path; update the websocket_index function signature to accept an include_paths
parameter (same name and semantics as in _run_async_indexing) and pass it
through to index_repository_with_progress, or if deliberate, add a clear comment
and validation that websocket_index does not accept include_paths; locate
websocket_index and the index_repository_with_progress call in
backend/routes/repos.py and mirror how _run_async_indexing handles include_paths
to ensure consistent behavior.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
backend/routes/repos.pybackend/services/indexer_optimized.py
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/services/indexer_optimized.py
…docs 1. Add -> dict return type to get_repo_directories 2. Add -> List[dict] return type to _scan_directories 3. Add Pydantic validator on IndexConfig.include_paths: - Rejects empty strings - Rejects path traversal (.. segments) - Normalizes leading/trailing slashes 4. Add docstring to websocket_index explaining it does NOT support include_paths (older pattern, use HTTP async endpoint instead) All 4 findings verified against code before fixing.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/routes/repos.py`:
- Around line 464-473: The include_paths validator uses deprecated Pydantic v1
`@validator` and calls strip() on pre=True values which raises AttributeError for
non-string input; replace it with Pydantic v2 `@field_validator` on
"include_paths" (use mode="before" and each_item=True) and add an explicit type
check in sanitize_path: if the value is not a str raise a validation error
(TypeError or ValueError) with a clear message, otherwise perform v =
v.strip().strip("/"), then enforce non-empty and no ".." segments as before;
update the decorator and the sanitize_path function name references accordingly.
Project requires pydantic>=2.0.0 and uses field_validator elsewhere (playground routes). Replace deprecated @validator with @field_validator: - mode='before' instead of pre=True - Iterate list manually instead of each_item=True (removed in v2) - Explicit isinstance check for non-string input - Same security behavior: rejects empty, traversal, normalizes slashes
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/routes/repos.py (1)
459-480: LGTM! The validator correctly implements Pydantic v2 syntax and handles edge cases (non-strings, empty strings, path traversal).Optional: Consider normalizing Windows-style backslashes for cross-platform clients:
Optional enhancement
if not isinstance(item, str): raise ValueError(f"include_paths entries must be strings, got {type(item).__name__}") + item = item.replace("\\", "/") item = item.strip().strip("/")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/routes/repos.py` around lines 459 - 480, IndexConfig.sanitize_paths currently rejects path-traversal and empty strings but doesn't normalize Windows backslashes; update sanitize_paths to first replace backslashes with forward slashes (e.g., item = item.replace("\\\\", "/")) before strip/strip("/") and the ".." check so clients submitting "packages\\foo" are normalized to "packages/foo" and still validated; keep the existing type checks, trimming, empty string check, and path-traversal check and return the cleaned list for include_paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/routes/repos.py`:
- Around line 459-480: IndexConfig.sanitize_paths currently rejects
path-traversal and empty strings but doesn't normalize Windows backslashes;
update sanitize_paths to first replace backslashes with forward slashes (e.g.,
item = item.replace("\\\\", "/")) before strip/strip("/") and the ".." check so
clients submitting "packages\\foo" are normalized to "packages/foo" and still
validated; keep the existing type checks, trimming, empty string check, and
path-traversal check and return the cleaned list for include_paths.
Defensive measure -- converts Windows-style backslashes to forward slashes before validation. Costs nothing, prevents edge cases if infra or client platform changes.
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
Monorepo subset indexing. Users can select which directories to index instead of the entire repo.
Source: Direct request from Trevor Keith (Solid/trysolid.com) for Effect-TS monorepo (200K+ functions, 30+ packages, only needed 2).
Changes:
Closes OPE-106
Summary by CodeRabbit