Skip to content

feat: monorepo subset indexing -- include_paths support (OPE-106)#266

Merged
DevanshuNEU merged 5 commits into
OpenCodeIntel:mainfrom
DevanshuNEU:feat/subset-indexing
Feb 28, 2026
Merged

feat: monorepo subset indexing -- include_paths support (OPE-106)#266
DevanshuNEU merged 5 commits into
OpenCodeIntel:mainfrom
DevanshuNEU:feat/subset-indexing

Conversation

@DevanshuNEU

@DevanshuNEU DevanshuNEU commented Feb 28, 2026

Copy link
Copy Markdown
Collaborator

Monorepo subset indexing. Users can select which directories to index instead of the entire repo.

Source: Direct request from Trevor Keith (Solid/trysolid.com) for Effect-TS monorepo (200K+ functions, 30+ packages, only needed 2).

Changes:

  • POST /repos/{id}/index/async accepts IndexConfig with include_paths
  • New GET /repos/{id}/directories returns top-level dirs with file counts
  • Indexer threads include_paths through entire chain
  • tree_sitter_extractor already supported include_paths

Closes OPE-106

Summary by CodeRabbit

  • New Features
    • View top-level repository directories with per-directory code file counts.
    • Submit indexing scoped to specific subdirectories via an indexing configuration (include paths + incremental flag).
    • Indexing progress, responses, and UI now display the selected include paths.
    • Incremental indexing is automatically disabled when include paths are provided.
    • WebSocket-based indexing does not support include-path subsets; use the async HTTP indexing endpoint.

Thread include_paths from API through entire indexing chain:
- POST /repos/{id}/index/async accepts IndexConfig with include_paths
- _run_async_indexing passes include_paths to indexer
- indexer._discover_code_files filters files by include_paths
- indexer.extract_functions_v2 passes include_paths to tree-sitter
- indexer.index_repository_with_progress passes include_paths to discovery
- tree_sitter_extractor.extract_from_repo already supported include_paths

New endpoint:
- GET /repos/{id}/directories returns top-level dirs with file counts
  (for UI directory picker in monorepo subset selection)

Requested by Trevor Keith (Solid/trysolid.com) for Effect-TS monorepo
(200K+ functions across 30+ packages, only needs 2).
@vercel

vercel Bot commented Feb 28, 2026

Copy link
Copy Markdown

@DevanshuNEU is attempting to deploy a commit to the Dev's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Feb 28, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c14515 and f4d61c9.

📒 Files selected for processing (1)
  • backend/routes/repos.py

📝 Walkthrough

Walkthrough

Added monorepo subdirectory listing and subset indexing: new GET route to list top-level repo directories with per-directory code-file counts; introduced IndexConfig (include_paths, incremental) and propagated include_paths through async indexing and the optimized indexer; incremental disabled when include_paths is used.

Changes

Cohort / File(s) Summary
API: repo endpoints & config
backend/routes/repos.py
Added get_repo_directories() to enumerate top-level dirs and file counts; added IndexConfig (include_paths?: List[str], incremental: bool) with sanitization; threaded include_paths and incremental through index_repository_async() and _run_async_indexing(); adjusted progress/response payloads; noted WebSocket lacks include_paths support.
Indexer service: selective discovery & extraction
backend/services/indexer_optimized.py
Added include_paths support to _discover_code_files(), extract_functions_v2(), index_repository_v2(), and index_repository_with_progress() so discovery/extraction/filtering respect provided subtrees; surfaced include_paths in logs/progress.
Manifests / imports
pyproject.toml, requirements.txt, backend/routes/...
Minor manifest/import adjustments to support new types (Path, List, field_validator) and expose new public entities.

Sequence Diagram

sequenceDiagram
    actor Client
    participant API as "API Routes"
    participant Indexer as "Indexer Service"
    participant FS as "File System"

    Client->>API: GET /repo/{id}/directories
    API->>FS: scan top-level dirs (sync)
    FS-->>API: directories + per-dir file counts
    API-->>Client: return directories

    Client->>API: POST /repo/{id}/index (IndexConfig incl. include_paths)
    API->>Indexer: schedule _run_async_indexing(repo_id, include_paths, incremental)
    Indexer->>Indexer: index_repository_with_progress(include_paths)
    Indexer->>FS: discover code files (filtered by include_paths)
    FS-->>Indexer: filtered file list
    Indexer->>Indexer: extract_functions_v2(include_paths)
    Indexer-->>API: progress updates (include_paths)
    API-->>Client: indexing started / progress updates
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped the root and counted every hidden trail,
Picked only chosen branches where focused searches prevail.
Paths in my pocket, incremental left behind,
I mapped the subset gardens and left no file behind.
Hooray — a neat, small harvest for code and rabbit kind.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: monorepo subset indexing -- include_paths support (OPE-106)' clearly and specifically describes the main change: adding monorepo subset indexing functionality with include_paths support. It aligns directly with the primary feature introduced across the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@DevanshuNEU

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit review

@coderabbitai

coderabbitai Bot commented Feb 28, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/routes/repos.py`:
- Around line 320-321: The incremental indexing branch ignores include_paths so
changed-file selection can include unrelated files; update the indexing handler
in backend/routes/repos.py (the function that accepts incremental: bool and
include_paths: Optional[List[str]]) to apply the same include_paths filtering
used in the full-index path when computing the set of files to index during
incremental=True (including the path/subset selection when a last commit
exists), i.e., after computing changed_files or candidate_files in the
incremental branch, filter them by matching include_paths patterns (respecting
the same glob/substring logic used in the full-index code) before proceeding
with indexing.
- Around line 199-207: The synchronous filesystem traversal in
get_repo_directories (using local_path.iterdir() and item.rglob()) must be moved
into a separate blocking helper function (e.g., _scan_repo_directories or
scan_repo_dirs) that performs the directory iteration and computes file_count,
then call that helper from the async route via asyncio.to_thread(...) to offload
blocking I/O; ensure the helper returns the data structure expected by
get_repo_directories and replace the inline iteration and rglob usage with the
to_thread call so the event loop is not blocked.

In `@backend/services/indexer_optimized.py`:
- Around line 135-145: The current loop over repo_path.rglob('*') still scans
the whole repo and the include-path check uses rel.startswith which can produce
false matches (e.g., "packages/effectx" matching "packages/effect"); change the
scan to only iterate the specified include_paths when include_paths is set
(build roots = [repo_path / p for p in include_paths] and rglob from each root,
falling back to repo_path.rglob when include_paths is empty), and replace the
startswith check with a segment-aware test on rel (e.g., require rel == p or
rel.startswith(p + os.sep) for each p in include_paths, or compare
Path(file_path).parts prefix equality) so only exact-path or subpath matches are
included; keep existing skip_dirs, extensions, and file_path.is_dir checks
intact (refer to repo_path.rglob, include_paths, file_path, rel, skip_dirs,
extensions).

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a563d3 and 781fdd2.

📒 Files selected for processing (2)
  • backend/routes/repos.py
  • backend/services/indexer_optimized.py

Comment thread backend/routes/repos.py
Comment thread backend/routes/repos.py
Comment thread backend/services/indexer_optimized.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/services/indexer_optimized.py (1)

700-707: 🛠️ Refactor suggestion | 🟠 Major

Add return type annotation -> int to the function signature.

The index_repository_with_progress method at lines 700-707 omits the return type annotation required by the coding guidelines. The function returns a numeric count (line 835: return len(all_functions_data)).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/services/indexer_optimized.py` around lines 700 - 707, The function
index_repository_with_progress is missing a return type annotation; update its
signature to declare it returns an int (i.e., add "-> int" to the def line for
index_repository_with_progress) so the signature matches the actual return value
(len(all_functions_data)); ensure any corresponding type hints or imports remain
valid.
♻️ Duplicate comments (3)
backend/services/indexer_optimized.py (1)

135-145: ⚠️ Potential issue | 🟠 Major

Fix subset matching and scan scope in _discover_code_files.

Line 135 still scans the full repo in subset mode, and Line 144 uses prefix string matching that can over-include sibling paths (e.g., packages/effectx for packages/effect).

Proposed fix
-        for file_path in repo_path.rglob('*'):
+        normalized_includes = [
+            Path(p.strip("/")) for p in (include_paths or []) if p and p.strip("/")
+        ]
+        scan_roots = [repo_path / p for p in normalized_includes] if normalized_includes else [repo_path]
+
+        for root in scan_roots:
+            if not root.exists() or not root.is_dir():
+                continue
+            for file_path in root.rglob('*'):
                 if file_path.is_dir():
                     continue
                 if any(skip in file_path.parts for skip in skip_dirs):
                     continue
                 if file_path.suffix in extensions:
-                # Subset filter: only include files under specified paths
-                if include_paths:
-                    rel = str(file_path.relative_to(repo_path))
-                    if not any(rel.startswith(p) for p in include_paths):
-                        continue
-                code_files.append(file_path)
+                    if normalized_includes:
+                        rel = file_path.relative_to(repo_path)
+                        if not any(rel == inc or inc in rel.parents for inc in normalized_includes):
+                            continue
+                    code_files.append(file_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/services/indexer_optimized.py` around lines 135 - 145, _in
_discover_code_files the loop currently walks the entire repo even when
include_paths is set and uses simple string startswith on rel which can match
sibling prefixes; change behavior so when include_paths is provided you only
iterate files under those include_paths (e.g., iterate repo_path / p for each p
in include_paths rather than repo_path.rglob('*')), or filter using
path-component-aware checks (compare tuples of Path.parts or ensure
rel.startswith(p + os.sep)) to avoid matching names like "packages/effectx" for
"packages/effect"; update variables repo_path, include_paths, rel, and file_path
handling accordingly so scan scope is restricted and matching is done on path
boundaries.
backend/routes/repos.py (2)

346-353: ⚠️ Potential issue | 🟠 Major

Honor include_paths in the incremental branch (or bypass incremental when subset mode is requested).

When incremental=True and last_commit exists, include_paths is not applied, so files outside the requested subset can be indexed.

Proposed fix
-        if incremental and last_commit:
+        if incremental and last_commit and not include_paths:
             logger.info("Async INCREMENTAL indexing", repo_id=repo_id, last_commit=last_commit[:8])
             total_functions = await indexer.incremental_index_repository(
                 repo_id,
                 repo["local_path"],
                 last_commit
             )
             index_type = "incremental"
@@
-        else:
+        else:
+            if incremental and last_commit and include_paths:
+                logger.info(
+                    "Subset indexing requested; skipping incremental path to honor include_paths",
+                    repo_id=repo_id,
+                    include_paths=include_paths,
+                )
             logger.info("Async FULL indexing with progress", repo_id=repo_id)

Also applies to: 392-397

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/repos.py` around lines 346 - 353, The incremental branch
ignores include_paths causing out-of-subset files to be indexed; update the
logic around the incremental check so that when include_paths is provided you
either call a version of indexer.incremental_index_repository that accepts and
applies include_paths (pass include_paths through) or skip incremental and call
the full indexer.index_repository with include_paths instead; modify the block
that uses incremental, last_commit and calls
indexer.incremental_index_repository (and the corresponding similar block later)
to honor include_paths and set index_type accordingly.

199-207: ⚠️ Potential issue | 🟠 Major

Move repository scanning off the async event loop.

Line 199 and Line 204 perform blocking filesystem traversal (iterdir, rglob) in an async route, which can stall the loop on large repos.

Proposed fix
+    def _scan_dirs() -> List[dict]:
+        extensions = {".py", ".js", ".jsx", ".ts", ".tsx"}
+        out: List[dict] = []
+        for item in sorted(local_path.iterdir()):
+            if item.is_dir() and item.name not in skip and not item.name.startswith("."):
+                file_count = sum(
+                    1 for f in item.rglob("*")
+                    if f.is_file() and f.suffix in extensions
+                    and not any(s in f.parts for s in skip)
+                )
+                out.append({
+                    "name": item.name,
+                    "path": str(item.relative_to(local_path)),
+                    "file_count": file_count,
+                })
+        return out
+
-    dirs = []
-    for item in sorted(local_path.iterdir()):
-        if item.is_dir() and item.name not in skip and not item.name.startswith("."):
-            # count code files in this directory
-            extensions = {".py", ".js", ".jsx", ".ts", ".tsx"}
-            file_count = sum(
-                1 for f in item.rglob("*")
-                if f.is_file() and f.suffix in extensions
-                and not any(s in f.parts for s in skip)
-            )
-            dirs.append({
-                "name": item.name,
-                "path": str(item.relative_to(local_path)),
-                "file_count": file_count,
-            })
+    dirs = await asyncio.to_thread(_scan_dirs)

Based on learnings: Applies to backend/**/*.py : Use async/await for I/O operations in Python backend.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/repos.py` around lines 199 - 207, The blocking filesystem
traversal using local_path.iterdir() and item.rglob() (which computes file_count
and uses the extensions set) is running inside an async route handler and should
be moved off the event loop; refactor by extracting the directory scanning logic
that builds file_count (and uses extensions and skip) into a synchronous helper
function (e.g., sync_scan_directory(local_path, skip, extensions)) and call it
from the async handler via await asyncio.to_thread(...) or
loop.run_in_executor(...), returning the result so the async route no longer
performs direct iterdir()/rglob() calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@backend/services/indexer_optimized.py`:
- Around line 700-707: The function index_repository_with_progress is missing a
return type annotation; update its signature to declare it returns an int (i.e.,
add "-> int" to the def line for index_repository_with_progress) so the
signature matches the actual return value (len(all_functions_data)); ensure any
corresponding type hints or imports remain valid.

---

Duplicate comments:
In `@backend/routes/repos.py`:
- Around line 346-353: The incremental branch ignores include_paths causing
out-of-subset files to be indexed; update the logic around the incremental check
so that when include_paths is provided you either call a version of
indexer.incremental_index_repository that accepts and applies include_paths
(pass include_paths through) or skip incremental and call the full
indexer.index_repository with include_paths instead; modify the block that uses
incremental, last_commit and calls indexer.incremental_index_repository (and the
corresponding similar block later) to honor include_paths and set index_type
accordingly.
- Around line 199-207: The blocking filesystem traversal using
local_path.iterdir() and item.rglob() (which computes file_count and uses the
extensions set) is running inside an async route handler and should be moved off
the event loop; refactor by extracting the directory scanning logic that builds
file_count (and uses extensions and skip) into a synchronous helper function
(e.g., sync_scan_directory(local_path, skip, extensions)) and call it from the
async handler via await asyncio.to_thread(...) or loop.run_in_executor(...),
returning the result so the async route no longer performs direct
iterdir()/rglob() calls.

In `@backend/services/indexer_optimized.py`:
- Around line 135-145: _in _discover_code_files the loop currently walks the
entire repo even when include_paths is set and uses simple string startswith on
rel which can match sibling prefixes; change behavior so when include_paths is
provided you only iterate files under those include_paths (e.g., iterate
repo_path / p for each p in include_paths rather than repo_path.rglob('*')), or
filter using path-component-aware checks (compare tuples of Path.parts or ensure
rel.startswith(p + os.sep)) to avoid matching names like "packages/effectx" for
"packages/effect"; update variables repo_path, include_paths, rel, and file_path
handling accordingly so scan scope is restricted and matching is done on path
boundaries.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a563d3 and 781fdd2.

📒 Files selected for processing (2)
  • backend/routes/repos.py
  • backend/services/indexer_optimized.py

1. Add -> int return annotation to index_repository_with_progress
2. Skip incremental indexing when include_paths is set (incremental_index_repository
   uses git diff which doesn't understand subset boundaries)
3. Move blocking rglob() out of async handler into _scan_directories() sync helper,
   called via asyncio.to_thread() to avoid blocking the event loop on large repos
4. Replace full-repo rglob + startswith filter in _discover_code_files with targeted
   subtree walking (only iterate include_paths dirs) -- eliminates prefix collision
   risk (packages/effectx no longer matches packages/effect)
@DevanshuNEU

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit review

@coderabbitai

coderabbitai Bot commented Feb 28, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
backend/routes/repos.py (4)

205-228: Add return type annotation to endpoint.

The endpoint is missing a return type annotation. Consider adding -> dict for consistency with coding guidelines requiring type hints on all function signatures.

Proposed fix
 `@router.get`("/{repo_id}/directories")
 async def get_repo_directories(
     repo_id: str,
     auth: AuthContext = Depends(require_auth),
-):
+) -> dict:

The use of asyncio.to_thread(_scan_directories, local_path) correctly offloads blocking filesystem I/O from the event loop.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/repos.py` around lines 205 - 228, The endpoint function
get_repo_directories is missing a return type annotation; update its signature
to include an explicit return type (-> dict) to satisfy the project's typing
guidelines, e.g., change the def get_repo_directories(...) line to include ->
dict, leaving the body and use of asyncio.to_thread(_scan_directories,
local_path) unchanged and referencing get_repo_directories and _scan_directories
to locate the function.

181-202: Consider using a more specific return type annotation.

The return type list is generic. Per coding guidelines requiring type hints on all function signatures, a more specific annotation improves clarity and enables better static analysis.

Proposed fix
-def _scan_directories(local_path: Path) -> list:
+def _scan_directories(local_path: Path) -> List[dict]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/repos.py` around lines 181 - 202, The function
_scan_directories currently returns a generic list; update its signature to a
precise type such as List[Dict[str, Union[str, int]]] (or list[dict[str, str |
int]] if using Python 3.10+), add the necessary typing imports (List, Dict,
Union or Any) at the top of the module, and adjust any tests/type checks
accordingly so the returned list of dicts with keys "name" (str), "path" (str)
and "file_count" (int) is fully reflected in the annotation.

459-463: Consider validating include_paths entries.

The include_paths field accepts arbitrary strings without validation. Consider adding Pydantic validators to:

  1. Reject empty strings
  2. Reject path traversal patterns (e.g., .. components)
  3. Optionally normalize paths (strip leading/trailing slashes)

This provides defense-in-depth even if downstream code handles invalid paths safely.

Example validation
+from pydantic import field_validator
+
 class IndexConfig(BaseModel):
     """Optional config for indexing -- supports monorepo subset selection."""
     include_paths: Optional[List[str]] = None  # e.g. ["packages/effect", "packages/schema"]
     incremental: bool = True
+
+    `@field_validator`("include_paths")
+    `@classmethod`
+    def validate_paths(cls, v: Optional[List[str]]) -> Optional[List[str]]:
+        if v is None:
+            return v
+        validated = []
+        for p in v:
+            p = p.strip().strip("/")
+            if not p:
+                raise ValueError("Empty path not allowed")
+            if ".." in p.split("/"):
+                raise ValueError(f"Path traversal not allowed: {p}")
+            validated.append(p)
+        return validated if validated else None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/repos.py` around lines 459 - 463, The
IndexConfig.include_paths currently accepts arbitrary strings; add Pydantic
validation on IndexConfig to (1) reject empty strings, (2) reject any path
containing '..' path-segments (path traversal), and (3) normalize entries by
stripping leading/trailing slashes; implement this with a validator on
"include_paths" (or validator("include_paths", each_item=True)) that iterates
items, strips whitespace/slashes, checks for empty result and for any '..'
components (split on os.path.sep or "/"), and raises ValueError with a clear
message for invalid entries so downstream code receives sanitized, normalized
path strings.

617-621: WebSocket indexing endpoint lacks include_paths support for subset selection.

The websocket_index function does not accept include_paths as a parameter and calls index_repository_with_progress without it, while the async HTTP indexing flow (_run_async_indexing) supports include_paths for the same underlying indexer method. This creates an inconsistency between two alternative indexing paths.

Consider either:

  1. Adding include_paths parameter to websocket_index and propagating it to the indexer call for feature parity, or
  2. Documenting why WebSocket indexing intentionally excludes this capability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/repos.py` around lines 617 - 621, The websocket_index handler
currently calls index_repository_with_progress(repo_id, repo["local_path"],
progress_callback) without supporting include_paths, causing feature mismatch
with the async HTTP path; update the websocket_index function signature to
accept an include_paths parameter (same name and semantics as in
_run_async_indexing) and pass it through to index_repository_with_progress, or
if deliberate, add a clear comment and validation that websocket_index does not
accept include_paths; locate websocket_index and the
index_repository_with_progress call in backend/routes/repos.py and mirror how
_run_async_indexing handles include_paths to ensure consistent behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/routes/repos.py`:
- Around line 205-228: The endpoint function get_repo_directories is missing a
return type annotation; update its signature to include an explicit return type
(-> dict) to satisfy the project's typing guidelines, e.g., change the def
get_repo_directories(...) line to include -> dict, leaving the body and use of
asyncio.to_thread(_scan_directories, local_path) unchanged and referencing
get_repo_directories and _scan_directories to locate the function.
- Around line 181-202: The function _scan_directories currently returns a
generic list; update its signature to a precise type such as List[Dict[str,
Union[str, int]]] (or list[dict[str, str | int]] if using Python 3.10+), add the
necessary typing imports (List, Dict, Union or Any) at the top of the module,
and adjust any tests/type checks accordingly so the returned list of dicts with
keys "name" (str), "path" (str) and "file_count" (int) is fully reflected in the
annotation.
- Around line 459-463: The IndexConfig.include_paths currently accepts arbitrary
strings; add Pydantic validation on IndexConfig to (1) reject empty strings, (2)
reject any path containing '..' path-segments (path traversal), and (3)
normalize entries by stripping leading/trailing slashes; implement this with a
validator on "include_paths" (or validator("include_paths", each_item=True))
that iterates items, strips whitespace/slashes, checks for empty result and for
any '..' components (split on os.path.sep or "/"), and raises ValueError with a
clear message for invalid entries so downstream code receives sanitized,
normalized path strings.
- Around line 617-621: The websocket_index handler currently calls
index_repository_with_progress(repo_id, repo["local_path"], progress_callback)
without supporting include_paths, causing feature mismatch with the async HTTP
path; update the websocket_index function signature to accept an include_paths
parameter (same name and semantics as in _run_async_indexing) and pass it
through to index_repository_with_progress, or if deliberate, add a clear comment
and validation that websocket_index does not accept include_paths; locate
websocket_index and the index_repository_with_progress call in
backend/routes/repos.py and mirror how _run_async_indexing handles include_paths
to ensure consistent behavior.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 781fdd2 and 022c297.

📒 Files selected for processing (2)
  • backend/routes/repos.py
  • backend/services/indexer_optimized.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/services/indexer_optimized.py

…docs

1. Add -> dict return type to get_repo_directories
2. Add -> List[dict] return type to _scan_directories
3. Add Pydantic validator on IndexConfig.include_paths:
   - Rejects empty strings
   - Rejects path traversal (.. segments)
   - Normalizes leading/trailing slashes
4. Add docstring to websocket_index explaining it does NOT support
   include_paths (older pattern, use HTTP async endpoint instead)

All 4 findings verified against code before fixing.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/routes/repos.py`:
- Around line 464-473: The include_paths validator uses deprecated Pydantic v1
`@validator` and calls strip() on pre=True values which raises AttributeError for
non-string input; replace it with Pydantic v2 `@field_validator` on
"include_paths" (use mode="before" and each_item=True) and add an explicit type
check in sanitize_path: if the value is not a str raise a validation error
(TypeError or ValueError) with a clear message, otherwise perform v =
v.strip().strip("/"), then enforce non-empty and no ".." segments as before;
update the decorator and the sanitize_path function name references accordingly.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 022c297 and fb0a985.

📒 Files selected for processing (1)
  • backend/routes/repos.py

Comment thread backend/routes/repos.py Outdated
Project requires pydantic>=2.0.0 and uses field_validator elsewhere
(playground routes). Replace deprecated @validator with @field_validator:
- mode='before' instead of pre=True
- Iterate list manually instead of each_item=True (removed in v2)
- Explicit isinstance check for non-string input
- Same security behavior: rejects empty, traversal, normalizes slashes

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
backend/routes/repos.py (1)

459-480: LGTM! The validator correctly implements Pydantic v2 syntax and handles edge cases (non-strings, empty strings, path traversal).

Optional: Consider normalizing Windows-style backslashes for cross-platform clients:

Optional enhancement
             if not isinstance(item, str):
                 raise ValueError(f"include_paths entries must be strings, got {type(item).__name__}")
+            item = item.replace("\\", "/")
             item = item.strip().strip("/")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/repos.py` around lines 459 - 480, IndexConfig.sanitize_paths
currently rejects path-traversal and empty strings but doesn't normalize Windows
backslashes; update sanitize_paths to first replace backslashes with forward
slashes (e.g., item = item.replace("\\\\", "/")) before strip/strip("/") and the
".." check so clients submitting "packages\\foo" are normalized to
"packages/foo" and still validated; keep the existing type checks, trimming,
empty string check, and path-traversal check and return the cleaned list for
include_paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/routes/repos.py`:
- Around line 459-480: IndexConfig.sanitize_paths currently rejects
path-traversal and empty strings but doesn't normalize Windows backslashes;
update sanitize_paths to first replace backslashes with forward slashes (e.g.,
item = item.replace("\\\\", "/")) before strip/strip("/") and the ".." check so
clients submitting "packages\\foo" are normalized to "packages/foo" and still
validated; keep the existing type checks, trimming, empty string check, and
path-traversal check and return the cleaned list for include_paths.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb0a985 and 6c14515.

📒 Files selected for processing (1)
  • backend/routes/repos.py

Defensive measure -- converts Windows-style backslashes to forward
slashes before validation. Costs nothing, prevents edge cases if
infra or client platform changes.
@vercel

vercel Bot commented Feb 28, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
opencodeintel Ignored Ignored Preview Feb 28, 2026 9:57pm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant