feat: POST /repos/analyze -- pre-clone directory analysis via GitHub API (OPE-109)#267
Conversation
…API (OPE-109)
New endpoint that returns a repo's directory structure WITHOUT cloning.
Uses GitHub Tree API to fetch the full file tree instantly (~200ms),
then groups by directory with code file counts.
Implementation:
- _github_headers(): builds auth headers with optional GITHUB_TOKEN
- _fetch_directory_tree(): calls GitHub Tree API, groups files by directory,
smart monorepo detection (packages/, libs/, apps/ grouped one level deeper)
- AnalyzeRepoRequest: Pydantic model with URL validation
- POST /repos/analyze: orchestrates metadata + tree fetch, returns directory
list with suggestion flag for large repos (>500 files or >10 dirs)
Key design decisions:
- Route placed before /{repo_id} to avoid path conflict
- Monorepo-aware grouping: 'packages/effect' not just 'packages'
- No auth required (same as viewing a public GitHub repo)
- Reuses RepoValidator.CODE_EXTENSIONS and SKIP_DIRS for consistency
- Returns 'suggestion: large_repo' to trigger directory picker in frontend
|
@DevanshuNEU is attempting to deploy a commit to the Dev's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a new POST /repos/analyze endpoint and supporting logic to validate GitHub URLs, fetch repo metadata and the GitHub tree via API, group directories, count files, estimate functions, and return aggregated pre-clone analysis; includes comprehensive tests for validation and tree-processing. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Backend
participant GitHub
Client->>Backend: POST /repos/analyze { url }
activate Backend
Backend->>Backend: Validate URL with _GITHUB_URL_RE
alt invalid
Backend-->>Client: 400 Validation Error
else valid
Backend->>GitHub: GET /repos/{owner}/{repo}
activate GitHub
GitHub-->>Backend: repo metadata (default_branch, size, stars, language)
deactivate GitHub
Backend->>GitHub: GET /repos/{owner}/{repo}/git/trees/{branch}?recursive=1
activate GitHub
GitHub-->>Backend: tree (paths, types)
deactivate GitHub
Backend->>Backend: Group directories, count files, estimate functions, set suggestions
Backend-->>Client: 200 {metadata, directories, totals}
end
deactivate Backend
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
backend/routes/repos.py (2)
287-304: Consider reusing a single AsyncClient for both API calls.The endpoint creates two separate
httpx.AsyncClientinstances—one for the metadata fetch (line 287) and another inside_fetch_directory_tree(line 185). Connection reuse would be more efficient.♻️ Proposed refactor to reuse client
Pass the client as an optional parameter to
_fetch_directory_tree:async def _fetch_directory_tree( owner: str, repo: str, branch: str, + client: httpx.AsyncClient | None = None, ) -> dict: ... - async with httpx.AsyncClient(timeout=15.0) as client: - response = await client.get(url, headers=_github_headers()) + if client is None: + async with httpx.AsyncClient(timeout=15.0) as client: + response = await client.get(url, headers=_github_headers()) + # ... rest of processing + else: + response = await client.get(url, headers=_github_headers())Then in
analyze_repository:- async with httpx.AsyncClient(timeout=10.0) as client: + async with httpx.AsyncClient(timeout=15.0) as client: meta_resp = await client.get(...) - - # Fetch directory tree - tree_data = await _fetch_directory_tree(owner, repo_name, default_branch) + # Fetch directory tree + tree_data = await _fetch_directory_tree(owner, repo_name, default_branch, client=client)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/routes/repos.py` around lines 287 - 304, The metadata fetch opens an httpx.AsyncClient and _fetch_directory_tree opens another—reuse one client instead: add an optional client parameter to _fetch_directory_tree (e.g., def _fetch_directory_tree(..., client: Optional[httpx.AsyncClient]=None) / async def) and, if client is None, create and close a local AsyncClient otherwise use the passed client; update the call in analyze_repository (where meta_resp is fetched with httpx.AsyncClient) to pass that same client into _fetch_directory_tree; ensure the passed client is used for requests, that timeout/headers (from _github_headers()) are preserved, and that only callers that don’t pass a client keep creating/closing their own AsyncClient.
269-275: Add rate limiting and caching to the/analyzeendpoint to prevent abuse of GitHub API quota.This endpoint requires no authentication and makes two GitHub API calls per request without caching. An attacker could repeatedly request the same repository, exhausting your GitHub API quota and causing the service to rate limit.
Consider implementing:
- In-memory or Redis caching with 5-minute TTL (similar to the
/validate-repoendpoint)- Request rate limiting per client IP using the existing
@rate_limitdecorator🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/routes/repos.py` around lines 269 - 275, The /analyze endpoint (analyze_repository handling AnalyzeRepoRequest) currently makes two unauthenticated GitHub calls with no caching or per-IP throttling; add the same caching and rate-limiting used by /validate-repo: apply the existing `@rate_limit` decorator to analyze_repository and add a 5-minute TTL cache lookup/insert (in-memory or Redis, consistent with the validate-repo implementation) keyed by repository identifier (e.g., owner/name + ref/branch) so repeated requests return cached results instead of calling GitHub; ensure cache hit returns the same dict response and cache misses execute the current logic then store the result with TTL.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/routes/repos.py`:
- Around line 287-304: The metadata fetch opens an httpx.AsyncClient and
_fetch_directory_tree opens another—reuse one client instead: add an optional
client parameter to _fetch_directory_tree (e.g., def _fetch_directory_tree(...,
client: Optional[httpx.AsyncClient]=None) / async def) and, if client is None,
create and close a local AsyncClient otherwise use the passed client; update the
call in analyze_repository (where meta_resp is fetched with httpx.AsyncClient)
to pass that same client into _fetch_directory_tree; ensure the passed client is
used for requests, that timeout/headers (from _github_headers()) are preserved,
and that only callers that don’t pass a client keep creating/closing their own
AsyncClient.
- Around line 269-275: The /analyze endpoint (analyze_repository handling
AnalyzeRepoRequest) currently makes two unauthenticated GitHub calls with no
caching or per-IP throttling; add the same caching and rate-limiting used by
/validate-repo: apply the existing `@rate_limit` decorator to analyze_repository
and add a 5-minute TTL cache lookup/insert (in-memory or Redis, consistent with
the validate-repo implementation) keyed by repository identifier (e.g.,
owner/name + ref/branch) so repeated requests return cached results instead of
calling GitHub; ensure cache hit returns the same dict response and cache misses
execute the current logic then store the result with TTL.
Test coverage: - URL regex: standard, trailing slash, http, rejects non-github/subpaths/no-repo - AnalyzeRepoRequest: valid, whitespace strip, rejects empty/non-github - IndexConfig: valid paths, slash normalization, backslash normalization, rejects empty/traversal/nested-traversal/non-string, allows None - _fetch_directory_tree: flat repo grouping, monorepo package-level grouping, node_modules skipping, large_repo suggestion, small repo no suggestion Also manually verified on Effect-TS/effect (1,767 files, 33 packages, correct monorepo grouping at packages/* level). 23/23 pass in 2.8s.
…exing) Indexing is function-level, not file-level. Tier limits are function-based (2K free, 20K pro, 500K enterprise). But the analyze endpoint only returned file counts -- users couldn't compare against their limits. Now each directory entry includes estimated_functions (file_count * 25, same multiplier RepoValidator uses for tier checks). Response also includes total_estimated_functions for the whole repo. Effect-TS example: packages/effect: 958 files, ~23,950 functions packages/schema: 203 files, ~5,075 functions Total: 1,767 files, ~44,175 functions User on Pro tier (20K limit) can immediately see they need to pick a subset. 24 tests pass (1 new for function estimation).
1. _fetch_directory_tree now accepts optional client param -- analyze_repository opens one AsyncClient for both the metadata and tree API calls instead of two 2. Results cached for 5 min (same TTL as validate-repo) keyed by owner/repo, avoids redundant GitHub API calls on page refresh or retry 3. Cache uses existing Redis cache from dependencies (same as validate-repo) 24 tests pass.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
backend/routes/repos.py (1)
276-284: Use the compiled GitHub URL regex in request validation for consistent failures.Current substring validation can pass malformed domains (e.g., containing
github.comas a substring) and defer rejection to route-level parsing.♻️ Suggested cleanup
def validate_url(cls, v: str) -> str: v = v.strip().rstrip("/") if not v: raise ValueError("GitHub URL is required") - if "github.com" not in v.lower(): - raise ValueError("Only GitHub URLs are supported") + if not _GITHUB_URL_RE.match(v): + raise ValueError("Invalid GitHub URL. Expected: https://github.com/owner/repo") return v🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/routes/repos.py` around lines 276 - 284, The field validator validate_url (decorated with `@field_validator`("github_url")) currently uses a substring check which allows malformed domains; replace the substring check with a match against the project’s compiled GitHub URL regex used by route-level parsing (use the same compiled regex object rather than "github.com" in v.lower()) and raise ValueError when the regex does not match after trimming and rstrip("/"), so validation fails consistently at request parsing rather than later in route handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/routes/repos.py`:
- Around line 185-196: The GitHub API calls in _fetch_directory_tree() and
analyze_repository() currently only check status codes and can raise unhandled
httpx.RequestError (network/timeout) or ValueError (JSON parse); wrap the
httpx.AsyncClient GET/response.json() calls in try/except blocks that catch
httpx.RequestError and ValueError, log the exception, and raise HTTPException
with a clear status (e.g., 504 Gateway Timeout or 502 Bad Gateway) and
actionable detail message; ensure you reference the existing functions
_fetch_directory_tree() and analyze_repository() and keep existing status-code
handling for response.status_code != 200 inside the try so transport/parse
failures are handled gracefully.
- Line 183: The constructed URL for the GitHub tree API currently interpolates
branch directly into url (variable url built from _GITHUB_API_BASE and
owner/repo/branch), which breaks for branch names containing “/”; fix by
percent-encoding the branch path segment before interpolation (use
urllib.parse.quote with safe='' to ensure '/' becomes '%2F') and then rebuild
url using the encoded_branch variable so owner, repo, and encoded_branch are
concatenated into the API path.
In `@backend/tests/test_analyze_repo.py`:
- Around line 53-58: The tests currently catch any Exception which is too broad;
change the assertions to expect pydantic's ValidationError for model validation
failures from AnalyzeRepoRequest (and similarly for IndexConfig tests) by
importing ValidationError from pydantic and using pytest.raises(ValidationError)
instead of pytest.raises(Exception) so the tests only pass for the specific
validation errors raised by the `@field_validator` logic.
---
Nitpick comments:
In `@backend/routes/repos.py`:
- Around line 276-284: The field validator validate_url (decorated with
`@field_validator`("github_url")) currently uses a substring check which allows
malformed domains; replace the substring check with a match against the
project’s compiled GitHub URL regex used by route-level parsing (use the same
compiled regex object rather than "github.com" in v.lower()) and raise
ValueError when the regex does not match after trimming and rstrip("/"), so
validation fails consistently at request parsing rather than later in route
handling.
Directory structure of a repo barely changes (new package added maybe once a month). 24hr cache saves GitHub API calls at ~3-5KB per entry. 1000 cached repos = 5MB Redis. Effectively free.
…ation 4 findings verified and fixed: 1. Wrap GitHub API calls in try/except for httpx.TimeoutException, httpx.RequestError, and ValueError (JSON parse). Returns proper 504/502 instead of unhandled crash. 2. URL-encode branch name with urllib.parse.quote(safe='') so branches like 'feature/foo' produce 'feature%2Ffoo' in the GitHub Tree API URL instead of breaking the path. 3. Tests now use pytest.raises(ValidationError) instead of broad Exception -- catches only Pydantic validation errors as intended. Added test_rejects_malformed_github_domain for 'fakegithub.com'. 4. AnalyzeRepoRequest validator uses _GITHUB_URL_RE regex instead of substring check -- rejects 'fakegithub.com' and 'notgithub.com.evil'. 25 tests pass.
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/routes/repos.py`:
- Around line 322-323: Update the misleading cache-duration comments that say “5
minutes” to correctly state the actual TTL used (_ANALYZE_CACHE_TTL) which is 24
hours; locate the comments near the _ANALYZE_CACHE_TTL definition and the
docstring/comment around the results caching (references: _ANALYZE_CACHE_TTL and
the function/class docstring that mentions results are cached) and change the
text from “5 minutes” to “24 hours” or refer to _ANALYZE_CACHE_TTL to avoid
hardcoding a stale duration.
- Around line 164-170: The /analyze handler currently uses the server
GITHUB_TOKEN via _github_headers() to call GitHub and immediately calls
_fetch_directory_tree() with repository metadata, which can leak private repo
info to unauthenticated callers; update the /analyze flow to detect whether the
incoming request is authenticated and if not, check metadata.get("private")
before calling _fetch_directory_tree() and return a 403 or an explicit error for
private repos; alternatively require a caller-supplied GitHub access token for
private repos and validate it before invoking _fetch_directory_tree(); ensure
the checks reference the /analyze request context, the metadata.get("private")
field, the _fetch_directory_tree() call, and the _github_headers() usage so
private repos are never fetched for unauthenticated requests.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
backend/routes/repos.pybackend/tests/test_analyze_repo.py
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/tests/test_analyze_repo.py
1. Security: reject private repos with 403 before fetching directory tree. Server GITHUB_TOKEN could access private repos via GitHub API, which would leak private repo structure to unauthenticated /analyze callers. Same check the playground validation already does. 2. Fix stale comments that said '5 minutes' when TTL was bumped to 24 hours. 25 tests pass.
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
What
New endpoint that returns a repo's directory structure WITHOUT cloning. Uses GitHub Tree API (~200ms) to fetch the full file tree, then groups by directory with code file counts.
Why
PR #266 added include_paths to the indexing chain, but users could only select directories AFTER cloning. This endpoint enables directory selection BEFORE cloning -- instant feedback, zero wasted bandwidth.
Requested by Trevor Keith (Solid/trysolid.com) for Effect-TS monorepo (200K+ functions, only needs 2 packages).
Endpoint
Key decisions
Depends on
Closes OPE-109
Summary by CodeRabbit