Skip to content

feat: POST /repos/analyze -- pre-clone directory analysis via GitHub API (OPE-109)#267

Merged
DevanshuNEU merged 7 commits into
OpenCodeIntel:mainfrom
DevanshuNEU:feat/pre-clone-analysis
Feb 28, 2026
Merged

feat: POST /repos/analyze -- pre-clone directory analysis via GitHub API (OPE-109)#267
DevanshuNEU merged 7 commits into
OpenCodeIntel:mainfrom
DevanshuNEU:feat/pre-clone-analysis

Conversation

@DevanshuNEU

@DevanshuNEU DevanshuNEU commented Feb 28, 2026

Copy link
Copy Markdown
Collaborator

What

New endpoint that returns a repo's directory structure WITHOUT cloning. Uses GitHub Tree API (~200ms) to fetch the full file tree, then groups by directory with code file counts.

Why

PR #266 added include_paths to the indexing chain, but users could only select directories AFTER cloning. This endpoint enables directory selection BEFORE cloning -- instant feedback, zero wasted bandwidth.

Requested by Trevor Keith (Solid/trysolid.com) for Effect-TS monorepo (200K+ functions, only needs 2 packages).

Endpoint

POST /api/v1/repos/analyze
Body: { "github_url": "https://github.com/Effect-TS/effect" }

Response: {
  "owner": "Effect-TS",
  "repo": "effect",
  "default_branch": "main",
  "total_files": 1761,
  "size_kb": 45000,
  "directories": [
    {"name": "packages/effect", "path": "packages/effect", "file_count": 487},
    {"name": "packages/schema", "path": "packages/schema", "file_count": 203}
  ],
  "suggestion": "large_repo"
}

Key decisions

  • Route at /repos/analyze (before /{repo_id} to avoid path conflict)
  • Monorepo-aware grouping: packages/, libs/, apps/ grouped one level deeper
  • No auth required (public GitHub repos are public)
  • Reuses RepoValidator.CODE_EXTENSIONS and SKIP_DIRS for consistency
  • Returns suggestion: large_repo flag for frontend to trigger directory picker

Depends on

Closes OPE-109

Summary by CodeRabbit

  • New Features
    • Added pre-clone GitHub repository analysis endpoint: view repo metadata (default branch, size, stars, language), directory breakdown with file counts, estimated function counts, and large-repo suggestions before importing.
  • Tests
    • Added comprehensive tests for URL validation, request validation, directory grouping, file/function estimates, large-repo detection, and error handling.

…API (OPE-109)

New endpoint that returns a repo's directory structure WITHOUT cloning.
Uses GitHub Tree API to fetch the full file tree instantly (~200ms),
then groups by directory with code file counts.

Implementation:
- _github_headers(): builds auth headers with optional GITHUB_TOKEN
- _fetch_directory_tree(): calls GitHub Tree API, groups files by directory,
  smart monorepo detection (packages/, libs/, apps/ grouped one level deeper)
- AnalyzeRepoRequest: Pydantic model with URL validation
- POST /repos/analyze: orchestrates metadata + tree fetch, returns directory
  list with suggestion flag for large repos (>500 files or >10 dirs)

Key design decisions:
- Route placed before /{repo_id} to avoid path conflict
- Monorepo-aware grouping: 'packages/effect' not just 'packages'
- No auth required (same as viewing a public GitHub repo)
- Reuses RepoValidator.CODE_EXTENSIONS and SKIP_DIRS for consistency
- Returns 'suggestion: large_repo' to trigger directory picker in frontend
@vercel

vercel Bot commented Feb 28, 2026

Copy link
Copy Markdown

@DevanshuNEU is attempting to deploy a commit to the Dev's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Feb 28, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@DevanshuNEU has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 15 minutes and 52 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 2535cd8 and 1bd52b9.

📒 Files selected for processing (1)
  • backend/routes/repos.py
📝 Walkthrough

Walkthrough

Adds a new POST /repos/analyze endpoint and supporting logic to validate GitHub URLs, fetch repo metadata and the GitHub tree via API, group directories, count files, estimate functions, and return aggregated pre-clone analysis; includes comprehensive tests for validation and tree-processing.

Changes

Cohort / File(s) Summary
GitHub pre-clone analysis
backend/routes/repos.py
Added _GITHUB_API_BASE, _GITHUB_URL_RE, _github_headers. Implemented _fetch_directory_tree(owner, repo, branch) to call GitHub Tree API, group directories, count files, estimate functions, and flag large repos. Added AnalyzeRepoRequest model and analyze_repository POST /analyze endpoint returning metadata plus directory analysis.
Tests for analysis feature
backend/tests/test_analyze_repo.py
New tests covering URL regex and request validation, IndexConfig checks, mocked GitHub Tree API responses via helper _make_tree, directory grouping (including monorepos), skipping node_modules, file/function counts, large-repo suggestion, and async httpx client mocking. Check regex edge cases and grouping logic.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Backend
    participant GitHub
    Client->>Backend: POST /repos/analyze { url }
    activate Backend
    Backend->>Backend: Validate URL with _GITHUB_URL_RE
    alt invalid
        Backend-->>Client: 400 Validation Error
    else valid
        Backend->>GitHub: GET /repos/{owner}/{repo}
        activate GitHub
        GitHub-->>Backend: repo metadata (default_branch, size, stars, language)
        deactivate GitHub
        Backend->>GitHub: GET /repos/{owner}/{repo}/git/trees/{branch}?recursive=1
        activate GitHub
        GitHub-->>Backend: tree (paths, types)
        deactivate GitHub
        Backend->>Backend: Group directories, count files, estimate functions, set suggestions
        Backend-->>Client: 200 {metadata, directories, totals}
    end
    deactivate Backend
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I sniff the branches without a clone,

Counting files where quiet functions hone.
Metadata hums, directories align,
A pre-clone peek — discovery is mine! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a POST /repos/analyze endpoint for pre-clone GitHub repository analysis via the GitHub API, with the ticket reference OPE-109.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
backend/routes/repos.py (2)

287-304: Consider reusing a single AsyncClient for both API calls.

The endpoint creates two separate httpx.AsyncClient instances—one for the metadata fetch (line 287) and another inside _fetch_directory_tree (line 185). Connection reuse would be more efficient.

♻️ Proposed refactor to reuse client

Pass the client as an optional parameter to _fetch_directory_tree:

 async def _fetch_directory_tree(
     owner: str, repo: str, branch: str,
+    client: httpx.AsyncClient | None = None,
 ) -> dict:
     ...
-    async with httpx.AsyncClient(timeout=15.0) as client:
-        response = await client.get(url, headers=_github_headers())
+    if client is None:
+        async with httpx.AsyncClient(timeout=15.0) as client:
+            response = await client.get(url, headers=_github_headers())
+            # ... rest of processing
+    else:
+        response = await client.get(url, headers=_github_headers())

Then in analyze_repository:

-    async with httpx.AsyncClient(timeout=10.0) as client:
+    async with httpx.AsyncClient(timeout=15.0) as client:
         meta_resp = await client.get(...)
-
-    # Fetch directory tree
-    tree_data = await _fetch_directory_tree(owner, repo_name, default_branch)
+        # Fetch directory tree
+        tree_data = await _fetch_directory_tree(owner, repo_name, default_branch, client=client)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/repos.py` around lines 287 - 304, The metadata fetch opens an
httpx.AsyncClient and _fetch_directory_tree opens another—reuse one client
instead: add an optional client parameter to _fetch_directory_tree (e.g., def
_fetch_directory_tree(..., client: Optional[httpx.AsyncClient]=None) / async
def) and, if client is None, create and close a local AsyncClient otherwise use
the passed client; update the call in analyze_repository (where meta_resp is
fetched with httpx.AsyncClient) to pass that same client into
_fetch_directory_tree; ensure the passed client is used for requests, that
timeout/headers (from _github_headers()) are preserved, and that only callers
that don’t pass a client keep creating/closing their own AsyncClient.

269-275: Add rate limiting and caching to the /analyze endpoint to prevent abuse of GitHub API quota.

This endpoint requires no authentication and makes two GitHub API calls per request without caching. An attacker could repeatedly request the same repository, exhausting your GitHub API quota and causing the service to rate limit.

Consider implementing:

  • In-memory or Redis caching with 5-minute TTL (similar to the /validate-repo endpoint)
  • Request rate limiting per client IP using the existing @rate_limit decorator
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/repos.py` around lines 269 - 275, The /analyze endpoint
(analyze_repository handling AnalyzeRepoRequest) currently makes two
unauthenticated GitHub calls with no caching or per-IP throttling; add the same
caching and rate-limiting used by /validate-repo: apply the existing `@rate_limit`
decorator to analyze_repository and add a 5-minute TTL cache lookup/insert
(in-memory or Redis, consistent with the validate-repo implementation) keyed by
repository identifier (e.g., owner/name + ref/branch) so repeated requests
return cached results instead of calling GitHub; ensure cache hit returns the
same dict response and cache misses execute the current logic then store the
result with TTL.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/routes/repos.py`:
- Around line 287-304: The metadata fetch opens an httpx.AsyncClient and
_fetch_directory_tree opens another—reuse one client instead: add an optional
client parameter to _fetch_directory_tree (e.g., def _fetch_directory_tree(...,
client: Optional[httpx.AsyncClient]=None) / async def) and, if client is None,
create and close a local AsyncClient otherwise use the passed client; update the
call in analyze_repository (where meta_resp is fetched with httpx.AsyncClient)
to pass that same client into _fetch_directory_tree; ensure the passed client is
used for requests, that timeout/headers (from _github_headers()) are preserved,
and that only callers that don’t pass a client keep creating/closing their own
AsyncClient.
- Around line 269-275: The /analyze endpoint (analyze_repository handling
AnalyzeRepoRequest) currently makes two unauthenticated GitHub calls with no
caching or per-IP throttling; add the same caching and rate-limiting used by
/validate-repo: apply the existing `@rate_limit` decorator to analyze_repository
and add a 5-minute TTL cache lookup/insert (in-memory or Redis, consistent with
the validate-repo implementation) keyed by repository identifier (e.g.,
owner/name + ref/branch) so repeated requests return cached results instead of
calling GitHub; ensure cache hit returns the same dict response and cache misses
execute the current logic then store the result with TTL.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d3ffafa and 7a7bdd5.

📒 Files selected for processing (1)
  • backend/routes/repos.py

Test coverage:
- URL regex: standard, trailing slash, http, rejects non-github/subpaths/no-repo
- AnalyzeRepoRequest: valid, whitespace strip, rejects empty/non-github
- IndexConfig: valid paths, slash normalization, backslash normalization,
  rejects empty/traversal/nested-traversal/non-string, allows None
- _fetch_directory_tree: flat repo grouping, monorepo package-level grouping,
  node_modules skipping, large_repo suggestion, small repo no suggestion

Also manually verified on Effect-TS/effect (1,767 files, 33 packages,
correct monorepo grouping at packages/* level).

23/23 pass in 2.8s.
…exing)

Indexing is function-level, not file-level. Tier limits are function-based
(2K free, 20K pro, 500K enterprise). But the analyze endpoint only returned
file counts -- users couldn't compare against their limits.

Now each directory entry includes estimated_functions (file_count * 25,
same multiplier RepoValidator uses for tier checks). Response also includes
total_estimated_functions for the whole repo.

Effect-TS example:
  packages/effect: 958 files, ~23,950 functions
  packages/schema: 203 files, ~5,075 functions
  Total: 1,767 files, ~44,175 functions

User on Pro tier (20K limit) can immediately see they need to pick a subset.

24 tests pass (1 new for function estimation).
1. _fetch_directory_tree now accepts optional client param -- analyze_repository
   opens one AsyncClient for both the metadata and tree API calls instead of two
2. Results cached for 5 min (same TTL as validate-repo) keyed by owner/repo,
   avoids redundant GitHub API calls on page refresh or retry
3. Cache uses existing Redis cache from dependencies (same as validate-repo)

24 tests pass.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
backend/routes/repos.py (1)

276-284: Use the compiled GitHub URL regex in request validation for consistent failures.

Current substring validation can pass malformed domains (e.g., containing github.com as a substring) and defer rejection to route-level parsing.

♻️ Suggested cleanup
     def validate_url(cls, v: str) -> str:
         v = v.strip().rstrip("/")
         if not v:
             raise ValueError("GitHub URL is required")
-        if "github.com" not in v.lower():
-            raise ValueError("Only GitHub URLs are supported")
+        if not _GITHUB_URL_RE.match(v):
+            raise ValueError("Invalid GitHub URL. Expected: https://github.com/owner/repo")
         return v
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/repos.py` around lines 276 - 284, The field validator
validate_url (decorated with `@field_validator`("github_url")) currently uses a
substring check which allows malformed domains; replace the substring check with
a match against the project’s compiled GitHub URL regex used by route-level
parsing (use the same compiled regex object rather than "github.com" in
v.lower()) and raise ValueError when the regex does not match after trimming and
rstrip("/"), so validation fails consistently at request parsing rather than
later in route handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/routes/repos.py`:
- Around line 185-196: The GitHub API calls in _fetch_directory_tree() and
analyze_repository() currently only check status codes and can raise unhandled
httpx.RequestError (network/timeout) or ValueError (JSON parse); wrap the
httpx.AsyncClient GET/response.json() calls in try/except blocks that catch
httpx.RequestError and ValueError, log the exception, and raise HTTPException
with a clear status (e.g., 504 Gateway Timeout or 502 Bad Gateway) and
actionable detail message; ensure you reference the existing functions
_fetch_directory_tree() and analyze_repository() and keep existing status-code
handling for response.status_code != 200 inside the try so transport/parse
failures are handled gracefully.
- Line 183: The constructed URL for the GitHub tree API currently interpolates
branch directly into url (variable url built from _GITHUB_API_BASE and
owner/repo/branch), which breaks for branch names containing “/”; fix by
percent-encoding the branch path segment before interpolation (use
urllib.parse.quote with safe='' to ensure '/' becomes '%2F') and then rebuild
url using the encoded_branch variable so owner, repo, and encoded_branch are
concatenated into the API path.

In `@backend/tests/test_analyze_repo.py`:
- Around line 53-58: The tests currently catch any Exception which is too broad;
change the assertions to expect pydantic's ValidationError for model validation
failures from AnalyzeRepoRequest (and similarly for IndexConfig tests) by
importing ValidationError from pydantic and using pytest.raises(ValidationError)
instead of pytest.raises(Exception) so the tests only pass for the specific
validation errors raised by the `@field_validator` logic.

---

Nitpick comments:
In `@backend/routes/repos.py`:
- Around line 276-284: The field validator validate_url (decorated with
`@field_validator`("github_url")) currently uses a substring check which allows
malformed domains; replace the substring check with a match against the
project’s compiled GitHub URL regex used by route-level parsing (use the same
compiled regex object rather than "github.com" in v.lower()) and raise
ValueError when the regex does not match after trimming and rstrip("/"), so
validation fails consistently at request parsing rather than later in route
handling.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a7bdd5 and 82ef4ef.

📒 Files selected for processing (2)
  • backend/routes/repos.py
  • backend/tests/test_analyze_repo.py

Comment thread backend/routes/repos.py Outdated
Comment thread backend/routes/repos.py Outdated
Comment thread backend/tests/test_analyze_repo.py Outdated
Directory structure of a repo barely changes (new package added maybe
once a month). 24hr cache saves GitHub API calls at ~3-5KB per entry.
1000 cached repos = 5MB Redis. Effectively free.
…ation

4 findings verified and fixed:

1. Wrap GitHub API calls in try/except for httpx.TimeoutException,
   httpx.RequestError, and ValueError (JSON parse). Returns proper
   504/502 instead of unhandled crash.

2. URL-encode branch name with urllib.parse.quote(safe='') so
   branches like 'feature/foo' produce 'feature%2Ffoo' in the
   GitHub Tree API URL instead of breaking the path.

3. Tests now use pytest.raises(ValidationError) instead of broad
   Exception -- catches only Pydantic validation errors as intended.
   Added test_rejects_malformed_github_domain for 'fakegithub.com'.

4. AnalyzeRepoRequest validator uses _GITHUB_URL_RE regex instead of
   substring check -- rejects 'fakegithub.com' and 'notgithub.com.evil'.

25 tests pass.
@DevanshuNEU

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit review

@coderabbitai

coderabbitai Bot commented Feb 28, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@DevanshuNEU

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit review

@coderabbitai

coderabbitai Bot commented Feb 28, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/routes/repos.py`:
- Around line 322-323: Update the misleading cache-duration comments that say “5
minutes” to correctly state the actual TTL used (_ANALYZE_CACHE_TTL) which is 24
hours; locate the comments near the _ANALYZE_CACHE_TTL definition and the
docstring/comment around the results caching (references: _ANALYZE_CACHE_TTL and
the function/class docstring that mentions results are cached) and change the
text from “5 minutes” to “24 hours” or refer to _ANALYZE_CACHE_TTL to avoid
hardcoding a stale duration.
- Around line 164-170: The /analyze handler currently uses the server
GITHUB_TOKEN via _github_headers() to call GitHub and immediately calls
_fetch_directory_tree() with repository metadata, which can leak private repo
info to unauthenticated callers; update the /analyze flow to detect whether the
incoming request is authenticated and if not, check metadata.get("private")
before calling _fetch_directory_tree() and return a 403 or an explicit error for
private repos; alternatively require a caller-supplied GitHub access token for
private repos and validate it before invoking _fetch_directory_tree(); ensure
the checks reference the /analyze request context, the metadata.get("private")
field, the _fetch_directory_tree() call, and the _github_headers() usage so
private repos are never fetched for unauthenticated requests.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82ef4ef and 2535cd8.

📒 Files selected for processing (2)
  • backend/routes/repos.py
  • backend/tests/test_analyze_repo.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/tests/test_analyze_repo.py

Comment thread backend/routes/repos.py
Comment thread backend/routes/repos.py Outdated
1. Security: reject private repos with 403 before fetching directory tree.
   Server GITHUB_TOKEN could access private repos via GitHub API, which
   would leak private repo structure to unauthenticated /analyze callers.
   Same check the playground validation already does.

2. Fix stale comments that said '5 minutes' when TTL was bumped to 24 hours.

25 tests pass.
@DevanshuNEU

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit review

@coderabbitai

coderabbitai Bot commented Feb 28, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@vercel

vercel Bot commented Feb 28, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
opencodeintel Ignored Ignored Preview Feb 28, 2026 11:20pm

@DevanshuNEU DevanshuNEU merged commit 6379d8f into OpenCodeIntel:main Feb 28, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant