Skip to content

fix: durable repo-state v0.1: lazy re-clone + stuck-job recovery (#311)#316

Merged
DevanshuNEU merged 5 commits into
OpenCodeIntel:mainfrom
DevanshuNEU:fix/durable-repo-state-v0.1
Jun 8, 2026
Merged

fix: durable repo-state v0.1: lazy re-clone + stuck-job recovery (#311)#316
DevanshuNEU merged 5 commits into
OpenCodeIntel:mainfrom
DevanshuNEU:fix/durable-repo-state-v0.1

Conversation

@DevanshuNEU

@DevanshuNEU DevanshuNEU commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

Repos stored on Railway's ephemeral disk vanished on every redeploy while Supabase still marked them indexed, so all file-based ops (graph, DNA, style, search context) 404'd until a manual re-index. This makes local_path a cache hint and lazily re-clones from the git remote on demand, and recovers indexing jobs left stuck in indexing when their process dies.

Closes / implements

What changed

Backend (backend/...):

  • services/repo_manager.py - new ensure_clone(repo) chokepoint: warm hit is a sub-ms stat; cache miss re-clones from git_url@branch via asyncio.to_thread, guarded by an in-process per-repo lock + clone-to-temp-then-atomic-rename. New RepoCloneError (503) for unrecoverable misses.
  • routes/analysis.py - graph/impact/insights/style/DNA route through ensure_clone just-in-time (after the cache check; impact validates input first, then re-clones).
  • routes/repos.py - directory listing, sync index, async index, and the legacy WebSocket index path all re-hydrate via ensure_clone.
  • services/supabase_service.py - stamp indexing_started_at on every transition to indexing; try_set_indexing_status re-claims stale/legacy rows via an atomic PostgREST .or_ CAS; new reset_stuck_indexing_jobs(). Both write paths degrade gracefully if migration 003 is not yet applied.
  • main.py - startup sweep resets orphaned indexing rows on boot.
  • migrations/003_add_indexing_started_at.sql - adds indexing_started_at TIMESTAMPTZ + partial index.

Tests:

  • tests/test_durable_repo_state.py - 13 tests: ensure_clone warm/cold/concurrency/503, stuck-job stamping/steal/sweep/fallback, and a route-wiring integration test.

ADR adherence

Pipeline

  • /oci-pipeline ran clean - BLOCKING: 0; MAJOR: 2 (both fixed - route integration test added; graceful migration fallback added)
  • CodeRabbit comments addressed (pending PR)
  • /oci-defend 7/7 passed (run after CI green)
  • All stack tests pass:
    • Backend: cd backend && pytest tests/ - 405 passed
    • Frontend: n/a - no frontend changes
    • MCP: n/a - no mcp-server changes

How to test

  1. Index a public repo, confirm graph/DNA work.
  2. Simulate a redeploy: rm -rf backend/repos/{repo_id} (wipe the working tree, leave Supabase/Pinecone intact).
  3. Call GET /api/v1/repos/{repo_id}/dependencies (or DNA/style/index).
  4. Stuck job: set a repo's status='indexing' in Supabase, restart the backend.

Expected: step 3 re-clones from git_url and returns 200 with no manual re-index (first call slower, subsequent calls fast); step 4 - the startup sweep resets the row to error and the user can re-trigger indexing.

Deployment notes

  • No new env vars required
  • DB migration REQUIRED - run backend/migrations/003_add_indexing_started_at.sql in the Supabase SQL Editor before deploying this code. The code degrades gracefully if it's missing (no 500s), but stuck-job recovery stays off until the column exists.
  • No off-limits files modified
  • No mcp package version change
  • No breaking change to /api/v1/* contracts - status changes (500 to 503 on unrecoverable miss; 404 to re-clone on /directories) are handled generically by the MCP _safe_error_message; no URL/shape/field/auth change. Cross-stack contract check passed.

Tradeoff annotation (Paired-Ship Protocol)

  • We chose lazy on-demand re-clone over eager re-clone-all at startup because eager risks the 300s Railway healthcheck timeout on a cold container with many/large repos - the exact surface behind a prior silent prod outage (F-011) - and clones repos nobody queries this deploy.
  • Latency tradeoff: the first access per repo after a redeploy pays the full clone cost (seconds to minutes); every access after is a sub-ms stat. We accept one slow request over a permanent 404.
  • Consistency: strengthened (no permanent stuck-indexing state; clone re-hydrates from the authoritative remote).
  • Operational: easier to debug (INFO log + repos_recloned metric on re-clone; WARN on sweep), and the single chokepoint is the seam for the deferred pro-tier (persistent volume) and private-repo support (feat: durable private-repo re-clone (authenticated clone + token-expiry handling) #315) without touching call sites again.

Risk and rollback

  • Symptom: if ensure_clone were wrong, repo file-ops would 503 (clear, actionable) rather than silently serve stale or 500. Stuck-job sweep only ever touches rows already in indexing.
  • Blast radius: the chokepoint fronts all file-based ops, but is purely additive on the warm path (existing behavior unchanged when the clone is present).
  • Rollback: revert this PR. No data migration to undo (the new column is additive and nullable; leaving it is harmless).
  • Time-to-rollback: < 5 min (revert + redeploy).

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed recovery of stuck indexing jobs on system startup
    • Improved repository cloning with atomic operations and proper cleanup on failure
  • New Features

    • Added automatic repository re-cloning when working tree is unavailable (e.g., after container redeployment)
    • Enhanced error handling for clone failures with actionable error messages
    • Prevents concurrent clone attempts for the same repository
  • Tests

    • Added comprehensive test coverage for repository state durability and concurrent operations

…nt (OpenCodeIntel#311)

Railway containers have ephemeral disk; a redeploy wipes ./repos while Pinecone and Supabase survive, so every file-reading op 404'd on existing repos. Reframe local_path as a cache hint (git remote is source of truth) and funnel all ~10 consumers through one async ensure_clone chokepoint that lazily re-clones on a cache miss.

Chose lazy on-demand re-clone over eager re-clone-all at startup: eager risks the 300s Railway healthcheck timeout on a cold container (the surface behind silent prod outage F-011). Tradeoff: first access per repo post-redeploy pays the full clone cost; every access after is a sub-ms stat. Correctness over a one-time latency hit vs a permanent 404.

Concurrency: in-process per-repo asyncio.Lock + clone to temp dir then atomic os.rename, so simultaneous/crashed clones never leave a half-populated dir. Clone runs via asyncio.to_thread to keep the event loop free. Failure surfaces as an actionable 503 (RepoCloneError), not an opaque 500. In analyze_impact, input validation runs BEFORE re-clone so a path-traversal attempt is rejected without triggering external work. Warm path unchanged.
)

An indexer that dies mid-job (a redeploy guarantees this) left repositories.status='indexing' forever, and try_set_indexing then permanently blocked retry. Add indexing_started_at (migration 003) stamped on every transition to 'indexing', plus a startup sweep (resets all orphaned 'indexing' rows to 'error') and steal-on-retry (re-claims rows older than 30 min or with a null stamp) via an atomic PostgREST .or_ CAS that cannot double-claim.

Migrations are applied manually (no runner), so both write paths degrade gracefully if 003 is not yet applied: update_repository_status retries without the column, try_set_indexing falls back to the original CAS. A wrong deploy order no longer 500s every index; it only loses stuck-job recovery until the migration runs.
ensure_clone warm no-op, cold re-clone, concurrent-callers-clone-once (lock + atomic rename), actionable-503 on missing-url and failed-clone. Stuck-job: started_at stamping, steal-on-retry filter incl. legacy null, fresh-job-not-stolen, startup sweep, graceful fallback when migration 003 is absent. Integration: /dependencies route actually invokes ensure_clone.
@vercel

vercel Bot commented Jun 8, 2026

Copy link
Copy Markdown

@DevanshuNEU is attempting to deploy a commit to the Dev's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@DevanshuNEU, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 11 minutes and 33 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f0a293fa-f71a-48c7-a383-c5a915b4b128

📥 Commits

Reviewing files that changed from the base of the PR and between 97b549b and 0694ada.

📒 Files selected for processing (3)
  • backend/services/repo_manager.py
  • backend/services/supabase_service.py
  • backend/tests/test_durable_repo_state.py
📝 Walkthrough

Walkthrough

This PR implements durable repository state v0.1, enabling the system to survive container redeploys and recover from stuck indexing jobs. It adds lazy re-cloning from git_url when a repository's local working tree is missing, tracking when indexing begins via timestamps, and resetting orphaned indexing jobs at startup.

Changes

Durable repository state with lazy re-clone and stuck job recovery

Layer / File(s) Summary
Database schema and orphan detection
backend/migrations/003_add_indexing_started_at.sql, backend/services/supabase_service.py
Migration adds indexing_started_at TIMESTAMPTZ column and partial index on repositories table filtered to status = 'indexing'. Service layer defines 30-minute orphan detection threshold constant.
Repository clone management and error handling
backend/services/repo_manager.py
Introduces RepoCloneError (503 HTTPException) for re-clone failures. RepositoryManager gains per-repository asyncio locks to serialize concurrent clone attempts for the same repo. New ensure_clone method checks for existing .git directory (cache hit) or re-clones from git_url/branch by offloading to worker thread with atomic temp-dir + rename into canonical path.
Indexing status lifecycle and stuck job recovery
backend/services/supabase_service.py, backend/main.py
Status transitions now stamp indexing_started_at when entering indexing state, with graceful fallback if column is missing. try_set_indexing_status becomes orphan-aware atomic claiming: succeeds when repo is not actively indexing or timestamp exceeds threshold, falling back to original compare-and-set on advanced-query failure. New reset_stuck_indexing_jobs helper resets orphaned indexing rows to error at startup, logging warnings.
Route integration of lazy re-clone
backend/routes/analysis.py, backend/routes/repos.py
Analysis endpoints (dependencies, impact, insights, style-analysis, dna) and indexing operations (get_repo_directories, index_repository, index_repository_async, websocket_index) call ensure_clone before accessing repo files. Impact endpoint validates file_path before clone to avoid expensive operations on invalid inputs. WebSocket handler logs errors and closes with clone-unavailable code on re-clone failure.
Test coverage for durable repo state
backend/tests/test_durable_repo_state.py
Tests validate ensure_clone warm/cold paths, 503 error handling for missing/invalid git_url and clone failures, concurrency safety (single clone across simultaneous callers), timestamp stamping for indexing transitions, orphan detection via null/stale timestamps, steal/retry claiming logic, failure when fresh job owns row, bulk reset of stuck jobs with count, fallback when timestamp column is missing, and route wiring (dependencies endpoint awaits ensure_clone).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Poem

🐰 A rabbit hopped through the deployed box,
Found repos wiped by redeploy's locks,
"Fear not!" it cried with git-clone might,
Lazy re-clones keep indexing right,
Stuck jobs reset, orphans take flight! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'fix: durable repo-state v0.1: lazy re-clone + stuck-job recovery (#311)' directly and clearly summarizes the main changes: implementing lazy re-clone and stuck-job recovery mechanisms.
Linked Issues check ✅ Passed All acceptance criteria from issue #311 are met: lazy re-clone on missing repos with concurrency safety via asyncio locks, stuck-job recovery via indexing_started_at stamping and reset logic, no regressions, and comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #311: re-clone logic, stuck-job recovery, database migration, endpoint integration, and tests—no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
backend/services/repo_manager.py (1)

212-215: ⚡ Quick win

Consider explicit error handling instead of ignore_errors=True.

Line 214 uses shutil.rmtree(canonical, ignore_errors=True) which silently swallows all errors including permission issues or I/O failures. This could mask real problems.

♻️ Proposed refactor
             git.Repo.clone_from(git_url, tmp, branch=branch, depth=1)
             # Clear any leftover partial dir before the atomic swap.
             if canonical.exists():
-                shutil.rmtree(canonical, ignore_errors=True)
+                shutil.rmtree(canonical)
             os.rename(tmp, canonical)  # atomic on the same filesystem

This way, if removal fails due to a real issue (e.g., permission denied), the exception will be caught by the outer try/except at line 216 and properly logged/surfaced as RepoCloneError.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/services/repo_manager.py` around lines 212 - 215, The current call
shutil.rmtree(canonical, ignore_errors=True) swallows important filesystem
errors; change it to explicitly handle errors by removing ignore_errors=True and
either let the OSError propagate or catch OSError/Exception and raise
RepoCloneError (the repo clone error type used in this module) with the original
exception chained; update the atomic-swap block around canonical/tmp and ensure
the exception is not silently ignored so the outer try/except (or callers
expecting RepoCloneError) can log/surface the real failure.
backend/services/supabase_service.py (1)

123-129: ⚡ Quick win

Consider adding a safety comment about filter string construction.

Line 128 uses f-string interpolation to build the PostgREST filter expression. While safe here (cutoff is server-generated from datetime.utcnow()), this pattern could be risky if adapted for user input elsewhere.

📝 Suggested comment
         cutoff = (datetime.utcnow() - timedelta(minutes=STUCK_INDEXING_THRESHOLD_MINUTES)).isoformat()
         try:
+            # Note: cutoff is server-generated timestamp, not user input. String interpolation
+            # is required by PostgREST .or_ filter syntax; never use with untrusted data.
             result = self.client.table("repositories").update(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/services/supabase_service.py` around lines 123 - 129, The PostgREST
filter string built in the call to
self.client.table("repositories").update(...).eq("id", repo_id).or_(...) uses an
f-string (cutoff) to interpolate values into the filter expression; add a clear
safety comment immediately above this block (around the cutoff computation and
the or_ call) noting that the current cutoff is server-controlled
(datetime.utcnow()) and explaining that any future use with user-supplied values
must be parameterized or properly escaped to avoid injection risks—also mention
STUCK_INDEXING_THRESHOLD_MINUTES so reviewers know the time-based origin of the
value and to prefer client/SDK-parameterized filters if variable input is
introduced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/services/repo_manager.py`:
- Line 202: The current _clone_into_place method uses shutil.rmtree(canonical,
ignore_errors=True) which can silently swallow real errors; change this to
explicitly check for existence (e.g., canonical.exists() or os.path.exists) and
remove the directory only when present, and wrap the removal in a try/except
that logs and re-raises or handles OSError so failures aren’t ignored. Locate
the call to shutil.rmtree in _clone_into_place and replace the ignore_errors
behavior with an existence check and explicit error handling that records the
exception (via the service logger) before failing or proceeding.

In `@backend/services/supabase_service.py`:
- Around line 98-111: The current exception handler wrongly tests membership of
"indexing_started_at" in updates (which will always be true when status ==
"indexing"), so the fallback retry is never executed; change the handler in the
method performing the repository status update (the block around
self.client.table("repositories").update(...).execute()) to detect a
missing-column migration failure by inspecting the exception (e.g., check str(e)
for "indexing_started_at" or a DB-specific error code/message) and only then
perform the retry with self.client.table("repositories").update({"status":
status}).eq("id", repo_id).execute(); otherwise re-raise the exception. Ensure
you reference the updates variable, the retry update({"status": status}) call,
and migration 003 in the new logic.

---

Nitpick comments:
In `@backend/services/repo_manager.py`:
- Around line 212-215: The current call shutil.rmtree(canonical,
ignore_errors=True) swallows important filesystem errors; change it to
explicitly handle errors by removing ignore_errors=True and either let the
OSError propagate or catch OSError/Exception and raise RepoCloneError (the repo
clone error type used in this module) with the original exception chained;
update the atomic-swap block around canonical/tmp and ensure the exception is
not silently ignored so the outer try/except (or callers expecting
RepoCloneError) can log/surface the real failure.

In `@backend/services/supabase_service.py`:
- Around line 123-129: The PostgREST filter string built in the call to
self.client.table("repositories").update(...).eq("id", repo_id).or_(...) uses an
f-string (cutoff) to interpolate values into the filter expression; add a clear
safety comment immediately above this block (around the cutoff computation and
the or_ call) noting that the current cutoff is server-controlled
(datetime.utcnow()) and explaining that any future use with user-supplied values
must be parameterized or properly escaped to avoid injection risks—also mention
STUCK_INDEXING_THRESHOLD_MINUTES so reviewers know the time-based origin of the
value and to prefer client/SDK-parameterized filters if variable input is
introduced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 963331dd-d264-4d5e-a03f-58b0796216e5

📥 Commits

Reviewing files that changed from the base of the PR and between 7a29b89 and 97b549b.

📒 Files selected for processing (7)
  • backend/main.py
  • backend/migrations/003_add_indexing_started_at.sql
  • backend/routes/analysis.py
  • backend/routes/repos.py
  • backend/services/repo_manager.py
  • backend/services/supabase_service.py
  • backend/tests/test_durable_repo_state.py

Comment thread backend/services/repo_manager.py
Comment thread backend/services/supabase_service.py
_clone_into_place: drop ignore_errors on the canonical-dir cleanup so a failed removal surfaces (ensure_clone wraps it into a logged RepoCloneError) instead of being swallowed and causing a confusing rename failure later.

update_repository_status / try_set_indexing_status: gate the missing-column migration fallback on the error actually naming indexing_started_at, so an unrelated failure (network, constraint) re-raises instead of being masked by the retry. CodeRabbit's 'fallback never runs' reading was incorrect (the not-in check does execute the retry), but narrowing the catch to the migration case is the right hardening.

Add an injection-safety note above the PostgREST .or_ filter: cutoff is server-controlled (utcnow + STUCK_INDEXING_THRESHOLD_MINUTES); any future user-supplied filter value must be SDK-parameterized. Tests: unrelated errors now re-raise rather than hit the fallback.
@vercel

vercel Bot commented Jun 8, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
opencodeintel Ready Ready Preview, Comment Jun 8, 2026 8:02pm

@DevanshuNEU DevanshuNEU merged commit ca451f0 into OpenCodeIntel:main Jun 8, 2026
8 checks passed
DevanshuNEU added a commit that referenced this pull request Jun 11, 2026
Railway assigns a dynamic $PORT and runs its healthcheck against it. The
Dockerfile hardcoded --port 8000, so every deploy after healthcheckPath was
added to railway.json failed with "service unavailable" on /health. #316 and
#318 have both been stuck for days; prod only survived on the pre-healthcheck
#293 image until that replica was knocked out. Bind ${PORT:-8000} (fallback for
local/compose) and make the internal healthcheck read the same port.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[OCI] Durable repo-state v0.1: survive Railway redeploy (lazy re-clone) + recover stuck indexing jobs

1 participant