fix: durable repo-state v0.1: lazy re-clone + stuck-job recovery (#311)#316
Conversation
…nt (OpenCodeIntel#311) Railway containers have ephemeral disk; a redeploy wipes ./repos while Pinecone and Supabase survive, so every file-reading op 404'd on existing repos. Reframe local_path as a cache hint (git remote is source of truth) and funnel all ~10 consumers through one async ensure_clone chokepoint that lazily re-clones on a cache miss. Chose lazy on-demand re-clone over eager re-clone-all at startup: eager risks the 300s Railway healthcheck timeout on a cold container (the surface behind silent prod outage F-011). Tradeoff: first access per repo post-redeploy pays the full clone cost; every access after is a sub-ms stat. Correctness over a one-time latency hit vs a permanent 404. Concurrency: in-process per-repo asyncio.Lock + clone to temp dir then atomic os.rename, so simultaneous/crashed clones never leave a half-populated dir. Clone runs via asyncio.to_thread to keep the event loop free. Failure surfaces as an actionable 503 (RepoCloneError), not an opaque 500. In analyze_impact, input validation runs BEFORE re-clone so a path-traversal attempt is rejected without triggering external work. Warm path unchanged.
) An indexer that dies mid-job (a redeploy guarantees this) left repositories.status='indexing' forever, and try_set_indexing then permanently blocked retry. Add indexing_started_at (migration 003) stamped on every transition to 'indexing', plus a startup sweep (resets all orphaned 'indexing' rows to 'error') and steal-on-retry (re-claims rows older than 30 min or with a null stamp) via an atomic PostgREST .or_ CAS that cannot double-claim. Migrations are applied manually (no runner), so both write paths degrade gracefully if 003 is not yet applied: update_repository_status retries without the column, try_set_indexing falls back to the original CAS. A wrong deploy order no longer 500s every index; it only loses stuck-job recovery until the migration runs.
ensure_clone warm no-op, cold re-clone, concurrent-callers-clone-once (lock + atomic rename), actionable-503 on missing-url and failed-clone. Stuck-job: started_at stamping, steal-on-retry filter incl. legacy null, fresh-job-not-stolen, startup sweep, graceful fallback when migration 003 is absent. Integration: /dependencies route actually invokes ensure_clone.
|
@DevanshuNEU is attempting to deploy a commit to the Dev's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
Warning Review limit reached
More reviews will be available in 11 minutes and 33 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR implements durable repository state v0.1, enabling the system to survive container redeploys and recover from stuck indexing jobs. It adds lazy re-cloning from git_url when a repository's local working tree is missing, tracking when indexing begins via timestamps, and resetting orphaned indexing jobs at startup. ChangesDurable repository state with lazy re-clone and stuck job recovery
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
backend/services/repo_manager.py (1)
212-215: ⚡ Quick winConsider explicit error handling instead of
ignore_errors=True.Line 214 uses
shutil.rmtree(canonical, ignore_errors=True)which silently swallows all errors including permission issues or I/O failures. This could mask real problems.♻️ Proposed refactor
git.Repo.clone_from(git_url, tmp, branch=branch, depth=1) # Clear any leftover partial dir before the atomic swap. if canonical.exists(): - shutil.rmtree(canonical, ignore_errors=True) + shutil.rmtree(canonical) os.rename(tmp, canonical) # atomic on the same filesystemThis way, if removal fails due to a real issue (e.g., permission denied), the exception will be caught by the outer try/except at line 216 and properly logged/surfaced as RepoCloneError.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/services/repo_manager.py` around lines 212 - 215, The current call shutil.rmtree(canonical, ignore_errors=True) swallows important filesystem errors; change it to explicitly handle errors by removing ignore_errors=True and either let the OSError propagate or catch OSError/Exception and raise RepoCloneError (the repo clone error type used in this module) with the original exception chained; update the atomic-swap block around canonical/tmp and ensure the exception is not silently ignored so the outer try/except (or callers expecting RepoCloneError) can log/surface the real failure.backend/services/supabase_service.py (1)
123-129: ⚡ Quick winConsider adding a safety comment about filter string construction.
Line 128 uses f-string interpolation to build the PostgREST filter expression. While safe here (cutoff is server-generated from
datetime.utcnow()), this pattern could be risky if adapted for user input elsewhere.📝 Suggested comment
cutoff = (datetime.utcnow() - timedelta(minutes=STUCK_INDEXING_THRESHOLD_MINUTES)).isoformat() try: + # Note: cutoff is server-generated timestamp, not user input. String interpolation + # is required by PostgREST .or_ filter syntax; never use with untrusted data. result = self.client.table("repositories").update(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/services/supabase_service.py` around lines 123 - 129, The PostgREST filter string built in the call to self.client.table("repositories").update(...).eq("id", repo_id).or_(...) uses an f-string (cutoff) to interpolate values into the filter expression; add a clear safety comment immediately above this block (around the cutoff computation and the or_ call) noting that the current cutoff is server-controlled (datetime.utcnow()) and explaining that any future use with user-supplied values must be parameterized or properly escaped to avoid injection risks—also mention STUCK_INDEXING_THRESHOLD_MINUTES so reviewers know the time-based origin of the value and to prefer client/SDK-parameterized filters if variable input is introduced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/services/repo_manager.py`:
- Line 202: The current _clone_into_place method uses shutil.rmtree(canonical,
ignore_errors=True) which can silently swallow real errors; change this to
explicitly check for existence (e.g., canonical.exists() or os.path.exists) and
remove the directory only when present, and wrap the removal in a try/except
that logs and re-raises or handles OSError so failures aren’t ignored. Locate
the call to shutil.rmtree in _clone_into_place and replace the ignore_errors
behavior with an existence check and explicit error handling that records the
exception (via the service logger) before failing or proceeding.
In `@backend/services/supabase_service.py`:
- Around line 98-111: The current exception handler wrongly tests membership of
"indexing_started_at" in updates (which will always be true when status ==
"indexing"), so the fallback retry is never executed; change the handler in the
method performing the repository status update (the block around
self.client.table("repositories").update(...).execute()) to detect a
missing-column migration failure by inspecting the exception (e.g., check str(e)
for "indexing_started_at" or a DB-specific error code/message) and only then
perform the retry with self.client.table("repositories").update({"status":
status}).eq("id", repo_id).execute(); otherwise re-raise the exception. Ensure
you reference the updates variable, the retry update({"status": status}) call,
and migration 003 in the new logic.
---
Nitpick comments:
In `@backend/services/repo_manager.py`:
- Around line 212-215: The current call shutil.rmtree(canonical,
ignore_errors=True) swallows important filesystem errors; change it to
explicitly handle errors by removing ignore_errors=True and either let the
OSError propagate or catch OSError/Exception and raise RepoCloneError (the repo
clone error type used in this module) with the original exception chained;
update the atomic-swap block around canonical/tmp and ensure the exception is
not silently ignored so the outer try/except (or callers expecting
RepoCloneError) can log/surface the real failure.
In `@backend/services/supabase_service.py`:
- Around line 123-129: The PostgREST filter string built in the call to
self.client.table("repositories").update(...).eq("id", repo_id).or_(...) uses an
f-string (cutoff) to interpolate values into the filter expression; add a clear
safety comment immediately above this block (around the cutoff computation and
the or_ call) noting that the current cutoff is server-controlled
(datetime.utcnow()) and explaining that any future use with user-supplied values
must be parameterized or properly escaped to avoid injection risks—also mention
STUCK_INDEXING_THRESHOLD_MINUTES so reviewers know the time-based origin of the
value and to prefer client/SDK-parameterized filters if variable input is
introduced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 963331dd-d264-4d5e-a03f-58b0796216e5
📒 Files selected for processing (7)
backend/main.pybackend/migrations/003_add_indexing_started_at.sqlbackend/routes/analysis.pybackend/routes/repos.pybackend/services/repo_manager.pybackend/services/supabase_service.pybackend/tests/test_durable_repo_state.py
_clone_into_place: drop ignore_errors on the canonical-dir cleanup so a failed removal surfaces (ensure_clone wraps it into a logged RepoCloneError) instead of being swallowed and causing a confusing rename failure later. update_repository_status / try_set_indexing_status: gate the missing-column migration fallback on the error actually naming indexing_started_at, so an unrelated failure (network, constraint) re-raises instead of being masked by the retry. CodeRabbit's 'fallback never runs' reading was incorrect (the not-in check does execute the retry), but narrowing the catch to the migration case is the right hardening. Add an injection-safety note above the PostgREST .or_ filter: cutoff is server-controlled (utcnow + STUCK_INDEXING_THRESHOLD_MINUTES); any future user-supplied filter value must be SDK-parameterized. Tests: unrelated errors now re-raise rather than hit the fallback.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Railway assigns a dynamic $PORT and runs its healthcheck against it. The Dockerfile hardcoded --port 8000, so every deploy after healthcheckPath was added to railway.json failed with "service unavailable" on /health. #316 and #318 have both been stuck for days; prod only survived on the pre-healthcheck #293 image until that replica was knocked out. Bind ${PORT:-8000} (fallback for local/compose) and make the internal healthcheck read the same port.
Summary
Repos stored on Railway's ephemeral disk vanished on every redeploy while Supabase still marked them
indexed, so all file-based ops (graph, DNA, style, search context) 404'd until a manual re-index. This makeslocal_patha cache hint and lazily re-clones from the git remote on demand, and recovers indexing jobs left stuck inindexingwhen their process dies.Closes / implements
oci/decisions/2026-06-08-durable-repo-state-v0.1.mdfix/durable-repo-state-v0.1What changed
Backend (
backend/...):services/repo_manager.py- newensure_clone(repo)chokepoint: warm hit is a sub-ms stat; cache miss re-clones fromgit_url@branchviaasyncio.to_thread, guarded by an in-process per-repo lock + clone-to-temp-then-atomic-rename. NewRepoCloneError(503) for unrecoverable misses.routes/analysis.py- graph/impact/insights/style/DNA route throughensure_clonejust-in-time (after the cache check; impact validates input first, then re-clones).routes/repos.py- directory listing, sync index, async index, and the legacy WebSocket index path all re-hydrate viaensure_clone.services/supabase_service.py- stampindexing_started_aton every transition toindexing;try_set_indexing_statusre-claims stale/legacy rows via an atomic PostgREST.or_CAS; newreset_stuck_indexing_jobs(). Both write paths degrade gracefully if migration 003 is not yet applied.main.py- startup sweep resets orphanedindexingrows on boot.migrations/003_add_indexing_started_at.sql- addsindexing_started_at TIMESTAMPTZ+ partial index.Tests:
tests/test_durable_repo_state.py- 13 tests: ensure_clone warm/cold/concurrency/503, stuck-job stamping/steal/sweep/fallback, and a route-wiring integration test.ADR adherence
indexing_started_atcolumn; clone is ephemeral cache)local_pathreframed as cache hint, git remote authoritative)Path.exists()stat (sub-ms); cold path is bounded bygit clone --depth 1time and runs off the event loop. No p50/p95 harness exists yet (tracked in feat: surface real search observability (p50/p95/p99 latency, cost/query, cache-hit-rate) #313); no new hot-path latency added.Pipeline
/oci-pipelineran clean - BLOCKING: 0; MAJOR: 2 (both fixed - route integration test added; graceful migration fallback added)/oci-defend7/7 passed (run after CI green)cd backend && pytest tests/- 405 passedHow to test
rm -rf backend/repos/{repo_id}(wipe the working tree, leave Supabase/Pinecone intact).GET /api/v1/repos/{repo_id}/dependencies(or DNA/style/index).status='indexing'in Supabase, restart the backend.Expected: step 3 re-clones from
git_urland returns 200 with no manual re-index (first call slower, subsequent calls fast); step 4 - the startup sweep resets the row toerrorand the user can re-trigger indexing.Deployment notes
backend/migrations/003_add_indexing_started_at.sqlin the Supabase SQL Editor before deploying this code. The code degrades gracefully if it's missing (no 500s), but stuck-job recovery stays off until the column exists.mcppackage version change/api/v1/*contracts - status changes (500 to 503 on unrecoverable miss; 404 to re-clone on/directories) are handled generically by the MCP_safe_error_message; no URL/shape/field/auth change. Cross-stack contract check passed.Tradeoff annotation (Paired-Ship Protocol)
repos_reclonedmetric on re-clone; WARN on sweep), and the single chokepoint is the seam for the deferred pro-tier (persistent volume) and private-repo support (feat: durable private-repo re-clone (authenticated clone + token-expiry handling) #315) without touching call sites again.Risk and rollback
ensure_clonewere wrong, repo file-ops would 503 (clear, actionable) rather than silently serve stale or 500. Stuck-job sweep only ever touches rows already inindexing.Summary by CodeRabbit
Release Notes
Bug Fixes
New Features
Tests