Skip to content

[OCI] Durable repo-state v0.1: survive Railway redeploy (lazy re-clone) + recover stuck indexing jobs #311

Description

@DevanshuNEU

Problem

Cloned repos are stored on the Railway container's ephemeral disk (`repo_manager.py:17`, `Path("./repos")`) and treated as a source of truth. `railway.json` mounts no volume. On every redeploy/restart the disk is wiped, but Supabase still holds `local_path` rows pointing at dead paths, and there is no re-clone path anywhere. `_sync_existing_repos` does NOT re-clone (it only registers dirs already on disk), so after a redeploy every clone-dependent operation (index, re-index, analyze, DNA, dependencies, context) silently 404s or returns empty for all existing repos.

Separately, if the indexer crashes/redeploys mid-job, the repo is left in `status=indexing` forever, and `try_set_indexing` then blocks the user from retrying. No reaper exists.

Verified against code 2026-06-01. See DevVault audit: `oci/architecture/improvement-roadmap.md` (#6, #7), `failure-modes.md`, `consistency-model.md`.

Scope (v0.1)

IN (cloud-agnostic, correct regardless of how LOCAL_MODE #305 resolves):

  • Lazy re-clone from `git_url` when the local clone is missing. `local_path` becomes a cache hint, not a source of truth. Any analysis op checks for the clone and re-clones from the stored `git_url`/`branch` if absent.
  • Stuck-indexing-job recovery: add `indexing_started_at`; a startup/background reaper resets jobs older than a threshold from `indexing` to `error` (or `cloned`) so the user can retry.

OUT (deferred until after the LOCAL_MODE #305 decision, since it is cloud-tier-specific):

  • Railway persistent volume vs object store (S3/R2) as the durable repo store. The lazy-re-clone model above makes prod correct without committing to this infra choice now.

Acceptance criteria

  • After a simulated redeploy (delete `./repos/{id}`), the next analysis/search/index op for that repo re-clones from `git_url` and succeeds, with no user-visible 404.
  • Re-clone is idempotent and concurrency-safe (two simultaneous ops on the same repo do not double-clone or corrupt the dir).
  • An indexing job whose process died leaves `status=indexing`, and within the reaper threshold is reset to a retryable state; the user can re-trigger indexing.
  • No regression to cloud bring-up; existing indexed repos keep working.
  • ADR in DevVault `oci/decisions/` governs this work; code reviewed against it.

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions