fix: adversarial review — correctness + security cluster (M1/M2/M3, A1/A2, A3, U1/U2, S1, AU1) by russellromney · Pull Request #62 · russellromney/ripclone

russellromney · 2026-06-25T23:41:29Z

Works through the correctness + security findings from ADVERSARIAL_REVIEW_2026-06-25.md (PR #60). Each fix is its own commit; the full cargo test suite is green after each. io_uring code (Linux-only) was additionally typechecked with cargo check --all-targets in a Linux container (toolchain 1.96.0) since dev is macOS.

Correctness-critical cluster

perf: parallel blob-pack build, lock-free proxy shaper, and zero-copy extraction #1 ref-write ordering "a newer sync never loses" (M1/M2/M3) — shared should_replace_ref; FileRefStore read-compare-then-rename under an in-process lock; SqlRefStore single atomic conditional upsert (ON CONFLICT … WHERE on sqlite/pg/libsql, @ripl IF() on MySQL); S3 branch saves through ETag CAS. Property test (File+SQLite always; pg/mysql/libsql/S3 env-gated).
experiment: parallel pre-built head-blobs pack download (Hybrid mode) #2 queue settlement + crash-loop dead-letter (A1/A2) — conditional finish/ack; attempts column + RIPCLONE_QUEUE_MAX_ATTEMPTS cap → dead-letter to terminal failed. All four adapters. Late-ack + dead-letter tests.
chore: move benchmarks to benchmark/ and trim stale docs #3 durable two-phase phase 2 (A3) — ephemeral cross-process workers build phase 2 inline before acking done (under the held repo_lock); crash → reclaimed/retried/dead-lettered via experiment: parallel pre-built head-blobs pack download (Hybrid mode) #2. In-process server keeps the fast background spawn.

Higher-risk cluster

Add opt-in Linux io_uring worktree writer #4 io_uring writer UAF (U1/U2) — never drop a PendingDirectWindow while the kernel still owns its buffers: EINTR-retrying submit/wait, drain-to-quiescence on harvest-error and Drop, SAFETY note on push_multiple. Linux-only; typechecked in a container.
perf(build): incremental history reuse + deferred archive + bitmaps + parallelism + e2e matrix #5 GC grace on reference time, not mtime (S1) — reference-time double-check: re-collect the reachable set (reading through the ref cache) before deleting and drop any candidate a concurrent sync started referencing during the pass. Deterministic reuse-race test (verified it fails without the recheck).
[codex] overlap io_uring write windows #6 per-repo read authz for private repos (AU1, +AU3) — the shared token is not per-repo authz; cache hits served private repos without re-checking the caller. New AccessVerifier proves access via a provider info/refs probe (public=anon, private=caller's credential), cached; gate runs at every read entry point before the cache-hit return; visibility derived from the provider, not the client header. Enforced by default; RIPCLONE_TRUST_GATEWAY=1 restores single-tenant trust mode. (Architecture note: there is no separate signing gateway in this repo — public is anonymous, private uses the caller's git credential — so this is backend per-repo verification, not a gateway-signed principal.)

Not addressed here

Lower-severity findings: rate limiter keying (AU2), SSRF/header-smuggling (AU4/AU5), symlink/depth correctness (F1/P1), and the perf items (R1–R12).

🤖 Generated with Claude Code

…1/M2/M3) The README promises "a newer sync never loses to an older one." That was enforced on none of: the file ref store (the default when no S3), S3 branch refs, and was a get-then-write TOCTOU on every SQL store — so the same race resolved differently depending on where refs happened to live. - Hoist the compare-policy into one shared `should_replace_ref` helper (same-commit always wins; otherwise newer-or-equal `synced_at` wins; a missing timestamp on either side defers to the backend's atomic tie-break). - FileRefStore: give save/save_branch the read-compare-then-rename they lacked, serialized by an in-process write lock (the file store is per-host by design). Also fixes two writers fighting over the shared `.json.tmp`. - SqlRefStore: replace get-then-unconditional-upsert with a single atomic conditional upsert via `MetaDb::save_ordered`. sqlite/postgres/libsql use `ON CONFLICT DO UPDATE ... WHERE`; MySQL (no WHERE on ON DUPLICATE KEY) uses an `@ripl := (...)` IF() session-variable computed once against the original row. Verified policy parity across all four dialects. - S3RefStore: route branch saves through the same ETag CAS as HEAD; unify both into one read-compare-CAS `save_keyed` with a bounded retry loop, and add `S3Storage::put_object_cas` (precondition failure → Ok(false) to retry). Adds the concurrency property test (suggested test #1): N tasks race save_branch with shuffled synced_at; the max must always win. FileRefStore and SQLite run unconditionally (both failed before this change); postgres/ mysql/libsql/S3 arms run when their connection env vars are set. Refs: ADVERSARIAL_REVIEW_2026-06-25.md M1/M2/M3 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…(A1/A2) Two cross-process queue hazards on a shared mirror: - A2 (double-settle): after a time-based reclaim re-owns a job, the original slow-but-alive worker's `finish` still settled it — the UPDATE matched only `WHERE id = ?`. `finish` is now conditional on still owning the claim (`WHERE id = ? AND worker_id = ? AND status = 'claimed'`) and returns whether it actually settled; `ack` threads the worker id and returns that bool. The worker logs and discards a rejected (stale) result instead of clobbering the current owner's build. - A1 (crash-loop): a SIGKILL/OOM build never acks, so its claim goes stale and was requeued forever with no terminal state. Added an `attempts` column (incremented on each claim) and a `RIPCLONE_QUEUE_MAX_ATTEMPTS` cap (default 5): `reclaim_stale` now dead-letters a stale claim that is at/over the cap to terminal `failed` (clearing its credential) and only requeues the rest. Applied identically across all four adapters (sqlite, postgres, mysql, libsql) with a best-effort `attempts` column migration for legacy tables. New tests: a reclaimed worker's late ack is rejected; a hard-killed build dead-letters after the cap. Refs: ADVERSARIAL_REVIEW_2026-06-25.md A1/A2 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two-phase publish acked the job `done` after phase 1 (depth=1) and built the full history in a detached `tokio::spawn`. On a long-lived in-process server that task survives, but an ephemeral/cross-process worker that exits right after acking loses it — the full clonepack is never built and nothing re-drives it. Phase 2 also ran after `do_sync` returned, outside `repo_lock`. `do_sync` / `build_and_publish_two_phase` now take an `inline_full_history` flag. The cross-process worker (`process_build_job` when the queue is not in-process, `inproc_wait() == false`) builds phase 2 inline before returning, so the job is acked `done` only once the full clonepack is durable; if that worker dies mid-build the claim goes stale and is reclaimed/retried (and dead-lettered after the cap — see A1/A2). Running inline keeps phase 2 under the `repo_lock` that `process_build_job` already holds across `do_sync`, so it is serialized against other syncs for the repo. The long-lived in-process server keeps the background spawn, so `/sync` stays fast there. Note: a faithful "SIGKILL the worker right after the phase-1 ack" regression test needs real subprocess control; the in-process e2e harness can't simulate it. The inline path's correctness is covered by the existing cross-process clone e2e (full history present after the job reports Done), and the re-drive-on-death backstop by the A1/A2 dead-letter test. Refs: ADVERSARIAL_REVIEW_2026-06-25.md A3 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…fers (U1/U2) A `PendingDirectWindow` owns the buffers the kernel reads/writes for its in-flight SQEs — the file content, the path `CString`, and the kernel-written `statx` results. The kernel owns those until the matching CQEs are reaped, so a window must never be dropped while still in flight. Two paths broke that: - An interrupted wait (`submit_and_wait` returning EINTR) with windows in flight propagated as an error, unwinding and dropping the window while the kernel still owned its buffers → UAF / data race (incl. the kernel writing freed `statx` memory). All submit/wait calls now retry EINTR via `submit_uninterrupted` / `submit_and_wait_uninterrupted` instead of returning. - `Drop` did a best-effort `flush_deferred_writes`, which bails on the first harvest error and leaves later windows in the deque to drop in flight. `Drop` now drains every pending window to `is_complete()` (`drain_pending_to_quiescence`, bounded, EINTR-retrying). The harvest error path likewise quiesces its owned window before returning the error. Adds the missing SAFETY note on the `push_multiple` unsafe block tying each SQE's buffer lifetime to its `PendingDirectWindow` living in `pending_windows` until complete. io_uring is Linux-only (`cfg(target_os = "linux")`); these changes were typechecked with `cargo check --all-targets` in a Linux container (toolchain 1.96.0) since the dev host is macOS. A true UAF regression test needs ASan + kernel EINTR fault injection (report's suggested test #5) and isn't added here; the existing scheduler tests cover happy-path no-regression. Refs: ADVERSARIAL_REVIEW_2026-06-25.md U1/U2 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The grace cutoff that protects in-flight uploads is anchored on each object's mtime — when it was last *written*. A sync that re-references an already-stored object without re-uploading it (a reused pack/chunk) leaves that object's mtime old, so to GC it looks unreachable-and-aged-out even though a concurrent sync just started pointing at it. The reachable set is snapshotted before the long list-every-object + walk-every-manifest pass, so any sync that lands during the pass is invisible and its reused objects can be deleted out from under it. Add a reference-time double-check: after building the delete candidates, re-collect the reachable set reading *through* the ref cache (a new `fresh` mode on `collect_reachable_hashes` that invalidates each branch before loading, a no-op for non-caching stores) and drop any candidate that became reachable during the pass. Freshly *written* objects stay protected by the mtime grace; this closes the reused-object window, leaving only a sub-second recheck→delete gap in place of the whole-pass window. Test (suggested #4): a concurrent sync re-references an aged reused object mid-pass (reproduced deterministically via a stale ref cache — first scan sees the pre-sync ref, recheck reads through to the fresh ref); the object must survive. Verified it fails without the recheck. Refs: ADVERSARIAL_REVIEW_2026-06-25.md S1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The shared RIPCLONE_SERVER_TOKEN only authenticates "may talk to this backend"; it is not per-repo authorization. Private-repo reads were gated by the caller's credential only on a cache *miss* (the upstream fetch). On a cache hit the ref response — including signed artifact URLs — was returned before the credential was looked at, so any holder of the shared token could read any already-cached private repo. A standing backend token widened this, and visibility was read from a client-supplied header that fails open to "public". There is no separate auth gateway in this architecture (public repos are anonymous; private repos are gated by the caller's own git credential), so the fix is per-repo verification in the backend, not a signed principal: - New `auth::access::AccessVerifier`. `HttpAccessVerifier` proves access the way a clone does — a `git-upload-pack` `info/refs` probe against the provider (anonymous → public; with the caller's credential → authorized private), provider-agnostic, results cached for `RIPCLONE_REPO_AUTH_TTL_SECS` (default 60s). - `authorize_repo_read` gate runs at every repo-read entry point BEFORE serving or signing — get_ref, sync, git smart-http (info/refs + upload-pack), and repo status — and crucially before the ref-response cache-hit return. Public → serve anonymously; authorized private → serve (private URL TTL); denied → 403. - Visibility is now derived from the verifier, not the client header. - On by default. `RIPCLONE_TRUST_GATEWAY=1` restores single-tenant trust mode (skip the check, header visibility) for a network-isolated self-host; a startup log states which mode is active. Documented in docs/BACKENDS.md. Tests: gate maps public/authorized/denied correctly and falls back to the header in trust mode; a denied caller gets 403 through the real `refs` route (previously 200 from cache). e2e harness uses trust mode (its file:// origins can't be probed over HTTP). Refs: ADVERSARIAL_REVIEW_2026-06-25.md AU1 (+ AU3 visibility fail-open) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…gin) - rustfmt across the branch: I had run `cargo test` after each fix but not `cargo fmt --check`, so several files (mostly multi-line SQL/signatures from the queue and auth changes) tripped the lint job. Formatting only. - `e2e_local.sh` clones from a `file://` origin, which the new per-repo access enforcement (AU1) can't probe over HTTP, so the real-binary e2e 403'd on clone. Set `RIPCLONE_TRUST_GATEWAY=1` there — the documented single-tenant escape hatch — matching what the in-process e2e harness already does. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

russellromney and others added 6 commits June 25, 2026 16:14

russellromney changed the title ~~fix: adversarial review correctness cluster (M1/M2/M3, A1/A2, A3)~~ fix: adversarial review — correctness + security cluster (M1/M2/M3, A1/A2, A3, U1/U2, S1, AU1) Jun 26, 2026

russellromney merged commit 3cb2ba4 into main Jun 26, 2026
7 checks passed

russellromney mentioned this pull request Jun 26, 2026

fix: remaining adversarial-review findings (M4/M5/M6/M7, S2/S3, A4, F2, P2, AU6/AU7, R8, U6) #69

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: adversarial review — correctness + security cluster (M1/M2/M3, A1/A2, A3, U1/U2, S1, AU1)#62

fix: adversarial review — correctness + security cluster (M1/M2/M3, A1/A2, A3, U1/U2, S1, AU1)#62
russellromney merged 7 commits into
mainfrom
fix/adversarial-correctness-cluster

russellromney commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

russellromney commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Correctness-critical cluster

Higher-risk cluster

Not addressed here

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

russellromney commented Jun 25, 2026 •

edited

Loading