fix: adversarial review — correctness + security cluster (M1/M2/M3, A1/A2, A3, U1/U2, S1, AU1)#62
Merged
Conversation
…1/M2/M3) The README promises "a newer sync never loses to an older one." That was enforced on none of: the file ref store (the default when no S3), S3 branch refs, and was a get-then-write TOCTOU on every SQL store — so the same race resolved differently depending on where refs happened to live. - Hoist the compare-policy into one shared `should_replace_ref` helper (same-commit always wins; otherwise newer-or-equal `synced_at` wins; a missing timestamp on either side defers to the backend's atomic tie-break). - FileRefStore: give save/save_branch the read-compare-then-rename they lacked, serialized by an in-process write lock (the file store is per-host by design). Also fixes two writers fighting over the shared `.json.tmp`. - SqlRefStore: replace get-then-unconditional-upsert with a single atomic conditional upsert via `MetaDb::save_ordered`. sqlite/postgres/libsql use `ON CONFLICT DO UPDATE ... WHERE`; MySQL (no WHERE on ON DUPLICATE KEY) uses an `@ripl := (...)` IF() session-variable computed once against the original row. Verified policy parity across all four dialects. - S3RefStore: route branch saves through the same ETag CAS as HEAD; unify both into one read-compare-CAS `save_keyed` with a bounded retry loop, and add `S3Storage::put_object_cas` (precondition failure → Ok(false) to retry). Adds the concurrency property test (suggested test #1): N tasks race save_branch with shuffled synced_at; the max must always win. FileRefStore and SQLite run unconditionally (both failed before this change); postgres/ mysql/libsql/S3 arms run when their connection env vars are set. Refs: ADVERSARIAL_REVIEW_2026-06-25.md M1/M2/M3 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(A1/A2) Two cross-process queue hazards on a shared mirror: - A2 (double-settle): after a time-based reclaim re-owns a job, the original slow-but-alive worker's `finish` still settled it — the UPDATE matched only `WHERE id = ?`. `finish` is now conditional on still owning the claim (`WHERE id = ? AND worker_id = ? AND status = 'claimed'`) and returns whether it actually settled; `ack` threads the worker id and returns that bool. The worker logs and discards a rejected (stale) result instead of clobbering the current owner's build. - A1 (crash-loop): a SIGKILL/OOM build never acks, so its claim goes stale and was requeued forever with no terminal state. Added an `attempts` column (incremented on each claim) and a `RIPCLONE_QUEUE_MAX_ATTEMPTS` cap (default 5): `reclaim_stale` now dead-letters a stale claim that is at/over the cap to terminal `failed` (clearing its credential) and only requeues the rest. Applied identically across all four adapters (sqlite, postgres, mysql, libsql) with a best-effort `attempts` column migration for legacy tables. New tests: a reclaimed worker's late ack is rejected; a hard-killed build dead-letters after the cap. Refs: ADVERSARIAL_REVIEW_2026-06-25.md A1/A2 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two-phase publish acked the job `done` after phase 1 (depth=1) and built the full history in a detached `tokio::spawn`. On a long-lived in-process server that task survives, but an ephemeral/cross-process worker that exits right after acking loses it — the full clonepack is never built and nothing re-drives it. Phase 2 also ran after `do_sync` returned, outside `repo_lock`. `do_sync` / `build_and_publish_two_phase` now take an `inline_full_history` flag. The cross-process worker (`process_build_job` when the queue is not in-process, `inproc_wait() == false`) builds phase 2 inline before returning, so the job is acked `done` only once the full clonepack is durable; if that worker dies mid-build the claim goes stale and is reclaimed/retried (and dead-lettered after the cap — see A1/A2). Running inline keeps phase 2 under the `repo_lock` that `process_build_job` already holds across `do_sync`, so it is serialized against other syncs for the repo. The long-lived in-process server keeps the background spawn, so `/sync` stays fast there. Note: a faithful "SIGKILL the worker right after the phase-1 ack" regression test needs real subprocess control; the in-process e2e harness can't simulate it. The inline path's correctness is covered by the existing cross-process clone e2e (full history present after the job reports Done), and the re-drive-on-death backstop by the A1/A2 dead-letter test. Refs: ADVERSARIAL_REVIEW_2026-06-25.md A3 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fers (U1/U2) A `PendingDirectWindow` owns the buffers the kernel reads/writes for its in-flight SQEs — the file content, the path `CString`, and the kernel-written `statx` results. The kernel owns those until the matching CQEs are reaped, so a window must never be dropped while still in flight. Two paths broke that: - An interrupted wait (`submit_and_wait` returning EINTR) with windows in flight propagated as an error, unwinding and dropping the window while the kernel still owned its buffers → UAF / data race (incl. the kernel writing freed `statx` memory). All submit/wait calls now retry EINTR via `submit_uninterrupted` / `submit_and_wait_uninterrupted` instead of returning. - `Drop` did a best-effort `flush_deferred_writes`, which bails on the first harvest error and leaves later windows in the deque to drop in flight. `Drop` now drains every pending window to `is_complete()` (`drain_pending_to_quiescence`, bounded, EINTR-retrying). The harvest error path likewise quiesces its owned window before returning the error. Adds the missing SAFETY note on the `push_multiple` unsafe block tying each SQE's buffer lifetime to its `PendingDirectWindow` living in `pending_windows` until complete. io_uring is Linux-only (`cfg(target_os = "linux")`); these changes were typechecked with `cargo check --all-targets` in a Linux container (toolchain 1.96.0) since the dev host is macOS. A true UAF regression test needs ASan + kernel EINTR fault injection (report's suggested test #5) and isn't added here; the existing scheduler tests cover happy-path no-regression. Refs: ADVERSARIAL_REVIEW_2026-06-25.md U1/U2 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The grace cutoff that protects in-flight uploads is anchored on each object's mtime — when it was last *written*. A sync that re-references an already-stored object without re-uploading it (a reused pack/chunk) leaves that object's mtime old, so to GC it looks unreachable-and-aged-out even though a concurrent sync just started pointing at it. The reachable set is snapshotted before the long list-every-object + walk-every-manifest pass, so any sync that lands during the pass is invisible and its reused objects can be deleted out from under it. Add a reference-time double-check: after building the delete candidates, re-collect the reachable set reading *through* the ref cache (a new `fresh` mode on `collect_reachable_hashes` that invalidates each branch before loading, a no-op for non-caching stores) and drop any candidate that became reachable during the pass. Freshly *written* objects stay protected by the mtime grace; this closes the reused-object window, leaving only a sub-second recheck→delete gap in place of the whole-pass window. Test (suggested #4): a concurrent sync re-references an aged reused object mid-pass (reproduced deterministically via a stale ref cache — first scan sees the pre-sync ref, recheck reads through to the fresh ref); the object must survive. Verified it fails without the recheck. Refs: ADVERSARIAL_REVIEW_2026-06-25.md S1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The shared RIPCLONE_SERVER_TOKEN only authenticates "may talk to this backend"; it is not per-repo authorization. Private-repo reads were gated by the caller's credential only on a cache *miss* (the upstream fetch). On a cache hit the ref response — including signed artifact URLs — was returned before the credential was looked at, so any holder of the shared token could read any already-cached private repo. A standing backend token widened this, and visibility was read from a client-supplied header that fails open to "public". There is no separate auth gateway in this architecture (public repos are anonymous; private repos are gated by the caller's own git credential), so the fix is per-repo verification in the backend, not a signed principal: - New `auth::access::AccessVerifier`. `HttpAccessVerifier` proves access the way a clone does — a `git-upload-pack` `info/refs` probe against the provider (anonymous → public; with the caller's credential → authorized private), provider-agnostic, results cached for `RIPCLONE_REPO_AUTH_TTL_SECS` (default 60s). - `authorize_repo_read` gate runs at every repo-read entry point BEFORE serving or signing — get_ref, sync, git smart-http (info/refs + upload-pack), and repo status — and crucially before the ref-response cache-hit return. Public → serve anonymously; authorized private → serve (private URL TTL); denied → 403. - Visibility is now derived from the verifier, not the client header. - On by default. `RIPCLONE_TRUST_GATEWAY=1` restores single-tenant trust mode (skip the check, header visibility) for a network-isolated self-host; a startup log states which mode is active. Documented in docs/BACKENDS.md. Tests: gate maps public/authorized/denied correctly and falls back to the header in trust mode; a denied caller gets 403 through the real `refs` route (previously 200 from cache). e2e harness uses trust mode (its file:// origins can't be probed over HTTP). Refs: ADVERSARIAL_REVIEW_2026-06-25.md AU1 (+ AU3 visibility fail-open) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gin) - rustfmt across the branch: I had run `cargo test` after each fix but not `cargo fmt --check`, so several files (mostly multi-line SQL/signatures from the queue and auth changes) tripped the lint job. Formatting only. - `e2e_local.sh` clones from a `file://` origin, which the new per-repo access enforcement (AU1) can't probe over HTTP, so the real-binary e2e 403'd on clone. Set `RIPCLONE_TRUST_GATEWAY=1` there — the documented single-tenant escape hatch — matching what the in-process e2e harness already does. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Works through the correctness + security findings from
ADVERSARIAL_REVIEW_2026-06-25.md(PR #60). Each fix is its own commit; the fullcargo testsuite is green after each. io_uring code (Linux-only) was additionally typechecked withcargo check --all-targetsin a Linux container (toolchain 1.96.0) since dev is macOS.Correctness-critical cluster
should_replace_ref; FileRefStore read-compare-then-rename under an in-process lock; SqlRefStore single atomic conditional upsert (ON CONFLICT … WHEREon sqlite/pg/libsql,@riplIF() on MySQL); S3 branch saves through ETag CAS. Property test (File+SQLite always; pg/mysql/libsql/S3 env-gated).finish/ack;attemptscolumn +RIPCLONE_QUEUE_MAX_ATTEMPTScap → dead-letter to terminalfailed. All four adapters. Late-ack + dead-letter tests.done(under the heldrepo_lock); crash → reclaimed/retried/dead-lettered via experiment: parallel pre-built head-blobs pack download (Hybrid mode) #2. In-process server keeps the fast background spawn.Higher-risk cluster
PendingDirectWindowwhile the kernel still owns its buffers: EINTR-retrying submit/wait, drain-to-quiescence on harvest-error and Drop, SAFETY note onpush_multiple. Linux-only; typechecked in a container.AccessVerifierproves access via a providerinfo/refsprobe (public=anon, private=caller's credential), cached; gate runs at every read entry point before the cache-hit return; visibility derived from the provider, not the client header. Enforced by default;RIPCLONE_TRUST_GATEWAY=1restores single-tenant trust mode. (Architecture note: there is no separate signing gateway in this repo — public is anonymous, private uses the caller's git credential — so this is backend per-repo verification, not a gateway-signed principal.)Not addressed here
Lower-severity findings: rate limiter keying (AU2), SSRF/header-smuggling (AU4/AU5), symlink/depth correctness (F1/P1), and the perf items (R1–R12).
🤖 Generated with Claude Code