Skip to content

fix(spec-003 v0.8.3): admit-tokenless on duplicate during settling window#69

Merged
Augustas11 merged 5 commits into
mainfrom
feat/tofu-settling-gate
Jun 12, 2026
Merged

fix(spec-003 v0.8.3): admit-tokenless on duplicate during settling window#69
Augustas11 merged 5 commits into
mainfrom
feat/tofu-settling-gate

Conversation

@Augustas11

Copy link
Copy Markdown
Owner

Summary

Replaces v0.8.2's hard-TOFU reject with admit-tokenless when a tokenless connect arrives for a provider_id that already has an unrevoked token row.

The DB partial unique index (v0.8.2) is the canonical "at most one valid bearer per provider_id" enforcement. The wire-level close was redundant and structurally incompatible with the settling-window deploy plan that PRs #44 + Entries 60-62 designed against.

Why now

Discovered during the 2026-06-12 v1.3.1 deploy attempt to Pearl (DECISION_CRITERIA Entry 66). Cascading failure:

  1. 13:55Z — Coordinator v1.3.1 deployed; restart kicked existing pool.
  2. Within 5 min — air5 + air8gb (both on v1.2.5 binaries) reconnected tokenless. Coordinator self-minted tokens for each, returned in hello_ack. v1.2.5 silently dropped the unknown assigned_provider_token field (Swift JSON decoder behavior).
  3. 22:47Z — air8gb disconnected (NAT blip), reconnected tokenless, hit v0.8.2's hard-TOFU gate: HasActiveTokenForProvider == trueCloseInvalidToken / "invalid_token". Permanently rejected.
  4. 22:49Z — Coordinator rolled back to v1.3.0-24-g87b3a6b via coordinator.prev swap. Both providers recovered on next reconnect.

The structural defect: PR #44's three-auditor pass + the focused security re-audit evaluated the security model (TOFU does prevent credential capture — correct) but did not audit the deployment plan against the "existing tokenless providers cannot persist tokens" constraint. The settling window cannot work if the wire-level reject bricks the old-binary cohort on first reconnect.

What changes

phase4-coordinator/internal/ws/server.go

  • resolveProvisionalToken maps ErrActiveTokenAlreadyExists to provisionalTokenSkip (admit without ack token) instead of provisionalTokenRejectTOFU (close).
  • Pre-INSERT HasActiveTokenForProvider SELECT call removed — it was a TOCTOU-racy belt for the unique-index suspenders, and the suspenders are the only canonical enforcement.
  • provisionalTokenRejectTOFU enum value + the v1/v2 ack-write close-on-reject branches deleted.

phase4-coordinator/internal/ws/provider_token_self_serve_test.go

Renamed + rewrote the TOFU regression test:

  • Old: TestSelfServeProvisionalTokenRejectedWhenActiveTokenExists (asserted CloseInvalidToken)
  • New: TestSelfServeProvisionalTokenAdmitsTokenlessWhenActiveTokenExists

Asserts:

  • hello_ack received (not close frame)
  • assigned_provider_token == "" in the ack (loser is bearer-less)
  • Exactly one active row in DB (DB constraint prevented duplicate mint)

specs/SPEC-003-open-onboarding.md

Bumped v0.8.2 → v0.8.3:

  • FR-C9.1 rewritten: coordinator MUST attempt the mint on every tokenless provisional admission; on ErrActiveTokenAlreadyExists, admit tokenless without ack token.
  • FR-C9.4 rewritten: the partial unique index is the canonical "at most one valid bearer per provider_id" enforcement (not a wire-level reject). Settling-window contract = admit-tokenless. Post-flip contract = validateProviderToken upgrade-gate reject. Operator runbook now mandates coordinator-cli list-tokens audit + revoke-and-coordinate at flag-flip.

beta/DECISION_CRITERIA.md

Entry 66 — full deploy postmortem + the methodology lesson: deployment plans for security-sensitive contract changes need an adversarial review against "what existing field consumers in the wild do," not just "what the spec says they should do."

Security property preserved

The credential-capture attack the codex security audit on PR #44 (MAJOR-1) closed remains closed. An attacker still cannot mint a parallel bearer for someone else's provider_id — the INSERT fails on the unique constraint. The only thing that changes is what happens to the losing connection on the wire: admit-bearer-less instead of close-with-reject.

After flag-flip (RequireProviderTokens=true), tokenless connects are still rejected at the WS upgrade layer by validateProviderToken BEFORE reaching admission. The DB constraint at that layer becomes defense in depth.

Operator runbook update at flag-flip

Pre-flip checklist (now normative in FR-C9.4):

  1. coordinator-cli list-tokens — audit for duplicate or unexpected active rows
  2. coordinator-cli revoke-token --token-prefix <prefix> on any rows belonging to the wrong party
  3. Coordinate with the legitimate party so they can mint cleanly on their next reconnect (revoke + restart, OR ensure they're on a binary that can persist assigned_provider_token)
  4. Only THEN flip auth.require_provider_tokens=true + SIGHUP

Test plan

  • go test ./... full coordinator suite green
  • New test TestSelfServeProvisionalTokenAdmitsTokenlessWhenActiveTokenExists passes
  • All existing token tests still pass (no regressions on the v0.8.2 TOCTOU-safety properties)
  • Pre-fix test name (...RejectedWhen...) no longer exists; renamed + rewrote

Not in this PR (carried separately)

  • PR release: conditional Developer ID signing + notarization (unblocks macOS 26.3.1 launchd) #62 — release-pipeline Developer ID signing + notarization (gates the next deploy on macOS 26.3.1+).
  • Revoke the two orphan tokens before any future v1.3.1+ coordinator deploy. Prefixes: 372c3372023e (air5), d630849e8acb (air8gb).
  • Fix nginx duplicate-zone source-of-truth in phase4-coordinator/dist/nginx-coordinator.streamvc.live.conf:18-19 — operator hot-patched the live conf; source still has the bug.

🤖 Generated with Claude Code

Augustas11 and others added 2 commits June 12, 2026 07:14
…ndow

Replaces v0.8.2's hard-TOFU reject with admit-tokenless when a tokenless
connect arrives for a provider_id that already has an unrevoked token
row. The DB partial unique index (v0.8.2) is the canonical "at most one
valid bearer per provider_id" enforcement; the wire-level close was
redundant and structurally incompatible with the settling-window
deploy plan.

Discovery (2026-06-12 deploy attempt; full writeup in DECISION_CRITERIA
Entry 66):

  1. Coordinator v1.3.1 deployed to Pearl at 13:55Z.
  2. air5 + air8gb (v1.2.5 binaries) reconnected tokenless within 5
     minutes; coordinator self-minted tokens, returned them in
     hello_ack frames.
  3. v1.2.5 binaries silently dropped the unknown assigned_provider_token
     field (Swift JSON decoder ignores unknown keys).
  4. air8gb disconnected at 22:47Z, reconnected tokenless, hit v0.8.2
     hard-TOFU gate, was permanently rejected with invalid_token.
  5. Coordinator rolled back to v1.3.0-24 at 22:49Z to recover both
     providers.

The structural defect: v0.8.1 + v0.8.2 audited the security model
(TOFU does prevent credential capture — correct) but did not audit the
deployment plan against the "existing tokenless providers cannot
persist tokens" constraint. The settling window cannot work if the
wire-level reject bricks the old-binary cohort on first reconnect.

Fix:

- resolveProvisionalToken maps ErrActiveTokenAlreadyExists to
  provisionalTokenSkip (admit without ack token) instead of
  provisionalTokenRejectTOFU (close).
- Pre-INSERT HasActiveTokenForProvider SELECT call removed (was a
  TOCTOU-racy belt for the unique-index suspenders; the suspenders
  are the only canonical enforcement).
- provisionalTokenRejectTOFU enum value + v1/v2 ack-write
  close-on-reject branches deleted.

Security property preserved: the partial unique index still prevents
minting a parallel bearer for someone else's provider_id, so whoever
races first owns the bearer. Subsequent tokenless connects are admitted
bearer-less, surfacing the operator-action need at flag-flip time via
coordinator-cli list-tokens audit.

Regression test renamed:
- TestSelfServeProvisionalTokenRejectedWhenActiveTokenExists ->
  TestSelfServeProvisionalTokenAdmitsTokenlessWhenActiveTokenExists

Asserts hello_ack received (not close), assigned_provider_token == "",
exactly one active row in DB.

SPEC-003 bumped v0.8.2 -> v0.8.3 with FR-C9.1 mint-conditional-on-DB-
constraint rewrite + FR-C9.4 admit-tokenless contract + explicit
operator runbook for flag-flip audit.

Full Go suite green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…book

Three codex auditors (code-reviewer, security-reviewer, architect) ran
in parallel on the v0.8.3 commit 1fc7eaa. Convergent MAJOR finding from
security + architect: v0.8.3 admit-tokenless still let a duplicate
tokenless connect REPLACE the legitimate session via Registry's
last-writer-wins on provider_id, so the attacker got the pool slot,
received buyer traffic, and accrued billing identity under the
victim's provider_id. This re-opened M1-1 / XSEC-1 in a different form.

Fix (security MAJOR-1 + architect MAJOR-1):

- New pool.AuthState enum:
  - AuthBearerValidated
  - AuthSelfMinted
  - AuthBearerlessDuplicate
- pool.Provider.AuthState field with omitempty JSON tag (preserves
  /poolz output for default-zero pre-v0.8.3 providers).
- resolveProvisionalToken returns (outcome, token, authState) — sets
  AuthBearerlessDuplicate on ErrActiveTokenAlreadyExists.
- v1/v2 ack sites set entry.AuthState before registerProviderSession.
- pool.Registry.Register returns (oldConn, registered bool); refuses
  to evict an existing session when the new entry is AuthBearerlessDuplicate.
- WS handler closes the duplicate with CloseInvalidToken when
  registration is refused.
- pool.Provider.RoutingEligible() returns false for AuthBearerlessDuplicate
  (excludes from buyer routing + billing identity).

Fix (security MAJOR-2 + architect MAJOR-2):

- Flag-flip runbook in FR-C9.4 rewritten on last_used_at freshness.
- Pure "row existence" via coordinator-cli list-tokens is NOT
  operational evidence under v0.8.3.
- Unproven rows (last_used_at IS NULL) MUST be revoked + legitimate
  provider asked to reconnect, OR operator coordinates out-of-band
  via coordinator-cli issue-token for a pinned-tier token.

Doc cleanups (code MAJOR-1/3, architect MAJOR-3, NITs):

- All Entry 63 -> Entry 66 references throughout SPEC-003 + test files
- Stale "TOFU close path" / "MUST treat as TOFU rejection" / "MUST close
  the tokenless connection" comments removed from tokens.go and
  server.go (TokenIssuer interface comment, ErrActiveTokenAlreadyExists
  comment, constraint-failure mapping, HasActiveTokenForProvider doc).
- provisionalTokenOutcome doc accuracy: two outcomes, no reject path,
  with v0.8.3 fix-pass-3 reference.

Interface narrowing (code MINOR-3 + architect MINOR-1):

- HasActiveTokenForProvider removed from TokenIssuer interface.
- Retained on *auth.Store for operator tooling with explicit doc-
  comment warning against re-introducing it as a pre-flight gate
  (TOCTOU window).

New regression tests:

- TestSelfServeProvisionalTokenEvictionDefenseProtectsRoutableSession
  (security MAJOR-1 regression — fails on pre-fix code).
- TestSelfServeProvisionalTokenAdmitsTokenlessOnAuthResponseV2WhenActiveTokenExists
  (code MAJOR-2 / architect MINOR-3 — v2 admit-tokenless coverage).
- Existing admit-tokenless v1 test extended with AuthState assertion
  + RoutingEligible assertion.

SPEC-003 FR-C9.4 rewritten as layered normative contract:
- (a) Storage invariant
- (b) Settling-window wire contract
- (c) Routing + billing + registry consequences (new)
- (d) Post-flip wire contract
- (e) Operator flag-flip runbook with last_used_at gate

DECISION_CRITERIA Entry 67 records full deploy postmortem + the
methodology lesson: any change to admission semantics requires explicit
re-evaluation of pool-registration, routing, and billing consequences,
not just wire-frame behavior.

Full Go suite green (172 self-serve coverage in WS package; all
existing pool/auth/buyer/explorer tests pass).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Augustas11

Copy link
Copy Markdown
Owner Author

fix-pass-3 (commit 1619f97) — three-auditor pass + structural fix

Three codex auditors (code-reviewer, security-reviewer, architect) ran in parallel on the v0.8.3 commit 1fc7eaa:

Auditor Verdict CRITICAL / MAJOR / MINOR / NIT
code-reviewer REQUEST CHANGES 0 / 2 / 3 / 0
security-reviewer BLOCK MERGE 0 / 2 / 0 / 1
architect refactor-needed 0 / 3 / 3 / 0

Convergent MAJOR (security + architect)

v0.8.3's admit-tokenless removed the wire-level close, but the v1/v2 paths still called registerProviderSession, which keys by provider_id and last-writer-wins-evicts the existing session (pool/provider.go:191-224). Attacker:

  • Doesn't get a bearer (DB unique index prevents that ✓)
  • Gets the pool slot (eviction defense missing ✗)
  • Receives buyer traffic ✗
  • Accrues billing identity under the victim's provider_id

That re-opened M1-1 / XSEC-1 (provider impersonation) in a different shape.

Fix

Added pool.AuthState as a first-class enum + field on pool.Provider:

  • AuthBearerValidated
  • AuthSelfMinted
  • AuthBearerlessDuplicate

resolveProvisionalToken now returns (outcome, token, authState). v1/v2 ack sites set entry.AuthState before registerProviderSession.

Three concurrent gates on AuthBearerlessDuplicate:

  1. pool.Registry.Register returns (oldConn, registered bool); refuses to evict an existing session in favor of a bearer-less duplicate. WS handler closes the duplicate with CloseInvalidToken / "invalid_token".
  2. pool.Provider.RoutingEligible() returns false — excluded from buyer routing AND billing identity (billing is keyed on routed providers).
  3. Surfaced in /poolz JSON via auth_state field (omitempty, so default-value entries don't change pre-v0.8.3 output).

Convergent MAJOR-2 (security + architect): flag-flip runbook ungrounded

Old runbook: "audit list-tokens for duplicate or unexpected rows." But duplicates are prevented by the unique index (so "duplicate" is unobservable) and "unexpected" was undefined.

New runbook (FR-C9.4(e)): gate flag flip on last_used_at freshness within 24h. A row with last_used_at IS NULL proves no binary has ever successfully authenticated with that token — could be the legitimate provider whose binary can't persist tokens OR an attacker who never had a usable binary. Either way: revoke + reconnect (race for fresh mint) OR coordinate out-of-band via coordinator-cli issue-token for a pinned-tier token.

Other findings (mechanical)

  • Code MAJOR-1 + architect MAJOR-3: Stale "TOFU close path" / "MUST treat as TOFU rejection" comments cleaned up across tokens.go, server.go, messages.go.
  • Code MAJOR-2 + architect MINOR-3: Added v2 auth_response duplicate-admit-tokenless test.
  • Code MINOR-3 + architect MINOR-1: HasActiveTokenForProvider removed from TokenIssuer interface. Retained on *auth.Store for operator tooling with explicit doc-comment warning against re-introducing it as a pre-flight gate (would reintroduce the TOCTOU window v0.8.2 closed).
  • Entry 63 → Entry 66 references corrected throughout SPEC-003 + test file.
  • NIT-1: Stale comments cleaned up.

New regression tests

Test What it locks Fails on
TestSelfServeProvisionalTokenEvictionDefenseProtectsRoutableSession Bearer-less duplicate cannot evict a routable session Pre-fix code (v0.8.3 1fc7eaa)
TestSelfServeProvisionalTokenAdmitsTokenlessOnAuthResponseV2WhenActiveTokenExists v2 admit-tokenless symmetric with v1 Pre-fix (v2 untested)
(extended) TestSelfServeProvisionalTokenAdmitsTokenlessWhenActiveTokenExists Now also asserts AuthBearerlessDuplicate marking + RoutingEligible()==false Pre-fix-pass-3

SPEC-003 FR-C9.4 rewritten as layered normative contract

  • (a) Storage invariant (unchanged from v0.8.2)
  • (b) Settling-window wire contract (5-step MUST list, including AuthBearerlessDuplicate marking)
  • (c) NEW: Routing + billing + registry consequences — codifies the eviction defense + routing exclusion + /poolz visibility
  • (d) Post-flip wire contract (DB constraint as defense in depth)
  • (e) NEW: Operator flag-flip runbook with last_used_at freshness gate

DECISION_CRITERIA

Entry 67 records the postmortem + the methodology lesson:

Any change to admission semantics requires explicit re-evaluation of pool-registration, routing, and billing consequences, not just wire-frame behavior.

Tests

Full Go suite green. 9 self-serve tests pass in WS package.

🤖 Generated with Claude Code

Three-auditor convergent MAJOR on fix-pass-3 (commit 1619f97):
buyer/server.go used a parallel baseRoutingEligible predicate
(State+SlotsFree only) that bypassed pool.Provider.RoutingEligible()
and re-opened the bearer-less-duplicate routing/billing capture
window. The fix-pass-3 claim of routing exclusion was false in the
money path.

Routing/capacity sites delegate to p.RoutingEligible():
- handleModels (/v1/models capacity)
- chat-completions candidate loop
- validatePinnedProviderForRequest (X-MacProvider-Provider hard-pin)

Observability sites that branch on HashStatus (internalTier2Metadata,
applyHashVerification) rename baseRoutingEligible -> hasAvailableSlot
and exclude AuthBearerlessDuplicate while preserving slot/state
semantics — those branches need to count Mismatched/Uncatalogued
providers and would lose visibility under the stricter predicate.

Scope correction to RoutingEligible: removed HashStatus
Mismatch/Invalid exclusion. HashStatus filtering is operator-
configurable (Tier-2 hash enforcement off => mismatches still route)
and a non-config-aware predicate cannot model that. Buyer routing
already calls tier2ProviderExcludedStatus alongside RoutingEligible
when hash enforcement is active. Predicate now exclusively concerns
credential trust (AuthState) + slot availability. Pool unit test
inverted (Mismatch/Invalid MUST route through the predicate).

Regression tests added:
- TestChatCompletionsExcludesAuthBearerlessDuplicate
  (503 + no upstream + no provider header)
- TestChatCompletionsPinnedHeaderRejectsAuthBearerlessDuplicate
  (hard-pin path)
- TestModelsExcludesAuthBearerlessDuplicateFromCapacity
  (provider_count + total_slots exclude bearer-less)
- TestSelfServeProvisionalTokenAdmitsTokenlessOnAuthResponseV2WhenActiveTokenExistsRetentionCleanup
  (SPEC-010 catalog fields force retainSpec010=true; verifies
  R-7.9.7 defer releases the retention entry on duplicate-admit)

Eviction-defense test (code MINOR-1): positive AuthBearerValidated
assertion replacing the prior negation-only check. Future regressions
landing empty AuthState (pre-FR-C9 conflation) cannot pass silently.

DECISION_CRITERIA Entry 67 reformatted for skimmability (no semantic
change). Entry 68 added documenting fix-pass-4.

Deferred to follow-up tracking issue (not bundled to keep PR
reviewable): Phase 5 gateway /poolz decode + auth_state aggregation
(SPEC-002 contract), AuthMintFailed enum for DB-failure
observability, coordinator-cli pre-flip-audit --max-last-used-age=24h
command, explorer auth_state exposure.

Full phase4-coordinator suite green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@antfleet-ops

Copy link
Copy Markdown

fix-pass-4 pushed in 51c1a18. Architectural changes:

  • Routing/capacity sites (handleModels, chat-completions, hard-pin) delegate to p.RoutingEligible()
  • Observability sites (Tier-2 metadata, hash verification) renamed to hasAvailableSlot with explicit AuthBearerlessDuplicate exclusion (preserves HashStatus branching needed for operator visibility)
  • pool.Provider.RoutingEligible() scope corrected — HashStatus filtering removed (was overaggressive when Tier-2 hash enforcement is operator-disabled). Predicate now exclusively concerns credential trust (AuthState) + slot availability. Tier-2 hash filtering stays in config-aware buyer code.

Three buyer regression tests added:

  • TestChatCompletionsExcludesAuthBearerlessDuplicate (503, no upstream invocation, no provider header)
  • TestChatCompletionsPinnedHeaderRejectsAuthBearerlessDuplicate (hard-pin path)
  • TestModelsExcludesAuthBearerlessDuplicateFromCapacity (capacity exclusion)

Code MINOR-1 / MINOR-2 / NIT addressed:

  • Eviction-defense test: positive AuthBearerValidated assertion (no more silent empty-state regression)
  • v2 retention-cleanup variant: SPEC-010 catalog fields force retainSpec010=true, verifies R-7.9.7 defer on duplicate-admit
  • Entry 67 reformatted; Entry 68 added documenting fix-pass-4

Architect's larger items deferred to tracking issue #82 (Phase 5 gateway aggregation, AuthMintFailed enum, coordinator-cli pre-flip-audit command, explorer auth_state) — kept out of this PR to maintain reviewability and atomicity.

Full phase4-coordinator test suite green.

This commit reconciles two parallel v0.8.3 drafts of FR-C9.4 and adds
fix-pass-5 hardening from a three-codex-auditor pass on the composition.

## Composition (DECISION_CRITERIA Entry 71)

PR #78 (merged to main): self-heal-on-NULL-row + strict-reject-on-USED-row,
keeping HasActiveTokenForProvider in the TokenIssuer interface.

PR #69 (this branch, six revisions): pool.AuthState first-class enum,
pool.Provider.RoutingEligible() single authority, Registry eviction
defense, buyer-side routing refactor delegating to RoutingEligible.

The composed v0.8.4 contract:

- NULL last_used_at row                  -> self-heal: revoke + remint (AuthSelfMinted)
- NOT NULL last_used_at row              -> strict TOFU reject
- DB error in revoke/has-active          -> fail-closed RejectTOFU
- IssueToken ErrActiveTokenAlreadyExists -> admit AuthBearerlessDuplicate (race-loss)
- IssueToken transient DB error          -> fail-closed RejectTOFU + AuthMintFailed

Four distinct log events for operator alerting:
- fr_c9_4_self_heal
- fr_c9_4_tofu_reject
- fr_c9_4_race_loss_admit_quarantined
- fr_c9_1_mint_failed

## fix-pass-5 (codex three-auditor pass on the composition)

Audit verdicts:
- code-reviewer: BLOCK MERGE (0/3/2/1)
- security-reviewer: BLOCK MERGE (0/2/1/0)
- architect: merge-ready after small in-PR cleanup (0/2/3/0)

Convergent findings closed in this commit:

A. SPEC-003 FR-C9.4(b) (line 582) rewritten: was pre-composition PR #69
   prose; now points back to the v0.8.4 table with the composed flow.

B. SPEC-003 changelog cross-reference: "Entry 68 (this composition note)"
   to "Entry 71" + fix-pass-5 details.

C. Eviction-defense test renamed
   TestSelfServeProvisionalTokenEvictionDefenseProtectsRoutableSession
   to TestSelfServeProvisionalTokenStrictRejectPreservesRoutableSession
   with comment explaining the actual branch exercised under composition
   (strict-reject closes the attacker BEFORE Register). Two new direct
   registry tests added in internal/pool that exercise the actual
   Register-layer defenses:
   - TestRegistryRefusesBearerlessDuplicateReplacement (PR #69 MAJOR-1)
   - TestRegistryRefusesNonBearerReplacementOfRoutableBearerValidated
     (fix-pass-5 MAJOR-2)

D. HasActiveTokenForProvider doc-comment in tokens.go:246 refreshed for
   v0.8.4 disambiguation role.

E. pool/provider.go:257 comment refreshed for broader eviction-defense
   semantics.

F. MAJOR (code-reviewer): atomic ValidateAndMarkTokenUsed closes the
   TOCTOU window between ValidateToken at WS upgrade and MarkTokenUsed
   in prepareProviderAdmission. Pre-fix, a concurrent tokenless self-
   heal during the window could revoke the legitimate provider's still-
   NULL row and mint a fresh bearer for an attacker. The atomic UPDATE-
   RETURNING stamps last_used_at in the same DB statement that validates
   the token, so RevokeUnusedTokenForProvider never sees a NULL row for
   a bearer that has just validated.

G. MAJOR (security-reviewer): AuthMintFailed enum + fail-closed on
   transient IssueToken DB errors. Pre-fix, the empty AuthState returned
   on DB-error path was treated as routable, amplifying a token-store
   outage into a routing-admission DoS where attackers could be admitted
   as fully-routable empty-state sessions. fix-pass-5 G surfaces the
   failure distinctly in /poolz observability (AuthMintFailed) AND
   closes the connection (RejectTOFU). Binary recovers cleanly on next
   reconnect since no row was written.

H. MAJOR (security-reviewer + architect Steelman): extended Register
   eviction defense to refuse non-Bearer-validated replacement of an
   existing routable Bearer-validated session. Pre-fix-pass-5, a
   concurrent tokenless self-heal during a victim's first-mint window
   would mint a fresh bearer and then register as AuthSelfMinted, last-
   writer-wins evicting the victim's validated session. New rule:
   AuthSelfMinted / AuthMintFailed / empty incoming MUST NOT replace an
   existing routable AuthBearerValidated session. Bearer-validated
   incoming connects always succeed (proof-of-bearer is independently
   strong).

## Deferred to tracking issue #82

- Phase 5 gateway /poolz aggregation decode + auth_state exclusion
  (architect MAJOR-2; SPEC-002 contract surface)
- v2 strict-reject mirror test (code MINOR-2)
- coordinator-cli pre-flip-audit CLI (architect Followup 3)
- explorer auth_state exposure (architect Followup 5)

## Test coverage

- internal/auth: all green
- internal/pool: all green (incl. 2 new direct registry-layer tests)
- internal/ws: all green (incl. updated mismatch-token test +
  renamed strict-reject test + new fail-closed mint-failure test)
- internal/buyer: all green (incl. 3 fix-pass-4 routing tests)
- Full coordinator suite: 14 packages green under -count=1 + race
  detector clean on auth/pool/ws/buyer

## SPEC-003 to v0.8.4

Composed normative body with FR-C9.4 table covering all five cases.
Storage primitive ValidateAndMarkTokenUsed documented. Pool primitive
extended eviction defense documented. AuthMintFailed enum documented.
Operator flag-flip runbook updated for AuthBearerlessDuplicate +
unproven AuthSelfMinted gates.

DECISION_CRITERIA Entries 67 (PR #78), 69 (was 67), 70 (was 68), 71
(composition) document the full arc.

Pearl operational state unchanged (still on v1.3.0-24 rollback;
2 orphan tokens still in DB awaiting future revoke-before-redeploy).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@antfleet-ops

Copy link
Copy Markdown

fix-pass-5 pushed in 9c78355. Composed v0.8.4 merge with three-codex-auditor closure.

Convergent A-E (doc/test cleanup):

  • SPEC-003 §FR-C9.4(b) rewrite + changelog Entry 68 → 71 cross-ref fix
  • TokenIssuer interface doc-comment refresh
  • Eviction-defense test renamed to reflect actual fired branch under composition (strict-reject path); added two direct registry-level tests in internal/pool for the actual Register-layer defenses

Code MAJOR-1 (F): atomic ValidateAndMarkTokenUsed.
Closes the TOCTOU window between ValidateToken at WS upgrade and MarkTokenUsed in prepareProviderAdmission where a concurrent tokenless self-heal could revoke the legitimate provider's still-NULL row and mint a fresh bearer for an attacker. Single UPDATE-RETURNING in auth.Store.ValidateAndMarkTokenUsed; called at upgrade time. prepareProviderAdmission MarkTokenUsed call removed (with explanatory comment).

Security MAJOR-1 (G): AuthMintFailed enum + fail-closed.
pool.AuthState gains AuthMintFailed (4th value). Transient IssueToken DB errors now return provisionalTokenRejectTOFU + AuthMintFailed instead of Skip + "". Closes the DoS-amplification vector where empty AuthState was treated as routable.

Security MAJOR-2 + Architect Steelman (H): extended Register eviction defense.
pool.Registry.Register now refuses non-Bearer-validated replacement of an existing routable Bearer-validated session. Bearer-validated reconnects always succeed (proof-of-bearer is independently strong).

Test coverage:

  • internal/auth/pool/ws/buyer: all green under -count=1
  • Race detector clean on auth/pool/ws/buyer
  • 2 new direct registry tests: TestRegistryRefusesBearerlessDuplicateReplacement + TestRegistryRefusesNonBearerReplacementOfRoutableBearerValidated
  • 1 new fail-closed test: TestSelfServeProvisionalTokenMintFailureFailsClosed (replaces the pre-fix-pass-5 tolerated variant)
  • Renamed: TestSelfServeProvisionalTokenStrictRejectPreservesRoutableSession

Deferred to tracking issue #82:

  • Phase 5 gateway /poolz aggregation decode + auth_state exclusion (architect MAJOR-2)
  • v2 strict-reject mirror test (code MINOR-2)
  • coordinator-cli pre-flip-audit CLI (architect Followup 3)
  • explorer auth_state exposure (architect Followup 5)

PR state: GitHub now shows CONFLICTING because more PRs landed on main during fix-pass-5 (M3 batch). Files touched on both sides: DECISION_CRITERIA, auth/tokens.go, buyer/server.go, pool/provider.go, ws/server.go, SPEC-003. Another merge resolution will be needed before merge.

Second merge of origin/main into feat/tofu-settling-gate. While fix-pass-5
was in flight, 7 PRs landed on main: M3-8b (dead Swift legacy branch),
M3-8c (capacity_tier JSON serializer), M3-8d (tier2 catalog DI), M3-1
(batched retention DELETEs), M3-2 (operator-key split), Swift CI bump,
HFClient redirect guard + model fit + browse subcommand.

Composition conflicts resolved:

- specs/SPEC-003-open-onboarding.md: main bumped to v0.9.1 (FR-D2.1
  installer hardening + MoE NxMB parser). Our branch had v0.8.4
  (FR-C9.4 composed contract + fix-pass-5). Bumped to v0.9.2 with
  changelog entry tying the two normative threads together; both are
  preserved in normative body (FR-D2 and FR-C9 don't overlap).

- beta/DECISION_CRITERIA.md: main claimed Entries 69, 70, 71, 73 with
  M3 work; also has two Entry 71s already (M3-1 and M3-8d collision
  on main). Our entries renumbered through 67->69->74 (fix-pass-3),
  68->70->75 (fix-pass-4), 71->76 (composition + fix-pass-5 inline
  description). Renumber note explicit in each entry body.

- phase4-coordinator/internal/{auth/tokens.go, buyer/server.go,
  pool/provider.go, ws/server.go}: auto-merged cleanly. No semantic
  conflicts because M3 work was on orthogonal surfaces (Swift code,
  capacity_tier serializer, tier2 DI, operator-key plumbing) vs the
  composed v0.9.2 FR-C9.4 path we changed.

Full coordinator test suite (14 packages) green under -count=1.

Pearl operational state unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Augustas11 Augustas11 merged commit 10c56b8 into main Jun 12, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants