Skip to content

feat(agent): shared-memory recovery — REPLACE apply + ACL gate + endpoint#1173

Closed
branarakic wants to merge 2 commits into
feat/facet-open-servefrom
feat/shared-memory-recovery
Closed

feat(agent): shared-memory recovery — REPLACE apply + ACL gate + endpoint#1173
branarakic wants to merge 2 commits into
feat/facet-open-servefrom
feat/shared-memory-recovery

Conversation

@branarakic

@branarakic branarakic commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Part 5/5 · base: feat/facet-open-serve.

Shared-memory recovery driver — a member repairs a stale or corrupt private shared-memory copy by pulling the current state from one chosen authoritative peer.

  • REPLACE apply (applySwmRecovery) — per-KA-graph clear + reinsert, fixing the blind-union corruption where re-syncing status=v2 into a store holding v1 left both ({v1, v2}) permanently.
  • All-or-nothing — never REPLACE over an incomplete fetch (row-based pagination can cut a root mid-stream); on a partial fetch it mutates nothing and drops the checkpoint so the retry re-fetches the complete state.
  • Own checkpoint namespace — recovery can't clobber the background incremental-sync cursor. (The members-only auth gate it triggers lives in the facet PR this is based on.)
  • Explicit endpoint POST /api/context-graph/recover-shared-memory {contextGraphId, remotePeerId} — pulls from one chosen authoritative peer, never a blanket on-connect sweep.
  • scripts/devnet-test-swm-recovery.sh — multi-node regression proving REPLACE (v1 → v2, no {v1,v2} union).

recovery tests green. Please do not merge before review / out of order.

🤖 Generated with Claude Code

Depends on #1175 — please merge that first.

remotePeerId,
contextGraphId,
deadline: this.createContextGraphSyncDeadline(1),
fetchSyncPages: this.fetchSyncPages.bind(this),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this recovery path still goes through the normal sync transport, but nothing in the request envelope/responder differentiates it from ordinary SWM sync. The new isMemberRecoveryAuthorized helper is never consulted, so recovery is still authorized by authorizePrivateSyncRequest's participant/peer fallback instead of the intended hard-deny members-only gate. Please plumb an explicit recovery auth mode through the request/response path and enforce the fresh agent gate on the responder before serving plaintext SWM.

const graphManager = new GraphManager(this.store);
await graphManager.ensureContextGraph(cgId);
},
setCheckpoint: (key, offset) => this.syncCheckpoints.set(key, offset),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: recovery is reusing the same syncCheckpoints namespace as background SWM catch-up (remotePeerId|contextGraphId|swm|phase). If a normal sync already has saved progress for this peer/context graph, running recovery here can overwrite or delete that cursor and force the incremental sync to restart from offset 0. Give recovery its own checkpoint/session scope instead of mutating the shared incremental-sync state.

@branarakic branarakic force-pushed the feat/public-context-graph-projection branch from 49e0c6e to 3423171 Compare June 15, 2026 01:02
@branarakic branarakic force-pushed the feat/shared-memory-recovery branch from 3927262 to 70f45ad Compare June 15, 2026 01:02
@branarakic branarakic changed the base branch from feat/public-context-graph-projection to feat/facet-open-serve June 15, 2026 01:03

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review produced 4 comment(s) but all targeted lines outside the diff and were dropped. Check the workflow logs for details.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@branarakic branarakic force-pushed the feat/facet-open-serve branch from 76fcd31 to ccd5c2f Compare June 15, 2026 08:02
@branarakic branarakic force-pushed the feat/shared-memory-recovery branch from 70f45ad to 479a34b Compare June 15, 2026 08:02
// has no cross-invocation accumulator, so the retry must restart from
// offset 0 (which makes the responder re-read from the start of its row
// list) and rebuild the COMPLETE state before the apply gate can pass.
if (lastCheckpointKey !== undefined) deps.deleteCheckpoint(lastCheckpointKey);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: Dropping only the persisted checkpoint here does not actually force a clean retry. fetchSyncPages also keeps the unfinished responder session in its module-local cache keyed by the same checkpointKey, so a retry before that TTL expires will reuse the old syncSessionId with offset=0 and the responder will keep serving the cached pre-timeout row list instead of re-reading current state. Recovery can then converge to a stale snapshot. When abandoning a partial recovery, clear/rotate the responder-session state as well, not just the checkpoint offset.

const applied = await applySwmRecovery({
store: deps.store,
verifiedData: processed.verifiedData,
roots: processed.entityCreators,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: processed.entityCreators is not the full set of roots that need REPLACE semantics here. The verifier only adds an entry when it can recover a creator from meta, but verifiedData can still contain valid rows for roots whose op is missing dkg:publisherPeerId / literal prov:wasAttributedTo. Those roots will be inserted without first deleting the stale local rows, which reintroduces the {v1,v2} union corruption this recovery path is meant to fix. Build the roots list from all validated rootEntity subjects (or return that separately from the verifier), and use creators only for ownership-cache hydration.

const ownershipKey = sharedMemoryOwnershipKeyFromGraph(deps.contextGraphId, dataGraph);
if (!ownershipKey) continue;
const ownedMap = deps.ensureOwnedMap(ownershipKey);
if (!ownedMap.has(entity)) ownedMap.set(entity, creator);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: Recovery is supposed to repair stale local state, but this only hydrates missing ownership-cache entries. If the local cache already contains the wrong owner for a recovered root, the stale mapping survives and the next contended write is still arbitrated incorrectly even though the triples were repaired. Overwrite the cached owner for recovered roots (or clear the key before hydration) so the cache matches the authoritative source you just applied.

)

A partial recovery fetch deletes its checkpoint to restart from offset 0, but the
requester remembered the responder session under the recovery checkpointKey — so
the retry reused the old syncSessionId and the responder re-served its cached
pre-timeout row list, converging to a STALE snapshot (up to the session TTL old)
instead of current state. R10 already scopes recovery's responder session
separately; this completes the intent: a recovery fetch never persists a responder
session (timeout + error paths), so a retry mints a fresh id and the responder
re-reads. Bounded before (self-heals after TTL); now correct on the retry.

Adds the contrasting unit test (non-recovery reuses; recovery mints fresh).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
verifiedData: processed.verifiedData,
roots: processed.entityCreators,
});
if (processed.verifiedMeta.length > 0) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: Unlike runSharedMemorySync, this path never calls syncPublicSnapshotsForMeta before returning completed: true. Any recovered meta rows that reference dkg:publicSnapshotRef will point at snapshot payloads that were never fetched into publicSnapshotStore, so recovery can report success while leaving snapshot-backed SWM data unreadable/incomplete. Mirror the normal snapshot fetch/validation step here before treating recovery as complete.

const ownershipKey = sharedMemoryOwnershipKeyFromGraph(deps.contextGraphId, dataGraph);
if (!ownershipKey) continue;
const ownedMap = deps.ensureOwnedMap(ownershipKey);
if (!ownedMap.has(entity)) ownedMap.set(entity, creator);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Recovery only hydrates ownership entries when they are missing. If the local ownership cache already contains a stale creator for one of these roots, the recovered triples are correct but the next contended write is still arbitrated with the wrong owner. Since this endpoint is explicitly repairing stale/corrupt local SWM state, it should overwrite recovered ownership entries (or clear them first) rather than leaving existing values in place.

@branarakic

Copy link
Copy Markdown
Contributor Author

Closing as redundant — superseded by #1203 (integration/rfc49-full) + #1201, which landed the RFC-49 SWM/agent work on main.

Verified against current main: the stack tip #1193 has 0 residual (every file it touches is byte-identical in main), and this branch adds nothing not already in main. Any per-file delta is a superseded intermediate — e.g. a relocated contextGraphCatalogUri, or a pre-curator-ack dkg-publisher that main has since advanced past (main carries the confirmBeforeCommit curator-ack gate this branch lacks).

No unique unmerged work. Reopen if I've missed something.

@branarakic branarakic closed this Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant