feat(agent): shared-memory recovery — REPLACE apply + ACL gate + endpoint#1173
feat(agent): shared-memory recovery — REPLACE apply + ACL gate + endpoint#1173branarakic wants to merge 2 commits into
Conversation
| remotePeerId, | ||
| contextGraphId, | ||
| deadline: this.createContextGraphSyncDeadline(1), | ||
| fetchSyncPages: this.fetchSyncPages.bind(this), |
There was a problem hiding this comment.
🔴 Bug: this recovery path still goes through the normal sync transport, but nothing in the request envelope/responder differentiates it from ordinary SWM sync. The new isMemberRecoveryAuthorized helper is never consulted, so recovery is still authorized by authorizePrivateSyncRequest's participant/peer fallback instead of the intended hard-deny members-only gate. Please plumb an explicit recovery auth mode through the request/response path and enforce the fresh agent gate on the responder before serving plaintext SWM.
| const graphManager = new GraphManager(this.store); | ||
| await graphManager.ensureContextGraph(cgId); | ||
| }, | ||
| setCheckpoint: (key, offset) => this.syncCheckpoints.set(key, offset), |
There was a problem hiding this comment.
🔴 Bug: recovery is reusing the same syncCheckpoints namespace as background SWM catch-up (remotePeerId|contextGraphId|swm|phase). If a normal sync already has saved progress for this peer/context graph, running recovery here can overwrite or delete that cursor and force the incremental sync to restart from offset 0. Give recovery its own checkpoint/session scope instead of mutating the shared incremental-sync state.
49e0c6e to
3423171
Compare
3927262 to
70f45ad
Compare
There was a problem hiding this comment.
Codex review produced 4 comment(s) but all targeted lines outside the diff and were dropped. Check the workflow logs for details.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
76fcd31 to
ccd5c2f
Compare
70f45ad to
479a34b
Compare
| // has no cross-invocation accumulator, so the retry must restart from | ||
| // offset 0 (which makes the responder re-read from the start of its row | ||
| // list) and rebuild the COMPLETE state before the apply gate can pass. | ||
| if (lastCheckpointKey !== undefined) deps.deleteCheckpoint(lastCheckpointKey); |
There was a problem hiding this comment.
🔴 Bug: Dropping only the persisted checkpoint here does not actually force a clean retry. fetchSyncPages also keeps the unfinished responder session in its module-local cache keyed by the same checkpointKey, so a retry before that TTL expires will reuse the old syncSessionId with offset=0 and the responder will keep serving the cached pre-timeout row list instead of re-reading current state. Recovery can then converge to a stale snapshot. When abandoning a partial recovery, clear/rotate the responder-session state as well, not just the checkpoint offset.
| const applied = await applySwmRecovery({ | ||
| store: deps.store, | ||
| verifiedData: processed.verifiedData, | ||
| roots: processed.entityCreators, |
There was a problem hiding this comment.
🔴 Bug: processed.entityCreators is not the full set of roots that need REPLACE semantics here. The verifier only adds an entry when it can recover a creator from meta, but verifiedData can still contain valid rows for roots whose op is missing dkg:publisherPeerId / literal prov:wasAttributedTo. Those roots will be inserted without first deleting the stale local rows, which reintroduces the {v1,v2} union corruption this recovery path is meant to fix. Build the roots list from all validated rootEntity subjects (or return that separately from the verifier), and use creators only for ownership-cache hydration.
| const ownershipKey = sharedMemoryOwnershipKeyFromGraph(deps.contextGraphId, dataGraph); | ||
| if (!ownershipKey) continue; | ||
| const ownedMap = deps.ensureOwnedMap(ownershipKey); | ||
| if (!ownedMap.has(entity)) ownedMap.set(entity, creator); |
There was a problem hiding this comment.
🔴 Bug: Recovery is supposed to repair stale local state, but this only hydrates missing ownership-cache entries. If the local cache already contains the wrong owner for a recovered root, the stale mapping survives and the next contended write is still arbitrated incorrectly even though the triples were repaired. Overwrite the cached owner for recovered roots (or clear the key before hydration) so the cache matches the authoritative source you just applied.
) A partial recovery fetch deletes its checkpoint to restart from offset 0, but the requester remembered the responder session under the recovery checkpointKey — so the retry reused the old syncSessionId and the responder re-served its cached pre-timeout row list, converging to a STALE snapshot (up to the session TTL old) instead of current state. R10 already scopes recovery's responder session separately; this completes the intent: a recovery fetch never persists a responder session (timeout + error paths), so a retry mints a fresh id and the responder re-reads. Bounded before (self-heals after TTL); now correct on the retry. Adds the contrasting unit test (non-recovery reuses; recovery mints fresh). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| verifiedData: processed.verifiedData, | ||
| roots: processed.entityCreators, | ||
| }); | ||
| if (processed.verifiedMeta.length > 0) { |
There was a problem hiding this comment.
🔴 Bug: Unlike runSharedMemorySync, this path never calls syncPublicSnapshotsForMeta before returning completed: true. Any recovered meta rows that reference dkg:publicSnapshotRef will point at snapshot payloads that were never fetched into publicSnapshotStore, so recovery can report success while leaving snapshot-backed SWM data unreadable/incomplete. Mirror the normal snapshot fetch/validation step here before treating recovery as complete.
| const ownershipKey = sharedMemoryOwnershipKeyFromGraph(deps.contextGraphId, dataGraph); | ||
| if (!ownershipKey) continue; | ||
| const ownedMap = deps.ensureOwnedMap(ownershipKey); | ||
| if (!ownedMap.has(entity)) ownedMap.set(entity, creator); |
There was a problem hiding this comment.
🟡 Issue: Recovery only hydrates ownership entries when they are missing. If the local ownership cache already contains a stale creator for one of these roots, the recovered triples are correct but the next contended write is still arbitrated with the wrong owner. Since this endpoint is explicitly repairing stale/corrupt local SWM state, it should overwrite recovered ownership entries (or clear them first) rather than leaving existing values in place.
|
Closing as redundant — superseded by #1203 ( Verified against current No unique unmerged work. Reopen if I've missed something. |
Part 5/5 · base:
feat/facet-open-serve.Shared-memory recovery driver — a member repairs a stale or corrupt private shared-memory copy by pulling the current state from one chosen authoritative peer.
applySwmRecovery) — per-KA-graph clear + reinsert, fixing the blind-union corruption where re-syncingstatus=v2into a store holdingv1left both ({v1, v2}) permanently.POST /api/context-graph/recover-shared-memory {contextGraphId, remotePeerId}— pulls from one chosen authoritative peer, never a blanket on-connect sweep.scripts/devnet-test-swm-recovery.sh— multi-node regression proving REPLACE (v1 → v2, no{v1,v2}union).recovery tests green. Please do not merge before review / out of order.
🤖 Generated with Claude Code
Depends on #1175 — please merge that first.