fix(agent): harden chain-driven VM reconcile robustness (audit follow-ups)#934
fix(agent): harden chain-driven VM reconcile robustness (audit follow-ups)#934branarakic wants to merge 1 commit into
Conversation
| // so different nodes could otherwise pick DIFFERENT ops and diverge the | ||
| // promoted snapshot's provenance. Sorting candidate ops by their (stable, | ||
| // content-independent) op URI makes every node converge on the same one (#15). | ||
| const opsSorted = [...rootsByOp.entries()].sort( |
There was a problem hiding this comment.
🔴 Bug: sorting only the workspace-operation URI does not make this path deterministic for multi-root KCs. The roots array still preserves SPARQL binding order, and later in promoteSharedMemoryToCanonical those roots are consumed positionally to assign tokenIds. Two replicas can therefore pick the same op here but still materialize different rootEntity→tokenId mappings. Please sort the per-op roots list as well (or derive a canonical KA order) before returning it.
| // their `dkg:authoredBy` triples (#14). Fall back to a deterministic id | ||
| // derived from the KC merkle root, which every node sweeping this KC | ||
| // converges on identically. | ||
| const publicationId = txHash || `chain:${ethers.hexlify(merkleRoot)}`; |
There was a problem hiding this comment.
🔴 Bug: falling back to chain:${merkleRoot} collapses distinct publish events that happen to carry identical content into one dkg:Publication subject. A re-publish of the same KC would then merge/overwrite provenance (dkg:authoredBy, context graph, linked KAs) instead of representing a separate publication event. Use a chain-unique identifier on this path (kaId/batchId, or thread the actual tx hash through reconcile) rather than a content hash.
| let headUnavailable = false; | ||
| try { | ||
| headBlock = await deps.getHeadBlock(); | ||
| } catch { |
There was a problem hiding this comment.
🟡 Issue: this catch now hides every getHeadBlock() failure without any diagnostic. In the degraded mode introduced here the sweep can keep reconciling ordinals while the watermark never advances, but operators have no signal that the root cause is an unobservable chain head. Please log the exception or include an explicit headUnavailable reason in the held-watermark log.
|
Thanks — addressed in 54bb061:
|
There was a problem hiding this comment.
Codex review produced 3 comment(s) but all targeted lines outside the diff and were dropped. Check the workflow logs for details.
…-ups) Follow-up to the #927 squash merge, closing the correctness + coverage findings from the original stacked PRs (#906/#908/#910/#911/#912/#914) that weren't carried into #927. Rebased onto main after #941 split the DKGAgent god class into subsystem mixin holders; the dkg-agent.ts hunks are relocated into their new homes (dkg-agent-swm-host.ts: #35/#13 + trackCoreHostRecording; dkg-agent-cg-resolve.ts: #26 parser delegate; dkg-agent-base.ts: coreHostRecordings field; dkg-agent.ts: #30 stop() flush; dkg-agent-lifecycle.ts: tracked ACK pre-sign call site). Correctness: - #13 reorg-safety: getHeadBlock lets a transient RPC error throw instead of masquerading as "no chain"; reconciler holds the contiguous watermark until a real head is observed (+ head-fetch diagnostics). - #35 access-policy existence: recordCoreHostedPublicCg resolves the policy via readLiveOnChainAccessPolicy (existence-gated) so an UNKNOWN CG no longer looks public(0). Fails closed. - #30 coreHosted persistence is tracked + flushed in stop() instead of fire-and-forget. - #14 sweep provenance: derive a chain-unique publication id from the KC UAL when txHash is empty on the chain-driven path. - #15 snapshot discriminator: sort WorkspaceOperation candidates by URI so every node converges on the same op (per-root tokenId order tracked in #936). - #8 warm-core unpin: only count successful unpins; retain peer on failure. Test coverage: - #26 parsePipeDelimitedSyncRequest extracted to request-build.ts as a pure parser with round-trip + misparse-guard tests. - #27 requester-side sinceBatchId threading (sync-checkpoint-key + durable-sync-since-threading suites). - #37 typed boolean keepRootCopyOnLabel recovery in chain-reconcile-e2e. Co-authored-by: Cursor <cursoragent@cursor.com>
54bb061 to
f41519e
Compare
| // Tracked (not bare `void`) so a rejection is logged and a | ||
| // graceful stop() flushes the persist rather than losing the | ||
| // host-only `coreHosted` flag as an unhandled promise (#30). | ||
| this.trackCoreHostRecording(this.recordCoreHostedPublicCg(cgId, swmGraphId)); |
There was a problem hiding this comment.
🔴 Bug: Tracking recordCoreHostedPublicCg() here still does not make the coreHosted persist flushable on shutdown. recordCoreHostedPublicCg() returns after setContextGraphSubscription(), but that method persists via void contextGraphSubscriptionStore.save(...), so the promise you track can settle before the actual save finishes. stop() can therefore await coreHostRecordings and still exit before the row is on disk. Please thread an awaitable persistence promise through this path (or add a dedicated awaited save) before relying on it for restart safety.
| // host-only `coreHosted` flag is persisted before we tear down, rather | ||
| // than being lost to a fire-and-forget launched from the ACK pre-sign hook. | ||
| if (this.coreHostRecordings.size > 0) { | ||
| await Promise.allSettled([...this.coreHostRecordings]); |
There was a problem hiding this comment.
🔴 Bug: This only awaits a snapshot of the current set. New StorageACKs can still arrive and call trackCoreHostRecording() until node.stop() happens later in stop(), so late recordings are not included in this Promise.allSettled(...) and can still lose the coreHosted flag during shutdown. Please quiesce ACK handling before draining, or loop until the set stays empty after transport teardown.
| // `dkg:Publication` subject and overwrite each other's provenance — the | ||
| // failure mode of a content-addressable merkle-root id) AND convergent | ||
| // across replicas (every node sweeping this KC sees the same UAL). | ||
| const publicationId = txHash || `chain:${ual}`; |
There was a problem hiding this comment.
🔴 Bug: ual is an asset identifier, not a publication identifier. Using chain:${ual} here collapses every sweep-reconciled publish/update of the same KA onto one dkg:Publication subject, so later author/context/merkleRoot triples overwrite earlier publication provenance. This changes the contract from "one subject per publication" to "one subject per asset". Please use a true publication discriminator (for example batch/tx/log identity) or skip emitting the dkg:Publication block when reconcile cannot recover one.
|
Coordination note (rc.16): this PR and #980 (rc.16 |
|
Superseded by #1144 (A4 on I folded the still-relevant #934 hardening into A4: transient head-fetch watermark hold, existence-gated core-host recording, tracked core-host persistence on shutdown, deterministic finalization op selection, failed warm-core unpin retry, typed keep-root coverage, and sync checkpoint/since threading tests. Two #934 pieces were intentionally not folded: the sync parser extraction would regress the A3 session/since parser shape, and the publication-provenance fallback is superseded by the #1155 per-KA metadata trim already in integration. |
Summary
Follow-up to the squash merge #927 (
integrate/core-sync-vm→main). The audit of the six original stacked PRs (#906/#908/#910/#911/#912/#914) surfaced correctness and test-coverage findings that weren't carried into #927. The cheap input-parsing fixes (#20/#21) already shipped in #932. This PR closes the remaining correctness + coverage findings, scoped entirely topackages/agent/*.Correctness fixes
getHeadBlocknow lets a transient RPC error throw instead of returningundefined(which the reconciler reads as "no chain"). On a head blip the reconciler still promotes the ordinal but holds the contiguous watermark until a real head is observed, so a reorg can't bury an ordinal we already absorbed. (chain-reconciler.ts, dkg-agent.ts)recordCoreHostedPublicCgresolves the CG policy viareadLiveOnChainAccessPolicy(existence-gated onisContextGraphActiveOnChain). Solidity returns0(public) for UNKNOWN ids, so a never-registered CG would otherwise be persisted ascoreHosted. Fails closed. (dkg-agent.ts)recordCoreHostedPublicCgpromises are tracked and flushed instop()rather than fire-and-forget. (dkg-agent.ts)publishOperationId(chain:<merkleRoot>) whentxHashis empty on the chain-driven path, sodkg:authoredByprovenance is still emitted. (finalization-handler.ts)WorkspaceOperationcandidates by URI before merkle-matching so every node deterministically converges on the same op when content is identical. (finalization-handler.ts)unpin(); on failure keep the peer in thewarmedset so the next pass retries (addsunpinFailedto the log line). (warm-core-connections.ts)Test-coverage gaps closed
parsePipeDelimitedSyncRequestintorequest-build.tsas a pure, round-trippable parser; added encode/parse round-trip + misparse-guard tests (workspace:prefix,snapshotsuffix, CG literally named "since").sinceBatchIdthreading: newsync-checkpoint-key.test.ts(checkpoint key scoped bysinceBatchId) anddurable-sync-since-threading.test.ts(sinceflows only to the DATA fetch, not META).chain-reconcile-e2enow also recovers a typed booleankeepRootCopyOnLabelliteral ("true"^^xsd:boolean) and dual-writes.Plus regression tests for #13 (transient-head watermark hold), #8 (failed unpin retained), and #35 (UNKNOWN CG not marked).
Test plan
chain-reconciler,warm-core-connections,core-fills-gap,chain-reconcile-e2e,sync-envelope-cursor,sync-checkpoint-key,durable-sync-since-threading.Scoped to
packages/agent/*; no overlap with #932 (node-ui).Made with Cursor