Skip to content

fix(agent): harden chain-driven VM reconcile robustness (audit follow-ups)#934

Closed
branarakic wants to merge 1 commit into
mainfrom
fix/core-sync-vm-reconcile-robustness
Closed

fix(agent): harden chain-driven VM reconcile robustness (audit follow-ups)#934
branarakic wants to merge 1 commit into
mainfrom
fix/core-sync-vm-reconcile-robustness

Conversation

@branarakic

Copy link
Copy Markdown
Contributor

Summary

Follow-up to the squash merge #927 (integrate/core-sync-vmmain). The audit of the six original stacked PRs (#906/#908/#910/#911/#912/#914) surfaced correctness and test-coverage findings that weren't carried into #927. The cheap input-parsing fixes (#20/#21) already shipped in #932. This PR closes the remaining correctness + coverage findings, scoped entirely to packages/agent/*.

Correctness fixes

Test-coverage gaps closed

Plus regression tests for #13 (transient-head watermark hold), #8 (failed unpin retained), and #35 (UNKNOWN CG not marked).

Test plan

  • CI: agent unit suite (local execution was blocked by a host CPU storm — Cursor's workspace indexer pegging all cores — so CI is the validation gate here; lint is clean locally).
  • Targeted: chain-reconciler, warm-core-connections, core-fills-gap, chain-reconcile-e2e, sync-envelope-cursor, sync-checkpoint-key, durable-sync-since-threading.

Scoped to packages/agent/*; no overlap with #932 (node-ui).

Made with Cursor

// so different nodes could otherwise pick DIFFERENT ops and diverge the
// promoted snapshot's provenance. Sorting candidate ops by their (stable,
// content-independent) op URI makes every node converge on the same one (#15).
const opsSorted = [...rootsByOp.entries()].sort(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: sorting only the workspace-operation URI does not make this path deterministic for multi-root KCs. The roots array still preserves SPARQL binding order, and later in promoteSharedMemoryToCanonical those roots are consumed positionally to assign tokenIds. Two replicas can therefore pick the same op here but still materialize different rootEntity→tokenId mappings. Please sort the per-op roots list as well (or derive a canonical KA order) before returning it.

// their `dkg:authoredBy` triples (#14). Fall back to a deterministic id
// derived from the KC merkle root, which every node sweeping this KC
// converges on identically.
const publicationId = txHash || `chain:${ethers.hexlify(merkleRoot)}`;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: falling back to chain:${merkleRoot} collapses distinct publish events that happen to carry identical content into one dkg:Publication subject. A re-publish of the same KC would then merge/overwrite provenance (dkg:authoredBy, context graph, linked KAs) instead of representing a separate publication event. Use a chain-unique identifier on this path (kaId/batchId, or thread the actual tx hash through reconcile) rather than a content hash.

Comment thread packages/agent/src/chain-reconciler.ts Outdated
let headUnavailable = false;
try {
headBlock = await deps.getHeadBlock();
} catch {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: this catch now hides every getHeadBlock() failure without any diagnostic. In the degraded mode introduced here the sweep can keep reconciling ordinals while the watermark never advances, but operators have no signal that the root cause is an unobservable chain head. Please log the exception or include an explicit headUnavailable reason in the held-watermark log.

@branarakic

Copy link
Copy Markdown
Contributor Author

Thanks — addressed in 54bb061:

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review produced 3 comment(s) but all targeted lines outside the diff and were dropped. Check the workflow logs for details.

…-ups)

Follow-up to the #927 squash merge, closing the correctness + coverage
findings from the original stacked PRs (#906/#908/#910/#911/#912/#914)
that weren't carried into #927. Rebased onto main after #941 split the
DKGAgent god class into subsystem mixin holders; the dkg-agent.ts hunks
are relocated into their new homes (dkg-agent-swm-host.ts: #35/#13 +
trackCoreHostRecording; dkg-agent-cg-resolve.ts: #26 parser delegate;
dkg-agent-base.ts: coreHostRecordings field; dkg-agent.ts: #30 stop()
flush; dkg-agent-lifecycle.ts: tracked ACK pre-sign call site).

Correctness:
- #13 reorg-safety: getHeadBlock lets a transient RPC error throw instead
  of masquerading as "no chain"; reconciler holds the contiguous watermark
  until a real head is observed (+ head-fetch diagnostics).
- #35 access-policy existence: recordCoreHostedPublicCg resolves the policy
  via readLiveOnChainAccessPolicy (existence-gated) so an UNKNOWN CG no
  longer looks public(0). Fails closed.
- #30 coreHosted persistence is tracked + flushed in stop() instead of
  fire-and-forget.
- #14 sweep provenance: derive a chain-unique publication id from the KC UAL
  when txHash is empty on the chain-driven path.
- #15 snapshot discriminator: sort WorkspaceOperation candidates by URI so
  every node converges on the same op (per-root tokenId order tracked in #936).
- #8 warm-core unpin: only count successful unpins; retain peer on failure.

Test coverage:
- #26 parsePipeDelimitedSyncRequest extracted to request-build.ts as a pure
  parser with round-trip + misparse-guard tests.
- #27 requester-side sinceBatchId threading (sync-checkpoint-key +
  durable-sync-since-threading suites).
- #37 typed boolean keepRootCopyOnLabel recovery in chain-reconcile-e2e.

Co-authored-by: Cursor <cursoragent@cursor.com>
@branarakic branarakic force-pushed the fix/core-sync-vm-reconcile-robustness branch from 54bb061 to f41519e Compare June 3, 2026 11:54
// Tracked (not bare `void`) so a rejection is logged and a
// graceful stop() flushes the persist rather than losing the
// host-only `coreHosted` flag as an unhandled promise (#30).
this.trackCoreHostRecording(this.recordCoreHostedPublicCg(cgId, swmGraphId));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: Tracking recordCoreHostedPublicCg() here still does not make the coreHosted persist flushable on shutdown. recordCoreHostedPublicCg() returns after setContextGraphSubscription(), but that method persists via void contextGraphSubscriptionStore.save(...), so the promise you track can settle before the actual save finishes. stop() can therefore await coreHostRecordings and still exit before the row is on disk. Please thread an awaitable persistence promise through this path (or add a dedicated awaited save) before relying on it for restart safety.

// host-only `coreHosted` flag is persisted before we tear down, rather
// than being lost to a fire-and-forget launched from the ACK pre-sign hook.
if (this.coreHostRecordings.size > 0) {
await Promise.allSettled([...this.coreHostRecordings]);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This only awaits a snapshot of the current set. New StorageACKs can still arrive and call trackCoreHostRecording() until node.stop() happens later in stop(), so late recordings are not included in this Promise.allSettled(...) and can still lose the coreHosted flag during shutdown. Please quiesce ACK handling before draining, or loop until the set stays empty after transport teardown.

// `dkg:Publication` subject and overwrite each other's provenance — the
// failure mode of a content-addressable merkle-root id) AND convergent
// across replicas (every node sweeping this KC sees the same UAL).
const publicationId = txHash || `chain:${ual}`;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: ual is an asset identifier, not a publication identifier. Using chain:${ual} here collapses every sweep-reconciled publish/update of the same KA onto one dkg:Publication subject, so later author/context/merkleRoot triples overwrite earlier publication provenance. This changes the contract from "one subject per publication" to "one subject per asset". Please use a true publication discriminator (for example batch/tx/log identity) or skip emitting the dkg:Publication block when reconcile cannot recover one.

@branarakic

Copy link
Copy Markdown
Contributor Author

Coordination note (rc.16): this PR and #980 (rc.16 integration/v10-devnet → main) both target main and both touch packages/agent/* (chain-reconciler.ts, dkg-agent.ts, finalization-handler.ts, dkg-agent-base.ts, dkg-agent-lifecycle.ts). Integration already carries the #927 core-sync-vm base (chain-reconciler.ts is present, same blob), so there's no missing-base issue — but the two will conflict on merge to main. Suggest landing #980 first, then rebasing this onto the rc.16 agent refactor and re-running the agent suite. Since this closes audit findings, it may also belong in the same review/audit pass as #980's R5/R3 gate.

@Jurij89

Jurij89 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Superseded by #1144 (A4 on integration/issue-1138).

I folded the still-relevant #934 hardening into A4: transient head-fetch watermark hold, existence-gated core-host recording, tracked core-host persistence on shutdown, deterministic finalization op selection, failed warm-core unpin retry, typed keep-root coverage, and sync checkpoint/since threading tests.

Two #934 pieces were intentionally not folded: the sync parser extraction would regress the A3 session/since parser shape, and the publication-provenance fallback is superseded by the #1155 per-KA metadata trim already in integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants