fix: resolve all 14 issues from rc.17 fresh-main QA validation (#1093–#1106)#1107
Conversation
Full-surface QA on a real multi-node devnet surfaced 14 issues across publish, sync, query, lifecycle and the HTTP API. Every fix below was re-verified on a fresh devnet with real data after the change: - #1093 ACKCollector: partial peer discovery permanently bricked core publishes (QuorumUnmetError) — pool now refreshes from known core peers - #1094 KA edit loop: pull-from couldn't find VM quads and stale seals wedged re-finalize — broadened VM graph lookup + clear seal on pull-from - #1095 lifecycle descriptor: contradictory state + missing published event - #1096 memory/search could never match Verifiable Memory - #1097 documented one-shot create+publish flow returned 500 — publish now auto-promotes sealed WM drafts to SWM - #1098 VM divergence on replicas + rs kc-not-synced warn loop — read SWM across bucket and per-KA graphs, backoff for not-synced KCs - #1099 SWM clear-after-publish didn't propagate — drain per-KA SWM graphs on publisher, finalization replicas and update replicas - #1100 agent-profile gossip mis-decoded as publish requests (log spam) - #1101 wm/import-file silently skipped extraction for Markdown uploads - #1102 CG routes: accept id alias for contextGraphId consistently - #1103 sign-join: make sign-only contract explicit (forwarded:false + next hint pointing at request-join) - #1104 surface publishedUal in the lifecycle descriptor - #1105 query-remote denied everything by default, ignoring the CG's public on-chain accessPolicy - #1106 SKILL.md drift bundle: chat body aliases, WM query default agent wallet fallback, update semantics, sub-graph name rules, import-file form field docs Co-authored-by: Cursor <cursoragent@cursor.com>
- Rate-limit BEFORE the access check and cache isContextGraphPublic verdicts (60s TTL, size-capped) so unauthenticated traffic cannot burn RPC quota through the #1105 on-chain public-CG resolver (review 🔴 1). - ENTITY_BY_UAL now enforces the RESOLVED context graph's access policy post-resolution — on-chain-public CGs become reachable by UAL on default-deny configs, and the old leak (UAL reads passing through an explicitly denied CG whenever any other public CG existed) is closed (review 🔴 2). - Unscoped working-memory reads for the node default agent span BOTH the legacy peerId-keyed namespace and the rc.17+ wallet-keyed namespace via new same-identity agentAddressAliases, so neither era of WM drafts is stranded (review 🟡). Co-authored-by: Cursor <cursoragent@cursor.com>
Per manager request — every high-priority issue gets a test that reproduces it (it.fails / it.skip-with-recipe), so a fix can flip it green and stay in CI. Runnable it.fails repros (14): unit — #11 (op-wallets plaintext), #1121 #1122 (async-lift encryption + canonicalization), plus existing #184 #675 #757 devnet — #886 #1093 #1094 #1095 #1096 #1097 #1098 #1104 (devnet/issue-liveness/high-issues.test.ts; 11 pass = bugs live) Documented it.skip stubs with exact repro recipes (11) — where a faithful test needs a fixture/design/topology that doesn't exist yet (a wrong test is worse): #1091 #614 contract/design (grindable seed, billing-window sweep) #1124 host-mode sharded topology (devnet cores are all CG members) #1099 gossip-retention timing (repros on testnet, not fast local devnet) #1013 #936 publisher-runtime / 2-replica-reconcile harness #999 #1008 load-dependent store saturation (verified live on testnet) #723 emergent network-wide RS metric #462 MessageHandler ACL harness (skill_request has no authz) #1078 layer-scoped private-store API The 9 fix-in-flight highs (#886, #1093-#1099, #1104) are fixed on PR #1107 — when it merges their it.fails repros start passing → unwrap them. Full map in docs/testing/ISSUE_LIVENESS_TESTS.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ntentType) ACK candidate pool tracks the chain's runtime quorum (#1093 follow-up): - getACKCandidatePeers gated its confirmed-core shortcut on the hard-coded DEFAULT_REQUIRED_ACKS (3) while ACKCollector enforces the chain's runtime requiredACKs. On networks configured above 3 signatures, a 3-strong confirmed-core subset is still below quorum — returning only it re-introduced pool_below_quorum. - The V10 ACK provider now caches requiredACKs (lastKnownRequiredACKs) right after each getMinimumRequiredSignatures() read — BEFORE collector.collect() runs the getConnectedCorePeers callback — so the pool decision uses the real quorum even on the first publish. - New ack-candidate-pool.test.ts: default quorum, runtime quorum above (5/4) and below (2) the default, and self-exclusion. Import filename inference respects the opaque-blob escape hatch (#1101 follow-up): - An EXPLICIT contentType=application/octet-stream form field no longer gets overridden by filename-extension inference (notes.md was treated as Markdown anyway, removing the deliberate "store as opaque blob" path). Only the implicit default (no override; generic/absent part header) is eligible for inference. - The test harness (import-file-orchestration.shared.ts) now mirrors the daemon's full detection incl. #1101 inference; new regression tests for both the explicit-octet-stream escape hatch and the preserved implicit inference. Verified live on a clean 6-node devnet from this branch: - bootstrap completes ALL 10 publishes across core1+core2 confirmed (the #1093 fix area, with the quorum threading in place); - import-file with explicit octet-stream + notes.md → stored as blob, extraction skipped; same file with implicit octet-stream part header → text/markdown pipeline runs. Unit: import-file parts 01-09 92/92, ack-candidate-pool 5/5. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… red-while-live convention Adds single-process / single-Hardhat-node reproducing tests (run in the normal CI lanes) for five high/pre-mainnet issues that were previously only documented `it.skip` stubs, and switches the whole liveness suite to the standard "red while the bug is live, green when fixed" convention (plain failing `it()` instead of the inverted `it.fails`, which was green-while-broken). New CI tests (each authored against a known-fixed build and confirmed to flip green there, so a red is a genuine live bug, not a broken test): - #462 agent/issue-462-skill-acl.test.ts — an unauthorized (but signed) peer's skill_request is rejected and the handler does not run. Today there is no ACL → handler runs → RED. - #936 agent/issue-936-tokenid-determinism.test.ts — two replicas reconciling the same multi-root KC from chain (divergent oxigraph insertion orders) agree on the rootEntity→tokenId mapping. Today positional assignment over a store-dependent order makes them disagree → RED. - #1013 publisher/issue-1013-async-finalization-honesty.test.ts — a private publish that never reached chain (no storage ACKs) must NOT map to a finalized lift job. Today the mapper returns finalized/local → RED. - #1078 storage/issue-1078-private-layer-scope.test.ts — a root hydrates only the authoritative private slice, not a superseded commitment for the same root. Today the CG-level _private graph commingles both → RED. - #1091 random-sampling/e2e-hardhat-chain.test.ts — a node cannot predict its own RS challenge from public block data. Today the seed is reconstructed from block.difficulty/blockhash/sender and previewChallengeForSeed predicts the exact on-chain draw → RED. Convention flip (it.fails → it()) for the existing high-issue repros (#11, #184, #675, #757, #1121, #1122 + the devnet multi-node tier) so the suite is uniformly RED while bugs are live and GREEN once fixed — matching how the fix PRs (#1107, #1132) turn individual tests green as they merge. Doc rewritten (docs/testing/ISSUE_LIVENESS_TESTS.md): all 25 high issues mapped to a test across three tiers — 11 CI unit/integration, 8 devnet multi-node, and 6 honest pending-fixture/emergent stubs (#614 #1099 #1124 fixture-needed; #723 #999 #1008 emergent/load — a deterministic CI assertion there would be a false positive). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Heads-up from #1155 (per-KA metadata trim) — these are exactly the pre-mainnet fixes that should land, and several (#1098 replica VM convergence, #1099 SWM drain) harden the same SWM/per-KA-graph surfaces #1155 restructures, so folding them together strengthens both. One cross-track interaction to coordinate, on #1155 changes two things this PR's SWM logic touches:
#1107's #1098 |
# Conflicts: # packages/cli/skills/dkg-node/SKILL.md # packages/publisher/src/dkg-publisher.ts # packages/random-sampling/src/prover.ts
The merge of main into this branch auto-merged 26 of 29 files textually, but three PR fixes were built on lifecycle/metadata APIs that main has since reworked (RFC ka-metadata-trim, OT-RFC-49). This commit reconciles those auto-merged-but-stale changes so the merged tree builds and the full suite (incl. main's red-while-live repro tests) is green. dkg-agent-publish.ts: - #1095/#1104: main deleted `generateAssertionPublishedMetadata` (RFC ka-metadata-trim Phase 0 — its only caller was dead on main). Dropped the PR's `published` prov:Activity event minting that depended on it; main already stamps `dkg:state="published"` on this path (deriveStatus → vm-confirmed), so #1095's lifecycle-STATE fix is satisfied. Kept #1104's `publishedUal` stamp (a plain raw-quad write, independent of the removed function). - #1097: dropped the PR's "auto-promote a sealed-but-unstaged assertion before publish" block. main solved the same issue the opposite way — a finalized-but-unshared publish is an explicit caller precondition that surfaces "No quads in shared memory" and the vm/publish route maps to a clean 409 VM_PUBLISH_PRECONDITION. Auto-promoting defeats that precondition (regressed it to a 500), so main's explicit share→publish contract wins. knowledge-assets-route.test.ts: - #1094: flipped the red-while-live repro `pull-from {layer} 500s "No sealed entity list"` to its positive seeding leg, exactly as the test's own comment instructed ("when #1094 ships, this flips and must be upgraded"). PR #1107 is #1094's fix, so pull-from now re-opens the WM draft and re-seeds the sealed entities (200). Conflict resolutions (in the merge commit): prover.ts took main's version (the PR's #1098 log-dedup was built on the ciphertext-chunks model main removed under OT-RFC-49); dkg-publisher.ts kept BOTH main's `{ provenanceEvents }` arg and the PR's #1095 VM-version discard filter; SKILL.md took main's (more complete) two-fork publish-route docs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…test The merge of main into this branch produced 2 conflicts (resolved in the merge commit) plus one auto-merge artifact this commit fixes. Conflict resolutions (in the merge commit): - packages/query/src/dkg-query-engine.ts (working-memory view): combined main's same-identity alias span (PR #1107 review 🟡 — one prefix per agentAddressAlias) with #1132's sub-graph scoping (#184/#675 — the `${sg}` suffix), so the WM prefixes are `…${sg}/_working_memory/${addr}/` per alias. - packages/cli/test/notifications-route.test.ts: took main's version. main rewrote it from a mock-based unit test into a real-daemon integration test (sign-join→request-join→curator reject-join) that fully exercises the #757 curator gate; the PR's #757 route change (thread callerAddress into listPendingJoinRequests) auto-merged into notifications.ts and is covered. This commit: - packages/agent/test/messaging-chat-acl.test.ts: add `vi` to the vitest import. main's copy of this file (chat-ACL tests only) and the PR's net-new `skill_request ACL (GH #462)` describe block (which uses `vi.fn` via echoSkill) auto-merged, but the surviving import line was main's `{ describe, it, expect }` — so the #462 skill-ACL tests threw `ReferenceError: vi is not defined`. The #462 feature itself (MessageHandler.setSkillAcl + default-deny enforcement) is intact in the merged messaging.ts; only the test import needed reconciling. 15/15 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Full-surface QA of fresh
main(rc.17) on a real 5-node devnet with real data surfaced 14 issues (#1093–#1106). This PR fixes all of them. Every fix was re-verified end-to-end on a clean devnet — reproduction first confirmed broken onmain, then confirmed fixed on this branch. All unit suites pass (agent 1225, cli 2018, publisher 1162).Fixes #1093, Fixes #1094, Fixes #1095, Fixes #1096, Fixes #1097, Fixes #1098, Fixes #1099, Fixes #1100, Fixes #1101, Fixes #1102, Fixes #1103, Fixes #1104, Fixes #1105, Fixes #1106
#1093 — P0: ACKCollector partial peer discovery permanently bricks core publishes
Repro: boot a multi-node devnet, publish from a core node shortly after startup →
QuorumUnmetError/pool_below_quorum, and the node never recovers — every subsequent publish fails too.What was wrong: the ACK candidate pool was built once from whatever peers happened to be discovered at that moment and never refreshed, so a slow-to-connect topology poisoned the pool forever.
Fix: the pool now re-resolves from
knownCorePeerIds(confirmed core nodes) on each collection attempt instead of caching a partial snapshot (packages/publisher/src/ack-collector.ts,packages/agent/src/dkg-agent.ts,sync-on-connect.ts).Verified: fresh devnet bootstrap + immediate publishes from multiple nodes — all
confirmed, zeroQuorumUnmetError/pool_below_quorumin any node's logs.#1094 — P0: no working path to update a published Knowledge Asset
Repro: publish a KA, then
wm/pull-from {layer:"vm"}→No VM quads found; after working around that,wm/finalize→already finalized with a different merkleRoot. The documented edit loop was completely broken.What was wrong: (1) the VM pull query only looked in the primary data graph, but published quads live in per-KA VM graphs (
…/_verifiable_memory/{author}/{number}) and context partitions; (2)pull-fromnever cleared the old assertion seal, so re-finalizing always collided with the stale merkleRoot.Fix: broadened the VM lookup to scan all VM graph locations, and
assertionPullFromnow deletes all seal predicates (canonical + legacy subjects) when re-opening the draft (packages/publisher/src/dkg-publisher.ts).Verified: full edit loop on devnet — pull-from VM → write new quads → finalize → share →
vm/publishreturnsstatus: confirmedas an UPDATE, and the updated content is readable from VM on publisher and replicas.#1095 — P0: lifecycle descriptor reports contradictory state, never records
publishedRepro: publish a KA, fetch its lifecycle descriptor → no
publishedevent; discard an unrelated WM draft with the same name → descriptor flips todiscardedwhile the KA is live on-chain.Fix: the descriptor now records the
publishedevent on confirmed publish and discard no longer clobbers post-publish state (packages/agent/src/dkg-agent-lifecycle.ts).Verified: descriptor shows
publishedevent with tx details after publish; discarding a WM draft no longer corrupts the state of the published KA.#1096 — P1:
/api/memory/searchcan never match Verifiable MemoryRepro: publish a KA,
POST /api/memory/searchscoped to VM → always 0 results even though/api/queryover VM returns rows.What was wrong: the search built graph prefixes that predate the uniform per-KA VM layout, and matched only
http://schema.orgwhile data useshttps://schema.org.Fix: corrected the VM graph prefix matching and normalized both schema.org schemes (
packages/cli/src/daemon/routes/memory.ts).Verified:
memory/searchover VM returns the published entities on devnet.#1097 — P1: documented one-shot publish flow returns 500
Repro:
wm/create(finalize) →vm/publish {assertionName}per SKILL.md → 500, because publish expected the data to already be in SWM while create only sealed it in WM.Fix:
publishFromFinalizedAssertionauto-promotes a sealed-but-unstaged WM draft to SWM before publishing (packages/agent/src/dkg-agent-publish.ts); SKILL.md updated to match.Verified: one-shot create + publish by assertion name →
status: confirmedon devnet.#1098 — P1: VM divergence on replicas +
rskc-not-syncedwarn loopRepro: publish on node1 with node2 subscribed → node2's VM never materializes the KA; meanwhile the Random Sampling prover floods logs with
kc-not-syncedretries every tick.What was wrong: the finalization handler read SWM quads only from the bare bucket graph, but gossiped SWM copies land in per-KA graphs — merkle verification saw 0 quads and promotion silently no-opped. The RS prover retried not-yet-synced KCs with no backoff.
Fix:
getSharedMemoryQuadsForRootsnow uses aCONSTRUCTspanning bucket + per-KA graphs viasharedMemoryReadBothFilter(preserving literal fidelity for the merkle recompute); the prover got exponential backoff forkc-not-synced(packages/agent/src/finalization-handler.ts,packages/random-sampling/src/prover.ts).Verified: late-joining nodes converge on the published VM data; no
kc-not-syncedspam in logs.#1099 — P1: SWM clear-after-publish does not propagate
Repro: publish from SWM → publisher clears its local SWM, but every replica keeps the stale copy forever and re-serves it to late subscribers via PROTOCOL_SYNC — even re-poisoning the publisher's own SWM on resync. The update path never cleared SWM at all, even locally.
Fix: extracted
clearPublishedSwmRoots(drains bucket + all per-KA SWM graphs + ownership/meta rows) and wired it into: the mint path, the update path on the publisher (dkg-agent-publish.ts), finalization replicas (finalization-handler.ts), and update replicas (update-handler.ts, including detaching roots fromWorkspaceOperationrows so the sync responder stops serving them).Verified: after publish + update, SWM row count for the roots is 0 on publisher and replicas, and a late subscriber receives no stale SWM content.
#1100 — P1: agent-profile gossip mis-decoded as publish requests (log spam)
Repro: run any 2+ node network → persistent WARN spam
…does not match topicas profile gossip was fed to the publish decoder.Fix: gossip handler dispatches by topic before decoding (
packages/agent/src/gossip-publish-handler.ts).Verified: clean devnet logs — zero mis-decode warnings across a full test session.
#1101 — P1:
wm/import-filesilently skips extraction for MarkdownRepro:
curl -F file=@notes.md …/api/knowledge-assets/import-file→ upload accepted but extraction silently skipped (curl sendsapplication/octet-stream).Fix: content-type detection falls back to the file extension, and skipped extractions now report an explicit skip reason (
packages/cli/src/daemon/routes/knowledge-assets-import.ts).Verified: Markdown upload reports
detectedContentType: text/markdown,extraction: completed, and extracted entities are queryable.#1102 — P2: inconsistent CG route body params (
idvscontextGraphId)Repro:
context-graph/renameand/subscriberejected the param name that sibling routes accepted.Fix: all CG routes accept
idas an alias forcontextGraphId(packages/cli/src/daemon/routes/context-graph.ts); SKILL.md documents the alias.Verified: both spellings work on rename and subscribe.
#1103 — P2:
sign-joinreturns 200 but the join request never reaches the curatorRepro:
POST …/sign-join→ 200, then wait — the curator's/join-requestsstays empty forever. The route is sign-only by design, but nothing said so.Fix: the response now carries
forwarded: falseplus anexthint with the exact/request-joincall needed to deliver the delegation; SKILL.md documents the two-step flow.Verified: sign-join → request-join (with
curatorPeerId) → request visible in curator's/join-requests→approve-joinsucceeds, end-to-end on devnet.#1104 — P1: UAL duality — published UAL missing from descriptor
Repro: after
vm/publishreturns the mintedual, the lifecycle descriptor still only exposedreservedUal.Fix: descriptor now surfaces
publishedUal(packages/agent/src/dkg-agent-lifecycle.ts).Verified:
publishedUalpresent and matching the publish response.#1105 — P1:
query-remotedenies everything by defaultRepro: create a CG with a public on-chain
accessPolicy, query it remotely from another node → denied, because the QueryHandler only consulted the localqueryAccess.defaultPolicy(deny) and never the CG's own policy.Fix: the QueryHandler resolves the target CG's on-chain access policy; public CGs are remotely queryable without local allowlist entries (
packages/query/src/query-handler.ts).Verified: remote SPARQL from node5 against node1's public CG returns the published rows (
status: OK, resultCount: 1); auto-scoping guards (no explicitGRAPHclauses) still enforced.#1106 — P2: SKILL.md drift bundle
Five doc/behavior mismatches fixed together:
/api/chatnow accepts the intuitivepeerId/messagealiases forto/text(agent-chat.ts) — verified working.POST /api/updatedocs corrected: data is inlinequads, it does not read SWM; edit loop recommended.agentAddressnow default to the node's primary agent wallet instead of the bare peerId namespace (dkg-agent-query.ts) — verified returning rows./.import-filetakescontextGraphIdas a multipart form field, not a query param.Test plan
dkg-agent(1225 passed),dkgCLI (2018 passed),dkg-publisher(1162 passed)Made with Cursor