Skip to content

fix: resolve all 14 issues from rc.17 fresh-main QA validation (#1093–#1106)#1107

Merged
branarakic merged 6 commits into
mainfrom
fix/qa-fresh-main-issues
Jun 17, 2026
Merged

fix: resolve all 14 issues from rc.17 fresh-main QA validation (#1093–#1106)#1107
branarakic merged 6 commits into
mainfrom
fix/qa-fresh-main-issues

Conversation

@Bojan131

Copy link
Copy Markdown
Contributor

Summary

Full-surface QA of fresh main (rc.17) on a real 5-node devnet with real data surfaced 14 issues (#1093#1106). This PR fixes all of them. Every fix was re-verified end-to-end on a clean devnet — reproduction first confirmed broken on main, then confirmed fixed on this branch. All unit suites pass (agent 1225, cli 2018, publisher 1162).

Fixes #1093, Fixes #1094, Fixes #1095, Fixes #1096, Fixes #1097, Fixes #1098, Fixes #1099, Fixes #1100, Fixes #1101, Fixes #1102, Fixes #1103, Fixes #1104, Fixes #1105, Fixes #1106


#1093 — P0: ACKCollector partial peer discovery permanently bricks core publishes

Repro: boot a multi-node devnet, publish from a core node shortly after startup → QuorumUnmetError / pool_below_quorum, and the node never recovers — every subsequent publish fails too.
What was wrong: the ACK candidate pool was built once from whatever peers happened to be discovered at that moment and never refreshed, so a slow-to-connect topology poisoned the pool forever.
Fix: the pool now re-resolves from knownCorePeerIds (confirmed core nodes) on each collection attempt instead of caching a partial snapshot (packages/publisher/src/ack-collector.ts, packages/agent/src/dkg-agent.ts, sync-on-connect.ts).
Verified: fresh devnet bootstrap + immediate publishes from multiple nodes — all confirmed, zero QuorumUnmetError/pool_below_quorum in any node's logs.

#1094 — P0: no working path to update a published Knowledge Asset

Repro: publish a KA, then wm/pull-from {layer:"vm"}No VM quads found; after working around that, wm/finalizealready finalized with a different merkleRoot. The documented edit loop was completely broken.
What was wrong: (1) the VM pull query only looked in the primary data graph, but published quads live in per-KA VM graphs (…/_verifiable_memory/{author}/{number}) and context partitions; (2) pull-from never cleared the old assertion seal, so re-finalizing always collided with the stale merkleRoot.
Fix: broadened the VM lookup to scan all VM graph locations, and assertionPullFrom now deletes all seal predicates (canonical + legacy subjects) when re-opening the draft (packages/publisher/src/dkg-publisher.ts).
Verified: full edit loop on devnet — pull-from VM → write new quads → finalize → share → vm/publish returns status: confirmed as an UPDATE, and the updated content is readable from VM on publisher and replicas.

#1095 — P0: lifecycle descriptor reports contradictory state, never records published

Repro: publish a KA, fetch its lifecycle descriptor → no published event; discard an unrelated WM draft with the same name → descriptor flips to discarded while the KA is live on-chain.
Fix: the descriptor now records the published event on confirmed publish and discard no longer clobbers post-publish state (packages/agent/src/dkg-agent-lifecycle.ts).
Verified: descriptor shows published event with tx details after publish; discarding a WM draft no longer corrupts the state of the published KA.

#1096 — P1: /api/memory/search can never match Verifiable Memory

Repro: publish a KA, POST /api/memory/search scoped to VM → always 0 results even though /api/query over VM returns rows.
What was wrong: the search built graph prefixes that predate the uniform per-KA VM layout, and matched only http://schema.org while data uses https://schema.org.
Fix: corrected the VM graph prefix matching and normalized both schema.org schemes (packages/cli/src/daemon/routes/memory.ts).
Verified: memory/search over VM returns the published entities on devnet.

#1097 — P1: documented one-shot publish flow returns 500

Repro: wm/create (finalize) → vm/publish {assertionName} per SKILL.md → 500, because publish expected the data to already be in SWM while create only sealed it in WM.
Fix: publishFromFinalizedAssertion auto-promotes a sealed-but-unstaged WM draft to SWM before publishing (packages/agent/src/dkg-agent-publish.ts); SKILL.md updated to match.
Verified: one-shot create + publish by assertion name → status: confirmed on devnet.

#1098 — P1: VM divergence on replicas + rs kc-not-synced warn loop

Repro: publish on node1 with node2 subscribed → node2's VM never materializes the KA; meanwhile the Random Sampling prover floods logs with kc-not-synced retries every tick.
What was wrong: the finalization handler read SWM quads only from the bare bucket graph, but gossiped SWM copies land in per-KA graphs — merkle verification saw 0 quads and promotion silently no-opped. The RS prover retried not-yet-synced KCs with no backoff.
Fix: getSharedMemoryQuadsForRoots now uses a CONSTRUCT spanning bucket + per-KA graphs via sharedMemoryReadBothFilter (preserving literal fidelity for the merkle recompute); the prover got exponential backoff for kc-not-synced (packages/agent/src/finalization-handler.ts, packages/random-sampling/src/prover.ts).
Verified: late-joining nodes converge on the published VM data; no kc-not-synced spam in logs.

#1099 — P1: SWM clear-after-publish does not propagate

Repro: publish from SWM → publisher clears its local SWM, but every replica keeps the stale copy forever and re-serves it to late subscribers via PROTOCOL_SYNC — even re-poisoning the publisher's own SWM on resync. The update path never cleared SWM at all, even locally.
Fix: extracted clearPublishedSwmRoots (drains bucket + all per-KA SWM graphs + ownership/meta rows) and wired it into: the mint path, the update path on the publisher (dkg-agent-publish.ts), finalization replicas (finalization-handler.ts), and update replicas (update-handler.ts, including detaching roots from WorkspaceOperation rows so the sync responder stops serving them).
Verified: after publish + update, SWM row count for the roots is 0 on publisher and replicas, and a late subscriber receives no stale SWM content.

#1100 — P1: agent-profile gossip mis-decoded as publish requests (log spam)

Repro: run any 2+ node network → persistent WARN spam …does not match topic as profile gossip was fed to the publish decoder.
Fix: gossip handler dispatches by topic before decoding (packages/agent/src/gossip-publish-handler.ts).
Verified: clean devnet logs — zero mis-decode warnings across a full test session.

#1101 — P1: wm/import-file silently skips extraction for Markdown

Repro: curl -F file=@notes.md …/api/knowledge-assets/import-file → upload accepted but extraction silently skipped (curl sends application/octet-stream).
Fix: content-type detection falls back to the file extension, and skipped extractions now report an explicit skip reason (packages/cli/src/daemon/routes/knowledge-assets-import.ts).
Verified: Markdown upload reports detectedContentType: text/markdown, extraction: completed, and extracted entities are queryable.

#1102 — P2: inconsistent CG route body params (id vs contextGraphId)

Repro: context-graph/rename and /subscribe rejected the param name that sibling routes accepted.
Fix: all CG routes accept id as an alias for contextGraphId (packages/cli/src/daemon/routes/context-graph.ts); SKILL.md documents the alias.
Verified: both spellings work on rename and subscribe.

#1103 — P2: sign-join returns 200 but the join request never reaches the curator

Repro: POST …/sign-join → 200, then wait — the curator's /join-requests stays empty forever. The route is sign-only by design, but nothing said so.
Fix: the response now carries forwarded: false plus a next hint with the exact /request-join call needed to deliver the delegation; SKILL.md documents the two-step flow.
Verified: sign-join → request-join (with curatorPeerId) → request visible in curator's /join-requestsapprove-join succeeds, end-to-end on devnet.

#1104 — P1: UAL duality — published UAL missing from descriptor

Repro: after vm/publish returns the minted ual, the lifecycle descriptor still only exposed reservedUal.
Fix: descriptor now surfaces publishedUal (packages/agent/src/dkg-agent-lifecycle.ts).
Verified: publishedUal present and matching the publish response.

#1105 — P1: query-remote denies everything by default

Repro: create a CG with a public on-chain accessPolicy, query it remotely from another node → denied, because the QueryHandler only consulted the local queryAccess.defaultPolicy (deny) and never the CG's own policy.
Fix: the QueryHandler resolves the target CG's on-chain access policy; public CGs are remotely queryable without local allowlist entries (packages/query/src/query-handler.ts).
Verified: remote SPARQL from node5 against node1's public CG returns the published rows (status: OK, resultCount: 1); auto-scoping guards (no explicit GRAPH clauses) still enforced.

#1106 — P2: SKILL.md drift bundle

Five doc/behavior mismatches fixed together:

  1. /api/chat now accepts the intuitive peerId/message aliases for to/text (agent-chat.ts) — verified working.
  2. POST /api/update docs corrected: data is inline quads, it does not read SWM; edit loop recommended.
  3. WM queries without agentAddress now default to the node's primary agent wallet instead of the bare peerId namespace (dkg-agent-query.ts) — verified returning rows.
  4. Documented that sub-graph names cannot contain /.
  5. Documented that import-file takes contextGraphId as a multipart form field, not a query param.

Test plan

Made with Cursor

Full-surface QA on a real multi-node devnet surfaced 14 issues across
publish, sync, query, lifecycle and the HTTP API. Every fix below was
re-verified on a fresh devnet with real data after the change:

- #1093 ACKCollector: partial peer discovery permanently bricked core
  publishes (QuorumUnmetError) — pool now refreshes from known core peers
- #1094 KA edit loop: pull-from couldn't find VM quads and stale seals
  wedged re-finalize — broadened VM graph lookup + clear seal on pull-from
- #1095 lifecycle descriptor: contradictory state + missing published event
- #1096 memory/search could never match Verifiable Memory
- #1097 documented one-shot create+publish flow returned 500 — publish now
  auto-promotes sealed WM drafts to SWM
- #1098 VM divergence on replicas + rs kc-not-synced warn loop — read SWM
  across bucket and per-KA graphs, backoff for not-synced KCs
- #1099 SWM clear-after-publish didn't propagate — drain per-KA SWM graphs
  on publisher, finalization replicas and update replicas
- #1100 agent-profile gossip mis-decoded as publish requests (log spam)
- #1101 wm/import-file silently skipped extraction for Markdown uploads
- #1102 CG routes: accept id alias for contextGraphId consistently
- #1103 sign-join: make sign-only contract explicit (forwarded:false + next
  hint pointing at request-join)
- #1104 surface publishedUal in the lifecycle descriptor
- #1105 query-remote denied everything by default, ignoring the CG's public
  on-chain accessPolicy
- #1106 SKILL.md drift bundle: chat body aliases, WM query default agent
  wallet fallback, update semantics, sub-graph name rules, import-file form
  field docs

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/query/src/query-handler.ts
Comment thread packages/query/src/query-handler.ts Outdated
Comment thread packages/agent/src/dkg-agent-query.ts Outdated
- Rate-limit BEFORE the access check and cache isContextGraphPublic
  verdicts (60s TTL, size-capped) so unauthenticated traffic cannot burn
  RPC quota through the #1105 on-chain public-CG resolver (review 🔴 1).
- ENTITY_BY_UAL now enforces the RESOLVED context graph's access policy
  post-resolution — on-chain-public CGs become reachable by UAL on
  default-deny configs, and the old leak (UAL reads passing through an
  explicitly denied CG whenever any other public CG existed) is closed
  (review 🔴 2).
- Unscoped working-memory reads for the node default agent span BOTH the
  legacy peerId-keyed namespace and the rc.17+ wallet-keyed namespace via
  new same-identity agentAddressAliases, so neither era of WM drafts is
  stranded (review 🟡).

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/agent/src/dkg-agent.ts Outdated
Comment thread packages/cli/src/daemon/routes/knowledge-assets-import.ts Outdated
Bojan131 added a commit that referenced this pull request Jun 11, 2026
Per manager request — every high-priority issue gets a test that reproduces it
(it.fails / it.skip-with-recipe), so a fix can flip it green and stay in CI.

Runnable it.fails repros (14):
  unit     — #11 (op-wallets plaintext), #1121 #1122 (async-lift encryption +
             canonicalization), plus existing #184 #675 #757
  devnet   — #886 #1093 #1094 #1095 #1096 #1097 #1098 #1104
             (devnet/issue-liveness/high-issues.test.ts; 11 pass = bugs live)

Documented it.skip stubs with exact repro recipes (11) — where a faithful test
needs a fixture/design/topology that doesn't exist yet (a wrong test is worse):
  #1091 #614  contract/design (grindable seed, billing-window sweep)
  #1124       host-mode sharded topology (devnet cores are all CG members)
  #1099       gossip-retention timing (repros on testnet, not fast local devnet)
  #1013 #936  publisher-runtime / 2-replica-reconcile harness
  #999 #1008  load-dependent store saturation (verified live on testnet)
  #723        emergent network-wide RS metric
  #462        MessageHandler ACL harness (skill_request has no authz)
  #1078       layer-scoped private-store API

The 9 fix-in-flight highs (#886, #1093-#1099, #1104) are fixed on PR #1107 —
when it merges their it.fails repros start passing → unwrap them.

Full map in docs/testing/ISSUE_LIVENESS_TESTS.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ntentType)

ACK candidate pool tracks the chain's runtime quorum (#1093 follow-up):
- getACKCandidatePeers gated its confirmed-core shortcut on the hard-coded
  DEFAULT_REQUIRED_ACKS (3) while ACKCollector enforces the chain's runtime
  requiredACKs. On networks configured above 3 signatures, a 3-strong
  confirmed-core subset is still below quorum — returning only it
  re-introduced pool_below_quorum.
- The V10 ACK provider now caches requiredACKs (lastKnownRequiredACKs)
  right after each getMinimumRequiredSignatures() read — BEFORE
  collector.collect() runs the getConnectedCorePeers callback — so the
  pool decision uses the real quorum even on the first publish.
- New ack-candidate-pool.test.ts: default quorum, runtime quorum above
  (5/4) and below (2) the default, and self-exclusion.

Import filename inference respects the opaque-blob escape hatch (#1101 follow-up):
- An EXPLICIT contentType=application/octet-stream form field no longer
  gets overridden by filename-extension inference (notes.md was treated
  as Markdown anyway, removing the deliberate "store as opaque blob"
  path). Only the implicit default (no override; generic/absent part
  header) is eligible for inference.
- The test harness (import-file-orchestration.shared.ts) now mirrors the
  daemon's full detection incl. #1101 inference; new regression tests for
  both the explicit-octet-stream escape hatch and the preserved implicit
  inference.

Verified live on a clean 6-node devnet from this branch:
- bootstrap completes ALL 10 publishes across core1+core2 confirmed (the
  #1093 fix area, with the quorum threading in place);
- import-file with explicit octet-stream + notes.md → stored as blob,
  extraction skipped; same file with implicit octet-stream part header →
  text/markdown pipeline runs.

Unit: import-file parts 01-09 92/92, ack-candidate-pool 5/5.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review skipped: filtered diff is 23982 lines (cap: 5,000). Please consider splitting this into smaller PRs for reviewability.

Bojan131 added a commit that referenced this pull request Jun 12, 2026
… red-while-live convention

Adds single-process / single-Hardhat-node reproducing tests (run in the normal
CI lanes) for five high/pre-mainnet issues that were previously only documented
`it.skip` stubs, and switches the whole liveness suite to the standard
"red while the bug is live, green when fixed" convention (plain failing `it()`
instead of the inverted `it.fails`, which was green-while-broken).

New CI tests (each authored against a known-fixed build and confirmed to flip
green there, so a red is a genuine live bug, not a broken test):
- #462  agent/issue-462-skill-acl.test.ts — an unauthorized (but signed) peer's
  skill_request is rejected and the handler does not run. Today there is no ACL
  → handler runs → RED.
- #936  agent/issue-936-tokenid-determinism.test.ts — two replicas reconciling
  the same multi-root KC from chain (divergent oxigraph insertion orders) agree
  on the rootEntity→tokenId mapping. Today positional assignment over a
  store-dependent order makes them disagree → RED.
- #1013 publisher/issue-1013-async-finalization-honesty.test.ts — a private
  publish that never reached chain (no storage ACKs) must NOT map to a finalized
  lift job. Today the mapper returns finalized/local → RED.
- #1078 storage/issue-1078-private-layer-scope.test.ts — a root hydrates only the
  authoritative private slice, not a superseded commitment for the same root.
  Today the CG-level _private graph commingles both → RED.
- #1091 random-sampling/e2e-hardhat-chain.test.ts — a node cannot predict its own
  RS challenge from public block data. Today the seed is reconstructed from
  block.difficulty/blockhash/sender and previewChallengeForSeed predicts the
  exact on-chain draw → RED.

Convention flip (it.fails → it()) for the existing high-issue repros
(#11, #184, #675, #757, #1121, #1122 + the devnet multi-node tier) so the suite
is uniformly RED while bugs are live and GREEN once fixed — matching how the fix
PRs (#1107, #1132) turn individual tests green as they merge.

Doc rewritten (docs/testing/ISSUE_LIVENESS_TESTS.md): all 25 high issues mapped
to a test across three tiers — 11 CI unit/integration, 8 devnet multi-node, and
6 honest pending-fixture/emergent stubs (#614 #1099 #1124 fixture-needed; #723
#999 #1008 emergent/load — a deterministic CI assertion there would be a false
positive).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@branarakic

Copy link
Copy Markdown
Contributor

Heads-up from #1155 (per-KA metadata trim) — these are exactly the pre-mainnet fixes that should land, and several (#1098 replica VM convergence, #1099 SWM drain) harden the same SWM/per-KA-graph surfaces #1155 restructures, so folding them together strengthens both. One cross-track interaction to coordinate, on finalization-handler.ts + the SWM materialization path:

#1155 changes two things this PR's SWM logic touches:

  • The minimal per-cgId partition no longer carries dkg:status — it moved to the label _meta graph, so isAlreadyConfirmed now reads both (partition ‖ label _meta).
  • KAs are collapsed to a single bare-UAL subject; sync still Merkle-verifies them via a self-map in sync-verify-worker-impl.ts + dkg-agent-utils.ts.

#1107's #1098 getSharedMemoryQuadsForRoots and #1099 clearPublishedSwmRoots assume the pre-trim partition shape. Different methods, same finalization/SWM mate — so whoever rebases second should (a) reconcile which graph carries status/SWM materialization, and (b) re-run the #1098 SWM merkle-recompute + late-joiner convergence against the collapsed bare-UAL self-map. No regression originates from the trim (every cross-reader's _meta dependency is preserved), but the SWM convergence path is worth an explicit cross-check. (Separately: #1107 and #1132 overlap each other in resolveViewGraphs WM-prefix block — independent of #1155.)

Bojan131 and others added 3 commits June 17, 2026 16:03
# Conflicts:
#	packages/cli/skills/dkg-node/SKILL.md
#	packages/publisher/src/dkg-publisher.ts
#	packages/random-sampling/src/prover.ts
The merge of main into this branch auto-merged 26 of 29 files textually,
but three PR fixes were built on lifecycle/metadata APIs that main has
since reworked (RFC ka-metadata-trim, OT-RFC-49). This commit reconciles
those auto-merged-but-stale changes so the merged tree builds and the full
suite (incl. main's red-while-live repro tests) is green.

dkg-agent-publish.ts:
- #1095/#1104: main deleted `generateAssertionPublishedMetadata` (RFC
  ka-metadata-trim Phase 0 — its only caller was dead on main). Dropped the
  PR's `published` prov:Activity event minting that depended on it; main
  already stamps `dkg:state="published"` on this path (deriveStatus →
  vm-confirmed), so #1095's lifecycle-STATE fix is satisfied. Kept #1104's
  `publishedUal` stamp (a plain raw-quad write, independent of the removed
  function).
- #1097: dropped the PR's "auto-promote a sealed-but-unstaged assertion
  before publish" block. main solved the same issue the opposite way — a
  finalized-but-unshared publish is an explicit caller precondition that
  surfaces "No quads in shared memory" and the vm/publish route maps to a
  clean 409 VM_PUBLISH_PRECONDITION. Auto-promoting defeats that precondition
  (regressed it to a 500), so main's explicit share→publish contract wins.

knowledge-assets-route.test.ts:
- #1094: flipped the red-while-live repro `pull-from {layer} 500s "No sealed
  entity list"` to its positive seeding leg, exactly as the test's own
  comment instructed ("when #1094 ships, this flips and must be upgraded").
  PR #1107 is #1094's fix, so pull-from now re-opens the WM draft and
  re-seeds the sealed entities (200).

Conflict resolutions (in the merge commit): prover.ts took main's version
(the PR's #1098 log-dedup was built on the ciphertext-chunks model main
removed under OT-RFC-49); dkg-publisher.ts kept BOTH main's
`{ provenanceEvents }` arg and the PR's #1095 VM-version discard filter;
SKILL.md took main's (more complete) two-fork publish-route docs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@branarakic branarakic merged commit 7c83648 into main Jun 17, 2026
40 of 41 checks passed
Bojan131 added a commit that referenced this pull request Jun 18, 2026
…test

The merge of main into this branch produced 2 conflicts (resolved in the
merge commit) plus one auto-merge artifact this commit fixes.

Conflict resolutions (in the merge commit):
- packages/query/src/dkg-query-engine.ts (working-memory view): combined
  main's same-identity alias span (PR #1107 review 🟡 — one prefix per
  agentAddressAlias) with #1132's sub-graph scoping (#184/#675 — the `${sg}`
  suffix), so the WM prefixes are `…${sg}/_working_memory/${addr}/` per alias.
- packages/cli/test/notifications-route.test.ts: took main's version. main
  rewrote it from a mock-based unit test into a real-daemon integration test
  (sign-join→request-join→curator reject-join) that fully exercises the #757
  curator gate; the PR's #757 route change (thread callerAddress into
  listPendingJoinRequests) auto-merged into notifications.ts and is covered.

This commit:
- packages/agent/test/messaging-chat-acl.test.ts: add `vi` to the vitest
  import. main's copy of this file (chat-ACL tests only) and the PR's net-new
  `skill_request ACL (GH #462)` describe block (which uses `vi.fn` via
  echoSkill) auto-merged, but the surviving import line was main's
  `{ describe, it, expect }` — so the #462 skill-ACL tests threw
  `ReferenceError: vi is not defined`. The #462 feature itself
  (MessageHandler.setSkillAcl + default-deny enforcement) is intact in the
  merged messaging.ts; only the test import needed reconciling. 15/15 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: documented one-shot publish flow (dkg_publish / SKILL.md §4a) returns 500 — publish-by-assertionName requires an undocumented promote:true on create bug: /api/memory/search can never match Verifiable Memory — VM graph filter still uses pre-rc.16 "_verified" prefix (+ http-only schema.org matching) bug: KA lifecycle descriptor reports contradictory state (state=discarded + status=vm-confirmed) and never records a published event bug: no working path to update a published Knowledge Asset — wm/pull-from 500s, finalize wedges, /api/update demands an attestation no API can produce bug: core nodes permanently unable to publish — partial StorageACK protocol discovery poisons the ACK peer pool (pool_below_quorum)

2 participants