Skip to content

Integration: data-store & query rework (C-track #1150–53 + A1 #1141 + #1155)#1156

Closed
branarakic wants to merge 27 commits into
mainfrom
integration/data-store-and-query-rework
Closed

Integration: data-store & query rework (C-track #1150–53 + A1 #1141 + #1155)#1156
branarakic wants to merge 27 commits into
mainfrom
integration/data-store-and-query-rework

Conversation

@branarakic

Copy link
Copy Markdown
Contributor

What

Integration branch combining the data-store volume work (#1155) with the query/scan + sync half of the #1138 scalability wave, assembled so the two halves can be tested, flag-flipped, and measured together — and so RFC Part 2 (Proposal A) can be built on a single stable base instead of four moving branches.

Merged onto main in dependency order:

Stream PR Role
C1 #1150 GraphSetIndexStore write-through graph-name index; eliminates the steady-state listGraphs DISTINCT ?g scan
C2/C3/C4 #1151/#1152/#1153 ContextGraphMetaProjection (event-driven per-CG meta, Part-2 Proposal B) · projection-backed listContextGraphs · host-sweep bound + slow-query tagging
A1 #1141 sync responder bounded-graph page serving (the ~95% sync win)
#1155 feat/ka-metadata-trim per-KA metadata trim, ~134 → ~50 quads/KA

Conflicts resolved (each compiled + tested)

  1. storage/.../sparql-http.ts (C1 ↔ C4) — kept C1's in-adapter listGraphs cache removal (GraphSetIndexStore owns graph-list indexing now) and C4's slow-query telemetry. These were entangled: C4's maybeEmitSlowQuery shares this.now, so the monotonic clock stays for timing while the cache machinery (graphListCache, scanGraphs, in-flight coalescing, LIST_GRAPHS_CACHE_TTL_MS) is dropped. listGraphs() is now a direct scan carrying a source tag.
  2. agent/.../dkg-agent-lifecycle.ts (A1 ↔ C-track) — both appended distinct helpers at the same location (listGraphFamily/listGraphsByPrefix vs getSharedMemorySubGraphAdmission/isKnownContextGraphUri); kept both.
  3. agent/.../sync/responder/sync-handler.ts (Per-KA metadata trim: ~134 → ~50 quads/KA (RFC + Phases 0–3 implementation) #1155 ↔ A1) — A1's bounded-page-serving structure won (Per-KA metadata trim: ~134 → ~50 quads/KA (RFC + Phases 0–3 implementation) #1155's only hunk here was the read-both delta arm). Re-ported Per-KA metadata trim: ~134 → ~50 quads/KA (RFC + Phases 0–3 implementation) #1155's read-both collapsed-UAL arm into A1's durableDeltaWhereClauseForGraphs (graph-plan.ts): legacy partOf token-row arm + collapsed ?ualc dkg:rootEntity ; dkg:batchId arm. Without it, collapsed-shape KAs bind no ?deltaBatch → are re-sent on every delta once DKG_SYNC_DELTA is enabled (correct, but defeats the optimization).

Two test fixes — pre-existing on their source branches, not regressions

Both were proven failing on their own branch in isolation (built + run on A1 and C2 in a throwaway worktree); both are test-only and should be upstreamed to the source PRs:

  • sync-responder-concurrent-interleaving.test.ts (A1 A1 — sync responder: bound-graph page serving (the ~95% win) #1141): the multi-graph page-query probe regex omitted DISTINCT, but readSwmMetaRowsPage correctly emits SELECT DISTINCT ?g … for the multi-graph VALUES join. The single-graph arm already tolerated DISTINCT; gave the multi-graph arm parity.
  • agent.part-13.test.ts curator-authority (C2 C2 — ContextGraphMetaProjection (one policy cache) #1151): C2 repointed getContextGraphCreator at the ContextGraphMetaProjection cache. The test forges foreign-creator state via raw store.insert that bypasses markDirty, leaving the projection stale → registration rejected with the wrong message. Real sync paths call markDirtyFromQuads after insert, so the code is correct; the test now invalidates the projection after its out-of-band write.

Verification

  • Full build green (turbo, 20/20 packages).
  • ~7,500 tests green across every touched package: storage 209 · query 261 · publisher 1162 · agent 1282 · cli 2021 · kafka-plugin 169 · node-ui 1423.

⚠️ Flag state + remaining work

  • DKG_LIST_CONTEXT_GRAPHS_PROJECTION defaults OFF — C3's projection-backed listContextGraphs is dead code until enabled. Merging this branch does not by itself reduce the idle-node enumeration cost; the flag must be flipped + measured.
  • The listDeclaredContextGraphIds() enumeration scan still survives (live, uncached SELECT DISTINCT ?ctxGraph + STRSTARTS arm), as do the 15-min SWM-cleanup double scan, the 30s metrics COUNTs, and the STRSTARTS graph-name sites. These are the RFC Part 2 / Proposal A residual — a typed graph registry in oxigraph (CG roster + per-CG cg → hasGraph → g membership) — to be built on top of this branch.

Next

  1. Flip DKG_LIST_CONTEXT_GRAPHS_PROJECTION + DKG_SYNC_DELTA on a test node, measure idle CPU, confirm the surviving scans match the Proposal-A residual.
  2. Build the Proposal-A registry (additive on C2's dirty-set plumbing).

🤖 Generated with Claude Code

Jurij Skornik and others added 27 commits June 12, 2026 06:30
One 1-triple KA publish leaves ~134 resident quads (live-measured); ~97% is
bookkeeping, ~30 quads are copies of five values. Combined with hot-path
graph-name scans (STRSTARTS(STR(?g)), SELECT DISTINCT ?g — the adapter's own
"dominant idle-node CPU cost"), this drives the rc.17 idle-CPU saturation.

The RFC specifies: Phase 0 dead code, Phase 1 zero-reader drops (~-24/KA),
Phase 2 dedupe via small reader migrations (~-25), Phase 3 aggressive decision
points (UAL/token collapse, URN merge, provenance-events flag, ~45-50/KA), and
the query-side fixes (graph registry to kill name scans, event-driven
reconcilers). Every verdict cites its writers and readers by file:line.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implements docs/rfcs/ka-metadata-trim.md: one 1-triple KA publish drops from
~134 resident quads to ~50 (~40 with metadata.provenanceEvents=false), with
every removed/relocated triple justified by a writer+reader audit and every
migrated reader reading both old and new shapes.

- Phase 0: dead AssertionPublished writer+gate; orphan kcUal read; dangling
  partition bnode.
- Phase 1: zero-reader drops (kaCount, blockTimestamp, publisherAddress,
  chainId, blockNumber, tokenId, publicTripleCount, AuthorshipProof block,
  Publication node, URN type/contextGraph/wasGeneratedBy rows). publishedAt
  KEPT after adversarial review found the kafka-plugin discovery reader.
- Phase 2: KC/KA type rows -> predicate-based counters; single rootEntity
  (entity alias dropped outside the signed seal); wm/swm pointers written
  only on divergence; fromLayer/toLayer + wasAssociatedWith derived/optional;
  publicSnapshotRef collapsed; WM marker updated (not deleted) at VM flip.
- Phase 3: UAL+<ual>/<n> collapsed to one node (read-both in resolveKA,
  access-handler incl. <ual>/<n> fallback for old clients, RS prover, sync,
  kafka discovery, async-lift, EPCIS, counters); metadata.provenanceEvents
  config (default true) gating all four lifecycle event writers; ShareTransition
  dropped (node-ui receipt reads the seal subject, legacy fallback retained);
  partition CONSTRUCT-copy -> documented minimal shape (REMAP keeps wholesale
  move). Lifecycle-URN->seal merge DEFERRED with a worked plan
  (TODO(rfc-ka-trim) at assertionLifecycleUri) after the audit surfaced
  signed-material collision + identity double-allocation hazards.
- Adversarial-review fixes: kafka read-both queries; provenanceEvents gating
  on discard/update; multi-root private access attestation always matches the
  served triples (computePrivateRoot fallback); stale re-promote stays a no-op;
  isAlreadyConfirmed read-both vs the minimal partition shape.

Author-signed seal material untouched. Suites green: core 1039, query 261,
random-sampling 62, publisher 1150, agent 1226, cli 2031, kafka-plugin 169,
node-ui 1456 (+tsc/builds).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- sync-verify: collapsed-shape KCs now Merkle-verified on sync (self-map in
  both duplicated impls, multi-map roots, dual-shape dedupe guard) — a wrong
  merkle root on a collapsed KC is now rejected instead of accepted on trust.
- node-ui receipt: replaced the never-binding lifecycle join with the
  member-entity join pinned by the URN/URI tail correspondence; the identical
  latent bug in the legacy hop fixed too; real executed-query tests added.
- multi-root access: conditional collapse — multi-root KAs re-emit per-token
  pairing rows (manifest order) alongside the collapsed shape, so legacy
  <ual>/<n> resolves exactly root N with matching attestation; single-root
  publishes (dominant case) keep the full collapse. Handler serves the first
  root that has a private bag; F3 recompute guard retained.
- graph-viz wasGeneratedBy: documentation-only correction (generic matcher,
  no stranded feature) — RFC + inline comment amended.

Tests: agent unit + 9 sync-path hardhat suites green; publisher full 1232;
node-ui full 1423 (+ new executed-query tests); graph-viz 140; builds green.
Full agent lane delegated to CI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…gration/data-store-and-query-rework

# Conflicts:
#	packages/storage/src/adapters/sparql-http.ts
#	packages/storage/test/sparql-http.test.ts
…ion/data-store-and-query-rework

# Conflicts:
#	packages/agent/src/dkg-agent-lifecycle.ts
…n/data-store-and-query-rework

# Conflicts:
#	packages/agent/src/sync/responder/sync-handler.ts
Both fail on their own source branches in isolation; neither is a behavior
regression. Fixed as part of assembling integration/data-store-and-query-rework
so the combined suite is green. Should be upstreamed to the source PRs.

- sync-responder-concurrent-interleaving (A1 #1141): the multi-graph page-query
  probe regex omitted DISTINCT, but readSwmMetaRowsPage correctly emits
  SELECT DISTINCT ?g ... for the multi-graph VALUES join. The single-graph arm
  already tolerated DISTINCT; gave the multi-graph arm parity.
- agent.part-13 curator-authority (C2 #1151): C2 repointed getContextGraphCreator
  at the ContextGraphMetaProjection cache; the test forges foreign-creator state
  via raw store.insert that bypasses markDirty, leaving the projection stale.
  Real sync paths call markDirtyFromQuads after insert, so the code is correct;
  the test now invalidates the projection after its out-of-band write.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@branarakic

Copy link
Copy Markdown
Contributor Author

Live idle-CPU A/B measurement (handoff context)

Ran the projection-flag A/B on a real data-rich node to validate what this branch does and does not fix for idle-node CPU. Posting full results since the work is being handed off.

Setup: HariSeldon testnet node, 1.1 GB store, native oxigraph-server backend, booted on this integration branch, idle. Sampled oxigraph-server %CPU 15×8s per phase after 50s stabilization. DKG_SYNC_DELTA is not flippable here — only A1 (#1141) is merged, not A5 (delta activation) — so this is projection-only (DKG_LIST_CONTEXT_GRAPHS_PROJECTION).

Result

DKG_LIST_CONTEXT_GRAPHS_PROJECTION mean oxigraph CPU median range
off 454% 445% 12.8 – 894%
on 426% 204% 1.1 – 981%

The projection flag does NOT fix idle CPU. Both phases saturate ~4.5–9 cores; the means are within run noise (the boot-time C1 GraphSetIndexStore seed scan + testnet sync confound the first ~7 samples each phase). There is a real secondary signal: the on phase reached a sustained genuine-idle window (4 consecutive samples @ 1–6%) that off never touched, and median dropped 445%→204% — consistent with the listContextGraphs enumeration being relieved by the projection. But that is not the bottleneck.

The actual bottleneck (identical in both phases)

The C4 (#1153) slow-query telemetry caught the dominant cost:

SELECT (COUNT(DISTINCT ?kc) AS ?c) WHERE { GRAPH ?g { ?kc … } }
   source=unknown  operation=select  elapsedMs≈30001  thresholdMs=10000   (~13×/phase)

This is the 30-second metrics sweepCOUNT(DISTINCT) over all named graphs (metrics-queries.ts + lifecycle.ts:1871-1900, the getTotalKAs/KCs/triples + getContextGraphCount getters). On a 1.1 GB store it TIMES OUT at the 30s SPARQL HTTP limit every 30 seconds. A wildcard GRAPH ?g COUNT(DISTINCT) over a large RocksDB store drives heavy multi-core read amplification + compaction — it is the #1 idle driver, and nothing in the C-track touches it (it is not behind the projection flag).

Two actionable findings

  1. The dominant query is emitted as source=unknown — C4's slow-query tagging does not tag the metrics COUNT, so its own telemetry mis-attributes the single biggest cost. Tag the metrics getters (and any other untagged reconciler reads) so the telemetry is usable.
  2. Priority-1 for the Proposal-A follow-up is the metrics COUNT, not the CG enumeration. A maintained counter (or registry-backed count) for KC/KA/triple totals + CG count kills the 30s timeout sweep. The enumeration/listContextGraphs work the projection already half-relieves is secondary. Per the adversarial review on the design, do NOT hand-maintain these as naive integers without an invalidation contract — they are operator-visible (dashboards) and writes arrive from gossip/sync/migration outside a single chokepoint; back them on the same dirty-set discipline C2's ContextGraphMetaProjection already establishes (markDirty/markDirtyFromQuads).

Net

Merging this branch is necessary but not sufficient for idle CPU — empirically confirmed. When Proposal A (the typed graph registry: CG roster + per-CG cg → hasGraph → g membership, in an oxigraph system graph) is built on top of this branch, it should back the metrics counters first, then repoint the enumeration / STRSTARTS sites.

Reproduce

# integration build is packages/cli/dist (already built on this branch)
export DKG_HOME=~/.dkg
DKG_LIST_CONTEXT_GRAPHS_PROJECTION=on  node packages/cli/dist/cli.js start   # vs unset for baseline
# sample: ps -o %cpu= -p $(pgrep -f oxigraph-v0)   ; grep "slow query" ~/.dkg/daemon.log
node packages/cli/dist/cli.js stop

Daemon lifecycle is dkg start / stop / status (NOT dkg daemon …). Raw 30-sample run is at /tmp/measure-projection.out on the dev box.

Flag state reminder for the takeover

  • DKG_LIST_CONTEXT_GRAPHS_PROJECTION defaults OFF — C3's projection-backed listContextGraphs is dead code until enabled; merging alone changes nothing for the enumeration.
  • The read-both collapsed-UAL delta arm was re-ported into A1's durableDeltaWhereClauseForGraphs (graph-plan.ts) but is inert until A5 lands the DKG_SYNC_DELTA activation — correct and ready, just not exercisable on this branch.

🤖 Generated with Claude Code

@branarakic

Copy link
Copy Markdown
Contributor Author

Closing as redundant — Track-C reconciled into main via #1201 (integration/track-c-merge).

Verified at content level (not just patch-id): this branch is 169 commits behind main; 6/7 of its distinctive SWM-admission identifiers exist in main (createResponderSubGraphRegistrationMemo, createSubGraphNameMemo, filterSharedMemoryMetaQuads, isSharedMemoryBucketDescendantDataGraph, effectiveRegisteredSubGraphNames, subGraphRegistrationMemo), and the 7th — the "honor child CG collisions in SWM admission" fix (3795aec11) — is present in main under renamed identifiers (graph-plan.ts:643: childCgUri + isKnownContextGraph, the same child-CG-collision skip logic). No unique unmerged work found. Reopen if I've missed something.

@branarakic branarakic closed this Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant