Skip to content

fix(agent): bound sync responder page-build cost (#1221 R-7) + memoize admission ASK (R-6)#1225

Closed
Jurij89 wants to merge 1 commit into
mainfrom
fix/1221-responder-page-build-cost
Closed

fix(agent): bound sync responder page-build cost (#1221 R-7) + memoize admission ASK (R-6)#1225
Jurij89 wants to merge 1 commit into
mainfrom
fix/1221-responder-page-build-cost

Conversation

@Jurij89

@Jurij89 Jurij89 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Residual ~9–10-core oxigraph CPU peg on rc.18 under live multi-peer sync (#1221), after the #1164 (A1–A4) un-peg gate. Root-caused to the sync responder page-build path: serving sync pages against an operation-heavy context graph materializes far more of the store than the page actually needs, and oxigraph v0.5.8 (per-query single-threaded MVCC, no global lock, no query-cancel on client abort) saturates one core per concurrent heavy page. Under the requester retry/flap storm those heavy page-builds stack and peg ~9 cores.

This PR bounds the per-page cost — the body of the responder work. It is the responder/server half of the fix; the requester/round-count half is #1224 (complementary — see below).

R-7 — bound page-build query cost (headline)

  • swm-data: push the SWM-root membership filter into the SPARQL snapshot read (readRowsAcrossGraphsForRoots) instead of materializing every row in the candidate graphs and filtering in JS. The store now returns only the root + its /.well-known/genid/ descendants. Byte-identical to the prior roots.has(s) || rootPrefixes.some(p => s.startsWith(p)) predicate.
  • durable-meta: replace the O(rows × assertionNames) nested .some() scan with an O(rows) trailing-segment lookup (matchesAssertionName). Byte-equivalent because assertion names cannot contain /.

The pushed filter is placed outside the GRAPH ?g { ?s ?p ?o } block, so the frozen-snapshot read contract (watchUnorderedSnapshotRead: no ORDER BY / OFFSET / LIMIT) is preserved — verified by the 23 snapshot-stability contracts in sync-responder-concurrent-interleaving.test.ts. The load-bearing post-merge union sort(compareRows) is preserved.

R-6 — memoize admission ASK (adjunct)

Per-request Map<string, Promise<boolean>> memo for the isKnownContextGraph admission ASK, threaded through readDurableDataPage and the SWM-twin readAdmittedSwmSubGraphNames. Collapses the offset-0 per-graph / per-segment re-ASK storm to one ASK per unique URI.

Invariant for both: the emitted row-set / admitted-graph set is byte-identical to pre-fix behaviour — only how the rows are computed changes, never which.

Tests

  • New test/sync-responder-r7-r6-pushdown.test.ts — byte-invariance for the swm-data root filter (genid-lookalike negative), the FILTER-present / no-ORDER BY/OFFSET/LIMIT shape, the durable-meta near-miss drop, the R-6 ASK-count collapse + SWM-twin admission.
  • test/sync-responder-concurrent-interleaving.test.ts — 23 snapshot-stability contracts green.
  • Full @origintrail-official/dkg-agent suite: 1616 pass (1 unrelated P2P-timing flake in e2e-join.test.ts).

Validation status

  • ✅ Unit-verified (byte-invariance + snapshot contracts + full suite).
  • Live before/after pending — per VALIDATION-RUNBOOK-1221.md, measuring per-source responder page-build totalMs on the real operation-heavy store. Draft until that lands.

Relationship to #1224

#1224 (sync responder backpressure defaults + requester-side in-flight coalescing) is the complementary half: it bounds the number of page requests / flap fanout; this PR bounds the cost per page. Together they break the responder feedback loop at both ends.

Refs #1221.

🤖 Generated with Claude Code

…ission ASK (R-6)

#1221's residual oxigraph peg on rc.18 is the sync RESPONDER page-build serving
cost under multi-peer sync (offset-0 retry storm), NOT VM-reconcile -- which never
armed on the medical node (CG was SWM-only / never on-chain registered). Forensics:
node-ui.db on_chain_id NULL across all CGs, empty replication_events, zero reconcile
arm. The page-0 timeout is the durable-data/swm page-build query=, not admission.

R-7 (page-build bounding, the headline):
- swm-data: push the SWM root-membership filter into SPARQL (new readRowsAcrossGraphsForRoots,
  reusing the proven durable-delta genid filter) instead of full per-graph materialize +
  JS filter; per-page cost O(graph) -> O(matched). Frozen-snapshot semantics preserved
  (filter runs inside the same cached page-0 materialize), the load-bearing post-merge
  multi-graph union sort preserved, and the FILTER is placed outside the GRAPH block so
  the unordered-snapshot-read contracts still hold.
- durable-meta: replace the O(rows x assertionNames) nested .some() match with an O(rows)
  last-path-segment lookup. Byte-equivalent because assertion names cannot contain '/'.

R-6 (admission-ASK tail): per-request memo over the un-memoized isKnownContextGraph
admission ASK (durable-data page admission + the readAdmittedSwmSubGraphNames SWM-twin),
collapsing ~O(graphs x segments) redundant ASKs to ~O(unique-URIs). Pure dedup over the
identical deterministic predicate; admitted graph set provably unchanged.

Tests (sync-responder-r7-r6-pushdown): row-set byte-invariance with genid-lookalike +
durable-meta near-miss negatives, ASK-count collapse, SWM-twin admission; existing
sync-responder-concurrent-interleaving snapshot-stability contracts stay green.

Refs #1221. The VM-reconcile size-bail/ops>0 neg-cache fix ships separately as a
test-validated latent bug (never executed on this node).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01G6SgdmcAV3AfpWQ3mYiFx6
@Jurij89

Jurij89 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Closing — R-7 is a measured performance regression on real data. Superseding with the #1224 (requester-side) direction.

We validated R-7 (swm-data SPARQL root-filter pushdown) against the real operation-heavy store (oxigraph v0.5.8 serve-read-only, 158,187 quads), reproducing exactly how readFreshSwmDataRows calls it — per bucket, each bucket's candidate graphs × that bucket's fresh roots:

swm-data page-build (6,547 rows, 1,098 fresh roots) BEFORE (main: materialize + JS filter) AFTER (R-7 pushdown)
total 59 ms / 0.05 core-s 6,448 ms / 6.27 core-s

R-7 is ~134× more oxigraph CPU / ~108× wall, for the identical row-set, and it scales O(rows × roots) — so it is worst exactly on the operation-heavy CGs #1221 targets (1→33 ms, 100→2.9 s, 500→14 s, 1098→30 s in the all-buckets cut). The VALUES ?swmRoot { … } FILTER(sameTerm(?s,?swmRoot) || STRSTARTS(STR(?s), CONCAT(STR(?swmRoot), "…/genid/"))) is a nested-loop cross-join in oxigraph; the || STRSTARTS defeats hash-join (sameTerm-only at 1,098 roots = 0.75 s; with the prefix branch = 30 s). The unit fixtures (a few roots) hid it entirely.

Root cause of the wrong turn: R-7's premise — "the swm-data page-build is intrinsically expensive and must be bounded" — was the rc.17 cost model. #1164 already bounded candidateGraphsFor, so on current main the swm-data materialize is already cheap (59 ms). There is nothing to bound here: zero upside, large downside.

Broader finding (real store, current code, single query, no concurrency): no responder page is intrinsically expensive — worst ≈ 0.81 core-s (agents durable-data) / ≈ 0.6–1.2 core-s (112k agents/_meta materialize), all sub-1.5 s. The field's 15–40 s page times can only arise from concurrency / queue-wait under the offset-0 retry storm (oxigraph is single-thread-per-query). ⇒ the real lever is the requester-side fix in #1224 (in-flight coalescing + backpressure), not bounding page-build cost. R-6 (admission-ASK memo) and the durable-meta JS micro-opt are dropped with this.

Validation harness + full numbers retained internally; the live before/after (A0 baseline → A1 with #1224) moves to a fresh registered networked node.

@Jurij89 Jurij89 closed this Jun 18, 2026
@Jurij89 Jurij89 deleted the fix/1221-responder-page-build-cost branch June 18, 2026 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant