fix(agent): bound sync responder page-build cost (#1221 R-7) + memoize admission ASK (R-6)#1225
fix(agent): bound sync responder page-build cost (#1221 R-7) + memoize admission ASK (R-6)#1225Jurij89 wants to merge 1 commit into
Conversation
…ission ASK (R-6) #1221's residual oxigraph peg on rc.18 is the sync RESPONDER page-build serving cost under multi-peer sync (offset-0 retry storm), NOT VM-reconcile -- which never armed on the medical node (CG was SWM-only / never on-chain registered). Forensics: node-ui.db on_chain_id NULL across all CGs, empty replication_events, zero reconcile arm. The page-0 timeout is the durable-data/swm page-build query=, not admission. R-7 (page-build bounding, the headline): - swm-data: push the SWM root-membership filter into SPARQL (new readRowsAcrossGraphsForRoots, reusing the proven durable-delta genid filter) instead of full per-graph materialize + JS filter; per-page cost O(graph) -> O(matched). Frozen-snapshot semantics preserved (filter runs inside the same cached page-0 materialize), the load-bearing post-merge multi-graph union sort preserved, and the FILTER is placed outside the GRAPH block so the unordered-snapshot-read contracts still hold. - durable-meta: replace the O(rows x assertionNames) nested .some() match with an O(rows) last-path-segment lookup. Byte-equivalent because assertion names cannot contain '/'. R-6 (admission-ASK tail): per-request memo over the un-memoized isKnownContextGraph admission ASK (durable-data page admission + the readAdmittedSwmSubGraphNames SWM-twin), collapsing ~O(graphs x segments) redundant ASKs to ~O(unique-URIs). Pure dedup over the identical deterministic predicate; admitted graph set provably unchanged. Tests (sync-responder-r7-r6-pushdown): row-set byte-invariance with genid-lookalike + durable-meta near-miss negatives, ASK-count collapse, SWM-twin admission; existing sync-responder-concurrent-interleaving snapshot-stability contracts stay green. Refs #1221. The VM-reconcile size-bail/ops>0 neg-cache fix ships separately as a test-validated latent bug (never executed on this node). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01G6SgdmcAV3AfpWQ3mYiFx6
|
Closing — R-7 is a measured performance regression on real data. Superseding with the #1224 (requester-side) direction. We validated R-7 (swm-data SPARQL root-filter pushdown) against the real operation-heavy store (oxigraph v0.5.8
R-7 is ~134× more oxigraph CPU / ~108× wall, for the identical row-set, and it scales O(rows × roots) — so it is worst exactly on the operation-heavy CGs #1221 targets (1→33 ms, 100→2.9 s, 500→14 s, 1098→30 s in the all-buckets cut). The Root cause of the wrong turn: R-7's premise — "the swm-data page-build is intrinsically expensive and must be bounded" — was the rc.17 cost model. #1164 already bounded Broader finding (real store, current code, single query, no concurrency): no responder page is intrinsically expensive — worst ≈ 0.81 core-s ( Validation harness + full numbers retained internally; the live before/after (A0 baseline → A1 with #1224) moves to a fresh registered networked node. |
Summary
Residual ~9–10-core oxigraph CPU peg on rc.18 under live multi-peer sync (#1221), after the #1164 (A1–A4) un-peg gate. Root-caused to the sync responder page-build path: serving sync pages against an operation-heavy context graph materializes far more of the store than the page actually needs, and oxigraph v0.5.8 (per-query single-threaded MVCC, no global lock, no query-cancel on client abort) saturates one core per concurrent heavy page. Under the requester retry/flap storm those heavy page-builds stack and peg ~9 cores.
This PR bounds the per-page cost — the body of the responder work. It is the responder/server half of the fix; the requester/round-count half is #1224 (complementary — see below).
R-7 — bound page-build query cost (headline)
readRowsAcrossGraphsForRoots) instead of materializing every row in the candidate graphs and filtering in JS. The store now returns only the root + its/.well-known/genid/descendants. Byte-identical to the priorroots.has(s) || rootPrefixes.some(p => s.startsWith(p))predicate..some()scan with an O(rows) trailing-segment lookup (matchesAssertionName). Byte-equivalent because assertion names cannot contain/.The pushed filter is placed outside the
GRAPH ?g { ?s ?p ?o }block, so the frozen-snapshot read contract (watchUnorderedSnapshotRead: noORDER BY/OFFSET/LIMIT) is preserved — verified by the 23 snapshot-stability contracts insync-responder-concurrent-interleaving.test.ts. The load-bearing post-merge unionsort(compareRows)is preserved.R-6 — memoize admission ASK (adjunct)
Per-request
Map<string, Promise<boolean>>memo for theisKnownContextGraphadmission ASK, threaded throughreadDurableDataPageand the SWM-twinreadAdmittedSwmSubGraphNames. Collapses the offset-0 per-graph / per-segment re-ASK storm to one ASK per unique URI.Invariant for both: the emitted row-set / admitted-graph set is byte-identical to pre-fix behaviour — only how the rows are computed changes, never which.
Tests
test/sync-responder-r7-r6-pushdown.test.ts— byte-invariance for the swm-data root filter (genid-lookalike negative), the FILTER-present / no-ORDER BY/OFFSET/LIMITshape, the durable-meta near-miss drop, the R-6 ASK-count collapse + SWM-twin admission.test/sync-responder-concurrent-interleaving.test.ts— 23 snapshot-stability contracts green.@origintrail-official/dkg-agentsuite: 1616 pass (1 unrelated P2P-timing flake ine2e-join.test.ts).Validation status
VALIDATION-RUNBOOK-1221.md, measuring per-source responder page-buildtotalMson the real operation-heavy store. Draft until that lands.Relationship to #1224
#1224 (sync responder backpressure defaults + requester-side in-flight coalescing) is the complementary half: it bounds the number of page requests / flap fanout; this PR bounds the cost per page. Together they break the responder feedback loop at both ends.
graph-plan.tsedits are in the memo/snapshot-limit region (~L41–204); this PR's are in the read functions (L427+). Disjoint — no merge conflict. Tune sync responder backpressure defaults #1224 editssync-responder-concurrent-interleaving.test.ts; this PR adds a separate test file. They can land in either order.Refs #1221.
🤖 Generated with Claude Code