Skip to content

Issue #1138 — un-peg sync-responder CPU: A1–A4 + B0–B2 (§7 checkpoint PASS)#1164

Merged
Jurij89 merged 8 commits into
mainfrom
integration/issue-1138
Jun 14, 2026
Merged

Issue #1138 — un-peg sync-responder CPU: A1–A4 + B0–B2 (§7 checkpoint PASS)#1164
Jurij89 merged 8 commits into
mainfrom
integration/issue-1138

Conversation

@Jurij89

@Jurij89 Jurij89 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Lands the validated un-peg gate for the sync-responder / oxigraph CPU peg (#1127 / #1136), tracked in #1138 — the set of fixes that takes a data-rich node from a machine-pegging, non-converging sync storm to a converging, sub-core steady state.

Included (waves — each independently reviewed + validated)

§7 field-repro checkpoint — PASS

Controlled rig (real oxigraph-server 0.5.8 / RocksDB, field-shaped ~150k-quad CG, N concurrent peers), same-rig before/after:

peers oxigraph cores converges?
BEFORE 4 / 16 10.23 / 15.21 (machine pegged) ❌ never
AFTER 1 / 4 / 16 0.31 / 0.38 / 0.38

~40× reduction, ≤0.38 cores ≪ the < 1 core target (spec predicted ~9.6 → "well under 0.5"). Mechanisms confirmed live: A1 bounded+cached, A2 cap sheds floods (peer queue full), A1 one-session-per-CG (session superseded), A3 converges.

Validation

Each wave: Codex review + multi-lens adversarial review + server-backed (oxigraph-server, not embedded) perf validation. The integration branch was not CI-gated (workflows pinned to main/rc) — this PR is the first CI run on the stack. The known pre-existing #1161 e2e-publish flake may surface (isolation-passes; not introduced here).

Not included (follow-ups, direct to main)

Merge note

Please use a merge commit (not squash) to preserve the per-wave history (A1…B2).

🤖 Generated with Claude Code

Jurij89 and others added 8 commits June 13, 2026 19:00
A1 validated server-backed (oxigraph-server/RocksDB): durable-data warm <500ms / durable-meta 102ms / SWM warm pages cheap — all perf phases under budget; D-SEC admission-parity 12/12; full agent suite 1283 green. CPU peg resolved. Merging to integration branch.
A2 validated: asAbortError fix verified (core 1060/0-errors), p2p-messenger delegation tests updated (3/3), A2 suites (72) + storage (207) green, server-backed perf 3/3 (caps don't regress A1 snapshot perf). Full-suite e2e reds (e2e-sub-graphs, e2e-publish-protocol) are pre-existing parallel-load flakes — pass 100% in isolation, not A2-caused. Merging to integration.
…1138

Brings the data-volume half (#1155) under the A1+A2 access-pattern work.
Single conflict: sync-handler.ts durable-data else branch — took A1's
structure (inline delta query was re-homed to graph-plan.readDurableDataPage);
#1155's read-both collapsed-KA arm is already present in A1's
durableDeltaWhereClauseForGraphs (legacy partOf arm + collapsed
rootEntity+batchId arm), so the coordination re-port is satisfied.
#1155 preserves the sync _meta predicates (memoryLayer/assertionGraph),
so A1 durable-meta admission + D-SEC are unaffected. Builds green (tsc 8/8).
…hygiene (#1143)

Squash-merge of codex/issue-1138-a3-sync-scheduler-progress (15 commits)
onto integration/issue-1138 (A1+A2+#1155), plus one stale-test fix.

A3 makes the sync requester/scheduler honest about failure:
 - separates peer-reachability failures (failedPeers) from phase failures
   (failedPhases); the lifecycle success-gate requires both zero, so a
   failed/timed-out/denied round no longer stamps success → backoff engages
 - per-(peer, contextGraph) progress accounting (defines a 'clean round')
 - freshness-aware durable checkpoints; metadata-only freshness is not
   counted as data progress
 - flap hygiene: cooldowns/backoff survive connection:close

Validation (worktree dkg-a1-verify, integration is not CI-gated):
 - build + tsc clean (agent/cli/node-ui)
 - agent suite 1361 passed / 0 failed; node-ui 1431/0
 - cli 82 fails are pre-existing Windows-host env (identical on base) —
   A3's own new cli tests pass
 - server-backed responder perf guard 3/3 (A3 doesn't touch the responder)
 - 4-lens adversarial review: 0 confirmed issues

Stale-test fix: swm-snapshot-sync 'remote snapshots unavailable' is a phase
failure, not a peer failure → assert failedPeers=0 + failedPhases=1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)

* fix(agent): damp VM reconcile retries

* fix(sync): harden VM reconcile cache keys

* fix(sync): harden VM reconcile negative cache

* fix(sync): refresh VM reconcile cache inputs

* fix(sync): preserve VM reconcile peer context

* fix(sync): bound VM reconcile replay state

* fix(sync): stabilize VM reconcile state cleanup

* fix(sync): keep VM reconcile probes cheap

* fix(sync): reprime VM reconcile peers before defer

* fix(sync): recheck VM reconcile fallbacks

* fix(agent): fold reconcile robustness into A4

* fix(agent): address A4 review edge cases

* fix(agent): harden VM reconcile damping retries

* fix(agent): scope VM reconcile cache by context graph

* fix(agent): avoid unsafe reconcile and warm-core trims

* fix(agent): retry VM reconcile after SWM fetch progress

* fix(agent): clear VM reconcile state on CG resets

* fix(agent): fully reset VM reconcile state on CG rebind

* fix(agent): drain core host recordings on stop

* fix(agent): harden VM reconcile SWM generation cache

* fix(agent): complete VM reconcile peer rotation pass

* fix(agent): bound core host ACK recording paths

* fix(agent): isolate warm-core unpin retries

* fix(agent): finish A4 review edge cases

* fix(agent): settle A4 recording and peer metrics

* fix(agent): close A4 stale recording races

* fix(agent): settle A4 liveness and peer metrics

* fix(agent): settle A4 liveness follow-ups

---------

Co-authored-by: Jurij Skornik <jurij.skornik@gmail.com>
…#1145)

* fix(cli): use cheap context graph metrics count

* fix(cli): preserve known context graph metrics

* fix(cli): refine context graph metric candidates

* fix(cli): bound context graph metric metadata reads

* fix(cli): narrow context graph declaration metrics

* fix(cli): accept nested wallet context graph metrics

* fix(cli): exclude system graphs from CG metrics

* fix(cli): prefer canonical wallet metric candidates

* fix(cli): canonicalize context graph metric aliases

* fix(cli): suppress proven metric shadow aliases

* fix(cli): prefer wire identity for CG metrics

* fix(cli): constrain CG declaration metric source

* fix(cli): skip graph-derived metric shadows

* fix(cli): require backing for slash CG metrics

* fix(cli): allow slash CG declaration candidates

* fix(cli): keep context graph metric cheap

---------

Co-authored-by: Jurij Skornik <jurij.skornik@gmail.com>
* fix(cli): add exact context graph write preflight

* fix: tighten B1 preflight authorization paths

* fix: constrain B1 exact preflight fast path

* fix(cli): require public policy for tokenless preflight

* fix(cli): preserve preflight failure context

---------

Co-authored-by: Jurij Skornik <jurij.skornik@gmail.com>
* fix(chain): bound context graph registry scans

* fix: surface partial B2 registry scans

* fix: finalize B2 partial scan pages before watermark

* fix: preserve B2 registry scan resume bounds

* fix: keep B2 chain discovery opt-in incremental

* fix: reuse shared B2 partial scan guard

* fix: allow degraded B2 registry scans past budget

* fix: keep B2 daemon chain full resync path

* fix(chain): keep full registry scans unbounded

* fix(cli): seed registry scan watermark after full scans

* fix(cli): retry full registry scans until seeded

* fix(cli): surface failed registry seed scans

* fix(cli): derive registry scan mode from adapter watermark

* fix(agent): rethrow strict partial registry scans

---------

Co-authored-by: Jurij Skornik <jurij.skornik@gmail.com>

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review skipped: filtered diff is 21817 lines (cap: 5,000). Please consider splitting this into smaller PRs for reviewability.

@Jurij89 Jurij89 merged commit 244c6ab into main Jun 14, 2026
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant