Skip to content

fix(agent): stop unbounded agents/_meta growth from profile heartbeats (#1233)#1234

Open
Jurij89 wants to merge 1 commit into
mainfrom
fix/agents-meta-bloat
Open

fix(agent): stop unbounded agents/_meta growth from profile heartbeats (#1233)#1234
Jurij89 wants to merge 1 commit into
mainfrom
fix/agents-meta-bloat

Conversation

@Jurij89

@Jurij89 Jurij89 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • The agents system context graph never confirms on-chain, so every agent-profile publish — a 5-minute heartbeat, fanned out across peers via gossip — appended a tentative _meta tracking record under a fresh, never-repeated UAL, and nothing ever pruned superseded ones. agents/_meta grew without bound (112k quads / ~386 records per agent on a long-lived node), making the CG expensive to serve and stalling offset-0 sync — the agents-CG-only slice of Residual ~9–10-core oxigraph peg in rc.18 despite #1164 (A1–A4), under live multi-peer sync on an operation-heavy CG #1221.
  • The per-publish _meta record has no consumer for the agents registry: agent facts are served from the data graph and the CG never confirms. An adversarial consumer audit confirmed the agents data graph itself IS load-bearing (cross-peer SWM encryption keys, peer discovery, curator/membership sync, the agent directory) — so only the _meta record is safe to touch, not the CG.
  • This PR skips persisting that record at the two highest-volume, side-effect-free paths, gated on a single shared dkg-core predicate isAgentRegistryContextGraph:
    • the gossip publish receiver — the dominant per-peer-heartbeat source (the offset-0 storm driver);
    • the local publish terminal.
  • Scoped deliberately. The update-restatement and direct-protocol receive paths are not touched here: there the _meta is load-bearing (it is the prior-root source for data-graph cleanup on rootEntity change, and it drives the tentative-expiry lifecycle), so skipping it would regress those. Bounding/pruning the per-agent record on those paths — plus a one-time prune of existing stores — is the follow-up.

Related

Files changed

File What
packages/core/src/genesis.ts + index.ts shared isAgentRegistryContextGraph(cgId) helper (single source of truth)
packages/agent/src/gossip-publish-handler.ts gossip receiver skips the tentative _meta insert for the agents registry CG (the dominant source)
packages/publisher/src/dkg-publisher.ts the local publish terminal skips it
packages/agent/test/gossip-publish-handler.test.ts +tests: agents-CG skips _meta; non-registry CG keeps it
packages/agent/test/profile-manager-meta-skip.test.ts +test: ProfileManager.publishProfile() populates the data graph but leaves agents/_meta empty
packages/publisher/test/dkg-publisher-meta-optout.test.ts +tests: AGENTS publish writes the data graph but not _meta; a normal CG keeps _meta

Test plan

  • vitest run gossip-publish-handler (agent) — 13/13
  • vitest run profile-manager-meta-skip (agent) — 1/1
  • vitest run dkg-publisher-meta-optout (publisher) — 2/2
  • publisher regression (update/restate/metadata/publish-handler reverted to base) — 145 pass
  • CI — the only failing jobs across runs are pre-existing flakes (Tornado: agent swm-ack-quorum/sync-responder; Bura: cli daemon-http-behavior-extra SIGINT timing), green on main and untouched by this PR
  • Behavioural: a fresh rc.18 node accumulated 570 status="tentative" stubs in 2h33m pre-fix; the gossip-receiver guard removes the dominant peer-fanned source.

🤖 Generated with Claude Code

Comment thread packages/publisher/src/publisher.ts Outdated
* minutes under a fresh UAL would otherwise grow `agents/_meta` without
* bound and stall offset-0 sync (the agents slice of #1221).
*/
persistPublishMeta?: boolean;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This makes an agents-registry-specific storage policy a public, caller-controlled PublishOptions switch. That pushes the invariant out to each caller (ProfileManager has to remember persistPublishMeta: false) while GossipPublishHandler hard-codes the same exception separately, so future registry publish paths can drift and arbitrary callers can also suppress _meta for normal local publishes. Consider centralizing this as internal policy, e.g. a shared shouldPersistTentativePublishMeta(contextGraphId)/system-graph helper used by both publisher and gossip paths, or keep the opt-out out of the public options surface.

await this.store.deleteBySubjectPrefix(dataGraph, prefix);
}

const options: PublishOptions = {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: The added validation never exercises the production ProfileManager.publishProfile() path that supplies persistPublishMeta: false; the publisher test calls publisher.publish() directly with the option, and the gossip test covers remote receipt. If this call site is later dropped or miswired, the new tests would still pass while local agent heartbeat publishes resume filling agents/_meta. Add a ProfileManager-level regression that publishes a profile through publishProfile() and asserts did:dkg:context-graph:agents/_meta stays empty while the agents data graph is populated.

@Jurij89 Jurij89 force-pushed the fix/agents-meta-bloat branch from 369f94f to 5831eef Compare June 19, 2026 00:07
const result = await publisher.publish({ contextGraphId: cg, quads: profileQuads(cg) });

// The publish still succeeds and accounts the entity...
expect(result.kaManifest[0]?.rootEntity).toBe(ENTITY);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: The publisher regression test proves the agents path returns a manifest and leaves agents/_meta empty, but it never verifies that the agent profile quads are still persisted. A regression that accidentally skips both the data write and the metadata write for SYSTEM_CONTEXT_GRAPHS.AGENTS would still pass this test. Add the same data-graph count/query assertion used in the gossip test so the validation covers the full intended contract: data graph is written, _meta is not.

@Jurij89 Jurij89 force-pushed the fix/agents-meta-bloat branch from 5831eef to 5fbe5fa Compare June 19, 2026 00:23
@Jurij89

Jurij89 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Pushed 5fbe5fad1 addressing the review.

1. Centralized the policy (publisher.ts thread). Done exactly as suggested — removed the public PublishOptions.persistPublishMeta switch and added a shared isAgentRegistryContextGraph(contextGraphId) helper in dkg-core. Both _meta write sites (the publisher's chainless terminal and the gossip receiver) gate on it, so the rule is single-sourced, can't drift, and is no longer a caller-controlled option. publisher.ts and profile-manager.ts are back to their original form (no longer in the diff).

2. ProfileManager regression (profile-manager.ts thread). Added packages/agent/test/profile-manager-meta-skip.test.ts — it publishes through ProfileManager.publishProfile() and asserts the agents DATA graph is populated while did:dkg:context-graph:agents/_meta stays empty.

3. Data-write assertion (publisher test thread). The AGENTS case now also asserts the data graph is written, so a regression that skipped both the data and _meta writes would fail.

Tests: gossip handler 13/13, publisher 2/2, profile-manager 1/1.

CI note. The failing Tornado: agent [5/9/10] and Bura: cli jobs are pre-existing flakes unrelated to this change: swm-ack-quorum/sync-responder (green on main; this PR touches none of those files) and daemon-http-behavior-extra > SIGINT exit code (AssertionError: expected 'SIGABRT' to be 'SIGINT' — a shutdown-signal timing flake). This PR's diff is confined to core (one helper + its export), dkg-publisher.ts, gossip-publish-handler.ts, and tests.

// per-publish tentative `_meta` record has no consumer (agent facts are
// served from the DATA graph); persisting one per heartbeat would grow
// `agents/_meta` without bound and stall offset-0 sync. Skip it there.
if (!isAgentRegistryContextGraph(contextGraphId)) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This guard only covers the local publish() terminal branch, but ProfileManager.publishProfile() switches to publisher.update() after the first successful profile publish (currentKcId is set). The local-only update path still calls storeUpdatedQuads(), which restates label metadata into did:dkg:context-graph:agents/_meta, so the production heartbeat can still create agents _meta rows even though the new integration test only exercises the first publish. Please either apply the same agents-registry predicate to the update/restatement path or adjust the profile heartbeat flow, and extend the ProfileManager test to call publishProfile() twice so this cannot drift again.

});
const profileManager = new ProfileManager(publisher, store);

await profileManager.publishProfile({

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This end-to-end guard only exercises the first profile write. In production, ProfileManager.publishProfile sets currentKcId after that call and subsequent heartbeats go through publisher.update, whose local update path still restamps metadata into the context graph _meta via restateLabelGraphForUpdate. So these tests can pass while the actual repeated heartbeat path resumes writing to agents/_meta. Add a second publishProfile call/assertion here, or a direct publisher.update coverage case, so the validation covers the real heartbeat/re-publish path.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This test does not actually exercise the publisher.update() / restateLabelGraphForUpdate(skipMeta) path it is meant to guard. With the NoChainAdapter setup, DKGPublisher.publish() returns kaId: 0n, and ProfileManager only calls update() when currentKcId is truthy, so the second publishProfile() call goes through publish() again. That leaves the new agents update-path behavior effectively untested; use a publisher spy/mock that returns a nonzero kaId, or add a direct update() regression test for the agents CG.

This test does not actually exercise the publisher.update() path it claims to cover. With NoChainAdapter, the first DKGPublisher.publish() returns kaId: 0n, and ProfileManager only calls update() when currentKcId is truthy, so the second publishProfile() call falls back to another publish(). A regression in the new update/restatement skipMeta plumbing would still pass this test. Force a non-zero KC id or directly drive DKGPublisher.update() / the update handler and assert agents/_meta stays empty.

@Jurij89 Jurij89 force-pushed the fix/agents-meta-bloat branch from 5fbe5fa to 8b10a2e Compare June 19, 2026 00:46
@Jurij89

Jurij89 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Pushed 8b10a2e57 — addressed the update()-path gap.

You're right: ProfileManager.publishProfile() goes through publisher.update() on every heartbeat after the first, and its local restatement (restateLabelGraphForUpdate) re-stamps label metadata into agents/_meta. My earlier guard only covered the first-publish terminal.

Fix — the same agents-registry predicate now gates every _meta write path:

  • restateLabelGraphForUpdate gains a skipMeta mode that restates the data graph only (no label re-stamp, no merkleRoot, no materialized-version write). skipMeta defaults false, so every non-agents update is byte-for-byte unchanged.
  • Both update call sites — DKGPublisher.update() (local heartbeat) and the gossip UpdateHandler (received peer updates) — pass skipMeta: isAgentRegistryContextGraph(contextGraphId).

So all four _meta writers for the agents CG are covered: local publish, local update, gossip publish, gossip update.

Test — profile-manager-meta-skip.test.ts now calls publishProfile() twice: the first through publish(), the second through the update() restatement path, asserting agents/_meta stays empty after both while the data graph stays populated. Also ran the publisher update/restate/metadata suites (137 tests) — no regression on the normal path.

Comment thread packages/publisher/src/metadata.ts Outdated
* profile heartbeat goes through this update() path and would otherwise keep
* filling `agents/_meta` without bound.
*/
skipMeta?: boolean;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: restateLabelGraphForUpdate is exported from the publisher package, so this generic skipMeta?: boolean turns an agents-only storage rule into caller-controlled behavior. A future non-agents caller can accidentally skip root/privateRoot/merkleRoot and materialized-version metadata, leaving the label graph hard to query and the version guard unadvanced. Prefer deriving the agents exception inside the helper from the context/meta graph, or keep the flag behind an internal agents-specific wrapper.

Comment thread packages/core/src/genesis.ts Outdated
* per-publish KC/KA `_meta` tracking record has no consumer — agent facts are
* served from the DATA graph. Persisting that `_meta` on every heartbeat
* (locally, and once per gossiped peer heartbeat) grows `agents/_meta` without
* bound (#1233). Both `_meta` write sites — the publisher's local path and the

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This comment says there are only two _meta write sites, but packages/publisher/src/publish-handler.ts still calls generateTentativeMetadata unconditionally for incoming /dkg/10.0.0/publish requests. Even if the heartbeat path currently uses gossip, the new central predicate documentation is already incomplete, which makes future drift likely. Either include/gate that handler too or narrow the comment to the paths this PR actually covers.

@Jurij89 Jurij89 force-pushed the fix/agents-meta-bloat branch from 8b10a2e to cfe9d78 Compare June 19, 2026 01:09
@Jurij89

Jurij89 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Pushed cfe9d782d — addressed all three.

  1. Test now drives update() for real. You're right — under NoChainAdapter, publish() returns kaId: 0n, so ProfileManager's currentKcId guard never flips and the second publishProfile() was just another publish(). The publisher suite now drives DKGPublisher.update() directly against the agents CG and asserts agents/_meta stays empty while the data graph is populated — plus a normal-CG update() control that asserts _meta IS still written, so the skip is provably agents-specific. The ProfileManager test is now an honest single-publish integration. 147 publisher tests pass.

  2. No generic skipMeta knob. Removed it — restateLabelGraphForUpdate now takes the contextGraphId and derives the agents exception internally via isAgentRegistryContextGraph, so a non-agents caller can no longer accidentally suppress label/root/merkleRoot/version metadata.

  3. The missed 5th site. Guarded publish-handler.ts (the direct /dkg/10.0.0/publish protocol's generateTentativeMetadata insert) and corrected the genesis.ts comment. Every agents/_meta writer is now gated on the one predicate: local publish, local update, direct-protocol publish, gossip publish, gossip update.

Comment thread packages/publisher/src/metadata.ts Outdated
* filling `agents/_meta` without bound. The exception is derived internally
* from this id, so it is never a caller-controlled meta-suppression knob.
*/
contextGraphId: string;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: restateLabelGraphForUpdate is exported publicly from packages/publisher/src/index.ts, but this PR adds a required contextGraphId option without updating existing direct callers or preserving compatibility. Several in-repo tests still call it without this field (for example packages/publisher/test/multi-root-token-rows.test.ts:211, packages/publisher/test/materialization-lock.test.ts:159, and packages/random-sampling/test/ka-extractor.test.ts:799), so type-checking consumers will break; plain JS callers will pass undefined, causing the agents-registry skip to be bypassed and reintroducing _meta writes. Make this optional with a safe derivation/default, or update every caller and treat the API break deliberately.

restateLabelGraphForUpdate is exported from the publisher package, and this change makes contextGraphId a new required option. Existing direct callers still use the old shape (for example the publisher/random-sampling test suites), and external TypeScript consumers will now fail to compile even though dataGraph/metaGraph already identify the context graph. Preserve the existing contract by deriving the agents-registry case from the graph URI, or make the new field optional with a default non-agents path and update call sites separately.

// #1233 — the agents registry CG never persists `_meta` (no consumer; it
// is served from the data graph). `metadataQuads` is still generated for
// the expire-timeout cleanup signature below, but skip the actual write.
if (!isAgentRegistryContextGraph(contextGraphId)) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: The added validation does not exercise this modified receive-side PublishHandler path, nor the analogous UpdateHandler restatement path at packages/publisher/src/update-handler.ts:270. The new tests cover local DKGPublisher publish/update, ProfileManager.publishProfile, and GossipPublishHandler, so a regression in stream receive handling or contextGraphId plumbing for agents-registry publish/update messages would still pass. Add targeted handler tests that assert agents _meta stays empty and a normal CG still writes _meta for both receive publish and receive update.

@Jurij89 Jurij89 force-pushed the fix/agents-meta-bloat branch from cfe9d78 to a0ac7cb Compare June 19, 2026 01:33
@Jurij89

Jurij89 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Pushed a0ac7cb3d — both addressed.

  1. 🔴 No longer a breaking change. restateLabelGraphForUpdate's contextGraphId is now optional — omitting it runs the standard _meta-writing restatement, so the existing in-repo callers (multi-root-token-rows, materialization-lock) and the external random-sampling consumer compile and behave exactly as before; my two agents-reachable update sites pass it. Verified: those flagged suites + random-sampling/ka-extractor (26 tests) all green.

  2. Receive-side handler tests added. New packages/publisher/test/handler-meta-skip.test.ts drives both receive paths directly — PublishHandler (the direct /dkg/.../publish protocol) and UpdateHandler (gossip receive/restatement). Each asserts agents _meta stays empty AND a normal CG still writes _meta (the control proves the skip is agents-specific). 163 publisher tests pass.

Comment thread packages/publisher/src/metadata.ts Outdated
// `_meta` re-stamp + merkleRoot + materialized-version writes, so the profile
// heartbeat's update() path doesn't resume filling agents/_meta. Callers that
// omit `contextGraphId` (external consumers) keep the standard write path.
if (contextGraphId && isAgentRegistryContextGraph(contextGraphId)) return true;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: Skipping the agents _meta restamp leaves restateLabelGraphForUpdate without a current source of prior root entities. The cleanup above still discovers old roots only from metaGraph, so after an agents update that changes rootEntity (wallet/key rotation or legacy peer-id form to address form), replicas can leave the previous profile subject in the agents data graph and registry discovery may show stale/duplicate agents. The local ProfileManager has a peerId data-graph cleanup, but receive-side updates do not. Add an agents-specific data-graph cleanup/index before this return, or keep a bounded per-agent root mapping instead of per-publish _meta.

}

// ── Tentative lifecycle timeout ──
const timeout = setTimeout(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: The agents-registry branch skips writing tentative _meta, but this direct receive path still registers the publish as pending and schedules the normal tentative timeout. Because agents publishes never confirm on-chain, expireTentativePublish will not find a confirmed status and will delete the just-stored profile data after TENTATIVE_TIMEOUT_MS; restored journal entries have the same cleanup behavior. Skip the pending/timeout lifecycle for isAgentRegistryContextGraph(contextGraphId) or make expiry a no-op for that CG, and add a timer/journal regression test.

const pub = await publisher.publish({ contextGraphId: cg, quads: profileQuads(cg) });
await publisher.update(pub.kaId, { contextGraphId: cg, quads: profileQuads(cg) });

expect(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This normal-CG update control is not isolating the update path. The preceding publish() already writes _meta, so countQuads(metaOf(cg)) > 0 would still pass if the update/restatement path stopped writing normal-CG metadata entirely. That leaves the regression guard for the new contextGraphId propagation ineffective. Capture the meta count after publish and assert it changes in an update-specific way, or use an update payload/metadata assertion that can only be produced by restateLabelGraphForUpdate.

…eats (#1233)

The agents system context graph never confirms on-chain, so every agent-profile
publish — a 5-minute heartbeat, fanned out across peers via gossip — wrote a
tentative `_meta` tracking record under a fresh, never-repeated UAL, and nothing
ever pruned superseded ones. agents/_meta grew without bound (112k quads / ~386
records per agent on a long-lived node), making the CG expensive to serve and
stalling offset-0 sync — the agents-CG-only slice of #1221.

The per-publish `_meta` record has no consumer for the agents registry: agent
facts are served from the DATA graph (ContextGraphMetaProjection never
enumerates the per-KA `_meta` index) and the CG never confirms, so the record is
write-only. A consumer audit confirmed the DATA graph itself IS load-bearing
(cross-peer SWM encryption keys, peer discovery, curator/membership sync, the
agent directory) and must stay; only the `_meta` record is removable.

Skip persisting it at the two highest-volume, side-effect-free paths, gated on a
single shared `dkg-core` predicate `isAgentRegistryContextGraph`:
- the gossip publish receiver — the DOMINANT source, re-inserting a stub per
  received peer heartbeat (the offset-0 storm);
- the local publish terminal.

The data graph is always written; every other context graph is unaffected.

Scoped deliberately: the update-restatement and direct-protocol receive paths
are NOT skipped here — there the `_meta` is load-bearing (it is the source for
prior-root cleanup on rootEntity change, and it drives the tentative-expiry
lifecycle), so simply skipping it would regress those. Bounding/pruning the
record on those paths (and a one-time prune of existing stores) is tracked as a
follow-up in #1233.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01G6SgdmcAV3AfpWQ3mYiFx6
@Jurij89 Jurij89 force-pushed the fix/agents-meta-bloat branch from a0ac7cb to 41ce51f Compare June 19, 2026 01:54
@Jurij89

Jurij89 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Reworked to address the load-bearing-_meta issues you found (a0ac7cb3d41ce51fcb).

You're right that on the update-restatement and direct-protocol receive paths the _meta is load-bearing — it's the prior-root source for the data-graph cleanup, and it drives the tentative-expiry lifecycle — so skipping it there regresses both (your two 🔴s). Rather than bolt agent-specific cleanup + lifecycle special-casing into those generic publisher paths, I've descoped this PR to the dominant, side-effect-free source:

Kept (gated on isAgentRegistryContextGraph):

  • the gossip publish receiver (gossip-publish-handler.ts) — the dominant per-peer-heartbeat source and the actual offset-0 storm driver; the agent serves agent facts from the data graph, so its tentative stub has no consumer and no lifecycle/prior-root tie.
  • the local publish terminal (dkg-publisher.ts).

Reverted to base (was the source of both 🔴s): the update-restatement (metadata.ts / DKGPublisher.update), the gossip UpdateHandler, and the direct-protocol PublishHandler. The genesis comment is narrowed accordingly and the update/handler tests are dropped.

Follow-up (#1233): bound/prune the per-agent record on those remaining paths (rather than skip it, so prior-root cleanup + lifecycle keep working), plus a one-time prune of existing stores.

Tests: gossip handler 13/13, publisher opt-out 2/2, ProfileManager 1/1; the reverted suites are back to base, all green.

@otReviewAgent otReviewAgent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Agent completed this review and found no issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants