Skip to content

fix(weka): subagents share parent trace's hash_id scope#7

Merged
cquil11 merged 1 commit into
SemiAnalysisAI:cjq/agentx-v0.3from
ajcasagrande:acasagrande/weka-subagent-shared-hash-scope
Jun 11, 2026
Merged

fix(weka): subagents share parent trace's hash_id scope#7
cquil11 merged 1 commit into
SemiAnalysisAI:cjq/agentx-v0.3from
ajcasagrande:acasagrande/weka-subagent-shared-hash-scope

Conversation

@ajcasagrande

@ajcasagrande ajcasagrande commented Jun 11, 2026

Copy link
Copy Markdown

What

A weka trace declares hash_id_scope: "local"one hash_id namespace per trace FILE. The reconstruction loops scoped subagent (spawn-child) decode on the child session_id instead of the parent trace_id, so a hash_id reused across the parent and a subagent (or across sibling subagents) decoded to different synthetic tokens. That destroyed the cross-agent shared prefix the real workload serves from KV cache, deflating replayed prefix-cache hit rates.

The fix is two lines — scope child decode on parent_trace_id in both reconstruction paths:

  • serial: weka_trace.py _reconstruct_serial
  • parallel: weka_parallel_convert.py _process_task

Plus MANIFEST_VERSION 5 → 6 (see "Cache" below).

Bug present since the original AgentX/Weka feature commit (91b3cde89, 2026-05-01); not a later regression.

Why this is correct — empirical proof from the 060826 corpus

Verified against semianalysisai/cc-traces-weka-with-subagents-060826 (391 traces, 107 with subagents, 615 subagents total). The question is whether the same hash_id is genuinely one shared block across parent + subagents (→ shared scope is correct), or whether each conversation renumbers from 0 (→ per-conversation scope would be correct). The data settles it:

1. Declared contract — uniform "local". All 391/391 rows declare hash_id_scope: "local" (the schema permits nothing else).

2. Single shared first-appearance sequence (positive proof). Walking each trace's requests in document order (parent requests + subagent inner requests inlined) with a running counter, every first-appearance hash_id must equal the counter (0,1,2,…,N) under one shared namespace; a per-conversation reset would surface an unseen ordinal below the counter (a "fresh-below" event — the reset fingerprint).

traces with a PERFECT 0,1,2,..,N first-appearance sequence : 391/391
  'fresh-below' events (RESET fingerprint)                  : 0
  'gap' events (out-of-order assignment, not a reset)       : 0

Zero fresh-below events anywhere → the hash_id space is one global counter shared by parent and subagents. (Zero gaps → assignment order is exactly document order.)

3. A reset namespace must begin at 0 (disproof of reset). A subagent's first model call has no prior context to evict, so under a from-zero renumbering its first request would start at ordinal 0. Observed:

subagents: 615
  first-call first-block == 0                          : 4/615
  first-call first-block min / median / p90 / max      : 0 / 1824 / 25361 / 93068
  subagents whose own set == {0..max} (reset signature): 0/615

The median subagent's first call starts at block 1824 — impossible in a private namespace (there is no block 0..1823 in it yet). It can only reference 1824 because that block already exists in the shared trace namespace (inherited parent/sibling context). The reset signature (dense-from-0 own set) is entirely absent.

4. Blast radius — sibling sharing dominates. Fraction of each subagent's blocks that are shared (and that the per-child-scope bug decoded to different tokens):

Sharing Traces affected Per-subagent share
subagent ↔ parent 24/107 mean 8%, median 0%
subagent ↔ sibling 74/107 mean 30%, median 25%

Parallel subagents spawned from the same turn inherit the same context blocks, so cross-sibling sharing (69% of subagent traces) is the dominant case the bug corrupts. The fix puts the parent and all siblings on one scope, repairing both rows at once.

5. tool_tokens / system_tokens are absent at the trace level (default 0), so the setup prefix is derived empirically, not from the header — consistent with the shared-namespace model.

Cache: why MANIFEST_VERSION had to bump

The mmap dataset cache key is (input bytes, tokenizer, prompt/input settings, MANIFEST_VERSION) with no source-code component. This fix changes the decoded content produced for an unchanged key, so without a bump a warm cache built by the buggy loader keeps serving the old (wrong) dataset even after the fix lands. 5 → 6 invalidates all pre-fix entries (treated as a MISS; a miss still produces byte-identical mmap files to a non-cached run). The version-bump convention comment was widened to call out content-semantics changes, not just on-disk layout/schema.

Tests

  • Unit (tests/unit/dataset/loader/test_weka_trace.py): parent↔child and sibling↔sibling shared-block scope. Wired with the real scope-sensitive HashIdRandomGenerator — the existing stub_hash_id_corpus_rng ignores set_trace_id, so a stubbed RNG decodes identically under any scope and cannot catch this class of bug. Confirmed RED on the buggy code (same block → different tokens per scope), GREEN after the fix.
  • Integration (tests/integration/test_weka_hash_id_scope.py): a crafted scope-stressing weka trace (parent + two sibling subagents all referencing [10,11,12]) run end-to-end through aiperf profile against the mock server with --export-level raw. Asserts the actual on-the-wire payloads of the two siblings (and parent vs child) are byte-identical. Pins the mmap cache to a per-test dir so a stale cross-version cached dataset can't mask the regression — this was a real false-green during development (the cache replayed a fixed dataset under the buggy code).

Verification: full loader unit suite green (incl. all serial↔parallel byte-identity + reproducibility tests), mmap-cache tests green, integration green and confirmed RED without the fix; ruff + all pre-commit hooks clean.

🤖 Generated with Claude Code


Note

Medium Risk
Changes decoded prompts for Weka traces with subagents (prefix-cache replay semantics) and invalidates mmap cache v5 entries; scope is narrow to Weka loader paths with strong regression tests.

Overview
Fixes Weka hash_id_scope: "local" for subagents so reused hash_id blocks decode to the same synthetic tokens on the parent and every spawn child (including siblings), matching one namespace per trace file and restoring cross-agent shared-prefix replay for prefix-cache metrics.

Serial reconstruction in weka_trace.py and parallel workers in weka_parallel_convert.py now call set_trace_id / _make_scope_helpers with parent_trace_id instead of the child session_id. MANIFEST_VERSION is bumped to 6 so mmap dataset caches miss on old wrong decodes (cache keys omit source version).

Tests: unit cases with the real scope-sensitive HashIdRandomGenerator (parent↔child and sibling sharing), plus an integration run through aiperf profile that asserts byte-identical on-wire prompts for shared blocks.

Reviewed by Cursor Bugbot for commit 3e0cdec. Bugbot is set up for automated code reviews on this repo. Configure here.

A weka trace declares ``hash_id_scope: "local"`` -- one hash_id namespace
per trace FILE. The reconstruction loops scoped subagent (spawn-child)
decode on the child ``session_id`` instead of the parent ``trace_id``, so a
hash_id reused across the parent and a subagent (or across sibling
subagents) decoded to DIFFERENT synthetic tokens. That destroyed the
cross-agent shared prefix the real workload serves from KV cache, deflating
replayed prefix-cache hit rates.

Empirically (060826 corpus, 391 traces): all rows declare scope "local" and
every subagent re-enters the parent's contiguous 0..N hash_id sequence;
cross-sibling block sharing is the dominant reuse mode (74/107 subagent
traces, ~25-30% of blocks). Bug present since the original feature commit.

Fix: scope child decode on ``parent_trace_id`` in both the serial
(_reconstruct_serial) and parallel (weka_parallel_convert._process_task)
paths so shared blocks decode identically.

Bump MANIFEST_VERSION 5 -> 6: the mmap dataset cache key has no source-code
component, so without this a warm cache built by the buggy loader keeps
serving the old (wrong) dataset for the same input bytes/tokenizer/settings
even after the fix. Bumping invalidates pre-fix entries (treated as a MISS).

Tests:
- unit: parent<->child and sibling<->sibling shared-block scope, wired with
  the REAL scope-sensitive HashIdRandomGenerator (the conftest stub ignores
  set_trace_id and cannot catch this).
- integration: a crafted scope-stressing weka trace run end-to-end against
  the mock server; asserts the actual on-the-wire payloads of the two
  sibling subagents (and parent) are identical. Pins the mmap dataset cache
  to a per-test dir so a stale cross-version cached dataset can't mask the
  regression.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
@github-actions

Copy link
Copy Markdown

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3e0cdecbab3ce47ac32bd12215b04848a005043d

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3e0cdecbab3ce47ac32bd12215b04848a005043d

Last updated for commit: 3e0cdecBrowse code

@cquil11 cquil11 merged commit 3c19e05 into SemiAnalysisAI:cjq/agentx-v0.3 Jun 11, 2026
2 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants