fix(weka): subagents share parent trace's hash_id scope by ajcasagrande · Pull Request #7 · SemiAnalysisAI/aiperf

ajcasagrande · 2026-06-11T07:16:04Z

What

A weka trace declares hash_id_scope: "local" — one hash_id namespace per trace FILE. The reconstruction loops scoped subagent (spawn-child) decode on the child session_id instead of the parent trace_id, so a hash_id reused across the parent and a subagent (or across sibling subagents) decoded to different synthetic tokens. That destroyed the cross-agent shared prefix the real workload serves from KV cache, deflating replayed prefix-cache hit rates.

The fix is two lines — scope child decode on parent_trace_id in both reconstruction paths:

serial: weka_trace.py _reconstruct_serial
parallel: weka_parallel_convert.py _process_task

Plus MANIFEST_VERSION 5 → 6 (see "Cache" below).

Bug present since the original AgentX/Weka feature commit (91b3cde89, 2026-05-01); not a later regression.

Why this is correct — empirical proof from the 060826 corpus

Verified against semianalysisai/cc-traces-weka-with-subagents-060826 (391 traces, 107 with subagents, 615 subagents total). The question is whether the same hash_id is genuinely one shared block across parent + subagents (→ shared scope is correct), or whether each conversation renumbers from 0 (→ per-conversation scope would be correct). The data settles it:

1. Declared contract — uniform "local". All 391/391 rows declare hash_id_scope: "local" (the schema permits nothing else).

2. Single shared first-appearance sequence (positive proof). Walking each trace's requests in document order (parent requests + subagent inner requests inlined) with a running counter, every first-appearance hash_id must equal the counter (0,1,2,…,N) under one shared namespace; a per-conversation reset would surface an unseen ordinal below the counter (a "fresh-below" event — the reset fingerprint).

traces with a PERFECT 0,1,2,..,N first-appearance sequence : 391/391
  'fresh-below' events (RESET fingerprint)                  : 0
  'gap' events (out-of-order assignment, not a reset)       : 0

Zero fresh-below events anywhere → the hash_id space is one global counter shared by parent and subagents. (Zero gaps → assignment order is exactly document order.)

3. A reset namespace must begin at 0 (disproof of reset). A subagent's first model call has no prior context to evict, so under a from-zero renumbering its first request would start at ordinal 0. Observed:

subagents: 615
  first-call first-block == 0                          : 4/615
  first-call first-block min / median / p90 / max      : 0 / 1824 / 25361 / 93068
  subagents whose own set == {0..max} (reset signature): 0/615

The median subagent's first call starts at block 1824 — impossible in a private namespace (there is no block 0..1823 in it yet). It can only reference 1824 because that block already exists in the shared trace namespace (inherited parent/sibling context). The reset signature (dense-from-0 own set) is entirely absent.

4. Blast radius — sibling sharing dominates. Fraction of each subagent's blocks that are shared (and that the per-child-scope bug decoded to different tokens):

Sharing	Traces affected	Per-subagent share
subagent ↔ parent	24/107	mean 8%, median 0%
subagent ↔ sibling	74/107	mean 30%, median 25%

Parallel subagents spawned from the same turn inherit the same context blocks, so cross-sibling sharing (69% of subagent traces) is the dominant case the bug corrupts. The fix puts the parent and all siblings on one scope, repairing both rows at once.

5. tool_tokens / system_tokens are absent at the trace level (default 0), so the setup prefix is derived empirically, not from the header — consistent with the shared-namespace model.

Cache: why `MANIFEST_VERSION` had to bump

The mmap dataset cache key is (input bytes, tokenizer, prompt/input settings, MANIFEST_VERSION) with no source-code component. This fix changes the decoded content produced for an unchanged key, so without a bump a warm cache built by the buggy loader keeps serving the old (wrong) dataset even after the fix lands. 5 → 6 invalidates all pre-fix entries (treated as a MISS; a miss still produces byte-identical mmap files to a non-cached run). The version-bump convention comment was widened to call out content-semantics changes, not just on-disk layout/schema.

Tests

Unit (tests/unit/dataset/loader/test_weka_trace.py): parent↔child and sibling↔sibling shared-block scope. Wired with the real scope-sensitive HashIdRandomGenerator — the existing stub_hash_id_corpus_rng ignores set_trace_id, so a stubbed RNG decodes identically under any scope and cannot catch this class of bug. Confirmed RED on the buggy code (same block → different tokens per scope), GREEN after the fix.
Integration (tests/integration/test_weka_hash_id_scope.py): a crafted scope-stressing weka trace (parent + two sibling subagents all referencing [10,11,12]) run end-to-end through aiperf profile against the mock server with --export-level raw. Asserts the actual on-the-wire payloads of the two siblings (and parent vs child) are byte-identical. Pins the mmap cache to a per-test dir so a stale cross-version cached dataset can't mask the regression — this was a real false-green during development (the cache replayed a fixed dataset under the buggy code).

Verification: full loader unit suite green (incl. all serial↔parallel byte-identity + reproducibility tests), mmap-cache tests green, integration green and confirmed RED without the fix; ruff + all pre-commit hooks clean.

🤖 Generated with Claude Code

Note

Medium Risk
Changes decoded prompts for Weka traces with subagents (prefix-cache replay semantics) and invalidates mmap cache v5 entries; scope is narrow to Weka loader paths with strong regression tests.

Overview
Fixes Weka hash_id_scope: "local" for subagents so reused hash_id blocks decode to the same synthetic tokens on the parent and every spawn child (including siblings), matching one namespace per trace file and restoring cross-agent shared-prefix replay for prefix-cache metrics.

Serial reconstruction in weka_trace.py and parallel workers in weka_parallel_convert.py now call set_trace_id / _make_scope_helpers with parent_trace_id instead of the child session_id. MANIFEST_VERSION is bumped to 6 so mmap dataset caches miss on old wrong decodes (cache keys omit source version).

Tests: unit cases with the real scope-sensitive HashIdRandomGenerator (parent↔child and sibling sharing), plus an integration run through aiperf profile that asserts byte-identical on-wire prompts for shared blocks.

^{Reviewed by Cursor Bugbot for commit 3e0cdec. Bugbot is set up for automated code reviews on this repo. Configure here.}

A weka trace declares ``hash_id_scope: "local"`` -- one hash_id namespace per trace FILE. The reconstruction loops scoped subagent (spawn-child) decode on the child ``session_id`` instead of the parent ``trace_id``, so a hash_id reused across the parent and a subagent (or across sibling subagents) decoded to DIFFERENT synthetic tokens. That destroyed the cross-agent shared prefix the real workload serves from KV cache, deflating replayed prefix-cache hit rates. Empirically (060826 corpus, 391 traces): all rows declare scope "local" and every subagent re-enters the parent's contiguous 0..N hash_id sequence; cross-sibling block sharing is the dominant reuse mode (74/107 subagent traces, ~25-30% of blocks). Bug present since the original feature commit. Fix: scope child decode on ``parent_trace_id`` in both the serial (_reconstruct_serial) and parallel (weka_parallel_convert._process_task) paths so shared blocks decode identically. Bump MANIFEST_VERSION 5 -> 6: the mmap dataset cache key has no source-code component, so without this a warm cache built by the buggy loader keeps serving the old (wrong) dataset for the same input bytes/tokenizer/settings even after the fix. Bumping invalidates pre-fix entries (treated as a MISS). Tests: - unit: parent<->child and sibling<->sibling shared-block scope, wired with the REAL scope-sensitive HashIdRandomGenerator (the conftest stub ignores set_trace_id and cannot catch this). - integration: a crafted scope-stressing weka trace run end-to-end against the mock server; asserts the actual on-the-wire payloads of the two sibling subagents (and parent) are identical. Pins the mmap dataset cache to a per-test dir so a stale cross-version cached dataset can't mask the regression. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

github-actions · 2026-06-11T07:16:14Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3e0cdecbab3ce47ac32bd12215b04848a005043d

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3e0cdecbab3ce47ac32bd12215b04848a005043d

Last updated for commit: 3e0cdec • Browse code

cquil11 merged commit 3c19e05 into SemiAnalysisAI:cjq/agentx-v0.3 Jun 11, 2026
2 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(weka): subagents share parent trace's hash_id scope#7

fix(weka): subagents share parent trace's hash_id scope#7
cquil11 merged 1 commit into
SemiAnalysisAI:cjq/agentx-v0.3from
ajcasagrande:acasagrande/weka-subagent-shared-hash-scope

ajcasagrande commented Jun 11, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ajcasagrande commented Jun 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why this is correct — empirical proof from the 060826 corpus

Cache: why MANIFEST_VERSION had to bump

Tests

Uh oh!

github-actions Bot commented Jun 11, 2026

Try out this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ajcasagrande commented Jun 11, 2026 •

edited by cursor Bot

Loading

Cache: why `MANIFEST_VERSION` had to bump