fix(weka): subagents share parent trace's hash_id scope#7
Merged
cquil11 merged 1 commit intoJun 11, 2026
Conversation
A weka trace declares ``hash_id_scope: "local"`` -- one hash_id namespace per trace FILE. The reconstruction loops scoped subagent (spawn-child) decode on the child ``session_id`` instead of the parent ``trace_id``, so a hash_id reused across the parent and a subagent (or across sibling subagents) decoded to DIFFERENT synthetic tokens. That destroyed the cross-agent shared prefix the real workload serves from KV cache, deflating replayed prefix-cache hit rates. Empirically (060826 corpus, 391 traces): all rows declare scope "local" and every subagent re-enters the parent's contiguous 0..N hash_id sequence; cross-sibling block sharing is the dominant reuse mode (74/107 subagent traces, ~25-30% of blocks). Bug present since the original feature commit. Fix: scope child decode on ``parent_trace_id`` in both the serial (_reconstruct_serial) and parallel (weka_parallel_convert._process_task) paths so shared blocks decode identically. Bump MANIFEST_VERSION 5 -> 6: the mmap dataset cache key has no source-code component, so without this a warm cache built by the buggy loader keeps serving the old (wrong) dataset for the same input bytes/tokenizer/settings even after the fix. Bumping invalidates pre-fix entries (treated as a MISS). Tests: - unit: parent<->child and sibling<->sibling shared-block scope, wired with the REAL scope-sensitive HashIdRandomGenerator (the conftest stub ignores set_trace_id and cannot catch this). - integration: a crafted scope-stressing weka trace run end-to-end against the mock server; asserts the actual on-the-wire payloads of the two sibling subagents (and parent) are identical. Pins the mmap dataset cache to a per-test dir so a stale cross-version cached dataset can't mask the regression. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3e0cdecbab3ce47ac32bd12215b04848a005043dRecommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3e0cdecbab3ce47ac32bd12215b04848a005043dLast updated for commit: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A weka trace declares
hash_id_scope: "local"— one hash_id namespace per trace FILE. The reconstruction loops scoped subagent (spawn-child) decode on the childsession_idinstead of the parenttrace_id, so a hash_id reused across the parent and a subagent (or across sibling subagents) decoded to different synthetic tokens. That destroyed the cross-agent shared prefix the real workload serves from KV cache, deflating replayed prefix-cache hit rates.The fix is two lines — scope child decode on
parent_trace_idin both reconstruction paths:weka_trace.py_reconstruct_serialweka_parallel_convert.py_process_taskPlus
MANIFEST_VERSION 5 → 6(see "Cache" below).Bug present since the original AgentX/Weka feature commit (
91b3cde89, 2026-05-01); not a later regression.Why this is correct — empirical proof from the 060826 corpus
Verified against
semianalysisai/cc-traces-weka-with-subagents-060826(391 traces, 107 with subagents, 615 subagents total). The question is whether the same hash_id is genuinely one shared block across parent + subagents (→ shared scope is correct), or whether each conversation renumbers from 0 (→ per-conversation scope would be correct). The data settles it:1. Declared contract — uniform
"local". All 391/391 rows declarehash_id_scope: "local"(the schema permits nothing else).2. Single shared first-appearance sequence (positive proof). Walking each trace's requests in document order (parent requests + subagent inner requests inlined) with a running counter, every first-appearance hash_id must equal the counter (
0,1,2,…,N) under one shared namespace; a per-conversation reset would surface an unseen ordinal below the counter (a "fresh-below" event — the reset fingerprint).Zero fresh-below events anywhere → the hash_id space is one global counter shared by parent and subagents. (Zero gaps → assignment order is exactly document order.)
3. A reset namespace must begin at 0 (disproof of reset). A subagent's first model call has no prior context to evict, so under a from-zero renumbering its first request would start at ordinal 0. Observed:
The median subagent's first call starts at block 1824 — impossible in a private namespace (there is no block 0..1823 in it yet). It can only reference 1824 because that block already exists in the shared trace namespace (inherited parent/sibling context). The reset signature (dense-from-0 own set) is entirely absent.
4. Blast radius — sibling sharing dominates. Fraction of each subagent's blocks that are shared (and that the per-child-scope bug decoded to different tokens):
Parallel subagents spawned from the same turn inherit the same context blocks, so cross-sibling sharing (69% of subagent traces) is the dominant case the bug corrupts. The fix puts the parent and all siblings on one scope, repairing both rows at once.
5.
tool_tokens/system_tokensare absent at the trace level (default 0), so the setup prefix is derived empirically, not from the header — consistent with the shared-namespace model.Cache: why
MANIFEST_VERSIONhad to bumpThe mmap dataset cache key is
(input bytes, tokenizer, prompt/input settings, MANIFEST_VERSION)with no source-code component. This fix changes the decoded content produced for an unchanged key, so without a bump a warm cache built by the buggy loader keeps serving the old (wrong) dataset even after the fix lands.5 → 6invalidates all pre-fix entries (treated as a MISS; a miss still produces byte-identical mmap files to a non-cached run). The version-bump convention comment was widened to call out content-semantics changes, not just on-disk layout/schema.Tests
tests/unit/dataset/loader/test_weka_trace.py): parent↔child and sibling↔sibling shared-block scope. Wired with the real scope-sensitiveHashIdRandomGenerator— the existingstub_hash_id_corpus_rngignoresset_trace_id, so a stubbed RNG decodes identically under any scope and cannot catch this class of bug. Confirmed RED on the buggy code (same block → different tokens per scope), GREEN after the fix.tests/integration/test_weka_hash_id_scope.py): a crafted scope-stressing weka trace (parent + two sibling subagents all referencing[10,11,12]) run end-to-end throughaiperf profileagainst the mock server with--export-level raw. Asserts the actual on-the-wire payloads of the two siblings (and parent vs child) are byte-identical. Pins the mmap cache to a per-test dir so a stale cross-version cached dataset can't mask the regression — this was a real false-green during development (the cache replayed a fixed dataset under the buggy code).Verification: full loader unit suite green (incl. all serial↔parallel byte-identity + reproducibility tests), mmap-cache tests green, integration green and confirmed RED without the fix; ruff + all pre-commit hooks clean.
🤖 Generated with Claude Code
Note
Medium Risk
Changes decoded prompts for Weka traces with subagents (prefix-cache replay semantics) and invalidates mmap cache v5 entries; scope is narrow to Weka loader paths with strong regression tests.
Overview
Fixes Weka
hash_id_scope: "local"for subagents so reusedhash_idblocks decode to the same synthetic tokens on the parent and every spawn child (including siblings), matching one namespace per trace file and restoring cross-agent shared-prefix replay for prefix-cache metrics.Serial reconstruction in
weka_trace.pyand parallel workers inweka_parallel_convert.pynow callset_trace_id/_make_scope_helperswithparent_trace_idinstead of the childsession_id.MANIFEST_VERSIONis bumped to 6 so mmap dataset caches miss on old wrong decodes (cache keys omit source version).Tests: unit cases with the real scope-sensitive
HashIdRandomGenerator(parent↔child and sibling sharing), plus an integration run throughaiperf profilethat asserts byte-identical on-wire prompts for shared blocks.Reviewed by Cursor Bugbot for commit 3e0cdec. Bugbot is set up for automated code reviews on this repo. Configure here.