fix(agentic): dispatch all profiling trajectories concurrently at startup by ajcasagrande · Pull Request #9 · SemiAnalysisAI/aiperf

ajcasagrande · 2026-06-12T17:53:55Z

A quick glossary first, since the table leans on a few terms:

AIPerf replays recorded AI-agent conversations against an inference server to measure its performance under load. Each replayed conversation slot is a lane; running at "concurrency 256" means 256 lanes are having conversations with the server at once.
A credit is the internal "go ahead, send the next message" token. When a conversation finishes, its slot is recycled: the finished conversation goes back into a pool and a fresh one is drawn so the load level stays constant.
A run has two phases: warmup (get the server's caches into a realistic steady state — not measured) and profiling (the actual measurement). Some conversations continue straight from warmup into profiling.
A cache-bust marker is a small unique tag injected into each conversation's text so the server can't cheat by reusing cached work from an identical earlier conversation. The same conversation must keep the same tag for its whole life.
Some replayed conversations spawn sub-agents (helper conversations) and then wait at a gate until all helpers finish before continuing — like a manager who can't proceed until every report comes back.

#	Bug	What went wrong	Why it matters
1	Slow-motion start (`ea1b4e9`, the original PR #9 fix)	At the start of measurement, the code launched the 256 lanes one at a time, fully waiting for each lane's first message to be accepted before even starting the next — like boarding a plane through a single door, one passenger at a time.	The benchmark claims to measure the server under 256 simultaneous conversations, but for the first ~54 seconds the load slowly trickled up from 1. Part of the measurement window described a load that wasn't actually there, skewing the results.
2	Same conversation played twice at once (`5e2d20b`)	Each lane is supposed to stamp "I'm using conversation X" into a ledger so no other lane picks X up. But lanes only stamped the ledger as they launched — and with the now-simultaneous start, a fast lane could finish its conversation and draw a replacement from the pool before a slower lane had stamped its claim. The fast lane could draw a conversation another lane was already about to run.	Two lanes replaying the same conversation simultaneously is something the system promises never happens. It distorts the load (the server sees duplicate traffic patterns) and corrupts the cache-related bookkeeping for that conversation. Fixed by stamping every lane's claim into the ledger up front, before anything launches.
3	The anti-cheat tag changed mid-conversation (`3dfa3b4`)	A conversation continuing from warmup into profiling must keep its exact cache-bust tag, so the cache the server warmed up stays usable. The code didn't store the tag across the boundary — it recomputed all tags from scratch in the new phase and relied on both phases computing them in the exact same order. But the two phases process slightly different lists, so in some setups the ordering shifted by one and a continuing conversation silently got a different tag.	The whole point of warmup is that the server enters measurement with realistic warm caches. A changed tag makes the conversation look brand-new to the server, so the warmed cache is never reused — and the run quietly under-reports the server's cache performance, with no error or warning. Fixed by remembering each conversation's tag rather than recomputing it.
4	Waiting forever for a report that already came in (`9479d72`, the worst one)	Benchmarks can resume from a mid-run snapshot. A manager-conversation waiting on two helper groups gets a checklist: "group A done? group B done?" The snapshot only records helpers still running at the snapshot moment — so a group that had already finished before the snapshot left no record, and nothing could ever check its box. The checklist could never complete.	That manager lane silently freezes for the entire run — no error, no crash. A 256-lane benchmark quietly becomes a 255-lane benchmark (or worse, with several frozen lanes), and at the end the run stalls for extra minutes waiting on the frozen lanes before force-quitting. Fixed by reasoning it out at resume time: if a group's start point is in the past and none of its helpers are alive, it must have finished — check its box.
5	Two departments sending the same message (`04318d6`)	When all of a manager's helpers fail to even start, someone must un-stick the manager. The gate-keeping code did it itself — it sent the manager's next message — but then also told the normal message-sending machinery "I didn't handle this," so the normal path sent the very same message a second time.	The server receives a duplicate request (wasted work, inflated request counts), and the duplication snowballs: each duplicated reply triggers two next-messages, and so on, until the run crashes on an internal "this conversation finished twice" safety check. Fixed by giving the dispatch a single owner: the gatekeeper just removes the dead gate and lets the normal machinery send the message once.

The connecting theme: bugs 2–5 were all hiding in code that resumes or coordinates work "mid-story" — they only show themselves under particular timing (a snapshot taken between two events, lanes starting simultaneously, all helpers failing at once), which is why they survived until a careful review rather than failing in everyday runs.

🤖 Generated with Claude Code

…rtup _execute_profiling was serially awaiting each trajectory's first credit, blocking trajectory K+1 until K's dispatch completed. With N=256 and slow per-trajectory work (snapshot materialization, tokenization), this caused the full concurrency target to take ~54s to reach on a real cluster instead of bursting at t=0. Replace the serial for-loop with asyncio.gather over a new _dispatch_one_profiling_trajectory helper so all trajectories begin their initial credit issuance concurrently. Regression test verifies that N terminal-root snapshot trajectories all reach issue_credit simultaneously rather than one-at-a-time. Root cause introduced in commit f47bd55 (Cam Quilici, 2026-06-03). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

github-actions · 2026-06-12T17:54:05Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@92026e5028800ccb33ae3819440cbe549478e335

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@92026e5028800ccb33ae3819440cbe549478e335

Last updated for commit: 92026e5 • Browse code

… gather A bare asyncio.gather re-raises the first lane's exception while the remaining dispatch coroutines keep running detached - issuing credits and mutating strategy state in a phase that is already failing, and unreachable by the phase runner's cancellation. Sibling exceptions were also silently swallowed. Gather with return_exceptions=True, log every failed lane with its trace_id, and re-raise the first failure only after all lanes settle. Matches the established fan-out pattern in branch_orchestrator. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

…patch A lane that recycles immediately at PROFILING startup (terminal-root snapshot or k_i at the last turn) pops the recycle queue before later lanes have registered themselves in _active_traces, so the pop's duplicate-session guard sees count 0 for a still-live trace and spawns a second concurrent session of it - also stealing that trace's pass=0 cache-bust digest from the continuing warmup session. Pre-register every lane's trace in _active_traces during setup_phase (exactly the _lanes_per_trace multiset) instead of lane-by-lane during dispatch, so the guard sees final counts from t=0 regardless of dispatch interleaving. Tests that hand-simulated the old per-lane increments drop them accordingly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

…across phases Marker coherence across the WARMUP -> PROFILING boundary relied on both phases re-minting in identical positional order from per-phase counters. The phases mint different sets: WARMUP skips waiting_on_children states while PROFILING mints every continuation state, so under wrap-fill with mixed ready/waiting roots on a shared trace the continuing session's re-mint lands on a different pass number - its marker rotates mid-session at the phase boundary and the warmed KV prefix becomes unreachable for the measured turns. Move _recycle_pass/_session_marker to a CacheBustLedger on the shared TrajectorySource and reuse an already-minted marker for the same x_correlation_id. Continuing sessions now keep their WARMUP marker by identity rather than by mint-order replay, and the never-restarting pass counter guarantees fresh sessions cannot collide with warmed digests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

_ensure_seeded_join pre-seeds every prereq key declared on a gated turn as unregistered, and seed_snapshot only registers keys for children still alive at t*. When two branch groups share one join turn and t* falls between their completions, the already-finished group's key could never register (PrereqState.is_done requires registered=True), so the gate was permanently unsatisfiable: the parent lane silently wedged for the entire phase, surfacing only as a misleading outstanding=0 abandoned-join warning at cleanup while has_pending_branch_work kept the phase-end drain waiting until the forced-completion timeout. Record each prereq's spawning turn in the index; snapshot seeding marks keys whose spawning turn fired before the parent's resume position as registered (expected=0 -> satisfied unless live children re-register). Keys whose spawning turn replays after t* stay unregistered so a gate can still not fire before its branch has spawned. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

When every child of a gated branch failed to start, the drained-gate path dispatched the gated turn via _release_blocked_join, but _maybe_suspend_parent then returned False (the gate was already popped), so the callback handler fell through to the strategy's normal continuation and dispatched the identical turn again - a duplicate request whose doubled returns propagate per-turn and end in the double-recycle RuntimeError at the trajectory's final turn. Pop drained gates silently instead: with the gate gone, intercept returns False and the strategy's continuation owns the single dispatch. No hang is reintroduced - the original Phase 0 deadlock came from leaving an unsatisfiable join registered, and the pop still prevents that. Tests that encoded the False+dispatched combination now assert single ownership. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Commit 932b4bc switched the inferencex-agentx-mvp scenario from inter_turn_delay_cap_seconds to trace_idle_gap_cap_seconds but the e2e log assertion still expected the old auto-set line, failing the test on every run since. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

- Assert the N-at-the-gate count is produced by exactly N distinct turn-0 sessions (await_count + distinct correlation ids), so the assertion cannot be satisfied by a double-dispatching lane masking a lost one. - Bound task completion with wait_for so a wedge regression fails in 5s instead of hanging to the global timeout. - Add a companion test covering the snapshot-less k_i+1 resume branch, which previously had no concurrency coverage (verified to fail against a serial dispatch loop). - Rebuild on the file's _make_strategy/_build_real_trajectory_source helpers, dropping the inline TrajectorySource.__new__ copy, the dead max_in_flight/seed/target-size setup, and the wrong queue-depth comment (N traces suffice for N concurrent pops; depth is conserved). - Match _dispatch_one_profiling_trajectory's parameter order to its sibling (trajectory, lane), drop the now-pointless default-arg capture ceremony in the debug lambda, and add the missing docstring. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

_export_and_load_sync used asyncio.get_event_loop().run_until_complete, which raises 'no current event loop' on Python 3.12 whenever the xdist worker previously ran an in-process CLI test (e.g. the agentic-replay e2e) that consumed the main thread's loop. The failure only appeared when test scheduling co-located the two files, making the suite flaky under -n auto. asyncio.run creates a fresh loop per call. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

ajcasagrande marked this pull request as draft June 12, 2026 17:54

ajcasagrande and others added 8 commits June 12, 2026 11:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agentic): dispatch all profiling trajectories concurrently at startup#9

fix(agentic): dispatch all profiling trajectories concurrently at startup#9
ajcasagrande wants to merge 9 commits into
SemiAnalysisAI:cjq/agentx-v0.3from
ajcasagrande:ajc/fix-terminal-roots-serial-startup

ajcasagrande commented Jun 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajcasagrande commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ajcasagrande commented Jun 12, 2026 •

edited

Loading

github-actions Bot commented Jun 12, 2026 •

edited

Loading