Skip to content

fix(agentic): dispatch all profiling trajectories concurrently at startup#9

Draft
ajcasagrande wants to merge 9 commits into
SemiAnalysisAI:cjq/agentx-v0.3from
ajcasagrande:ajc/fix-terminal-roots-serial-startup
Draft

fix(agentic): dispatch all profiling trajectories concurrently at startup#9
ajcasagrande wants to merge 9 commits into
SemiAnalysisAI:cjq/agentx-v0.3from
ajcasagrande:ajc/fix-terminal-roots-serial-startup

Conversation

@ajcasagrande

@ajcasagrande ajcasagrande commented Jun 12, 2026

Copy link
Copy Markdown

A quick glossary first, since the table leans on a few terms:

  • AIPerf replays recorded AI-agent conversations against an inference server to measure its performance under load. Each replayed conversation slot is a lane; running at "concurrency 256" means 256 lanes are having conversations with the server at once.
  • A credit is the internal "go ahead, send the next message" token. When a conversation finishes, its slot is recycled: the finished conversation goes back into a pool and a fresh one is drawn so the load level stays constant.
  • A run has two phases: warmup (get the server's caches into a realistic steady state — not measured) and profiling (the actual measurement). Some conversations continue straight from warmup into profiling.
  • A cache-bust marker is a small unique tag injected into each conversation's text so the server can't cheat by reusing cached work from an identical earlier conversation. The same conversation must keep the same tag for its whole life.
  • Some replayed conversations spawn sub-agents (helper conversations) and then wait at a gate until all helpers finish before continuing — like a manager who can't proceed until every report comes back.
# Bug What went wrong Why it matters
1 Slow-motion start (ea1b4e9, the original PR #9 fix) At the start of measurement, the code launched the 256 lanes one at a time, fully waiting for each lane's first message to be accepted before even starting the next — like boarding a plane through a single door, one passenger at a time. The benchmark claims to measure the server under 256 simultaneous conversations, but for the first ~54 seconds the load slowly trickled up from 1. Part of the measurement window described a load that wasn't actually there, skewing the results.
2 Same conversation played twice at once (5e2d20b) Each lane is supposed to stamp "I'm using conversation X" into a ledger so no other lane picks X up. But lanes only stamped the ledger as they launched — and with the now-simultaneous start, a fast lane could finish its conversation and draw a replacement from the pool before a slower lane had stamped its claim. The fast lane could draw a conversation another lane was already about to run. Two lanes replaying the same conversation simultaneously is something the system promises never happens. It distorts the load (the server sees duplicate traffic patterns) and corrupts the cache-related bookkeeping for that conversation. Fixed by stamping every lane's claim into the ledger up front, before anything launches.
3 The anti-cheat tag changed mid-conversation (3dfa3b4) A conversation continuing from warmup into profiling must keep its exact cache-bust tag, so the cache the server warmed up stays usable. The code didn't store the tag across the boundary — it recomputed all tags from scratch in the new phase and relied on both phases computing them in the exact same order. But the two phases process slightly different lists, so in some setups the ordering shifted by one and a continuing conversation silently got a different tag. The whole point of warmup is that the server enters measurement with realistic warm caches. A changed tag makes the conversation look brand-new to the server, so the warmed cache is never reused — and the run quietly under-reports the server's cache performance, with no error or warning. Fixed by remembering each conversation's tag rather than recomputing it.
4 Waiting forever for a report that already came in (9479d72, the worst one) Benchmarks can resume from a mid-run snapshot. A manager-conversation waiting on two helper groups gets a checklist: "group A done? group B done?" The snapshot only records helpers still running at the snapshot moment — so a group that had already finished before the snapshot left no record, and nothing could ever check its box. The checklist could never complete. That manager lane silently freezes for the entire run — no error, no crash. A 256-lane benchmark quietly becomes a 255-lane benchmark (or worse, with several frozen lanes), and at the end the run stalls for extra minutes waiting on the frozen lanes before force-quitting. Fixed by reasoning it out at resume time: if a group's start point is in the past and none of its helpers are alive, it must have finished — check its box.
5 Two departments sending the same message (04318d6) When all of a manager's helpers fail to even start, someone must un-stick the manager. The gate-keeping code did it itself — it sent the manager's next message — but then also told the normal message-sending machinery "I didn't handle this," so the normal path sent the very same message a second time. The server receives a duplicate request (wasted work, inflated request counts), and the duplication snowballs: each duplicated reply triggers two next-messages, and so on, until the run crashes on an internal "this conversation finished twice" safety check. Fixed by giving the dispatch a single owner: the gatekeeper just removes the dead gate and lets the normal machinery send the message once.

The connecting theme: bugs 2–5 were all hiding in code that resumes or coordinates work "mid-story" — they only show themselves under particular timing (a snapshot taken between two events, lanes starting simultaneously, all helpers failing at once), which is why they survived until a careful review rather than failing in everyday runs.

🤖 Generated with Claude Code

…rtup

_execute_profiling was serially awaiting each trajectory's first credit,
blocking trajectory K+1 until K's dispatch completed. With N=256 and
slow per-trajectory work (snapshot materialization, tokenization), this
caused the full concurrency target to take ~54s to reach on a real cluster
instead of bursting at t=0.

Replace the serial for-loop with asyncio.gather over a new
_dispatch_one_profiling_trajectory helper so all trajectories begin their
initial credit issuance concurrently.

Regression test verifies that N terminal-root snapshot trajectories all
reach issue_credit simultaneously rather than one-at-a-time.

Root cause introduced in commit f47bd55 (Cam Quilici, 2026-06-03).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@92026e5028800ccb33ae3819440cbe549478e335

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@92026e5028800ccb33ae3819440cbe549478e335

Last updated for commit: 92026e5Browse code

@ajcasagrande ajcasagrande marked this pull request as draft June 12, 2026 17:54
ajcasagrande and others added 8 commits June 12, 2026 11:50
… gather

A bare asyncio.gather re-raises the first lane's exception while the
remaining dispatch coroutines keep running detached - issuing credits
and mutating strategy state in a phase that is already failing, and
unreachable by the phase runner's cancellation. Sibling exceptions were
also silently swallowed.

Gather with return_exceptions=True, log every failed lane with its
trace_id, and re-raise the first failure only after all lanes settle.
Matches the established fan-out pattern in branch_orchestrator.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…patch

A lane that recycles immediately at PROFILING startup (terminal-root
snapshot or k_i at the last turn) pops the recycle queue before later
lanes have registered themselves in _active_traces, so the pop's
duplicate-session guard sees count 0 for a still-live trace and spawns
a second concurrent session of it - also stealing that trace's pass=0
cache-bust digest from the continuing warmup session.

Pre-register every lane's trace in _active_traces during setup_phase
(exactly the _lanes_per_trace multiset) instead of lane-by-lane during
dispatch, so the guard sees final counts from t=0 regardless of
dispatch interleaving. Tests that hand-simulated the old per-lane
increments drop them accordingly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…across phases

Marker coherence across the WARMUP -> PROFILING boundary relied on both
phases re-minting in identical positional order from per-phase counters.
The phases mint different sets: WARMUP skips waiting_on_children states
while PROFILING mints every continuation state, so under wrap-fill with
mixed ready/waiting roots on a shared trace the continuing session's
re-mint lands on a different pass number - its marker rotates mid-session
at the phase boundary and the warmed KV prefix becomes unreachable for
the measured turns.

Move _recycle_pass/_session_marker to a CacheBustLedger on the shared
TrajectorySource and reuse an already-minted marker for the same
x_correlation_id. Continuing sessions now keep their WARMUP marker by
identity rather than by mint-order replay, and the never-restarting pass
counter guarantees fresh sessions cannot collide with warmed digests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
_ensure_seeded_join pre-seeds every prereq key declared on a gated turn
as unregistered, and seed_snapshot only registers keys for children
still alive at t*. When two branch groups share one join turn and t*
falls between their completions, the already-finished group's key could
never register (PrereqState.is_done requires registered=True), so the
gate was permanently unsatisfiable: the parent lane silently wedged for
the entire phase, surfacing only as a misleading outstanding=0
abandoned-join warning at cleanup while has_pending_branch_work kept
the phase-end drain waiting until the forced-completion timeout.

Record each prereq's spawning turn in the index; snapshot seeding marks
keys whose spawning turn fired before the parent's resume position as
registered (expected=0 -> satisfied unless live children re-register).
Keys whose spawning turn replays after t* stay unregistered so a gate
can still not fire before its branch has spawned.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
When every child of a gated branch failed to start, the drained-gate
path dispatched the gated turn via _release_blocked_join, but
_maybe_suspend_parent then returned False (the gate was already
popped), so the callback handler fell through to the strategy's normal
continuation and dispatched the identical turn again - a duplicate
request whose doubled returns propagate per-turn and end in the
double-recycle RuntimeError at the trajectory's final turn.

Pop drained gates silently instead: with the gate gone, intercept
returns False and the strategy's continuation owns the single dispatch.
No hang is reintroduced - the original Phase 0 deadlock came from
leaving an unsatisfiable join registered, and the pop still prevents
that. Tests that encoded the False+dispatched combination now assert
single ownership.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Commit 932b4bc switched the inferencex-agentx-mvp scenario from
inter_turn_delay_cap_seconds to trace_idle_gap_cap_seconds but the e2e
log assertion still expected the old auto-set line, failing the test on
every run since.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
- Assert the N-at-the-gate count is produced by exactly N distinct
  turn-0 sessions (await_count + distinct correlation ids), so the
  assertion cannot be satisfied by a double-dispatching lane masking a
  lost one.
- Bound task completion with wait_for so a wedge regression fails in
  5s instead of hanging to the global timeout.
- Add a companion test covering the snapshot-less k_i+1 resume branch,
  which previously had no concurrency coverage (verified to fail
  against a serial dispatch loop).
- Rebuild on the file's _make_strategy/_build_real_trajectory_source
  helpers, dropping the inline TrajectorySource.__new__ copy, the dead
  max_in_flight/seed/target-size setup, and the wrong queue-depth
  comment (N traces suffice for N concurrent pops; depth is conserved).
- Match _dispatch_one_profiling_trajectory's parameter order to its
  sibling (trajectory, lane), drop the now-pointless default-arg
  capture ceremony in the debug lambda, and add the missing docstring.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
_export_and_load_sync used asyncio.get_event_loop().run_until_complete,
which raises 'no current event loop' on Python 3.12 whenever the xdist
worker previously ran an in-process CLI test (e.g. the agentic-replay
e2e) that consumed the main thread's loop. The failure only appeared
when test scheduling co-located the two files, making the suite flaky
under -n auto. asyncio.run creates a fresh loop per call.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant