fix(agentic): dispatch all profiling trajectories concurrently at startup#9
Draft
ajcasagrande wants to merge 9 commits into
Draft
Conversation
…rtup _execute_profiling was serially awaiting each trajectory's first credit, blocking trajectory K+1 until K's dispatch completed. With N=256 and slow per-trajectory work (snapshot materialization, tokenization), this caused the full concurrency target to take ~54s to reach on a real cluster instead of bursting at t=0. Replace the serial for-loop with asyncio.gather over a new _dispatch_one_profiling_trajectory helper so all trajectories begin their initial credit issuance concurrently. Regression test verifies that N terminal-root snapshot trajectories all reach issue_credit simultaneously rather than one-at-a-time. Root cause introduced in commit f47bd55 (Cam Quilici, 2026-06-03). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@92026e5028800ccb33ae3819440cbe549478e335Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@92026e5028800ccb33ae3819440cbe549478e335Last updated for commit: |
… gather A bare asyncio.gather re-raises the first lane's exception while the remaining dispatch coroutines keep running detached - issuing credits and mutating strategy state in a phase that is already failing, and unreachable by the phase runner's cancellation. Sibling exceptions were also silently swallowed. Gather with return_exceptions=True, log every failed lane with its trace_id, and re-raise the first failure only after all lanes settle. Matches the established fan-out pattern in branch_orchestrator. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…patch A lane that recycles immediately at PROFILING startup (terminal-root snapshot or k_i at the last turn) pops the recycle queue before later lanes have registered themselves in _active_traces, so the pop's duplicate-session guard sees count 0 for a still-live trace and spawns a second concurrent session of it - also stealing that trace's pass=0 cache-bust digest from the continuing warmup session. Pre-register every lane's trace in _active_traces during setup_phase (exactly the _lanes_per_trace multiset) instead of lane-by-lane during dispatch, so the guard sees final counts from t=0 regardless of dispatch interleaving. Tests that hand-simulated the old per-lane increments drop them accordingly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…across phases Marker coherence across the WARMUP -> PROFILING boundary relied on both phases re-minting in identical positional order from per-phase counters. The phases mint different sets: WARMUP skips waiting_on_children states while PROFILING mints every continuation state, so under wrap-fill with mixed ready/waiting roots on a shared trace the continuing session's re-mint lands on a different pass number - its marker rotates mid-session at the phase boundary and the warmed KV prefix becomes unreachable for the measured turns. Move _recycle_pass/_session_marker to a CacheBustLedger on the shared TrajectorySource and reuse an already-minted marker for the same x_correlation_id. Continuing sessions now keep their WARMUP marker by identity rather than by mint-order replay, and the never-restarting pass counter guarantees fresh sessions cannot collide with warmed digests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
_ensure_seeded_join pre-seeds every prereq key declared on a gated turn as unregistered, and seed_snapshot only registers keys for children still alive at t*. When two branch groups share one join turn and t* falls between their completions, the already-finished group's key could never register (PrereqState.is_done requires registered=True), so the gate was permanently unsatisfiable: the parent lane silently wedged for the entire phase, surfacing only as a misleading outstanding=0 abandoned-join warning at cleanup while has_pending_branch_work kept the phase-end drain waiting until the forced-completion timeout. Record each prereq's spawning turn in the index; snapshot seeding marks keys whose spawning turn fired before the parent's resume position as registered (expected=0 -> satisfied unless live children re-register). Keys whose spawning turn replays after t* stay unregistered so a gate can still not fire before its branch has spawned. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
When every child of a gated branch failed to start, the drained-gate path dispatched the gated turn via _release_blocked_join, but _maybe_suspend_parent then returned False (the gate was already popped), so the callback handler fell through to the strategy's normal continuation and dispatched the identical turn again - a duplicate request whose doubled returns propagate per-turn and end in the double-recycle RuntimeError at the trajectory's final turn. Pop drained gates silently instead: with the gate gone, intercept returns False and the strategy's continuation owns the single dispatch. No hang is reintroduced - the original Phase 0 deadlock came from leaving an unsatisfiable join registered, and the pop still prevents that. Tests that encoded the False+dispatched combination now assert single ownership. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Commit 932b4bc switched the inferencex-agentx-mvp scenario from inter_turn_delay_cap_seconds to trace_idle_gap_cap_seconds but the e2e log assertion still expected the old auto-set line, failing the test on every run since. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
- Assert the N-at-the-gate count is produced by exactly N distinct turn-0 sessions (await_count + distinct correlation ids), so the assertion cannot be satisfied by a double-dispatching lane masking a lost one. - Bound task completion with wait_for so a wedge regression fails in 5s instead of hanging to the global timeout. - Add a companion test covering the snapshot-less k_i+1 resume branch, which previously had no concurrency coverage (verified to fail against a serial dispatch loop). - Rebuild on the file's _make_strategy/_build_real_trajectory_source helpers, dropping the inline TrajectorySource.__new__ copy, the dead max_in_flight/seed/target-size setup, and the wrong queue-depth comment (N traces suffice for N concurrent pops; depth is conserved). - Match _dispatch_one_profiling_trajectory's parameter order to its sibling (trajectory, lane), drop the now-pointless default-arg capture ceremony in the debug lambda, and add the missing docstring. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
_export_and_load_sync used asyncio.get_event_loop().run_until_complete, which raises 'no current event loop' on Python 3.12 whenever the xdist worker previously ran an in-process CLI test (e.g. the agentic-replay e2e) that consumed the main thread's loop. The failure only appeared when test scheduling co-located the two files, making the suite flaky under -n auto. asyncio.run creates a fresh loop per call. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A quick glossary first, since the table leans on a few terms:
ea1b4e9, the original PR #9 fix)5e2d20b)3dfa3b4)9479d72, the worst one)04318d6)The connecting theme: bugs 2–5 were all hiding in code that resumes or coordinates work "mid-story" — they only show themselves under particular timing (a snapshot taken between two events, lanes starting simultaneously, all helpers failing at once), which is why they survived until a careful review rather than failing in everyday runs.
🤖 Generated with Claude Code