dsr1 disagg 8k1k mtp: nightly 20260609 + conc-64 dispatch-bug validation#1696
dsr1 disagg 8k1k mtp: nightly 20260609 + conc-64 dispatch-bug validation#1696Oseltamivir wants to merge 4 commits into
Conversation
… conc-64 Bump dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp to SGLang ROCm nightly v0.5.12.post1-rocm720-mi35x-20260609 and collapse the 8k1k MTP search space to the single conc-64 DEP8 + 1xDEP8 (MTP3) point so max(CONC_LIST)=64 -> the decode server sizes the MoRI per-rank dispatch buffer at 64/8*(MTP+1)=32 (<256), the regime that silently corrupted output (gsm8k=0) on -20260529. Validates the upstream fix (sgl-project/sglang#27194, ROCm/mori#356) reported in this nightly. Harness/env-var settings left unchanged so the result is an honest test.
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
…g validation Expand the conc-64 point into separate DEP8+MTP3 entries for conc 64, 32, 16, 8 so each launches its own server and exercises a distinct sub-256 dispatch size (32/16/8/4). conc<=4 omitted (floor(conc/8)*4=0). To get a gsm8k eval at every point (the harness otherwise evals only the highest-conc entry per topology group, and ignores conc<16): - mark_eval_entries: eval every eligible multinode entry per group, each at its own concurrency, instead of just max-conc. - MIN_EVAL_CONC 16 -> 8 so conc-8 (dispatch=4) is eval-eligible. Both are validation-only; revert with the full search-space restore. Verified locally: generator emits 4 eval entries (conc 64/32/16/8, each run-eval=true, eval-conc = own conc) and 4 benchmark entries.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27224150488 |
The test hardcoded conc [8,16,32]/[8] and expected eval-conc=32, which broke when MIN_EVAL_CONC was lowered 16->8 (eligible median shifted to 16). Rebuild the conc lists from MIN_EVAL_CONC (below/at/above) so the test asserts the floor behavior for any value of the constant -- passes under both 8 and 16.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27224813462 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27225030090 |
2 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27225030090 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27225030090 |
Summary
Bump
dsr1-fp4-mi355x-sglang-disagg-8k1k-mtpto SGLang ROCm nightlyv0.5.12.post1-rocm720-mi35x-20260609and narrow its 8k1k MTP search space to the single conc-64 DEP8 + 1×DEP8 (MTP3) point.Diff is intentionally just two things: the image tag and the search-space narrowing. No harness / env-var changes.
Why
At low concurrency the MoRI EP per-rank dispatch buffer
num_max_dispatch_tokens_per_rankis sizedmax(CONC_LIST)/TP*(MTP+1). At conc-64/TP8/MTP3 that collapses to64/8*4 = 32(< 256), which silently corrupted decode output on-20260529: output decodes fine and acceptance length stays high, but gsm8k → 0. Collapsing the sweep to conc-64 forcesmax(CONC_LIST)=64, so the eval runs squarely in that regime.This nightly is reported to carry the upstream fix (sgl-project/sglang#27194, ROCm/mori#356). Because the dispatch formula is left at its
mainvalue (no env clamp, no patcher), a green conc-64 gsm8k here demonstrates the nightly itself fixes the kernel; a red one means it does not.Pre-fix MI355X reference: dispatch=32 → 0.00, 64 → 0.00, ≥256 → 0.94.
Note
Search space is narrowed for this validation; restore the full sweep once confirmed.
Note
Medium Risk
Temporary global changes to multinode eval marking (
MIN_EVAL_CONCand eval-all-entries-in-group) can affect gsm8k coverage for other disagg configs until reverted; benchmark matrix scope is intentionally reduced for one config key.Overview
Updates
dsr1-fp4-mi355x-sglang-disagg-8k1k-mtpto SGLang ROCm nightly20260609(MoRI EP dispatch-buffer fix) and replaces the broad 8k1k MTP disagg sweep with four DEP8 + 1×DEP8 (MTP3) points at conc 64, 32, 16, and 8—one concurrency per matrix entry so per-rank dispatch sizes 32→4 stay in the sub-256 bug regime.Temporary validation harness (documented as revert when the full search space returns):
MIN_EVAL_CONC16→8 so conc-8 is eval-eligible;mark_eval_entriesruns gsm8k on every eligible multinode entry in a group instead of only the highest-concurrency one.perf-changelog.yamlrecords the image bump, narrowed sweep, and harness deltas.Reviewed by Cursor Bugbot for commit 38370f4. Bugbot is set up for automated code reviews on this repo. Configure here.