Skip to content

dsr1 disagg 8k1k mtp: nightly 20260609 + conc-64 dispatch-bug validation#1696

Open
Oseltamivir wants to merge 4 commits into
mainfrom
dsr1-fp4-disagg-8k1k-mtp-nightly-conc64
Open

dsr1 disagg 8k1k mtp: nightly 20260609 + conc-64 dispatch-bug validation#1696
Oseltamivir wants to merge 4 commits into
mainfrom
dsr1-fp4-disagg-8k1k-mtp-nightly-conc64

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Bump dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp to SGLang ROCm nightly v0.5.12.post1-rocm720-mi35x-20260609 and narrow its 8k1k MTP search space to the single conc-64 DEP8 + 1×DEP8 (MTP3) point.

Diff is intentionally just two things: the image tag and the search-space narrowing. No harness / env-var changes.

Why

At low concurrency the MoRI EP per-rank dispatch buffer num_max_dispatch_tokens_per_rank is sized max(CONC_LIST)/TP*(MTP+1). At conc-64/TP8/MTP3 that collapses to 64/8*4 = 32 (< 256), which silently corrupted decode output on -20260529: output decodes fine and acceptance length stays high, but gsm8k → 0. Collapsing the sweep to conc-64 forces max(CONC_LIST)=64, so the eval runs squarely in that regime.

This nightly is reported to carry the upstream fix (sgl-project/sglang#27194, ROCm/mori#356). Because the dispatch formula is left at its main value (no env clamp, no patcher), a green conc-64 gsm8k here demonstrates the nightly itself fixes the kernel; a red one means it does not.

Pre-fix MI355X reference: dispatch=32 → 0.00, 64 → 0.00, ≥256 → 0.94.

Note

Search space is narrowed for this validation; restore the full sweep once confirmed.


Note

Medium Risk
Temporary global changes to multinode eval marking (MIN_EVAL_CONC and eval-all-entries-in-group) can affect gsm8k coverage for other disagg configs until reverted; benchmark matrix scope is intentionally reduced for one config key.

Overview
Updates dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp to SGLang ROCm nightly 20260609 (MoRI EP dispatch-buffer fix) and replaces the broad 8k1k MTP disagg sweep with four DEP8 + 1×DEP8 (MTP3) points at conc 64, 32, 16, and 8—one concurrency per matrix entry so per-rank dispatch sizes 32→4 stay in the sub-256 bug regime.

Temporary validation harness (documented as revert when the full search space returns): MIN_EVAL_CONC 16→8 so conc-8 is eval-eligible; mark_eval_entries runs gsm8k on every eligible multinode entry in a group instead of only the highest-concurrency one. perf-changelog.yaml records the image bump, narrowed sweep, and harness deltas.

Reviewed by Cursor Bugbot for commit 38370f4. Bugbot is set up for automated code reviews on this repo. Configure here.

… conc-64

Bump dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp to SGLang ROCm nightly
v0.5.12.post1-rocm720-mi35x-20260609 and collapse the 8k1k MTP search
space to the single conc-64 DEP8 + 1xDEP8 (MTP3) point so
max(CONC_LIST)=64 -> the decode server sizes the MoRI per-rank dispatch
buffer at 64/8*(MTP+1)=32 (<256), the regime that silently corrupted
output (gsm8k=0) on -20260529. Validates the upstream fix
(sgl-project/sglang#27194, ROCm/mori#356) reported in this nightly.
Harness/env-var settings left unchanged so the result is an honest test.
@Oseltamivir Oseltamivir requested a review from a team June 9, 2026 17:30
@Oseltamivir Oseltamivir added the non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) label Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

…g validation

Expand the conc-64 point into separate DEP8+MTP3 entries for conc 64, 32,
16, 8 so each launches its own server and exercises a distinct sub-256
dispatch size (32/16/8/4). conc<=4 omitted (floor(conc/8)*4=0).

To get a gsm8k eval at every point (the harness otherwise evals only the
highest-conc entry per topology group, and ignores conc<16):
- mark_eval_entries: eval every eligible multinode entry per group, each
  at its own concurrency, instead of just max-conc.
- MIN_EVAL_CONC 16 -> 8 so conc-8 (dispatch=4) is eval-eligible.
Both are validation-only; revert with the full search-space restore.

Verified locally: generator emits 4 eval entries (conc 64/32/16/8, each
run-eval=true, eval-conc = own conc) and 4 benchmark entries.
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

The test hardcoded conc [8,16,32]/[8] and expected eval-conc=32, which
broke when MIN_EVAL_CONC was lowered 16->8 (eligible median shifted to
16). Rebuild the conc lists from MIN_EVAL_CONC (below/at/above) so the
test asserts the floor behavior for any value of the constant -- passes
under both 8 and 16.
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

2 similar comments
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim)

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant