Skip to content

[Klaud Cold] dsv4-fp4-mi355x-sglang-disagg: DeepSeek-V4-Pro SGLang disagg (8k1k conc=1 smoke test)#1708

Open
functionstackx wants to merge 6 commits into
mainfrom
dsv4-fp4-mi355x-sglang-disagg
Open

[Klaud Cold] dsv4-fp4-mi355x-sglang-disagg: DeepSeek-V4-Pro SGLang disagg (8k1k conc=1 smoke test)#1708
functionstackx wants to merge 6 commits into
mainfrom
dsv4-fp4-mi355x-sglang-disagg

Conversation

@functionstackx

@functionstackx functionstackx commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds dsv4-fp4-mi355x-sglang-disagg — a DeepSeek-V4-Pro FP4 prefill/decode-disaggregated benchmark on MI355X via SGLang + MoRI. It combines the two references this work was scoped against:

  • the validated single-node DSv4 SGLang recipe (dsv4-fp4-mi355x-sglang and its MTP variant), and
  • the SGLang-disagg framework (MoRI KV transfer + sglang_router) introduced for the dsr1 / qwen3.5 / glm5 MI355X recipes (#1570, #1572, #1579).

Files

File Change
benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh New model-agnostic launcher (same shape as the qwen3.5/glm5 sglang-disagg wrappers, with NODE_LIST for local smoke). launch_mi355x-amds.sh resolves dsv4+fp4+sglang-disagg → this filename.
benchmarks/multi_node/amd_utils/models.yaml New DeepSeek-V4-Pro entry (base/dp flags + prefill/decode profiles).
benchmarks/multi_node/amd_utils/env.sh DSv4 FP4-experts SGLANG_* env block + deep_gemm-absence fallback, gated on MODEL_NAME.
benchmarks/multi_node/amd_utils/setup_deps.sh Idempotent, atomic config.json model_type patch, gated on MODEL_NAME.
.github/configs/amd-master.yaml New dsv4-fp4-mi355x-sglang-disagg block.

Serving config (models.yaml)

base_flags mirror the validated single-node DSv4 SGLang recipe so the per-worker engine config matches the known-good aggregated run:

  • --attention-backend dsv4, --swa-full-tokens-ratio 0.15, --page-size 256, --disable-shared-experts-fusion
  • --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4, the DSv4 thinking chat template
  • disagg essentials: --disaggregation-transfer-backend mori --load-balance-method round_robin --watchdog-timeout 3600
  • --context-length 9472 is pinned (the model default is very long → would over-reserve KV); covers the 8k/1k smoke point.
  • --kv-cache-dtype left at the model default (the single-node DSv4 recipe sets none), unlike the fp8_e4m3 DeepSeek-R1 disagg entries.

Env / flags track the validated 0610 single-node recipe (PR #1701, "[AMD][MI35X] 0610 DSV4", successful run): mainline …-20260610 image, --attention-backend dsv4 (not compressed), unified_kv_triton FlashMLA, the aiter indexer, the mainline fp8 wo_a / topk-v2 fallbacks hardcoded (so no deep_gemm-presence detect), and the branch-only SGLANG_DSV4_FP4_EXPERTS / SGLANG_FORCE_TRITON_MOE_FP8 flags dropped. The prefill delayer (--enable-prefill-delayer) is intentionally not used.

The DSv4 SGLANG_* env block (SGLANG_DSV4_FP4_EXPERTS=True, SGLANG_FORCE_TRITON_MOE_FP8=0, aiter MHC, tilelang indexer, triton FlashMLA, …) are copied verbatim from the single-node recipe into env.sh, gated on MODEL_NAME.

Image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610

This is the mainline ROCm nightly the DSv4 MTP single-node recipe (dsv4-fp4-mi355x-sglang-mtp) already runs on. It is the right image for disagg because it carries both:

  • DSv4 model support — sgl#26383 ([AMD][DSV4], merged to mainline 2026-05-27), and
  • the MoRI disaggregation transfer backend — it's on the same …-mi35x-… image line as the dsr1/qwen3.5/glm5 disagg recipes.

The aggregated dsv4-fp4-mi355x-sglang entry uses rocm/sgl-dev:*-DSv4, cut from the amd/deepseek_v4 branch, which lacks #26383 and has unverified MoRI support — so it's not suitable for disagg. Mainline omits deep_gemm; env.sh detects that and routes the DSv4 fp8 wo_a / topk paths to torch fallbacks (same logic as the MTP single-node recipe), so it runs on both image lines. The v0.5.12.post1 tag also auto-applies the MoRI conn.py overlay (job.slurm) that fixes the KV wire format for hybrid/sparse-attention models.

Topology

1P1D, TP8/EP1, dp-attn false — the same conservative starting point the qwen3.5 and glm5 sglang-disagg recipes launched with.

Scope: smoke test first

Per request, this starts minimal: a single ISL/OSL (8k/1k) at conc=1, to validate end-to-end that DSv4 + MoRI disaggregation comes up and transfers KV at all on this image, before expanding to the full conc sweep (and DEP / 1P2D). At conc=1 the generator emits one config and skips eval.

Validated locally:

$ generate_sweep_configs.py full-sweep --config-files .github/configs/amd-master.yaml \
    --framework sglang-disagg --runner-type mi355x-disagg
# 1 dsv4_8k1k config: isl=8192 osl=1024 conc=[1], 1P1D TP8/EP1, image …-20260601
$ validate_master_config(amd-master.yaml)   # all 75 entries valid
# config.json patch unit-tested: deepseek_v4 -> deepseek_v3, architectures preserved, idempotent
# bash -n on launcher / env.sh / setup_deps.sh: clean

Test plan

  • Apply the sweep label so run-sweep.yml exercises dsv4-fp4-mi355x-sglang-disagg at 8k1k/conc=1.
  • Confirm …-mi35x-20260601 imports on a fresh MI355X-disagg runner and supports --disaggregation-mode + --disaggregation-transfer-backend mori with the DSv4 model class.
  • Confirm DeepSeek-V4-Pro is staged at $MODEL_DIR/DeepSeek-V4-Pro; verify the setup_deps.sh config.json patch fires (and is a safe no-op on re-runs / concurrent nodes).
  • Verify --attention-backend dsv4 + --page-size 256 interoperate with MoRI KV transfer (the highest-risk unknown — DSv4 sparse MLA vs the MLA shapes MoRI was validated against).
  • On green: expand conc, then enable mori-EP decode (which will require carrying the sglang#27855 aiter fix), and evaluate DEP / 1P2D and MTP.

Risks / open questions

  • Highest risk: DSv4's compressed/sparse-MLA attention + --page-size 256 over the MoRI KV transport is an unvalidated combination (MoRI disagg was validated against DeepSeek-R1 MLA and Qwen3.5/GLM-5). The smoke test exists to surface exactly this.
  • Decode keeps radix cache enabled (the framework only disables it on prefill); harmless for random-token throughput, revisit if it affects correctness.

🤖 Generated with Claude Code


Note

Medium Risk
Touches multi-node disagg launch paths and mutates shared NFS config.json for DSv4; main unknown is DSv4 sparse MLA KV over MoRI at the chosen page size.

Overview
Adds dsv4-fp4-mi355x-sglang-disagg, a DeepSeek-V4-Pro FP4 prefill/decode-disaggregated MI355X benchmark on SGLang + MoRI, wired like the existing dsr1/qwen3.5/glm5 disagg recipes.

amd-master.yaml registers the recipe on lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610 with a 1P1D TP8/EP1 smoke point at 8k/1k, conc=1 (dp-attn off). benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh is a thin launcher that maps topology from the master config into amd_utils/submit.sh.

amd_utils/models.yaml gains DeepSeek-V4-Pro with MoRI disagg flags aligned to the validated single-node DSv4 recipe (dsv4 attention, SWA, page-size 256, parsers/template, pinned context-length, no forced kv-cache-dtype). env.sh adds a MODEL_NAME-gated SGLANG_* block (FlashMLA/indexer/fp8 fallbacks from PR #1701). setup_deps.sh adds an idempotent atomic config.json model_type patch (deepseek_v4deepseek_v3) on shared NFS weights.

perf-changelog.yaml documents the new config key.

Reviewed by Cursor Bugbot for commit 316dd21. Bugbot is set up for automated code reviews on this repo. Configure here.

…sagg (8k1k conc=1 smoke test)

Adds a DeepSeek-V4-Pro FP4 prefill/decode-disaggregated recipe on MI355X via
SGLang + MoRI, combining the validated single-node DSv4 SGLang recipe with the
sglang-disagg framework used by the dsr1 / qwen3.5 / glm5 mi355x recipes
(#1570, #1572, #1579).

- benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh: model-agnostic launcher
  (same shape as the qwen3.5/glm5 wrappers, with NODE_LIST support).
- amd_utils/models.yaml: DeepSeek-V4-Pro entry. Serving flags mirror the
  single-node recipe (compressed attention, SWA, page-size 256, deepseekv4/
  deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off);
  context-length pinned; kv-cache-dtype left at model default.
- amd_utils/env.sh: DSv4 FP4-experts SGLANG_* env block + deep_gemm-absence
  fallback, gated on MODEL_NAME.
- amd_utils/setup_deps.sh: idempotent, atomic config.json model_type patch
  (deepseek_v4 -> deepseek_v3, architectures preserved), gated on MODEL_NAME.
- amd-master.yaml: dsv4-fp4-mi355x-sglang-disagg, 1P1D TP8/EP1 dp-attn false,
  image v0.5.12.post1-rocm720-mi35x-20260601 (mainline w/ DSv4 #26383 + MoRI
  disagg; auto-applies the MoRI conn.py overlay).

Starts at a single ISL/OSL (8k/1k) conc=1 to smoke-test that DSv4 + MoRI disagg
comes up and transfers KV on this image before expanding the sweep.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

…sh fix)

DeepSeek-V4-Pro + MoRI expert-parallel aborts at warmup with
"dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2" on the
clamped-SwiGLU/INTERLEAVE path. sgl-project/sglang#27855 fixes it in
moe_runner/aiter.py:_pre_permute_deepep_to_aiter (W4A4 + FP4-dispatch branch
that dequants the FP4 activation to BF16 via upscale_mxfp4) but is unmerged and
absent from the pinned image.

setup_deps.sh now source-patches aiter.py at container start, gated on
MODEL_NAME == DeepSeek-V4-Pro: idempotent, atomic write, warn+skip if the
image's aiter.py predates the anchored structure. Verified byte-identical to
the PR head against current sglang main.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

…ention backend

Realigns the DSv4 sglang-disagg recipe with the validated 0610 single-node recipe
(PR #1701, "[AMD][MI35X] 0610 DSV4", successful run):

- image -> lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610
- env.sh DSv4 block replaced with #1701's: unified_kv_triton FlashMLA, aiter
  indexer (not tilelang), mainline fp8 wo_a / topk-v2 fallbacks hardcoded
  (SGLANG_OPT_FP8_WO_A_GEMM=false, SGLANG_OPT_USE_TOPK_V2=false) instead of the
  deep_gemm-presence detect; SGLANG_DEFAULT_THINKING / SGLANG_DSV4_REASONING_EFFORT;
  multi-stream overlap off. Branch-only SGLANG_DSV4_FP4_EXPERTS /
  SGLANG_FORCE_TRITON_MOE_FP8 dropped (DSv4 main no longer needs them).
- models.yaml base_flags: --attention-backend compressed -> dsv4; dp_flags add
  --enable-prefill-delayer --prefill-delayer-max-delay-ms 5000 (the #1701 DP path).

Still a v0.5.12.post1 tag, so the MoRI conn.py overlay auto-applies; the #27855
aiter monkey-patch is unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Per request, do not use --enable-prefill-delayer / --prefill-delayer-max-delay-ms
in the DSv4 sglang-disagg recipe.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

…, no EP)

The #27855 fix only matters on the DSv4 + MoRI expert-parallel path. This recipe
is TP8/EP1 for the smoke test, so that crash isn't reachable. Remove the
patch_aiter_dsv4_fp4_swiglu source-patch from setup_deps.sh; a comment in
amd-master.yaml records that it's needed only when EP/DEP decode is enabled.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 316dd21. Configure here.

# multi-stream
export SGLANG_OPT_USE_MULTI_STREAM_OVERLAP=false
export SGLANG_ROCM_USE_MULTI_STREAM=false
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing mainline deep_gemm JIT off

Medium Severity

The DeepSeek-V4-Pro block hardcodes SGLANG_OPT_FP8_WO_A_GEMM and SGLANG_OPT_USE_TOPK_V2 for the mainline …-20260610 image, but omits SGLANG_ENABLE_JIT_DEEPGEMM=0 (and SGLANG_TOPK_TRANSFORM_512_TORCH=1) that the in-repo mainline DSv4 recipe sets when deep_gemm is absent. That image line has no deep_gemm, so startup can still hit JIT or top-k paths that expect it.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 316dd21. Configure here.

os.path.exists(tmp) and os.remove(tmp)
raise
PYEOF
_SETUP_INSTALLED+=("dsv4-config-model-type")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setup logs false config patch

Low Severity

patch_dsv4_config always appends dsv4-config-model-type to _SETUP_INSTALLED after the Python helper returns, including when the helper exits early because model_type is already deepseek_v3. Setup summary then reports an install/patch that did not run.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 316dd21. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant