[Klaud Cold] dsv4-fp4-mi355x-sglang-disagg: DeepSeek-V4-Pro SGLang disagg (8k1k conc=1 smoke test)#1708
[Klaud Cold] dsv4-fp4-mi355x-sglang-disagg: DeepSeek-V4-Pro SGLang disagg (8k1k conc=1 smoke test)#1708functionstackx wants to merge 6 commits into
Conversation
…sagg (8k1k conc=1 smoke test) Adds a DeepSeek-V4-Pro FP4 prefill/decode-disaggregated recipe on MI355X via SGLang + MoRI, combining the validated single-node DSv4 SGLang recipe with the sglang-disagg framework used by the dsr1 / qwen3.5 / glm5 mi355x recipes (#1570, #1572, #1579). - benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh: model-agnostic launcher (same shape as the qwen3.5/glm5 wrappers, with NODE_LIST support). - amd_utils/models.yaml: DeepSeek-V4-Pro entry. Serving flags mirror the single-node recipe (compressed attention, SWA, page-size 256, deepseekv4/ deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off); context-length pinned; kv-cache-dtype left at model default. - amd_utils/env.sh: DSv4 FP4-experts SGLANG_* env block + deep_gemm-absence fallback, gated on MODEL_NAME. - amd_utils/setup_deps.sh: idempotent, atomic config.json model_type patch (deepseek_v4 -> deepseek_v3, architectures preserved), gated on MODEL_NAME. - amd-master.yaml: dsv4-fp4-mi355x-sglang-disagg, 1P1D TP8/EP1 dp-attn false, image v0.5.12.post1-rocm720-mi35x-20260601 (mainline w/ DSv4 #26383 + MoRI disagg; auto-applies the MoRI conn.py overlay). Starts at a single ISL/OSL (8k/1k) conc=1 to smoke-test that DSv4 + MoRI disagg comes up and transfers KV on this image before expanding the sweep. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
…sh fix) DeepSeek-V4-Pro + MoRI expert-parallel aborts at warmup with "dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2" on the clamped-SwiGLU/INTERLEAVE path. sgl-project/sglang#27855 fixes it in moe_runner/aiter.py:_pre_permute_deepep_to_aiter (W4A4 + FP4-dispatch branch that dequants the FP4 activation to BF16 via upscale_mxfp4) but is unmerged and absent from the pinned image. setup_deps.sh now source-patches aiter.py at container start, gated on MODEL_NAME == DeepSeek-V4-Pro: idempotent, atomic write, warn+skip if the image's aiter.py predates the anchored structure. Verified byte-identical to the PR head against current sglang main. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27324471640 |
…ention backend Realigns the DSv4 sglang-disagg recipe with the validated 0610 single-node recipe (PR #1701, "[AMD][MI35X] 0610 DSV4", successful run): - image -> lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610 - env.sh DSv4 block replaced with #1701's: unified_kv_triton FlashMLA, aiter indexer (not tilelang), mainline fp8 wo_a / topk-v2 fallbacks hardcoded (SGLANG_OPT_FP8_WO_A_GEMM=false, SGLANG_OPT_USE_TOPK_V2=false) instead of the deep_gemm-presence detect; SGLANG_DEFAULT_THINKING / SGLANG_DSV4_REASONING_EFFORT; multi-stream overlap off. Branch-only SGLANG_DSV4_FP4_EXPERTS / SGLANG_FORCE_TRITON_MOE_FP8 dropped (DSv4 main no longer needs them). - models.yaml base_flags: --attention-backend compressed -> dsv4; dp_flags add --enable-prefill-delayer --prefill-delayer-max-delay-ms 5000 (the #1701 DP path). Still a v0.5.12.post1 tag, so the MoRI conn.py overlay auto-applies; the #27855 aiter monkey-patch is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27326105979 |
Per request, do not use --enable-prefill-delayer / --prefill-delayer-max-delay-ms in the DSv4 sglang-disagg recipe. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27326175156 |
…, no EP) The #27855 fix only matters on the DSv4 + MoRI expert-parallel path. This recipe is TP8/EP1 for the smoke test, so that crash isn't reachable. Remove the patch_aiter_dsv4_fp4_swiglu source-patch from setup_deps.sh; a comment in amd-master.yaml records that it's needed only when EP/DEP decode is enabled. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27326242579 |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 316dd21. Configure here.
| # multi-stream | ||
| export SGLANG_OPT_USE_MULTI_STREAM_OVERLAP=false | ||
| export SGLANG_ROCM_USE_MULTI_STREAM=false | ||
| fi |
There was a problem hiding this comment.
Missing mainline deep_gemm JIT off
Medium Severity
The DeepSeek-V4-Pro block hardcodes SGLANG_OPT_FP8_WO_A_GEMM and SGLANG_OPT_USE_TOPK_V2 for the mainline …-20260610 image, but omits SGLANG_ENABLE_JIT_DEEPGEMM=0 (and SGLANG_TOPK_TRANSFORM_512_TORCH=1) that the in-repo mainline DSv4 recipe sets when deep_gemm is absent. That image line has no deep_gemm, so startup can still hit JIT or top-k paths that expect it.
Reviewed by Cursor Bugbot for commit 316dd21. Configure here.
| os.path.exists(tmp) and os.remove(tmp) | ||
| raise | ||
| PYEOF | ||
| _SETUP_INSTALLED+=("dsv4-config-model-type") |
There was a problem hiding this comment.
Setup logs false config patch
Low Severity
patch_dsv4_config always appends dsv4-config-model-type to _SETUP_INSTALLED after the Python helper returns, including when the helper exits early because model_type is already deepseek_v3. Setup summary then reports an install/patch that did not run.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 316dd21. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27326392188 |


Summary
Adds
dsv4-fp4-mi355x-sglang-disagg— a DeepSeek-V4-Pro FP4 prefill/decode-disaggregated benchmark on MI355X via SGLang + MoRI. It combines the two references this work was scoped against:dsv4-fp4-mi355x-sglangand its MTP variant), andsglang_router) introduced for the dsr1 / qwen3.5 / glm5 MI355X recipes (#1570, #1572, #1579).Files
benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.shsglang-disaggwrappers, withNODE_LISTfor local smoke).launch_mi355x-amds.shresolvesdsv4+fp4+sglang-disagg→ this filename.benchmarks/multi_node/amd_utils/models.yamlDeepSeek-V4-Proentry (base/dp flags + prefill/decode profiles).benchmarks/multi_node/amd_utils/env.shSGLANG_*env block + deep_gemm-absence fallback, gated onMODEL_NAME.benchmarks/multi_node/amd_utils/setup_deps.shconfig.jsonmodel_typepatch, gated onMODEL_NAME..github/configs/amd-master.yamldsv4-fp4-mi355x-sglang-disaggblock.Serving config (
models.yaml)base_flagsmirror the validated single-node DSv4 SGLang recipe so the per-worker engine config matches the known-good aggregated run:--attention-backend dsv4,--swa-full-tokens-ratio 0.15,--page-size 256,--disable-shared-experts-fusion--tool-call-parser deepseekv4 --reasoning-parser deepseek-v4, the DSv4 thinking chat template--disaggregation-transfer-backend mori --load-balance-method round_robin --watchdog-timeout 3600--context-length 9472is pinned (the model default is very long → would over-reserve KV); covers the 8k/1k smoke point.--kv-cache-dtypeleft at the model default (the single-node DSv4 recipe sets none), unlike thefp8_e4m3DeepSeek-R1 disagg entries.Env / flags track the validated 0610 single-node recipe (PR #1701, "[AMD][MI35X] 0610 DSV4", successful run): mainline
…-20260610image,--attention-backend dsv4(notcompressed),unified_kv_tritonFlashMLA, the aiter indexer, the mainline fp8wo_a/ topk-v2 fallbacks hardcoded (so no deep_gemm-presence detect), and the branch-onlySGLANG_DSV4_FP4_EXPERTS/SGLANG_FORCE_TRITON_MOE_FP8flags dropped. The prefill delayer (--enable-prefill-delayer) is intentionally not used.The DSv4
SGLANG_*env block (SGLANG_DSV4_FP4_EXPERTS=True,SGLANG_FORCE_TRITON_MOE_FP8=0, aiter MHC, tilelang indexer, triton FlashMLA, …) are copied verbatim from the single-node recipe intoenv.sh, gated onMODEL_NAME.Image:
lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610This is the mainline ROCm nightly the DSv4 MTP single-node recipe (
dsv4-fp4-mi355x-sglang-mtp) already runs on. It is the right image for disagg because it carries both:…-mi35x-…image line as the dsr1/qwen3.5/glm5 disagg recipes.The aggregated
dsv4-fp4-mi355x-sglangentry usesrocm/sgl-dev:*-DSv4, cut from theamd/deepseek_v4branch, which lacks #26383 and has unverified MoRI support — so it's not suitable for disagg. Mainline omitsdeep_gemm;env.shdetects that and routes the DSv4 fp8wo_a/ topk paths to torch fallbacks (same logic as the MTP single-node recipe), so it runs on both image lines. Thev0.5.12.post1tag also auto-applies the MoRIconn.pyoverlay (job.slurm) that fixes the KV wire format for hybrid/sparse-attention models.Topology
1P1D, TP8/EP1, dp-attn false — the same conservative starting point the qwen3.5 and glm5
sglang-disaggrecipes launched with.Scope: smoke test first
Per request, this starts minimal: a single ISL/OSL (8k/1k) at conc=1, to validate end-to-end that DSv4 + MoRI disaggregation comes up and transfers KV at all on this image, before expanding to the full conc sweep (and DEP / 1P2D). At conc=1 the generator emits one config and skips eval.
Validated locally:
Test plan
run-sweep.ymlexercisesdsv4-fp4-mi355x-sglang-disaggat 8k1k/conc=1.…-mi35x-20260601imports on a fresh MI355X-disagg runner and supports--disaggregation-mode+--disaggregation-transfer-backend moriwith the DSv4 model class.$MODEL_DIR/DeepSeek-V4-Pro; verify thesetup_deps.shconfig.json patch fires (and is a safe no-op on re-runs / concurrent nodes).--attention-backend dsv4+--page-size 256interoperate with MoRI KV transfer (the highest-risk unknown — DSv4 sparse MLA vs the MLA shapes MoRI was validated against).Risks / open questions
--page-size 256over the MoRI KV transport is an unvalidated combination (MoRI disagg was validated against DeepSeek-R1 MLA and Qwen3.5/GLM-5). The smoke test exists to surface exactly this.🤖 Generated with Claude Code
Note
Medium Risk
Touches multi-node disagg launch paths and mutates shared NFS
config.jsonfor DSv4; main unknown is DSv4 sparse MLA KV over MoRI at the chosen page size.Overview
Adds
dsv4-fp4-mi355x-sglang-disagg, a DeepSeek-V4-Pro FP4 prefill/decode-disaggregated MI355X benchmark on SGLang + MoRI, wired like the existing dsr1/qwen3.5/glm5 disagg recipes.amd-master.yamlregisters the recipe onlmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610with a 1P1D TP8/EP1 smoke point at 8k/1k, conc=1 (dp-attnoff).benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.shis a thin launcher that maps topology from the master config intoamd_utils/submit.sh.amd_utils/models.yamlgainsDeepSeek-V4-Prowith MoRI disagg flags aligned to the validated single-node DSv4 recipe (dsv4attention, SWA, page-size 256, parsers/template, pinnedcontext-length, no forcedkv-cache-dtype).env.shadds aMODEL_NAME-gatedSGLANG_*block (FlashMLA/indexer/fp8 fallbacks from PR #1701).setup_deps.shadds an idempotent atomicconfig.jsonmodel_typepatch (deepseek_v4→deepseek_v3) on shared NFS weights.perf-changelog.yamldocuments the new config key.Reviewed by Cursor Bugbot for commit 316dd21. Bugbot is set up for automated code reviews on this repo. Configure here.