[Klaud Cold] dsv4-fp4-mi355x-vllm-disagg: DeepSeek-V4-Pro vLLM disagg (8k1k conc=1 smoke test)#1707
[Klaud Cold] dsv4-fp4-mi355x-vllm-disagg: DeepSeek-V4-Pro vLLM disagg (8k1k conc=1 smoke test)#1707functionstackx wants to merge 14 commits into
Conversation
…m image Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
… (1k1k conc=1 smoke test) Adds a DeepSeek-V4-Pro disaggregated prefill/decode recipe on MI355X via vLLM + MoRI-IO, combining the validated single-node DSv4 vLLM serving recipe (dsv4-fp4-mi355x-vllm, vllm-project/recipes#433) with the vLLM-disagg framework introduced for the kimi / minimax mi355x recipes (#1141, #1569). - benchmarks/multi_node/dsv4_fp4_mi355x_vllm-disagg.sh: model-agnostic launcher (identical in shape to the kimi/minimax wrappers). - amd_utils/models_vllm.yaml: DeepSeek-V4-Pro entry. Per-node serving flags reuse the aggregated recipe verbatim (--moe-backend triton_unfused required for the FP4 expert format, deepseek_v4 tokenizer/reasoning parser, fp8 KV, --enforce-eager); only the MoRIIO kv-transfer role is added by the framework. - amd-master.yaml: dsv4-fp4-mi355x-vllm-disagg, 1P1D (TP8/EP1 prefill+decode), image v0.22.0 (carries both DeepseekV4ForCausalLM and the MoRIIO connector, and stays pullable unlike the GC'd nightly tags). Starts with a single ISL/OSL (1k/1k) at conc=1 to smoke-test the path end-to-end before expanding to the full 1k1k + 8k1k, conc 8-512 sweep. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27323783591 |
…ch-free nightly Brings the vLLM-disagg infra onto the upstream-MoRIIO nightly so the large setup_deps.sh runtime patches are dropped (vllm#40344), and migrates the new dsv4-fp4-mi355x-vllm-disagg recipe to match: - image -> vllm/vllm-openai-rocm:nightly-3f0a91bb (carries #40344 + DeepseekV4); not available in v0.22.0/v0.22.1 release tags - drop VLLM_MORIIO_CONNECTOR_READ_MODE env setting (read_mode now set via kv_connector_extra_config in server_vllm.sh) - dsv4 is TP8/EP1 so no all2all backend / mori_low_latency rename needed Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27324475132 |
Remove the kimik2.5/minimaxm2.5 vllm-disagg changelog entry (that change is documented in #1585) and scrub kimi/minimax references from the dsv4-fp4-mi355x-vllm-disagg entry descriptions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27395381813 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27395515912 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27395515912 |
Summary
Adds
dsv4-fp4-mi355x-vllm-disagg— a DeepSeek-V4-Pro disaggregated prefill/decode benchmark on MI355X via vLLM + MoRI-IO. It combines the two pieces this work was scoped against:dsv4-fp4-mi355x-vllm, from vllm-project/recipes#433, landed here in dsv4-fp4-mi355x-vllm and adopt recipes#433 #1374), andDisaggregation only adds the MoRIIO kv-transfer role to each worker; the per-node engine config is otherwise identical to the known-good aggregated run.
Files
benchmarks/multi_node/dsv4_fp4_mi355x_vllm-disagg.shvllm-disaggwrappers (launch_mi355x-amds.shresolvesdsv4+fp4+vllm-disagg→ this filename).benchmarks/multi_node/amd_utils/models_vllm.yamlDeepSeek-V4-Proentry — prefill/decode flags + env keyed onMODEL_NAME..github/configs/amd-master.yamldsv4-fp4-mi355x-vllm-disaggblock.Serving config (
models_vllm.yaml)Per-node flags reuse the aggregated recipe verbatim, so the engine config matches the known-good single-node run:
--moe-backend triton_unfused— required for the FP4 MoE expert weight format (auto backend doesn't register the FP4 scale params → safetensorsKeyError).--tokenizer-mode deepseek_v4 --reasoning-parser deepseek_v4 --kv-cache-dtype fp8 --no-enable-prefix-caching --distributed-executor-backend mp --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192.--enforce-eager— no CUDA graphs, to keep the first disagg recipe robust against cudagraph/MoRIIO-hook interactions (FULL/PIECEWISE capture is a follow-up).--async-schedulingintentionally omitted (not used by the kimi/minimaxvllm-disaggrecipes).VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ENGINE_READY_TIMEOUT_S=3600.Image:
vllm/vllm-openai-rocm:nightly-3f0a91bb…(patch-free, via #1585)This PR also folds in #1585 ("Remove MoRI-IO patches from vLLM Disagg benchmarks"), so all three
vllm-disaggrecipes (kimi, minimax, dsv4) run patch-free on the same nightly:setup_deps.shdrops ~557 lines of runtime MoRIIO Python patches — they were upstreamed in vllm#40344 (merged 2026-05-28).mori→mori_low_latency; read-mode is now set viaread_mode: trueinkv_connector_extra_config(server_vllm.sh) instead of theVLLM_MORIIO_CONNECTOR_READ_MODEenv var.Why the nightly and not a release tag: vllm#40344 is not in
v0.22.0orv0.22.1— it landed ~1 day before the v0.22.0 cut and wasn't backported (both release trees have zeroread_mode). So the patch-free path requires the nightly.nightly-3f0a91bb96f8d72e0498b95c166e817deae14d62(2026-06-03) carries #40344 andDeepseekV4ForCausalLM(vllm#40871) and the MoRIIO connector (vllm#29304); it's confirmed live on Docker Hub. (Note: the GC'd-tag risk I'd flagged for the old disagg nightlies applies here too, but this is the maintained image the kimi/minimax recipes now share, so the disagg cluster caches it.)Topology
1P1D — 1 prefill node + 1 decode node (2 nodes total), each a full TP=8, EP=1 worker. This matches the aggregated recipe, which runs DSv4 on TP=8 without expert parallelism (
--moe-backend triton_unfusedhandles the FP4 sharding at TP=8, so EP is not required to load; at EP=1 there is no all2all backend, so themori_low_latencyrename from #1585 doesn't touch this recipe). DEP decode and multi-node 1P2D are follow-ups once the base path validates.Scope: smoke test first
Per request, this starts minimal: a single ISL/OSL (8k/1k) at a single conc=1, to validate the path end-to-end (image pull, MoRIIO transport, serving flags, model staging on the disagg cluster) before expanding to the full
1k1k + 8k1k,conc 8-512sweep the kimi/minimax recipes run. At conc=1 the generator emits one config and skips eval.Validated locally:
Test plan
run-sweep.ymlexercisesdsv4-fp4-mi355x-vllm-disaggat 8k1k/conc=1.nightly-3f0a91bb…imports on a fresh MI355X-disagg runner and exposesDeepseekV4ForCausalLM, the MoRIIO connector, and the nativeread_modeflag (no setup_deps patches).models--deepseek-ai--DeepSeek-V4-ProorDeepSeek-V4-ProunderMODEL_DIR).🤖 Generated with Claude Code
Note
Medium Risk
Touches shared multi-node vLLM-disagg plumbing (images, MoRI-IO config, large setup_deps removal) for kimi/minimax as well as the new DSv4 path; benchmark/infra risk rather than app auth or data handling.
Overview
Adds
dsv4-fp4-mi355x-vllm-disagg, a DeepSeek-V4-Pro disaggregated prefill/decode benchmark on MI355X (vLLM + MoRI-IO), with a new launcher script,DeepSeek-V4-Proserving flags inmodels_vllm.yaml(aligned with the single-node DSv4 recipe), and anamd-master.yamlentry scoped as an 8k/1k, conc=1 smoke test on 1P1D TP8/EP1.MoRI-IO patch-free path (#1585) is folded in for all MI355X
vllm-disaggrecipes: kimi/minimax move tonightly-3f0a91bb…,setup_deps.shdrops ~550 lines of runtime MoRIIO Python patches,VLLM_MORIIO_CONNECTOR_READ_MODEis removed from Slurm/submit/docker env, andread_mode: trueis set inserver_vllm.shkv_connector_extra_config. Kimi/MiniMax decode all2all backend is renamedmori→mori_low_latency; the vLLM router default image is bumped.perf-changelog.yamldocuments the new config key.Reviewed by Cursor Bugbot for commit 46ffe59. Bugbot is set up for automated code reviews on this repo. Configure here.