From a0ad648ea153b94b5a0cc955791fbc0f63e0cca5 Mon Sep 17 00:00:00 2001 From: functionstackx <47992694+functionstackx@users.noreply.github.com> Date: Thu, 11 Jun 2026 00:38:51 -0400 Subject: [PATCH 1/6] [Klaud Cold] dsv4-fp4-mi355x-sglang-disagg: DeepSeek-V4-Pro SGLang disagg (8k1k conc=1 smoke test) Adds a DeepSeek-V4-Pro FP4 prefill/decode-disaggregated recipe on MI355X via SGLang + MoRI, combining the validated single-node DSv4 SGLang recipe with the sglang-disagg framework used by the dsr1 / qwen3.5 / glm5 mi355x recipes (#1570, #1572, #1579). - benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh: model-agnostic launcher (same shape as the qwen3.5/glm5 wrappers, with NODE_LIST support). - amd_utils/models.yaml: DeepSeek-V4-Pro entry. Serving flags mirror the single-node recipe (compressed attention, SWA, page-size 256, deepseekv4/ deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off); context-length pinned; kv-cache-dtype left at model default. - amd_utils/env.sh: DSv4 FP4-experts SGLANG_* env block + deep_gemm-absence fallback, gated on MODEL_NAME. - amd_utils/setup_deps.sh: idempotent, atomic config.json model_type patch (deepseek_v4 -> deepseek_v3, architectures preserved), gated on MODEL_NAME. - amd-master.yaml: dsv4-fp4-mi355x-sglang-disagg, 1P1D TP8/EP1 dp-attn false, image v0.5.12.post1-rocm720-mi35x-20260601 (mainline w/ DSv4 #26383 + MoRI disagg; auto-applies the MoRI conn.py overlay). Starts at a single ISL/OSL (8k/1k) conc=1 to smoke-test that DSv4 + MoRI disagg comes up and transfers KV on this image before expanding the sweep. Co-Authored-By: Claude Fable 5 --- .github/configs/amd-master.yaml | 56 ++++++++++++ benchmarks/multi_node/amd_utils/env.sh | 55 +++++++++++ benchmarks/multi_node/amd_utils/models.yaml | 42 +++++++++ benchmarks/multi_node/amd_utils/setup_deps.sh | 49 ++++++++++ .../dsv4_fp4_mi355x_sglang-disagg.sh | 91 +++++++++++++++++++ 5 files changed, 293 insertions(+) create mode 100755 benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml index 0e548e611..422e1a64e 100644 --- a/.github/configs/amd-master.yaml +++ b/.github/configs/amd-master.yaml @@ -2136,6 +2136,62 @@ dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp: - "DECODE_MTP_SIZE=1" +# DeepSeek-V4-Pro FP4 PD-disaggregation on MI355X via SGLang + MoRI. Combines the +# validated single-node DSv4 SGLang recipe (dsv4-fp4-mi355x-sglang below) with the +# SGLang-disagg framework used by the dsr1 / qwen3.5 / glm5 mi355x recipes. Routes +# to benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh; per-node serving flags +# live in the DeepSeek-V4-Pro entry of amd_utils/models.yaml, the DSv4 FP4-experts +# SGLANG_* env block in amd_utils/env.sh, and the config.json model_type patch in +# amd_utils/setup_deps.sh (all gated on MODEL_NAME). +# +# Image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601 — the mainline +# ROCm nightly the DSv4 MTP single-node recipe (dsv4-fp4-mi355x-sglang-mtp) runs +# on. It carries DSv4 support (sgl#26383, mainline 2026-05-27) AND is on the same +# -mi35x- image line as the dsr1/qwen3.5/glm5 disagg recipes, so it has the MoRI +# disaggregation transfer backend — unlike the rocm/sgl-dev:*-DSv4 branch image +# the aggregated dsv4-fp4-mi355x-sglang entry uses, which is cut from +# amd/deepseek_v4 (no #26383, MoRI support unverified). Mainline omits deep_gemm; +# env.sh detects that and routes the DSv4 fp8 wo_a / topk paths to torch +# fallbacks. The v0.5.12.post1 tag also auto-applies the MoRI conn.py overlay +# (job.slurm) that fixes the wire format for hybrid/sparse-attention models. +# +# Topology 1P1D, TP8/EP1, dp-attn false — the same conservative starting point the +# qwen3.5 / glm5 sglang-disagg recipes launched with. Starts at a single ISL/OSL +# (8k/1k) conc=1 as an end-to-end smoke test (does DSv4 + MoRI disagg come up and +# transfer KV at all on this image) before expanding to the full conc / DEP sweep. +dsv4-fp4-mi355x-sglang-disagg: + image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601 + model: deepseek-ai/DeepSeek-V4-Pro + model-prefix: dsv4 + runner: mi355x-disagg + precision: fp4 + framework: sglang-disagg + multinode: true + disagg: true + scenarios: + fixed-seq-len: + - isl: 8192 + osl: 1024 + search-space: + # 1P1D TP8/EP1: 1 prefill node (co-located with router) + 1 decode node + - spec-decoding: "none" + conc-list: [ 1 ] + prefill: + num-worker: 1 + tp: 8 + ep: 1 + dp-attn: false + additional-settings: + - "PREFILL_NODES=1" + decode: + num-worker: 1 + tp: 8 + ep: 1 + dp-attn: false + additional-settings: + - "DECODE_NODES=1" + - "DECODE_MTP_SIZE=0" + # DSv4-Pro FP4 on MI355X via SGLang. Uses a rocm720 mi35x image built off the # amd/deepseek_v4 branch in sgl-project/sglang; the SHA is encoded in the # image tag, so bumping sglang is just an image tag bump here. Sweeps diff --git a/benchmarks/multi_node/amd_utils/env.sh b/benchmarks/multi_node/amd_utils/env.sh index 71d2653bd..e1fd17c28 100755 --- a/benchmarks/multi_node/amd_utils/env.sh +++ b/benchmarks/multi_node/amd_utils/env.sh @@ -149,6 +149,61 @@ else export SAFETENSORS_FAST_GPU=1 fi + # DeepSeek-V4-Pro (FP4 experts): the DSv4 SGLANG_* env block, copied verbatim + # from the single-node recipe (benchmarks/single_node/dsv4_fp4_mi355x_sglang*.sh), + # which tracks python/run_dsv4.sh on the amd/deepseek_v4 branch with the two + # FP4 overrides (SGLANG_DSV4_FP4_EXPERTS=True, SGLANG_FORCE_TRITON_MOE_FP8=0). + # Routes experts through the FP4 kernels + aiter MoE / MHC, tilelang indexer, + # triton SWA-prepare and FlashMLA. The deep_gemm-absence fallback at the end + # mirrors the single-node MTP recipe so this also runs on a mainline ROCm + # nightly (which omits deep_gemm). All gated on MODEL_NAME so other models are + # unaffected. + if [[ "$MODEL_NAME" == "DeepSeek-V4-Pro" ]]; then + export SGLANG_REASONING_EFFORT=max + export SGLANG_OPT_USE_FUSED_COMPRESS=true + export SGLANG_OPT_USE_OLD_COMPRESSOR=false + export SGLANG_OPT_USE_TILELANG_SWA_PREPARE=false + export SGLANG_OPT_USE_JIT_KERNEL_FUSED_TOPK=false + export SGLANG_OPT_USE_FUSED_HASH_TOPK=true + export SGLANG_OPT_DEEPGEMM_HC_PRENORM=false + export SGLANG_OPT_USE_TILELANG_MHC_PRE=false + export SGLANG_OPT_USE_TILELANG_MHC_POST=false + export SGLANG_OPT_USE_AITER_MHC_PRE=true + export SGLANG_OPT_USE_AITER_MHC_POST=true + export SGLANG_ENABLE_THINKING=1 + export SGLANG_USE_AITER=1 + export SGLANG_USE_ROCM700A=1 + export SGLANG_TOPK_TRANSFORM_512_TORCH=0 + export SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1 + export SGLANG_DSV4_FP4_EXPERTS=True + export SGLANG_OPT_DPSK_V4_RADIX=1 + export SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false + export SGLANG_OPT_USE_FUSED_STORE_CACHE=true + export SGLANG_FORCE_TRITON_MOE_FP8=0 + export SGLANG_HACK_FLASHMLA_BACKEND=triton + export SGLANG_OPT_USE_TILELANG_INDEXER=true + export SGLANG_OPT_USE_TRITON_SWA_PREPARE=true + export AITER_BF16_FP8_MOE_BOUND=0 + export SGLANG_OPT_FUSE_WQA_WKV=true + export SGLANG_OPT_USE_FUSED_PAGED_COMPRESS=true + export SGLANG_OPT_USE_MULTI_STREAM_OVERLAP=0 + + # Mainline ROCm nightlies carry DSv4 support but omit deep_gemm (only + # rocm/sgl-dev:*-DSv4 builds bundle it). DSv4-Pro's default fp8 wo_a path + # imports deep_gemm at weight load; detect its absence and route the + # deep_gemm-touching paths to their torch fallbacks. No-op on a + # deep_gemm-bearing image, so this works on both image lines. + if python3 -c "import deep_gemm" >/dev/null 2>&1; then + echo "[env.sh] deep_gemm present -> DSv4 fp8 wo_a / deep_gemm perf path" + else + echo "[env.sh] deep_gemm absent -> routing DSv4 fp8 wo_a / topk around it" + export SGLANG_OPT_FP8_WO_A_GEMM=0 + export SGLANG_TOPK_TRANSFORM_512_TORCH=1 + export SGLANG_OPT_USE_TOPK_V2=0 + export SGLANG_ENABLE_JIT_DEEPGEMM=0 + fi + fi + # Disable allocating memory in one pass export MORI_SHMEM_MODE=ISOLATION diff --git a/benchmarks/multi_node/amd_utils/models.yaml b/benchmarks/multi_node/amd_utils/models.yaml index 605a377be..2ccbcd4b7 100644 --- a/benchmarks/multi_node/amd_utils/models.yaml +++ b/benchmarks/multi_node/amd_utils/models.yaml @@ -349,3 +349,45 @@ DeepSeek-R1-0528-MXFP4-v2: max_running_requests: 128 chunked_prefill_size: 262144 cuda_graph_bs_range: "1-128" + +# DeepSeek-V4-Pro (FP4 experts) PD-disaggregation. Serving flags mirror the +# validated single-node SGLang recipe (benchmarks/single_node/dsv4_fp4_mi355x_sglang.sh +# + the MTP variant): compressed attention backend, SWA, page-size 256, the +# deepseekv4 tool-call / deepseek-v4 reasoning parsers, the DSv4 thinking chat +# template, and shared-experts-fusion disabled. The DSv4 FP4-experts SGLANG_* +# env block + deep_gemm-absence fallback live in env.sh, and the config.json +# model_type patch (deepseek_v4 -> deepseek_v3) in setup_deps.sh, both gated on +# MODEL_NAME == DeepSeek-V4-Pro. --context-length is pinned (model default is +# very long; would over-reserve KV); 9472 covers the 8k/1k smoke point. +# kv-cache-dtype is intentionally left at the model default (the single-node DSv4 +# recipe sets none), unlike the fp8_e4m3 DeepSeek-R1 disagg entries. +DeepSeek-V4-Pro: + base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --load-balance-method round_robin --disaggregation-transfer-backend mori --attention-backend compressed --swa-full-tokens-ratio 0.15 --page-size 256 --disable-shared-experts-fusion --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 --context-length 9472 --chat-template /workspace/benchmarks/single_node/chat_templates/deepseek_v4_thinking.jinja" + mtp_flags: "" + dp_flags: "--moe-a2a-backend mori --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head" + prefill: + mem_fraction_static: 0.8 + disable_radix_cache: true + dp: + max_running_requests: 24 + chunked_prefill_size: "MORI_MAX_DISPATCH_TOKENS_PREFILL * PREFILL_TP_SIZE" + cuda_graph_bs: "1 2 3" + no_dp: + max_running_requests: 32 + chunked_prefill_size: 8192 + cuda_graph_bs_range: "1-32" + decode: + mem_fraction_static: 0.85 + prefill_round_robin_balance: true + dp: + max_running_requests: 4096 + chunked_prefill_size: "MORI_MAX_DISPATCH_TOKENS_DECODE * DECODE_TP_SIZE" + cuda_graph_bs_range: "1-160" + ep_only: + max_running_requests: 256 + chunked_prefill_size: 262144 + cuda_graph_bs_range: "1-256" + no_dp: + max_running_requests: 64 + chunked_prefill_size: 8192 + cuda_graph_bs_range: "1-64" diff --git a/benchmarks/multi_node/amd_utils/setup_deps.sh b/benchmarks/multi_node/amd_utils/setup_deps.sh index add2e3fa5..13714aa77 100644 --- a/benchmarks/multi_node/amd_utils/setup_deps.sh +++ b/benchmarks/multi_node/amd_utils/setup_deps.sh @@ -735,6 +735,54 @@ install_transformers_glm5() { _SETUP_INSTALLED+=("transformers-glm5") } +# --------------------------------------------------------------------------- +# SGLang: DeepSeek-V4-Pro config.json model_type patch. +# +# Transformers in these images doesn't recognize the `deepseek_v4` model_type, +# so AutoConfig.from_pretrained crashes before SGLang can dispatch. The +# single-node DSv4 recipes patch the HF-cache config.json directly; for disagg +# the weights live on shared NFS at $MODEL_DIR/$MODEL_NAME, so patch that +# config.json instead. Set model_type -> deepseek_v3 (so AutoConfig succeeds) +# while keeping architectures=['DeepseekV4ForCausalLM'] so SGLang still +# dispatches to its native DSv4 model class. +# +# Idempotent (no-op once model_type is deepseek_v3) and crash-safe under the +# concurrent multi-node start: writes a temp file in the same dir and os.replace() +# (atomic same-filesystem rename), so a reader never sees a half-written config. +# Only runs for MODEL_NAME == DeepSeek-V4-Pro. +# --------------------------------------------------------------------------- +patch_dsv4_config() { + if [[ "$MODEL_NAME" != "DeepSeek-V4-Pro" ]]; then + return 0 + fi + local cfg="${MODEL_DIR}/${MODEL_NAME}/config.json" + if [[ ! -f "$cfg" ]]; then + echo "[SETUP] WARN: DSv4 config.json not found at $cfg; skipping model_type patch" + return 0 + fi + python3 - "$cfg" <<'PYEOF' +import json, os, sys, tempfile +cfg = sys.argv[1] +with open(cfg) as f: + config = json.load(f) +if config.get("model_type") != "deepseek_v4": + print(f"[SETUP] DSv4 config.json already patched (model_type={config.get('model_type')!r})") + sys.exit(0) +config["model_type"] = "deepseek_v3" +d = os.path.dirname(cfg) +fd, tmp = tempfile.mkstemp(dir=d, prefix=".config.json.", suffix=".tmp") +try: + with os.fdopen(fd, "w") as f: + json.dump(config, f, indent=2) + os.replace(tmp, cfg) + print(f"[SETUP] Patched {cfg}: model_type deepseek_v4 -> deepseek_v3") +except Exception: + os.path.exists(tmp) and os.remove(tmp) + raise +PYEOF + _SETUP_INSTALLED+=("dsv4-config-model-type") +} + # ============================================================================= # Run installers (engine-gated) # ============================================================================= @@ -759,6 +807,7 @@ if [[ "$ENGINE" == "vllm-disagg" ]]; then else patch_gluon_pa_mqa_logits_instr_shape install_transformers_glm5 + patch_dsv4_config fi _SETUP_END=$(date +%s) diff --git a/benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh b/benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh new file mode 100755 index 000000000..e55559519 --- /dev/null +++ b/benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh @@ -0,0 +1,91 @@ +#!/usr/bin/env bash + +# DeepSeek-V4-Pro FP4 disaggregated prefill/decode on MI355X via SGLang + MoRI. +# Thin, model-agnostic launcher (same shape as the dsr1 / qwen3.5 / glm5 +# sglang-disagg wrappers): all serving flags live in the DeepSeek-V4-Pro entry +# of amd_utils/models.yaml, DSv4-specific env + the config.json model_type patch +# live in amd_utils/env.sh + setup_deps.sh, and topology (P/D node counts, TP/EP) +# comes from amd-master.yaml. + +source "$(dirname "$0")/../benchmark_lib.sh" + +check_env_vars \ + CONC_LIST \ + ISL \ + OSL \ + IMAGE \ + SPEC_DECODING \ + MODEL_PATH \ + PREFILL_NUM_WORKERS \ + PREFILL_TP \ + PREFILL_EP \ + PREFILL_DP_ATTN \ + DECODE_NUM_WORKERS \ + DECODE_TP \ + DECODE_EP \ + DECODE_DP_ATTN \ + PREFILL_NODES \ + DECODE_NODES \ + RANDOM_RANGE_RATIO \ + FRAMEWORK + +if [[ -n "$SLURM_JOB_ID" ]]; then + echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" +fi + +set -x + +# Use upstreamed multi_node scripts (no external clone needed) +cd "$GITHUB_WORKSPACE/benchmarks/multi_node/amd_utils" || exit 1 + +# Set up SGL launch script-specific environment variables +export TIME_LIMIT="08:00:00" +export MODEL_PATH=$MODEL_PATH +export MODEL_NAME=$MODEL_NAME +export CONTAINER_IMAGE=$IMAGE + +if [[ "${PREFILL_EP:-1}" -eq 1 ]]; then +export PREFILL_ENABLE_EP=false +else +export PREFILL_ENABLE_EP=true +fi + +if [[ "$PREFILL_DP_ATTN" == "true" ]]; then +export PREFILL_ENABLE_DP=true +else +export PREFILL_ENABLE_DP=false +fi + +if [[ "${DECODE_EP:-1}" -eq 1 ]]; then +export DECODE_ENABLE_EP=false +else +export DECODE_ENABLE_EP=true +fi + +if [[ "$DECODE_DP_ATTN" == "true" ]]; then +export DECODE_ENABLE_DP=true +else +export DECODE_ENABLE_DP=false +fi + +# Launch jobs based on ISL/OSL +# Replace ' ' in CONC_LIST with 'x' such that the concurrency list is represented +# by a list of numbers delimited by 'x'. This is because of how the underlying launch script +# expects the concurrencies. +JOB_ID=$(bash ./submit.sh $PREFILL_NODES \ + $PREFILL_NUM_WORKERS \ + $DECODE_NODES \ + $DECODE_NUM_WORKERS \ + $ISL $OSL "${CONC_LIST// /x}" inf \ + ${PREFILL_ENABLE_EP} ${PREFILL_ENABLE_DP} \ + ${DECODE_ENABLE_EP} ${DECODE_ENABLE_DP} \ + ${PREFILL_TP} ${DECODE_TP} \ + ${RANDOM_RANGE_RATIO} \ + ${NODE_LIST:-}) + +if [[ $? -ne 0 ]]; then + echo "Failed to submit job" >&2 + exit 1 +fi + +echo "$JOB_ID" From 42c97e6e61e9b079f986028928d8030d8fef367e Mon Sep 17 00:00:00 2001 From: functionstackx <47992694+functionstackx@users.noreply.github.com> Date: Thu, 11 Jun 2026 00:39:21 -0400 Subject: [PATCH 2/6] perf-changelog: add dsv4-fp4-mi355x-sglang-disagg entry (#1708) Co-Authored-By: Claude Fable 5 --- perf-changelog.yaml | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/perf-changelog.yaml b/perf-changelog.yaml index e3fb6e94f..56725c88d 100644 --- a/perf-changelog.yaml +++ b/perf-changelog.yaml @@ -3547,3 +3547,13 @@ - "MI355x DSR1-FP4: Include TP4 configurations for 8k1k" - "Expand the TP sweep (included TP=4) for 8k/1k configuration for conc=4 to 64" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1692 + +- config-keys: + - dsv4-fp4-mi355x-sglang-disagg + description: + - "New recipe: DeepSeek-V4-Pro FP4 prefill/decode-disaggregated on MI355X via SGLang + MoRI. Combines the validated single-node DSv4 SGLang recipe with the sglang-disagg framework from the dsr1/qwen3.5/glm5 mi355x recipes" + - "New benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh launcher + DeepSeek-V4-Pro entry in amd_utils/models.yaml (compressed attention, SWA, page-size 256, deepseekv4/deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off, context-length pinned, kv-cache-dtype at model default)" + - "DSv4 FP4-experts SGLANG_* env block + deep_gemm-absence fallback added to amd_utils/env.sh; idempotent atomic config.json model_type patch (deepseek_v4->deepseek_v3) added to amd_utils/setup_deps.sh, both gated on MODEL_NAME" + - "Image lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601 (mainline, carries DSv4 sgl#26383 + MoRI disagg backend, auto-applies the MoRI conn.py overlay; rocm/sgl-dev:*-DSv4 branch image lacks both)" + - "1P1D TP8/EP1 dp-attn false. Starts at a single ISL/OSL (8k/1k) conc=1 as an end-to-end smoke test before expanding the conc / DEP sweep" + pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1708 From d4f6c89885ec8aa9cdf2e7a9748ea5737c066e36 Mon Sep 17 00:00:00 2001 From: functionstackx <47992694+functionstackx@users.noreply.github.com> Date: Thu, 11 Jun 2026 01:32:45 -0400 Subject: [PATCH 3/6] dsv4 sglang-disagg: monkey-patch sglang#27855 (MoRI-EP FP4 swiglu crash fix) DeepSeek-V4-Pro + MoRI expert-parallel aborts at warmup with "dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2" on the clamped-SwiGLU/INTERLEAVE path. sgl-project/sglang#27855 fixes it in moe_runner/aiter.py:_pre_permute_deepep_to_aiter (W4A4 + FP4-dispatch branch that dequants the FP4 activation to BF16 via upscale_mxfp4) but is unmerged and absent from the pinned image. setup_deps.sh now source-patches aiter.py at container start, gated on MODEL_NAME == DeepSeek-V4-Pro: idempotent, atomic write, warn+skip if the image's aiter.py predates the anchored structure. Verified byte-identical to the PR head against current sglang main. Co-Authored-By: Claude Fable 5 --- benchmarks/multi_node/amd_utils/setup_deps.sh | 109 ++++++++++++++++++ perf-changelog.yaml | 1 + 2 files changed, 110 insertions(+) diff --git a/benchmarks/multi_node/amd_utils/setup_deps.sh b/benchmarks/multi_node/amd_utils/setup_deps.sh index 13714aa77..795f1cb70 100644 --- a/benchmarks/multi_node/amd_utils/setup_deps.sh +++ b/benchmarks/multi_node/amd_utils/setup_deps.sh @@ -783,6 +783,114 @@ PYEOF _SETUP_INSTALLED+=("dsv4-config-model-type") } +# --------------------------------------------------------------------------- +# SGLang: DeepSeek-V4-Pro + MoRI-EP AITER MoE FP4 dispatch crash fix. +# +# Monkey-patches sgl-project/sglang#27855 ("[AMD] fix moriep quant kernel not +# implemented issue"), which is not yet merged upstream and so is absent from +# the pinned mainline image. Without it, DSv4 + MoRI expert-parallel aborts at +# warmup with: +# dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2 +# on the clamped-SwiGLU / INTERLEAVE path. The fix, in +# moe_runner/aiter.py:_pre_permute_deepep_to_aiter, adds a W4A4 + FP4-dispatch +# branch that dequantizes the FP4 activation to BF16 (upscale_mxfp4) and lets +# fused_moe re-quantize internally, mirroring the existing W4A4+FP8 and +# FP8-weight+FP4 dequant branches. +# +# Only the MoRI-EP decode path triggers it, so this is gated on +# MODEL_NAME == DeepSeek-V4-Pro. Idempotent (skips once swiglu_interleave is +# present), atomic write, and warn+skip if the image's aiter.py predates the +# anchored structure (then an image bump carrying #27855 is needed). Drop this +# patch once a pinned image already includes #27855. +# --------------------------------------------------------------------------- +patch_aiter_dsv4_fp4_swiglu() { + if [[ "$MODEL_NAME" != "DeepSeek-V4-Pro" ]]; then + return 0 + fi + local target + target=$(python3 -c "import sglang.srt.layers.moe.moe_runner.aiter as m; print(m.__file__)" 2>/dev/null) + if [[ -z "$target" || ! -f "$target" ]]; then + echo "[SETUP] WARN: aiter.py not found; skipping DSv4 FP4 swiglu patch (#27855)" + return 0 + fi + python3 - "$target" <<'PYEOF' +import os, sys, tempfile +target = sys.argv[1] +src = open(target).read() + +if "swiglu_interleave" in src: + print("[SETUP] DSv4 aiter FP4 swiglu patch (#27855) already applied") + sys.exit(0) + +# Edit A: import get_bool_env_var alongside get_int_env_var. +import_anchor = "from sglang.srt.utils import get_int_env_var\n" +if "get_bool_env_var" not in src: + if import_anchor not in src: + print("[SETUP] WARN: #27855 import anchor not found; skipping (image aiter.py differs)") + sys.exit(0) + src = src.replace( + import_anchor, + "from sglang.srt.utils import get_bool_env_var, get_int_env_var\n", + 1, + ) + +# Edit B: compute swiglu_interleave right after is_fp4_dispatch. +b_anchor = ( + " is_fp4_dispatch = hidden_states.dtype == torch.float4_e2m1fn_x2\n" + "\n" + " if is_w4a4 and a1_scale is not None and not is_fp4_dispatch:\n" +) +b_new = ( + " is_fp4_dispatch = hidden_states.dtype == torch.float4_e2m1fn_x2\n" + "\n" + " # AITER fused_moe Clamped-SwiGLU is dispatched with\n" + " # gate_mode=INTERLEAVE, for which AITER picks a bf16/fp8 `q_dtype_a`\n" + " # Refer to https://github.com/ROCm/aiter/blob/a2617c366dc7271a1662ecda2023d19f6ccefcec/aiter/fused_moe.py#L406-L412\n" + " swiglu_interleave = quant_info.swiglu_limit > 0 and get_bool_env_var(\n" + ' "SGLANG_USE_AITER_MOE_GU_ITLV", "true"\n' + " )\n" + "\n" + " if is_w4a4 and a1_scale is not None and not is_fp4_dispatch:\n" +) + +# Edit C: insert the W4A4 + FP4-dispatch + INTERLEAVE dequant branch. +c_anchor = ( + " a1_scale = None\n" + " elif is_fp8_quant and is_fp4_dispatch and a1_scale is not None:\n" +) +c_new = ( + " a1_scale = None\n" + " elif is_w4a4 and is_fp4_dispatch and a1_scale is not None and swiglu_interleave:\n" + " # W4A4 weights + FP4 dispatch on the clamped-SwiGLU/INTERLEAVE\n" + " # path: AITER expects a bf16/fp8 activation here, not fp4x2.\n" + " # Dequant FP4->BF16 and let fused_moe re-quantize internally.\n" + " hidden_states = upscale_mxfp4(\n" + " hidden_states, a1_scale, num_local_tokens, output_dtype\n" + " )\n" + " a1_scale = None\n" + " elif is_fp8_quant and is_fp4_dispatch and a1_scale is not None:\n" +) + +if b_anchor not in src or c_anchor not in src: + print("[SETUP] WARN: #27855 body anchors not found; skipping (image aiter.py predates the W4A4 branch)") + sys.exit(0) + +src = src.replace(b_anchor, b_new, 1).replace(c_anchor, c_new, 1) + +d = os.path.dirname(target) +fd, tmp = tempfile.mkstemp(dir=d, prefix=".aiter.py.", suffix=".tmp") +try: + with os.fdopen(fd, "w") as f: + f.write(src) + os.replace(tmp, target) + print(f"[SETUP] Patched {target}: DSv4 W4A4+FP4 swiglu-interleave dequant (#27855)") +except Exception: + os.path.exists(tmp) and os.remove(tmp) + raise +PYEOF + _SETUP_INSTALLED+=("dsv4-aiter-fp4-swiglu-27855") +} + # ============================================================================= # Run installers (engine-gated) # ============================================================================= @@ -808,6 +916,7 @@ else patch_gluon_pa_mqa_logits_instr_shape install_transformers_glm5 patch_dsv4_config + patch_aiter_dsv4_fp4_swiglu fi _SETUP_END=$(date +%s) diff --git a/perf-changelog.yaml b/perf-changelog.yaml index 56725c88d..56481891b 100644 --- a/perf-changelog.yaml +++ b/perf-changelog.yaml @@ -3554,6 +3554,7 @@ - "New recipe: DeepSeek-V4-Pro FP4 prefill/decode-disaggregated on MI355X via SGLang + MoRI. Combines the validated single-node DSv4 SGLang recipe with the sglang-disagg framework from the dsr1/qwen3.5/glm5 mi355x recipes" - "New benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh launcher + DeepSeek-V4-Pro entry in amd_utils/models.yaml (compressed attention, SWA, page-size 256, deepseekv4/deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off, context-length pinned, kv-cache-dtype at model default)" - "DSv4 FP4-experts SGLANG_* env block + deep_gemm-absence fallback added to amd_utils/env.sh; idempotent atomic config.json model_type patch (deepseek_v4->deepseek_v3) added to amd_utils/setup_deps.sh, both gated on MODEL_NAME" + - "setup_deps.sh monkey-patches sgl-project/sglang#27855 into the container's moe_runner/aiter.py (gated on MODEL_NAME): adds the W4A4 + FP4-dispatch clamped-SwiGLU/INTERLEAVE dequant branch that fixes the DSv4 + MoRI-EP warmup crash (dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2). Unmerged upstream / absent from the pinned image; idempotent + warn-skip if the image already carries it. Drop once a pinned image includes #27855" - "Image lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601 (mainline, carries DSv4 sgl#26383 + MoRI disagg backend, auto-applies the MoRI conn.py overlay; rocm/sgl-dev:*-DSv4 branch image lacks both)" - "1P1D TP8/EP1 dp-attn false. Starts at a single ISL/OSL (8k/1k) conc=1 as an end-to-end smoke test before expanding the conc / DEP sweep" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1708 From a1015d0ba6ed9680fe93a88f565bcf5eba35a1c3 Mon Sep 17 00:00:00 2001 From: functionstackx <47992694+functionstackx@users.noreply.github.com> Date: Thu, 11 Jun 2026 01:34:33 -0400 Subject: [PATCH 4/6] dsv4 sglang-disagg: adopt PR #1701 (0610 DSv4) image + env + dsv4 attention backend Realigns the DSv4 sglang-disagg recipe with the validated 0610 single-node recipe (PR #1701, "[AMD][MI35X] 0610 DSV4", successful run): - image -> lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610 - env.sh DSv4 block replaced with #1701's: unified_kv_triton FlashMLA, aiter indexer (not tilelang), mainline fp8 wo_a / topk-v2 fallbacks hardcoded (SGLANG_OPT_FP8_WO_A_GEMM=false, SGLANG_OPT_USE_TOPK_V2=false) instead of the deep_gemm-presence detect; SGLANG_DEFAULT_THINKING / SGLANG_DSV4_REASONING_EFFORT; multi-stream overlap off. Branch-only SGLANG_DSV4_FP4_EXPERTS / SGLANG_FORCE_TRITON_MOE_FP8 dropped (DSv4 main no longer needs them). - models.yaml base_flags: --attention-backend compressed -> dsv4; dp_flags add --enable-prefill-delayer --prefill-delayer-max-delay-ms 5000 (the #1701 DP path). Still a v0.5.12.post1 tag, so the MoRI conn.py overlay auto-applies; the #27855 aiter monkey-patch is unchanged. Co-Authored-By: Claude Fable 5 --- .github/configs/amd-master.yaml | 24 +++---- benchmarks/multi_node/amd_utils/env.sh | 72 ++++++++------------- benchmarks/multi_node/amd_utils/models.yaml | 25 +++---- perf-changelog.yaml | 6 +- 4 files changed, 55 insertions(+), 72 deletions(-) diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml index 422e1a64e..d4b7a4756 100644 --- a/.github/configs/amd-master.yaml +++ b/.github/configs/amd-master.yaml @@ -2144,23 +2144,25 @@ dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp: # SGLANG_* env block in amd_utils/env.sh, and the config.json model_type patch in # amd_utils/setup_deps.sh (all gated on MODEL_NAME). # -# Image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601 — the mainline -# ROCm nightly the DSv4 MTP single-node recipe (dsv4-fp4-mi355x-sglang-mtp) runs -# on. It carries DSv4 support (sgl#26383, mainline 2026-05-27) AND is on the same -# -mi35x- image line as the dsr1/qwen3.5/glm5 disagg recipes, so it has the MoRI -# disaggregation transfer backend — unlike the rocm/sgl-dev:*-DSv4 branch image -# the aggregated dsv4-fp4-mi355x-sglang entry uses, which is cut from -# amd/deepseek_v4 (no #26383, MoRI support unverified). Mainline omits deep_gemm; -# env.sh detects that and routes the DSv4 fp8 wo_a / topk paths to torch -# fallbacks. The v0.5.12.post1 tag also auto-applies the MoRI conn.py overlay -# (job.slurm) that fixes the wire format for hybrid/sparse-attention models. +# Image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610 — the mainline +# ROCm nightly the validated 0610 single-node DSv4 recipe (PR #1701) runs on. It +# carries DSv4 support (now on sglang main) AND is on the same -mi35x- image line +# as the dsr1/qwen3.5/glm5 disagg recipes, so it has the MoRI disaggregation +# transfer backend — unlike the rocm/sgl-dev:*-DSv4 branch image the aggregated +# dsv4-fp4-mi355x-sglang entry uses (cut from amd/deepseek_v4, MoRI support +# unverified). Mainline omits deep_gemm; env.sh hardcodes the #1701 fp8 wo_a / +# topk-v2 fallbacks (matching that validated run) rather than a runtime detect. +# The v0.5.12.post1 tag also auto-applies the MoRI conn.py overlay (job.slurm) +# that fixes the wire format for hybrid/sparse-attention models. setup_deps.sh +# additionally monkey-patches the unmerged sglang#27855 (MoRI-EP FP4 swiglu +# crash fix) into the image's aiter.py. # # Topology 1P1D, TP8/EP1, dp-attn false — the same conservative starting point the # qwen3.5 / glm5 sglang-disagg recipes launched with. Starts at a single ISL/OSL # (8k/1k) conc=1 as an end-to-end smoke test (does DSv4 + MoRI disagg come up and # transfer KV at all on this image) before expanding to the full conc / DEP sweep. dsv4-fp4-mi355x-sglang-disagg: - image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601 + image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610 model: deepseek-ai/DeepSeek-V4-Pro model-prefix: dsv4 runner: mi355x-disagg diff --git a/benchmarks/multi_node/amd_utils/env.sh b/benchmarks/multi_node/amd_utils/env.sh index e1fd17c28..dbe2fa4db 100755 --- a/benchmarks/multi_node/amd_utils/env.sh +++ b/benchmarks/multi_node/amd_utils/env.sh @@ -150,58 +150,38 @@ else fi # DeepSeek-V4-Pro (FP4 experts): the DSv4 SGLANG_* env block, copied verbatim - # from the single-node recipe (benchmarks/single_node/dsv4_fp4_mi355x_sglang*.sh), - # which tracks python/run_dsv4.sh on the amd/deepseek_v4 branch with the two - # FP4 overrides (SGLANG_DSV4_FP4_EXPERTS=True, SGLANG_FORCE_TRITON_MOE_FP8=0). - # Routes experts through the FP4 kernels + aiter MoE / MHC, tilelang indexer, - # triton SWA-prepare and FlashMLA. The deep_gemm-absence fallback at the end - # mirrors the single-node MTP recipe so this also runs on a mainline ROCm - # nightly (which omits deep_gemm). All gated on MODEL_NAME so other models are - # unaffected. + # from the validated 0610 single-node recipe (PR #1701, benchmarks/single_node/ + # fixed_seq_len/dsv4_fp4_mi355x_sglang.sh). That PR realigned DSv4 to the + # mainline ...mi35x-20260610 image (now that DSv4 support is on sglang main): + # the dsv4 attention backend, unified_kv_triton FlashMLA, the aiter indexer + # (not tilelang), and the mainline fp8 wo_a / topk-v2 fallbacks hardcoded + # (SGLANG_OPT_FP8_WO_A_GEMM=false, SGLANG_OPT_USE_TOPK_V2=false) instead of a + # deep_gemm-presence detect. Branch-only FP4 MoE flags (SGLANG_DSV4_FP4_EXPERTS, + # SGLANG_FORCE_TRITON_MOE_FP8) are dropped — DSv4 main no longer needs them. + # Gated on MODEL_NAME so other models are unaffected. if [[ "$MODEL_NAME" == "DeepSeek-V4-Pro" ]]; then - export SGLANG_REASONING_EFFORT=max - export SGLANG_OPT_USE_FUSED_COMPRESS=true - export SGLANG_OPT_USE_OLD_COMPRESSOR=false - export SGLANG_OPT_USE_TILELANG_SWA_PREPARE=false - export SGLANG_OPT_USE_JIT_KERNEL_FUSED_TOPK=false - export SGLANG_OPT_USE_FUSED_HASH_TOPK=true + export SGLANG_DEFAULT_THINKING=1 + export SGLANG_DSV4_REASONING_EFFORT=max export SGLANG_OPT_DEEPGEMM_HC_PRENORM=false + export SGLANG_USE_AITER=1 + export SGLANG_USE_ROCM700A=0 + export SGLANG_OPT_USE_FUSED_COMPRESS=true + export SGLANG_HACK_FLASHMLA_BACKEND=unified_kv_triton + export SGLANG_OPT_FP8_WO_A_GEMM=false + export SGLANG_OPT_USE_JIT_INDEXER_METADATA=false + export SGLANG_OPT_USE_TOPK_V2=false + export SGLANG_OPT_USE_AITER_INDEXER=true + export SGLANG_OPT_USE_TILELANG_INDEXER=false export SGLANG_OPT_USE_TILELANG_MHC_PRE=false export SGLANG_OPT_USE_TILELANG_MHC_POST=false - export SGLANG_OPT_USE_AITER_MHC_PRE=true - export SGLANG_OPT_USE_AITER_MHC_POST=true - export SGLANG_ENABLE_THINKING=1 - export SGLANG_USE_AITER=1 - export SGLANG_USE_ROCM700A=1 - export SGLANG_TOPK_TRANSFORM_512_TORCH=0 export SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1 - export SGLANG_DSV4_FP4_EXPERTS=True - export SGLANG_OPT_DPSK_V4_RADIX=1 - export SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false - export SGLANG_OPT_USE_FUSED_STORE_CACHE=true - export SGLANG_FORCE_TRITON_MOE_FP8=0 - export SGLANG_HACK_FLASHMLA_BACKEND=triton - export SGLANG_OPT_USE_TILELANG_INDEXER=true - export SGLANG_OPT_USE_TRITON_SWA_PREPARE=true + export SGLANG_OPT_USE_FUSED_COMPRESS_TRITON=true export AITER_BF16_FP8_MOE_BOUND=0 - export SGLANG_OPT_FUSE_WQA_WKV=true - export SGLANG_OPT_USE_FUSED_PAGED_COMPRESS=true - export SGLANG_OPT_USE_MULTI_STREAM_OVERLAP=0 - - # Mainline ROCm nightlies carry DSv4 support but omit deep_gemm (only - # rocm/sgl-dev:*-DSv4 builds bundle it). DSv4-Pro's default fp8 wo_a path - # imports deep_gemm at weight load; detect its absence and route the - # deep_gemm-touching paths to their torch fallbacks. No-op on a - # deep_gemm-bearing image, so this works on both image lines. - if python3 -c "import deep_gemm" >/dev/null 2>&1; then - echo "[env.sh] deep_gemm present -> DSv4 fp8 wo_a / deep_gemm perf path" - else - echo "[env.sh] deep_gemm absent -> routing DSv4 fp8 wo_a / topk around it" - export SGLANG_OPT_FP8_WO_A_GEMM=0 - export SGLANG_TOPK_TRANSFORM_512_TORCH=1 - export SGLANG_OPT_USE_TOPK_V2=0 - export SGLANG_ENABLE_JIT_DEEPGEMM=0 - fi + export SGLANG_EAGER_INPUT_NO_COPY=true + + # multi-stream + export SGLANG_OPT_USE_MULTI_STREAM_OVERLAP=false + export SGLANG_ROCM_USE_MULTI_STREAM=false fi # Disable allocating memory in one pass diff --git a/benchmarks/multi_node/amd_utils/models.yaml b/benchmarks/multi_node/amd_utils/models.yaml index 2ccbcd4b7..98a1d58d9 100644 --- a/benchmarks/multi_node/amd_utils/models.yaml +++ b/benchmarks/multi_node/amd_utils/models.yaml @@ -351,20 +351,21 @@ DeepSeek-R1-0528-MXFP4-v2: cuda_graph_bs_range: "1-128" # DeepSeek-V4-Pro (FP4 experts) PD-disaggregation. Serving flags mirror the -# validated single-node SGLang recipe (benchmarks/single_node/dsv4_fp4_mi355x_sglang.sh -# + the MTP variant): compressed attention backend, SWA, page-size 256, the -# deepseekv4 tool-call / deepseek-v4 reasoning parsers, the DSv4 thinking chat -# template, and shared-experts-fusion disabled. The DSv4 FP4-experts SGLANG_* -# env block + deep_gemm-absence fallback live in env.sh, and the config.json -# model_type patch (deepseek_v4 -> deepseek_v3) in setup_deps.sh, both gated on -# MODEL_NAME == DeepSeek-V4-Pro. --context-length is pinned (model default is -# very long; would over-reserve KV); 9472 covers the 8k/1k smoke point. -# kv-cache-dtype is intentionally left at the model default (the single-node DSv4 -# recipe sets none), unlike the fp8_e4m3 DeepSeek-R1 disagg entries. +# validated 0610 single-node SGLang recipe (PR #1701, dsv4_fp4_mi355x_sglang.sh): +# the dsv4 attention backend, SWA, page-size 256, the deepseekv4 tool-call / +# deepseek-v4 reasoning parsers, the DSv4 thinking chat template, and +# shared-experts-fusion disabled. The matching DSv4 SGLANG_* env block (#1701) +# lives in env.sh, and the config.json model_type patch (deepseek_v4 -> +# deepseek_v3) in setup_deps.sh, both gated on MODEL_NAME == DeepSeek-V4-Pro. +# --context-length is pinned (model default is very long; would over-reserve KV); +# 9472 covers the 8k/1k smoke point. kv-cache-dtype is left at the model default +# (the #1701 single-node recipe sets none), unlike the fp8_e4m3 DeepSeek-R1 +# disagg entries. dp_flags add --enable-prefill-delayer + max-delay-ms 5000 (the +# #1701 DP path) for the future DP-attention expansion. DeepSeek-V4-Pro: - base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --load-balance-method round_robin --disaggregation-transfer-backend mori --attention-backend compressed --swa-full-tokens-ratio 0.15 --page-size 256 --disable-shared-experts-fusion --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 --context-length 9472 --chat-template /workspace/benchmarks/single_node/chat_templates/deepseek_v4_thinking.jinja" + base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --load-balance-method round_robin --disaggregation-transfer-backend mori --attention-backend dsv4 --swa-full-tokens-ratio 0.15 --page-size 256 --disable-shared-experts-fusion --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 --context-length 9472 --chat-template /workspace/benchmarks/single_node/chat_templates/deepseek_v4_thinking.jinja" mtp_flags: "" - dp_flags: "--moe-a2a-backend mori --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head" + dp_flags: "--moe-a2a-backend mori --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head --enable-prefill-delayer --prefill-delayer-max-delay-ms 5000" prefill: mem_fraction_static: 0.8 disable_radix_cache: true diff --git a/perf-changelog.yaml b/perf-changelog.yaml index 56481891b..f6a3dffe1 100644 --- a/perf-changelog.yaml +++ b/perf-changelog.yaml @@ -3552,9 +3552,9 @@ - dsv4-fp4-mi355x-sglang-disagg description: - "New recipe: DeepSeek-V4-Pro FP4 prefill/decode-disaggregated on MI355X via SGLang + MoRI. Combines the validated single-node DSv4 SGLang recipe with the sglang-disagg framework from the dsr1/qwen3.5/glm5 mi355x recipes" - - "New benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh launcher + DeepSeek-V4-Pro entry in amd_utils/models.yaml (compressed attention, SWA, page-size 256, deepseekv4/deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off, context-length pinned, kv-cache-dtype at model default)" - - "DSv4 FP4-experts SGLANG_* env block + deep_gemm-absence fallback added to amd_utils/env.sh; idempotent atomic config.json model_type patch (deepseek_v4->deepseek_v3) added to amd_utils/setup_deps.sh, both gated on MODEL_NAME" + - "New benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh launcher + DeepSeek-V4-Pro entry in amd_utils/models.yaml. Serving flags + env mirror the validated 0610 single-node recipe (PR #1701): dsv4 attention backend, SWA, page-size 256, deepseekv4/deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off, context-length pinned, kv-cache-dtype at model default; dp_flags add --enable-prefill-delayer + max-delay-ms 5000" + - "DSv4 SGLANG_* env block from PR #1701 added to amd_utils/env.sh (unified_kv_triton FlashMLA, aiter indexer, mainline fp8 wo_a / topk-v2 fallbacks hardcoded, multi-stream off; branch-only SGLANG_DSV4_FP4_EXPERTS/FORCE_TRITON_MOE_FP8 dropped); idempotent atomic config.json model_type patch (deepseek_v4->deepseek_v3) added to amd_utils/setup_deps.sh, both gated on MODEL_NAME" - "setup_deps.sh monkey-patches sgl-project/sglang#27855 into the container's moe_runner/aiter.py (gated on MODEL_NAME): adds the W4A4 + FP4-dispatch clamped-SwiGLU/INTERLEAVE dequant branch that fixes the DSv4 + MoRI-EP warmup crash (dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2). Unmerged upstream / absent from the pinned image; idempotent + warn-skip if the image already carries it. Drop once a pinned image includes #27855" - - "Image lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601 (mainline, carries DSv4 sgl#26383 + MoRI disagg backend, auto-applies the MoRI conn.py overlay; rocm/sgl-dev:*-DSv4 branch image lacks both)" + - "Image lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610 (the validated PR #1701 0610 image; mainline, carries DSv4 support + MoRI disagg backend, auto-applies the MoRI conn.py overlay; rocm/sgl-dev:*-DSv4 branch image lacks both)" - "1P1D TP8/EP1 dp-attn false. Starts at a single ISL/OSL (8k/1k) conc=1 as an end-to-end smoke test before expanding the conc / DEP sweep" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1708 From 9648a8129e81f7775c1d35eb43df002cfef1f475 Mon Sep 17 00:00:00 2001 From: functionstackx <47992694+functionstackx@users.noreply.github.com> Date: Thu, 11 Jun 2026 01:36:18 -0400 Subject: [PATCH 5/6] dsv4 sglang-disagg: drop the prefill delayer from dp_flags Per request, do not use --enable-prefill-delayer / --prefill-delayer-max-delay-ms in the DSv4 sglang-disagg recipe. Co-Authored-By: Claude Fable 5 --- benchmarks/multi_node/amd_utils/models.yaml | 6 +++--- perf-changelog.yaml | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/benchmarks/multi_node/amd_utils/models.yaml b/benchmarks/multi_node/amd_utils/models.yaml index 98a1d58d9..bd549dd54 100644 --- a/benchmarks/multi_node/amd_utils/models.yaml +++ b/benchmarks/multi_node/amd_utils/models.yaml @@ -360,12 +360,12 @@ DeepSeek-R1-0528-MXFP4-v2: # --context-length is pinned (model default is very long; would over-reserve KV); # 9472 covers the 8k/1k smoke point. kv-cache-dtype is left at the model default # (the #1701 single-node recipe sets none), unlike the fp8_e4m3 DeepSeek-R1 -# disagg entries. dp_flags add --enable-prefill-delayer + max-delay-ms 5000 (the -# #1701 DP path) for the future DP-attention expansion. +# disagg entries. The prefill delayer (--enable-prefill-delayer) is intentionally +# not used here. DeepSeek-V4-Pro: base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --load-balance-method round_robin --disaggregation-transfer-backend mori --attention-backend dsv4 --swa-full-tokens-ratio 0.15 --page-size 256 --disable-shared-experts-fusion --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 --context-length 9472 --chat-template /workspace/benchmarks/single_node/chat_templates/deepseek_v4_thinking.jinja" mtp_flags: "" - dp_flags: "--moe-a2a-backend mori --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head --enable-prefill-delayer --prefill-delayer-max-delay-ms 5000" + dp_flags: "--moe-a2a-backend mori --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head" prefill: mem_fraction_static: 0.8 disable_radix_cache: true diff --git a/perf-changelog.yaml b/perf-changelog.yaml index f6a3dffe1..2ef5d33e6 100644 --- a/perf-changelog.yaml +++ b/perf-changelog.yaml @@ -3552,7 +3552,7 @@ - dsv4-fp4-mi355x-sglang-disagg description: - "New recipe: DeepSeek-V4-Pro FP4 prefill/decode-disaggregated on MI355X via SGLang + MoRI. Combines the validated single-node DSv4 SGLang recipe with the sglang-disagg framework from the dsr1/qwen3.5/glm5 mi355x recipes" - - "New benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh launcher + DeepSeek-V4-Pro entry in amd_utils/models.yaml. Serving flags + env mirror the validated 0610 single-node recipe (PR #1701): dsv4 attention backend, SWA, page-size 256, deepseekv4/deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off, context-length pinned, kv-cache-dtype at model default; dp_flags add --enable-prefill-delayer + max-delay-ms 5000" + - "New benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh launcher + DeepSeek-V4-Pro entry in amd_utils/models.yaml. Serving flags + env mirror the validated 0610 single-node recipe (PR #1701): dsv4 attention backend, SWA, page-size 256, deepseekv4/deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off, context-length pinned, kv-cache-dtype at model default; prefill delayer not used" - "DSv4 SGLANG_* env block from PR #1701 added to amd_utils/env.sh (unified_kv_triton FlashMLA, aiter indexer, mainline fp8 wo_a / topk-v2 fallbacks hardcoded, multi-stream off; branch-only SGLANG_DSV4_FP4_EXPERTS/FORCE_TRITON_MOE_FP8 dropped); idempotent atomic config.json model_type patch (deepseek_v4->deepseek_v3) added to amd_utils/setup_deps.sh, both gated on MODEL_NAME" - "setup_deps.sh monkey-patches sgl-project/sglang#27855 into the container's moe_runner/aiter.py (gated on MODEL_NAME): adds the W4A4 + FP4-dispatch clamped-SwiGLU/INTERLEAVE dequant branch that fixes the DSv4 + MoRI-EP warmup crash (dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2). Unmerged upstream / absent from the pinned image; idempotent + warn-skip if the image already carries it. Drop once a pinned image includes #27855" - "Image lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610 (the validated PR #1701 0610 image; mainline, carries DSv4 support + MoRI disagg backend, auto-applies the MoRI conn.py overlay; rocm/sgl-dev:*-DSv4 branch image lacks both)" From 316dd215c7706eefeb093a3ce4f0310593dc00af Mon Sep 17 00:00:00 2001 From: functionstackx <47992694+functionstackx@users.noreply.github.com> Date: Thu, 11 Jun 2026 01:40:24 -0400 Subject: [PATCH 6/6] dsv4 sglang-disagg: drop the sglang#27855 monkey-patch (TP-only smoke, no EP) The #27855 fix only matters on the DSv4 + MoRI expert-parallel path. This recipe is TP8/EP1 for the smoke test, so that crash isn't reachable. Remove the patch_aiter_dsv4_fp4_swiglu source-patch from setup_deps.sh; a comment in amd-master.yaml records that it's needed only when EP/DEP decode is enabled. Co-Authored-By: Claude Fable 5 --- .github/configs/amd-master.yaml | 8 +- benchmarks/multi_node/amd_utils/setup_deps.sh | 109 ------------------ perf-changelog.yaml | 1 - 3 files changed, 5 insertions(+), 113 deletions(-) diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml index d4b7a4756..f3e0c6b81 100644 --- a/.github/configs/amd-master.yaml +++ b/.github/configs/amd-master.yaml @@ -2153,9 +2153,11 @@ dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp: # unverified). Mainline omits deep_gemm; env.sh hardcodes the #1701 fp8 wo_a / # topk-v2 fallbacks (matching that validated run) rather than a runtime detect. # The v0.5.12.post1 tag also auto-applies the MoRI conn.py overlay (job.slurm) -# that fixes the wire format for hybrid/sparse-attention models. setup_deps.sh -# additionally monkey-patches the unmerged sglang#27855 (MoRI-EP FP4 swiglu -# crash fix) into the image's aiter.py. +# that fixes the wire format for hybrid/sparse-attention models. +# +# This smoke test is TP-only (EP1), so the DSv4 + MoRI-EP FP4 swiglu crash that +# sglang#27855 fixes is not on the path; that monkey-patch is intentionally not +# carried here and would be needed only when EP/DEP decode is enabled. # # Topology 1P1D, TP8/EP1, dp-attn false — the same conservative starting point the # qwen3.5 / glm5 sglang-disagg recipes launched with. Starts at a single ISL/OSL diff --git a/benchmarks/multi_node/amd_utils/setup_deps.sh b/benchmarks/multi_node/amd_utils/setup_deps.sh index 795f1cb70..13714aa77 100644 --- a/benchmarks/multi_node/amd_utils/setup_deps.sh +++ b/benchmarks/multi_node/amd_utils/setup_deps.sh @@ -783,114 +783,6 @@ PYEOF _SETUP_INSTALLED+=("dsv4-config-model-type") } -# --------------------------------------------------------------------------- -# SGLang: DeepSeek-V4-Pro + MoRI-EP AITER MoE FP4 dispatch crash fix. -# -# Monkey-patches sgl-project/sglang#27855 ("[AMD] fix moriep quant kernel not -# implemented issue"), which is not yet merged upstream and so is absent from -# the pinned mainline image. Without it, DSv4 + MoRI expert-parallel aborts at -# warmup with: -# dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2 -# on the clamped-SwiGLU / INTERLEAVE path. The fix, in -# moe_runner/aiter.py:_pre_permute_deepep_to_aiter, adds a W4A4 + FP4-dispatch -# branch that dequantizes the FP4 activation to BF16 (upscale_mxfp4) and lets -# fused_moe re-quantize internally, mirroring the existing W4A4+FP8 and -# FP8-weight+FP4 dequant branches. -# -# Only the MoRI-EP decode path triggers it, so this is gated on -# MODEL_NAME == DeepSeek-V4-Pro. Idempotent (skips once swiglu_interleave is -# present), atomic write, and warn+skip if the image's aiter.py predates the -# anchored structure (then an image bump carrying #27855 is needed). Drop this -# patch once a pinned image already includes #27855. -# --------------------------------------------------------------------------- -patch_aiter_dsv4_fp4_swiglu() { - if [[ "$MODEL_NAME" != "DeepSeek-V4-Pro" ]]; then - return 0 - fi - local target - target=$(python3 -c "import sglang.srt.layers.moe.moe_runner.aiter as m; print(m.__file__)" 2>/dev/null) - if [[ -z "$target" || ! -f "$target" ]]; then - echo "[SETUP] WARN: aiter.py not found; skipping DSv4 FP4 swiglu patch (#27855)" - return 0 - fi - python3 - "$target" <<'PYEOF' -import os, sys, tempfile -target = sys.argv[1] -src = open(target).read() - -if "swiglu_interleave" in src: - print("[SETUP] DSv4 aiter FP4 swiglu patch (#27855) already applied") - sys.exit(0) - -# Edit A: import get_bool_env_var alongside get_int_env_var. -import_anchor = "from sglang.srt.utils import get_int_env_var\n" -if "get_bool_env_var" not in src: - if import_anchor not in src: - print("[SETUP] WARN: #27855 import anchor not found; skipping (image aiter.py differs)") - sys.exit(0) - src = src.replace( - import_anchor, - "from sglang.srt.utils import get_bool_env_var, get_int_env_var\n", - 1, - ) - -# Edit B: compute swiglu_interleave right after is_fp4_dispatch. -b_anchor = ( - " is_fp4_dispatch = hidden_states.dtype == torch.float4_e2m1fn_x2\n" - "\n" - " if is_w4a4 and a1_scale is not None and not is_fp4_dispatch:\n" -) -b_new = ( - " is_fp4_dispatch = hidden_states.dtype == torch.float4_e2m1fn_x2\n" - "\n" - " # AITER fused_moe Clamped-SwiGLU is dispatched with\n" - " # gate_mode=INTERLEAVE, for which AITER picks a bf16/fp8 `q_dtype_a`\n" - " # Refer to https://github.com/ROCm/aiter/blob/a2617c366dc7271a1662ecda2023d19f6ccefcec/aiter/fused_moe.py#L406-L412\n" - " swiglu_interleave = quant_info.swiglu_limit > 0 and get_bool_env_var(\n" - ' "SGLANG_USE_AITER_MOE_GU_ITLV", "true"\n' - " )\n" - "\n" - " if is_w4a4 and a1_scale is not None and not is_fp4_dispatch:\n" -) - -# Edit C: insert the W4A4 + FP4-dispatch + INTERLEAVE dequant branch. -c_anchor = ( - " a1_scale = None\n" - " elif is_fp8_quant and is_fp4_dispatch and a1_scale is not None:\n" -) -c_new = ( - " a1_scale = None\n" - " elif is_w4a4 and is_fp4_dispatch and a1_scale is not None and swiglu_interleave:\n" - " # W4A4 weights + FP4 dispatch on the clamped-SwiGLU/INTERLEAVE\n" - " # path: AITER expects a bf16/fp8 activation here, not fp4x2.\n" - " # Dequant FP4->BF16 and let fused_moe re-quantize internally.\n" - " hidden_states = upscale_mxfp4(\n" - " hidden_states, a1_scale, num_local_tokens, output_dtype\n" - " )\n" - " a1_scale = None\n" - " elif is_fp8_quant and is_fp4_dispatch and a1_scale is not None:\n" -) - -if b_anchor not in src or c_anchor not in src: - print("[SETUP] WARN: #27855 body anchors not found; skipping (image aiter.py predates the W4A4 branch)") - sys.exit(0) - -src = src.replace(b_anchor, b_new, 1).replace(c_anchor, c_new, 1) - -d = os.path.dirname(target) -fd, tmp = tempfile.mkstemp(dir=d, prefix=".aiter.py.", suffix=".tmp") -try: - with os.fdopen(fd, "w") as f: - f.write(src) - os.replace(tmp, target) - print(f"[SETUP] Patched {target}: DSv4 W4A4+FP4 swiglu-interleave dequant (#27855)") -except Exception: - os.path.exists(tmp) and os.remove(tmp) - raise -PYEOF - _SETUP_INSTALLED+=("dsv4-aiter-fp4-swiglu-27855") -} - # ============================================================================= # Run installers (engine-gated) # ============================================================================= @@ -916,7 +808,6 @@ else patch_gluon_pa_mqa_logits_instr_shape install_transformers_glm5 patch_dsv4_config - patch_aiter_dsv4_fp4_swiglu fi _SETUP_END=$(date +%s) diff --git a/perf-changelog.yaml b/perf-changelog.yaml index 2ef5d33e6..4cc6b5e7c 100644 --- a/perf-changelog.yaml +++ b/perf-changelog.yaml @@ -3554,7 +3554,6 @@ - "New recipe: DeepSeek-V4-Pro FP4 prefill/decode-disaggregated on MI355X via SGLang + MoRI. Combines the validated single-node DSv4 SGLang recipe with the sglang-disagg framework from the dsr1/qwen3.5/glm5 mi355x recipes" - "New benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh launcher + DeepSeek-V4-Pro entry in amd_utils/models.yaml. Serving flags + env mirror the validated 0610 single-node recipe (PR #1701): dsv4 attention backend, SWA, page-size 256, deepseekv4/deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off, context-length pinned, kv-cache-dtype at model default; prefill delayer not used" - "DSv4 SGLANG_* env block from PR #1701 added to amd_utils/env.sh (unified_kv_triton FlashMLA, aiter indexer, mainline fp8 wo_a / topk-v2 fallbacks hardcoded, multi-stream off; branch-only SGLANG_DSV4_FP4_EXPERTS/FORCE_TRITON_MOE_FP8 dropped); idempotent atomic config.json model_type patch (deepseek_v4->deepseek_v3) added to amd_utils/setup_deps.sh, both gated on MODEL_NAME" - - "setup_deps.sh monkey-patches sgl-project/sglang#27855 into the container's moe_runner/aiter.py (gated on MODEL_NAME): adds the W4A4 + FP4-dispatch clamped-SwiGLU/INTERLEAVE dequant branch that fixes the DSv4 + MoRI-EP warmup crash (dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2). Unmerged upstream / absent from the pinned image; idempotent + warn-skip if the image already carries it. Drop once a pinned image includes #27855" - "Image lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610 (the validated PR #1701 0610 image; mainline, carries DSv4 support + MoRI disagg backend, auto-applies the MoRI conn.py overlay; rocm/sgl-dev:*-DSv4 branch image lacks both)" - "1P1D TP8/EP1 dp-attn false. Starts at a single ISL/OSL (8k/1k) conc=1 as an end-to-end smoke test before expanding the conc / DEP sweep" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1708