From a0ad648ea153b94b5a0cc955791fbc0f63e0cca5 Mon Sep 17 00:00:00 2001
From: functionstackx <47992694+functionstackx@users.noreply.github.com>
Date: Thu, 11 Jun 2026 00:38:51 -0400
Subject: [PATCH 1/6] [Klaud Cold] dsv4-fp4-mi355x-sglang-disagg:
 DeepSeek-V4-Pro SGLang disagg (8k1k conc=1 smoke test)

Adds a DeepSeek-V4-Pro FP4 prefill/decode-disaggregated recipe on MI355X via
SGLang + MoRI, combining the validated single-node DSv4 SGLang recipe with the
sglang-disagg framework used by the dsr1 / qwen3.5 / glm5 mi355x recipes
(#1570, #1572, #1579).

- benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh: model-agnostic launcher
  (same shape as the qwen3.5/glm5 wrappers, with NODE_LIST support).
- amd_utils/models.yaml: DeepSeek-V4-Pro entry. Serving flags mirror the
  single-node recipe (compressed attention, SWA, page-size 256, deepseekv4/
  deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off);
  context-length pinned; kv-cache-dtype left at model default.
- amd_utils/env.sh: DSv4 FP4-experts SGLANG_* env block + deep_gemm-absence
  fallback, gated on MODEL_NAME.
- amd_utils/setup_deps.sh: idempotent, atomic config.json model_type patch
  (deepseek_v4 -> deepseek_v3, architectures preserved), gated on MODEL_NAME.
- amd-master.yaml: dsv4-fp4-mi355x-sglang-disagg, 1P1D TP8/EP1 dp-attn false,
  image v0.5.12.post1-rocm720-mi35x-20260601 (mainline w/ DSv4 #26383 + MoRI
  disagg; auto-applies the MoRI conn.py overlay).

Starts at a single ISL/OSL (8k/1k) conc=1 to smoke-test that DSv4 + MoRI disagg
comes up and transfers KV on this image before expanding the sweep.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 .github/configs/amd-master.yaml               | 56 ++++++++++++
 benchmarks/multi_node/amd_utils/env.sh        | 55 +++++++++++
 benchmarks/multi_node/amd_utils/models.yaml   | 42 +++++++++
 benchmarks/multi_node/amd_utils/setup_deps.sh | 49 ++++++++++
 .../dsv4_fp4_mi355x_sglang-disagg.sh          | 91 +++++++++++++++++++
 5 files changed, 293 insertions(+)
 create mode 100755 benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh

diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml
index 0e548e611..422e1a64e 100644
--- a/.github/configs/amd-master.yaml
+++ b/.github/configs/amd-master.yaml
@@ -2136,6 +2136,62 @@ dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp:
           - "DECODE_MTP_SIZE=1"
 
 
+# DeepSeek-V4-Pro FP4 PD-disaggregation on MI355X via SGLang + MoRI. Combines the
+# validated single-node DSv4 SGLang recipe (dsv4-fp4-mi355x-sglang below) with the
+# SGLang-disagg framework used by the dsr1 / qwen3.5 / glm5 mi355x recipes. Routes
+# to benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh; per-node serving flags
+# live in the DeepSeek-V4-Pro entry of amd_utils/models.yaml, the DSv4 FP4-experts
+# SGLANG_* env block in amd_utils/env.sh, and the config.json model_type patch in
+# amd_utils/setup_deps.sh (all gated on MODEL_NAME).
+#
+# Image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601 — the mainline
+# ROCm nightly the DSv4 MTP single-node recipe (dsv4-fp4-mi355x-sglang-mtp) runs
+# on. It carries DSv4 support (sgl#26383, mainline 2026-05-27) AND is on the same
+# -mi35x- image line as the dsr1/qwen3.5/glm5 disagg recipes, so it has the MoRI
+# disaggregation transfer backend — unlike the rocm/sgl-dev:*-DSv4 branch image
+# the aggregated dsv4-fp4-mi355x-sglang entry uses, which is cut from
+# amd/deepseek_v4 (no #26383, MoRI support unverified). Mainline omits deep_gemm;
+# env.sh detects that and routes the DSv4 fp8 wo_a / topk paths to torch
+# fallbacks. The v0.5.12.post1 tag also auto-applies the MoRI conn.py overlay
+# (job.slurm) that fixes the wire format for hybrid/sparse-attention models.
+#
+# Topology 1P1D, TP8/EP1, dp-attn false — the same conservative starting point the
+# qwen3.5 / glm5 sglang-disagg recipes launched with. Starts at a single ISL/OSL
+# (8k/1k) conc=1 as an end-to-end smoke test (does DSv4 + MoRI disagg come up and
+# transfer KV at all on this image) before expanding to the full conc / DEP sweep.
+dsv4-fp4-mi355x-sglang-disagg:
+  image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601
+  model: deepseek-ai/DeepSeek-V4-Pro
+  model-prefix: dsv4
+  runner: mi355x-disagg
+  precision: fp4
+  framework: sglang-disagg
+  multinode: true
+  disagg: true
+  scenarios:
+    fixed-seq-len:
+    - isl: 8192
+      osl: 1024
+      search-space:
+      # 1P1D TP8/EP1: 1 prefill node (co-located with router) + 1 decode node
+      - spec-decoding: "none"
+        conc-list: [ 1 ]
+        prefill:
+          num-worker: 1
+          tp: 8
+          ep: 1
+          dp-attn: false
+          additional-settings:
+          - "PREFILL_NODES=1"
+        decode:
+          num-worker: 1
+          tp: 8
+          ep: 1
+          dp-attn: false
+          additional-settings:
+          - "DECODE_NODES=1"
+          - "DECODE_MTP_SIZE=0"
+
 # DSv4-Pro FP4 on MI355X via SGLang. Uses a rocm720 mi35x image built off the
 # amd/deepseek_v4 branch in sgl-project/sglang; the SHA is encoded in the
 # image tag, so bumping sglang is just an image tag bump here. Sweeps
diff --git a/benchmarks/multi_node/amd_utils/env.sh b/benchmarks/multi_node/amd_utils/env.sh
index 71d2653bd..e1fd17c28 100755
--- a/benchmarks/multi_node/amd_utils/env.sh
+++ b/benchmarks/multi_node/amd_utils/env.sh
@@ -149,6 +149,61 @@ else
         export SAFETENSORS_FAST_GPU=1
     fi
 
+    # DeepSeek-V4-Pro (FP4 experts): the DSv4 SGLANG_* env block, copied verbatim
+    # from the single-node recipe (benchmarks/single_node/dsv4_fp4_mi355x_sglang*.sh),
+    # which tracks python/run_dsv4.sh on the amd/deepseek_v4 branch with the two
+    # FP4 overrides (SGLANG_DSV4_FP4_EXPERTS=True, SGLANG_FORCE_TRITON_MOE_FP8=0).
+    # Routes experts through the FP4 kernels + aiter MoE / MHC, tilelang indexer,
+    # triton SWA-prepare and FlashMLA. The deep_gemm-absence fallback at the end
+    # mirrors the single-node MTP recipe so this also runs on a mainline ROCm
+    # nightly (which omits deep_gemm). All gated on MODEL_NAME so other models are
+    # unaffected.
+    if [[ "$MODEL_NAME" == "DeepSeek-V4-Pro" ]]; then
+        export SGLANG_REASONING_EFFORT=max
+        export SGLANG_OPT_USE_FUSED_COMPRESS=true
+        export SGLANG_OPT_USE_OLD_COMPRESSOR=false
+        export SGLANG_OPT_USE_TILELANG_SWA_PREPARE=false
+        export SGLANG_OPT_USE_JIT_KERNEL_FUSED_TOPK=false
+        export SGLANG_OPT_USE_FUSED_HASH_TOPK=true
+        export SGLANG_OPT_DEEPGEMM_HC_PRENORM=false
+        export SGLANG_OPT_USE_TILELANG_MHC_PRE=false
+        export SGLANG_OPT_USE_TILELANG_MHC_POST=false
+        export SGLANG_OPT_USE_AITER_MHC_PRE=true
+        export SGLANG_OPT_USE_AITER_MHC_POST=true
+        export SGLANG_ENABLE_THINKING=1
+        export SGLANG_USE_AITER=1
+        export SGLANG_USE_ROCM700A=1
+        export SGLANG_TOPK_TRANSFORM_512_TORCH=0
+        export SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1
+        export SGLANG_DSV4_FP4_EXPERTS=True
+        export SGLANG_OPT_DPSK_V4_RADIX=1
+        export SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false
+        export SGLANG_OPT_USE_FUSED_STORE_CACHE=true
+        export SGLANG_FORCE_TRITON_MOE_FP8=0
+        export SGLANG_HACK_FLASHMLA_BACKEND=triton
+        export SGLANG_OPT_USE_TILELANG_INDEXER=true
+        export SGLANG_OPT_USE_TRITON_SWA_PREPARE=true
+        export AITER_BF16_FP8_MOE_BOUND=0
+        export SGLANG_OPT_FUSE_WQA_WKV=true
+        export SGLANG_OPT_USE_FUSED_PAGED_COMPRESS=true
+        export SGLANG_OPT_USE_MULTI_STREAM_OVERLAP=0
+
+        # Mainline ROCm nightlies carry DSv4 support but omit deep_gemm (only
+        # rocm/sgl-dev:*-DSv4 builds bundle it). DSv4-Pro's default fp8 wo_a path
+        # imports deep_gemm at weight load; detect its absence and route the
+        # deep_gemm-touching paths to their torch fallbacks. No-op on a
+        # deep_gemm-bearing image, so this works on both image lines.
+        if python3 -c "import deep_gemm" >/dev/null 2>&1; then
+            echo "[env.sh] deep_gemm present -> DSv4 fp8 wo_a / deep_gemm perf path"
+        else
+            echo "[env.sh] deep_gemm absent -> routing DSv4 fp8 wo_a / topk around it"
+            export SGLANG_OPT_FP8_WO_A_GEMM=0
+            export SGLANG_TOPK_TRANSFORM_512_TORCH=1
+            export SGLANG_OPT_USE_TOPK_V2=0
+            export SGLANG_ENABLE_JIT_DEEPGEMM=0
+        fi
+    fi
+
     # Disable allocating memory in one pass
     export MORI_SHMEM_MODE=ISOLATION
 
diff --git a/benchmarks/multi_node/amd_utils/models.yaml b/benchmarks/multi_node/amd_utils/models.yaml
index 605a377be..2ccbcd4b7 100644
--- a/benchmarks/multi_node/amd_utils/models.yaml
+++ b/benchmarks/multi_node/amd_utils/models.yaml
@@ -349,3 +349,45 @@ DeepSeek-R1-0528-MXFP4-v2:
       max_running_requests: 128
       chunked_prefill_size: 262144
       cuda_graph_bs_range: "1-128"
+
+# DeepSeek-V4-Pro (FP4 experts) PD-disaggregation. Serving flags mirror the
+# validated single-node SGLang recipe (benchmarks/single_node/dsv4_fp4_mi355x_sglang.sh
+# + the MTP variant): compressed attention backend, SWA, page-size 256, the
+# deepseekv4 tool-call / deepseek-v4 reasoning parsers, the DSv4 thinking chat
+# template, and shared-experts-fusion disabled. The DSv4 FP4-experts SGLANG_*
+# env block + deep_gemm-absence fallback live in env.sh, and the config.json
+# model_type patch (deepseek_v4 -> deepseek_v3) in setup_deps.sh, both gated on
+# MODEL_NAME == DeepSeek-V4-Pro. --context-length is pinned (model default is
+# very long; would over-reserve KV); 9472 covers the 8k/1k smoke point.
+# kv-cache-dtype is intentionally left at the model default (the single-node DSv4
+# recipe sets none), unlike the fp8_e4m3 DeepSeek-R1 disagg entries.
+DeepSeek-V4-Pro:
+  base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --load-balance-method round_robin --disaggregation-transfer-backend mori --attention-backend compressed --swa-full-tokens-ratio 0.15 --page-size 256 --disable-shared-experts-fusion --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 --context-length 9472 --chat-template /workspace/benchmarks/single_node/chat_templates/deepseek_v4_thinking.jinja"
+  mtp_flags: ""
+  dp_flags: "--moe-a2a-backend mori --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head"
+  prefill:
+    mem_fraction_static: 0.8
+    disable_radix_cache: true
+    dp:
+      max_running_requests: 24
+      chunked_prefill_size: "MORI_MAX_DISPATCH_TOKENS_PREFILL * PREFILL_TP_SIZE"
+      cuda_graph_bs: "1 2 3"
+    no_dp:
+      max_running_requests: 32
+      chunked_prefill_size: 8192
+      cuda_graph_bs_range: "1-32"
+  decode:
+    mem_fraction_static: 0.85
+    prefill_round_robin_balance: true
+    dp:
+      max_running_requests: 4096
+      chunked_prefill_size: "MORI_MAX_DISPATCH_TOKENS_DECODE * DECODE_TP_SIZE"
+      cuda_graph_bs_range: "1-160"
+    ep_only:
+      max_running_requests: 256
+      chunked_prefill_size: 262144
+      cuda_graph_bs_range: "1-256"
+    no_dp:
+      max_running_requests: 64
+      chunked_prefill_size: 8192
+      cuda_graph_bs_range: "1-64"
diff --git a/benchmarks/multi_node/amd_utils/setup_deps.sh b/benchmarks/multi_node/amd_utils/setup_deps.sh
index add2e3fa5..13714aa77 100644
--- a/benchmarks/multi_node/amd_utils/setup_deps.sh
+++ b/benchmarks/multi_node/amd_utils/setup_deps.sh
@@ -735,6 +735,54 @@ install_transformers_glm5() {
     _SETUP_INSTALLED+=("transformers-glm5")
 }
 
+# ---------------------------------------------------------------------------
+# SGLang: DeepSeek-V4-Pro config.json model_type patch.
+#
+# Transformers in these images doesn't recognize the `deepseek_v4` model_type,
+# so AutoConfig.from_pretrained crashes before SGLang can dispatch. The
+# single-node DSv4 recipes patch the HF-cache config.json directly; for disagg
+# the weights live on shared NFS at $MODEL_DIR/$MODEL_NAME, so patch that
+# config.json instead. Set model_type -> deepseek_v3 (so AutoConfig succeeds)
+# while keeping architectures=['DeepseekV4ForCausalLM'] so SGLang still
+# dispatches to its native DSv4 model class.
+#
+# Idempotent (no-op once model_type is deepseek_v3) and crash-safe under the
+# concurrent multi-node start: writes a temp file in the same dir and os.replace()
+# (atomic same-filesystem rename), so a reader never sees a half-written config.
+# Only runs for MODEL_NAME == DeepSeek-V4-Pro.
+# ---------------------------------------------------------------------------
+patch_dsv4_config() {
+    if [[ "$MODEL_NAME" != "DeepSeek-V4-Pro" ]]; then
+        return 0
+    fi
+    local cfg="${MODEL_DIR}/${MODEL_NAME}/config.json"
+    if [[ ! -f "$cfg" ]]; then
+        echo "[SETUP] WARN: DSv4 config.json not found at $cfg; skipping model_type patch"
+        return 0
+    fi
+    python3 - "$cfg" <<'PYEOF'
+import json, os, sys, tempfile
+cfg = sys.argv[1]
+with open(cfg) as f:
+    config = json.load(f)
+if config.get("model_type") != "deepseek_v4":
+    print(f"[SETUP] DSv4 config.json already patched (model_type={config.get('model_type')!r})")
+    sys.exit(0)
+config["model_type"] = "deepseek_v3"
+d = os.path.dirname(cfg)
+fd, tmp = tempfile.mkstemp(dir=d, prefix=".config.json.", suffix=".tmp")
+try:
+    with os.fdopen(fd, "w") as f:
+        json.dump(config, f, indent=2)
+    os.replace(tmp, cfg)
+    print(f"[SETUP] Patched {cfg}: model_type deepseek_v4 -> deepseek_v3")
+except Exception:
+    os.path.exists(tmp) and os.remove(tmp)
+    raise
+PYEOF
+    _SETUP_INSTALLED+=("dsv4-config-model-type")
+}
+
 # =============================================================================
 # Run installers (engine-gated)
 # =============================================================================
@@ -759,6 +807,7 @@ if [[ "$ENGINE" == "vllm-disagg" ]]; then
 else
     patch_gluon_pa_mqa_logits_instr_shape
     install_transformers_glm5
+    patch_dsv4_config
 fi
 
 _SETUP_END=$(date +%s)
diff --git a/benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh b/benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh
new file mode 100755
index 000000000..e55559519
--- /dev/null
+++ b/benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh
@@ -0,0 +1,91 @@
+#!/usr/bin/env bash
+
+# DeepSeek-V4-Pro FP4 disaggregated prefill/decode on MI355X via SGLang + MoRI.
+# Thin, model-agnostic launcher (same shape as the dsr1 / qwen3.5 / glm5
+# sglang-disagg wrappers): all serving flags live in the DeepSeek-V4-Pro entry
+# of amd_utils/models.yaml, DSv4-specific env + the config.json model_type patch
+# live in amd_utils/env.sh + setup_deps.sh, and topology (P/D node counts, TP/EP)
+# comes from amd-master.yaml.
+
+source "$(dirname "$0")/../benchmark_lib.sh"
+
+check_env_vars \
+    CONC_LIST \
+    ISL \
+    OSL \
+    IMAGE \
+    SPEC_DECODING \
+    MODEL_PATH \
+    PREFILL_NUM_WORKERS \
+    PREFILL_TP \
+    PREFILL_EP \
+    PREFILL_DP_ATTN \
+    DECODE_NUM_WORKERS \
+    DECODE_TP \
+    DECODE_EP \
+    DECODE_DP_ATTN \
+    PREFILL_NODES \
+    DECODE_NODES \
+    RANDOM_RANGE_RATIO \
+    FRAMEWORK
+
+if [[ -n "$SLURM_JOB_ID" ]]; then
+  echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
+fi
+
+set -x
+
+# Use upstreamed multi_node scripts (no external clone needed)
+cd "$GITHUB_WORKSPACE/benchmarks/multi_node/amd_utils" || exit 1
+
+# Set up SGL launch script-specific environment variables
+export TIME_LIMIT="08:00:00"
+export MODEL_PATH=$MODEL_PATH
+export MODEL_NAME=$MODEL_NAME
+export CONTAINER_IMAGE=$IMAGE
+
+if [[ "${PREFILL_EP:-1}" -eq 1 ]]; then
+export PREFILL_ENABLE_EP=false
+else
+export PREFILL_ENABLE_EP=true
+fi
+
+if [[ "$PREFILL_DP_ATTN" == "true" ]]; then
+export PREFILL_ENABLE_DP=true
+else
+export PREFILL_ENABLE_DP=false
+fi
+
+if [[ "${DECODE_EP:-1}" -eq 1 ]]; then
+export DECODE_ENABLE_EP=false
+else
+export DECODE_ENABLE_EP=true
+fi
+
+if [[ "$DECODE_DP_ATTN" == "true" ]]; then
+export DECODE_ENABLE_DP=true
+else
+export DECODE_ENABLE_DP=false
+fi
+
+# Launch jobs based on ISL/OSL
+# Replace ' ' in CONC_LIST with 'x' such that the concurrency list is represented
+# by a list of numbers delimited by 'x'. This is because of how the underlying launch script
+# expects the concurrencies.
+JOB_ID=$(bash ./submit.sh $PREFILL_NODES \
+    $PREFILL_NUM_WORKERS \
+    $DECODE_NODES \
+    $DECODE_NUM_WORKERS \
+    $ISL $OSL "${CONC_LIST// /x}" inf \
+    ${PREFILL_ENABLE_EP} ${PREFILL_ENABLE_DP} \
+    ${DECODE_ENABLE_EP} ${DECODE_ENABLE_DP} \
+    ${PREFILL_TP} ${DECODE_TP} \
+    ${RANDOM_RANGE_RATIO} \
+    ${NODE_LIST:-})
+
+if [[ $? -ne 0 ]]; then
+    echo "Failed to submit job" >&2
+    exit 1
+fi
+
+echo "$JOB_ID"

From 42c97e6e61e9b079f986028928d8030d8fef367e Mon Sep 17 00:00:00 2001
From: functionstackx <47992694+functionstackx@users.noreply.github.com>
Date: Thu, 11 Jun 2026 00:39:21 -0400
Subject: [PATCH 2/6] perf-changelog: add dsv4-fp4-mi355x-sglang-disagg entry
 (#1708)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 perf-changelog.yaml | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/perf-changelog.yaml b/perf-changelog.yaml
index e3fb6e94f..56725c88d 100644
--- a/perf-changelog.yaml
+++ b/perf-changelog.yaml
@@ -3547,3 +3547,13 @@
     - "MI355x DSR1-FP4: Include TP4 configurations for 8k1k"
     - "Expand the TP sweep (included TP=4) for 8k/1k configuration for conc=4 to 64"
   pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1692
+
+- config-keys:
+    - dsv4-fp4-mi355x-sglang-disagg
+  description:
+    - "New recipe: DeepSeek-V4-Pro FP4 prefill/decode-disaggregated on MI355X via SGLang + MoRI. Combines the validated single-node DSv4 SGLang recipe with the sglang-disagg framework from the dsr1/qwen3.5/glm5 mi355x recipes"
+    - "New benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh launcher + DeepSeek-V4-Pro entry in amd_utils/models.yaml (compressed attention, SWA, page-size 256, deepseekv4/deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off, context-length pinned, kv-cache-dtype at model default)"
+    - "DSv4 FP4-experts SGLANG_* env block + deep_gemm-absence fallback added to amd_utils/env.sh; idempotent atomic config.json model_type patch (deepseek_v4->deepseek_v3) added to amd_utils/setup_deps.sh, both gated on MODEL_NAME"
+    - "Image lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601 (mainline, carries DSv4 sgl#26383 + MoRI disagg backend, auto-applies the MoRI conn.py overlay; rocm/sgl-dev:*-DSv4 branch image lacks both)"
+    - "1P1D TP8/EP1 dp-attn false. Starts at a single ISL/OSL (8k/1k) conc=1 as an end-to-end smoke test before expanding the conc / DEP sweep"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1708

From d4f6c89885ec8aa9cdf2e7a9748ea5737c066e36 Mon Sep 17 00:00:00 2001
From: functionstackx <47992694+functionstackx@users.noreply.github.com>
Date: Thu, 11 Jun 2026 01:32:45 -0400
Subject: [PATCH 3/6] dsv4 sglang-disagg: monkey-patch sglang#27855 (MoRI-EP
 FP4 swiglu crash fix)

DeepSeek-V4-Pro + MoRI expert-parallel aborts at warmup with
"dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2" on the
clamped-SwiGLU/INTERLEAVE path. sgl-project/sglang#27855 fixes it in
moe_runner/aiter.py:_pre_permute_deepep_to_aiter (W4A4 + FP4-dispatch branch
that dequants the FP4 activation to BF16 via upscale_mxfp4) but is unmerged and
absent from the pinned image.

setup_deps.sh now source-patches aiter.py at container start, gated on
MODEL_NAME == DeepSeek-V4-Pro: idempotent, atomic write, warn+skip if the
image's aiter.py predates the anchored structure. Verified byte-identical to
the PR head against current sglang main.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 benchmarks/multi_node/amd_utils/setup_deps.sh | 109 ++++++++++++++++++
 perf-changelog.yaml                           |   1 +
 2 files changed, 110 insertions(+)

diff --git a/benchmarks/multi_node/amd_utils/setup_deps.sh b/benchmarks/multi_node/amd_utils/setup_deps.sh
index 13714aa77..795f1cb70 100644
--- a/benchmarks/multi_node/amd_utils/setup_deps.sh
+++ b/benchmarks/multi_node/amd_utils/setup_deps.sh
@@ -783,6 +783,114 @@ PYEOF
     _SETUP_INSTALLED+=("dsv4-config-model-type")
 }
 
+# ---------------------------------------------------------------------------
+# SGLang: DeepSeek-V4-Pro + MoRI-EP AITER MoE FP4 dispatch crash fix.
+#
+# Monkey-patches sgl-project/sglang#27855 ("[AMD] fix moriep quant kernel not
+# implemented issue"), which is not yet merged upstream and so is absent from
+# the pinned mainline image. Without it, DSv4 + MoRI expert-parallel aborts at
+# warmup with:
+#     dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2
+# on the clamped-SwiGLU / INTERLEAVE path. The fix, in
+# moe_runner/aiter.py:_pre_permute_deepep_to_aiter, adds a W4A4 + FP4-dispatch
+# branch that dequantizes the FP4 activation to BF16 (upscale_mxfp4) and lets
+# fused_moe re-quantize internally, mirroring the existing W4A4+FP8 and
+# FP8-weight+FP4 dequant branches.
+#
+# Only the MoRI-EP decode path triggers it, so this is gated on
+# MODEL_NAME == DeepSeek-V4-Pro. Idempotent (skips once swiglu_interleave is
+# present), atomic write, and warn+skip if the image's aiter.py predates the
+# anchored structure (then an image bump carrying #27855 is needed). Drop this
+# patch once a pinned image already includes #27855.
+# ---------------------------------------------------------------------------
+patch_aiter_dsv4_fp4_swiglu() {
+    if [[ "$MODEL_NAME" != "DeepSeek-V4-Pro" ]]; then
+        return 0
+    fi
+    local target
+    target=$(python3 -c "import sglang.srt.layers.moe.moe_runner.aiter as m; print(m.__file__)" 2>/dev/null)
+    if [[ -z "$target" || ! -f "$target" ]]; then
+        echo "[SETUP] WARN: aiter.py not found; skipping DSv4 FP4 swiglu patch (#27855)"
+        return 0
+    fi
+    python3 - "$target" <<'PYEOF'
+import os, sys, tempfile
+target = sys.argv[1]
+src = open(target).read()
+
+if "swiglu_interleave" in src:
+    print("[SETUP] DSv4 aiter FP4 swiglu patch (#27855) already applied")
+    sys.exit(0)
+
+# Edit A: import get_bool_env_var alongside get_int_env_var.
+import_anchor = "from sglang.srt.utils import get_int_env_var\n"
+if "get_bool_env_var" not in src:
+    if import_anchor not in src:
+        print("[SETUP] WARN: #27855 import anchor not found; skipping (image aiter.py differs)")
+        sys.exit(0)
+    src = src.replace(
+        import_anchor,
+        "from sglang.srt.utils import get_bool_env_var, get_int_env_var\n",
+        1,
+    )
+
+# Edit B: compute swiglu_interleave right after is_fp4_dispatch.
+b_anchor = (
+    "        is_fp4_dispatch = hidden_states.dtype == torch.float4_e2m1fn_x2\n"
+    "\n"
+    "        if is_w4a4 and a1_scale is not None and not is_fp4_dispatch:\n"
+)
+b_new = (
+    "        is_fp4_dispatch = hidden_states.dtype == torch.float4_e2m1fn_x2\n"
+    "\n"
+    "        # AITER fused_moe Clamped-SwiGLU is dispatched with\n"
+    "        # gate_mode=INTERLEAVE, for which AITER picks a bf16/fp8 `q_dtype_a`\n"
+    "        # Refer to https://github.com/ROCm/aiter/blob/a2617c366dc7271a1662ecda2023d19f6ccefcec/aiter/fused_moe.py#L406-L412\n"
+    "        swiglu_interleave = quant_info.swiglu_limit > 0 and get_bool_env_var(\n"
+    '            "SGLANG_USE_AITER_MOE_GU_ITLV", "true"\n'
+    "        )\n"
+    "\n"
+    "        if is_w4a4 and a1_scale is not None and not is_fp4_dispatch:\n"
+)
+
+# Edit C: insert the W4A4 + FP4-dispatch + INTERLEAVE dequant branch.
+c_anchor = (
+    "            a1_scale = None\n"
+    "        elif is_fp8_quant and is_fp4_dispatch and a1_scale is not None:\n"
+)
+c_new = (
+    "            a1_scale = None\n"
+    "        elif is_w4a4 and is_fp4_dispatch and a1_scale is not None and swiglu_interleave:\n"
+    "            # W4A4 weights + FP4 dispatch on the clamped-SwiGLU/INTERLEAVE\n"
+    "            # path: AITER expects a bf16/fp8 activation here, not fp4x2.\n"
+    "            # Dequant FP4->BF16 and let fused_moe re-quantize internally.\n"
+    "            hidden_states = upscale_mxfp4(\n"
+    "                hidden_states, a1_scale, num_local_tokens, output_dtype\n"
+    "            )\n"
+    "            a1_scale = None\n"
+    "        elif is_fp8_quant and is_fp4_dispatch and a1_scale is not None:\n"
+)
+
+if b_anchor not in src or c_anchor not in src:
+    print("[SETUP] WARN: #27855 body anchors not found; skipping (image aiter.py predates the W4A4 branch)")
+    sys.exit(0)
+
+src = src.replace(b_anchor, b_new, 1).replace(c_anchor, c_new, 1)
+
+d = os.path.dirname(target)
+fd, tmp = tempfile.mkstemp(dir=d, prefix=".aiter.py.", suffix=".tmp")
+try:
+    with os.fdopen(fd, "w") as f:
+        f.write(src)
+    os.replace(tmp, target)
+    print(f"[SETUP] Patched {target}: DSv4 W4A4+FP4 swiglu-interleave dequant (#27855)")
+except Exception:
+    os.path.exists(tmp) and os.remove(tmp)
+    raise
+PYEOF
+    _SETUP_INSTALLED+=("dsv4-aiter-fp4-swiglu-27855")
+}
+
 # =============================================================================
 # Run installers (engine-gated)
 # =============================================================================
@@ -808,6 +916,7 @@ else
     patch_gluon_pa_mqa_logits_instr_shape
     install_transformers_glm5
     patch_dsv4_config
+    patch_aiter_dsv4_fp4_swiglu
 fi
 
 _SETUP_END=$(date +%s)
diff --git a/perf-changelog.yaml b/perf-changelog.yaml
index 56725c88d..56481891b 100644
--- a/perf-changelog.yaml
+++ b/perf-changelog.yaml
@@ -3554,6 +3554,7 @@
     - "New recipe: DeepSeek-V4-Pro FP4 prefill/decode-disaggregated on MI355X via SGLang + MoRI. Combines the validated single-node DSv4 SGLang recipe with the sglang-disagg framework from the dsr1/qwen3.5/glm5 mi355x recipes"
     - "New benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh launcher + DeepSeek-V4-Pro entry in amd_utils/models.yaml (compressed attention, SWA, page-size 256, deepseekv4/deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off, context-length pinned, kv-cache-dtype at model default)"
     - "DSv4 FP4-experts SGLANG_* env block + deep_gemm-absence fallback added to amd_utils/env.sh; idempotent atomic config.json model_type patch (deepseek_v4->deepseek_v3) added to amd_utils/setup_deps.sh, both gated on MODEL_NAME"
+    - "setup_deps.sh monkey-patches sgl-project/sglang#27855 into the container's moe_runner/aiter.py (gated on MODEL_NAME): adds the W4A4 + FP4-dispatch clamped-SwiGLU/INTERLEAVE dequant branch that fixes the DSv4 + MoRI-EP warmup crash (dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2). Unmerged upstream / absent from the pinned image; idempotent + warn-skip if the image already carries it. Drop once a pinned image includes #27855"
     - "Image lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601 (mainline, carries DSv4 sgl#26383 + MoRI disagg backend, auto-applies the MoRI conn.py overlay; rocm/sgl-dev:*-DSv4 branch image lacks both)"
     - "1P1D TP8/EP1 dp-attn false. Starts at a single ISL/OSL (8k/1k) conc=1 as an end-to-end smoke test before expanding the conc / DEP sweep"
   pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1708

From a1015d0ba6ed9680fe93a88f565bcf5eba35a1c3 Mon Sep 17 00:00:00 2001
From: functionstackx <47992694+functionstackx@users.noreply.github.com>
Date: Thu, 11 Jun 2026 01:34:33 -0400
Subject: [PATCH 4/6] dsv4 sglang-disagg: adopt PR #1701 (0610 DSv4) image +
 env + dsv4 attention backend

Realigns the DSv4 sglang-disagg recipe with the validated 0610 single-node recipe
(PR #1701, "[AMD][MI35X] 0610 DSV4", successful run):

- image -> lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610
- env.sh DSv4 block replaced with #1701's: unified_kv_triton FlashMLA, aiter
  indexer (not tilelang), mainline fp8 wo_a / topk-v2 fallbacks hardcoded
  (SGLANG_OPT_FP8_WO_A_GEMM=false, SGLANG_OPT_USE_TOPK_V2=false) instead of the
  deep_gemm-presence detect; SGLANG_DEFAULT_THINKING / SGLANG_DSV4_REASONING_EFFORT;
  multi-stream overlap off. Branch-only SGLANG_DSV4_FP4_EXPERTS /
  SGLANG_FORCE_TRITON_MOE_FP8 dropped (DSv4 main no longer needs them).
- models.yaml base_flags: --attention-backend compressed -> dsv4; dp_flags add
  --enable-prefill-delayer --prefill-delayer-max-delay-ms 5000 (the #1701 DP path).

Still a v0.5.12.post1 tag, so the MoRI conn.py overlay auto-applies; the #27855
aiter monkey-patch is unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 .github/configs/amd-master.yaml             | 24 +++----
 benchmarks/multi_node/amd_utils/env.sh      | 72 ++++++++-------------
 benchmarks/multi_node/amd_utils/models.yaml | 25 +++----
 perf-changelog.yaml                         |  6 +-
 4 files changed, 55 insertions(+), 72 deletions(-)

diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml
index 422e1a64e..d4b7a4756 100644
--- a/.github/configs/amd-master.yaml
+++ b/.github/configs/amd-master.yaml
@@ -2144,23 +2144,25 @@ dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp:
 # SGLANG_* env block in amd_utils/env.sh, and the config.json model_type patch in
 # amd_utils/setup_deps.sh (all gated on MODEL_NAME).
 #
-# Image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601 — the mainline
-# ROCm nightly the DSv4 MTP single-node recipe (dsv4-fp4-mi355x-sglang-mtp) runs
-# on. It carries DSv4 support (sgl#26383, mainline 2026-05-27) AND is on the same
-# -mi35x- image line as the dsr1/qwen3.5/glm5 disagg recipes, so it has the MoRI
-# disaggregation transfer backend — unlike the rocm/sgl-dev:*-DSv4 branch image
-# the aggregated dsv4-fp4-mi355x-sglang entry uses, which is cut from
-# amd/deepseek_v4 (no #26383, MoRI support unverified). Mainline omits deep_gemm;
-# env.sh detects that and routes the DSv4 fp8 wo_a / topk paths to torch
-# fallbacks. The v0.5.12.post1 tag also auto-applies the MoRI conn.py overlay
-# (job.slurm) that fixes the wire format for hybrid/sparse-attention models.
+# Image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610 — the mainline
+# ROCm nightly the validated 0610 single-node DSv4 recipe (PR #1701) runs on. It
+# carries DSv4 support (now on sglang main) AND is on the same -mi35x- image line
+# as the dsr1/qwen3.5/glm5 disagg recipes, so it has the MoRI disaggregation
+# transfer backend — unlike the rocm/sgl-dev:*-DSv4 branch image the aggregated
+# dsv4-fp4-mi355x-sglang entry uses (cut from amd/deepseek_v4, MoRI support
+# unverified). Mainline omits deep_gemm; env.sh hardcodes the #1701 fp8 wo_a /
+# topk-v2 fallbacks (matching that validated run) rather than a runtime detect.
+# The v0.5.12.post1 tag also auto-applies the MoRI conn.py overlay (job.slurm)
+# that fixes the wire format for hybrid/sparse-attention models. setup_deps.sh
+# additionally monkey-patches the unmerged sglang#27855 (MoRI-EP FP4 swiglu
+# crash fix) into the image's aiter.py.
 #
 # Topology 1P1D, TP8/EP1, dp-attn false — the same conservative starting point the
 # qwen3.5 / glm5 sglang-disagg recipes launched with. Starts at a single ISL/OSL
 # (8k/1k) conc=1 as an end-to-end smoke test (does DSv4 + MoRI disagg come up and
 # transfer KV at all on this image) before expanding to the full conc / DEP sweep.
 dsv4-fp4-mi355x-sglang-disagg:
-  image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601
+  image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610
   model: deepseek-ai/DeepSeek-V4-Pro
   model-prefix: dsv4
   runner: mi355x-disagg
diff --git a/benchmarks/multi_node/amd_utils/env.sh b/benchmarks/multi_node/amd_utils/env.sh
index e1fd17c28..dbe2fa4db 100755
--- a/benchmarks/multi_node/amd_utils/env.sh
+++ b/benchmarks/multi_node/amd_utils/env.sh
@@ -150,58 +150,38 @@ else
     fi
 
     # DeepSeek-V4-Pro (FP4 experts): the DSv4 SGLANG_* env block, copied verbatim
-    # from the single-node recipe (benchmarks/single_node/dsv4_fp4_mi355x_sglang*.sh),
-    # which tracks python/run_dsv4.sh on the amd/deepseek_v4 branch with the two
-    # FP4 overrides (SGLANG_DSV4_FP4_EXPERTS=True, SGLANG_FORCE_TRITON_MOE_FP8=0).
-    # Routes experts through the FP4 kernels + aiter MoE / MHC, tilelang indexer,
-    # triton SWA-prepare and FlashMLA. The deep_gemm-absence fallback at the end
-    # mirrors the single-node MTP recipe so this also runs on a mainline ROCm
-    # nightly (which omits deep_gemm). All gated on MODEL_NAME so other models are
-    # unaffected.
+    # from the validated 0610 single-node recipe (PR #1701, benchmarks/single_node/
+    # fixed_seq_len/dsv4_fp4_mi355x_sglang.sh). That PR realigned DSv4 to the
+    # mainline ...mi35x-20260610 image (now that DSv4 support is on sglang main):
+    # the dsv4 attention backend, unified_kv_triton FlashMLA, the aiter indexer
+    # (not tilelang), and the mainline fp8 wo_a / topk-v2 fallbacks hardcoded
+    # (SGLANG_OPT_FP8_WO_A_GEMM=false, SGLANG_OPT_USE_TOPK_V2=false) instead of a
+    # deep_gemm-presence detect. Branch-only FP4 MoE flags (SGLANG_DSV4_FP4_EXPERTS,
+    # SGLANG_FORCE_TRITON_MOE_FP8) are dropped — DSv4 main no longer needs them.
+    # Gated on MODEL_NAME so other models are unaffected.
     if [[ "$MODEL_NAME" == "DeepSeek-V4-Pro" ]]; then
-        export SGLANG_REASONING_EFFORT=max
-        export SGLANG_OPT_USE_FUSED_COMPRESS=true
-        export SGLANG_OPT_USE_OLD_COMPRESSOR=false
-        export SGLANG_OPT_USE_TILELANG_SWA_PREPARE=false
-        export SGLANG_OPT_USE_JIT_KERNEL_FUSED_TOPK=false
-        export SGLANG_OPT_USE_FUSED_HASH_TOPK=true
+        export SGLANG_DEFAULT_THINKING=1
+        export SGLANG_DSV4_REASONING_EFFORT=max
         export SGLANG_OPT_DEEPGEMM_HC_PRENORM=false
+        export SGLANG_USE_AITER=1
+        export SGLANG_USE_ROCM700A=0
+        export SGLANG_OPT_USE_FUSED_COMPRESS=true
+        export SGLANG_HACK_FLASHMLA_BACKEND=unified_kv_triton
+        export SGLANG_OPT_FP8_WO_A_GEMM=false
+        export SGLANG_OPT_USE_JIT_INDEXER_METADATA=false
+        export SGLANG_OPT_USE_TOPK_V2=false
+        export SGLANG_OPT_USE_AITER_INDEXER=true
+        export SGLANG_OPT_USE_TILELANG_INDEXER=false
         export SGLANG_OPT_USE_TILELANG_MHC_PRE=false
         export SGLANG_OPT_USE_TILELANG_MHC_POST=false
-        export SGLANG_OPT_USE_AITER_MHC_PRE=true
-        export SGLANG_OPT_USE_AITER_MHC_POST=true
-        export SGLANG_ENABLE_THINKING=1
-        export SGLANG_USE_AITER=1
-        export SGLANG_USE_ROCM700A=1
-        export SGLANG_TOPK_TRANSFORM_512_TORCH=0
         export SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1
-        export SGLANG_DSV4_FP4_EXPERTS=True
-        export SGLANG_OPT_DPSK_V4_RADIX=1
-        export SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false
-        export SGLANG_OPT_USE_FUSED_STORE_CACHE=true
-        export SGLANG_FORCE_TRITON_MOE_FP8=0
-        export SGLANG_HACK_FLASHMLA_BACKEND=triton
-        export SGLANG_OPT_USE_TILELANG_INDEXER=true
-        export SGLANG_OPT_USE_TRITON_SWA_PREPARE=true
+        export SGLANG_OPT_USE_FUSED_COMPRESS_TRITON=true
         export AITER_BF16_FP8_MOE_BOUND=0
-        export SGLANG_OPT_FUSE_WQA_WKV=true
-        export SGLANG_OPT_USE_FUSED_PAGED_COMPRESS=true
-        export SGLANG_OPT_USE_MULTI_STREAM_OVERLAP=0
-
-        # Mainline ROCm nightlies carry DSv4 support but omit deep_gemm (only
-        # rocm/sgl-dev:*-DSv4 builds bundle it). DSv4-Pro's default fp8 wo_a path
-        # imports deep_gemm at weight load; detect its absence and route the
-        # deep_gemm-touching paths to their torch fallbacks. No-op on a
-        # deep_gemm-bearing image, so this works on both image lines.
-        if python3 -c "import deep_gemm" >/dev/null 2>&1; then
-            echo "[env.sh] deep_gemm present -> DSv4 fp8 wo_a / deep_gemm perf path"
-        else
-            echo "[env.sh] deep_gemm absent -> routing DSv4 fp8 wo_a / topk around it"
-            export SGLANG_OPT_FP8_WO_A_GEMM=0
-            export SGLANG_TOPK_TRANSFORM_512_TORCH=1
-            export SGLANG_OPT_USE_TOPK_V2=0
-            export SGLANG_ENABLE_JIT_DEEPGEMM=0
-        fi
+        export SGLANG_EAGER_INPUT_NO_COPY=true
+
+        # multi-stream
+        export SGLANG_OPT_USE_MULTI_STREAM_OVERLAP=false
+        export SGLANG_ROCM_USE_MULTI_STREAM=false
     fi
 
     # Disable allocating memory in one pass
diff --git a/benchmarks/multi_node/amd_utils/models.yaml b/benchmarks/multi_node/amd_utils/models.yaml
index 2ccbcd4b7..98a1d58d9 100644
--- a/benchmarks/multi_node/amd_utils/models.yaml
+++ b/benchmarks/multi_node/amd_utils/models.yaml
@@ -351,20 +351,21 @@ DeepSeek-R1-0528-MXFP4-v2:
       cuda_graph_bs_range: "1-128"
 
 # DeepSeek-V4-Pro (FP4 experts) PD-disaggregation. Serving flags mirror the
-# validated single-node SGLang recipe (benchmarks/single_node/dsv4_fp4_mi355x_sglang.sh
-# + the MTP variant): compressed attention backend, SWA, page-size 256, the
-# deepseekv4 tool-call / deepseek-v4 reasoning parsers, the DSv4 thinking chat
-# template, and shared-experts-fusion disabled. The DSv4 FP4-experts SGLANG_*
-# env block + deep_gemm-absence fallback live in env.sh, and the config.json
-# model_type patch (deepseek_v4 -> deepseek_v3) in setup_deps.sh, both gated on
-# MODEL_NAME == DeepSeek-V4-Pro. --context-length is pinned (model default is
-# very long; would over-reserve KV); 9472 covers the 8k/1k smoke point.
-# kv-cache-dtype is intentionally left at the model default (the single-node DSv4
-# recipe sets none), unlike the fp8_e4m3 DeepSeek-R1 disagg entries.
+# validated 0610 single-node SGLang recipe (PR #1701, dsv4_fp4_mi355x_sglang.sh):
+# the dsv4 attention backend, SWA, page-size 256, the deepseekv4 tool-call /
+# deepseek-v4 reasoning parsers, the DSv4 thinking chat template, and
+# shared-experts-fusion disabled. The matching DSv4 SGLANG_* env block (#1701)
+# lives in env.sh, and the config.json model_type patch (deepseek_v4 ->
+# deepseek_v3) in setup_deps.sh, both gated on MODEL_NAME == DeepSeek-V4-Pro.
+# --context-length is pinned (model default is very long; would over-reserve KV);
+# 9472 covers the 8k/1k smoke point. kv-cache-dtype is left at the model default
+# (the #1701 single-node recipe sets none), unlike the fp8_e4m3 DeepSeek-R1
+# disagg entries. dp_flags add --enable-prefill-delayer + max-delay-ms 5000 (the
+# #1701 DP path) for the future DP-attention expansion.
 DeepSeek-V4-Pro:
-  base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --load-balance-method round_robin --disaggregation-transfer-backend mori --attention-backend compressed --swa-full-tokens-ratio 0.15 --page-size 256 --disable-shared-experts-fusion --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 --context-length 9472 --chat-template /workspace/benchmarks/single_node/chat_templates/deepseek_v4_thinking.jinja"
+  base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --load-balance-method round_robin --disaggregation-transfer-backend mori --attention-backend dsv4 --swa-full-tokens-ratio 0.15 --page-size 256 --disable-shared-experts-fusion --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 --context-length 9472 --chat-template /workspace/benchmarks/single_node/chat_templates/deepseek_v4_thinking.jinja"
   mtp_flags: ""
-  dp_flags: "--moe-a2a-backend mori --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head"
+  dp_flags: "--moe-a2a-backend mori --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head --enable-prefill-delayer --prefill-delayer-max-delay-ms 5000"
   prefill:
     mem_fraction_static: 0.8
     disable_radix_cache: true
diff --git a/perf-changelog.yaml b/perf-changelog.yaml
index 56481891b..f6a3dffe1 100644
--- a/perf-changelog.yaml
+++ b/perf-changelog.yaml
@@ -3552,9 +3552,9 @@
     - dsv4-fp4-mi355x-sglang-disagg
   description:
     - "New recipe: DeepSeek-V4-Pro FP4 prefill/decode-disaggregated on MI355X via SGLang + MoRI. Combines the validated single-node DSv4 SGLang recipe with the sglang-disagg framework from the dsr1/qwen3.5/glm5 mi355x recipes"
-    - "New benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh launcher + DeepSeek-V4-Pro entry in amd_utils/models.yaml (compressed attention, SWA, page-size 256, deepseekv4/deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off, context-length pinned, kv-cache-dtype at model default)"
-    - "DSv4 FP4-experts SGLANG_* env block + deep_gemm-absence fallback added to amd_utils/env.sh; idempotent atomic config.json model_type patch (deepseek_v4->deepseek_v3) added to amd_utils/setup_deps.sh, both gated on MODEL_NAME"
+    - "New benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh launcher + DeepSeek-V4-Pro entry in amd_utils/models.yaml. Serving flags + env mirror the validated 0610 single-node recipe (PR #1701): dsv4 attention backend, SWA, page-size 256, deepseekv4/deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off, context-length pinned, kv-cache-dtype at model default; dp_flags add --enable-prefill-delayer + max-delay-ms 5000"
+    - "DSv4 SGLANG_* env block from PR #1701 added to amd_utils/env.sh (unified_kv_triton FlashMLA, aiter indexer, mainline fp8 wo_a / topk-v2 fallbacks hardcoded, multi-stream off; branch-only SGLANG_DSV4_FP4_EXPERTS/FORCE_TRITON_MOE_FP8 dropped); idempotent atomic config.json model_type patch (deepseek_v4->deepseek_v3) added to amd_utils/setup_deps.sh, both gated on MODEL_NAME"
     - "setup_deps.sh monkey-patches sgl-project/sglang#27855 into the container's moe_runner/aiter.py (gated on MODEL_NAME): adds the W4A4 + FP4-dispatch clamped-SwiGLU/INTERLEAVE dequant branch that fixes the DSv4 + MoRI-EP warmup crash (dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2). Unmerged upstream / absent from the pinned image; idempotent + warn-skip if the image already carries it. Drop once a pinned image includes #27855"
-    - "Image lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601 (mainline, carries DSv4 sgl#26383 + MoRI disagg backend, auto-applies the MoRI conn.py overlay; rocm/sgl-dev:*-DSv4 branch image lacks both)"
+    - "Image lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610 (the validated PR #1701 0610 image; mainline, carries DSv4 support + MoRI disagg backend, auto-applies the MoRI conn.py overlay; rocm/sgl-dev:*-DSv4 branch image lacks both)"
     - "1P1D TP8/EP1 dp-attn false. Starts at a single ISL/OSL (8k/1k) conc=1 as an end-to-end smoke test before expanding the conc / DEP sweep"
   pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1708

From 9648a8129e81f7775c1d35eb43df002cfef1f475 Mon Sep 17 00:00:00 2001
From: functionstackx <47992694+functionstackx@users.noreply.github.com>
Date: Thu, 11 Jun 2026 01:36:18 -0400
Subject: [PATCH 5/6] dsv4 sglang-disagg: drop the prefill delayer from
 dp_flags

Per request, do not use --enable-prefill-delayer / --prefill-delayer-max-delay-ms
in the DSv4 sglang-disagg recipe.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 benchmarks/multi_node/amd_utils/models.yaml | 6 +++---
 perf-changelog.yaml                         | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/benchmarks/multi_node/amd_utils/models.yaml b/benchmarks/multi_node/amd_utils/models.yaml
index 98a1d58d9..bd549dd54 100644
--- a/benchmarks/multi_node/amd_utils/models.yaml
+++ b/benchmarks/multi_node/amd_utils/models.yaml
@@ -360,12 +360,12 @@ DeepSeek-R1-0528-MXFP4-v2:
 # --context-length is pinned (model default is very long; would over-reserve KV);
 # 9472 covers the 8k/1k smoke point. kv-cache-dtype is left at the model default
 # (the #1701 single-node recipe sets none), unlike the fp8_e4m3 DeepSeek-R1
-# disagg entries. dp_flags add --enable-prefill-delayer + max-delay-ms 5000 (the
-# #1701 DP path) for the future DP-attention expansion.
+# disagg entries. The prefill delayer (--enable-prefill-delayer) is intentionally
+# not used here.
 DeepSeek-V4-Pro:
   base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --load-balance-method round_robin --disaggregation-transfer-backend mori --attention-backend dsv4 --swa-full-tokens-ratio 0.15 --page-size 256 --disable-shared-experts-fusion --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 --context-length 9472 --chat-template /workspace/benchmarks/single_node/chat_templates/deepseek_v4_thinking.jinja"
   mtp_flags: ""
-  dp_flags: "--moe-a2a-backend mori --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head --enable-prefill-delayer --prefill-delayer-max-delay-ms 5000"
+  dp_flags: "--moe-a2a-backend mori --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head"
   prefill:
     mem_fraction_static: 0.8
     disable_radix_cache: true
diff --git a/perf-changelog.yaml b/perf-changelog.yaml
index f6a3dffe1..2ef5d33e6 100644
--- a/perf-changelog.yaml
+++ b/perf-changelog.yaml
@@ -3552,7 +3552,7 @@
     - dsv4-fp4-mi355x-sglang-disagg
   description:
     - "New recipe: DeepSeek-V4-Pro FP4 prefill/decode-disaggregated on MI355X via SGLang + MoRI. Combines the validated single-node DSv4 SGLang recipe with the sglang-disagg framework from the dsr1/qwen3.5/glm5 mi355x recipes"
-    - "New benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh launcher + DeepSeek-V4-Pro entry in amd_utils/models.yaml. Serving flags + env mirror the validated 0610 single-node recipe (PR #1701): dsv4 attention backend, SWA, page-size 256, deepseekv4/deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off, context-length pinned, kv-cache-dtype at model default; dp_flags add --enable-prefill-delayer + max-delay-ms 5000"
+    - "New benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh launcher + DeepSeek-V4-Pro entry in amd_utils/models.yaml. Serving flags + env mirror the validated 0610 single-node recipe (PR #1701): dsv4 attention backend, SWA, page-size 256, deepseekv4/deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off, context-length pinned, kv-cache-dtype at model default; prefill delayer not used"
     - "DSv4 SGLANG_* env block from PR #1701 added to amd_utils/env.sh (unified_kv_triton FlashMLA, aiter indexer, mainline fp8 wo_a / topk-v2 fallbacks hardcoded, multi-stream off; branch-only SGLANG_DSV4_FP4_EXPERTS/FORCE_TRITON_MOE_FP8 dropped); idempotent atomic config.json model_type patch (deepseek_v4->deepseek_v3) added to amd_utils/setup_deps.sh, both gated on MODEL_NAME"
     - "setup_deps.sh monkey-patches sgl-project/sglang#27855 into the container's moe_runner/aiter.py (gated on MODEL_NAME): adds the W4A4 + FP4-dispatch clamped-SwiGLU/INTERLEAVE dequant branch that fixes the DSv4 + MoRI-EP warmup crash (dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2). Unmerged upstream / absent from the pinned image; idempotent + warn-skip if the image already carries it. Drop once a pinned image includes #27855"
     - "Image lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610 (the validated PR #1701 0610 image; mainline, carries DSv4 support + MoRI disagg backend, auto-applies the MoRI conn.py overlay; rocm/sgl-dev:*-DSv4 branch image lacks both)"

From 316dd215c7706eefeb093a3ce4f0310593dc00af Mon Sep 17 00:00:00 2001
From: functionstackx <47992694+functionstackx@users.noreply.github.com>
Date: Thu, 11 Jun 2026 01:40:24 -0400
Subject: [PATCH 6/6] dsv4 sglang-disagg: drop the sglang#27855 monkey-patch
 (TP-only smoke, no EP)

The #27855 fix only matters on the DSv4 + MoRI expert-parallel path. This recipe
is TP8/EP1 for the smoke test, so that crash isn't reachable. Remove the
patch_aiter_dsv4_fp4_swiglu source-patch from setup_deps.sh; a comment in
amd-master.yaml records that it's needed only when EP/DEP decode is enabled.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 .github/configs/amd-master.yaml               |   8 +-
 benchmarks/multi_node/amd_utils/setup_deps.sh | 109 ------------------
 perf-changelog.yaml                           |   1 -
 3 files changed, 5 insertions(+), 113 deletions(-)

diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml
index d4b7a4756..f3e0c6b81 100644
--- a/.github/configs/amd-master.yaml
+++ b/.github/configs/amd-master.yaml
@@ -2153,9 +2153,11 @@ dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp:
 # unverified). Mainline omits deep_gemm; env.sh hardcodes the #1701 fp8 wo_a /
 # topk-v2 fallbacks (matching that validated run) rather than a runtime detect.
 # The v0.5.12.post1 tag also auto-applies the MoRI conn.py overlay (job.slurm)
-# that fixes the wire format for hybrid/sparse-attention models. setup_deps.sh
-# additionally monkey-patches the unmerged sglang#27855 (MoRI-EP FP4 swiglu
-# crash fix) into the image's aiter.py.
+# that fixes the wire format for hybrid/sparse-attention models.
+#
+# This smoke test is TP-only (EP1), so the DSv4 + MoRI-EP FP4 swiglu crash that
+# sglang#27855 fixes is not on the path; that monkey-patch is intentionally not
+# carried here and would be needed only when EP/DEP decode is enabled.
 #
 # Topology 1P1D, TP8/EP1, dp-attn false — the same conservative starting point the
 # qwen3.5 / glm5 sglang-disagg recipes launched with. Starts at a single ISL/OSL
diff --git a/benchmarks/multi_node/amd_utils/setup_deps.sh b/benchmarks/multi_node/amd_utils/setup_deps.sh
index 795f1cb70..13714aa77 100644
--- a/benchmarks/multi_node/amd_utils/setup_deps.sh
+++ b/benchmarks/multi_node/amd_utils/setup_deps.sh
@@ -783,114 +783,6 @@ PYEOF
     _SETUP_INSTALLED+=("dsv4-config-model-type")
 }
 
-# ---------------------------------------------------------------------------
-# SGLang: DeepSeek-V4-Pro + MoRI-EP AITER MoE FP4 dispatch crash fix.
-#
-# Monkey-patches sgl-project/sglang#27855 ("[AMD] fix moriep quant kernel not
-# implemented issue"), which is not yet merged upstream and so is absent from
-# the pinned mainline image. Without it, DSv4 + MoRI expert-parallel aborts at
-# warmup with:
-#     dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2
-# on the clamped-SwiGLU / INTERLEAVE path. The fix, in
-# moe_runner/aiter.py:_pre_permute_deepep_to_aiter, adds a W4A4 + FP4-dispatch
-# branch that dequantizes the FP4 activation to BF16 (upscale_mxfp4) and lets
-# fused_moe re-quantize internally, mirroring the existing W4A4+FP8 and
-# FP8-weight+FP4 dequant branches.
-#
-# Only the MoRI-EP decode path triggers it, so this is gated on
-# MODEL_NAME == DeepSeek-V4-Pro. Idempotent (skips once swiglu_interleave is
-# present), atomic write, and warn+skip if the image's aiter.py predates the
-# anchored structure (then an image bump carrying #27855 is needed). Drop this
-# patch once a pinned image already includes #27855.
-# ---------------------------------------------------------------------------
-patch_aiter_dsv4_fp4_swiglu() {
-    if [[ "$MODEL_NAME" != "DeepSeek-V4-Pro" ]]; then
-        return 0
-    fi
-    local target
-    target=$(python3 -c "import sglang.srt.layers.moe.moe_runner.aiter as m; print(m.__file__)" 2>/dev/null)
-    if [[ -z "$target" || ! -f "$target" ]]; then
-        echo "[SETUP] WARN: aiter.py not found; skipping DSv4 FP4 swiglu patch (#27855)"
-        return 0
-    fi
-    python3 - "$target" <<'PYEOF'
-import os, sys, tempfile
-target = sys.argv[1]
-src = open(target).read()
-
-if "swiglu_interleave" in src:
-    print("[SETUP] DSv4 aiter FP4 swiglu patch (#27855) already applied")
-    sys.exit(0)
-
-# Edit A: import get_bool_env_var alongside get_int_env_var.
-import_anchor = "from sglang.srt.utils import get_int_env_var\n"
-if "get_bool_env_var" not in src:
-    if import_anchor not in src:
-        print("[SETUP] WARN: #27855 import anchor not found; skipping (image aiter.py differs)")
-        sys.exit(0)
-    src = src.replace(
-        import_anchor,
-        "from sglang.srt.utils import get_bool_env_var, get_int_env_var\n",
-        1,
-    )
-
-# Edit B: compute swiglu_interleave right after is_fp4_dispatch.
-b_anchor = (
-    "        is_fp4_dispatch = hidden_states.dtype == torch.float4_e2m1fn_x2\n"
-    "\n"
-    "        if is_w4a4 and a1_scale is not None and not is_fp4_dispatch:\n"
-)
-b_new = (
-    "        is_fp4_dispatch = hidden_states.dtype == torch.float4_e2m1fn_x2\n"
-    "\n"
-    "        # AITER fused_moe Clamped-SwiGLU is dispatched with\n"
-    "        # gate_mode=INTERLEAVE, for which AITER picks a bf16/fp8 `q_dtype_a`\n"
-    "        # Refer to https://github.com/ROCm/aiter/blob/a2617c366dc7271a1662ecda2023d19f6ccefcec/aiter/fused_moe.py#L406-L412\n"
-    "        swiglu_interleave = quant_info.swiglu_limit > 0 and get_bool_env_var(\n"
-    '            "SGLANG_USE_AITER_MOE_GU_ITLV", "true"\n'
-    "        )\n"
-    "\n"
-    "        if is_w4a4 and a1_scale is not None and not is_fp4_dispatch:\n"
-)
-
-# Edit C: insert the W4A4 + FP4-dispatch + INTERLEAVE dequant branch.
-c_anchor = (
-    "            a1_scale = None\n"
-    "        elif is_fp8_quant and is_fp4_dispatch and a1_scale is not None:\n"
-)
-c_new = (
-    "            a1_scale = None\n"
-    "        elif is_w4a4 and is_fp4_dispatch and a1_scale is not None and swiglu_interleave:\n"
-    "            # W4A4 weights + FP4 dispatch on the clamped-SwiGLU/INTERLEAVE\n"
-    "            # path: AITER expects a bf16/fp8 activation here, not fp4x2.\n"
-    "            # Dequant FP4->BF16 and let fused_moe re-quantize internally.\n"
-    "            hidden_states = upscale_mxfp4(\n"
-    "                hidden_states, a1_scale, num_local_tokens, output_dtype\n"
-    "            )\n"
-    "            a1_scale = None\n"
-    "        elif is_fp8_quant and is_fp4_dispatch and a1_scale is not None:\n"
-)
-
-if b_anchor not in src or c_anchor not in src:
-    print("[SETUP] WARN: #27855 body anchors not found; skipping (image aiter.py predates the W4A4 branch)")
-    sys.exit(0)
-
-src = src.replace(b_anchor, b_new, 1).replace(c_anchor, c_new, 1)
-
-d = os.path.dirname(target)
-fd, tmp = tempfile.mkstemp(dir=d, prefix=".aiter.py.", suffix=".tmp")
-try:
-    with os.fdopen(fd, "w") as f:
-        f.write(src)
-    os.replace(tmp, target)
-    print(f"[SETUP] Patched {target}: DSv4 W4A4+FP4 swiglu-interleave dequant (#27855)")
-except Exception:
-    os.path.exists(tmp) and os.remove(tmp)
-    raise
-PYEOF
-    _SETUP_INSTALLED+=("dsv4-aiter-fp4-swiglu-27855")
-}
-
 # =============================================================================
 # Run installers (engine-gated)
 # =============================================================================
@@ -916,7 +808,6 @@ else
     patch_gluon_pa_mqa_logits_instr_shape
     install_transformers_glm5
     patch_dsv4_config
-    patch_aiter_dsv4_fp4_swiglu
 fi
 
 _SETUP_END=$(date +%s)
diff --git a/perf-changelog.yaml b/perf-changelog.yaml
index 2ef5d33e6..4cc6b5e7c 100644
--- a/perf-changelog.yaml
+++ b/perf-changelog.yaml
@@ -3554,7 +3554,6 @@
     - "New recipe: DeepSeek-V4-Pro FP4 prefill/decode-disaggregated on MI355X via SGLang + MoRI. Combines the validated single-node DSv4 SGLang recipe with the sglang-disagg framework from the dsr1/qwen3.5/glm5 mi355x recipes"
     - "New benchmarks/multi_node/dsv4_fp4_mi355x_sglang-disagg.sh launcher + DeepSeek-V4-Pro entry in amd_utils/models.yaml. Serving flags + env mirror the validated 0610 single-node recipe (PR #1701): dsv4 attention backend, SWA, page-size 256, deepseekv4/deepseek-v4 parsers, DSv4 thinking chat template, shared-experts-fusion off, context-length pinned, kv-cache-dtype at model default; prefill delayer not used"
     - "DSv4 SGLANG_* env block from PR #1701 added to amd_utils/env.sh (unified_kv_triton FlashMLA, aiter indexer, mainline fp8 wo_a / topk-v2 fallbacks hardcoded, multi-stream off; branch-only SGLANG_DSV4_FP4_EXPERTS/FORCE_TRITON_MOE_FP8 dropped); idempotent atomic config.json model_type patch (deepseek_v4->deepseek_v3) added to amd_utils/setup_deps.sh, both gated on MODEL_NAME"
-    - "setup_deps.sh monkey-patches sgl-project/sglang#27855 into the container's moe_runner/aiter.py (gated on MODEL_NAME): adds the W4A4 + FP4-dispatch clamped-SwiGLU/INTERLEAVE dequant branch that fixes the DSv4 + MoRI-EP warmup crash (dynamic_per_group_scaled_quant_kernel not implemented for dtype fp4x2). Unmerged upstream / absent from the pinned image; idempotent + warn-skip if the image already carries it. Drop once a pinned image includes #27855"
     - "Image lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260610 (the validated PR #1701 0610 image; mainline, carries DSv4 support + MoRI disagg backend, auto-applies the MoRI conn.py overlay; rocm/sgl-dev:*-DSv4 branch image lacks both)"
     - "1P1D TP8/EP1 dp-attn false. Starts at a single ISL/OSL (8k/1k) conc=1 as an end-to-end smoke test before expanding the conc / DEP sweep"
   pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1708