From 6c9c79ff7520c6c09629482eb325cd8399e919bc Mon Sep 17 00:00:00 2001
From: "cliu1004@amd.com" <cliu1004@amd.com@mia1-p01-g18.mia.tensorwave.lan>
Date: Fri, 5 Jun 2026 20:48:03 +0000
Subject: [PATCH 1/2] [AMD][MI355X] Bump qwen3.5-bf16 single-node SGLang image
 to v0.5.12.post1

Pin both qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp
to lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x (was
lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517) so the e2e matrix
runs on the image where we already measured the MTP EAGLE acceleration.

Measured on a single MI355X (mia1-p01-g09), Qwen/Qwen3.5-397B-A17B,
1k/1k, TP=8, EP=1, no DP-attn, --attention-backend triton, EAGLE
num_steps=3 / eagle_topk=1 / num_draft_tokens=4. MTP delivers
+34..69% total token throughput and -28..42% median TPOT over non-MTP
for conc 1..32; the conc=64 row is depressed on tok/s (+6.9%) because
EAGLE silently caps max_running_requests=48 and 16 of 64 requests queue
(TPOT speedup unchanged at 1.39x).

Co-authored-by: Cursor <cursoragent@cursor.com>
---
 .github/configs/amd-master.yaml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml
index 78fdffa9a..0e79ad150 100644
--- a/.github/configs/amd-master.yaml
+++ b/.github/configs/amd-master.yaml
@@ -166,7 +166,7 @@ dsr1-fp8-mi355x-sglang-mtp:
       - { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
 
 qwen3.5-bf16-mi355x-sglang:
-  image: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517
+  image: lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x
   model: Qwen/Qwen3.5-397B-A17B
   model-prefix: qwen3.5
   runner: mi355x
@@ -185,7 +185,7 @@ qwen3.5-bf16-mi355x-sglang:
       - { tp: 8, ep: 1, conc-start: 4, conc-end: 256 }
 
 qwen3.5-bf16-mi355x-sglang-mtp:
-  image: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517
+  image: lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x
   model: Qwen/Qwen3.5-397B-A17B
   model-prefix: qwen3.5
   runner: mi355x

From 5206b219cf7fa9a8972c03e222ef89676e1bda04 Mon Sep 17 00:00:00 2001
From: "cliu1004@amd.com" <cliu1004@amd.com@mia1-p01-g18.mia.tensorwave.lan>
Date: Tue, 9 Jun 2026 11:35:49 +0000
Subject: [PATCH 2/2] Add perf-changelog entry for
 qwen3.5-bf16-mi355x-sglang(-mtp) image bump

Required by PR review (chunfangamd requested changes: "Needs to update
the perf-changelog.yaml").

Root cause hypothesis: the old v0.5.12-rocm720-mi35x-20260517 image
shipped an AITER build that lacked Qwen3.5-397B TP8 MoE kernel tuning
and kernel-selection coverage. EAGLE/MTP ran functionally, but the
verify and MoE decode paths selected suboptimal AITER kernels, erasing
the expected speculative-decoding speedup on MI355X. The newer
v0.5.12.post1 image includes later AITER changes (candidate PRs:
aiter#3453, aiter#3341) that restore proper MoE kernel coverage,
bringing the MTP throughput gap back to +34..69% (conc 1..32).

Ref: https://github.com/SemiAnalysisAI/InferenceX/pull/1673
Co-authored-by: Cursor <cursoragent@cursor.com>
---
 perf-changelog.yaml | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/perf-changelog.yaml b/perf-changelog.yaml
index 47cfcebc1..dacf79bd6 100644
--- a/perf-changelog.yaml
+++ b/perf-changelog.yaml
@@ -3502,3 +3502,12 @@
     - "Update GPT-OSS model for MI355X vLLM from amd/gpt-oss-120b-w-mxfp4-a-fp8 to openai/gpt-oss-120b"
   pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1670
 
+- config-keys:
+    - qwen3.5-bf16-mi355x-sglang
+    - qwen3.5-bf16-mi355x-sglang-mtp
+  description:
+    - "Bump image from lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 to lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x to fix MTP performance drop."
+    - "Root cause: the old image shipped an AITER build that lacked Qwen3.5-397B TP8 MoE kernel tuning and kernel-selection coverage. EAGLE/MTP ran functionally, but the verify and MoE decode paths selected suboptimal AITER kernels, erasing the expected speculative-decoding speedup on MI355X. The newer v0.5.12.post1 image includes later AITER changes (candidate PRs: aiter#3453, aiter#3341) that add proper Qwen3.5-397B TP8 MoE kernel coverage and retuning, restoring the MTP throughput gap to +34..69% (conc 1..32)."
+    - "On the new image, EAGLE-MTP delivers +34..69% total token throughput and -28..42% median TPOT over non-MTP at conc 1..32 (1k/1k, TP=8, EP=1, triton attention). conc=64 throughput speedup limited to +6.9% by EAGLE silent max_running_requests=48 cap, while TPOT speedup stays at 1.39x."
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1673
+