From 6c9c79ff7520c6c09629482eb325cd8399e919bc Mon Sep 17 00:00:00 2001 From: "cliu1004@amd.com" Date: Fri, 5 Jun 2026 20:48:03 +0000 Subject: [PATCH 1/2] [AMD][MI355X] Bump qwen3.5-bf16 single-node SGLang image to v0.5.12.post1 Pin both qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp to lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x (was lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517) so the e2e matrix runs on the image where we already measured the MTP EAGLE acceleration. Measured on a single MI355X (mia1-p01-g09), Qwen/Qwen3.5-397B-A17B, 1k/1k, TP=8, EP=1, no DP-attn, --attention-backend triton, EAGLE num_steps=3 / eagle_topk=1 / num_draft_tokens=4. MTP delivers +34..69% total token throughput and -28..42% median TPOT over non-MTP for conc 1..32; the conc=64 row is depressed on tok/s (+6.9%) because EAGLE silently caps max_running_requests=48 and 16 of 64 requests queue (TPOT speedup unchanged at 1.39x). Co-authored-by: Cursor --- .github/configs/amd-master.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml index 78fdffa9a..0e79ad150 100644 --- a/.github/configs/amd-master.yaml +++ b/.github/configs/amd-master.yaml @@ -166,7 +166,7 @@ dsr1-fp8-mi355x-sglang-mtp: - { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp } qwen3.5-bf16-mi355x-sglang: - image: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 + image: lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x model: Qwen/Qwen3.5-397B-A17B model-prefix: qwen3.5 runner: mi355x @@ -185,7 +185,7 @@ qwen3.5-bf16-mi355x-sglang: - { tp: 8, ep: 1, conc-start: 4, conc-end: 256 } qwen3.5-bf16-mi355x-sglang-mtp: - image: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 + image: lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x model: Qwen/Qwen3.5-397B-A17B model-prefix: qwen3.5 runner: mi355x From 5206b219cf7fa9a8972c03e222ef89676e1bda04 Mon Sep 17 00:00:00 2001 From: "cliu1004@amd.com" Date: Tue, 9 Jun 2026 11:35:49 +0000 Subject: [PATCH 2/2] Add perf-changelog entry for qwen3.5-bf16-mi355x-sglang(-mtp) image bump Required by PR review (chunfangamd requested changes: "Needs to update the perf-changelog.yaml"). Root cause hypothesis: the old v0.5.12-rocm720-mi35x-20260517 image shipped an AITER build that lacked Qwen3.5-397B TP8 MoE kernel tuning and kernel-selection coverage. EAGLE/MTP ran functionally, but the verify and MoE decode paths selected suboptimal AITER kernels, erasing the expected speculative-decoding speedup on MI355X. The newer v0.5.12.post1 image includes later AITER changes (candidate PRs: aiter#3453, aiter#3341) that restore proper MoE kernel coverage, bringing the MTP throughput gap back to +34..69% (conc 1..32). Ref: https://github.com/SemiAnalysisAI/InferenceX/pull/1673 Co-authored-by: Cursor --- perf-changelog.yaml | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/perf-changelog.yaml b/perf-changelog.yaml index 47cfcebc1..dacf79bd6 100644 --- a/perf-changelog.yaml +++ b/perf-changelog.yaml @@ -3502,3 +3502,12 @@ - "Update GPT-OSS model for MI355X vLLM from amd/gpt-oss-120b-w-mxfp4-a-fp8 to openai/gpt-oss-120b" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1670 +- config-keys: + - qwen3.5-bf16-mi355x-sglang + - qwen3.5-bf16-mi355x-sglang-mtp + description: + - "Bump image from lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 to lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x to fix MTP performance drop." + - "Root cause: the old image shipped an AITER build that lacked Qwen3.5-397B TP8 MoE kernel tuning and kernel-selection coverage. EAGLE/MTP ran functionally, but the verify and MoE decode paths selected suboptimal AITER kernels, erasing the expected speculative-decoding speedup on MI355X. The newer v0.5.12.post1 image includes later AITER changes (candidate PRs: aiter#3453, aiter#3341) that add proper Qwen3.5-397B TP8 MoE kernel coverage and retuning, restoring the MTP throughput gap to +34..69% (conc 1..32)." + - "On the new image, EAGLE-MTP delivers +34..69% total token throughput and -28..42% median TPOT over non-MTP at conc 1..32 (1k/1k, TP=8, EP=1, triton attention). conc=64 throughput speedup limited to +6.9% by EAGLE silent max_running_requests=48 cap, while TPOT speedup stays at 1.39x." + pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1673 +