Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ dsr1-fp8-mi355x-sglang-mtp:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }

qwen3.5-bf16-mi355x-sglang:
image: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517
image: lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x
model: Qwen/Qwen3.5-397B-A17B
model-prefix: qwen3.5
runner: mi355x
Expand All @@ -185,7 +185,7 @@ qwen3.5-bf16-mi355x-sglang:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 256 }

qwen3.5-bf16-mi355x-sglang-mtp:
image: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517
image: lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x
model: Qwen/Qwen3.5-397B-A17B
model-prefix: qwen3.5
runner: mi355x
Expand Down
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3531,3 +3531,12 @@
- "The Rust frontend replaces only the Python serving/API layer (HTTP, tokenization, scheduling glue, detokenization) and spawns the same Python EngineCore, so GPU kernels/attention/MoE GEMM/KV cache are untouched"
- "A/B sweep (28 single-node points, 1k1k + 8k1k, TP 1/2/4) vs the Python-frontend baseline (run 26696260751): throughput Pareto-neutral (peak tok/s/GPU within <1.5%, frontiers coincident) and TPOT flat (+-0.5%); TTFT improves ~8% at 1k1k and ~22% at 8k1k (every point), the expected signature of lower frontend CPU latency before first token, scaling with input length"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1634

- config-keys:
- qwen3.5-bf16-mi355x-sglang
- qwen3.5-bf16-mi355x-sglang-mtp
description:
- "Bump image from lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 to lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x to fix MTP performance drop."
- "Root cause: the old image shipped an AITER build that lacked Qwen3.5-397B TP8 MoE kernel tuning and kernel-selection coverage. EAGLE/MTP ran functionally, but the verify and MoE decode paths selected suboptimal AITER kernels, erasing the expected speculative-decoding speedup on MI355X. The newer v0.5.12.post1 image includes later AITER changes (candidate PRs: aiter#3453, aiter#3341) that add proper Qwen3.5-397B TP8 MoE kernel coverage and retuning, restoring the MTP throughput gap to +34..69% (conc 1..32)."
- "On the new image, EAGLE-MTP delivers +34..69% total token throughput and -28..42% median TPOT over non-MTP at conc 1..32 (1k/1k, TP=8, EP=1, triton attention). conc=64 throughput speedup limited to +6.9% by EAGLE silent max_running_requests=48 cap, while TPOT speedup stays at 1.39x."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1673