Skip to content

[AMD] Performance Improvements for MI300X with GEMM and FP8 Enhancements#811

Open
chunfangamd wants to merge 15 commits intomainfrom
chun_zhentao/dsr1_fp8_mi300x_20260225
Open

[AMD] Performance Improvements for MI300X with GEMM and FP8 Enhancements#811
chunfangamd wants to merge 15 commits intomainfrom
chun_zhentao/dsr1_fp8_mi300x_20260225

Conversation

@chunfangamd
Copy link
Copy Markdown
Collaborator

@chunfangamd chunfangamd commented Feb 26, 2026

Upgrade the DSR1 FP8 MI300X/MI325X images to lmsysorg/sglang-rocm:v0.5.10rc0-rocm700-mi30x-20260414:

Compared to the previous image lmsysorg/sglang:v0.5.9-rocm700-mi30x, we included the following features.

  1. Include configuration files for three GEMM operations: [gfx942]Add new GEMM configuration files for DSKR1 ROCm/aiter#2024
  2. Improve TPOT by using fp8 bmm in MLA and MI300X for DSR1/V3: [AMD] DSR1/V3 use fp8 bmm in MLA for MI300X sgl-project/sglang#18624
  3. Broaden the optimized paths to all HIP platforms and add tuned FP8 GEMM configs: [ROCm] Optimize Deepseek R1 on MI300X sgl-project/sglang#18242
  4. Newer aiter version 0.1.12.post1

e2e Tests:

Co-authored with @zhentaocc

- Pin aiter and sgl-kernel to specific commits required by the
v0.5.8-rocm700-mi30x image.
- This Patch should only work with Image
    lmsysorg/sglang:v0.5.8-rocm700-mi30x
- A work with Zhentao Chen
The previous aiter ref (9046b6f) changed get_mla_metadata_v1 to expect
a Tensor for kv_last_page_lens, but the image's sglang still passed an
int, crashing during cuda graph capture.

Fix by fresh-cloning aiter at d2ca5a89, pinning sgl-kernel to 8bd6447
(now at sglang/sgl-kernel), and uninstalling stale packages before
rebuilding to avoid leftover C extension conflicts.
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can y'all update sglang with the latest aiter and we wait until next sgl release image before updating inferenceMax? Ideally we wanna track acutal images and not patchwork

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Feb 26, 2026

@chunfangamd So the idea is that this will land in SGLang 0.10.0?

as it contains previous mentioned three improvements
@chunfangamd chunfangamd force-pushed the chun_zhentao/dsr1_fp8_mi300x_20260225 branch from e1c3d94 to 9c490e8 Compare March 11, 2026 11:41
@functionstackx
Copy link
Copy Markdown
Contributor

functionstackx commented Mar 11, 2026

@chunfangamd So the idea is that this will land in SGLang 0.10.0?

yes generally for less new models, plz use the release builds when possible

+viz @cquil11 @Oseltamivir

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Mar 23, 2026

@chunfangamd did this enhancement make it into SGLang 0.5.9?

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Apr 8, 2026

@claude Can you check to see if the PRs / commits mentioned in this PR description are in SGLang 0.5.10 (https://github.com/sgl-project/sglang/releases/tag/v0.5.10)? This may involve searching the commit history on the SGLang repo.

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Apr 8, 2026

Claude finished @cquil11's task in 2m 14s —— View job


Checking if PR-referenced changes are in SGLang 0.5.10


Results

All three PRs referenced in this PR description were merged after SGLang v0.5.9 (released Feb 24, 2026 at 01:14 UTC) and before SGLang v0.5.10 (released April 6, 2026).

PR Title Merged In v0.5.9? In v0.5.10?
sgl-project/sglang#18624 [AMD] DSR1/V3 use fp8 bmm in MLA for MI300X Feb 24, 2026 23:33 UTC ❌ No (merged after release) Yes — merge commit c193a52 is an ancestor of v0.5.10
sgl-project/sglang#18242 [ROCm] Optimize Deepseek R1 on MI300X Feb 25, 2026 01:01 UTC ❌ No (merged after release) Yes — explicitly listed in v0.5.10 release notes under New Contributors (@zhentaocc)
ROCm/aiter#2024 [gfx942] Add new GEMM configuration files for DSKR1 Feb 12, 2026 N/A (aiter repo) ⚠️ Indirectly — merged into aiter on Feb 12; inclusion in SGLang depends on which aiter version SGLang 0.5.10 pins. The v0.5.10 release notes show sgl-kernel was upgraded to sglang-kernel 0.4.1, but don't explicitly mention an aiter version bump. Any aiter build after Feb 12 would include this change.

Summary: Both SGLang PRs (#18624 and #18242) are confirmed to be in v0.5.10. The aiter GEMM config PR (#2024) was merged into the aiter repo before either SGLang PR, so it would be included if SGLang 0.5.10 uses an aiter version built after Feb 12, 2026.


@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Apr 8, 2026

@claude referring to your previous comment and investigation:

Summary: Both SGLang PRs (#18624 and #18242) are confirmed to be in v0.5.10. The aiter GEMM config PR (#2024) was merged into the aiter repo before either SGLang PR, so it would be included if SGLang 0.5.10 uses an aiter version built after Feb 12, 2026.

You need to confirm if SGLang 0.5.10 uses an aiter version built after Feb 12, 2026
Lazy clanker

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Apr 8, 2026

Claude finished @cquil11's task in 1m 28s —— View job


Confirming aiter version in SGLang 0.5.10


Results

Yes, SGLang v0.5.10 uses an aiter version built well after Feb 12, 2026.

SGLang v0.5.10's docker/rocm.Dockerfile pins:

AITER_COMMIT_DEFAULT="v0.1.11.post1"

The aiter v0.1.11.post1 tag points to commit 417de6d (dated March 5, 2026), which is 81 commits ahead of the ROCm/aiter#2024 merge commit 89015e2 (merged Feb 12, 2026) and 0 commits behind — so the GEMM config changes are fully included.

Item Details
SGLang v0.5.10 aiter pin v0.1.11.post1 (commit 417de6d, March 5, 2026)
ROCm/aiter#2024 merge commit 89015e2, Feb 12, 2026
Included? Yes — tag is 81 commits ahead of the PR merge

Updated summary: All three PRs referenced in this PR description are confirmed to be in SGLang v0.5.10:

  1. sgl-project/sglang#18624 — in v0.5.10
  2. sgl-project/sglang#18242 — in v0.5.10
  3. ROCm/aiter#2024 — included via aiter v0.1.11.post1

@cquil11 cquil11 changed the title Performance Improvements for MI300X with GEMM and FP8 Enhancements [AMD] Performance Improvements for MI300X with GEMM and FP8 Enhancements Apr 8, 2026
@chunfangamd
Copy link
Copy Markdown
Collaborator Author

@claude, show me one sweep run link. Are we facing a hang issue?

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Apr 8, 2026

Claude finished @chunfangamd's task in 1m 42s —— View job


Sweep runs for this PR branch

  • Find recent e2e test / sweep runs for this PR branch
  • Check run status for hang issues
  • Summarize findings

There are 2 sweep runs on the chun_zhentao/dsr1_fp8_mi300x_20260225 branch. Both completed successfully with no hang issues:

# Run Date Status Duration
1 e2e Test - DSR1-FP8-MI300X-SGLang -- 20260225 Feb 25, 2026 ✅ Success ~45 min
2 e2e Test - DSR1-FP8-MI300X-SGLang v0.5.9 (daily image 20260310) -- 20260311 Mar 11, 2026 ✅ Success ~2h 18m

No hang issues detected. In the most recent run (#22951504643), all 16 benchmark jobs, plus collect-evals, collect-results, and calc-success-rate completed successfully. The longest individual job was dsr1_1k8k fp8 mi300x sglang conc-64 which took ~72 minutes — within normal range for a high-concurrency long-output benchmark.

There are also no in-progress runs currently on this branch or repo-wide for the e2e-tests workflow.


View job

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Apr 9, 2026

@claude update perf changelog to be up to date

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Apr 9, 2026

Claude finished @cquil11's task in 59s —— View job


Updating perf-changelog.yaml

  • Read current perf-changelog.yaml and PR changes
  • Determine what updates are needed
  • Update perf-changelog.yaml
  • Commit and push changes

Updated the perf-changelog entry for this PR to reflect the actual image transition: the old entry referenced a stale daily build (v0.5.8 → lmsysorg/sglang-daily:v0.5.9-rocm720-mi30x-20260310), but the actual change is from the official v0.5.9-rocm700-mi30x to v0.5.10-rocm720-mi30x release image. The three enhancement descriptions (GEMM configs, FP8 BMM in MLA, HIP platform optimization) remain unchanged as they're still accurate.

View job

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants