[CUDA]: Split-K2 QMoE SwiGLU GEMV kernel by tianleiwu · Pull Request #29167 · microsoft/onnxruntime

tianleiwu · 2026-06-19T19:28:55Z

Description

Add a CUDA QMoE Split-K2 two-pass FC1 interleaved-SwiGLU GEMV implementation for supported fp16 INT4 decode-shaped workloads. The first pass computes two K-split partials into QMoE workspace using the selected GEMV accumulator type, and the second pass reduces the partials in fp32, applies optional bias, and writes the SwiGLU output for FC2. FC2 stays on the existing moe_gemv_kernel path.

The route now uses two binary environment controls. ORT_MOE_GEMV_FP32_ACCUM=1 enables fp32 accumulation, and ORT_MOE_GEMV_SPLITK2_SWIGLU=1 enables Split-K2. Both default to 0.

Accumulation control	Split-K2 control	Route
unset or `0`	unset or `0`	fp16 accumulation, single-kernel FC1 SwiGLU
unset or `0`	`ORT_MOE_GEMV_SPLITK2_SWIGLU=1`	fp16 accumulation, Split-K2 FC1 SwiGLU
`ORT_MOE_GEMV_FP32_ACCUM=1`	unset or `0`	fp32 accumulation, single-kernel FC1 SwiGLU
`ORT_MOE_GEMV_FP32_ACCUM=1`	`ORT_MOE_GEMV_SPLITK2_SWIGLU=1`	fp32 accumulation, Split-K2 FC1 SwiGLU

This PR also:

keeps Split-K2 narrowly gated to fp16 INT4 interleaved-SwiGLU GEMV with activation/bias scale type matching the activation type;
adds QMoE workspace plumbing for the Split-K2 partials;
updates the focused QMoE profiler and Nsight wrapper with matching --fp32-accum and --splitk2-swiglu controls;
adds focused benchmark coverage that explicitly forces the Split-K2 route under the fp16-default policy;
documents the routing policy, measurements, binary knobs, and future autotune direction.

Motivation and Context

GPT-OSS-20B single-token decode spends visible time in the QMoE FC1 interleaved-SwiGLU GEMV path. Split-K2 improves FC1 parallelism by splitting the K dimension and reducing the partials in a lightweight second pass.

Under the fp32-accumulation route, Split-K2 reduced FC1 kernel work from about 21.42 us to 17.59 + 2.39 = 19.98 us in Nsight, and repeated CUDA-graph GPT-OSS decode pairs showed about +0.9% to +1.6% throughput improvement. A later 3-pair CUDA-graph run averaged 332.099536 tok/s for Split-K2 versus 327.857928 tok/s with Split-K2 disabled (+1.29% throughput, -1.28% latency), with no MMLU smoke regression signal.

After the normal fp16 QMoE GEMV path changed to fp16 accumulation by default, the single-kernel fp16 route became faster on the focused GPT-OSS, Qwen3.6-35B-A3B, and Gemma4-26B-A4B helper configurations. The fp16 Split-K2 variant is still kept because it is faster than the fp32 Split-K2 route in those focused runs and may be selected by future per-shape autotuning.

Validation

Built and synced CUDA provider:
- cmake --build /home/tianlei/onnxruntime/build/cu130/Release --target onnxruntime_providers_cuda --parallel $(nproc)
Lint/format:
- lintrunner -a ...
- git diff --check
Focused CUDA test:
- ORT_QMOE_GEMV_BENCHMARK=1 pytest -q onnxruntime/test/python/transformers/test_qmoe_cuda.py::TestQMoEGemvBenchmark::test_splitk2_swiglu_decode_latency
- result: 1 passed
Focused GPT-OSS helper route checks:
- default fp16: valid output, ORT_MOE_GEMV_SPLITK2_SWIGLU=0, latency 0.062995 ms
- fp16 accumulation with Split-K2: valid output, ORT_MOE_GEMV_SPLITK2_SWIGLU=1, latency 0.063945 ms
- fp32 accumulation without Split-K2: valid output, ORT_MOE_GEMV_SPLITK2_SWIGLU=0, latency 0.071311 ms
- fp32 accumulation with Split-K2: valid output, ORT_MOE_GEMV_SPLITK2_SWIGLU=1, latency 0.071726 ms
Nsight route verification:
- default fp16 dispatched moe_gemv_interleaved_swiglu_kernel and moe_gemv_kernel
- fp16 accumulation with Split-K2 dispatched moe_gemv_splitk_partials_kernel, moe_gemv_splitk_reduce_swiglu_kernel, and moe_gemv_kernel
- fp32 accumulation without Split-K2 dispatched the single FC1 SwiGLU kernel and FC2
- fp32 accumulation with Split-K2 enabled dispatched Split-K2 partial/reduce kernels and FC2
Additional focused helper checks, all with valid output:
- Qwen3.6-35B-A3B: fp16 Split-K2 0.049207 ms, fp16 single-kernel 0.047403 ms, fp32 Split-K2 0.052055 ms
- Gemma4-26B-A4B: fp16 Split-K2 0.053503 ms, fp16 single-kernel 0.050732 ms, fp32 Split-K2 0.059571 ms
1000-sample match_mmlu smoke on GPT-OSS-20B INT4 QMoE:
- Split-K2 route: 0.8380
- Split-K2 disabled: 0.8350

Copilot

Pull request overview

Enables the CUDA QMoE FC1 interleaved-SwiGLU GEMV Split-K2 two-pass path by default (for the narrowly supported FP16 + INT4 configuration), adds an opt-out environment knob, and updates profiling/benchmarking utilities and documentation to reflect the new default route.

Changes:

Adds Split-K2 two-pass SwiGLU GEMV support (partials kernel + reduction/activation kernel) and wires it into the existing GEMV dispatch with an ORT_DISABLE_MOE_GEMV_SPLITK2_SWIGLU=1 opt-out.
Extends QMoE GEMV profiling scripts and benchmark output metadata to include Split-K2 enable/disable control.
Adds focused benchmark coverage and documents the experiment results, fallback knob, and future autotuning direction.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
onnxruntime/test/python/transformers/test_qmoe_cuda.py	Adds a focused benchmark test case and reports Split-K2 disable state in JSON results.
onnxruntime/test/python/transformers/profile_qmoe_gemv.sh	Adds CLI flag to disable Split-K2 for profiling A/B comparisons.
onnxruntime/test/python/transformers/profile_qmoe_gemv.py	Adds profiling CLI flags to set/unset Split-K2 disable env var (and a deprecated compatibility flag).
onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_kernels.h	Extends GEMM1 plumbing to pass Split-K partials workspace pointer.
onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_kernels.cu	Adds env gating, workspace sizing/plumbing, and routes GEMV to pass partials workspace when Split-K2 is enabled.
onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.h	Extends the interleaved-SwiGLU GEMV launcher signature to accept Split-K partials workspace.
onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu	Implements Split-K2 partials + reduction/activation kernels and dispatch path.
docs/contrib_ops/cuda/qmoe_gemv_experiments.md	Documents Split-K2 experiment methodology, measurements, and results.
docs/contrib_ops/cuda/moe_qmoe.md	Documents the new default Split-K2 route and opt-out knob.

…litk_gemv

Split-K2 SwiGLU GEMV

3de24c6

tianleiwu changed the title ~~Perf(cuda): enable Split-K2 QMoE SwiGLU GEMV~~ [CUDA]: Enable Split-K2 QMoE SwiGLU GEMV Jun 19, 2026

tianleiwu mentioned this pull request Jun 19, 2026

[CUDA] GPT-OSS-20B Throughput Optimization #29160

Open

tianleiwu requested a review from Copilot June 19, 2026 20:39

Copilot started reviewing on behalf of tianleiwu June 19, 2026 20:39 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu Outdated

Comment thread onnxruntime/test/python/transformers/profile_qmoe_gemv.sh

tianleiwu marked this pull request as draft June 19, 2026 22:09

tianleiwu added 2 commits June 19, 2026 16:09

experiment of fp16 accumulation for split-k

bc4712b

Merge remote-tracking branch 'origin/main' into tlwu/20260619/qmoe_sp…

4836d23

…litk_gemv

tianleiwu changed the title ~~[CUDA]: Enable Split-K2 QMoE SwiGLU GEMV~~ [CUDA]: Split-K2 QMoE SwiGLU GEMV kernel Jun 19, 2026

env vars for fp32_accum and splik routing

64db1ce

tianleiwu marked this pull request as ready for review June 20, 2026 00:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA]: Split-K2 QMoE SwiGLU GEMV kernel#29167

[CUDA]: Split-K2 QMoE SwiGLU GEMV kernel#29167
tianleiwu wants to merge 4 commits into
mainfrom
tlwu/20260619/qmoe_splitk_gemv

tianleiwu commented Jun 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tianleiwu commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tianleiwu commented Jun 19, 2026 •

edited

Loading