[CUDA]: Split-K2 QMoE SwiGLU GEMV kernel#29167
Open
tianleiwu wants to merge 4 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Enables the CUDA QMoE FC1 interleaved-SwiGLU GEMV Split-K2 two-pass path by default (for the narrowly supported FP16 + INT4 configuration), adds an opt-out environment knob, and updates profiling/benchmarking utilities and documentation to reflect the new default route.
Changes:
- Adds Split-K2 two-pass SwiGLU GEMV support (partials kernel + reduction/activation kernel) and wires it into the existing GEMV dispatch with an
ORT_DISABLE_MOE_GEMV_SPLITK2_SWIGLU=1opt-out. - Extends QMoE GEMV profiling scripts and benchmark output metadata to include Split-K2 enable/disable control.
- Adds focused benchmark coverage and documents the experiment results, fallback knob, and future autotuning direction.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/python/transformers/test_qmoe_cuda.py | Adds a focused benchmark test case and reports Split-K2 disable state in JSON results. |
| onnxruntime/test/python/transformers/profile_qmoe_gemv.sh | Adds CLI flag to disable Split-K2 for profiling A/B comparisons. |
| onnxruntime/test/python/transformers/profile_qmoe_gemv.py | Adds profiling CLI flags to set/unset Split-K2 disable env var (and a deprecated compatibility flag). |
| onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_kernels.h | Extends GEMM1 plumbing to pass Split-K partials workspace pointer. |
| onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_kernels.cu | Adds env gating, workspace sizing/plumbing, and routes GEMV to pass partials workspace when Split-K2 is enabled. |
| onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.h | Extends the interleaved-SwiGLU GEMV launcher signature to accept Split-K partials workspace. |
| onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu | Implements Split-K2 partials + reduction/activation kernels and dispatch path. |
| docs/contrib_ops/cuda/qmoe_gemv_experiments.md | Documents Split-K2 experiment methodology, measurements, and results. |
| docs/contrib_ops/cuda/moe_qmoe.md | Documents the new default Split-K2 route and opt-out knob. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add a CUDA QMoE Split-K2 two-pass FC1 interleaved-SwiGLU GEMV implementation for supported fp16 INT4 decode-shaped workloads. The first pass computes two K-split partials into QMoE workspace using the selected GEMV accumulator type, and the second pass reduces the partials in fp32, applies optional bias, and writes the SwiGLU output for FC2. FC2 stays on the existing
moe_gemv_kernelpath.The route now uses two binary environment controls.
ORT_MOE_GEMV_FP32_ACCUM=1enables fp32 accumulation, andORT_MOE_GEMV_SPLITK2_SWIGLU=1enables Split-K2. Both default to0.000ORT_MOE_GEMV_SPLITK2_SWIGLU=1ORT_MOE_GEMV_FP32_ACCUM=10ORT_MOE_GEMV_FP32_ACCUM=1ORT_MOE_GEMV_SPLITK2_SWIGLU=1This PR also:
--fp32-accumand--splitk2-swiglucontrols;Motivation and Context
GPT-OSS-20B single-token decode spends visible time in the QMoE FC1 interleaved-SwiGLU GEMV path. Split-K2 improves FC1 parallelism by splitting the K dimension and reducing the partials in a lightweight second pass.
Under the fp32-accumulation route, Split-K2 reduced FC1 kernel work from about
21.42 usto17.59 + 2.39 = 19.98 usin Nsight, and repeated CUDA-graph GPT-OSS decode pairs showed about+0.9%to+1.6%throughput improvement. A later 3-pair CUDA-graph run averaged332.099536 tok/sfor Split-K2 versus327.857928 tok/swith Split-K2 disabled (+1.29%throughput,-1.28%latency), with no MMLU smoke regression signal.After the normal fp16 QMoE GEMV path changed to fp16 accumulation by default, the single-kernel fp16 route became faster on the focused GPT-OSS, Qwen3.6-35B-A3B, and Gemma4-26B-A4B helper configurations. The fp16 Split-K2 variant is still kept because it is faster than the fp32 Split-K2 route in those focused runs and may be selected by future per-shape autotuning.
Validation
cmake --build /home/tianlei/onnxruntime/build/cu130/Release --target onnxruntime_providers_cuda --parallel $(nproc)lintrunner -a ...git diff --checkORT_QMOE_GEMV_BENCHMARK=1 pytest -q onnxruntime/test/python/transformers/test_qmoe_cuda.py::TestQMoEGemvBenchmark::test_splitk2_swiglu_decode_latency1 passedORT_MOE_GEMV_SPLITK2_SWIGLU=0, latency0.062995 msORT_MOE_GEMV_SPLITK2_SWIGLU=1, latency0.063945 msORT_MOE_GEMV_SPLITK2_SWIGLU=0, latency0.071311 msORT_MOE_GEMV_SPLITK2_SWIGLU=1, latency0.071726 msmoe_gemv_interleaved_swiglu_kernelandmoe_gemv_kernelmoe_gemv_splitk_partials_kernel,moe_gemv_splitk_reduce_swiglu_kernel, andmoe_gemv_kernel0.049207 ms, fp16 single-kernel0.047403 ms, fp32 Split-K20.052055 ms0.053503 ms, fp16 single-kernel0.050732 ms, fp32 Split-K20.059571 msmatch_mmlusmoke on GPT-OSS-20B INT4 QMoE:0.83800.8350