Skip to content

[CUDA]: Split-K2 QMoE SwiGLU GEMV kernel#29167

Open
tianleiwu wants to merge 4 commits into
mainfrom
tlwu/20260619/qmoe_splitk_gemv
Open

[CUDA]: Split-K2 QMoE SwiGLU GEMV kernel#29167
tianleiwu wants to merge 4 commits into
mainfrom
tlwu/20260619/qmoe_splitk_gemv

Conversation

@tianleiwu

@tianleiwu tianleiwu commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Description

Add a CUDA QMoE Split-K2 two-pass FC1 interleaved-SwiGLU GEMV implementation for supported fp16 INT4 decode-shaped workloads. The first pass computes two K-split partials into QMoE workspace using the selected GEMV accumulator type, and the second pass reduces the partials in fp32, applies optional bias, and writes the SwiGLU output for FC2. FC2 stays on the existing moe_gemv_kernel path.

The route now uses two binary environment controls. ORT_MOE_GEMV_FP32_ACCUM=1 enables fp32 accumulation, and ORT_MOE_GEMV_SPLITK2_SWIGLU=1 enables Split-K2. Both default to 0.

Accumulation control Split-K2 control Route
unset or 0 unset or 0 fp16 accumulation, single-kernel FC1 SwiGLU
unset or 0 ORT_MOE_GEMV_SPLITK2_SWIGLU=1 fp16 accumulation, Split-K2 FC1 SwiGLU
ORT_MOE_GEMV_FP32_ACCUM=1 unset or 0 fp32 accumulation, single-kernel FC1 SwiGLU
ORT_MOE_GEMV_FP32_ACCUM=1 ORT_MOE_GEMV_SPLITK2_SWIGLU=1 fp32 accumulation, Split-K2 FC1 SwiGLU

This PR also:

  • keeps Split-K2 narrowly gated to fp16 INT4 interleaved-SwiGLU GEMV with activation/bias scale type matching the activation type;
  • adds QMoE workspace plumbing for the Split-K2 partials;
  • updates the focused QMoE profiler and Nsight wrapper with matching --fp32-accum and --splitk2-swiglu controls;
  • adds focused benchmark coverage that explicitly forces the Split-K2 route under the fp16-default policy;
  • documents the routing policy, measurements, binary knobs, and future autotune direction.

Motivation and Context

GPT-OSS-20B single-token decode spends visible time in the QMoE FC1 interleaved-SwiGLU GEMV path. Split-K2 improves FC1 parallelism by splitting the K dimension and reducing the partials in a lightweight second pass.

Under the fp32-accumulation route, Split-K2 reduced FC1 kernel work from about 21.42 us to 17.59 + 2.39 = 19.98 us in Nsight, and repeated CUDA-graph GPT-OSS decode pairs showed about +0.9% to +1.6% throughput improvement. A later 3-pair CUDA-graph run averaged 332.099536 tok/s for Split-K2 versus 327.857928 tok/s with Split-K2 disabled (+1.29% throughput, -1.28% latency), with no MMLU smoke regression signal.

After the normal fp16 QMoE GEMV path changed to fp16 accumulation by default, the single-kernel fp16 route became faster on the focused GPT-OSS, Qwen3.6-35B-A3B, and Gemma4-26B-A4B helper configurations. The fp16 Split-K2 variant is still kept because it is faster than the fp32 Split-K2 route in those focused runs and may be selected by future per-shape autotuning.

Validation

  • Built and synced CUDA provider:
    • cmake --build /home/tianlei/onnxruntime/build/cu130/Release --target onnxruntime_providers_cuda --parallel $(nproc)
  • Lint/format:
    • lintrunner -a ...
    • git diff --check
  • Focused CUDA test:
    • ORT_QMOE_GEMV_BENCHMARK=1 pytest -q onnxruntime/test/python/transformers/test_qmoe_cuda.py::TestQMoEGemvBenchmark::test_splitk2_swiglu_decode_latency
    • result: 1 passed
  • Focused GPT-OSS helper route checks:
    • default fp16: valid output, ORT_MOE_GEMV_SPLITK2_SWIGLU=0, latency 0.062995 ms
    • fp16 accumulation with Split-K2: valid output, ORT_MOE_GEMV_SPLITK2_SWIGLU=1, latency 0.063945 ms
    • fp32 accumulation without Split-K2: valid output, ORT_MOE_GEMV_SPLITK2_SWIGLU=0, latency 0.071311 ms
    • fp32 accumulation with Split-K2: valid output, ORT_MOE_GEMV_SPLITK2_SWIGLU=1, latency 0.071726 ms
  • Nsight route verification:
    • default fp16 dispatched moe_gemv_interleaved_swiglu_kernel and moe_gemv_kernel
    • fp16 accumulation with Split-K2 dispatched moe_gemv_splitk_partials_kernel, moe_gemv_splitk_reduce_swiglu_kernel, and moe_gemv_kernel
    • fp32 accumulation without Split-K2 dispatched the single FC1 SwiGLU kernel and FC2
    • fp32 accumulation with Split-K2 enabled dispatched Split-K2 partial/reduce kernels and FC2
  • Additional focused helper checks, all with valid output:
    • Qwen3.6-35B-A3B: fp16 Split-K2 0.049207 ms, fp16 single-kernel 0.047403 ms, fp32 Split-K2 0.052055 ms
    • Gemma4-26B-A4B: fp16 Split-K2 0.053503 ms, fp16 single-kernel 0.050732 ms, fp32 Split-K2 0.059571 ms
  • 1000-sample match_mmlu smoke on GPT-OSS-20B INT4 QMoE:
    • Split-K2 route: 0.8380
    • Split-K2 disabled: 0.8350

@tianleiwu tianleiwu changed the title Perf(cuda): enable Split-K2 QMoE SwiGLU GEMV [CUDA]: Enable Split-K2 QMoE SwiGLU GEMV Jun 19, 2026
@tianleiwu tianleiwu requested a review from Copilot June 19, 2026 20:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables the CUDA QMoE FC1 interleaved-SwiGLU GEMV Split-K2 two-pass path by default (for the narrowly supported FP16 + INT4 configuration), adds an opt-out environment knob, and updates profiling/benchmarking utilities and documentation to reflect the new default route.

Changes:

  • Adds Split-K2 two-pass SwiGLU GEMV support (partials kernel + reduction/activation kernel) and wires it into the existing GEMV dispatch with an ORT_DISABLE_MOE_GEMV_SPLITK2_SWIGLU=1 opt-out.
  • Extends QMoE GEMV profiling scripts and benchmark output metadata to include Split-K2 enable/disable control.
  • Adds focused benchmark coverage and documents the experiment results, fallback knob, and future autotuning direction.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
onnxruntime/test/python/transformers/test_qmoe_cuda.py Adds a focused benchmark test case and reports Split-K2 disable state in JSON results.
onnxruntime/test/python/transformers/profile_qmoe_gemv.sh Adds CLI flag to disable Split-K2 for profiling A/B comparisons.
onnxruntime/test/python/transformers/profile_qmoe_gemv.py Adds profiling CLI flags to set/unset Split-K2 disable env var (and a deprecated compatibility flag).
onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_kernels.h Extends GEMM1 plumbing to pass Split-K partials workspace pointer.
onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_kernels.cu Adds env gating, workspace sizing/plumbing, and routes GEMV to pass partials workspace when Split-K2 is enabled.
onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.h Extends the interleaved-SwiGLU GEMV launcher signature to accept Split-K partials workspace.
onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu Implements Split-K2 partials + reduction/activation kernels and dispatch path.
docs/contrib_ops/cuda/qmoe_gemv_experiments.md Documents Split-K2 experiment methodology, measurements, and results.
docs/contrib_ops/cuda/moe_qmoe.md Documents the new default Split-K2 route and opt-out knob.

Comment thread onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu Outdated
Comment thread onnxruntime/test/python/transformers/profile_qmoe_gemv.sh
@tianleiwu tianleiwu marked this pull request as draft June 19, 2026 22:09
@tianleiwu tianleiwu changed the title [CUDA]: Enable Split-K2 QMoE SwiGLU GEMV [CUDA]: Split-K2 QMoE SwiGLU GEMV kernel Jun 19, 2026
@tianleiwu tianleiwu marked this pull request as ready for review June 20, 2026 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants