microsoft · tianleiwu · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/docs/contrib_ops/cuda/moe_qmoe.md b/docs/contrib_ops/cuda/moe_qmoe.md
@@ -1003,6 +1003,36 @@ CUDA-graph decode run, default fp16 accumulation reached 386.26 tok/s versus
 353.70 tok/s with the fp32 fallback. A 1000-sample MMLU smoke test matched pooled
 accuracy at 0.8260 for both modes.
 
+#### Split-K2 SwiGLU GEMV route
+
+The fp16 INT4 interleaved-SwiGLU GEMV path can use a two-pass Split-K2 FC1 kernel
+for supported decode shapes. `ORT_MOE_GEMV_FP32_ACCUM=1` enables fp32
+accumulation, and `ORT_MOE_GEMV_SPLITK2_SWIGLU=1` enables Split-K2. Both default
+to `0`, so the default route is fp16 accumulation with the single-kernel FC1
+SwiGLU path.
+
+The first pass computes two K-split
+partials into QMoE workspace using the same accumulator type as the normal GEMV
+path: fp16 activations use fp16 partials when the fp16-accumulation route is
+selected, and the fp32 fallback uses fp32 partials. The second pass reduces those
+partials in fp32, adds optional bias, and applies the interleaved SwiGLU
+epilogue. FC2 stays on the regular `moe_gemv_kernel` path.
+
+Use the two binary knobs before process start for A/B benchmarking or bisecting
+numerical differences. The focused profiler exposes the same controls as
+`--fp32-accum` and `--splitk2-swiglu`. On GPT-OSS-20B, Split-K2
+reduced FC1 kernel work from about 21.42 us to 19.98 us in the fp32-accumulation
+route and improved repeated CUDA-graph decode throughput by about 0.9% to 1.6%
+with valid focused-helper output. A 1000-sample MMLU smoke matched the non-Split-K
+fallback within noise. A future autotuner can replace this hand-selected routing
+with per-shape route selection.
+
+```bash
+onnxruntime/test/python/transformers/profile_qmoe_gemv.py \
+  --case gpt_oss_20b_m1_top4_fp16_2880x2880_e32 \
+  --fp32-accum --splitk2-swiglu --warmup 5 --repeat 100 --nvtx
+```
+
 #### Experiments rejected after profiling
 
 | Experiment | Why it was rejected |

diff --git a/docs/contrib_ops/cuda/qmoe_gemv_experiments.md b/docs/contrib_ops/cuda/qmoe_gemv_experiments.md
@@ -979,6 +979,224 @@ Every case reported `has_invalid_output=false`.
 - Per-column INT8 W8A16 decode shapes route to GEMV for both FP16 and BF16 and
   beat the grouped-GEMM fallback at every profiled shape.
 
+## 2026-06-19: Split-K2 Two-Pass SwiGLU GEMV Experiment
+
+### Change Under Test
+
+- Code commit: `f1d6718be719c1237be392c0389874b6a8926a3c`
+  (`Experiment QMoE split-K SwiGLU GEMV`).
+- Added Split-K2 route with opt-in env knob:
+  `ORT_MOE_GEMV_SPLITK2_SWIGLU=1`.
+- Scope: FP16 INT4/interleaved-SwiGLU FC1 GEMV path for decode-shaped QMoE.
+- Implementation:
+  - First pass launches `moe_gemv_splitk_partials_kernel` with `SplitK=2` and
+    writes accumulator-typed partials into QMoE workspace. This follows the
+    normal GEMV accumulation policy: fp16 partials for fp16 accumulation, fp32
+    partials for the fp32 fallback.
+  - Second pass launches `moe_gemv_splitk_reduce_swiglu_kernel` to reduce the
+    partials in fp32, add optional bias, and apply SwiGLU.
+  - FC2 remains on the existing `moe_gemv_kernel`.
+  - Scratch is allocated only for the supported Split-K2 route. Leaving
+    `ORT_MOE_GEMV_SPLITK2_SWIGLU` unset or setting it to `0` keeps the previous single-kernel
+    FC1 SwiGLU GEMV path.
+
+### Repro Notes
+
+- Build: `cmake --build build/cu130/Release --target onnxruntime_providers_cuda --parallel $(nproc)`.
+- Important provider sync: Python tests importing from
+  `build/cu130/Release/onnxruntime` load
+  `build/cu130/Release/onnxruntime/capi/libonnxruntime_providers_cuda.so`, not
+  only the top-level `build/cu130/Release/libonnxruntime_providers_cuda.so` or
+  the venv copy. Sync all relevant copies before measuring:
+
+  ```bash
+  cp build/cu130/Release/libonnxruntime_providers_cuda.so \
+     build/cu130/Release/onnxruntime/capi/libonnxruntime_providers_cuda.so
+  cp build/cu130/Release/libonnxruntime_providers_cuda.so \
+     .venv_cu130/lib/python3.14/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so
+  ```
+
+- Focused QMoE helper:
+
+  ```bash
+  cd ~
+  CUDA_VISIBLE_DEVICES=1 \
+  LD_LIBRARY_PATH=~/onnxruntime/build/cu130/Release:~/cuda13.0/lib64:~/cudnn9.19_cuda13/lib:~/cudnn9.19_cuda13/lib64:${LD_LIBRARY_PATH:-} \
+  PYTHONPATH=~/onnxruntime/build/cu130/Release:~/onnxruntime/onnxruntime/test/python/transformers \
+  ~/onnxruntime/.venv_cu130/bin/python \
+  ~/onnxruntime/onnxruntime/test/python/transformers/profile_qmoe_gemv.py \
+    --case gpt_oss_20b_m1_top4_fp16_2880x2880_e32 --warmup 3 --repeat 20
+  ```
+
+### Focused QMoE Smoke
+
+Both modes reported `has_invalid_output=false`.
+
+| Mode | Env | Latency ms |
+|------|-----|------------|
+| Baseline | unset | 0.072344 |
+| Split-K2 | none | 0.073816 |
+
+The short helper was slightly slower with split-K2, so Nsight was required to
+confirm route selection and isolate kernel time.
+
+### Nsight Systems Kernel Results
+
+Artifacts:
+
+- Baseline: `/tmp/qmoe_gptoss_baseline_final.{nsys-rep,sqlite}`
+- Split-K2: `/tmp/qmoe_gptoss_splitk_final.{nsys-rep,sqlite}`
+
+Command shape:
+
+```bash
+~/cuda13.0/bin/nsys profile -t cuda,nvtx --force-overwrite true \
+  -o /tmp/qmoe_gptoss_splitk_final --export=sqlite \
+  ~/onnxruntime/.venv_cu130/bin/python \
+  ~/onnxruntime/onnxruntime/test/python/transformers/profile_qmoe_gemv.py \
+    --case gpt_oss_20b_m1_top4_fp16_2880x2880_e32 --warmup 3 --repeat 30 --nvtx --splitk2-swiglu
+```
+
+Parsed with `parse_nsys.py --nvtx-range benchmark --pattern '%'`.
+
+| Mode | Kernel | Calls | Avg us |
+|------|--------|-------|--------|
+| Baseline | `moe_gemv_interleaved_swiglu_kernel` | 30 | 21.42 |
+| Baseline | `moe_gemv_kernel` | 30 | 12.13 |
+| Split-K2 | `moe_gemv_splitk_partials_kernel` | 30 | 17.59 |
+| Split-K2 | `moe_gemv_splitk_reduce_swiglu_kernel` | 30 | 2.39 |
+| Split-K2 | `moe_gemv_kernel` | 30 | 12.22 |
+
+Split-K2 reduced FC1 kernel work from about `21.42 us` to `17.59 + 2.39 =
+19.98 us`, a net FC1 reduction of about `1.44 us` per QMoE invocation. End-to-end
+under Nsight was effectively tied:
+
+| Mode | Helper latency ms |
+|------|-------------------|
+| Baseline | 0.079855 |
+| Split-K2 | 0.079728 |
+
+### Model-Level Decode Benchmark With CUDA Graph
+
+The user requested model-level measurement assuming CUDA graph. Both runs used
+the GPT-OSS-20B INT4 QMoE model package, CUDA graph enabled, XQA enabled, and
+deterministic MoE tactic selection:
+
+```bash
+MODEL=models/gpt-oss-20b/variants/cuda_int4_int4_qmoe_rtn_matmul_only \
+GPU=0 PROMPT_LEN=512 GEN_LEN=128 REPS=10 WARMUP=3 CUDA_GRAPH=1 XQA=1 SYNC_LIB=1 \
+ORT_FORCE_DETERMINISTIC_MOE=1 \
+bash scripts/bench_gpt_oss_ort_decode.sh
+```
+
+Baseline left `ORT_MOE_GEMV_SPLITK2_SWIGLU` unset.
+
+| Run | Mode | Decode latency ms/token | Decode throughput tok/s |
+|-----|------|-------------------------|-------------------------|
+| R1, `REPS=5`, `WARMUP=2` | Baseline | 2.869450 | 348.498901 |
+| R1, `REPS=5`, `WARMUP=2` | Split-K2 | 2.823800 | 354.132707 |
+| R2, `REPS=10`, `WARMUP=3` | Baseline | 2.865840 | 348.937861 |
+| R2, `REPS=10`, `WARMUP=3` | Split-K2 | 2.839335 | 352.195107 |
+
+The longer CUDA-graph pair showed about `+0.9%` decode throughput. The shorter
+pair showed about `+1.6%`. Since the focused helper reported valid output and
+the model-level gain repeated in the same direction, even this modest gain is
+worth enabling for GPT-OSS-20B decode while keeping an opt-out for A/B checks.
+
+After testing Split-K2 as the selected route, three more paired
+CUDA-graph model runs were collected with `REPS=10`, `WARMUP=3`, prompt length
+512, and generation length 128:
+
+| Run | Mode | Decode latency ms/token | Decode throughput tok/s |
+|-----|------|-------------------------|-------------------------|
+| R3 | Default Split-K2 | 3.017252 | 331.427448 |
+| R3 | Split-K2 disabled | 3.055736 | 327.253380 |
+| R4 | Default Split-K2 | 3.006739 | 332.586260 |
+| R4 | Split-K2 disabled | 3.047570 | 328.130314 |
+| R5 | Default Split-K2 | 3.009466 | 332.284898 |
+| R5 | Split-K2 disabled | 3.047015 | 328.190090 |
+| Average | Default Split-K2 | 3.011152 | 332.099536 |
+| Average | Split-K2 disabled | 3.050107 | 327.857928 |
+
+The default Split-K2 route was faster in all three pairs, averaging `+1.29%`
+decode throughput and `-1.28%` decode latency versus the opt-out fallback.
+
+### FP16 Accumulation Follow-Up
+
+After the normal QMoE GEMV path changed to use fp16 accumulation by default, the
+Split-K2 route was rechecked on the same GPT-OSS-20B decode shape. Sequential
+focused-helper runs with `--repeat 100` showed Split-K2 behind the single-kernel
+path:
+
+| Run | Mode | Latency ms/inference |
+|-----|------|----------------------|
+| R1 | Split-K2 | 0.061761 |
+| R1 | Split-K2 disabled | 0.060108 |
+| R2 | Split-K2 | 0.062862 |
+| R2 | Split-K2 disabled | 0.060989 |
+| R3 | Split-K2 | 0.064595 |
+| R3 | Split-K2 disabled | 0.060464 |
+| Average | Split-K2 | 0.063073 |
+| Average | Split-K2 disabled | 0.060520 |
+
+A short CUDA-graph model-level pair with `REPS=5`, `WARMUP=2`, prompt length
+512, and generation length 128 showed the same direction:
+
+| Mode | Decode latency ms/token | Decode throughput tok/s |
+|------|-------------------------|-------------------------|
+| Split-K2 | 2.848148 | 351.105318 |
+| Split-K2 disabled | 2.816800 | 355.012723 |
+
+Although the single-kernel path was faster for this GPT-OSS focused helper,
+Split-K2 with default fp16 accumulation was still faster than the
+`ORT_MOE_GEMV_FP32_ACCUM=1` Split-K2 route. The fp16 Split-K2 variant is kept so
+a future autotuner can choose it for shapes where the extra K parallelism wins.
+
+The same focused profiler check was run for Qwen3.6-35B-A3B and Gemma4-26B-A4B
+decode-shaped configs with `--repeat 100`:
+
+| Case | Mode | Latency ms/inference |
+|------|------|----------------------|
+| Qwen3.6-35B-A3B | fp16 Split-K2 | 0.049207 |
+| Qwen3.6-35B-A3B | fp16 Split-K2 disabled | 0.047403 |
+| Qwen3.6-35B-A3B | fp32 Split-K2 | 0.052055 |
+| Gemma4-26B-A4B | fp16 Split-K2 | 0.053503 |
+| Gemma4-26B-A4B | fp16 Split-K2 disabled | 0.050732 |
+| Gemma4-26B-A4B | fp32 Split-K2 | 0.059571 |
+
+Both additional shapes produced valid output. In these focused helper runs,
+fp16 Split-K2 again sat between the fp16 single-kernel path and the
+`ORT_MOE_GEMV_FP32_ACCUM=1` Split-K2 path.
+
+### Accuracy Smoke
+
+A 1000-sample `match_mmlu` smoke was run with the local parallel eval harness on
+all eight H200 GPUs, using the same GPT-OSS-20B INT4 QMoE model package and the
+current ORT build package. The default Split-K2 run scored `0.8380` pooled
+accuracy; the non-Split-K fallback with `ORT_MOE_GEMV_SPLITK2_SWIGLU` unset
+scored `0.8350`. The small positive difference is within smoke-test noise, and
+there is no accuracy regression signal from enabling Split-K2.
+
+### Decision
+
+- Keep Split-K2 available for its supported fp16 INT4 interleaved-SwiGLU GEMV
+  scope when `ORT_MOE_GEMV_SPLITK2_SWIGLU=1` enables it.
+- Keep the fp16-accumulation Split-K2 variant available. It is slower than the
+  single-kernel fp16-accumulation path on the GPT-OSS shape, but faster than the
+  fp32-accumulation Split-K2 route and may be selected by future per-shape
+  autotuning.
+- Use two binary route controls: `ORT_MOE_GEMV_FP32_ACCUM=1` enables fp32
+  accumulation, and `ORT_MOE_GEMV_SPLITK2_SWIGLU=1` enables Split-K2. Both
+  default to `0`.
+- The 1000-sample MMLU smoke matched the opt-out fallback within noise, so the
+  default flip has an accuracy sanity check in addition to focused-helper valid
+  output.
+- Future work:
+  - Add per-shape autotune so route selection is data-driven instead of a fixed
+    default.
+  - Try a launch-fused reduction strategy or cooperative approach to keep the
+    FC1 parallelism benefit without the extra reduce launch.
+
 ## 2026-06-19 FP16 Accumulation Default: SM90, GPT-OSS Decode Shape
 
 ### Setup