Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions docs/contrib_ops/cuda/moe_qmoe.md
Original file line number Diff line number Diff line change
Expand Up @@ -1003,6 +1003,36 @@ CUDA-graph decode run, default fp16 accumulation reached 386.26 tok/s versus
353.70 tok/s with the fp32 fallback. A 1000-sample MMLU smoke test matched pooled
accuracy at 0.8260 for both modes.

#### Split-K2 SwiGLU GEMV route

The fp16 INT4 interleaved-SwiGLU GEMV path can use a two-pass Split-K2 FC1 kernel
for supported decode shapes. `ORT_MOE_GEMV_FP32_ACCUM=1` enables fp32
accumulation, and `ORT_MOE_GEMV_SPLITK2_SWIGLU=1` enables Split-K2. Both default
to `0`, so the default route is fp16 accumulation with the single-kernel FC1
SwiGLU path.

The first pass computes two K-split
partials into QMoE workspace using the same accumulator type as the normal GEMV
path: fp16 activations use fp16 partials when the fp16-accumulation route is
selected, and the fp32 fallback uses fp32 partials. The second pass reduces those
partials in fp32, adds optional bias, and applies the interleaved SwiGLU
epilogue. FC2 stays on the regular `moe_gemv_kernel` path.

Use the two binary knobs before process start for A/B benchmarking or bisecting
numerical differences. The focused profiler exposes the same controls as
`--fp32-accum` and `--splitk2-swiglu`. On GPT-OSS-20B, Split-K2
reduced FC1 kernel work from about 21.42 us to 19.98 us in the fp32-accumulation
route and improved repeated CUDA-graph decode throughput by about 0.9% to 1.6%
with valid focused-helper output. A 1000-sample MMLU smoke matched the non-Split-K
fallback within noise. A future autotuner can replace this hand-selected routing
with per-shape route selection.

```bash
onnxruntime/test/python/transformers/profile_qmoe_gemv.py \
--case gpt_oss_20b_m1_top4_fp16_2880x2880_e32 \
--fp32-accum --splitk2-swiglu --warmup 5 --repeat 100 --nvtx
```

#### Experiments rejected after profiling

| Experiment | Why it was rejected |
Expand Down
218 changes: 218 additions & 0 deletions docs/contrib_ops/cuda/qmoe_gemv_experiments.md
Original file line number Diff line number Diff line change
Expand Up @@ -979,6 +979,224 @@ Every case reported `has_invalid_output=false`.
- Per-column INT8 W8A16 decode shapes route to GEMV for both FP16 and BF16 and
beat the grouped-GEMM fallback at every profiled shape.

## 2026-06-19: Split-K2 Two-Pass SwiGLU GEMV Experiment

### Change Under Test

- Code commit: `f1d6718be719c1237be392c0389874b6a8926a3c`
(`Experiment QMoE split-K SwiGLU GEMV`).
- Added Split-K2 route with opt-in env knob:
`ORT_MOE_GEMV_SPLITK2_SWIGLU=1`.
- Scope: FP16 INT4/interleaved-SwiGLU FC1 GEMV path for decode-shaped QMoE.
- Implementation:
- First pass launches `moe_gemv_splitk_partials_kernel` with `SplitK=2` and
writes accumulator-typed partials into QMoE workspace. This follows the
normal GEMV accumulation policy: fp16 partials for fp16 accumulation, fp32
partials for the fp32 fallback.
- Second pass launches `moe_gemv_splitk_reduce_swiglu_kernel` to reduce the
partials in fp32, add optional bias, and apply SwiGLU.
- FC2 remains on the existing `moe_gemv_kernel`.
- Scratch is allocated only for the supported Split-K2 route. Leaving
`ORT_MOE_GEMV_SPLITK2_SWIGLU` unset or setting it to `0` keeps the previous single-kernel
FC1 SwiGLU GEMV path.

### Repro Notes

- Build: `cmake --build build/cu130/Release --target onnxruntime_providers_cuda --parallel $(nproc)`.
- Important provider sync: Python tests importing from
`build/cu130/Release/onnxruntime` load
`build/cu130/Release/onnxruntime/capi/libonnxruntime_providers_cuda.so`, not
only the top-level `build/cu130/Release/libonnxruntime_providers_cuda.so` or
the venv copy. Sync all relevant copies before measuring:

```bash
cp build/cu130/Release/libonnxruntime_providers_cuda.so \
build/cu130/Release/onnxruntime/capi/libonnxruntime_providers_cuda.so
cp build/cu130/Release/libonnxruntime_providers_cuda.so \
.venv_cu130/lib/python3.14/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so
```

- Focused QMoE helper:

```bash
cd ~
CUDA_VISIBLE_DEVICES=1 \
LD_LIBRARY_PATH=~/onnxruntime/build/cu130/Release:~/cuda13.0/lib64:~/cudnn9.19_cuda13/lib:~/cudnn9.19_cuda13/lib64:${LD_LIBRARY_PATH:-} \
PYTHONPATH=~/onnxruntime/build/cu130/Release:~/onnxruntime/onnxruntime/test/python/transformers \
~/onnxruntime/.venv_cu130/bin/python \
~/onnxruntime/onnxruntime/test/python/transformers/profile_qmoe_gemv.py \
--case gpt_oss_20b_m1_top4_fp16_2880x2880_e32 --warmup 3 --repeat 20
```

### Focused QMoE Smoke

Both modes reported `has_invalid_output=false`.

| Mode | Env | Latency ms |
|------|-----|------------|
| Baseline | unset | 0.072344 |
| Split-K2 | none | 0.073816 |

The short helper was slightly slower with split-K2, so Nsight was required to
confirm route selection and isolate kernel time.

### Nsight Systems Kernel Results

Artifacts:

- Baseline: `/tmp/qmoe_gptoss_baseline_final.{nsys-rep,sqlite}`
- Split-K2: `/tmp/qmoe_gptoss_splitk_final.{nsys-rep,sqlite}`

Command shape:

```bash
~/cuda13.0/bin/nsys profile -t cuda,nvtx --force-overwrite true \
-o /tmp/qmoe_gptoss_splitk_final --export=sqlite \
~/onnxruntime/.venv_cu130/bin/python \
~/onnxruntime/onnxruntime/test/python/transformers/profile_qmoe_gemv.py \
--case gpt_oss_20b_m1_top4_fp16_2880x2880_e32 --warmup 3 --repeat 30 --nvtx --splitk2-swiglu
```

Parsed with `parse_nsys.py --nvtx-range benchmark --pattern '%'`.

| Mode | Kernel | Calls | Avg us |
|------|--------|-------|--------|
| Baseline | `moe_gemv_interleaved_swiglu_kernel` | 30 | 21.42 |
| Baseline | `moe_gemv_kernel` | 30 | 12.13 |
| Split-K2 | `moe_gemv_splitk_partials_kernel` | 30 | 17.59 |
| Split-K2 | `moe_gemv_splitk_reduce_swiglu_kernel` | 30 | 2.39 |
| Split-K2 | `moe_gemv_kernel` | 30 | 12.22 |

Split-K2 reduced FC1 kernel work from about `21.42 us` to `17.59 + 2.39 =
19.98 us`, a net FC1 reduction of about `1.44 us` per QMoE invocation. End-to-end
under Nsight was effectively tied:

| Mode | Helper latency ms |
|------|-------------------|
| Baseline | 0.079855 |
| Split-K2 | 0.079728 |

### Model-Level Decode Benchmark With CUDA Graph

The user requested model-level measurement assuming CUDA graph. Both runs used
the GPT-OSS-20B INT4 QMoE model package, CUDA graph enabled, XQA enabled, and
deterministic MoE tactic selection:

```bash
MODEL=models/gpt-oss-20b/variants/cuda_int4_int4_qmoe_rtn_matmul_only \
GPU=0 PROMPT_LEN=512 GEN_LEN=128 REPS=10 WARMUP=3 CUDA_GRAPH=1 XQA=1 SYNC_LIB=1 \
ORT_FORCE_DETERMINISTIC_MOE=1 \
bash scripts/bench_gpt_oss_ort_decode.sh
```

Baseline left `ORT_MOE_GEMV_SPLITK2_SWIGLU` unset.

| Run | Mode | Decode latency ms/token | Decode throughput tok/s |
|-----|------|-------------------------|-------------------------|
| R1, `REPS=5`, `WARMUP=2` | Baseline | 2.869450 | 348.498901 |
| R1, `REPS=5`, `WARMUP=2` | Split-K2 | 2.823800 | 354.132707 |
| R2, `REPS=10`, `WARMUP=3` | Baseline | 2.865840 | 348.937861 |
| R2, `REPS=10`, `WARMUP=3` | Split-K2 | 2.839335 | 352.195107 |

The longer CUDA-graph pair showed about `+0.9%` decode throughput. The shorter
pair showed about `+1.6%`. Since the focused helper reported valid output and
the model-level gain repeated in the same direction, even this modest gain is
worth enabling for GPT-OSS-20B decode while keeping an opt-out for A/B checks.

After testing Split-K2 as the selected route, three more paired
CUDA-graph model runs were collected with `REPS=10`, `WARMUP=3`, prompt length
512, and generation length 128:

| Run | Mode | Decode latency ms/token | Decode throughput tok/s |
|-----|------|-------------------------|-------------------------|
| R3 | Default Split-K2 | 3.017252 | 331.427448 |
| R3 | Split-K2 disabled | 3.055736 | 327.253380 |
| R4 | Default Split-K2 | 3.006739 | 332.586260 |
| R4 | Split-K2 disabled | 3.047570 | 328.130314 |
| R5 | Default Split-K2 | 3.009466 | 332.284898 |
| R5 | Split-K2 disabled | 3.047015 | 328.190090 |
| Average | Default Split-K2 | 3.011152 | 332.099536 |
| Average | Split-K2 disabled | 3.050107 | 327.857928 |

The default Split-K2 route was faster in all three pairs, averaging `+1.29%`
decode throughput and `-1.28%` decode latency versus the opt-out fallback.

### FP16 Accumulation Follow-Up

After the normal QMoE GEMV path changed to use fp16 accumulation by default, the
Split-K2 route was rechecked on the same GPT-OSS-20B decode shape. Sequential
focused-helper runs with `--repeat 100` showed Split-K2 behind the single-kernel
path:

| Run | Mode | Latency ms/inference |
|-----|------|----------------------|
| R1 | Split-K2 | 0.061761 |
| R1 | Split-K2 disabled | 0.060108 |
| R2 | Split-K2 | 0.062862 |
| R2 | Split-K2 disabled | 0.060989 |
| R3 | Split-K2 | 0.064595 |
| R3 | Split-K2 disabled | 0.060464 |
| Average | Split-K2 | 0.063073 |
| Average | Split-K2 disabled | 0.060520 |

A short CUDA-graph model-level pair with `REPS=5`, `WARMUP=2`, prompt length
512, and generation length 128 showed the same direction:

| Mode | Decode latency ms/token | Decode throughput tok/s |
|------|-------------------------|-------------------------|
| Split-K2 | 2.848148 | 351.105318 |
| Split-K2 disabled | 2.816800 | 355.012723 |

Although the single-kernel path was faster for this GPT-OSS focused helper,
Split-K2 with default fp16 accumulation was still faster than the
`ORT_MOE_GEMV_FP32_ACCUM=1` Split-K2 route. The fp16 Split-K2 variant is kept so
a future autotuner can choose it for shapes where the extra K parallelism wins.

The same focused profiler check was run for Qwen3.6-35B-A3B and Gemma4-26B-A4B
decode-shaped configs with `--repeat 100`:

| Case | Mode | Latency ms/inference |
|------|------|----------------------|
| Qwen3.6-35B-A3B | fp16 Split-K2 | 0.049207 |
| Qwen3.6-35B-A3B | fp16 Split-K2 disabled | 0.047403 |
| Qwen3.6-35B-A3B | fp32 Split-K2 | 0.052055 |
| Gemma4-26B-A4B | fp16 Split-K2 | 0.053503 |
| Gemma4-26B-A4B | fp16 Split-K2 disabled | 0.050732 |
| Gemma4-26B-A4B | fp32 Split-K2 | 0.059571 |

Both additional shapes produced valid output. In these focused helper runs,
fp16 Split-K2 again sat between the fp16 single-kernel path and the
`ORT_MOE_GEMV_FP32_ACCUM=1` Split-K2 path.

### Accuracy Smoke

A 1000-sample `match_mmlu` smoke was run with the local parallel eval harness on
all eight H200 GPUs, using the same GPT-OSS-20B INT4 QMoE model package and the
current ORT build package. The default Split-K2 run scored `0.8380` pooled
accuracy; the non-Split-K fallback with `ORT_MOE_GEMV_SPLITK2_SWIGLU` unset
scored `0.8350`. The small positive difference is within smoke-test noise, and
there is no accuracy regression signal from enabling Split-K2.

### Decision

- Keep Split-K2 available for its supported fp16 INT4 interleaved-SwiGLU GEMV
scope when `ORT_MOE_GEMV_SPLITK2_SWIGLU=1` enables it.
- Keep the fp16-accumulation Split-K2 variant available. It is slower than the
single-kernel fp16-accumulation path on the GPT-OSS shape, but faster than the
fp32-accumulation Split-K2 route and may be selected by future per-shape
autotuning.
- Use two binary route controls: `ORT_MOE_GEMV_FP32_ACCUM=1` enables fp32
accumulation, and `ORT_MOE_GEMV_SPLITK2_SWIGLU=1` enables Split-K2. Both
default to `0`.
- The 1000-sample MMLU smoke matched the opt-out fallback within noise, so the
default flip has an accuracy sanity check in addition to focused-helper valid
output.
- Future work:
- Add per-shape autotune so route selection is data-driven instead of a fixed
default.
- Try a launch-fused reduction strategy or cooperative approach to keep the
FC1 parallelism benefit without the extra reduce launch.

## 2026-06-19 FP16 Accumulation Default: SM90, GPT-OSS Decode Shape

### Setup
Expand Down
Loading
Loading