This is used to track the progress of GPT-OSS-20B Throughput Optimization. Related PRs: * olive-recipes https://github.com/microsoft/olive-recipes/pull/507 Experiments of recipes: https://github.com/tianleiwu/olive-recipes/blob/tlwu/gpt-oss-20b/gpt-oss-20b/gpt_oss_20b_experiments.md * onnxruntime-genai https://github.com/microsoft/onnxruntime-genai/pull/2234 * cuda kernel improvements: https://github.com/microsoft/onnxruntime/pull/29038 https://github.com/microsoft/onnxruntime/pull/29161 https://github.com/microsoft/onnxruntime/pull/29162 https://github.com/microsoft/onnxruntime/pull/29166 https://github.com/microsoft/onnxruntime/pull/29177 https://github.com/microsoft/onnxruntime/pull/29167 (Experiment) * fusion https://github.com/microsoft/onnxruntime/pull/29186 https://github.com/microsoft/onnxruntime/pull/29170 QMoE router Fusion (Experiment)
This is used to track the progress of GPT-OSS-20B Throughput Optimization.
Related PRs:
Add GPT-OSS 20B recipes olive-recipes#507
Experiments of recipes: https://github.com/tianleiwu/olive-recipes/blob/tlwu/gpt-oss-20b/gpt-oss-20b/gpt_oss_20b_experiments.md
onnxruntime-genai
Update model builder for gpt-oss onnxruntime-genai#2234
cuda kernel improvements:
[CUDA] QMoE GEMV fast path for batch-1 decode #29038
[CUDA] Optimize FlashDecode split planning for local-window GQA #29161
[CUDA] Enable XQA decode for GroupQueryAttention with attention sink #29162
[CUDA] Default QMoE GEMV fp16 accumulation for fp16 activations #29166
[CUDA] Add sliding-window support to non-quantized XQA decode #29177
[CUDA]: Split-K2 QMoE SwiGLU GEMV kernel #29167 (Experiment)
fusion
[CUDA] Enable CUDA GQA QK-Norm and XQA decode #29186
[CUDA] Fuse MoE router bias into MatMulNBits GEMV #29170
QMoE router Fusion (Experiment)