Skip to content

Wave RDNA4 MoE kernel#1090

Open
nirmie wants to merge 83 commits intoiree-org:mainfrom
nirmie:MoE-clean
Open

Wave RDNA4 MoE kernel#1090
nirmie wants to merge 83 commits intoiree-org:mainfrom
nirmie:MoE-clean

Conversation

@nirmie
Copy link
Contributor

@nirmie nirmie commented Mar 9, 2026

Wave Moe Kernel off of @panditsa initial work

  • 32 threads/wavefront (RDNA4)
  • multiple microkernels (working on fusing it, different hardware constraints between block_align and actual gemms cause issues)
  • Can combine gather/scatter kernels with GEMM (will do soon)

@nirmie nirmie force-pushed the MoE-clean branch 3 times, most recently from 96ff51a to fc631d5 Compare March 12, 2026 23:12
panditsa and others added 27 commits March 16, 2026 15:58
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
panditsa and others added 24 commits March 16, 2026 15:58
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
…/expert IDs

Signed-off-by: Nirmal Senthilkumar <nirmalsent@gmail.com>
Signed-off-by: Nirmal Senthilkumar <nirmalsent@gmail.com>
…pert IDs fill using Wave

Signed-off-by: Nirmal Senthilkumar <nirmalsent@gmail.com>
Signed-off-by: Nirmal Senthilkumar <nirmalsent@gmail.com>
…crokernels architecture

Signed-off-by: Nirmal Senthilkumar <nirmalsent@gmail.com>
Signed-off-by: Nirmal Senthilkumar <nirmalsent@gmail.com>
Signed-off-by: Nirmal Senthilkumar <nirmalsent@gmail.com>
nirmie and others added 2 commits March 23, 2026 20:58
Since scatter and gather use the same sorted_ids, the token-space
round-trip cancels. SiLU can be applied directly in block/slot space,
eliminating 3 kernel launches and 2 intermediate buffers.

---------

Signed-off-by: Nirmal Senthilkumar <nirmalsent@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
BLOCK_N=min(n,128) gives ~1.7x speedup on GEMM1 (RDNA4) by increasing
N-tile reuse per wave. Unit tests cover each kernel individually from
smallest to largest, with full pipeline integration tests last.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Nirmal Senthilkumar <nirmalsent@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants