LLVM dialect -> WaveASM GEMM e2e by Hardcode84 · Pull Request #1079 · iree-org/wave

Hardcode84 · 2026-03-08T21:27:11Z

Add GEMM op handlers to the LLVM→WaveASM translation and fix a register allocator bug.

Implement ~15 new LLVM dialect op handlers needed to translate a basic GEMM kernel (64x64x64, F16→F32, MFMA 16x16x16) end-to-end through the water+waveasm pipeline: LDS addressing (addressof, multi-index GEP, ds_read/ds_write), integer division (sdiv/srem for power-of-2), barriers, MFMA intrinsics, scf.for→waveasm.loop structured control flow, dense vector constants, and shufflevector element extraction.
Extend water.py pipeline: preserve structured control flow (remove convert-scf-to-cf), add LDS alloca-to-global transform, add cleanup passes.
Fix register allocator loop init-arg coalescing: when an init arg has post-loop uses, skip coalescing with the block arg to prevent the loop body from corrupting the shared register. Emit entry copies for non-coalesced init args.

Depends on #1078

Add `water_waveasm_lowering_pipeline` that lowers high-level MLIR to LLVM dialect via water-opt (memref decomposition, lower-affine, int-range-optimizations, gpu/amdgpu→rocdl, vector→llvm), producing IR suitable for waveasm-translate consumption. Hook this into compile.py under `use_water_backend + backend=asm` so it takes priority over the legacy ASM flow. Relax the wave_runtime validation for this combination. Step 2 (waveasm-translate consuming LLVM dialect) is a TODO. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Tests the copy kernel through the new water_waveasm_lowering_pipeline path (use_water_backend=True, backend="asm"). Currently compile_to_mlir only — verifies the lowering to LLVM dialect succeeds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Introduce --waveasm-translate-from-llvm pass that creates a ProgramOp from llvm.func kernels and strictly rejects any unhandled op. This is the entry point for incremental LLVM dialect support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Handle the 12 LLVM/ROCDL ops needed for the copy kernel: - llvm.mlir.constant → v_mov_b32 - rocdl.workitem.id.x → precolored v0 - llvm.sext/zext/trunc (i32↔i64) → identity on 32-bit GPU - llvm.icmp → v_cmp_* (VCC implicit) - llvm.select → v_cndmask_b32 - llvm.mul → v_mul_lo_u32 - rocdl.make.buffer.rsrc → map to SRD from prologue - llvm.getelementptr (ptr<7>) → decompose into (SRD, voffset) - llvm.load/store (ptr<7>) → buffer_load_ushort/buffer_store_short The lit test exercises the full copy kernel path end-to-end. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Implement Step 2 of water_waveasm_lowering_pipeline: invoke waveasm-translate with --waveasm-translate-from-llvm, optimization passes, regalloc, and --emit-assembly. Step 3 assembles and links to HSACO via clang (skipped when compile_to_mlir=True). Also add use_buffer_ops=True to test_copy_water_waveasm (required for the buffer ops path). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

End-to-end pass that translates LLVM dialect gpu.module to gpu.binary: runs the full WaveASM pipeline (translation, optimization, regalloc, waitcnt, hazard mitigation), emits AMDGCN assembly, assembles and links to HSACO via ROCDL utilities, and embeds the binary inline in the IR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

… final step - TranslateFromLLVM now inserts waveasm.program inside the gpu.module that contains the kernel, instead of at module top-level. The gpu.module is preserved (not erased). - GPUModuleToBinary is now a pure final step: expects already-optimized waveasm.program ops inside gpu.module, emits assembly, assembles + links to HSACO, and replaces each gpu.module with gpu.binary. - The full pipeline is now externally composable as individual passes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

…gram Keep the original llvm.func in gpu.module so gpu.launch_func verification passes, while placing waveasm.program alongside it with a mangled name (e.g. test__waveasm). The kernel_name attribute preserves the original name for assembly emission. GPUModuleToBinary erases everything when replacing gpu.module with gpu.binary. Also wires up the 3-step water-opt → waveasm-translate → water-opt pipeline and restricts the waveasm e2e test to CDNA4. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Map workgroup ID intrinsics to precolored system SGPRs, mirroring the existing gpu.block_id handler from the MLIR-level translation path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Materialize poison values as zero immediates since they represent undefined values with no meaningful content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

When a GEP's base is another GEP result (not a direct buffer resource), add the offsets together using v_add_u32 and propagate the SRD. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

When GEPs operate on bare pointers (!llvm.ptr) before make.buffer.rsrc, propagate the mapper entry and accumulate the byte offset. The offset is then added to the voffset when creating buffer GEPs on the resulting ptr<7>. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Add handleAdd (v_add_u32) and extract handlePoison into a proper function matching the handleMul/handleConstant pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Use properly sized vreg types (e.g. vreg<2,2> for dwordx2) so the assembly emitter produces register pairs like v[2:3] instead of single v2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

The SRD prologue uses s_mov_b64 to copy base pointers into SRD[0:1], which corrupts word 1 bits [31:16] (stride/swizzle) with pointer bits. Additionally, SRD[3] was hardcoded to 0x20000 instead of using the actual flags from make.buffer.rsrc. Now handleMakeBufferRsrc patches the SRD: clears stride bits in word 1 and overwrites flags in word 3. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

AMDGPU only allocates system SGPRs for enabled workgroup IDs, so if only workgroup_id_y is used (not x), the hardware puts y at the first system SGPR slot — not the second. Our getWorkgroupIdSgprIndex was unconditionally using base+dimension, reading garbage from the wrong SGPR. Fix by always enabling all three workgroup IDs when any is used, matching the real LLVM backend behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

This shape generates vector constants for bounds checking which require a scalarization pass not yet implemented. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Non-pointer kernel arguments (e.g., dynamic dimension sizes) are now mapped to their preloaded SGPR positions instead of being treated as buffer pointers. Adds i64→i32 truncation for 32-bit VALU ops and constant bus violation handling (SGPR→VGPR moves). Also adds test_dynamic_copy_water_waveasm e2e test with proper xfail markers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Documents the planned approach for type legalization (i64→i32), register placement (SGPR/VGPR), and constant bus enforcement using generic pseudo-ops and dedicated post-translation passes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Introduce register-file-agnostic pseudo-ops (arith.add, arith.mul, arith.cmp, arith.select, arith.trunc, arith.sext, arith.zext) that defer SGPR/VGPR and width decisions to legalization passes. A single arith.cmp op with a CmpPredicate enum replaces per-predicate ops. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Lowers generic arith pseudo-ops to concrete SALU/VALU machine ops: - Demand-driven SGPR/VGPR assignment based on operand types - Constant bus enforcement (v_mov_b32 for extra SGPRs in VALU) - Wide SGPR narrowing to low sub-register (i64 truncation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Replace direct SALU/VALU emission in TranslateFromLLVM with generic arith pseudo-ops (arith.add, arith.mul, arith.cmp, arith.select, arith.trunc, arith.sext, arith.zext). The arith legalization pass handles register file selection, constant bus enforcement, and i64 narrowing as a separate pipeline stage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Implement the missing LLVM dialect op handlers needed to translate a basic GEMM kernel (64×64×64, F16→F32, MFMA 16×16×16) through the water+waveasm pipeline. The kernel now compiles end-to-end to a GPU binary (runtime correctness is a follow-up). New handlers: - addressof: LDS global → VGPR zero base offset, records LDS size - sdiv/srem: power-of-2 via v_ashrrev / v_and_b32 - fence: no-op (implicit via barrier/waitcnt) - barrier: both rocdl.barrier and rocdl.s.barrier → s_barrier - shufflevector: single-element extract via waveasm.extract - mfma_f32_16x16x16_f16: emit v_mfma with 4-wide VReg accumulator - scf.for/yield: translate to waveasm.loop + condition (do-while) - LDS load/store: ds_read/ds_write dispatch by access width - dense vector constant: splat init for MFMA accumulators GEP handling extended for LDS (ptr<3>): multi-index GEPs with constant indices, constant-attr byte offsets, all-zero passthrough. Pipeline changes (water.py): - Remove convert-scf-to-cf (preserve structured control flow) - Add water-alloc-to-alloca + alloca_to_global transform for LDS - Add water-drop-transform-ops + symbol-dce cleanup Includes a pytest e2e test (test_gemm_waveasm) that exercises the full pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Cover the new op handlers added for GEMM translation: - LDS: addressof, multi-index GEP, ds_read_b32, ds_write_b32 - sdiv/srem: power-of-2 via v_ashrrev_i32 / v_and_b32 - barrier: rocdl.barrier, rocdl.s.barrier, llvm.fence (no-op) - MFMA: dense vector constant, v_mfma_f32_16x16x16_f16, extract - SCF for: waveasm.loop + condition with IV increment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

The register allocator's tied-operand coalescing for loop boundaries would coalesce an init arg with its block arg even when the init arg had uses after the loop. The loop body modifies the block arg (same physical register), corrupting the value for post-loop users. Skip coalescing when the init arg's live range extends past the loop operation, forcing separate physical registers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

When the liveness analysis skips init-arg-to-block-arg coalescing (because the init arg has post-loop uses), the init arg and block arg get different physical registers. The assembly emitter must emit a copy from the init arg register to the block arg register before the loop label, otherwise the block arg starts uninitialized. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Hardcode84 and others added 28 commits March 8, 2026 18:06

Handle llvm.mlir.poison in LLVM→WaveASM translation

5eae69c

Materialize poison values as zero immediates since they represent undefined values with no meaningful content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Skip shape (111, 813) in waveasm copy test

cc5c143

This shape generates vector constants for bounds checking which require a scalarization pass not yet implemented. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove empty TranslateFromLLVM.h header

e40ef4f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

format

8f8ee32

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Hardcode84 requested a review from harsh-nod March 8, 2026 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLVM dialect -> WaveASM GEMM e2e#1079

LLVM dialect -> WaveASM GEMM e2e#1079
Hardcode84 wants to merge 28 commits intoiree-org:mainfrom
Hardcode84:llvm-asm-backend-gemm

Hardcode84 commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Hardcode84 commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant