Skip to content

LLVM dialect -> WaveASM GEMM e2e#1079

Open
Hardcode84 wants to merge 28 commits intoiree-org:mainfrom
Hardcode84:llvm-asm-backend-gemm
Open

LLVM dialect -> WaveASM GEMM e2e#1079
Hardcode84 wants to merge 28 commits intoiree-org:mainfrom
Hardcode84:llvm-asm-backend-gemm

Conversation

@Hardcode84
Copy link
Contributor

Add GEMM op handlers to the LLVM→WaveASM translation and fix a register allocator bug.

  • Implement ~15 new LLVM dialect op handlers needed to translate a basic GEMM kernel (64x64x64, F16→F32, MFMA 16x16x16) end-to-end through the water+waveasm pipeline: LDS addressing (addressof, multi-index GEP, ds_read/ds_write), integer division (sdiv/srem for power-of-2), barriers, MFMA intrinsics, scf.forwaveasm.loop structured control flow, dense vector constants, and shufflevector element extraction.
  • Extend water.py pipeline: preserve structured control flow (remove convert-scf-to-cf), add LDS alloca-to-global transform, add cleanup passes.
  • Fix register allocator loop init-arg coalescing: when an init arg has post-loop uses, skip coalescing with the block arg to prevent the loop body from corrupting the shared register. Emit entry copies for non-coalesced init args.

Depends on #1078

Hardcode84 and others added 28 commits March 8, 2026 18:06
Add `water_waveasm_lowering_pipeline` that lowers high-level MLIR to
LLVM dialect via water-opt (memref decomposition, lower-affine,
int-range-optimizations, gpu/amdgpu→rocdl, vector→llvm), producing
IR suitable for waveasm-translate consumption.

Hook this into compile.py under `use_water_backend + backend=asm`
so it takes priority over the legacy ASM flow.  Relax the
wave_runtime validation for this combination.

Step 2 (waveasm-translate consuming LLVM dialect) is a TODO.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Tests the copy kernel through the new water_waveasm_lowering_pipeline
path (use_water_backend=True, backend="asm").  Currently compile_to_mlir
only — verifies the lowering to LLVM dialect succeeds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Introduce --waveasm-translate-from-llvm pass that creates a ProgramOp
from llvm.func kernels and strictly rejects any unhandled op. This is
the entry point for incremental LLVM dialect support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Handle the 12 LLVM/ROCDL ops needed for the copy kernel:
- llvm.mlir.constant → v_mov_b32
- rocdl.workitem.id.x → precolored v0
- llvm.sext/zext/trunc (i32↔i64) → identity on 32-bit GPU
- llvm.icmp → v_cmp_* (VCC implicit)
- llvm.select → v_cndmask_b32
- llvm.mul → v_mul_lo_u32
- rocdl.make.buffer.rsrc → map to SRD from prologue
- llvm.getelementptr (ptr<7>) → decompose into (SRD, voffset)
- llvm.load/store (ptr<7>) → buffer_load_ushort/buffer_store_short

The lit test exercises the full copy kernel path end-to-end.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Implement Step 2 of water_waveasm_lowering_pipeline: invoke
waveasm-translate with --waveasm-translate-from-llvm, optimization
passes, regalloc, and --emit-assembly. Step 3 assembles and links
to HSACO via clang (skipped when compile_to_mlir=True).

Also add use_buffer_ops=True to test_copy_water_waveasm (required
for the buffer ops path).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
End-to-end pass that translates LLVM dialect gpu.module to gpu.binary:
runs the full WaveASM pipeline (translation, optimization, regalloc,
waitcnt, hazard mitigation), emits AMDGCN assembly, assembles and links
to HSACO via ROCDL utilities, and embeds the binary inline in the IR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
… final step

- TranslateFromLLVM now inserts waveasm.program inside the gpu.module
  that contains the kernel, instead of at module top-level. The gpu.module
  is preserved (not erased).
- GPUModuleToBinary is now a pure final step: expects already-optimized
  waveasm.program ops inside gpu.module, emits assembly, assembles + links
  to HSACO, and replaces each gpu.module with gpu.binary.
- The full pipeline is now externally composable as individual passes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
…gram

Keep the original llvm.func in gpu.module so gpu.launch_func verification
passes, while placing waveasm.program alongside it with a mangled name
(e.g. test__waveasm). The kernel_name attribute preserves the original
name for assembly emission. GPUModuleToBinary erases everything when
replacing gpu.module with gpu.binary. Also wires up the 3-step
water-opt → waveasm-translate → water-opt pipeline and restricts the
waveasm e2e test to CDNA4.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Map workgroup ID intrinsics to precolored system SGPRs, mirroring the
existing gpu.block_id handler from the MLIR-level translation path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Materialize poison values as zero immediates since they represent
undefined values with no meaningful content.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
When a GEP's base is another GEP result (not a direct buffer resource),
add the offsets together using v_add_u32 and propagate the SRD.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
When GEPs operate on bare pointers (!llvm.ptr) before make.buffer.rsrc,
propagate the mapper entry and accumulate the byte offset. The offset
is then added to the voffset when creating buffer GEPs on the resulting
ptr<7>.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Add handleAdd (v_add_u32) and extract handlePoison into a proper
function matching the handleMul/handleConstant pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Use properly sized vreg types (e.g. vreg<2,2> for dwordx2) so the
assembly emitter produces register pairs like v[2:3] instead of
single v2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
The SRD prologue uses s_mov_b64 to copy base pointers into SRD[0:1],
which corrupts word 1 bits [31:16] (stride/swizzle) with pointer bits.
Additionally, SRD[3] was hardcoded to 0x20000 instead of using the
actual flags from make.buffer.rsrc. Now handleMakeBufferRsrc patches
the SRD: clears stride bits in word 1 and overwrites flags in word 3.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
AMDGPU only allocates system SGPRs for enabled workgroup IDs, so if
only workgroup_id_y is used (not x), the hardware puts y at the first
system SGPR slot — not the second. Our getWorkgroupIdSgprIndex was
unconditionally using base+dimension, reading garbage from the wrong
SGPR. Fix by always enabling all three workgroup IDs when any is used,
matching the real LLVM backend behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
This shape generates vector constants for bounds checking which
require a scalarization pass not yet implemented.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Non-pointer kernel arguments (e.g., dynamic dimension sizes) are now
mapped to their preloaded SGPR positions instead of being treated as
buffer pointers. Adds i64→i32 truncation for 32-bit VALU ops and
constant bus violation handling (SGPR→VGPR moves). Also adds
test_dynamic_copy_water_waveasm e2e test with proper xfail markers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Documents the planned approach for type legalization (i64→i32),
register placement (SGPR/VGPR), and constant bus enforcement
using generic pseudo-ops and dedicated post-translation passes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Introduce register-file-agnostic pseudo-ops (arith.add, arith.mul,
arith.cmp, arith.select, arith.trunc, arith.sext, arith.zext) that
defer SGPR/VGPR and width decisions to legalization passes. A single
arith.cmp op with a CmpPredicate enum replaces per-predicate ops.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Lowers generic arith pseudo-ops to concrete SALU/VALU machine ops:
- Demand-driven SGPR/VGPR assignment based on operand types
- Constant bus enforcement (v_mov_b32 for extra SGPRs in VALU)
- Wide SGPR narrowing to low sub-register (i64 truncation)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Replace direct SALU/VALU emission in TranslateFromLLVM with generic
arith pseudo-ops (arith.add, arith.mul, arith.cmp, arith.select,
arith.trunc, arith.sext, arith.zext). The arith legalization pass
handles register file selection, constant bus enforcement, and i64
narrowing as a separate pipeline stage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Implement the missing LLVM dialect op handlers needed to translate a
basic GEMM kernel (64×64×64, F16→F32, MFMA 16×16×16) through the
water+waveasm pipeline. The kernel now compiles end-to-end to a GPU
binary (runtime correctness is a follow-up).

New handlers:
- addressof: LDS global → VGPR zero base offset, records LDS size
- sdiv/srem: power-of-2 via v_ashrrev / v_and_b32
- fence: no-op (implicit via barrier/waitcnt)
- barrier: both rocdl.barrier and rocdl.s.barrier → s_barrier
- shufflevector: single-element extract via waveasm.extract
- mfma_f32_16x16x16_f16: emit v_mfma with 4-wide VReg accumulator
- scf.for/yield: translate to waveasm.loop + condition (do-while)
- LDS load/store: ds_read/ds_write dispatch by access width
- dense vector constant: splat init for MFMA accumulators

GEP handling extended for LDS (ptr<3>): multi-index GEPs with
constant indices, constant-attr byte offsets, all-zero passthrough.

Pipeline changes (water.py):
- Remove convert-scf-to-cf (preserve structured control flow)
- Add water-alloc-to-alloca + alloca_to_global transform for LDS
- Add water-drop-transform-ops + symbol-dce cleanup

Includes a pytest e2e test (test_gemm_waveasm) that exercises the
full pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Cover the new op handlers added for GEMM translation:
- LDS: addressof, multi-index GEP, ds_read_b32, ds_write_b32
- sdiv/srem: power-of-2 via v_ashrrev_i32 / v_and_b32
- barrier: rocdl.barrier, rocdl.s.barrier, llvm.fence (no-op)
- MFMA: dense vector constant, v_mfma_f32_16x16x16_f16, extract
- SCF for: waveasm.loop + condition with IV increment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
The register allocator's tied-operand coalescing for loop boundaries
would coalesce an init arg with its block arg even when the init arg
had uses after the loop. The loop body modifies the block arg (same
physical register), corrupting the value for post-loop users.

Skip coalescing when the init arg's live range extends past the loop
operation, forcing separate physical registers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
When the liveness analysis skips init-arg-to-block-arg coalescing
(because the init arg has post-loop uses), the init arg and block arg
get different physical registers. The assembly emitter must emit a
copy from the init arg register to the block arg register before the
loop label, otherwise the block arg starts uninitialized.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
@Hardcode84 Hardcode84 requested a review from harsh-nod March 8, 2026 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant