Dynamic stride support through waveasm#1091
Draft
suryajasper wants to merge 9 commits intoiree-org:mainfrom
Draft
Dynamic stride support through waveasm#1091suryajasper wants to merge 9 commits intoiree-org:mainfrom
suryajasper wants to merge 9 commits intoiree-org:mainfrom
Conversation
b51b0b3 to
2bb545c
Compare
panditsa
reviewed
Mar 10, 2026
wave_lang/kernel/wave/compile.py
Outdated
| "--waveasm-loop-address-promotion", | ||
| "--waveasm-linear-scan=max-vgprs=512 max-agprs=512", | ||
| "--waveasm-insert-waitcnt=ticketed-waitcnt=false", | ||
| "--waveasm-insert-waitcnt=ticketed-waitcnt=true", |
Contributor
There was a problem hiding this comment.
This should be controlled through a compile option from the test itself.
136a9f0 to
432dcaa
Compare
Signed-off-by: Surya Jasper <45545431+suryajasper@users.noreply.github.com>
Signed-off-by: Surya Jasper <45545431+suryajasper@users.noreply.github.com>
Signed-off-by: Surya Jasper <45545431+suryajasper@users.noreply.github.com>
…ast to output buffer Signed-off-by: Surya Jasper <45545431+suryajasper@users.noreply.github.com>
Signed-off-by: Surya Jasper <45545431+suryajasper@users.noreply.github.com>
Signed-off-by: Surya Jasper <45545431+suryajasper@users.noreply.github.com>
Signed-off-by: Surya Jasper <45545431+suryajasper@users.noreply.github.com>
Signed-off-by: Surya Jasper <45545431+suryajasper@users.noreply.github.com>
Signed-off-by: Surya Jasper <45545431+suryajasper@users.noreply.github.com>
47f2ef2 to
1995206
Compare
suryajasper
added a commit
to suryajasper/wave
that referenced
this pull request
Mar 25, 2026
Squashed cherry-pick of suryajasper/dynamic-strides-waveasm onto 4waveasm-256x192x256. Merges partial kernel argument preloading, extract_strided_metadata handler, and dynamic stride test updates. Commits included: - Handle memref.extract_strided_metadata in waveasm backend - Update dynamic strides test & compile options to include waveasm - xfail waveasm dynamic strides tests w/ dynamic dims or buffer ops - Fix dynamic strides + dynamic dims through waveasm & accumulator bitcast - Fixed dynamic strides with bufops w/ waveasm - Fix mxfp waveasm example to use (2,2) wave shape - Fixed waveasm dynamic strides to use partial kernel argument preloading Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds support for dynamic strides through the waveasm backend. There are 4 main cases that need to be addressed to ensure complete support.
[FIXED] the existing dynamic stride logic in waveasm handles the loads correctly but not the stores. The stores go to a flat memref with a static stride of [1], and the MLIR pipeline produces an extract_strided_metadata op to compute the linearized index for the store, accounting for the dynamic strides. This part isn't being handled through the ASM backend, so I added a handler to properly load the strides and handle the linearized computation.
[IN PROGRESS] fails because including the buffer addresses + dynamic dims + dynamic strides overflows the gfx950 limit of preloaded kernel arguments. For example, a simple GEMM with buffer arguments A = MxK, B = NxK, & C = MxN produces 9 (3 buffer pointers, 3 dynamic dims, 3 leading strides) preloaded arguments, which maps to 9 * 2 = 18 preloaded SGPRs, exceeding the limit of 15. For this, I'm working on a fix to only preload the buffer args, and load the scalar args explicitly through
s_load_dword. This fixes the waveasm compilation issues, but causes GPU faults, which I am debugging.