[wip] Coalesce epilogue store by xintin · Pull Request #1194 · iree-org/wave

xintin · 2026-03-25T20:55:42Z

In MXFP4 GEMM kernels with bf16 output, the MMA accumulator layout (F32_16x16x128_F8F6F4) distributes results across threads in a non-contiguous pattern. Without a transposed output layout, each thread issues narrow, scattered global stores (buffer_store_short), which underutilizes memory bandwidth. A transposed output layout (C^T [N, M]) makes per-lane MMA accumulator elements contiguous in the fast (M) dimension, improving store width to buffer_store_dwordx2 (4 bf16 per thread). The compiler pass in this PR doubles it further to buffer_store_dwordx4 (8 bf16).

The coalesce_epilogue_stores pass marks epilogue bf16 global writes for permlane packing. At codegen time, v_permlane16_swap_b32 exchanges each thread's 4 bf16 values (2 packed dwords) with a partner lane 16 positions apart. Since the MMA layout groups lanes by 16 (lanes 0-15 own M=0-3, lanes 16-31 own M=4-7, etc.), this produces 8 consecutive M values per lane, written as a single buffer_store_dwordx4.

Both lane halves write identical data to the same global address (benign duplicate), avoiding divergent control flow:

Lower half (lanes 0-15 in each 32-lane group): data = [own, partner], address = thread's original M index
Upper half (lanes 16-31): data = [partner, own], address = original M index - 4.
No LDS staging, barriers, or additional shared memory budget is required. Out-of-bounds stores at tile edges are suppressed by the buffer descriptor's valid_bytes field.

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

Signed-off-by: xintin <gaurav.verma@amd.com>

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

Signed-off-by: xintin <gaurav.verma@amd.com>

adedespirlet and others added 21 commits March 26, 2026 00:19

add race free optimization wo double barrier

d0bfc20

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

enable minimize shared allocs when conditional

8742c40

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

optimize schedule

11150f0

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

cleaning

d411992

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

Revert unintended compile.py changes from 063261e

60e804a

Signed-off-by: xintin <gaurav.verma@amd.com>

cleaning

bc6447a

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

disable wave runtime

1380708

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

adding assumption to prevent masking in gathertoshared

9c463c2

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

unroll to prevent early counters + magic number logic for dynamic kernel

04b8fec

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

cleaning

0a89744

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

optimization for 256x192

3625242

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

detects when denomitor is the same when adding fractions

59e5122

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

opt

96a9cb8

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

cast to bf16

ad3cac4

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com> Signed-off-by: xintin <gaurav.verma@amd.com>

initial working commit

1b19eea

Signed-off-by: xintin <gaurav.verma@amd.com>

updated to a pass

7327fce

Signed-off-by: xintin <gaurav.verma@amd.com>

pass

df5cda0

Signed-off-by: xintin <gaurav.verma@amd.com>

milestone: achieved ds_write_b64

dfba65f

Signed-off-by: xintin <gaurav.verma@amd.com>

milestone: achieved ds_write_b128

159b207

Signed-off-by: xintin <gaurav.verma@amd.com>

fix emit permlane

db23dbb

Signed-off-by: xintin <gaurav.verma@amd.com>

removed LDS staging while storing C

d5eacf6

Signed-off-by: xintin <gaurav.verma@amd.com>

xintin force-pushed the xintin/coalesce_epilogue_store branch from 4f4d867 to d5eacf6 Compare March 26, 2026 00:19

xintin changed the title ~~[wip] [intermittent failures need debugging] Coalesce epilogue store~~ [wip] Coalesce epilogue store Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] Coalesce epilogue store#1194

[wip] Coalesce epilogue store#1194
xintin wants to merge 21 commits intomainfrom
xintin/coalesce_epilogue_store

xintin commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xintin commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xintin commented Mar 25, 2026 •

edited

Loading