Skip to content

[wip] Coalesce epilogue store#1194

Draft
xintin wants to merge 21 commits intomainfrom
xintin/coalesce_epilogue_store
Draft

[wip] Coalesce epilogue store#1194
xintin wants to merge 21 commits intomainfrom
xintin/coalesce_epilogue_store

Conversation

@xintin
Copy link
Contributor

@xintin xintin commented Mar 25, 2026

In MXFP4 GEMM kernels with bf16 output, the MMA accumulator layout (F32_16x16x128_F8F6F4) distributes results across threads in a non-contiguous pattern. Without a transposed output layout, each thread issues narrow, scattered global stores (buffer_store_short), which underutilizes memory bandwidth. A transposed output layout (C^T [N, M]) makes per-lane MMA accumulator elements contiguous in the fast (M) dimension, improving store width to buffer_store_dwordx2 (4 bf16 per thread). The compiler pass in this PR doubles it further to buffer_store_dwordx4 (8 bf16).

The coalesce_epilogue_stores pass marks epilogue bf16 global writes for permlane packing. At codegen time, v_permlane16_swap_b32 exchanges each thread's 4 bf16 values (2 packed dwords) with a partner lane 16 positions apart. Since the MMA layout groups lanes by 16 (lanes 0-15 own M=0-3, lanes 16-31 own M=4-7, etc.), this produces 8 consecutive M values per lane, written as a single buffer_store_dwordx4.

Both lane halves write identical data to the same global address (benign duplicate), avoiding divergent control flow:

  • Lower half (lanes 0-15 in each 32-lane group): data = [own, partner], address = thread's original M index
  • Upper half (lanes 16-31): data = [partner, own], address = original M index - 4.
  • No LDS staging, barriers, or additional shared memory budget is required. Out-of-bounds stores at tile edges are suppressed by the buffer descriptor's valid_bytes field.

adedespirlet and others added 21 commits March 26, 2026 00:19
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
@xintin xintin force-pushed the xintin/coalesce_epilogue_store branch from 4f4d867 to d5eacf6 Compare March 26, 2026 00:19
@xintin xintin changed the title [wip] [intermittent failures need debugging] Coalesce epilogue store [wip] Coalesce epilogue store Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants