[CK_TILE] MX GEMM, non-preshuffled and RCR layout #3709

samremes · 2026-02-06T18:30:00Z

Proposed changes

Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

…eline from flatmm

…x_gemm

ThomasNing

Thank you for the PR @samremes Could I know when will be the estimation we could have the mxfp4 and mxfp8 working?

ThomasNing · 2026-02-11T18:34:10Z

include/ck_tile/core/arch/amd_buffer_addressing_builtins.hpp

-            (std::is_same<T, pk_fp6x16_t>::value && (N == 1)),
+                 (N == 1 || N == 2 || N == 4 || N == 8 || N == 16 || N == 32) ||
+             (std::is_same<T, pk_fp4_t>::value &&
+              (N == 1 || N == 2 || N == 4 || N == 8 || N == 16 || N == 32))),


The largest granularity is b128, so I don't think we need the N = 32.

ThomasNing · 2026-02-11T19:27:36Z

include/ck_tile/ops/gemm/block/block_gemm_areg_breg_creg_v1.hpp

+                        merge_sequences(sequence<1, 1>{}, b_warp_y_lengths));
+
+                    // get B scale for this N-K tile using get_y_sliced_thread_data
+                    auto scale_b_slice = scale_b_tensor.get_y_sliced_thread_data(


For the packed data of scale, it is better that we could use the load_tile_offset + get_thread_buffer() method

ThomasNing · 2026-02-11T19:30:25Z

include/ck_tile/ops/gemm/block/block_gemm_areg_breg_creg_v1.hpp

+                    // warp GEMM with MX scaling
+                    // Cast e8m0_t to int32_t, use OpSel=0 (least significant byte)
+                    constexpr index_t kOpSel = 0;  // Always use OpSel=0
+                    WarpGemm{}.template operator()<kOpSel, kOpSel>(


We could load 32 bits together and select based on the iterations.

ThomasNing · 2026-02-11T19:37:54Z

include/ck_tile/ops/gemm/block/block_gemm_areg_breg_creg_v1.hpp

        });
    }

+    // C += A * B with MX scaling


We could migrate that to the gemm_mx block folder.

ThomasNing · 2026-02-11T19:41:30Z

include/ck_tile/ops/gemm_mx/pipeline/gemm_pipeline_ag_bg_cr_comp_async.hpp

+//  C Distributed tensor: register
+//  MX scaling support with OpSel
+template <typename Problem>
+struct BaseMXGemmPipelineAgBgCrCompAsync


If we need, we could first limit it without async feature :)

samremes added 30 commits December 18, 2025 04:06

adap gemm_mx_kernel.hpp from flatmm, comment changes needed to mx pip…

4985afb

…eline from flatmm

refactor the mx pipeline, backup the modified flatmm pipeline

0faed29

add mx gemm example

6a4951c

fix settings for example, fix some things in pipeline

86cc59e

WIP: fixing loading logic

10fb184

Use simpler layout for scales.

ec1a069

Extend comp async pipeline with scales

f944bc0

Extend comp async pipeline with scales

edd11c9

use new pipeline in example

93ff8b0

Merge remote-tracking branch 'origin/develop' into samremes/ck_tile_m…

5d4e07e

…x_gemm

WIP

f6f9931

WIP

16ca5cb

fixed vector load siz for fp4

f09e109

compiles again using get_y_sliced_thread_data in warpgemm loop

d2a7c2f

WIP: debugging...

70c7fcd

current state of pipeline

f62cc54

update example code

08ec1f4

use PackedSize in slicing

30d4c25

revert custom ldstile, should be able to use the regular ones

0033748

Merge remote-tracking branch 'origin/develop' into samremes/ck_tile_m…

409a7d8

…x_gemm

override base policys vector size with static_assert 4/12/16 bytes

2cc0e3d

revert mostly back to original comp_async

b124a72

add initial version for scale block_gemm, not used yet

771c46a

enable fp8 mx gemm too

b8cdea5

enable 32 element for fp4

407df88

use default scale (no scale) for 16x16x128 mfma scale

4d24128

enable fp4 for universal gemm - without any scaling

b47853d

fix alignment calculation of lds tensor views

6b50755

use proper rtol/atol

16fa73d

fix strides in mx gemm example

329eabd

samremes added 12 commits February 5, 2026 09:25

try to enable scale loading in kernel and pipeline

6c61804

init=1 init=2 working, some scales are still wrong as init=0 failing

3500228

fix packing in example

c4daaf2

now offsetting with M/MPerXdl to get scales

a8d48f9

save packing approach

061c9f9

use unpacked scales

c588a1f

clean up example a bit

241ee59

clean up kernel and pipeline code

06a8998

add main include file

dc4366a

use persistent

1622674

use stricter tolerance

457474e

remove some old files

c7298e5

ThomasNing reviewed Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CK_TILE] MX GEMM, non-preshuffled and RCR layout #3709

[CK_TILE] MX GEMM, non-preshuffled and RCR layout #3709

Uh oh!

samremes commented Feb 6, 2026

Uh oh!

ThomasNing left a comment

Uh oh!

ThomasNing Feb 11, 2026

Uh oh!

ThomasNing Feb 11, 2026

Uh oh!

ThomasNing Feb 11, 2026

Uh oh!

ThomasNing Feb 11, 2026

Uh oh!

ThomasNing Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[CK_TILE] MX GEMM, non-preshuffled and RCR layout #3709

Are you sure you want to change the base?

[CK_TILE] MX GEMM, non-preshuffled and RCR layout #3709

Uh oh!

Conversation

samremes commented Feb 6, 2026

Proposed changes

Checklist

Discussion

Uh oh!

ThomasNing left a comment

Choose a reason for hiding this comment

Uh oh!

ThomasNing Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

ThomasNing Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

ThomasNing Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

ThomasNing Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

ThomasNing Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants