MXFP4 Cast Transpose Triton [WIP] #422

sarthak-amd · 2026-01-20T00:32:53Z

Description

Implements the MXFP4 rowwise and columnwise FP32/BF16 -> MXFP4 fused quantization + cast kernel

Verify Tolerances and functional Unit Tests
The triton te_cast_transpose_mxfp4_triton currently outputs FP4 data in linear layout [M, N/2] with contiguous byte packing. AITER's gemm_a4w4 requires the B matrix in MFMA shuffle layout for tensor cores. This layout shuffle can be fused into the triton kernel in future.

…-mxfp4

wangye805

You

wangye805 · 2026-02-02T04:51:25Z

tests/pytorch/triton_kernels/test_cast_mxfp4.py

+import numpy as np
+import os
+
+os.environ["USE_TRITON_FUSED_CAST_TRANSPOSE"] = "1"


We previously already defined env NVTE_USE_CAST_TRANSPOSE_TRITON.

wangye805 · 2026-02-02T04:57:31Z

tests/pytorch/triton_kernels/test_cast_mxfp4.py

+def test_quantize_mxfp4(shape, in_dtype, rowwise, columnwise, shuffle_B_matrix):
+    """Test MXFP4 quantization for rowwise/columnwise modes with/without FP4 shuffle.
+
+    Note: FP4 data shuffle (shuffle_B_matrix_for_aiter) is not yet supported in Triton kernel.


If FP4 data shuffle is not yet supported in Triton kernel, why do we need to add it here?

This is kept to ensure API consistency between Triton and the upcoming hip kernel for which I'll create a separate PR. In the hip kernel we were able to fuse the shuffle.

hip vs triton flow
Input: BF16 [M, N]
↓
MXFP4Quantizer.update_quantized()
↓
tex.cast_transpose_mxfp4_fused_shuffle() [Single HIP kernel]
↓
├─→ Rowwise FP4 [M, K/2] (MFMA shuffled)
├─→ Rowwise Scale [M_pad, K/32_pad] (shuffled)
├─→ Colwise FP4 [N, M/2] (MFMA shuffled)
└─→ Colwise Scale [N_pad, M/32_pad] (shuffled)
↓
AITER gemm_a4w4 (zero-copy)

vs

Input: BF16 [M, N]
↓
MXFP4Quantizer.update_quantized()
↓
te_cast_transpose_mxfp4_triton() [Triton JIT kernel]
↓
├─→ Rowwise FP4 [M, K/2] (linear layout)
├─→ Rowwise Scale [M_pad, K/32_pad] (shuffled)
├─→ Colwise FP4 [N, M/2] (linear layout)
└─→ Colwise Scale [N_pad, M/32_pad] (shuffled)
↓
aiter.ops.shuffle.shuffle_weight() [External call]
↓
FP4 data → MFMA layout
↓
AITER gemm_a4w4

wangye805 · 2026-02-02T04:59:28Z

tests/pytorch/triton_kernels/test_cast_mxfp4.py

+    (32768, 160),
+    (4096, 1632),
+    (8, 32, 1024),
+    (16, 8, 4, 512),


Can we add some prime numbers like

TransformerEngine/tests/cpp/operator/test_cast_transpose.cu

Lines 90 to 92 in 9d6b0e5

{1, 3221}, // Prime 456

{2333, 1}, // Prime 345

{1481, 677}}; // Primes 234, 123

MXFP4 requires dimensions divisible by 32 for per-block scaling compatibility with AITER gemm_a4w4. I have added the shapes that should throw a valid and expected assertion error.

wangye805 · 2026-02-02T05:02:05Z

tests/pytorch/triton_kernels/test_cast_mxfp4.py

+    data_atol = 20.0 if in_dtype != torch.float32 else 16.0
+    scale_atol = 2.0 if in_dtype != torch.float32 else 1.0


Data tol seems to be quite large. You can follow our mxfp8 scale and data adjustment scheme:

TransformerEngine/tests/cpp/test_common.cu

Line 730 in 9d6b0e5

void adjust_ref_for_e8m0_scale_error(const std::string &name,

@wangye805 closely followed the example and updated the pytest.

wangye805 · 2026-02-02T05:02:54Z

tests/pytorch/triton_kernels/test_cast_mxfp4.py

+            use_torch_semantics=True
+        )
+
+        # Compare only valid (non-padded) region - no shuffle extraction needed


What is fp4 shuffle?

fp4 shuffle basically rearranges [M, K/2] linear layout → MFMA instruction layout (16×16).

The currently flow training workflow if TE MXFP4 Quantization Kernel is used is as follows
TE Triton Kernel → Linear FP4 [N, K/2] → aiter.ops.shuffle_weight() → MFMA FP4 → aiter.gemm_a4w4()

You can find the shuffle code in aiter/aiter/ops/shuffle.py

wangye805 · 2026-02-02T14:52:14Z

transformer_engine/common/util/pybind_helper.h

      .value("kFloat8E4M3", transformer_engine::DType::kFloat8E4M3)                                \
-      .value("kFloat8E5M2", transformer_engine::DType::kFloat8E5M2);                               \
+      .value("kFloat8E5M2", transformer_engine::DType::kFloat8E5M2)                                \
+      .value("kFloat4E2M1", transformer_engine::DType::kFloat4E2M1);                               \


If we are going to enable kFloat4E2M1, there are other related changes needed. Search for https://github.com/search?q=repo%3AROCm%2FTransformerEngine%20kFloat4E2M1&type=code for more details:

wangye805 · 2026-02-02T14:57:17Z

transformer_engine/pytorch/tensor/_internal/mxfp4_tensor_base.py

+    - Data: [M, K/2] uint8 tensor (2 FP4 values packed per byte)
+    - Scale: [M, K/32] uint8 tensor (E8M0 format, one scale per 32-element block)


Is there alignment/padding requirements for M and K?

wangye805 · 2026-02-02T14:59:37Z

transformer_engine/pytorch/tensor/mxfp4_tensor.py

+        if inp.ndim < 2:
+            return False


TE currently supported 2D matrices from flatted high-dimensional tensors:

TransformerEngine/transformer_engine/common/common.h

Lines 238 to 262 in 9d6b0e5

size_t flat_first_dim() const {

const auto &full_shape = shape();

size_t ret = 1;

if (!full_shape.empty()) {

for (size_t i = 0; i < full_shape.size() - 1; i++) {

ret *= full_shape[i];

}

}

return ret;

}

/*! Matrix width after tensor is flattened to 2D

*

* If a tensor has dimensions (D1, D2, ..., Dn), it is reinterpreted

* as a (D1*D2*...*D(n-1), Dn) matrix.

*/

size_t flat_last_dim() const {

const auto &full_shape = shape();

if (full_shape.empty()) {

return 1;

} else {

return full_shape.back();

}

}

};

wangye805 · 2026-02-02T15:00:19Z

transformer_engine/pytorch/tensor/mxfp4_tensor.py

+
+        # Allocate PADDED scale tensors for shuffle compatibility
+        rowwise_scale_N = K // MXFP4_BLOCK_SCALING_SIZE
+        rowwise_scale_M_pad = cdiv(M, 256) * 256


I presume this 256 is from some alignment/padding requirement?

The 256 alignment is required by AITER's CK-based MXFP4 GEMM kernels for scale tensor swizzle/shuffle layout.
Required for scale swizzle layout: 256 = ScaleBlockSize(32) × 8 waves.
See aiter/aiter/utility/fp4_utils.py:398 and gemm_a4w4_blockscale_common.cuh:66

wangye805 · 2026-02-02T15:03:09Z

tests/pytorch/triton_kernels/test_cast_mxfp4.py

@@ -0,0 +1,178 @@
+# Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.


You will need to add this pytest into our ci script (somewhere near

TransformerEngine/ci/pytorch.sh

Line 74 in 9d6b0e5

run_default_fa 1 triton_kernels/test_norms.py

) otherwise it won't be tested

…-mxfp4

sarthak-amd added 6 commits January 19, 2026 18:28

MXFP4 Tensor support in TE

fd7129d

fused cast transpose mxfp4

aca9e33

add E2M1 Dtype

b7cc9f2

Add unit test

7b2b4e5

update unit test and unify the api with upcoming hip kernel

df39c9a

Merge remote-tracking branch 'origin/dev' into feature/cast-transpose…

c1680cb

…-mxfp4

wangye805 requested changes Feb 2, 2026

View reviewed changes

sarthak-amd and others added 6 commits February 3, 2026 13:13

Merge remote-tracking branch 'origin/dev' into feature/cast-transpose…

f2bef5a

…-mxfp4

add mxfp4 cast kernel to pytorch ci

968875d

Merge remote-tracking branch 'origin/dev' into feature/cast-transpose…

a05fbb9

…-mxfp4

add prime number shapes

0523d73

Merge remote-tracking branch 'origin/dev' into feature/cast-transpose…

1922fb9

…-mxfp4

update tolerances to match hip requirements

ef83316

	{1, 3221}, // Prime 456
	{2333, 1}, // Prime 345
	{1481, 677}}; // Primes 234, 123

		data_atol = 20.0 if in_dtype != torch.float32 else 16.0
		scale_atol = 2.0 if in_dtype != torch.float32 else 1.0

		- Data: [M, K/2] uint8 tensor (2 FP4 values packed per byte)
		- Scale: [M, K/32] uint8 tensor (E8M0 format, one scale per 32-element block)

	size_t flat_first_dim() const {
	const auto &full_shape = shape();
	size_t ret = 1;
	if (!full_shape.empty()) {
	for (size_t i = 0; i < full_shape.size() - 1; i++) {
	ret *= full_shape[i];
	}
	}
	return ret;
	}

	/*! Matrix width after tensor is flattened to 2D
	*
	* If a tensor has dimensions (D1, D2, ..., Dn), it is reinterpreted
	* as a (D1D2...*D(n-1), Dn) matrix.
	*/
	size_t flat_last_dim() const {
	const auto &full_shape = shape();
	if (full_shape.empty()) {
	return 1;
	} else {
	return full_shape.back();
	}
	}
	};

		@@ -0,0 +1,178 @@
		# Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.

MXFP4 Cast Transpose Triton [WIP] #422

Are you sure you want to change the base?

MXFP4 Cast Transpose Triton [WIP] #422

Uh oh!

Conversation

sarthak-amd commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

wangye805 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarthak-amd Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sarthak-amd commented Jan 20, 2026 •

edited

Loading

sarthak-amd Feb 3, 2026 •

edited

Loading