Skip to content

Conversation

@alextmagro
Copy link
Contributor

Removes padding for scale vectors that are used mainly for MXFP8.

if (params.m % 16 || params.n % 16) {
GTEST_SKIP() << "MXFP8 requires M & N to be multiples of 16";
}
if (params.k % 128) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it hipblasLt limitation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are the values that hipblastlt team provided to us. I tested just in case, but nothing smaller that 128 works for k.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 32x128x32 config needed with 16x128x16 then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it makes sense to keep. This allows us to test a TE acceptable size with 32 while also ensuring unpadding and hipBLASlt is working with 16.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I'd change 32x128x32 to 32x128x16 to test they work together

NVTE_DIM_CHECK(chunk_height > 0 && chunk_width > 0, "Attempted to get empty tensor chunk");
NVTE_DIM_CHECK(chunk_height <= height && chunk_width <= width,
"Attempted to get out-of-bounds tensor chunk");
#ifndef __HIP_PLATFORM_AMD__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file is not currently compiled for ROCm - it is for UB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can move it to the UB PR if you prefer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, better move to UB because this file wold require more changes than those ifdefs, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense.

@alextmagro alextmagro requested a review from ipanfilo February 6, 2026 23:41
if (params.m % 16 || params.n % 16) {
GTEST_SKIP() << "MXFP8 requires M & N to be multiples of 16";
}
if (params.k % 128) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 32x128x32 config needed with 16x128x16 then?

NVTE_DIM_CHECK(chunk_height > 0 && chunk_width > 0, "Attempted to get empty tensor chunk");
NVTE_DIM_CHECK(chunk_height <= height && chunk_width <= width,
"Attempted to get out-of-bounds tensor chunk");
#ifndef __HIP_PLATFORM_AMD__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, better move to UB because this file wold require more changes than those ifdefs, right?

@pytest.mark.skipif(not mxfp8_available, reason=reason_for_no_mxfp8)
@pytest.mark.parametrize("N", [32])
@pytest.mark.parametrize("K", [128])
@pytest.mark.parametrize("M", [32])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use non multiple of 32 to test this path is unpadding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We require block sizes of 32 at the python level, so not possible to do a non-multiple. We are padding scales, so we will see a rowwise scale of (1,4) padded to (128,4), and a colwise scale of (4,1) being padded to (4,128).

return 0.0


def unpad_scales(tensor: torch.Tensor, transpose: bool) -> torch.Tensor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it can be called once when tensors are created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a user creates a tensor for the first time, we don't do padding to begin with -- This logic is for loading NV checkpoints only. I was thinking that when we load a pytorch checkpoint, the tensors are filled without calling the initializers, potentially missing the logic. Is there a way to guarantee the unpadding function is called when a Tensor is loaded?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think init should be called when checkpointing is loading. Or loading can be intercepted be overriding load_from_state_dict

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants