-
Notifications
You must be signed in to change notification settings - Fork 23
Remove padding from scales for hipBLASlt calls #442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
| if (params.m % 16 || params.n % 16) { | ||
| GTEST_SKIP() << "MXFP8 requires M & N to be multiples of 16"; | ||
| } | ||
| if (params.k % 128) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it hipblasLt limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these are the values that hipblastlt team provided to us. I tested just in case, but nothing smaller that 128 works for k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 32x128x32 config needed with 16x128x16 then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say it makes sense to keep. This allows us to test a TE acceptable size with 32 while also ensuring unpadding and hipBLASlt is working with 16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I'd change 32x128x32 to 32x128x16 to test they work together
| NVTE_DIM_CHECK(chunk_height > 0 && chunk_width > 0, "Attempted to get empty tensor chunk"); | ||
| NVTE_DIM_CHECK(chunk_height <= height && chunk_width <= width, | ||
| "Attempted to get out-of-bounds tensor chunk"); | ||
| #ifndef __HIP_PLATFORM_AMD__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this file is not currently compiled for ROCm - it is for UB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can move it to the UB PR if you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, better move to UB because this file wold require more changes than those ifdefs, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense.
| if (params.m % 16 || params.n % 16) { | ||
| GTEST_SKIP() << "MXFP8 requires M & N to be multiples of 16"; | ||
| } | ||
| if (params.k % 128) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 32x128x32 config needed with 16x128x16 then?
| NVTE_DIM_CHECK(chunk_height > 0 && chunk_width > 0, "Attempted to get empty tensor chunk"); | ||
| NVTE_DIM_CHECK(chunk_height <= height && chunk_width <= width, | ||
| "Attempted to get out-of-bounds tensor chunk"); | ||
| #ifndef __HIP_PLATFORM_AMD__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, better move to UB because this file wold require more changes than those ifdefs, right?
tests/pytorch/test_sanity.py
Outdated
| @pytest.mark.skipif(not mxfp8_available, reason=reason_for_no_mxfp8) | ||
| @pytest.mark.parametrize("N", [32]) | ||
| @pytest.mark.parametrize("K", [128]) | ||
| @pytest.mark.parametrize("M", [32]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use non multiple of 32 to test this path is unpadding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We require block sizes of 32 at the python level, so not possible to do a non-multiple. We are padding scales, so we will see a rowwise scale of (1,4) padded to (128,4), and a colwise scale of (4,1) being padded to (4,128).
| return 0.0 | ||
|
|
||
|
|
||
| def unpad_scales(tensor: torch.Tensor, transpose: bool) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it can be called once when tensors are created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a user creates a tensor for the first time, we don't do padding to begin with -- This logic is for loading NV checkpoints only. I was thinking that when we load a pytorch checkpoint, the tensors are filled without calling the initializers, potentially missing the logic. Is there a way to guarantee the unpadding function is called when a Tensor is loaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think init should be called when checkpointing is loading. Or loading can be intercepted be overriding load_from_state_dict
Removes padding for scale vectors that are used mainly for MXFP8.