Skip to content

Add schedule for 256x224x256 macro tile#1129

Open
willghatch wants to merge 2 commits intomainfrom
users/willghatch/mt-256x224x256
Open

Add schedule for 256x224x256 macro tile#1129
willghatch wants to merge 2 commits intomainfrom
users/willghatch/mt-256x224x256

Conversation

@willghatch
Copy link
Contributor

It is a no_unroll schedule to get under the register budget. This gets the macro tile functional with the waveasm backend.

For the 7.1 example, it adds

  • --wave_shape flag -- Previously (1,4) was hard-coded, but the 256x224x256 tile needed (2, 2) because the N dimension was not divisible by 4 after pipelining... I think was the reason we chose that.
  • --no_unroll flag to access the new no_unroll schedule.

The particular 7.1 example target for this work was python examples/python/7.1_schedule.py --block 256,224,256 --shape 1024,896,8192 --wave_shape 2,2 --no-unroll --test test_dbuf_4wave_mxfp_preshuffle_b_gemm_cpp

This also adds an e2e waveasm test.

At this stage no real effort has been made to make the schedule performant, just to get it working.

@willghatch willghatch force-pushed the users/willghatch/mt-256x224x256 branch from 314dad9 to 1c33bc7 Compare March 13, 2026 22:44
@willghatch willghatch requested a review from harsh-nod March 13, 2026 22:45
schedule = get_mxfp4_asymmetric_schedule(
eliminate_epilogue=eliminate_epilogue, is_bscale_shuffled=True
)
if no_unroll:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a new schedule for this? Can it be an option in get_mxfp4_asymmetric_schedule ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new schedule is necessary. Without the new schedule it blows the register budget (even if skipping the unrolling for the schedule). The new schedule moves ops within the schedule to reduce memory pressure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this not be an option in the get_mxfp4_asymmetric_schedule? If the only change in the asymmetric schedule disabling the unroll factor, you can do something like this in asymmetric schedule:

if not no_unroll: # passed as an option default false
   tkw.unroll ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a commit that now “uses the same schedule”, but that has conditions in the schedule based on whether it uses unrolling or not. I don't think it's really an improvement. But it maybe makes it easier to see the differences. The new schedule has 3 clusters instead of 2, it uses different interleaving. It is a fairly different schedule. The original schedule blows the register budget, even with no unrolling.

It is a no_unroll schedule to get under the register budget.
This gets the macro tile functional with the waveasm backend.

For the 7.1 example, it adds
- `--wave_shape` flag -- Previously (1,4) was hard-coded, but the 256x224x256 tile needed (2, 2) because the N dimension was not divisible by 4 after pipelining... I think was the reason we chose that.
- `--no_unroll` flag to access the new no_unroll schedule.

The particular 7.1 example target for this work was
`python examples/python/7.1_schedule.py --block 256,224,256 --shape 1024,896,8192 --wave_shape 2,2 --no-unroll --test test_dbuf_4wave_mxfp_preshuffle_b_gemm_cpp`

This also adds an e2e waveasm test.

At this stage no real effort has been made to make the schedule performant, just to get it working.

Signed-off-by: William G Hatch <william@hatch.uno>
…c schedule

The no-unroll path needs a different kernel interleaving strategy than
the unrolled path: 2-group interleaving (shared A loads interleaved
with MMA) with B loads and G2S prefetches in a separate third cluster,
rather than 4-group interleaving that folds B loads and G2S directly
into the two MMA clusters.  The 4-group pattern was designed for the
unrolled kernel where the larger loop body can absorb the extra live
values; with unroll_factor=1 the tighter loop needs the third cluster
to keep VGPR pressure in check.
@willghatch willghatch force-pushed the users/willghatch/mt-256x224x256 branch from 1c33bc7 to b782e12 Compare March 16, 2026 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants