Conversation
314dad9 to
1c33bc7
Compare
examples/python/7.1_schedule.py
Outdated
| schedule = get_mxfp4_asymmetric_schedule( | ||
| eliminate_epilogue=eliminate_epilogue, is_bscale_shuffled=True | ||
| ) | ||
| if no_unroll: |
There was a problem hiding this comment.
Do we need a new schedule for this? Can it be an option in get_mxfp4_asymmetric_schedule ?
There was a problem hiding this comment.
The new schedule is necessary. Without the new schedule it blows the register budget (even if skipping the unrolling for the schedule). The new schedule moves ops within the schedule to reduce memory pressure.
There was a problem hiding this comment.
Can this not be an option in the get_mxfp4_asymmetric_schedule? If the only change in the asymmetric schedule disabling the unroll factor, you can do something like this in asymmetric schedule:
if not no_unroll: # passed as an option default false
tkw.unroll ...
There was a problem hiding this comment.
I've pushed a commit that now “uses the same schedule”, but that has conditions in the schedule based on whether it uses unrolling or not. I don't think it's really an improvement. But it maybe makes it easier to see the differences. The new schedule has 3 clusters instead of 2, it uses different interleaving. It is a fairly different schedule. The original schedule blows the register budget, even with no unrolling.
It is a no_unroll schedule to get under the register budget. This gets the macro tile functional with the waveasm backend. For the 7.1 example, it adds - `--wave_shape` flag -- Previously (1,4) was hard-coded, but the 256x224x256 tile needed (2, 2) because the N dimension was not divisible by 4 after pipelining... I think was the reason we chose that. - `--no_unroll` flag to access the new no_unroll schedule. The particular 7.1 example target for this work was `python examples/python/7.1_schedule.py --block 256,224,256 --shape 1024,896,8192 --wave_shape 2,2 --no-unroll --test test_dbuf_4wave_mxfp_preshuffle_b_gemm_cpp` This also adds an e2e waveasm test. At this stage no real effort has been made to make the schedule performant, just to get it working. Signed-off-by: William G Hatch <william@hatch.uno>
…c schedule The no-unroll path needs a different kernel interleaving strategy than the unrolled path: 2-group interleaving (shared A loads interleaved with MMA) with B loads and G2S prefetches in a separate third cluster, rather than 4-group interleaving that folds B loads and G2S directly into the two MMA clusters. The 4-group pattern was designed for the unrolled kernel where the larger loop body can absorb the extra live values; with unroll_factor=1 the tighter loop needs the third cluster to keep VGPR pressure in check.
1c33bc7 to
b782e12
Compare
It is a no_unroll schedule to get under the register budget. This gets the macro tile functional with the waveasm backend.
For the 7.1 example, it adds
--wave_shapeflag -- Previously (1,4) was hard-coded, but the 256x224x256 tile needed (2, 2) because the N dimension was not divisible by 4 after pipelining... I think was the reason we chose that.--no_unrollflag to access the new no_unroll schedule.The particular 7.1 example target for this work was
python examples/python/7.1_schedule.py --block 256,224,256 --shape 1024,896,8192 --wave_shape 2,2 --no-unroll --test test_dbuf_4wave_mxfp_preshuffle_b_gemm_cppThis also adds an e2e waveasm test.
At this stage no real effort has been made to make the schedule performant, just to get it working.