Open
Conversation
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR schedules the pass merge_contiguous_reads earlier in the pipeline (before manual scheduling) so that we can reason about merged read counts instead of only pre-merge reads when building the schedule. It keeps the pass in its original late position too (calls it twice) because for some kernels the pass can't prove contiguity that early in the pipeline. After the pass simplify_indices runs, the index expressions are simplified enough for the pass to prove contiguity. This is the case for the test in scaled_gemm.py (test_dynamic_preshuffle_b_scale_coalescing).
In order to move the pass earlier before manual scheduling, some fixes were required :
Pipelined bug in the 4 wave assymetric schedule:
After merge_contiguous_reads, ExtractSlice nodes are created between Reads and their consumers Bitcasts. The manual schedule's set_stage only assigns scheduling_parameters to nodes it's given explicitly and ExtractSlice was not part of them. This PR now makes sure the scheduling parameters are propagated properly with propagate_scheduling_parameters_to_extract_slices from source Reads to eahc extract_strided slice.
With this fix, the liveness_anaysis in Constructpipelined loop, sees the stage gap betwen extacts strided slice and bitcast and thus creates rotating registers to carry value across pipeline iterations.
The other other failure happened in scaled_gemm. The auto-scheduler's create_scheduling_edges skips nodes in ignore_nodes. ExtractSlice was unknown to get_custom_operation_type (returned None), so it landed in ignore_nodes. This broke the dependency chain: edges from Read → ExtractSlice were created, but ExtractSlice → Bitcast edges were not (since ExtractSlice was skipped as a source). Without that edge, Bitcast lost its ordering constraint relative to Read and breaking the stage-transition validation. The fix made get_custom_operation_type resolve ExtractSlice recursively to its source Read's operation keeping it out of ignore_nodes and preserving the full dependency chain.