Enhance CUDA flash attention kernel selection for DKQ=512 with low gq…#6
Open
Ooooze wants to merge 1 commit into
Open
Enhance CUDA flash attention kernel selection for DKQ=512 with low gq…#6Ooooze wants to merge 1 commit into
Ooooze wants to merge 1 commit into
Conversation
…a_ratio This update modifies the kernel selection logic in the CUDA implementation of the flash attention mechanism. Specifically, when the query dimension (DKQ) is set to 512 and the gqa_ratio is less than 3, the code now routes to the TILE kernel instead of falling through to an abort condition. This change ensures better compatibility and performance for specific hardware configurations, particularly for models like Gemma 4 E4B.
|
"Confirming this still reproduces on 31B (Q4_K_XL target, Q8_0 assistant, RTX 3090 sm_8.6)" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…a_ratio
This update modifies the kernel selection logic in the CUDA implementation of the flash attention mechanism. Specifically, when the query dimension (DKQ) is set to 512 and the gqa_ratio is less than 3, the code now routes to the TILE kernel instead of falling through to an abort condition. This change ensures better compatibility and performance for specific hardware configurations, particularly for models like Gemma 4 E4B.
Overview
Additional information
Requirements