Skip to content

Enhance CUDA flash attention kernel selection for DKQ=512 with low gq…#6

Open
Ooooze wants to merge 1 commit into
feature/turboquant-kv-cachefrom
fix/cuda-mma-dkq512-fallback
Open

Enhance CUDA flash attention kernel selection for DKQ=512 with low gq…#6
Ooooze wants to merge 1 commit into
feature/turboquant-kv-cachefrom
fix/cuda-mma-dkq512-fallback

Conversation

@Ooooze
Copy link
Copy Markdown

@Ooooze Ooooze commented May 8, 2026

…a_ratio

This update modifies the kernel selection logic in the CUDA implementation of the flash attention mechanism. Specifically, when the query dimension (DKQ) is set to 512 and the gqa_ratio is less than 3, the code now routes to the TILE kernel instead of falling through to an abort condition. This change ensures better compatibility and performance for specific hardware configurations, particularly for models like Gemma 4 E4B.

Overview

Additional information

Requirements

…a_ratio

This update modifies the kernel selection logic in the CUDA implementation of the flash attention mechanism. Specifically, when the query dimension (DKQ) is set to 512 and the gqa_ratio is less than 3, the code now routes to the TILE kernel instead of falling through to an abort condition. This change ensures better compatibility and performance for specific hardware configurations, particularly for models like Gemma 4 E4B.
@danganbenpa
Copy link
Copy Markdown

"Confirming this still reproduces on 31B (Q4_K_XL target, Q8_0 assistant, RTX 3090 sm_8.6)"
"PR #6's gating doesn't help: 31B's global layers have gqa_ratio=8, condition gqa_ratio < 3 doesn't fire, hits the same abort"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants