Repro: MTP path on CUDA aborts at fattn.cu:109 (DKQ=512) for Gemma 4 — Blackwell sm_120 + Ampere sm_86#5
Conversation
|
Hi @jameseiten — thanks a lot for the detailed reproduction and the clean root-cause writeup, that made the diagnosis trivial. I went with your suggested option #2 (route DKQ=512 to TILE) since fattn-tile already has full DKQ=DV=512 with ncols2 ∈ {1, 2} fallback after #425db5b, and the NVIDIA fp16/fp32 kernel-config tables were extended in that same commit, so the path is ready to use. Single-line dispatcher guard: if (Q->ne[0] == 512 && gqa_ratio < 3) { I don't have an NVIDIA box at hand to verify against your repro. Since you're already set up with both Blackwell sm_120 and Ampere sm_86 + the exact GGUFs and the throughput preset script — would you mind giving #6 a spin against the same reproduction? Especially curious whether -ctk turbo3 -ctv turbo3 works through the TILE path on NVIDIA the way it does on AMD gfx1150 (where it was tested for #425db5b). Thanks again. |
MTP path on CUDA aborts at fattn.cu:109 (DKQ=512) for Gemma 4 — reproducible on Blackwell sm_120 and Ampere sm_86
Summary
The MTP speculative-decoding worker crashes with a
GGML_ABORT("fatal error")inggml/src/ggml-cuda/fattn.cu:109whenever Gemma 4's global-attention layer (head_dim = 512) is processed via the FA-MMA kernel. It happens on both Blackwell (RTX 5060 Ti, sm_12.0) and Ampere (RTX 3060, sm_8.6) so it's not architecture-specific. Non-MTP inference of the same target is unaffected.The bug appears specific to the CUDA
fattn-mmapath; the recentfattn-tileDKQ=DV=512 fix (425db5b) doesn't cover it. The Apple Metal benchmarks in the README (M4 Max, TurboFlash kernel) use a different code path and are unaffected.Reproduction
Pinned commit:
2e81dc5(feature/turboquant-kv-cache, today's HEAD).Build
(Builds cleanly.
version: 72 (2e81dc5).)Models
unsloth/gemma-4-E4B-it-GGUFQ8_K_XL (gemma-4-E4B-it-UD-Q8_K_XL.gguf)AtomicChat/gemma-4-E4B-it-assistant-GGUFQ4_K_M (verified clean byscripts/verify-gemma4-assistant-gguf.py)Serve command (matches
scripts/run-gemma4-e4b-mtp-server.shthroughputpreset)Server starts cleanly:
Trigger
Any chat-completion request with non-trivial output is enough. Example payload:
{ "model": "gemma-4-E4B-it-UD-Q8_K_XL.gguf", "messages": [ {"role": "system", "content": "Output strict JSON: {severity, escalate, confidence, rationale}."}, {"role": "user", "content": "Finding: prompt-injection attempt. Bot refused. No leak."} ], "max_tokens": 128, "temperature": 0.0 }Crash
The non-MTP path (same target, no
--mtp-head/--spec-type) responds correctly, so the model itself loads and runs fine; only the MTP worker hits the abort.Root cause analysis
fattn.cu:109is inggml_cuda_flash_attn_ext_mma_f16_switch_ncols2:Gemma 4 has heterogeneous head dimensions:
head_dim = 256→ DKQ = 256 → fineglobal_head_dim = 512→ DKQ = 512 → hits theelseand abortsThe
fattn-tileDKQ=DV=512 path was fixed in425db5b, but the MTP scheduler appears to dispatch tofattn-mma(this code path), which has no DKQ=512 implementation and aborts unconditionally.vLLM 0.20.2rc1 handles the same model by detecting heterogeneous head dims at startup and forcing the TRITON_ATTN backend (log line: "Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence."). atomic appears to have no equivalent dispatch fallback.
Workarounds attempted (all fail with the same abort)
-fa on(defaultauto) → omitted (auto)-fa onexplicit-fa offCUDA_VISIBLE_DEVICES=1(3060, sm_8.6) instead of 5060 Ti (sm_12.0)turbo3KV cache,--parallel 1,--draft-block-size 2 --draft-max 6)Q4_K_Mfrom your published GGUF (verified byscripts/verify-gemma4-assistant-gguf.py)Non-MTP serve of the same target on the same hardware works correctly, so this is purely an MTP-path FA-MMA dispatch issue.
Environment
Suggested fix direction
Either:
ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<DKQ=512, DV=512, ...>template instances (the same wayfattn-tilewas extended in425db5b), orfattn-tile(which already has DKQ=512 support) instead offattn-mmafor Gemma 4's global-attention layers, orHappy to test patches against this repro.
(Filed while benchmarking atomic + vLLM + llama-swap for the dev.to Gemma 4 Challenge contest. Will publish full A/B numbers once atomic CUDA MTP is healthy.)