Skip to content

Repro: MTP path on CUDA aborts at fattn.cu:109 (DKQ=512) for Gemma 4 — Blackwell sm_120 + Ampere sm_86#5

Draft
jameseiten wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
jameseiten:bug/mtp-cuda-fattn-DKQ-512-abort
Draft

Repro: MTP path on CUDA aborts at fattn.cu:109 (DKQ=512) for Gemma 4 — Blackwell sm_120 + Ampere sm_86#5
jameseiten wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
jameseiten:bug/mtp-cuda-fattn-DKQ-512-abort

Conversation

@jameseiten
Copy link
Copy Markdown

MTP path on CUDA aborts at fattn.cu:109 (DKQ=512) for Gemma 4 — reproducible on Blackwell sm_120 and Ampere sm_86

Summary

The MTP speculative-decoding worker crashes with a GGML_ABORT("fatal error") in ggml/src/ggml-cuda/fattn.cu:109 whenever Gemma 4's global-attention layer (head_dim = 512) is processed via the FA-MMA kernel. It happens on both Blackwell (RTX 5060 Ti, sm_12.0) and Ampere (RTX 3060, sm_8.6) so it's not architecture-specific. Non-MTP inference of the same target is unaffected.

The bug appears specific to the CUDA fattn-mma path; the recent fattn-tile DKQ=DV=512 fix (425db5b) doesn't cover it. The Apple Metal benchmarks in the README (M4 Max, TurboFlash kernel) use a different code path and are unaffected.

Reproduction

Pinned commit: 2e81dc5 (feature/turboquant-kv-cache, today's HEAD).

Build

export PATH=/usr/local/cuda-12.8/bin:$PATH
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="86;120" \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.8/bin/nvcc \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.8 \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc) --target llama-server

(Builds cleanly. version: 72 (2e81dc5).)

Models

  • Target: unsloth/gemma-4-E4B-it-GGUF Q8_K_XL (gemma-4-E4B-it-UD-Q8_K_XL.gguf)
  • Drafter: AtomicChat/gemma-4-E4B-it-assistant-GGUF Q4_K_M (verified clean by scripts/verify-gemma4-assistant-gguf.py)

Serve command (matches scripts/run-gemma4-e4b-mtp-server.sh throughput preset)

build/bin/llama-server \
  -m /path/to/gemma-4-E4B-it-UD-Q8_K_XL.gguf \
  --mtp-head /path/to/gemma-4-E4B-it-assistant.Q4_K_M.gguf \
  --spec-type mtp \
  --draft-block-size 2 --draft-max 6 --draft-min 0 \
  -c 4096 \
  -ngl 99 -ngld 99 \
  -ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3 \
  -fa on \
  --parallel 1 -np 1 --cont-batching \
  --host 0.0.0.0 --port 8002

Server starts cleanly:

srv    load_model: MTP assistant path '...gemma-4-E4B-it-assistant.Q4_K_M.gguf' (loaded into target model)
slot   load_model: id  0 | task -1 | speculative decoding context initialized
main: model loaded
main: server is listening on http://0.0.0.0:8002

Trigger

Any chat-completion request with non-trivial output is enough. Example payload:

{
  "model": "gemma-4-E4B-it-UD-Q8_K_XL.gguf",
  "messages": [
    {"role": "system", "content": "Output strict JSON: {severity, escalate, confidence, rationale}."},
    {"role": "user", "content": "Finding: prompt-injection attempt. Bot refused. No leak."}
  ],
  "max_tokens": 128,
  "temperature": 0.0
}

Crash

slot update_slots: id  3 | task 0 | prompt processing done, n_tokens = 109, batch.n_tokens = 4
slot update_slots: id  3 | task 0 | created context checkpoint 1 of 32 ...
init: embeddings required but some input tokens were not marked as outputs -> overriding
ggml/src/ggml-cuda/fattn.cu:109: fatal error
[backtrace]
  ggml_cuda_flash_attn_ext
  llama_context::graph_compute_mtp
  llama_context::process_ubatch_mtp
  llama_context::decode_mtp_run
  llama_context::mtp_worker_loop
systemd: Main process exited, code=killed, status=6/ABRT

The non-MTP path (same target, no --mtp-head/--spec-type) responds correctly, so the model itself loads and runs fine; only the MTP worker hits the abort.

Root cause analysis

fattn.cu:109 is in ggml_cuda_flash_attn_ext_mma_f16_switch_ncols2:

if constexpr (DKQ <= 256) {
    if (use_gqa_opt && gqa_ratio > 1) {
        ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<DKQ, DV, 2>(ctx, dst);
        return;
    }
    ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<DKQ, DV, 1>(ctx, dst);
} else {
    GGML_ABORT("fatal error");   // ← line 109
}

Gemma 4 has heterogeneous head dimensions:

  • Most layers: head_dim = 256 → DKQ = 256 → fine
  • Global-attention layers: global_head_dim = 512 → DKQ = 512 → hits the else and aborts

The fattn-tile DKQ=DV=512 path was fixed in 425db5b, but the MTP scheduler appears to dispatch to fattn-mma (this code path), which has no DKQ=512 implementation and aborts unconditionally.

vLLM 0.20.2rc1 handles the same model by detecting heterogeneous head dims at startup and forcing the TRITON_ATTN backend (log line: "Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence."). atomic appears to have no equivalent dispatch fallback.

Workarounds attempted (all fail with the same abort)

Attempt Result
-fa on (default auto) → omitted (auto) abort
-fa on explicit abort
-fa off abort (FA still required by MTP graph)
CUDA_VISIBLE_DEVICES=1 (3060, sm_8.6) instead of 5060 Ti (sm_12.0) abort — not arch-specific
Reference script flags exactly (turbo3 KV cache, --parallel 1, --draft-block-size 2 --draft-max 6) abort
Drafter Q4_K_M from your published GGUF (verified by scripts/verify-gemma4-assistant-gguf.py) abort

Non-MTP serve of the same target on the same hardware works correctly, so this is purely an MTP-path FA-MMA dispatch issue.

Environment

atomic-llama-cpp-turboquant: 2e81dc5 (feature/turboquant-kv-cache, today's HEAD)
CUDA: 12.8.93
GCC: 12.2.0 (Debian 12)
GPUs: NVIDIA GeForce RTX 5060 Ti (sm_12.0, 16 GiB) + NVIDIA GeForce RTX 3060 (sm_8.6, 12 GiB)
Driver: 580.126.18
OS: Linux 6.x (Proxmox LXC, Debian 12 userspace)

Suggested fix direction

Either:

  1. Extend ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<DKQ=512, DV=512, ...> template instances (the same way fattn-tile was extended in 425db5b), or
  2. Have the MTP graph builder dispatch to fattn-tile (which already has DKQ=512 support) instead of fattn-mma for Gemma 4's global-attention layers, or
  3. Fall back gracefully (CPU FA, or non-FA path) when DKQ > 256 is detected at dispatch time.

Happy to test patches against this repro.


(Filed while benchmarking atomic + vLLM + llama-swap for the dev.to Gemma 4 Challenge contest. Will publish full A/B numbers once atomic CUDA MTP is healthy.)

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 8, 2026
@Ooooze
Copy link
Copy Markdown

Ooooze commented May 8, 2026

Hi @jameseiten — thanks a lot for the detailed reproduction and the clean root-cause writeup, that made the diagnosis trivial.

I went with your suggested option #2 (route DKQ=512 to TILE) since fattn-tile already has full DKQ=DV=512 with ncols2 ∈ {1, 2} fallback after #425db5b, and the NVIDIA fp16/fp32 kernel-config tables were extended in that same commit, so the path is ready to use. Single-line dispatcher guard:

if (Q->ne[0] == 512 && gqa_ratio < 3) {
return BEST_FATTN_KERNEL_TILE;
}
Fix is up in #6.

I don't have an NVIDIA box at hand to verify against your repro. Since you're already set up with both Blackwell sm_120 and Ampere sm_86 + the exact GGUFs and the throughput preset script — would you mind giving #6 a spin against the same reproduction? Especially curious whether -ctk turbo3 -ctv turbo3 works through the TILE path on NVIDIA the way it does on AMD gfx1150 (where it was tested for #425db5b).

Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants