Repro: MTP path on CUDA aborts at fattn.cu:109 (DKQ=512) for Gemma 4 — Blackwell sm_120 + Ampere sm_86 by jameseiten · Pull Request #5 · AtomicBot-ai/atomic-llama-cpp-turboquant

jameseiten · 2026-05-08T04:24:30Z

MTP path on CUDA aborts at fattn.cu:109 (DKQ=512) for Gemma 4 — reproducible on Blackwell sm_120 and Ampere sm_86

Summary

The MTP speculative-decoding worker crashes with a GGML_ABORT("fatal error") in ggml/src/ggml-cuda/fattn.cu:109 whenever Gemma 4's global-attention layer (head_dim = 512) is processed via the FA-MMA kernel. It happens on both Blackwell (RTX 5060 Ti, sm_12.0) and Ampere (RTX 3060, sm_8.6) so it's not architecture-specific. Non-MTP inference of the same target is unaffected.

The bug appears specific to the CUDA fattn-mma path; the recent fattn-tile DKQ=DV=512 fix (425db5b) doesn't cover it. The Apple Metal benchmarks in the README (M4 Max, TurboFlash kernel) use a different code path and are unaffected.

Reproduction

Pinned commit: 2e81dc5 (feature/turboquant-kv-cache, today's HEAD).

Build

export PATH=/usr/local/cuda-12.8/bin:$PATH
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="86;120" \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.8/bin/nvcc \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.8 \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc) --target llama-server

(Builds cleanly. version: 72 (2e81dc5).)

Models

Target: unsloth/gemma-4-E4B-it-GGUF Q8_K_XL (gemma-4-E4B-it-UD-Q8_K_XL.gguf)
Drafter: AtomicChat/gemma-4-E4B-it-assistant-GGUF Q4_K_M (verified clean by scripts/verify-gemma4-assistant-gguf.py)

Serve command (matches `scripts/run-gemma4-e4b-mtp-server.sh` `throughput` preset)

build/bin/llama-server \
  -m /path/to/gemma-4-E4B-it-UD-Q8_K_XL.gguf \
  --mtp-head /path/to/gemma-4-E4B-it-assistant.Q4_K_M.gguf \
  --spec-type mtp \
  --draft-block-size 2 --draft-max 6 --draft-min 0 \
  -c 4096 \
  -ngl 99 -ngld 99 \
  -ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3 \
  -fa on \
  --parallel 1 -np 1 --cont-batching \
  --host 0.0.0.0 --port 8002

Server starts cleanly:

srv    load_model: MTP assistant path '...gemma-4-E4B-it-assistant.Q4_K_M.gguf' (loaded into target model)
slot   load_model: id  0 | task -1 | speculative decoding context initialized
main: model loaded
main: server is listening on http://0.0.0.0:8002

Trigger

Any chat-completion request with non-trivial output is enough. Example payload:

{
  "model": "gemma-4-E4B-it-UD-Q8_K_XL.gguf",
  "messages": [
    {"role": "system", "content": "Output strict JSON: {severity, escalate, confidence, rationale}."},
    {"role": "user", "content": "Finding: prompt-injection attempt. Bot refused. No leak."}
  ],
  "max_tokens": 128,
  "temperature": 0.0
}

Crash

slot update_slots: id  3 | task 0 | prompt processing done, n_tokens = 109, batch.n_tokens = 4
slot update_slots: id  3 | task 0 | created context checkpoint 1 of 32 ...
init: embeddings required but some input tokens were not marked as outputs -> overriding
ggml/src/ggml-cuda/fattn.cu:109: fatal error
[backtrace]
  ggml_cuda_flash_attn_ext
  llama_context::graph_compute_mtp
  llama_context::process_ubatch_mtp
  llama_context::decode_mtp_run
  llama_context::mtp_worker_loop
systemd: Main process exited, code=killed, status=6/ABRT

The non-MTP path (same target, no --mtp-head/--spec-type) responds correctly, so the model itself loads and runs fine; only the MTP worker hits the abort.

Root cause analysis

fattn.cu:109 is in ggml_cuda_flash_attn_ext_mma_f16_switch_ncols2:

if constexpr (DKQ <= 256) {
    if (use_gqa_opt && gqa_ratio > 1) {
        ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<DKQ, DV, 2>(ctx, dst);
        return;
    }
    ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<DKQ, DV, 1>(ctx, dst);
} else {
    GGML_ABORT("fatal error");   // ← line 109
}

Gemma 4 has heterogeneous head dimensions:

Most layers: head_dim = 256 → DKQ = 256 → fine
Global-attention layers: global_head_dim = 512 → DKQ = 512 → hits the else and aborts

The fattn-tile DKQ=DV=512 path was fixed in 425db5b, but the MTP scheduler appears to dispatch to fattn-mma (this code path), which has no DKQ=512 implementation and aborts unconditionally.

vLLM 0.20.2rc1 handles the same model by detecting heterogeneous head dims at startup and forcing the TRITON_ATTN backend (log line: "Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence."). atomic appears to have no equivalent dispatch fallback.

Workarounds attempted (all fail with the same abort)

Attempt	Result
`-fa on` (default `auto`) → omitted (`auto`)	abort
`-fa on` explicit	abort
`-fa off`	abort (FA still required by MTP graph)
`CUDA_VISIBLE_DEVICES=1` (3060, sm_8.6) instead of 5060 Ti (sm_12.0)	abort — not arch-specific
Reference script flags exactly (`turbo3` KV cache, `--parallel 1`, `--draft-block-size 2 --draft-max 6`)	abort
Drafter `Q4_K_M` from your published GGUF (verified by `scripts/verify-gemma4-assistant-gguf.py`)	abort

Non-MTP serve of the same target on the same hardware works correctly, so this is purely an MTP-path FA-MMA dispatch issue.

Environment

atomic-llama-cpp-turboquant: 2e81dc5 (feature/turboquant-kv-cache, today's HEAD)
CUDA: 12.8.93
GCC: 12.2.0 (Debian 12)
GPUs: NVIDIA GeForce RTX 5060 Ti (sm_12.0, 16 GiB) + NVIDIA GeForce RTX 3060 (sm_8.6, 12 GiB)
Driver: 580.126.18
OS: Linux 6.x (Proxmox LXC, Debian 12 userspace)

Suggested fix direction

Either:

Extend ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<DKQ=512, DV=512, ...> template instances (the same way fattn-tile was extended in 425db5b), or
Have the MTP graph builder dispatch to fattn-tile (which already has DKQ=512 support) instead of fattn-mma for Gemma 4's global-attention layers, or
Fall back gracefully (CPU FA, or non-FA path) when DKQ > 256 is detected at dispatch time.

Happy to test patches against this repro.

(Filed while benchmarking atomic + vLLM + llama-swap for the dev.to Gemma 4 Challenge contest. Will publish full A/B numbers once atomic CUDA MTP is healthy.)

Ooooze · 2026-05-08T07:57:34Z

Hi @jameseiten — thanks a lot for the detailed reproduction and the clean root-cause writeup, that made the diagnosis trivial.

I went with your suggested option #2 (route DKQ=512 to TILE) since fattn-tile already has full DKQ=DV=512 with ncols2 ∈ {1, 2} fallback after #425db5b, and the NVIDIA fp16/fp32 kernel-config tables were extended in that same commit, so the path is ready to use. Single-line dispatcher guard:

if (Q->ne[0] == 512 && gqa_ratio < 3) {
return BEST_FATTN_KERNEL_TILE;
}
Fix is up in #6.

I don't have an NVIDIA box at hand to verify against your repro. Since you're already set up with both Blackwell sm_120 and Ampere sm_86 + the exact GGUFs and the throughput preset script — would you mind giving #6 a spin against the same reproduction? Especially curious whether -ctk turbo3 -ctv turbo3 works through the TILE path on NVIDIA the way it does on AMD gfx1150 (where it was tested for #425db5b).

Thanks again.

docs: report MTP CUDA fattn.cu:109 abort on Gemma 4 (DKQ=512)

c9dd120

github-actions Bot added the documentation Improvements or additions to documentation label May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repro: MTP path on CUDA aborts at fattn.cu:109 (DKQ=512) for Gemma 4 — Blackwell sm_120 + Ampere sm_86#5

Repro: MTP path on CUDA aborts at fattn.cu:109 (DKQ=512) for Gemma 4 — Blackwell sm_120 + Ampere sm_86#5
jameseiten wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
jameseiten:bug/mtp-cuda-fattn-DKQ-512-abort

jameseiten commented May 8, 2026

Uh oh!

Ooooze commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jameseiten commented May 8, 2026

MTP path on CUDA aborts at fattn.cu:109 (DKQ=512) for Gemma 4 — reproducible on Blackwell sm_120 and Ampere sm_86

Summary

Reproduction

Build

Models

Serve command (matches scripts/run-gemma4-e4b-mtp-server.sh throughput preset)

Trigger

Crash

Root cause analysis

Workarounds attempted (all fail with the same abort)

Environment

Suggested fix direction

Uh oh!

Ooooze commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Serve command (matches `scripts/run-gemma4-e4b-mtp-server.sh` `throughput` preset)