[Bug] iso3/planar3 KV prefill scales super-linearly and produces garbage on RTX 5090 (sm_120 / CUDA 12.8)

# [Bug] `iso3` / `planar3` KV kernels: super-linear prefill slowdown and garbage output on RTX 5090 (sm_120 / CUDA 12.8)

## Summary

On a 2× RTX 5090 (Blackwell, sm_120a, CUDA 12.8 toolkit, driver 580.126.09) box, the rotorquant KV-cache types `iso3` and `planar3` — as exposed by the `johndpope/llama-cpp-turboquant` consumer fork at `feature/planarquant-kv-cache@fc3d1b6` — show:

- **Super-linear prefill slowdown that scales with context length** — 2× slower at 512 tokens, **7.5× at 4 K**, **~24× at 16 K** compared to `f16/f16` on the same fork binary
- **Incoherent output at real long-context workloads** — 37 K-token prompt produces `I?????????`-style garbage + 0.3 t/s decode

The fork's own `f16/f16` baseline works correctly at every context size (156.8 t/s decode on Llama 3.1 8B, coherent English), so the CUDA / PTX / driver / CMake path is fine — the regression is localised to the rotorquant kernel path on sm_120.

Filing on this repo because the arithmetic authoring (Givens rotations, quaternion rotations, isometric projections) originates here and the README headlines RTX 5090 results. (Attempted to file a companion report on `johndpope/llama-cpp-turboquant` where the CUDA kernel code lives, but that repo has issues + discussions both disabled — happy to coordinate with either maintainer.)

## Environment

| | |
|---|---|
| GPUs | 2× NVIDIA GeForce RTX 5090 (32 GiB each) |
| Compute capability | 12.0 (sm_120a) |
| Driver | 580.126.09 |
| CUDA toolkit | 12.8.1 (side-toolkit at `/opt/cuda-12.8/`) |
| OS | Ubuntu 24.04, Linux 6.17.0-20-generic |
| GCC | 13.3 |
| Consumer fork | `johndpope/llama-cpp-turboquant @ feature/planarquant-kv-cache` (fc3d1b6, tip as of 2026-04-20) |
| ggml version reported | `0.9.8-dirty` |

## Reproducer A — super-linear prefill slowdown (deterministic, via `llama-bench`)

Built from `johndpope/llama-cpp-turboquant @ feature/planarquant-kv-cache` with:

```bash
cmake -B build \
  -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DGGML_CUDA=ON -DGGML_CUDA_MMQ=ON -DGGML_FLASH_ATTN=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
```

(Needs one-line `ggml/include/ggml.h:184` patch for GCC 13.3 — filing separately on the fork.)

Run:

```bash
MODEL=llama-3.1-8b-instruct.Q4_K_M.gguf
./build/bin/llama-bench -m $MODEL -ngl 99 -p  512 -n 128 -ctk f16  -ctv f16
./build/bin/llama-bench -m $MODEL -ngl 99 -p 4096 -n  64 -ctk f16  -ctv f16
./build/bin/llama-bench -m $MODEL -ngl 99 -p   512 -n 128 -ctk iso3 -ctv iso3
./build/bin/llama-bench -m $MODEL -ngl 99 -p  4096 -n  64 -ctk iso3 -ctv iso3
./build/bin/llama-bench -m $MODEL -ngl 99 -p 16384 -n  64 -ctk iso3 -ctv iso3
./build/bin/llama-bench -m $MODEL -ngl 99 -p   512 -n 128 -ctk planar3 -ctv planar3
./build/bin/llama-bench -m $MODEL -ngl 99 -p   512 -n 128 -ctk planar3 -ctv f16
```

Results (tensor-split across both 5090s, default `-sm layer`):

| Config | pp512 (t/s) | pp4096 (t/s) | pp16384 (t/s) | tg128/tg64 (t/s) |
|---|---:|---:|---:|---:|
| f16/f16 (baseline) | 12 533 ± 1 035 | 16 263 ± 4 | n/a | 234 / 244 |
| iso3/iso3          | 5 771 ± 298 (**−54 %**) | 2 166 ± 2 (**−87 %**) | **691 ± 1** | 179 / 186 |
| planar3/planar3    | 7 586 ± 140 (**−40 %**) | — | — | 184 |
| planar3/f16        | 14 850 ± 534 (+18 %) | — | — | 218 |

**`iso3` prefill throughput collapses from ~46 % of baseline at 512 tokens to ~13 % at 4 K to ~4 % at 16 K.** Decode (tg) throughput is basically fine, so the regression is on the prefill / batch-eval path, not the single-token attention kernel.

`planar3/f16` (K-only compression) is the only rotorquant config that avoids the collapse — it's actually faster than baseline at pp512. Consistent with it skipping more work rather than hitting the slow path.

## Reproducer B — garbage output at ~37 K context (`llama-server`)

With Llama 3.1 8B Q4_K_M and a ~37 K-token Punjabi scripture prompt (real workload — long-form Gurmukhi extraction):

```bash
./build/bin/llama-server -m $MODEL -ngl 99 --jinja \
  --host 127.0.0.1 --port 8090 \
  -c 65536 -t 16 -ts 1,1 \
  -ctk iso3 -ctv iso3 -fa on
```

| Config | Decode (t/s) | Output |
|---|---:|---|
| `-ctk iso3 -ctv iso3 -fa on` | **0.6** | garbage: `I?????????` |
| `-ctk planar3 -ctv f16 -fa on` | **0.4** | garbage: `I?????????` |
| `-ctk iso3 -ctv iso3` (no `-fa`) | **0.3** | garbage |
| `-ctk f16 -ctv f16 -fa on` (fork baseline) | **156.8** | coherent English |

Raw data captured at `~/local_llm/.recon/install-gate-5-v2-rotorquant.json` in our local repo — can upload if useful.

Given the prefill scaling in Reproducer A, my guess is these are the same root cause: a prefill kernel that scales wrong on sm_120 and past some threshold produces arithmetic that's corrupt as well as slow, rather than just slow.

## What I've ruled out

- **Not a broken toolchain.** Fork's own f16/f16 baseline runs at 156.8 t/s decode / 12.5 K t/s prefill on the same build, coherent output.
- **Not a caller-side `-fa` issue.** Garbage reproduces with and without flash-attention.
- **Not sm_120a vs sm_120.** `CMAKE_CUDA_ARCHITECTURES=120` targets base sm_120 PTX; kernels using `a`-family accelerated ops still run.
- **Not CUDA version cached wrong.** Built cleanly with `/opt/cuda-12.8/` toolkit; `libggml-cuda.so` links `libcudart.so.12`.

## Hypothesis

The rotorquant CUDA kernels landed in the consumer fork via:

- `1ed0453` — "Add CUDA set_rows kernels for planar3/iso3/planar4/iso4"
- `26c90d6` — "Add CUDA F16→quantized conversion kernels for planar3/4 and iso3/4"
- `a75b16f` — "Add CUDA flash attention dequantize for planar3/iso3/planar4/iso4"
- `9d4ece5` — "COMPRESSION WORKS: 5.1x K-cache + 200 tok/s decode on CUDA"

None of those commits mention `cu128`, `WMMA`, `mma.sync`, `__CUDA_ARCH__ >= 1200`, or any architecture gating. The "200 tok/s decode on CUDA" in `9d4ece5` was almost certainly measured on sm_89 Ada (RTX 4090 / A6000) since none of the commits adjust block/warp sizing or add Blackwell-specific codepaths.

Most-likely candidates for the sm_120 path going bad:

1. **Givens-rotation prefill kernel** uses a block shape or shared-memory layout valid on sm_89 (128 KB dynamic smem) but overflowing / mis-aligning on sm_120 (228 KB dynamic smem with different bank layout)
2. **WMMA / mma.sync instruction** chosen by template expansion compiles on sm_120 but produces wrong output for Givens-rotated tiles
3. **Constant-memory read pattern** for rotation coefficients behaves differently on Blackwell (CC 12.0 changed constant cache behaviour)

Cross-reference: **ggml-org/llama.cpp#21915** (gibberish with quantised KV cache on sm_120, open as of 2026-04-14) — same symptom family for non-rotorquant KV quant, suggests the sm_120 + quant-KV path is fragile upstream too.

## Ask

1. **Which compute capabilities were the `iso3` / `planar3` kernels validated on in CI or manual testing before the RTX 5090 numbers landed in the README?**
2. **Is there a suspected kernel (Givens rotation in prefill? FA dequantize?) where the sm_89 → sm_120 jump would plausibly regress?** I can run `compute-sanitizer` / `nsys profile` and share output if it would help narrow this down.
3. **If this is a known-soft spot on Blackwell,** would it be worth a note in the README until kernel paths are updated?

Happy to coordinate a debug session — this is blocking a real-world Gurmukhi scripture extraction workload for us (30 K–50 K-token chunks), so a fix here would be directly useful. The algorithm itself is genuinely encouraging — `planar3/f16` showing K-only compression faster than baseline at pp512 is a nice result.

---

*Filed 2026-04-20. The consumer fork `johndpope/llama-cpp-turboquant` has issues + discussions disabled, so there's no parallel ticket there — this is the single issue for this bug. Machine-readable repro data at our local `.recon/install-gate-5-v2-rotorquant.json`.*


Config	Decode (t/s)	Output
`-ctk iso3 -ctv iso3 -fa on`	0.6	garbage: `I?????????`
`-ctk planar3 -ctv f16 -fa on`	0.4	garbage: `I?????????`
`-ctk iso3 -ctv iso3` (no `-fa`)	0.3	garbage
`-ctk f16 -ctv f16 -fa on` (fork baseline)	156.8	coherent English

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] iso3/planar3 KV prefill scales super-linearly and produces garbage on RTX 5090 (sm_120 / CUDA 12.8) #10

[Bug] `iso3` / `planar3` KV kernels: super-linear prefill slowdown and garbage output on RTX 5090 (sm_120 / CUDA 12.8)

Summary

Environment

Reproducer A — super-linear prefill slowdown (deterministic, via `llama-bench`)

Reproducer B — garbage output at ~37 K context (`llama-server`)

What I've ruled out

Hypothesis

Ask

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development


GPUs	2× NVIDIA GeForce RTX 5090 (32 GiB each)
Compute capability	12.0 (sm_120a)
Driver	580.126.09
CUDA toolkit	12.8.1 (side-toolkit at `/opt/cuda-12.8/`)
OS	Ubuntu 24.04, Linux 6.17.0-20-generic
GCC	13.3
Consumer fork	`johndpope/llama-cpp-turboquant @ feature/planarquant-kv-cache` (fc3d1b6, tip as of 2026-04-20)
ggml version reported	`0.9.8-dirty`

Config	pp512 (t/s)	pp4096 (t/s)	pp16384 (t/s)	tg128/tg64 (t/s)
f16/f16 (baseline)	12 533 ± 1 035	16 263 ± 4	n/a	234 / 244
iso3/iso3	5 771 ± 298 (−54 %)	2 166 ± 2 (−87 %)	691 ± 1	179 / 186
planar3/planar3	7 586 ± 140 (−40 %)	—	—	184
planar3/f16	14 850 ± 534 (+18 %)	—	—	218

[Bug] iso3/planar3 KV prefill scales super-linearly and produces garbage on RTX 5090 (sm_120 / CUDA 12.8) #10

Description

[Bug] iso3 / planar3 KV kernels: super-linear prefill slowdown and garbage output on RTX 5090 (sm_120 / CUDA 12.8)

Summary

Environment

Reproducer A — super-linear prefill slowdown (deterministic, via llama-bench)

Reproducer B — garbage output at ~37 K context (llama-server)

What I've ruled out

Hypothesis

Ask

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug] `iso3` / `planar3` KV kernels: super-linear prefill slowdown and garbage output on RTX 5090 (sm_120 / CUDA 12.8)

Reproducer A — super-linear prefill slowdown (deterministic, via `llama-bench`)

Reproducer B — garbage output at ~37 K context (`llama-server`)