Skip to content

[Bug] iso3/planar3 KV prefill scales super-linearly and produces garbage on RTX 5090 (sm_120 / CUDA 12.8) #10

@TanveerKahlon

Description

@TanveerKahlon

[Bug] iso3 / planar3 KV kernels: super-linear prefill slowdown and garbage output on RTX 5090 (sm_120 / CUDA 12.8)

Summary

On a 2× RTX 5090 (Blackwell, sm_120a, CUDA 12.8 toolkit, driver 580.126.09) box, the rotorquant KV-cache types iso3 and planar3 — as exposed by the johndpope/llama-cpp-turboquant consumer fork at feature/planarquant-kv-cache@fc3d1b6 — show:

  • Super-linear prefill slowdown that scales with context length — 2× slower at 512 tokens, 7.5× at 4 K, ~24× at 16 K compared to f16/f16 on the same fork binary
  • Incoherent output at real long-context workloads — 37 K-token prompt produces I?????????-style garbage + 0.3 t/s decode

The fork's own f16/f16 baseline works correctly at every context size (156.8 t/s decode on Llama 3.1 8B, coherent English), so the CUDA / PTX / driver / CMake path is fine — the regression is localised to the rotorquant kernel path on sm_120.

Filing on this repo because the arithmetic authoring (Givens rotations, quaternion rotations, isometric projections) originates here and the README headlines RTX 5090 results. (Attempted to file a companion report on johndpope/llama-cpp-turboquant where the CUDA kernel code lives, but that repo has issues + discussions both disabled — happy to coordinate with either maintainer.)

Environment

GPUs 2× NVIDIA GeForce RTX 5090 (32 GiB each)
Compute capability 12.0 (sm_120a)
Driver 580.126.09
CUDA toolkit 12.8.1 (side-toolkit at /opt/cuda-12.8/)
OS Ubuntu 24.04, Linux 6.17.0-20-generic
GCC 13.3
Consumer fork johndpope/llama-cpp-turboquant @ feature/planarquant-kv-cache (fc3d1b6, tip as of 2026-04-20)
ggml version reported 0.9.8-dirty

Reproducer A — super-linear prefill slowdown (deterministic, via llama-bench)

Built from johndpope/llama-cpp-turboquant @ feature/planarquant-kv-cache with:

cmake -B build \
  -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DGGML_CUDA=ON -DGGML_CUDA_MMQ=ON -DGGML_FLASH_ATTN=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

(Needs one-line ggml/include/ggml.h:184 patch for GCC 13.3 — filing separately on the fork.)

Run:

MODEL=llama-3.1-8b-instruct.Q4_K_M.gguf
./build/bin/llama-bench -m $MODEL -ngl 99 -p  512 -n 128 -ctk f16  -ctv f16
./build/bin/llama-bench -m $MODEL -ngl 99 -p 4096 -n  64 -ctk f16  -ctv f16
./build/bin/llama-bench -m $MODEL -ngl 99 -p   512 -n 128 -ctk iso3 -ctv iso3
./build/bin/llama-bench -m $MODEL -ngl 99 -p  4096 -n  64 -ctk iso3 -ctv iso3
./build/bin/llama-bench -m $MODEL -ngl 99 -p 16384 -n  64 -ctk iso3 -ctv iso3
./build/bin/llama-bench -m $MODEL -ngl 99 -p   512 -n 128 -ctk planar3 -ctv planar3
./build/bin/llama-bench -m $MODEL -ngl 99 -p   512 -n 128 -ctk planar3 -ctv f16

Results (tensor-split across both 5090s, default -sm layer):

Config pp512 (t/s) pp4096 (t/s) pp16384 (t/s) tg128/tg64 (t/s)
f16/f16 (baseline) 12 533 ± 1 035 16 263 ± 4 n/a 234 / 244
iso3/iso3 5 771 ± 298 (−54 %) 2 166 ± 2 (−87 %) 691 ± 1 179 / 186
planar3/planar3 7 586 ± 140 (−40 %) 184
planar3/f16 14 850 ± 534 (+18 %) 218

iso3 prefill throughput collapses from ~46 % of baseline at 512 tokens to ~13 % at 4 K to ~4 % at 16 K. Decode (tg) throughput is basically fine, so the regression is on the prefill / batch-eval path, not the single-token attention kernel.

planar3/f16 (K-only compression) is the only rotorquant config that avoids the collapse — it's actually faster than baseline at pp512. Consistent with it skipping more work rather than hitting the slow path.

Reproducer B — garbage output at ~37 K context (llama-server)

With Llama 3.1 8B Q4_K_M and a ~37 K-token Punjabi scripture prompt (real workload — long-form Gurmukhi extraction):

./build/bin/llama-server -m $MODEL -ngl 99 --jinja \
  --host 127.0.0.1 --port 8090 \
  -c 65536 -t 16 -ts 1,1 \
  -ctk iso3 -ctv iso3 -fa on
Config Decode (t/s) Output
-ctk iso3 -ctv iso3 -fa on 0.6 garbage: I?????????
-ctk planar3 -ctv f16 -fa on 0.4 garbage: I?????????
-ctk iso3 -ctv iso3 (no -fa) 0.3 garbage
-ctk f16 -ctv f16 -fa on (fork baseline) 156.8 coherent English

Raw data captured at ~/local_llm/.recon/install-gate-5-v2-rotorquant.json in our local repo — can upload if useful.

Given the prefill scaling in Reproducer A, my guess is these are the same root cause: a prefill kernel that scales wrong on sm_120 and past some threshold produces arithmetic that's corrupt as well as slow, rather than just slow.

What I've ruled out

  • Not a broken toolchain. Fork's own f16/f16 baseline runs at 156.8 t/s decode / 12.5 K t/s prefill on the same build, coherent output.
  • Not a caller-side -fa issue. Garbage reproduces with and without flash-attention.
  • Not sm_120a vs sm_120. CMAKE_CUDA_ARCHITECTURES=120 targets base sm_120 PTX; kernels using a-family accelerated ops still run.
  • Not CUDA version cached wrong. Built cleanly with /opt/cuda-12.8/ toolkit; libggml-cuda.so links libcudart.so.12.

Hypothesis

The rotorquant CUDA kernels landed in the consumer fork via:

  • 1ed0453 — "Add CUDA set_rows kernels for planar3/iso3/planar4/iso4"
  • 26c90d6 — "Add CUDA F16→quantized conversion kernels for planar3/4 and iso3/4"
  • a75b16f — "Add CUDA flash attention dequantize for planar3/iso3/planar4/iso4"
  • 9d4ece5 — "COMPRESSION WORKS: 5.1x K-cache + 200 tok/s decode on CUDA"

None of those commits mention cu128, WMMA, mma.sync, __CUDA_ARCH__ >= 1200, or any architecture gating. The "200 tok/s decode on CUDA" in 9d4ece5 was almost certainly measured on sm_89 Ada (RTX 4090 / A6000) since none of the commits adjust block/warp sizing or add Blackwell-specific codepaths.

Most-likely candidates for the sm_120 path going bad:

  1. Givens-rotation prefill kernel uses a block shape or shared-memory layout valid on sm_89 (128 KB dynamic smem) but overflowing / mis-aligning on sm_120 (228 KB dynamic smem with different bank layout)
  2. WMMA / mma.sync instruction chosen by template expansion compiles on sm_120 but produces wrong output for Givens-rotated tiles
  3. Constant-memory read pattern for rotation coefficients behaves differently on Blackwell (CC 12.0 changed constant cache behaviour)

Cross-reference: ggml-org/llama.cpp#21915 (gibberish with quantised KV cache on sm_120, open as of 2026-04-14) — same symptom family for non-rotorquant KV quant, suggests the sm_120 + quant-KV path is fragile upstream too.

Ask

  1. Which compute capabilities were the iso3 / planar3 kernels validated on in CI or manual testing before the RTX 5090 numbers landed in the README?
  2. Is there a suspected kernel (Givens rotation in prefill? FA dequantize?) where the sm_89 → sm_120 jump would plausibly regress? I can run compute-sanitizer / nsys profile and share output if it would help narrow this down.
  3. If this is a known-soft spot on Blackwell, would it be worth a note in the README until kernel paths are updated?

Happy to coordinate a debug session — this is blocking a real-world Gurmukhi scripture extraction workload for us (30 K–50 K-token chunks), so a fix here would be directly useful. The algorithm itself is genuinely encouraging — planar3/f16 showing K-only compression faster than baseline at pp512 is a nice result.


Filed 2026-04-20. The consumer fork johndpope/llama-cpp-turboquant has issues + discussions disabled, so there's no parallel ticket there — this is the single issue for this bug. Machine-readable repro data at our local .recon/install-gate-5-v2-rotorquant.json.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions