[Bug] iso3 / planar3 KV kernels: super-linear prefill slowdown and garbage output on RTX 5090 (sm_120 / CUDA 12.8)
Summary
On a 2× RTX 5090 (Blackwell, sm_120a, CUDA 12.8 toolkit, driver 580.126.09) box, the rotorquant KV-cache types iso3 and planar3 — as exposed by the johndpope/llama-cpp-turboquant consumer fork at feature/planarquant-kv-cache@fc3d1b6 — show:
- Super-linear prefill slowdown that scales with context length — 2× slower at 512 tokens, 7.5× at 4 K, ~24× at 16 K compared to
f16/f16 on the same fork binary
- Incoherent output at real long-context workloads — 37 K-token prompt produces
I?????????-style garbage + 0.3 t/s decode
The fork's own f16/f16 baseline works correctly at every context size (156.8 t/s decode on Llama 3.1 8B, coherent English), so the CUDA / PTX / driver / CMake path is fine — the regression is localised to the rotorquant kernel path on sm_120.
Filing on this repo because the arithmetic authoring (Givens rotations, quaternion rotations, isometric projections) originates here and the README headlines RTX 5090 results. (Attempted to file a companion report on johndpope/llama-cpp-turboquant where the CUDA kernel code lives, but that repo has issues + discussions both disabled — happy to coordinate with either maintainer.)
Environment
|
|
| GPUs |
2× NVIDIA GeForce RTX 5090 (32 GiB each) |
| Compute capability |
12.0 (sm_120a) |
| Driver |
580.126.09 |
| CUDA toolkit |
12.8.1 (side-toolkit at /opt/cuda-12.8/) |
| OS |
Ubuntu 24.04, Linux 6.17.0-20-generic |
| GCC |
13.3 |
| Consumer fork |
johndpope/llama-cpp-turboquant @ feature/planarquant-kv-cache (fc3d1b6, tip as of 2026-04-20) |
| ggml version reported |
0.9.8-dirty |
Reproducer A — super-linear prefill slowdown (deterministic, via llama-bench)
Built from johndpope/llama-cpp-turboquant @ feature/planarquant-kv-cache with:
cmake -B build \
-DCMAKE_CUDA_ARCHITECTURES=120 \
-DGGML_CUDA=ON -DGGML_CUDA_MMQ=ON -DGGML_FLASH_ATTN=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build -j
(Needs one-line ggml/include/ggml.h:184 patch for GCC 13.3 — filing separately on the fork.)
Run:
MODEL=llama-3.1-8b-instruct.Q4_K_M.gguf
./build/bin/llama-bench -m $MODEL -ngl 99 -p 512 -n 128 -ctk f16 -ctv f16
./build/bin/llama-bench -m $MODEL -ngl 99 -p 4096 -n 64 -ctk f16 -ctv f16
./build/bin/llama-bench -m $MODEL -ngl 99 -p 512 -n 128 -ctk iso3 -ctv iso3
./build/bin/llama-bench -m $MODEL -ngl 99 -p 4096 -n 64 -ctk iso3 -ctv iso3
./build/bin/llama-bench -m $MODEL -ngl 99 -p 16384 -n 64 -ctk iso3 -ctv iso3
./build/bin/llama-bench -m $MODEL -ngl 99 -p 512 -n 128 -ctk planar3 -ctv planar3
./build/bin/llama-bench -m $MODEL -ngl 99 -p 512 -n 128 -ctk planar3 -ctv f16
Results (tensor-split across both 5090s, default -sm layer):
| Config |
pp512 (t/s) |
pp4096 (t/s) |
pp16384 (t/s) |
tg128/tg64 (t/s) |
| f16/f16 (baseline) |
12 533 ± 1 035 |
16 263 ± 4 |
n/a |
234 / 244 |
| iso3/iso3 |
5 771 ± 298 (−54 %) |
2 166 ± 2 (−87 %) |
691 ± 1 |
179 / 186 |
| planar3/planar3 |
7 586 ± 140 (−40 %) |
— |
— |
184 |
| planar3/f16 |
14 850 ± 534 (+18 %) |
— |
— |
218 |
iso3 prefill throughput collapses from ~46 % of baseline at 512 tokens to ~13 % at 4 K to ~4 % at 16 K. Decode (tg) throughput is basically fine, so the regression is on the prefill / batch-eval path, not the single-token attention kernel.
planar3/f16 (K-only compression) is the only rotorquant config that avoids the collapse — it's actually faster than baseline at pp512. Consistent with it skipping more work rather than hitting the slow path.
Reproducer B — garbage output at ~37 K context (llama-server)
With Llama 3.1 8B Q4_K_M and a ~37 K-token Punjabi scripture prompt (real workload — long-form Gurmukhi extraction):
./build/bin/llama-server -m $MODEL -ngl 99 --jinja \
--host 127.0.0.1 --port 8090 \
-c 65536 -t 16 -ts 1,1 \
-ctk iso3 -ctv iso3 -fa on
| Config |
Decode (t/s) |
Output |
-ctk iso3 -ctv iso3 -fa on |
0.6 |
garbage: I????????? |
-ctk planar3 -ctv f16 -fa on |
0.4 |
garbage: I????????? |
-ctk iso3 -ctv iso3 (no -fa) |
0.3 |
garbage |
-ctk f16 -ctv f16 -fa on (fork baseline) |
156.8 |
coherent English |
Raw data captured at ~/local_llm/.recon/install-gate-5-v2-rotorquant.json in our local repo — can upload if useful.
Given the prefill scaling in Reproducer A, my guess is these are the same root cause: a prefill kernel that scales wrong on sm_120 and past some threshold produces arithmetic that's corrupt as well as slow, rather than just slow.
What I've ruled out
- Not a broken toolchain. Fork's own f16/f16 baseline runs at 156.8 t/s decode / 12.5 K t/s prefill on the same build, coherent output.
- Not a caller-side
-fa issue. Garbage reproduces with and without flash-attention.
- Not sm_120a vs sm_120.
CMAKE_CUDA_ARCHITECTURES=120 targets base sm_120 PTX; kernels using a-family accelerated ops still run.
- Not CUDA version cached wrong. Built cleanly with
/opt/cuda-12.8/ toolkit; libggml-cuda.so links libcudart.so.12.
Hypothesis
The rotorquant CUDA kernels landed in the consumer fork via:
1ed0453 — "Add CUDA set_rows kernels for planar3/iso3/planar4/iso4"
26c90d6 — "Add CUDA F16→quantized conversion kernels for planar3/4 and iso3/4"
a75b16f — "Add CUDA flash attention dequantize for planar3/iso3/planar4/iso4"
9d4ece5 — "COMPRESSION WORKS: 5.1x K-cache + 200 tok/s decode on CUDA"
None of those commits mention cu128, WMMA, mma.sync, __CUDA_ARCH__ >= 1200, or any architecture gating. The "200 tok/s decode on CUDA" in 9d4ece5 was almost certainly measured on sm_89 Ada (RTX 4090 / A6000) since none of the commits adjust block/warp sizing or add Blackwell-specific codepaths.
Most-likely candidates for the sm_120 path going bad:
- Givens-rotation prefill kernel uses a block shape or shared-memory layout valid on sm_89 (128 KB dynamic smem) but overflowing / mis-aligning on sm_120 (228 KB dynamic smem with different bank layout)
- WMMA / mma.sync instruction chosen by template expansion compiles on sm_120 but produces wrong output for Givens-rotated tiles
- Constant-memory read pattern for rotation coefficients behaves differently on Blackwell (CC 12.0 changed constant cache behaviour)
Cross-reference: ggml-org/llama.cpp#21915 (gibberish with quantised KV cache on sm_120, open as of 2026-04-14) — same symptom family for non-rotorquant KV quant, suggests the sm_120 + quant-KV path is fragile upstream too.
Ask
- Which compute capabilities were the
iso3 / planar3 kernels validated on in CI or manual testing before the RTX 5090 numbers landed in the README?
- Is there a suspected kernel (Givens rotation in prefill? FA dequantize?) where the sm_89 → sm_120 jump would plausibly regress? I can run
compute-sanitizer / nsys profile and share output if it would help narrow this down.
- If this is a known-soft spot on Blackwell, would it be worth a note in the README until kernel paths are updated?
Happy to coordinate a debug session — this is blocking a real-world Gurmukhi scripture extraction workload for us (30 K–50 K-token chunks), so a fix here would be directly useful. The algorithm itself is genuinely encouraging — planar3/f16 showing K-only compression faster than baseline at pp512 is a nice result.
Filed 2026-04-20. The consumer fork johndpope/llama-cpp-turboquant has issues + discussions disabled, so there's no parallel ticket there — this is the single issue for this bug. Machine-readable repro data at our local .recon/install-gate-5-v2-rotorquant.json.
[Bug]
iso3/planar3KV kernels: super-linear prefill slowdown and garbage output on RTX 5090 (sm_120 / CUDA 12.8)Summary
On a 2× RTX 5090 (Blackwell, sm_120a, CUDA 12.8 toolkit, driver 580.126.09) box, the rotorquant KV-cache types
iso3andplanar3— as exposed by thejohndpope/llama-cpp-turboquantconsumer fork atfeature/planarquant-kv-cache@fc3d1b6— show:f16/f16on the same fork binaryI?????????-style garbage + 0.3 t/s decodeThe fork's own
f16/f16baseline works correctly at every context size (156.8 t/s decode on Llama 3.1 8B, coherent English), so the CUDA / PTX / driver / CMake path is fine — the regression is localised to the rotorquant kernel path on sm_120.Filing on this repo because the arithmetic authoring (Givens rotations, quaternion rotations, isometric projections) originates here and the README headlines RTX 5090 results. (Attempted to file a companion report on
johndpope/llama-cpp-turboquantwhere the CUDA kernel code lives, but that repo has issues + discussions both disabled — happy to coordinate with either maintainer.)Environment
/opt/cuda-12.8/)johndpope/llama-cpp-turboquant @ feature/planarquant-kv-cache(fc3d1b6, tip as of 2026-04-20)0.9.8-dirtyReproducer A — super-linear prefill slowdown (deterministic, via
llama-bench)Built from
johndpope/llama-cpp-turboquant @ feature/planarquant-kv-cachewith:(Needs one-line
ggml/include/ggml.h:184patch for GCC 13.3 — filing separately on the fork.)Run:
Results (tensor-split across both 5090s, default
-sm layer):iso3prefill throughput collapses from ~46 % of baseline at 512 tokens to ~13 % at 4 K to ~4 % at 16 K. Decode (tg) throughput is basically fine, so the regression is on the prefill / batch-eval path, not the single-token attention kernel.planar3/f16(K-only compression) is the only rotorquant config that avoids the collapse — it's actually faster than baseline at pp512. Consistent with it skipping more work rather than hitting the slow path.Reproducer B — garbage output at ~37 K context (
llama-server)With Llama 3.1 8B Q4_K_M and a ~37 K-token Punjabi scripture prompt (real workload — long-form Gurmukhi extraction):
./build/bin/llama-server -m $MODEL -ngl 99 --jinja \ --host 127.0.0.1 --port 8090 \ -c 65536 -t 16 -ts 1,1 \ -ctk iso3 -ctv iso3 -fa on-ctk iso3 -ctv iso3 -fa onI?????????-ctk planar3 -ctv f16 -fa onI?????????-ctk iso3 -ctv iso3(no-fa)-ctk f16 -ctv f16 -fa on(fork baseline)Raw data captured at
~/local_llm/.recon/install-gate-5-v2-rotorquant.jsonin our local repo — can upload if useful.Given the prefill scaling in Reproducer A, my guess is these are the same root cause: a prefill kernel that scales wrong on sm_120 and past some threshold produces arithmetic that's corrupt as well as slow, rather than just slow.
What I've ruled out
-faissue. Garbage reproduces with and without flash-attention.CMAKE_CUDA_ARCHITECTURES=120targets base sm_120 PTX; kernels usinga-family accelerated ops still run./opt/cuda-12.8/toolkit;libggml-cuda.solinkslibcudart.so.12.Hypothesis
The rotorquant CUDA kernels landed in the consumer fork via:
1ed0453— "Add CUDA set_rows kernels for planar3/iso3/planar4/iso4"26c90d6— "Add CUDA F16→quantized conversion kernels for planar3/4 and iso3/4"a75b16f— "Add CUDA flash attention dequantize for planar3/iso3/planar4/iso4"9d4ece5— "COMPRESSION WORKS: 5.1x K-cache + 200 tok/s decode on CUDA"None of those commits mention
cu128,WMMA,mma.sync,__CUDA_ARCH__ >= 1200, or any architecture gating. The "200 tok/s decode on CUDA" in9d4ece5was almost certainly measured on sm_89 Ada (RTX 4090 / A6000) since none of the commits adjust block/warp sizing or add Blackwell-specific codepaths.Most-likely candidates for the sm_120 path going bad:
Cross-reference: ggml-org/llama.cpp#21915 (gibberish with quantised KV cache on sm_120, open as of 2026-04-14) — same symptom family for non-rotorquant KV quant, suggests the sm_120 + quant-KV path is fragile upstream too.
Ask
iso3/planar3kernels validated on in CI or manual testing before the RTX 5090 numbers landed in the README?compute-sanitizer/nsys profileand share output if it would help narrow this down.Happy to coordinate a debug session — this is blocking a real-world Gurmukhi scripture extraction workload for us (30 K–50 K-token chunks), so a fix here would be directly useful. The algorithm itself is genuinely encouraging —
planar3/f16showing K-only compression faster than baseline at pp512 is a nice result.Filed 2026-04-20. The consumer fork
johndpope/llama-cpp-turboquanthas issues + discussions disabled, so there's no parallel ticket there — this is the single issue for this bug. Machine-readable repro data at our local.recon/install-gate-5-v2-rotorquant.json.