Skip to content

fix: Gemma 4 + TurboQuant KV no longer crashes on second prompt when --cache-reuse enabled#10

Open
sujitvasanth wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:fix/turbo-rope-shift-gemma4
Open

fix: Gemma 4 + TurboQuant KV no longer crashes on second prompt when --cache-reuse enabled#10
sujitvasanth wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:fix/turbo-rope-shift-gemma4

Conversation

@sujitvasanth
Copy link
Copy Markdown

@sujitvasanth sujitvasanth commented May 11, 2026

Overview

The previous cache bug #9 prevented the discovery of a knock on problem in the RoPE implementation. This fix is necessary to allow TurboQuant to function properly with cache reuse with gemma 4.

TurboQuant (turbo2/3/4) uses kernel-level WHT rotation, which is position-invariant -- WHT preserves inner products so no RoPE correction is needed after a KV position shift.

build_graph_shift() assumed standard quantized tensors with upstream rotation, but TurboQuant sets attn_rot_k=0 and handles rotation at kernel level. Building the shift graph with turbo-padded tensors causes a null buffer assert and segfault on the second prompt.

Fix: skip build_graph_shift() layers and get_has_shift() entirely for turbo KV types. Position tracking via seq_add() still works correctly -- only the broken RoPE re-rotation kernel is skipped.

Additional information

Combined with the previous PR that recognises caching in Gemma 4 this leads to near instataneous chat conversations on llama-sever web gui, when previously there was a reprocessing lag of 7 seconds plus, and a crash with any prompt causing a sliding window shift.
I have tested to around 6k of available 250k context and working flawlessly now.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yers, coauthored with Claude, I have built and tested - confirm fully functional in my rtx3060+gtx1660 setup on ubuntu 20.04

…--cache-reuse enabled

TurboQuant (turbo2/3/4) uses kernel-level WHT rotation which is
position-invariant -- WHT preserves inner products so no RoPE correction
is needed after a KV position shift.

build_graph_shift() assumed standard quantized tensors with upstream
rotation, but TurboQuant sets attn_rot_k=0 and handles rotation at kernel
level. Building the shift graph with turbo-padded tensors causes a null
buffer assert and segfault on the second prompt.

Fix: skip build_graph_shift() layers and get_has_shift() entirely for
turbo KV types. Position tracking via seq_add() still works correctly --
only the broken RoPE re-rotation kernel is skipped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant