fix: Gemma 4 + TurboQuant KV no longer crashes on second prompt when --cache-reuse enabled#10
Open
sujitvasanth wants to merge 1 commit into
Conversation
…--cache-reuse enabled TurboQuant (turbo2/3/4) uses kernel-level WHT rotation which is position-invariant -- WHT preserves inner products so no RoPE correction is needed after a KV position shift. build_graph_shift() assumed standard quantized tensors with upstream rotation, but TurboQuant sets attn_rot_k=0 and handles rotation at kernel level. Building the shift graph with turbo-padded tensors causes a null buffer assert and segfault on the second prompt. Fix: skip build_graph_shift() layers and get_has_shift() entirely for turbo KV types. Position tracking via seq_add() still works correctly -- only the broken RoPE re-rotation kernel is skipped.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
The previous cache bug #9 prevented the discovery of a knock on problem in the RoPE implementation. This fix is necessary to allow TurboQuant to function properly with cache reuse with gemma 4.
TurboQuant (turbo2/3/4) uses kernel-level WHT rotation, which is position-invariant -- WHT preserves inner products so no RoPE correction is needed after a KV position shift.
build_graph_shift() assumed standard quantized tensors with upstream rotation, but TurboQuant sets attn_rot_k=0 and handles rotation at kernel level. Building the shift graph with turbo-padded tensors causes a null buffer assert and segfault on the second prompt.
Fix: skip build_graph_shift() layers and get_has_shift() entirely for turbo KV types. Position tracking via seq_add() still works correctly -- only the broken RoPE re-rotation kernel is skipped.
Additional information
Combined with the previous PR that recognises caching in Gemma 4 this leads to near instataneous chat conversations on llama-sever web gui, when previously there was a reprocessing lag of 7 seconds plus, and a crash with any prompt causing a sliding window shift.
I have tested to around 6k of available 250k context and working flawlessly now.
Requirements