Skip to content

fix: Gemma 4 time-to-first-token drops from 8-12s to <1s by unblocking cache reuse#9

Open
sujitvasanth wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:fix/iswa-get-can-shift-gemma4
Open

fix: Gemma 4 time-to-first-token drops from 8-12s to <1s by unblocking cache reuse#9
sujitvasanth wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:fix/iswa-get-can-shift-gemma4

Conversation

@sujitvasanth
Copy link
Copy Markdown

@sujitvasanth sujitvasanth commented May 11, 2026

Overview

The atomic implementation of Gemma 4 is currently not as efficient as it could be due to broken cache reuse. This is a known problem even in the main llama.cpp branch where it blocks cache reuse when slot swapping on the server but it is evenmoredeleterious in the atomic implementatuion as thefaulty code is pinned in the turboquant implementation, so effecting the atomic implementation at every single prompt crippling time to first token to around 7 seconds due to rebuilding of the cache - after the implementation it drops to less than 1 second (for Gemma with turboquant it needs to be implemented with some changes needed in RoPE see next PR) which were hidden due to this bug bypassing their running.

hopefully this will be a very useful unlock for gemma4 inference on this branch and llama.cpp

The kv_base->get_size() == kv_swa->get_size() condition in get_can_shift() was introduced in llama.cpp main branch PR ggml-org#15467 before Gemma 4 existed. Gemma 4 has 10 global layers vs 50 SWA layers by design, so this check always returns false, permanently blocking cache reuse for all Gemma 4 users.

The individual get_can_shift() calls on each sub-cache already guard shift safety independently. Removing the size equality check is safe for all existing models.

Additional information

I also submitted the fix to llama.cpp mainline for aa known issue:
ggml-org#21831 (comment)

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, coauthered with Claude, I have built, checked code makes sense and built and tested. All working smoothly on dual gpu rtx3060+gtx1660 ubuntu 20.04

…g cache reuse

The kv_base->get_size() == kv_swa->get_size() condition in get_can_shift()
was introduced in PR ggml-org#15467 before Gemma 4 existed. Gemma 4 has 10 global
layers vs 50 SWA layers by design, so this check always returns false,
permanently blocking cache reuse for all Gemma 4 users.

The individual get_can_shift() calls on each sub-cache already guard shift
safety independently. Removing the size equality check is safe for all
existing models.

Fix also submitted to llama.cpp mainline:
ggml-org#21831 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant