fix: Gemma 4 time-to-first-token drops from 8-12s to <1s by unblocking cache reuse by sujitvasanth · Pull Request #9 · AtomicBot-ai/atomic-llama-cpp-turboquant

sujitvasanth · 2026-05-11T14:14:07Z

Overview

The atomic implementation of Gemma 4 is currently not as efficient as it could be due to broken cache reuse. This is a known problem even in the main llama.cpp branch where it blocks cache reuse when slot swapping on the server but it is evenmoredeleterious in the atomic implementatuion as thefaulty code is pinned in the turboquant implementation, so effecting the atomic implementation at every single prompt crippling time to first token to around 7 seconds due to rebuilding of the cache - after the implementation it drops to less than 1 second (for Gemma with turboquant it needs to be implemented with some changes needed in RoPE see next PR) which were hidden due to this bug bypassing their running.

hopefully this will be a very useful unlock for gemma4 inference on this branch and llama.cpp

The kv_base->get_size() == kv_swa->get_size() condition in get_can_shift() was introduced in llama.cpp main branch PR ggml-org#15467 before Gemma 4 existed. Gemma 4 has 10 global layers vs 50 SWA layers by design, so this check always returns false, permanently blocking cache reuse for all Gemma 4 users.

The individual get_can_shift() calls on each sub-cache already guard shift safety independently. Removing the size equality check is safe for all existing models.

Additional information

I also submitted the fix to llama.cpp mainline for aa known issue:
ggml-org#21831 (comment)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, coauthered with Claude, I have built, checked code makes sense and built and tested. All working smoothly on dual gpu rtx3060+gtx1660 ubuntu 20.04

…g cache reuse The kv_base->get_size() == kv_swa->get_size() condition in get_can_shift() was introduced in PR ggml-org#15467 before Gemma 4 existed. Gemma 4 has 10 global layers vs 50 SWA layers by design, so this check always returns false, permanently blocking cache reuse for all Gemma 4 users. The individual get_can_shift() calls on each sub-cache already guard shift safety independently. Removing the size equality check is safe for all existing models. Fix also submitted to llama.cpp mainline: ggml-org#21831 (comment)

sujitvasanth mentioned this pull request May 12, 2026

fix: Gemma 4 + TurboQuant KV no longer crashes on second prompt when --cache-reuse enabled #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Gemma 4 time-to-first-token drops from 8-12s to <1s by unblocking cache reuse#9

fix: Gemma 4 time-to-first-token drops from 8-12s to <1s by unblocking cache reuse#9
sujitvasanth wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:fix/iswa-get-can-shift-gemma4

sujitvasanth commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sujitvasanth commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sujitvasanth commented May 11, 2026 •

edited

Loading