Background / Problem
Today, multi-turn reuse works by having the caller send only the new turn, and the engine reuses the persisted KV cache:
- prepare_input_for_model() switches to incremental_prompt when h.kv_len > 0 (api/quick_dot_ai_api.cpp:537-557).
- incremental_prompt extracts only the latest user content (QNN: src/models/qnn/gauss-3.6-qnn/gauss3_6_qnn.cpp:650-654).
- Position persists across turns via h.kv_len (quick_dot_ai_api.cpp:95-96, 424-432); read_kv_len callback in api/model_callbacks.h:28-33.
This contract is fragile:
- The caller must know exactly what was already cached and send only the new suffix.
- It breaks under conversation editing / regeneration / branching (cache no longer matches the intended prefix).
- nntrainer-side read_kv_len returns 0 (not implemented) → this path is effectively QNN-only.
Requested behavior: let the caller pass the full accumulated context every turn; the engine records how much it has already processed (a recorded context length / processed token sequence)
and only computes the unprocessed delta, reusing cached KV for the matching prefix.
Goal
Make the API robust and caller-friendly: full context in, minimal compute out, correct on divergence.
Proposed scope
Acceptance criteria
- Sending the full context each turn yields the same output as the current incremental path, with compute proportional to the new tokens only.
- Editing an earlier turn correctly invalidates and recomputes from the divergence point.
- Works on both QNN and nntrainer backends (or nntrainer gap documented).
Risks / notes
- Tokenizer-level prefix matching must be exact; whitespace/template re-rendering can shift token boundaries — match on token IDs, not raw strings.
- Tight coupling with Issue 4 (KV truncation requires position rollback) and Issue 2 (per-session recorded length).
- Defines a cleaner contract than the current "send only the new turn" approach and removes the chat-template-dependent extract_latest_* heuristics.
Background / Problem
Today, multi-turn reuse works by having the caller send only the new turn, and the engine reuses the persisted KV cache:
This contract is fragile:
Requested behavior: let the caller pass the full accumulated context every turn; the engine records how much it has already processed (a recorded context length / processed token sequence)
and only computes the unprocessed delta, reusing cached KV for the matching prefix.
Goal
Make the API robust and caller-friendly: full context in, minimal compute out, correct on divergence.
Proposed scope
Acceptance criteria
Risks / notes