Skip to content

[Feat(API)] Accept full accumulated context and compute only the unprocessed delta via recorded context length #34

Description

@dlwlzzero

Background / Problem

Today, multi-turn reuse works by having the caller send only the new turn, and the engine reuses the persisted KV cache:

  • prepare_input_for_model() switches to incremental_prompt when h.kv_len > 0 (api/quick_dot_ai_api.cpp:537-557).
  • incremental_prompt extracts only the latest user content (QNN: src/models/qnn/gauss-3.6-qnn/gauss3_6_qnn.cpp:650-654).
  • Position persists across turns via h.kv_len (quick_dot_ai_api.cpp:95-96, 424-432); read_kv_len callback in api/model_callbacks.h:28-33.

This contract is fragile:

  • The caller must know exactly what was already cached and send only the new suffix.
  • It breaks under conversation editing / regeneration / branching (cache no longer matches the intended prefix).
  • nntrainer-side read_kv_len returns 0 (not implemented) → this path is effectively QNN-only.

Requested behavior: let the caller pass the full accumulated context every turn; the engine records how much it has already processed (a recorded context length / processed token sequence)
and only computes the unprocessed delta, reusing cached KV for the matching prefix.

Goal

Make the API robust and caller-friendly: full context in, minimal compute out, correct on divergence.

Proposed scope

  • Tokenize the full incoming context and compute the longest common prefix against the recorded processed-token sequence for the session.
  • Reuse KV for the matched prefix; prefill/decode only the new suffix.
  • On divergence (edit/regeneration/branch): truncate KV + recorded length to the divergence point, then process the remainder.
  • Persist the processed token sequence (or a robust signature) alongside kv_len in the handle/session (CausalLmModel, quick_dot_ai_api.cpp:87-97).
  • Replace/augment the brittle incremental_prompt "extract latest turn" heuristic with token-level prefix matching.
  • Implement the nntrainer-side read_kv_len/position path so this is not QNN-only.

Acceptance criteria

  • Sending the full context each turn yields the same output as the current incremental path, with compute proportional to the new tokens only.
  • Editing an earlier turn correctly invalidates and recomputes from the divergence point.
  • Works on both QNN and nntrainer backends (or nntrainer gap documented).

Risks / notes

  • Tokenizer-level prefix matching must be exact; whitespace/template re-rendering can shift token boundaries — match on token IDs, not raw strings.
  • Tight coupling with Issue 4 (KV truncation requires position rollback) and Issue 2 (per-session recorded length).
  • Defines a cleaner contract than the current "send only the new turn" approach and removes the chat-template-dependent extract_latest_* heuristics.

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions