[Feat(API)] Accept full accumulated context and compute only the unprocessed delta via recorded context length

## Background / Problem

  Today, multi-turn reuse works by having the caller send only the new turn, and the engine reuses the persisted KV cache:

  - prepare_input_for_model() switches to incremental_prompt when h.kv_len > 0 (api/quick_dot_ai_api.cpp:537-557).
  - incremental_prompt extracts only the latest user content (QNN: src/models/qnn/gauss-3.6-qnn/gauss3_6_qnn.cpp:650-654).
  - Position persists across turns via h.kv_len (quick_dot_ai_api.cpp:95-96, 424-432); read_kv_len callback in api/model_callbacks.h:28-33.

  This contract is fragile:
  - The caller must know exactly what was already cached and send only the new suffix.
  - It breaks under conversation editing / regeneration / branching (cache no longer matches the intended prefix).
  - nntrainer-side read_kv_len returns 0 (not implemented) → this path is effectively QNN-only.

  Requested behavior: let the caller pass the full accumulated context every turn; the engine records how much it has already processed (a recorded context length / processed token sequence)
  and only computes the unprocessed delta, reusing cached KV for the matching prefix.

## Goal

  Make the API robust and caller-friendly: full context in, minimal compute out, correct on divergence.

 ## Proposed scope

  - [ ] Tokenize the full incoming context and compute the longest common prefix against the recorded processed-token sequence for the session.
  - [ ] Reuse KV for the matched prefix; prefill/decode only the new suffix.
  - [ ] On divergence (edit/regeneration/branch): truncate KV + recorded length to the divergence point, then process the remainder.
  - [ ] Persist the processed token sequence (or a robust signature) alongside kv_len in the handle/session (CausalLmModel, quick_dot_ai_api.cpp:87-97).
  - [ ] Replace/augment the brittle incremental_prompt "extract latest turn" heuristic with token-level prefix matching.
  - [ ] Implement the nntrainer-side read_kv_len/position path so this is not QNN-only.

 ## Acceptance criteria

  - Sending the full context each turn yields the same output as the current incremental path, with compute proportional to the new tokens only.
  - Editing an earlier turn correctly invalidates and recomputes from the divergence point.
  - Works on both QNN and nntrainer backends (or nntrainer gap documented).

 ## Risks / notes

  - Tokenizer-level prefix matching must be exact; whitespace/template re-rendering can shift token boundaries — match on token IDs, not raw strings.
  - Tight coupling with Issue 4 (KV truncation requires position rollback) and Issue 2 (per-session recorded length).
  - Defines a cleaner contract than the current "send only the new turn" approach and removes the chat-template-dependent extract_latest_* heuristics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat(API)] Accept full accumulated context and compute only the unprocessed delta via recorded context length #34

Background / Problem

Goal

Proposed scope

Acceptance criteria

Risks / notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feat(API)] Accept full accumulated context and compute only the unprocessed delta via recorded context length #34

Description

Background / Problem

Goal

Proposed scope

Acceptance criteria

Risks / notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions