[Feasibility] Unify the nntrainer KV-cache manager with the QNN KV-cache manager

## Background

  Quick.AI has two independent KV-cache managers:
  - causallm::KVCacheManager — nntrainer/Applications/CausalLM/kv_cache_manager.{h,cpp}. CPU-resident, backed by nntrainer::Tensor per layer (key_cache/value_cache), single cache for
  prefill+generation, position via cache_pos_, tensor-view access (getKeyCacheWriteView/ReadView).
  - causallm::QnnKvCacheManager — src/models/qnn/qnn_kv_cache_manager.{h,cpp}. Raw uint8_t* buffers in RPC/ION memory, separate prefill and generation caches synced per turn
  (syncGenerationToPrefill, .cpp:141-165), position via kv_len_, explicit QnnKvOutputBinding index mapping.
  
  This issue tracks whether unifying them is worthwhile, and proposes a path if so.

  Feasibility conclusion (中~高 difficulty — proceed with a thin shared abstraction, not a full merge)

  What can be shared cheaply (low risk):
  - Position/length tracking — both expose the same concept: advance(step), get/set length, reset(). (KVCacheManager::getPosition/setPosition/advance vs
  QnnKvCacheManager::length/setLength/advance.) This is a natural common interface. 
  - Multi-turn lifecycle semantics (advance on prefill+decode, reset on new session).

  What is genuinely hard (do NOT force into one impl):
  1. Memory model is fundamentally different — nntrainer::Tensor (CPU, view semantics) vs raw RPC/ION byte buffers (pointer arithmetic + memcpy). Unifying storage would require a
  buffer-abstraction adapter on both sides.
  2. QNN dual-cache (prefill vs generation) exists because the QNN prefill and generation graphs have different KV I/O shapes; nntrainer uses a single cache. A unified manager must either
  model both patterns or push changes into the graphs.
  3. Layout / row-length encoding — QNN stores data-type-aware raw bytes with explicit per-layer row_length; nntrainer uses Tensor stride semantics.
  4. Output append/binding — QNN uses explicit QnnKvOutputBinding index maps; nntrainer uses implicit shared-data tensor views.

  Verdict: A full single-implementation merge is not worth it — the memory models diverge too much. A shared interface + per-backend implementations (position tracking, save/load contract,
  appendOutputs/reset semantics) is feasible and valuable, mainly for a consistent multi-session/multi-turn contract (ties into Issues 2 & 5). Persistence formats can stay separate.

 ## Proposed scope (if accepted)

  - [ ] Extract a minimal IKvCacheManager interface: advance, length/setLength, reset, save/load contract — implemented by both managers, no storage unification.
  - [ ] Align position semantics and naming (cache_pos_ vs kv_len_).
  - [ ] Defer/decline storage-buffer unification; document the RPC/ION vs Tensor boundary as intentional.
  - [ ] Note: nntrainer-side read_kv_len callback currently returns 0 / is not implemented (api/model_callbacks.h:28-33) — multi-turn reuse today is QNN-only.

 ## Risks / notes

  - Over-abstracting risks adding indirection to the QNN hot path (memcpy/append). Keep the interface thin.
  - Depends on the Issue 3 move (QNN context location) — coordinate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feasibility] Unify the nntrainer KV-cache manager with the QNN KV-cache manager #33

Background

Proposed scope (if accepted)

Risks / notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feasibility] Unify the nntrainer KV-cache manager with the QNN KV-cache manager #33

Description

Background

Proposed scope (if accepted)

Risks / notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions