Skip to content

[Feasibility] Unify the nntrainer KV-cache manager with the QNN KV-cache manager #33

Description

@dlwlzzero

Background

Quick.AI has two independent KV-cache managers:

  • causallm::KVCacheManager — nntrainer/Applications/CausalLM/kv_cache_manager.{h,cpp}. CPU-resident, backed by nntrainer::Tensor per layer (key_cache/value_cache), single cache for
    prefill+generation, position via cache_pos_, tensor-view access (getKeyCacheWriteView/ReadView).
  • causallm::QnnKvCacheManager — src/models/qnn/qnn_kv_cache_manager.{h,cpp}. Raw uint8_t* buffers in RPC/ION memory, separate prefill and generation caches synced per turn
    (syncGenerationToPrefill, .cpp:141-165), position via kv_len_, explicit QnnKvOutputBinding index mapping.

This issue tracks whether unifying them is worthwhile, and proposes a path if so.

Feasibility conclusion (中~高 difficulty — proceed with a thin shared abstraction, not a full merge)

What can be shared cheaply (low risk):

  • Position/length tracking — both expose the same concept: advance(step), get/set length, reset(). (KVCacheManager::getPosition/setPosition/advance vs
    QnnKvCacheManager::length/setLength/advance.) This is a natural common interface.
  • Multi-turn lifecycle semantics (advance on prefill+decode, reset on new session).

What is genuinely hard (do NOT force into one impl):

  1. Memory model is fundamentally different — nntrainer::Tensor (CPU, view semantics) vs raw RPC/ION byte buffers (pointer arithmetic + memcpy). Unifying storage would require a
    buffer-abstraction adapter on both sides.
  2. QNN dual-cache (prefill vs generation) exists because the QNN prefill and generation graphs have different KV I/O shapes; nntrainer uses a single cache. A unified manager must either
    model both patterns or push changes into the graphs.
  3. Layout / row-length encoding — QNN stores data-type-aware raw bytes with explicit per-layer row_length; nntrainer uses Tensor stride semantics.
  4. Output append/binding — QNN uses explicit QnnKvOutputBinding index maps; nntrainer uses implicit shared-data tensor views.

Verdict: A full single-implementation merge is not worth it — the memory models diverge too much. A shared interface + per-backend implementations (position tracking, save/load contract,
appendOutputs/reset semantics) is feasible and valuable, mainly for a consistent multi-session/multi-turn contract (ties into Issues 2 & 5). Persistence formats can stay separate.

Proposed scope (if accepted)

  • Extract a minimal IKvCacheManager interface: advance, length/setLength, reset, save/load contract — implemented by both managers, no storage unification.
  • Align position semantics and naming (cache_pos_ vs kv_len_).
  • Defer/decline storage-buffer unification; document the RPC/ION vs Tensor boundary as intentional.
  • Note: nntrainer-side read_kv_len callback currently returns 0 / is not implemented (api/model_callbacks.h:28-33) — multi-turn reuse today is QNN-only.

Risks / notes

  • Over-abstracting risks adding indirection to the QNN hot path (memcpy/append). Keep the interface thin.
  • Depends on the Issue 3 move (QNN context location) — coordinate.

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions