Background
Quick.AI has two independent KV-cache managers:
- causallm::KVCacheManager — nntrainer/Applications/CausalLM/kv_cache_manager.{h,cpp}. CPU-resident, backed by nntrainer::Tensor per layer (key_cache/value_cache), single cache for
prefill+generation, position via cache_pos_, tensor-view access (getKeyCacheWriteView/ReadView).
- causallm::QnnKvCacheManager — src/models/qnn/qnn_kv_cache_manager.{h,cpp}. Raw uint8_t* buffers in RPC/ION memory, separate prefill and generation caches synced per turn
(syncGenerationToPrefill, .cpp:141-165), position via kv_len_, explicit QnnKvOutputBinding index mapping.
This issue tracks whether unifying them is worthwhile, and proposes a path if so.
Feasibility conclusion (中~高 difficulty — proceed with a thin shared abstraction, not a full merge)
What can be shared cheaply (low risk):
- Position/length tracking — both expose the same concept: advance(step), get/set length, reset(). (KVCacheManager::getPosition/setPosition/advance vs
QnnKvCacheManager::length/setLength/advance.) This is a natural common interface.
- Multi-turn lifecycle semantics (advance on prefill+decode, reset on new session).
What is genuinely hard (do NOT force into one impl):
- Memory model is fundamentally different — nntrainer::Tensor (CPU, view semantics) vs raw RPC/ION byte buffers (pointer arithmetic + memcpy). Unifying storage would require a
buffer-abstraction adapter on both sides.
- QNN dual-cache (prefill vs generation) exists because the QNN prefill and generation graphs have different KV I/O shapes; nntrainer uses a single cache. A unified manager must either
model both patterns or push changes into the graphs.
- Layout / row-length encoding — QNN stores data-type-aware raw bytes with explicit per-layer row_length; nntrainer uses Tensor stride semantics.
- Output append/binding — QNN uses explicit QnnKvOutputBinding index maps; nntrainer uses implicit shared-data tensor views.
Verdict: A full single-implementation merge is not worth it — the memory models diverge too much. A shared interface + per-backend implementations (position tracking, save/load contract,
appendOutputs/reset semantics) is feasible and valuable, mainly for a consistent multi-session/multi-turn contract (ties into Issues 2 & 5). Persistence formats can stay separate.
Proposed scope (if accepted)
Risks / notes
- Over-abstracting risks adding indirection to the QNN hot path (memcpy/append). Keep the interface thin.
- Depends on the Issue 3 move (QNN context location) — coordinate.
Background
Quick.AI has two independent KV-cache managers:
prefill+generation, position via cache_pos_, tensor-view access (getKeyCacheWriteView/ReadView).
(syncGenerationToPrefill, .cpp:141-165), position via kv_len_, explicit QnnKvOutputBinding index mapping.
This issue tracks whether unifying them is worthwhile, and proposes a path if so.
Feasibility conclusion (中~高 difficulty — proceed with a thin shared abstraction, not a full merge)
What can be shared cheaply (low risk):
QnnKvCacheManager::length/setLength/advance.) This is a natural common interface.
What is genuinely hard (do NOT force into one impl):
buffer-abstraction adapter on both sides.
model both patterns or push changes into the graphs.
Verdict: A full single-implementation merge is not worth it — the memory models diverge too much. A shared interface + per-backend implementations (position tracking, save/load contract,
appendOutputs/reset semantics) is feasible and valuable, mainly for a consistent multi-session/multi-turn contract (ties into Issues 2 & 5). Persistence formats can stay separate.
Proposed scope (if accepted)
Risks / notes