docs: add ADR for response token cache#65
Conversation
6f7afdd to
315b730
Compare
fcb0254 to
e2483a1
Compare
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
e2483a1 to
02ce0eb
Compare
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
|
@franciscojavierarceo thank you for the comprehensive ADR. I wanted to clarify my understanding of the overall follow proposed. Turn 1: no cache yet full path
Turn 2: cache path
currently we cannot interface this design we would need your feature branch from vllm fork to land into vllm upstream. are you planning to open PR soon on vllm? I was also wondering the benchmarking figure mentioned in this ADR how is it generated? like prototyping with |
|
@maralbahari the overall shape is close, with two important corrections. first, i would not want production turn 2 to do a fresh full render on every request. that is useful for prototype validation, CI, canary sampling, and fallback/reseed, but if it happens on every hot-path request we give back much of the render/tokenize win. the production path should validate the stored prefix metadata/handle, then render only the marginal suffix through a renderer-owned incremental path or a stored safe-boundary segment. second, for agentic-api storage, the hot path should store compact prefix metadata: prefix hash, token count, model/tokenizer/renderer/template fingerprints, safe-boundary proof, and eventually router-visible block hashes for llm-d. raw token ids are useful as an optional cold-path diagnostic/reseed artifact, but i do not think we should require reading/writing full token arrays synchronously every turn. same for marginal input ids: they are execution material for vLLM, but not something the DB hot path should need to persist as raw ids unless we explicitly choose to keep a debug/checkpoint span. on vLLM: yes, this design depends on a vLLM primitive that is not upstream today. the benchmark used my fork/branch with the on the figure: it was generated by a standalone benchmark harness against the forked vLLM server on the DGX with GPT-OSS-20B and APC enabled. it was not llm-d, not OGX rehydration, and not simply upstream prefix caching on/off. the key comparison was full-history streaming request versus minimal prompt-cache-ref replay. the important result was that APC was already hitting, so the measured win was mostly reduced prompt reconstruction / render-tokenize / request-size overhead, not recovering a missed KV prefill cache. i updated the ADR to make that provenance and storage split clearer. |
|
Something important to point out here is that in vLLM, subsequent turns of a conversation are not a strict append-only situation by adding new token ids at the end of an existing list of token ids. Reasoning handling is a good example, where depending on where we are in the turns we conditionally drop some older reasoning that would distract the model from its current task at hand. Other examples are things like handling of multiple system messages for specific models, where some models can handle system messages that arrive in later turns and some can not. Or toggling thinking on or off in subsequent turns, changing reasoning effort between turns, etc. This is not an exhaustive list, and these are model-specific things that may or may not invalidate our prefix caching for any given request. The point is, vLLM is the thing that knows how to properly turn a request into token ids and has model-dependent logic to do that. In many happy path cases that can look append-only, but in the real world it often is not with backtracking across at least one turn boundary for many popular models for every user turn due to how reasoning trimming is done. Basically, the full vLLM render pipeline has to run for a given request to get the correct token ids for that turn's input to the model. There may be some things that could be cached internally, in vLLM, for happy path situations specific to each model that are append-only for subsequent turns. But, any logic to do that would need to live in vLLM itself right next to all the logic that decides which inputs to keep, drop, munge, or otherwise change as it's constructing the next set of input tokens. |
Summary
prompt_cache_ref + append_token_idsbenchmark conclusions in the ADR without including prototype code or raw CSV benchmark artifacts.Test Plan
uv run --with-requirements docs/requirements.txt mkdocs build --strictNotes: MkDocs build passes with the existing Material-for-MkDocs warning and existing unnaved-page warnings for ADR/design pages.