Skip to content

docs: add ADR for response token cache#65

Draft
franciscojavierarceo wants to merge 7 commits into
mainfrom
feat/adr04-token-cache-phase2
Draft

docs: add ADR for response token cache#65
franciscojavierarceo wants to merge 7 commits into
mainfrom
feat/adr04-token-cache-phase2

Conversation

@franciscojavierarceo

@franciscojavierarceo franciscojavierarceo commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add ADR-04 for cached rendered token IDs in Responses/Conversation continuation.
  • Capture the measured vLLM prompt_cache_ref + append_token_ids benchmark conclusions in the ADR without including prototype code or raw CSV benchmark artifacts.
  • Include one summarized SVG graph for the dense prefix-handle TTFT comparison, with 10 prompt-token points and 95% confidence intervals.
  • Document how this should fit a future llm-d deployment: DB token spans provide deterministic prefix identity and fallback material, while llm-d precise-prefix routing/tiered KV cache owns fleet-level KV locality.
  • Add the ADR to the MkDocs navigation.

Test Plan

  • uv run --with-requirements docs/requirements.txt mkdocs build --strict

Notes: MkDocs build passes with the existing Material-for-MkDocs warning and existing unnaved-page warnings for ADR/design pages.

@franciscojavierarceo franciscojavierarceo force-pushed the feat/adr04-token-cache-phase2 branch from 6f7afdd to 315b730 Compare June 18, 2026 14:29
@franciscojavierarceo franciscojavierarceo changed the title feat: prototype response token cache replay docs: add ADR for response token cache Jun 18, 2026
@franciscojavierarceo franciscojavierarceo force-pushed the feat/adr04-token-cache-phase2 branch 5 times, most recently from fcb0254 to e2483a1 Compare June 18, 2026 16:09
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
@franciscojavierarceo franciscojavierarceo force-pushed the feat/adr04-token-cache-phase2 branch from e2483a1 to 02ce0eb Compare June 18, 2026 17:04
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
@maralbahari

Copy link
Copy Markdown
Collaborator

@franciscojavierarceo thank you for the comprehensive ADR. I wanted to clarify my understanding of the overall follow proposed.

Turn 1: no cache yet full path

  1. rehydrate items from DB (ordered)
  2. send full prompt to vLLM
  3. vLLM renders + tokenizes → generates output
  4. vLLM returns response + prompt_token_ids = [t0...t_N]
  5. agentic-api stores:
    • response items in DB
    • prefix_hash = sha256([t0...t_N])
    • prefix_token_count = N
    • tokenizer_fingerprint, renderer_fingerprint, template_fingerprint
      there would not be any raw token ids stored in agentic-api storage right?

Turn 2: cache path

  1. rehydrate items from DB (ordered)
  2. fresh full render locally → [t0, t1, ... t_N, t_N+1, t_N+2]
  3. strict-prefix check:
    sha256(fresh_render[:N]) == stored prefix_hash?
    → YES: proceed with replay
    → NO: fall back to full prompt (back to turn 1 path)
  4. marginal suffix = fresh_render[N:] = [t_N+1, t_N+2]
  5. send to vLLM:
    {
    prompt_cache_ref: {
    handle: "vllm_prefix_...",
    prefix_hash: "sha256:...",
    prefix_token_count: N,
    tokenizer_fingerprint: "...",
    renderer_fingerprint: "..."
    },
    append_token_ids: [t_N+1, t_N+2]
    }
  6. vLLM validates handle, skips render+tokenize for prefix
  7. APC kicks in, skips prefill for cached blocks
  8. only append_token_ids go through prefill
  9. decode starts
  10. agentic-api stores:
    • new response items in DB
    • prefix_hash = sha256(prompt_token_ids returned by vLLM)
    • prefix_token_count = N + len(append_token_ids) + len(output_token_ids)
      here there would not be raw token_id storage by agentic-api for the marginal input either right we always store the hash?

currently we cannot interface this design we would need your feature branch from vllm fork to land into vllm upstream. are you planning to open PR soon on vllm?

I was also wondering the benchmarking figure mentioned in this ADR how is it generated? like prototyping with llm-d ogx rehydration or using ur forked vllm feature branch with agentic-api ? or is it just simple benchmark result on vllm upstream with prefix caching enabled vs disabled?

Copy link
Copy Markdown
Collaborator Author

@maralbahari the overall shape is close, with two important corrections.

first, i would not want production turn 2 to do a fresh full render on every request. that is useful for prototype validation, CI, canary sampling, and fallback/reseed, but if it happens on every hot-path request we give back much of the render/tokenize win. the production path should validate the stored prefix metadata/handle, then render only the marginal suffix through a renderer-owned incremental path or a stored safe-boundary segment.

second, for agentic-api storage, the hot path should store compact prefix metadata: prefix hash, token count, model/tokenizer/renderer/template fingerprints, safe-boundary proof, and eventually router-visible block hashes for llm-d. raw token ids are useful as an optional cold-path diagnostic/reseed artifact, but i do not think we should require reading/writing full token arrays synchronously every turn. same for marginal input ids: they are execution material for vLLM, but not something the DB hot path should need to persist as raw ids unless we explicitly choose to keep a debug/checkpoint span.

on vLLM: yes, this design depends on a vLLM primitive that is not upstream today. the benchmark used my fork/branch with the prompt_cache_ref + append_token_ids prototype, not stock upstream vLLM. i think the vLLM work should probably land as one or more focused PRs: diagnostic token-id visibility first, then the replay primitive/handle fallback path.

on the figure: it was generated by a standalone benchmark harness against the forked vLLM server on the DGX with GPT-OSS-20B and APC enabled. it was not llm-d, not OGX rehydration, and not simply upstream prefix caching on/off. the key comparison was full-history streaming request versus minimal prompt-cache-ref replay. the important result was that APC was already hitting, so the measured win was mostly reduced prompt reconstruction / render-tokenize / request-size overhead, not recovering a missed KV prefill cache. i updated the ADR to make that provenance and storage split clearer.

@bbrowning

Copy link
Copy Markdown
Collaborator

Something important to point out here is that in vLLM, subsequent turns of a conversation are not a strict append-only situation by adding new token ids at the end of an existing list of token ids. Reasoning handling is a good example, where depending on where we are in the turns we conditionally drop some older reasoning that would distract the model from its current task at hand. Other examples are things like handling of multiple system messages for specific models, where some models can handle system messages that arrive in later turns and some can not. Or toggling thinking on or off in subsequent turns, changing reasoning effort between turns, etc. This is not an exhaustive list, and these are model-specific things that may or may not invalidate our prefix caching for any given request.

The point is, vLLM is the thing that knows how to properly turn a request into token ids and has model-dependent logic to do that. In many happy path cases that can look append-only, but in the real world it often is not with backtracking across at least one turn boundary for many popular models for every user turn due to how reasoning trimming is done.

Basically, the full vLLM render pipeline has to run for a given request to get the correct token ids for that turn's input to the model. There may be some things that could be cached internally, in vLLM, for happy path situations specific to each model that are append-only for subsequent turns. But, any logic to do that would need to live in vLLM itself right next to all the logic that decides which inputs to keep, drop, munge, or otherwise change as it's constructing the next set of input tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants