docs: add ADR for response token cache by franciscojavierarceo · Pull Request #65 · vllm-project/agentic-api

franciscojavierarceo · 2026-06-18T14:18:31Z

Summary

Add ADR-04 for cached rendered token IDs in Responses/Conversation continuation.
Capture the measured vLLM prompt_cache_ref + append_token_ids benchmark conclusions in the ADR without including prototype code or raw CSV benchmark artifacts.
Include one summarized SVG graph for the dense prefix-handle TTFT comparison, with 10 prompt-token points and 95% confidence intervals.
Document how this should fit a future llm-d deployment: DB token spans provide deterministic prefix identity and fallback material, while llm-d precise-prefix routing/tiered KV cache owns fleet-level KV locality.
Add the ADR to the MkDocs navigation.

Test Plan

uv run --with-requirements docs/requirements.txt mkdocs build --strict

Notes: MkDocs build passes with the existing Material-for-MkDocs warning and existing unnaved-page warnings for ADR/design pages.

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

maralbahari · 2026-06-22T10:44:07Z

@franciscojavierarceo thank you for the comprehensive ADR. I wanted to clarify my understanding of the overall follow proposed.

Turn 1: no cache yet full path

rehydrate items from DB (ordered)
send full prompt to vLLM
vLLM renders + tokenizes → generates output
vLLM returns response + prompt_token_ids = [t0...t_N]
agentic-api stores:
- response items in DB
- prefix_hash = sha256([t0...t_N])
- prefix_token_count = N
- tokenizer_fingerprint, renderer_fingerprint, template_fingerprint
  there would not be any raw token ids stored in agentic-api storage right?

Turn 2: cache path

rehydrate items from DB (ordered)
fresh full render locally → [t0, t1, ... t_N, t_N+1, t_N+2]
strict-prefix check:
sha256(fresh_render[:N]) == stored prefix_hash?
→ YES: proceed with replay
→ NO: fall back to full prompt (back to turn 1 path)
marginal suffix = fresh_render[N:] = [t_N+1, t_N+2]
send to vLLM:
{
prompt_cache_ref: {
handle: "vllm_prefix_...",
prefix_hash: "sha256:...",
prefix_token_count: N,
tokenizer_fingerprint: "...",
renderer_fingerprint: "..."
},
append_token_ids: [t_N+1, t_N+2]
}
vLLM validates handle, skips render+tokenize for prefix
APC kicks in, skips prefill for cached blocks
only append_token_ids go through prefill
decode starts
agentic-api stores:
- new response items in DB
- prefix_hash = sha256(prompt_token_ids returned by vLLM)
- prefix_token_count = N + len(append_token_ids) + len(output_token_ids)
  here there would not be raw token_id storage by agentic-api for the marginal input either right we always store the hash?

currently we cannot interface this design we would need your feature branch from vllm fork to land into vllm upstream. are you planning to open PR soon on vllm?

I was also wondering the benchmarking figure mentioned in this ADR how is it generated? like prototyping with llm-d ogx rehydration or using ur forked vllm feature branch with agentic-api ? or is it just simple benchmark result on vllm upstream with prefix caching enabled vs disabled?

franciscojavierarceo · 2026-06-22T20:32:01Z

@maralbahari the overall shape is close, with two important corrections.

first, i would not want production turn 2 to do a fresh full render on every request. that is useful for prototype validation, CI, canary sampling, and fallback/reseed, but if it happens on every hot-path request we give back much of the render/tokenize win. the production path should validate the stored prefix metadata/handle, then render only the marginal suffix through a renderer-owned incremental path or a stored safe-boundary segment.

second, for agentic-api storage, the hot path should store compact prefix metadata: prefix hash, token count, model/tokenizer/renderer/template fingerprints, safe-boundary proof, and eventually router-visible block hashes for llm-d. raw token ids are useful as an optional cold-path diagnostic/reseed artifact, but i do not think we should require reading/writing full token arrays synchronously every turn. same for marginal input ids: they are execution material for vLLM, but not something the DB hot path should need to persist as raw ids unless we explicitly choose to keep a debug/checkpoint span.

on vLLM: yes, this design depends on a vLLM primitive that is not upstream today. the benchmark used my fork/branch with the prompt_cache_ref + append_token_ids prototype, not stock upstream vLLM. i think the vLLM work should probably land as one or more focused PRs: diagnostic token-id visibility first, then the replay primitive/handle fallback path.

on the figure: it was generated by a standalone benchmark harness against the forked vLLM server on the DGX with GPT-OSS-20B and APC enabled. it was not llm-d, not OGX rehydration, and not simply upstream prefix caching on/off. the key comparison was full-history streaming request versus minimal prompt-cache-ref replay. the important result was that APC was already hitting, so the measured win was mostly reduced prompt reconstruction / render-tokenize / request-size overhead, not recovering a missed KV prefill cache. i updated the ADR to make that provenance and storage split clearer.

bbrowning · 2026-06-23T12:52:35Z

Something important to point out here is that in vLLM, subsequent turns of a conversation are not a strict append-only situation by adding new token ids at the end of an existing list of token ids. Reasoning handling is a good example, where depending on where we are in the turns we conditionally drop some older reasoning that would distract the model from its current task at hand. Other examples are things like handling of multiple system messages for specific models, where some models can handle system messages that arrive in later turns and some can not. Or toggling thinking on or off in subsequent turns, changing reasoning effort between turns, etc. This is not an exhaustive list, and these are model-specific things that may or may not invalidate our prefix caching for any given request.

The point is, vLLM is the thing that knows how to properly turn a request into token ids and has model-dependent logic to do that. In many happy path cases that can look append-only, but in the real world it often is not with backtracking across at least one turn boundary for many popular models for every user turn due to how reasoning trimming is done.

Basically, the full vLLM render pipeline has to run for a given request to get the correct token ids for that turn's input to the model. There may be some things that could be cached internally, in vLLM, for happy path situations specific to each model that are append-only for subsequent turns. But, any logic to do that would need to live in vLLM itself right next to all the logic that decides which inputs to keep, drop, munge, or otherwise change as it's constructing the next set of input tokens.

franciscojavierarceo force-pushed the feat/adr04-token-cache-phase2 branch from 6f7afdd to 315b730 Compare June 18, 2026 14:29

franciscojavierarceo changed the title ~~feat: prototype response token cache replay~~ docs: add ADR for response token cache Jun 18, 2026

franciscojavierarceo force-pushed the feat/adr04-token-cache-phase2 branch 5 times, most recently from fcb0254 to e2483a1 Compare June 18, 2026 16:09

docs: add ADR for response token cache

02ce0eb

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

franciscojavierarceo force-pushed the feat/adr04-token-cache-phase2 branch from e2483a1 to 02ce0eb Compare June 18, 2026 17:04

franciscojavierarceo added 6 commits June 18, 2026 21:50

docs: add codex websocket requirement to token cache ADR

7f8e68a

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

docs: focus token cache ADR on outcome

5678830

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

docs: refine token cache ADR framing

5f3e825

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

docs: make token cache ADR more approachable

92c7016

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

docs: remove token cache ADR status footer

15e83fd

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

docs: clarify token cache ADR structure

5fd10af

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: add ADR for response token cache#65

docs: add ADR for response token cache#65
franciscojavierarceo wants to merge 7 commits into
mainfrom
feat/adr04-token-cache-phase2

franciscojavierarceo commented Jun 18, 2026 •

edited

Loading

Uh oh!

maralbahari commented Jun 22, 2026

Uh oh!

franciscojavierarceo commented Jun 22, 2026

Uh oh!

bbrowning commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

franciscojavierarceo commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

maralbahari commented Jun 22, 2026

Uh oh!

franciscojavierarceo commented Jun 22, 2026

Uh oh!

bbrowning commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

franciscojavierarceo commented Jun 18, 2026 •

edited

Loading