Skip to content

Prototype cached prefix replay for Responses continuations #69

Description

@franciscojavierarceo

Summary

Track implementation follow-up from ADR-04 for cached prompt-token prefixes in long Responses and Conversation continuations.

The goal is to reduce time to first token (TTFT) for long APC-hot agentic loops by letting agentic-api persist enough prefix metadata to prove a prior model-visible prompt prefix is still valid, then ask vLLM to continue from a compact prompt_cache_ref + append_token_ids replay request.

Design context: #65

Why

The ADR measurements showed that automatic prefix caching was already effective on the measured DGX GPT-OSS-20B server, and rendered prompt IDs were stable. The useful production win is therefore not primarily recovering missed GPU prefill; it is avoiding repeated prompt reconstruction, repeated rendering/tokenization, and large request bodies as conversations grow.

In the measured Codex-session fixture, the handle path became clearly useful around 24k prompt tokens, with a fitted TTFT improvement of about 20.4 ms per additional 10k prompt tokens.

Subissues

Acceptance criteria

  • Strict-prefix validation exists before replay is enabled.
  • Replay only appends at renderer/template-safe boundaries.
  • vLLM handle miss and restart fallback behavior is defined.
  • Codex WebSocket continuation can reach the same replay path.
  • Long-context benchmarks show lower TTFT without changing model-visible token IDs.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions