Prototype cached prefix replay for Responses continuations

## Summary

Track implementation follow-up from ADR-04 for cached prompt-token prefixes in long Responses and Conversation continuations.

The goal is to reduce time to first token (TTFT) for long APC-hot agentic loops by letting `agentic-api` persist enough prefix metadata to prove a prior model-visible prompt prefix is still valid, then ask vLLM to continue from a compact `prompt_cache_ref + append_token_ids` replay request.

Design context: #65

## Why

The ADR measurements showed that automatic prefix caching was already effective on the measured DGX GPT-OSS-20B server, and rendered prompt IDs were stable. The useful production win is therefore not primarily recovering missed GPU prefill; it is avoiding repeated prompt reconstruction, repeated rendering/tokenization, and large request bodies as conversations grow.

In the measured Codex-session fixture, the handle path became clearly useful around 24k prompt tokens, with a fitted TTFT improvement of about 20.4 ms per additional 10k prompt tokens.

## Subissues

- #71 - Persist and validate prompt-prefix state in `agentic-api`
- #70 - Harden vLLM prompt-cache replay execution
- #72 - Add Responses WebSocket support for Codex continuations
- #73 - Expose cached-prefix identity for llm-d routing
- #74 - Benchmark cached-prefix replay end to end

## Acceptance criteria

- Strict-prefix validation exists before replay is enabled.
- Replay only appends at renderer/template-safe boundaries.
- vLLM handle miss and restart fallback behavior is defined.
- Codex WebSocket continuation can reach the same replay path.
- Long-context benchmarks show lower TTFT without changing model-visible token IDs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prototype cached prefix replay for Responses continuations #69

Summary

Why

Subissues

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Prototype cached prefix replay for Responses continuations #69

Description

Summary

Why

Subissues

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions