Skip to content

Expose cached-prefix identity for llm-d routing #73

Description

@franciscojavierarceo

Parent tracker: #69
Design context: #65

Summary

Make cached-prefix replay compatible with scaled llm-d deployments, where agentic-api database state and vLLM KV cache residency are separate layers.

Scope

  • Make the replay prefix router-visible without sending the full token array.
  • Define compact prefix identity: prefix hash, token count, block size, and eventually a block-hash chain compatible with llm-d precise-prefix routing.
  • Ensure vLLM KV events are emitted for the Responses replay path.
  • Align the replay-plan token stream, block size, and prefix/block hash with llm-d routing.
  • Test active-active EPP, tiered KV offload, wrong-pod routing, pod restart, AllBlocksCleared, and shared-storage reload scenarios.

Acceptance criteria

  • Router-visible prefix hints are derived from the same token stream and block size vLLM uses for KV events.
  • Wrong-pod, restart, and cache-clear cases fall back safely.
  • The design does not treat a process-local vLLM handle as durable global database state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions