From 8a87de6b14f3423d2f7e4b6a33ab2cd82a5e5b04 Mon Sep 17 00:00:00 2001 From: zhengkezhou1 Date: Thu, 17 Jul 2025 21:51:40 +0800 Subject: [PATCH 1/3] [proposal]: Support Context Cache for Improved Conversation Efficiency Signed-off-by: zhengkezhou1 --- design/ep-1248.md | 252 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 252 insertions(+) create mode 100644 design/ep-1248.md diff --git a/design/ep-1248.md b/design/ep-1248.md new file mode 100644 index 000000000..324497883 --- /dev/null +++ b/design/ep-1248.md @@ -0,0 +1,252 @@ +# EP: Support Context Cache for Improved Conversation Efficiency + +## Background + +In multi-turn or session-based Large Language Model (LLM) inference scenarios, the current practice involves sending the entire conversation history with each new query. This leads to redundant computation of past prompts' Key-Value (KV) Caches, resulting in significant performance bottlenecks and increased computational costs (especially for longer contexts). To address this, efficiently reusing and managing the KV Cache for conversational history becomes critical. Many leading LLM providers have already adopted context caching functionalities to mitigate these challenges. + +## Goal + +Based on the capabilities of the already implemented KV Cache Sidecar, this proposal aims to build and integrate an **optional** context caching feature for the Aibrix system. This feature will allow users to efficiently reuse the model inference's KV Cache via a session ID, thereby significantly reducing redundant computation overhead and optimizing overall performance and resource consumption in multi-turn or conversational LLM interactions. + +## Implementation + +### Request Flow + +We will introduce a new endpoint: `/v1/context` to manage context caches. The following fields will be used: + + - `session_id`: A unique identifier for each context cache, created upon the first request and used in subsequent requests. + - `ttl`: The time-to-live for the cache, after which it will be automatically cleared. + +#### Creating a Cache for a Session + +```mermaid +sequenceDiagram + participant C as Client + participant E as Envoy + participant G as Gateway Plugin + participant R as Context Cache Manager + participant IP as InferencePod + participant V as vLLM Main Container + participant S as KV Cache Sidecar + + C->>+E: POST /v1/context (prompt, model, ttl) + E->>+G: Forward Request + G->>+R: 1. Request Session ID & Metadata Creation + R->>-G: Return Session ID + G->>+V: 2. Submit Prompt for Initial Inference + V->>V: 2.1. Compute KV Cache for Prompt + V-->>V: 2.2. Generate Completion (if needed) + V->>+S: 3. Export & Store KV Cache (via Sidecar API/IPC) + S->>S: 3.1. Persist KV Cache Data + S->>-V: Confirmation of Storage & Sidecar Cache ID + V->>-G: 4. Return Initial Inference Response (incl. Sidecar Cache ID) + G->>+R: 5. Register Session Metadata (session_id, Sidecar Cache ID, TTL) + R->>-G: Confirmation of Registration + G->>-E: Pipe back Response (with session_id, usage) + E->>-C: Complete Response + Note over R,S: Context Cache Manager manages metadata. Sidecar handles actual KV Cache data. +``` + +Before using context caching, users first need to create it. Here, we create a context cache with a `ttl` of one hour. + +```shell +curl -X POST http://localhost:8000/v1/context \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer test-key-1234567890" \ + -d '{ + "model": "facebook-opt-125m", + "prompt": "Say this is a test", + "ttl": 3600, + }' +``` + +In the response, we can obtain the unique identifier for the created session, `session_id`. + +```json +{ + "id": "cmpl-de1f99972bd34149968489cb100b2c88", + "object": "text_completion", + "created": 1752594611, + "model": "facebook-opt-125m", + "session_id": "session-01" + ... + "usage": { + "prompt_tokens": 6, + "total_tokens": 93, + "completion_tokens": 87, + "prompt_tokens_details": null + } +} +``` + +#### Using Context Cache with `session_id` + +We can use the context cache by populating the obtained `session_id` into the request body. + +```mermaid +sequenceDiagram + participant C as Client + participant E as Envoy + participant G as Gateway Plugin + participant R as Context Cache Manager + participant IP as InferencePod + participant V as vLLM Main Container + participant S as KV Cache Sidecar + + C->>+E: POST /v1/completions (session_id="session-01", prompt="Next turn...") + E->>+G: Forward Request + G->>+R: 1. Lookup KV Cache Metadata (session_id="session-01") + R->>R: 1.1. Check TTL & validity + R->>-G: Return KV Cache Reference/ID (from Sidecar) + G->>+S: 2. Load KV Cache Data (using reference/ID) + S->>S: 2.1. Read KV Cache from persistent storage + S->>-G: Return KV Cache Data + G->>+V: 3. Submit Request with Loaded KV Cache & New Prompt + V-->>V: Generate new completion tokens + V->>-G: 4. Return Response (generated_text, adjusted usage_info) + G->>-E: Pipe back Response + E->>-C: Complete streaming +``` + +```shell +curl -X POST http://localhost:8000/v1/completions \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer test-key-1234567890" \ + -d '{ + "session_id": "session-01" + "model": "facebook-opt-125m", + "prompt": "Say this is a test", + }' +``` + +The expected effect is that when we use context caching, token consumption in multi-turn conversations will be reduced. + +```json +{ + ... + "usage": { + "prompt_tokens": 1, + "total_tokens": 50, + "completion_tokens": 49, + "prompt_tokens_details": null + } + ... +} +``` + +#### Clearing Context Cache + +When the TTL expires, the cache will be cleared. Manual early clearing is also provided. + +```shell +curl -X DELETE http://localhost:8000/v1/context/$session_id \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer test-key-1234567890" \ +``` + +### Runtime Container Changes + +In our context caching solution, the Runtime Container hosts the core **Context Cache Manager**. It is an independent logical unit whose primary responsibility is to act as the **registration center and lifecycle manager for context cache session metadata** within the entire system. Unlike the `KV Cache Sidecar` which directly handles KV Cache data, the `Context Cache Manager` is not responsible for the physical storage, serialization, or injection of KV Cache, but focuses solely on **logical-level management**. + +#### Core Responsibilities: + +##### Session Metadata Management: + + - **Registration and Mapping:** When a client first creates a context cache, the Context Cache Manager generates a unique `session_id` and associates it with a **unique reference (`kv_cache_sidecar_ref`)** returned by the KV Cache Sidecar, pointing to the actual KV Cache data. This mapping, along with model ID, TTL, and other information, is stored as session metadata. + + - **Query and Validation:** In subsequent requests, the Gateway Plugin queries the Context Cache Manager to obtain the `kv_cache_sidecar_ref` corresponding to a given `session_id`. The Manager will also validate the session's validity, including checking for expiration (TTL). + + - **Deregistration and Deletion:** When a user manually requests cache deletion, or when a cache needs to be cleared due to TTL expiration, the Context Cache Manager is responsible for removing the corresponding session metadata from its storage. + +##### Lifecycle Management (TTL): + +The Context Cache Manager stores the expiration time (`expires_at`) for each session. It will provide mechanisms (e.g., internal background tasks or external calls) to periodically check and clean up expired session metadata, ensuring timely release of cached resources. + +##### System Coordination Layer: + +The Context Cache Manager provides clear API interfaces (e.g., `register_session_cache`, `get_session_metadata`, `unregister_session_cache`) to the Gateway Plugin, enabling it to smoothly complete the creation, usage, and deletion processes for context caches. It **does not directly interact with the `vLLM Main Container` or the `KV Cache Sidecar` for data transfer**, but rather, through the passing of metadata, it guides the Gateway Plugin to coordinate with the KV Cache Sidecar for the loading and storage of KV Cache data. + +```python +from typing import Union, Optional +from pydantic import BaseModel # Assuming pydantic for request/response models + +class CreateContextCacheRequest(BaseModel): + model: str + prompt: str + ttl: int = 3600 # seconds + +class CreateCacheResponse(BaseModel): + id: str # ID of the initial inference + session_id: str + model: str + created: int + usage: dict # Contains prompt_tokens, total_tokens, etc. + +class DeleteCacheRequest(BaseModel): + session_id: str + +class DeleteCacheResponse(BaseModel): + session_id: str + status: str = "success" + +class ErrorResponse(BaseModel): + detail: str +``` + +```python +class CacheSessionMetadata(BaseModel): + """Session metadata stored in the ContextCacheManager""" + session_id: str + model_id: str # The model this cache is for + kv_cache_sidecar_ref: str # Reference/ID used by the KV Cache Sidecar to identify the actual KV cache data + expires_at: int # Unix timestamp for TTL expiry + +class ContextCacheManager: + """ + Context Cache Manager, running in the Runtime Container. + Main responsibilities: + 1. Manage context cache session metadata (session_id, TTL, KV Cache Sidecar reference). + 2. Provide API for Gateway Plugin to register, query, and delete session metadata. + 3. Handle TTL expiration checks and cleanup of sessions (potentially via background tasks). + """ + + def __init__(self): + # In a production environment, Redis, a distributed key-value store, or a database would typically be used for metadata storage. + self.session_metadata_store: dict[str, CacheSessionMetadata] = {} + # Note: ContextCacheManager does not directly interact with KV Cache Sidecar for data. + # It only stores the reference provided by the Sidecar. Actual data interaction with the Sidecar is coordinated by the Gateway Plugin. + pass + + async def register_session_cache( + self, + request: CreateContextCacheRequest, # Metadata from the original create request + initial_inference_response: CreateCacheResponse, # Response obtained from initial vLLM inference + kv_cache_sidecar_ref: str # Unique reference to the internal KV Cache returned by the KV Cache Sidecar + ) -> Union[ErrorResponse, CreateCacheResponse]: + # Implementation details will go here. + pass + + async def unregister_session_cache( + self, + session_id: str, + ) -> Union[ErrorResponse, DeleteCacheResponse]: + """ + Deletes the metadata for the specified context cache session. + Note: This method only deletes metadata and **does not directly trigger** the KV Cache Sidecar to delete actual data. + **Actual KV Cache data deletion should be coordinated by the Gateway Plugin, or handled by the Sidecar itself based on TTL periodic cleanup.** + """ + # Implementation details will go here. + pass + + async def get_session_metadata( + self, + session_id: str, + ) -> Optional[CacheSessionMetadata]: + """ + Retrieves cache metadata for the specified session. + To be used by the Gateway Plugin in subsequent requests. + Also performs TTL check. + """ + # Implementation details will go here. + pass +``` From 5d10c26211826a712811e7907095771dc3cb6138 Mon Sep 17 00:00:00 2001 From: Zhengke Zhou Date: Thu, 17 Jul 2025 22:03:27 +0800 Subject: [PATCH 2/3] Update design/ep-1248.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Zhengke Zhou --- design/ep-1248.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/design/ep-1248.md b/design/ep-1248.md index 324497883..3ed04f6f7 100644 --- a/design/ep-1248.md +++ b/design/ep-1248.md @@ -113,9 +113,9 @@ curl -X POST http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer test-key-1234567890" \ -d '{ - "session_id": "session-01" + "session_id": "session-01", "model": "facebook-opt-125m", - "prompt": "Say this is a test", + "prompt": "Say this is a test" }' ``` From 7c47f463384ae3dd190981c39a6d3731d7f3b40d Mon Sep 17 00:00:00 2001 From: zhengkezhou1 Date: Mon, 25 Aug 2025 19:35:43 +0800 Subject: [PATCH 3/3] refactor Signed-off-by: zhengkezhou1 --- design/ep-1248.md | 253 ++++++++++++++++------------------------------ 1 file changed, 86 insertions(+), 167 deletions(-) diff --git a/design/ep-1248.md b/design/ep-1248.md index 3ed04f6f7..7ad878c47 100644 --- a/design/ep-1248.md +++ b/design/ep-1248.md @@ -2,11 +2,16 @@ ## Background -In multi-turn or session-based Large Language Model (LLM) inference scenarios, the current practice involves sending the entire conversation history with each new query. This leads to redundant computation of past prompts' Key-Value (KV) Caches, resulting in significant performance bottlenecks and increased computational costs (especially for longer contexts). To address this, efficiently reusing and managing the KV Cache for conversational history becomes critical. Many leading LLM providers have already adopted context caching functionalities to mitigate these challenges. +Many LLM providers (OpenAI, Anthropic) offer Prompt caching functionality, which reduces first token inference latency by caching prompts. We want to introduce similar functionality in our current system: Context Cache. +Context Cache will be integrated with existing KV Cache. From a higher level perspective, it can be viewed as a KV Cache Manager that operates during the prefill stage. When Context Cache is enabled, if the prompt received by the inference engine already exists in the current KV Cache, we can skip the computation phase and instead load them from cache (disk, remote storage) into GPU memory. ## Goal -Based on the capabilities of the already implemented KV Cache Sidecar, this proposal aims to build and integrate an **optional** context caching feature for the Aibrix system. This feature will allow users to efficiently reuse the model inference's KV Cache via a session ID, thereby significantly reducing redundant computation overhead and optimizing overall performance and resource consumption in multi-turn or conversational LLM interactions. +Provide Prompt Caching-like functionality for the current system. This is optional for users, and the usage is as follows: + +1. Users pass TTL (time-to-live) and the initial prompt to a specific endpoint to initialize the cache, and receive the corresponding response along with a unique identifier for the current session (session-id) +2. Users send OpenAI-compatible requests (chat/completion, response) with the unique identifier (session-id). When cache hits occur, the corresponding response time will be shorter than without using it. +3. The cache will be automatically deleted after the user-specified TTL, or users can proactively send requests to the endpoint to delete them early. ## Implementation @@ -14,8 +19,10 @@ Based on the capabilities of the already implemented KV Cache Sidecar, this prop We will introduce a new endpoint: `/v1/context` to manage context caches. The following fields will be used: - - `session_id`: A unique identifier for each context cache, created upon the first request and used in subsequent requests. - - `ttl`: The time-to-live for the cache, after which it will be automatically cleared. +- `x-session-id`: A unique identifier for each context cache, created upon the first request and used in subsequent requests. +- `x-session-ttl`: The time-to-live for the cache, after which it will be automatically cleared. + +Placing these two fields in the HTTP header ensures that all requests remain compatible with the OpenAI API #### Creating a Cache for a Session @@ -25,42 +32,46 @@ sequenceDiagram participant E as Envoy participant G as Gateway Plugin participant R as Context Cache Manager - participant IP as InferencePod - participant V as vLLM Main Container - participant S as KV Cache Sidecar - - C->>+E: POST /v1/context (prompt, model, ttl) - E->>+G: Forward Request - G->>+R: 1. Request Session ID & Metadata Creation - R->>-G: Return Session ID - G->>+V: 2. Submit Prompt for Initial Inference - V->>V: 2.1. Compute KV Cache for Prompt - V-->>V: 2.2. Generate Completion (if needed) - V->>+S: 3. Export & Store KV Cache (via Sidecar API/IPC) - S->>S: 3.1. Persist KV Cache Data - S->>-V: Confirmation of Storage & Sidecar Cache ID - V->>-G: 4. Return Initial Inference Response (incl. Sidecar Cache ID) - G->>+R: 5. Register Session Metadata (session_id, Sidecar Cache ID, TTL) - R->>-G: Confirmation of Registration - G->>-E: Pipe back Response (with session_id, usage) - E->>-C: Complete Response - Note over R,S: Context Cache Manager manages metadata. Sidecar handles actual KV Cache data. + participant T as Router + participant V as vLLM Engine + participant S as Persistent KV Cache Store + + Note over C,S: Creating a new Context Cache Session + + C->>+E: 1. POST /v1/context
(x-session-ttl, prompt, model) + E->>+G: 2. Forward Request + G->>+R: 3. Generate x-session-id + R->>+T: 4. Make routing decision (based on routing algorithm) + T->>+V: 5. Inference request + + Note over V: Execute inference (Prefill + Decode) + V->>V: 6. Execute inference + V->>+S: 7. Offload cache to KV Cache storage + V-->>-T: 8. Return output tokens + T->>-R: 9. Create mapping between session-id and prompt cache
e.g.: session-01 -> input tokens + output tokens + R-->>-G: 10. Return response + G-->>-E: 11. Pipe back response
(with x-session-id) + E-->>-C: 12. Complete Response ``` -Before using context caching, users first need to create it. Here, we create a context cache with a `ttl` of one hour. +Before using context caching, users first need to create it. Here, we create a context cache with a `x-session-ttl` of one hour. ```shell curl -X POST http://localhost:8000/v1/context \ -H "Content-Type: application/json" \ -H "Authorization: Bearer test-key-1234567890" \ + -H "x-session-ttl: 3600" \ -d '{ "model": "facebook-opt-125m", "prompt": "Say this is a test", - "ttl": 3600, }' ``` -In the response, we can obtain the unique identifier for the created session, `session_id`. +In the response, we can obtain the unique identifier for the created session `x-session-id` in HTTP header. + +``` +x-session-id: "session-01" +``` ```json { @@ -68,7 +79,6 @@ In the response, we can obtain the unique identifier for the created session, `s "object": "text_completion", "created": 1752594611, "model": "facebook-opt-125m", - "session_id": "session-01" ... "usage": { "prompt_tokens": 6, @@ -79,9 +89,7 @@ In the response, we can obtain the unique identifier for the created session, `s } ``` -#### Using Context Cache with `session_id` - -We can use the context cache by populating the obtained `session_id` into the request body. +#### Using Context Cache with `x-session-id` ```mermaid sequenceDiagram @@ -89,49 +97,47 @@ sequenceDiagram participant E as Envoy participant G as Gateway Plugin participant R as Context Cache Manager - participant IP as InferencePod - participant V as vLLM Main Container - participant S as KV Cache Sidecar - - C->>+E: POST /v1/completions (session_id="session-01", prompt="Next turn...") - E->>+G: Forward Request - G->>+R: 1. Lookup KV Cache Metadata (session_id="session-01") - R->>R: 1.1. Check TTL & validity - R->>-G: Return KV Cache Reference/ID (from Sidecar) - G->>+S: 2. Load KV Cache Data (using reference/ID) - S->>S: 2.1. Read KV Cache from persistent storage - S->>-G: Return KV Cache Data - G->>+V: 3. Submit Request with Loaded KV Cache & New Prompt - V-->>V: Generate new completion tokens - V->>-G: 4. Return Response (generated_text, adjusted usage_info) - G->>-E: Pipe back Response - E->>-C: Complete streaming + participant T as Router + participant V as vLLM Engine + participant S as Persistent KV Cache Store + + Note over C,S: Using existing session-id for inference request + + C->>+E: 1. POST /v1/completions
(x-session-id, prompt, model...) + E->>+G: 2. Forward Request + G->>+R: 3. Look up cache corresponding to x-session-id + R->>+T: 4. Make routing decision (based on routing algorithm) + + alt + Note over R,S: ✅ Cache hit path: Use existing cache + T->>R: 5a. Return inference engine pod metadata + R->>+S: 6a. Load prompt caching from persistent cache
to inference engine (pod) + S-->>-V: 7a. Return prompt caching + T->>+V: 8a. Inference request
(inference engine will use prompt caching during prefill) + else + Note over T,V: ⚠️ Cache miss path: Execute full inference + T->>+V: 5b. Inference request
(inference engine will use entire prompt during prefill) + Note over V,S: Generate new cache for subsequent use + V->>S: 6b. Offload newly generated cache to KV Cache storage + end + + Note over V: 🔄 Execute inference (Prefill + Decode) + V-->>-T: 7. Return Output Tokens + T->>-R: 8. Update mapping between session-id and prompt cache + R-->>-G: 9. Return response + G-->>-E: 10. Pipe back response
(with x-session-id) + E-->>-C: 11. Complete Response ``` ```shell curl -X POST http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer test-key-1234567890" \ + -H "x-session-id: session-01" \ -d '{ - "session_id": "session-01", - "model": "facebook-opt-125m", - "prompt": "Say this is a test" - }' -``` - -The expected effect is that when we use context caching, token consumption in multi-turn conversations will be reduced. - -```json -{ - ... - "usage": { - "prompt_tokens": 1, - "total_tokens": 50, - "completion_tokens": 49, - "prompt_tokens_details": null - } - ... -} + "model": "facebook-opt-125m", + "prompt": "Say this is a test" + }' ``` #### Clearing Context Cache @@ -144,109 +150,22 @@ curl -X DELETE http://localhost:8000/v1/context/$session_id \ -H "Authorization: Bearer test-key-1234567890" \ ``` -### Runtime Container Changes - -In our context caching solution, the Runtime Container hosts the core **Context Cache Manager**. It is an independent logical unit whose primary responsibility is to act as the **registration center and lifecycle manager for context cache session metadata** within the entire system. Unlike the `KV Cache Sidecar` which directly handles KV Cache data, the `Context Cache Manager` is not responsible for the physical storage, serialization, or injection of KV Cache, but focuses solely on **logical-level management**. - -#### Core Responsibilities: - -##### Session Metadata Management: - - - **Registration and Mapping:** When a client first creates a context cache, the Context Cache Manager generates a unique `session_id` and associates it with a **unique reference (`kv_cache_sidecar_ref`)** returned by the KV Cache Sidecar, pointing to the actual KV Cache data. This mapping, along with model ID, TTL, and other information, is stored as session metadata. - - - **Query and Validation:** In subsequent requests, the Gateway Plugin queries the Context Cache Manager to obtain the `kv_cache_sidecar_ref` corresponding to a given `session_id`. The Manager will also validate the session's validity, including checking for expiration (TTL). +### Data Plane Change - - **Deregistration and Deletion:** When a user manually requests cache deletion, or when a cache needs to be cleared due to TTL expiration, the Context Cache Manager is responsible for removing the corresponding session metadata from its storage. +#### Introduce new plugin: Context Cache Manager -##### Lifecycle Management (TTL): +We need to add a new plugin Context Cache Manager at the gateway layer, which is used to manage prompt caching across sessions. Traditional KV Cache in single inference sessions initializes attention states during prefill and applies them during the decode phase, then removes them from memory through eviction policies after inference completion. Context Cache Manager breaks through this limitation by pre-computing frequently reused prompt segments (such as system messages, document context, etc.) as prompt cache and persistently storing them within the specified TTL. When cache hits occur, the prefill phase transforms from recomputing attention states to loading pre-computed attention states from storage, achieving a conversion from compute-intensive to data-intensive operations, and reducing time-to-first-token latency. -The Context Cache Manager stores the expiration time (`expires_at`) for each session. It will provide mechanisms (e.g., internal background tasks or external calls) to periodically check and clean up expired session metadata, ensuring timely release of cached resources. +#### Interaction between Context Cache Manager and existing components -##### System Coordination Layer: +Context Cache Manager (CCM) serves as the core coordination component, primarily interacting with the following components: -The Context Cache Manager provides clear API interfaces (e.g., `register_session_cache`, `get_session_metadata`, `unregister_session_cache`) to the Gateway Plugin, enabling it to smoothly complete the creation, usage, and deletion processes for context caches. It **does not directly interact with the `vLLM Main Container` or the `KV Cache Sidecar` for data transfer**, but rather, through the passing of metadata, it guides the Gateway Plugin to coordinate with the KV Cache Sidecar for the loading and storage of KV Cache data. +1. **Bidirectional interaction with Router**: + - CCM sends routing decision requests to Router + - Router selects target inference engine based on routing algorithm and returns pod metadata to CCM + - When cache hits occur, CCM loads prompt caching into the GPU memory of the pod selected by Router -```python -from typing import Union, Optional -from pydantic import BaseModel # Assuming pydantic for request/response models - -class CreateContextCacheRequest(BaseModel): - model: str - prompt: str - ttl: int = 3600 # seconds - -class CreateCacheResponse(BaseModel): - id: str # ID of the initial inference - session_id: str - model: str - created: int - usage: dict # Contains prompt_tokens, total_tokens, etc. - -class DeleteCacheRequest(BaseModel): - session_id: str - -class DeleteCacheResponse(BaseModel): - session_id: str - status: str = "success" - -class ErrorResponse(BaseModel): - detail: str -``` - -```python -class CacheSessionMetadata(BaseModel): - """Session metadata stored in the ContextCacheManager""" - session_id: str - model_id: str # The model this cache is for - kv_cache_sidecar_ref: str # Reference/ID used by the KV Cache Sidecar to identify the actual KV cache data - expires_at: int # Unix timestamp for TTL expiry - -class ContextCacheManager: - """ - Context Cache Manager, running in the Runtime Container. - Main responsibilities: - 1. Manage context cache session metadata (session_id, TTL, KV Cache Sidecar reference). - 2. Provide API for Gateway Plugin to register, query, and delete session metadata. - 3. Handle TTL expiration checks and cleanup of sessions (potentially via background tasks). - """ - - def __init__(self): - # In a production environment, Redis, a distributed key-value store, or a database would typically be used for metadata storage. - self.session_metadata_store: dict[str, CacheSessionMetadata] = {} - # Note: ContextCacheManager does not directly interact with KV Cache Sidecar for data. - # It only stores the reference provided by the Sidecar. Actual data interaction with the Sidecar is coordinated by the Gateway Plugin. - pass - - async def register_session_cache( - self, - request: CreateContextCacheRequest, # Metadata from the original create request - initial_inference_response: CreateCacheResponse, # Response obtained from initial vLLM inference - kv_cache_sidecar_ref: str # Unique reference to the internal KV Cache returned by the KV Cache Sidecar - ) -> Union[ErrorResponse, CreateCacheResponse]: - # Implementation details will go here. - pass - - async def unregister_session_cache( - self, - session_id: str, - ) -> Union[ErrorResponse, DeleteCacheResponse]: - """ - Deletes the metadata for the specified context cache session. - Note: This method only deletes metadata and **does not directly trigger** the KV Cache Sidecar to delete actual data. - **Actual KV Cache data deletion should be coordinated by the Gateway Plugin, or handled by the Sidecar itself based on TTL periodic cleanup.** - """ - # Implementation details will go here. - pass - - async def get_session_metadata( - self, - session_id: str, - ) -> Optional[CacheSessionMetadata]: - """ - Retrieves cache metadata for the specified session. - To be used by the Gateway Plugin in subsequent requests. - Also performs TTL check. - """ - # Implementation details will go here. - pass -``` +2. **Interaction with Persistent KV Cache Store**: + - CCM manages cache lifecycle + - Coordinates loading and offloading of cache between persistent storage and GPU memory + - Maintains mapping relationships between session-id and cache data