Skip to content

Conversation

@juncgu-google
Copy link
Collaborator

@juncgu-google juncgu-google commented Nov 24, 2025

Description

TL;DR, add the feature of offloading KV cache to host cpu buffer (similar to the native CPU offloading in vLLM).

This PR allows offloading computed KV cache (at the granularity of block / page) (of prompt tokens or even including the generated tokens) to the host CPU buffer and bringing them back to TPU HBM when there are cache hits, to avoid re-compute.

implementation

Following the general kv connector interfaces and the logic of native CPU offloading in vLLM, it introduces a TPUOffloadConnector, which is the central component of managing and executing the offloading logic. Within the TPUOffloadConnector, there are:

  1. TPUOffloadConnectorScheduler: colocates with the vLLM scheduler, makes KV cache swap decisions, and keeps track of running requests and their in-flight swap operations.
    • OffloadManager: manages the logical kv cache space on host CPU buffer, and applys LRU eviction policy.
    • StagingBufferManager: manages a logical staging buffer in the TPU's HBM, which is used as a temporary buffer (in HBM) for data being transferred between the TPU and the CPU. Without the control of using temporary buffer in HBM can easily lead to TPU OOM.
  2. TPUOffloadConnectorWorker: executes the swap decisions (embedded in the SchedulerOutput) from TPUOffloadConnectorScheduler.
    • LocalCPUBackend: a simple key-value map in host CPU memory for hosting the offloaded KV cache blocks / pages.

The TPU-CPU swap operations are the core of this PR. We provide two approaches to move KV cache data:

  1. using jax.device_put
  2. using pallas dma kernel

usage

This feature can be used by setting the kv_tranfer_config in vLLM engine:
--kv-transfer-config '{"kv_connector":"TPUOffloadConnector","kv_connector_module_path":"tpu_inference.distributed.offload.tpu_offload_connector","kv_role":"kv_both"}'
example: examples/offload/gke/benchmarks/deploy-cpu-offload.yaml

And, it can be configured through the following environment variables (we will move them into the KVTransferConfig.kv_connector_extra_config):

  1. TPU_OFFLOAD_SKIP_JAX_PRECOMPILE: skipping pre-compiling the swap functions, default=0. We would suggest to turn on pre-compile. All swap operations are applied at block-granularity; when swap pre-compile is turned on, we will break a request of swap into multiple swap-operations following the predefined bucket size list (1 block, 2 blocks, 4 blocks, 8 blocks, 16 blocks) to avoid re-compile (thanks to @saikat-royc).
  2. TPU_OFFLOAD_SWAP_OP_TYPE: jax (default), or pallas.
  3. TPU_OFFLOAD_NUM_CPU_CHUNKS: host CPU buffer capacity in terms of number of chunks (equivalent to kv cache blocks / pages), default=1024.
  4. TPU_OFFLOAD_NUM_STAGING_BLOCKS: the size of the staging buffer in the TPU HBM, default=128.
  5. TPU_OFFLOAD_DECODE_SAVE: save the KV cache of generated (decode) tokens, default=False.

Tests

pytest -s -v tests/distributed/offload/
pytest -s -v tests/kernels/host_dma_test.py

Checklist

Before submitting this PR, please make sure:

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have made or will make corresponding changes to any relevant documentation.

@github-actions
Copy link

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

  • why is this change being made,
  • the problem being solved and any relevant context,
  • why this is a good solution,
  • some information about the specific implementation,
  • shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have made or will make corresponding changes to any relevant documentation.

@juncgu-google juncgu-google changed the title [Feat][TPU Offload] KV cache offload to local cpu buffer [Feat][WIP][TPU Offload] KV cache offload to local cpu buffer Nov 24, 2025
@juncgu-google juncgu-google changed the title [Feat][WIP][TPU Offload] KV cache offload to local cpu buffer [Feat][TPU Offload] KV cache offload to local cpu buffer Nov 25, 2025
@juncgu-google juncgu-google changed the title [Feat][TPU Offload] KV cache offload to local cpu buffer [Feat][WIP][TPU Offload] KV cache offload to local cpu buffer Nov 25, 2025
@juncgu-google juncgu-google force-pushed the cpu-offloading/dev-merge branch 2 times, most recently from 48d2e3d to eeb1ed8 Compare December 1, 2025 22:58
@juncgu-google juncgu-google changed the title [Feat][WIP][TPU Offload] KV cache offload to local cpu buffer [Feat][TPU Offload] KV cache offload to local cpu buffer Dec 1, 2025
@juncgu-google juncgu-google added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 1, 2025
@kyuyeunk
Copy link
Collaborator

kyuyeunk commented Dec 2, 2025

this is really really big pr & it's difficult to review.

are you planning to split this PR into multiple smaller ones to make it easier for reviewers?

@juncgu-google juncgu-google removed the ready ONLY add when PR is ready to merge/full CI is needed label Dec 3, 2025
@juncgu-google juncgu-google force-pushed the cpu-offloading/dev-merge branch 2 times, most recently from 54785e1 to cc97559 Compare December 4, 2025 00:40
@juncgu-google juncgu-google added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 4, 2025
@juncgu-google juncgu-google force-pushed the cpu-offloading/dev-merge branch from 5dfe210 to e780427 Compare December 5, 2025 02:48
@dannawang0221
Copy link
Collaborator

/lgtm

@juncgu-google juncgu-google force-pushed the cpu-offloading/dev-merge branch from e780427 to 87db579 Compare December 5, 2025 18:29
@juncgu-google
Copy link
Collaborator Author

this is really really big pr & it's difficult to review.

are you planning to split this PR into multiple smaller ones to make it easier for reviewers?

That's true. The core implementation is in the distributed/offload folder (~2400 lines), others are for tests. But this feature is modularized and makes negligible changes to the core of tpu-inference.
I will discuss with all of the reviewers to see how to re-organize the PR.

juncgu-google and others added 24 commits December 6, 2025 04:50
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: dannawang <dannawang@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: dannawang <dannawang@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
Signed-off-by: dannawang <dannawang@google.com>
Signed-off-by: Juncheng Gu <jcgu@google.com>
@juncgu-google juncgu-google force-pushed the cpu-offloading/dev-merge branch from 6113fb4 to 9ac152e Compare December 6, 2025 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants