[Feat][TPU Offload] KV cache offload to local cpu buffer #1163

juncgu-google · 2025-11-24T21:41:43Z

Description

TL;DR, add the feature of offloading KV cache to host cpu buffer (similar to the native CPU offloading in vLLM).

This PR allows offloading computed KV cache (at the granularity of block / page) (of prompt tokens or even including the generated tokens) to the host CPU buffer and bringing them back to TPU HBM when there are cache hits, to avoid re-compute.

implementation

Following the general kv connector interfaces and the logic of native CPU offloading in vLLM, it introduces a TPUOffloadConnector, which is the central component of managing and executing the offloading logic. Within the TPUOffloadConnector, there are:

TPUOffloadConnectorScheduler: colocates with the vLLM scheduler, makes KV cache swap decisions, and keeps track of running requests and their in-flight swap operations.
- OffloadManager: manages the logical kv cache space on host CPU buffer, and applys LRU eviction policy.
- StagingBufferManager: manages a logical staging buffer in the TPU's HBM, which is used as a temporary buffer (in HBM) for data being transferred between the TPU and the CPU. Without the control of using temporary buffer in HBM can easily lead to TPU OOM.
TPUOffloadConnectorWorker: executes the swap decisions (embedded in the SchedulerOutput) from TPUOffloadConnectorScheduler.
- LocalCPUBackend: a simple key-value map in host CPU memory for hosting the offloaded KV cache blocks / pages.

The TPU-CPU swap operations are the core of this PR. We provide two approaches to move KV cache data:

using jax.device_put
using pallas dma kernel

usage

This feature can be used by setting the kv_tranfer_config in vLLM engine:
--kv-transfer-config '{"kv_connector":"TPUOffloadConnector","kv_connector_module_path":"tpu_inference.distributed.offload.tpu_offload_connector","kv_role":"kv_both"}'
example: examples/offload/gke/benchmarks/deploy-cpu-offload.yaml

And, it can be configured through the following environment variables (we will move them into the KVTransferConfig.kv_connector_extra_config):

TPU_OFFLOAD_SKIP_JAX_PRECOMPILE: skipping pre-compiling the swap functions, default=0. We would suggest to turn on pre-compile. All swap operations are applied at block-granularity; when swap pre-compile is turned on, we will break a request of swap into multiple swap-operations following the predefined bucket size list (1 block, 2 blocks, 4 blocks, 8 blocks, 16 blocks) to avoid re-compile (thanks to @saikat-royc).
TPU_OFFLOAD_SWAP_OP_TYPE: jax (default), or pallas.
TPU_OFFLOAD_NUM_CPU_CHUNKS: host CPU buffer capacity in terms of number of chunks (equivalent to kv cache blocks / pages), default=1024.
TPU_OFFLOAD_NUM_STAGING_BLOCKS: the size of the staging buffer in the TPU HBM, default=128.
TPU_OFFLOAD_DECODE_SAVE: save the KV cache of generated (decode) tokens, default=False.

Tests

pytest -s -v tests/distributed/offload/
pytest -s -v tests/kernels/host_dma_test.py

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

github-actions · 2025-11-24T21:41:54Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

examples/offline_inference_kv_cache_verification.py

examples/offline_inference_kv_cache.py

examples/offload/gke/pod_tpu_host_offload_unit_tests.yaml

tests/distributed/offload/tpu_offload_connector_worker_test.py

kyuyeunk · 2025-12-02T11:29:42Z

this is really really big pr & it's difficult to review.

are you planning to split this PR into multiple smaller ones to make it easier for reviewers?

tests/distributed/offload/tpu_offload_connector_worker_test.py

dannawang0221 · 2025-12-05T04:17:15Z

/lgtm

juncgu-google · 2025-12-05T21:49:29Z

this is really really big pr & it's difficult to review.

are you planning to split this PR into multiple smaller ones to make it easier for reviewers?

That's true. The core implementation is in the distributed/offload folder (~2400 lines), others are for tests. But this feature is modularized and makes negligible changes to the core of tpu-inference.
I will discuss with all of the reviewers to see how to re-organize the PR.

Signed-off-by: Juncheng Gu <jcgu@google.com>

Signed-off-by: dannawang <dannawang@google.com>

Signed-off-by: Juncheng Gu <jcgu@google.com>

Signed-off-by: dannawang <dannawang@google.com>

Signed-off-by: Juncheng Gu <jcgu@google.com>

Signed-off-by: dannawang <dannawang@google.com>

Signed-off-by: Juncheng Gu <jcgu@google.com>

juncgu-google requested review from bythew3i, kyuyeunk, mrjunwan-lang, py4, sixiang-google, vanbasten23, vipannalla, wenxindongwork and yaochengji as code owners November 24, 2025 21:41

juncgu-google changed the title ~~[Feat][TPU Offload] KV cache offload to local cpu buffer~~ [Feat][WIP][TPU Offload] KV cache offload to local cpu buffer Nov 24, 2025

juncgu-google requested a review from dannawang0221 November 24, 2025 21:50

dannawang0221 reviewed Nov 25, 2025

View reviewed changes

examples/offline_inference_kv_cache_verification.py Outdated Show resolved Hide resolved

dannawang0221 reviewed Nov 25, 2025

View reviewed changes

examples/offline_inference_kv_cache.py Outdated Show resolved Hide resolved

juncgu-google changed the title ~~[Feat][WIP][TPU Offload] KV cache offload to local cpu buffer~~ [Feat][TPU Offload] KV cache offload to local cpu buffer Nov 25, 2025

juncgu-google changed the title ~~[Feat][TPU Offload] KV cache offload to local cpu buffer~~ [Feat][WIP][TPU Offload] KV cache offload to local cpu buffer Nov 25, 2025

dannawang0221 reviewed Dec 1, 2025

View reviewed changes

examples/offload/gke/pod_tpu_host_offload_unit_tests.yaml Outdated Show resolved Hide resolved

dannawang0221 reviewed Dec 1, 2025

View reviewed changes

tests/distributed/offload/tpu_offload_connector_worker_test.py Show resolved Hide resolved

juncgu-google requested review from QiliangCui and jcyang43 as code owners December 1, 2025 22:17

juncgu-google force-pushed the cpu-offloading/dev-merge branch 2 times, most recently from 48d2e3d to eeb1ed8 Compare December 1, 2025 22:58

juncgu-google changed the title ~~[Feat][WIP][TPU Offload] KV cache offload to local cpu buffer~~ [Feat][TPU Offload] KV cache offload to local cpu buffer Dec 1, 2025

juncgu-google added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 1, 2025

dannawang0221 reviewed Dec 2, 2025

View reviewed changes

tests/distributed/offload/tpu_offload_connector_worker_test.py Show resolved Hide resolved

juncgu-google removed the ready ONLY add when PR is ready to merge/full CI is needed label Dec 3, 2025

juncgu-google force-pushed the cpu-offloading/dev-merge branch 2 times, most recently from 54785e1 to cc97559 Compare December 4, 2025 00:40

juncgu-google added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 4, 2025

juncgu-google force-pushed the cpu-offloading/dev-merge branch from 5dfe210 to e780427 Compare December 5, 2025 02:48

juncgu-google force-pushed the cpu-offloading/dev-merge branch from e780427 to 87db579 Compare December 5, 2025 18:29

juncgu-google and others added 24 commits December 6, 2025 04:50

feat: TPU host offload for KV cache

e8d4c5c

Signed-off-by: Juncheng Gu <jcgu@google.com>

tweaks

cd5cce2

Signed-off-by: Juncheng Gu <jcgu@google.com>

offload envs

21ae0de

Signed-off-by: Juncheng Gu <jcgu@google.com>

rm saving behavior

a5ec87d

Signed-off-by: Juncheng Gu <jcgu@google.com>

tweaks

0fc7dad

Signed-off-by: Juncheng Gu <jcgu@google.com>

staging_tokens --> staging_blocks

df3b091

Signed-off-by: Juncheng Gu <jcgu@google.com>

updte gke yaml

aca95f1

Signed-off-by: Juncheng Gu <jcgu@google.com>

tweaks

ace918a

Signed-off-by: Juncheng Gu <jcgu@google.com>

fix imports in kv_cache tests

894c747

Signed-off-by: Juncheng Gu <jcgu@google.com>

tweaks

a24e4bb

Signed-off-by: Juncheng Gu <jcgu@google.com>

tweaks

616ac13

Signed-off-by: Juncheng Gu <jcgu@google.com>

multi-request worker test

6f8ae20

Signed-off-by: Juncheng Gu <jcgu@google.com>

debug: add jax block

ff4d31f

Signed-off-by: Juncheng Gu <jcgu@google.com>

worker_test: multi requests; acc_test: precompile

43f8f1e

Signed-off-by: Juncheng Gu <jcgu@google.com>

add feature test

97153b9

Signed-off-by: Juncheng Gu <jcgu@google.com>

follow up changes in the upstream; and update test scripts

560caf8

Signed-off-by: Juncheng Gu <jcgu@google.com>

update ci tests

12c4885

Signed-off-by: Juncheng Gu <jcgu@google.com>

update unit-test yml

2901e56

Signed-off-by: Juncheng Gu <jcgu@google.com>

Update test

7b0a20a

Signed-off-by: dannawang <dannawang@google.com>

fix gke kv cache verification with sampling_param.temperature=0

63b0c0b

Signed-off-by: Juncheng Gu <jcgu@google.com>

Change sampling params to configrable

8b79f68

Signed-off-by: dannawang <dannawang@google.com>

config pre-mapped buffer of tpu

a3ff52b

Signed-off-by: Juncheng Gu <jcgu@google.com>

Update benchmark pods

0249329

Signed-off-by: dannawang <dannawang@google.com>

tweaks

9ac152e

Signed-off-by: Juncheng Gu <jcgu@google.com>

juncgu-google force-pushed the cpu-offloading/dev-merge branch from 6113fb4 to 9ac152e Compare December 6, 2025 05:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat][TPU Offload] KV cache offload to local cpu buffer #1163

[Feat][TPU Offload] KV cache offload to local cpu buffer #1163

Uh oh!

juncgu-google commented Nov 24, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kyuyeunk commented Dec 2, 2025

Uh oh!

Uh oh!

dannawang0221 commented Dec 5, 2025

Uh oh!

juncgu-google commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Feat][TPU Offload] KV cache offload to local cpu buffer #1163

Are you sure you want to change the base?

[Feat][TPU Offload] KV cache offload to local cpu buffer #1163

Uh oh!

Conversation

juncgu-google commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

implementation

usage

Tests

Checklist

Uh oh!

github-actions bot commented Nov 24, 2025

Description

Tests

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kyuyeunk commented Dec 2, 2025

Uh oh!

Uh oh!

dannawang0221 commented Dec 5, 2025

Uh oh!

juncgu-google commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juncgu-google commented Nov 24, 2025 •

edited

Loading