Release v0.12.0 · vllm-project/tpu-inference

This release brings several new features and improvements for vLLM TPU Inference.

Highlights

Async Scheduler Enabled the async-scheduler in tpu-inference for improved performance on smaller models.

Spec Decoder EAGLE-3 Added support EAGLE-3 variant with verified performance for Llama 3.1-8B.

Out-of-Tree Model Support Load custom JAX models as plugins, enabling users to serve custom model architectures without forking or modifying vLLM internals.

Automated CI/CD and Pre-merge Check Improved the testing and validation pipeline with automated CI/CD and pre-merge checks to enhance stability and accelerate iteration. More improvements to come.

What's Changed

[Bug Fix] Fix small bug in server-based profiling init by @jrplatin in #872
[Disagg][Bugfix] add check for global devices in profiler start by @sixiang-google in #874
[CI] Fix imports to catchup vLLM's recent update. by @hfan in #876
[Kernel] Added a RPA V3 kernel variant optimized for head_dim=64 by @yaochengji in #875
Update README.md by @bvrockwell in #880
Remove convert_list_to_device_array to reduce latency between model forward pass by @Lumosis in #879
fix mock device error in profile enabling by @sixiang-google in #878
[RPA] Reduce VREG spill by optimize masking by @kyuyeunk in #818
[Doc] Fixed the docker path for the quick start guide by @hosseinsarshar in #885
Fix/docs links by @RobMulla in #873
Light rewording of jax model development readme by @gpolovets1 in #871
Revert "[CI] Fix imports to catchup vLLM's recent update." by @hfan in #887
docs: Clarify support matrix messaging by @RobMulla in #886
Docs rename reco page by @RobMulla in #888
Update README.md by @bvrockwell in #897
[Profiling] Pull Over the TPU Profiler from vLLM + add profiling docs by @jrplatin in #882
[Misc] Fix various vLLM import issues by @jrplatin in #900
Revert "[Misc] Fix various vLLM import issues" by @hfan in #902
[Misc] Fix failing phased-based profiling test by @jrplatin in #905
Added the docker login instructions by @hosseinsarshar in #891
Unpin upstream vllm version by @jcyang43 in #904
[Bug fix] Fix v7 HBM limit by @wenxindongwork in #903
Enable spmd on lora by @vanbasten23 in #829
Support --enforce-eager by @kyuyeunk in #907
[CI] Fixes to catchup with vllm changes by @hfan in #912
[Docker] Add V7X requirements and update Docker to accept option to build using it by @jrplatin in #916
Fix the jax device ordering. by @wang2yn84 in #915
Update the disagg multi host sh file to setup the disagg inference in… by @mrjunwan-lang in #922
[Llama4/JAX] Refactor RoPE Scaling, QK Norm, and No-RoPE Layer Config Handling for Maverick by @sierraisland in #923
[Bug fix + Qwix] Add JAX quantization YAMLs to WHL build + add fp8 quantization configs by @jrplatin in #929
Enable multi-host P/D and adopt the vllm distributed executor changes by @mrjunwan-lang in #932
fix the uniitest to adopt vllm API changes by @mrjunwan-lang in #933
[CI] Fix Qwen2.5 VL get_mrope_input_positions after vLLM change. by @kwang3939 in #934
[Disagg] Use pathways resharding api to handle transfer by @sixiang-google in #935
[Misc] Report TPU usage by @hfan in #925
[CI] Use real vLLM ModelConfig object in init_device test by @hfan in #937
update the ports to make the ports consistent in single host and multihost by @mrjunwan-lang in #938
[Spec Decoding] Merge jitted helpers for eagle3 by @Lumosis in #920
[GPT-OSS] JAX implementation of GPT-OSS by @bzgoogle in #861
[Bug fixes] Update vLLM imports by @jrplatin in #947
[Misc] Move numba installation to requirements.txt by @py4 in #948
[Multi-host] Fix bugs in the deployment script by @Lumosis in #940
Fix issues when running multiple LoRA tests on the v6e-8 machine. by @vanbasten23 in #926
[Bug fixes] Fix a few more vLLM imports + Dockerfile typo by @jrplatin in #953
Add the bgmv tests by @vanbasten23 in #942
[MMLU] Add chat-template support for MMLU by @bzgoogle in #952
[RPA] Add attention sink support to 64 dim variant of RPA kernel by @kyuyeunk in #958
Revert "Add the bgmv tests" by @vanbasten23 in #963
fix the vllm import issue for round_down by @mrjunwan-lang in #965
Update docs to include installation guide with building from source. by @RobMulla in #949
Reduce the host overhead for LoRA by @vanbasten23 in #930
[GPT-OSS] uncomment sink related changes as the kernel_hd64.py was merged by @bzgoogle in #966
Add bgmv test by @vanbasten23 in #964
[CI] Skip build if only docs/icons changed by @boe20211 in #908
[Spec Decoding] Fix precompilation by @Lumosis in #960
fix the bug in kv transfer params is None by @mrjunwan-lang in #969
[GPT-OSS] fix unstable sparse sum among different by @bzgoogle in #968
fused Moe by @bythew3i in #973
fix readme links to the docs by @RobMulla in #974
[Feature] Code implementation of Async Scheduler by @cychiuak in #924
[Misc] Fix observability config to prevent error from upstream by @py4 in #979
add unit test for tpu_connector.py by @mrjunwan-lang in #980
[Model] Add vision encoder and input embeddings merger warmup for Qwen2.5 VL model by @kwang3939 in #972
Fix the test of multimodal manager by @kwang3939 in #986
Fix the test of tpu_jax_runner by @kwang3939 in #989
[Misc] Attempt to fix hash mismatch in CI if it's because of incomplete download by @py4 in #994
[RPA] Update attention_sink to use prepare_inputs by @kyuyeunk in #993
[Misc] Only run JAX unit tests and few e2e tests for each PR in CI. by @py4 in #995
[Misc] Remove unused interfaces by @py4 in #990
[Misc] Fix buildkite yaml format. by @py4 in #997
Update README.md by @bvrockwell in #998
Fix kv cache shape for head_dim=64 by @yaochengji in #976
Add precommit hook for detecting missing init.py files by @jcyang43 in #1001
Fix grid size calculation in qwen2.5-vl vision encoder warmup by @kwang3939 in #1004
[Runner] Separate execute_model and sample_tokens to adapt upstream change. by @py4 in #1003
[Misc] Change buildkite pipeline to run all steps but skip some through command by @py4 in #1005
Fix .gitignore rule for support_matrices by @boe20211 in #1008
Fix the wrong KVAggregator finished count cause dead loop, adopt vllm changes by @mrjunwan-lang in #1009
[Docs] Update CONTRIBUTING.MD with testing info by @jrplatin in #1007
Bump jax version to 0.8.0 by @py4 in #1006
Introduce cpu agents for general-purpose tasks that don't require TPU by @jcyang43 in #1012
Revert "Bump jax version to 0.8.0" by @py4 in #1013
Integrate MoE kernel for torchax path by @lsy323 in #996
Update yml config for speculative decoding ngram by @jcyang43 in #1020
async scheduler fix _substitute_placeholder_token_fn bug by @cychiuak in #991
Add async scheduler test to CI by @jcyang43 in #1010
Fix default value for USE_MOE_EP_KERNEL by @kyuyeunk in #1014
Data Parallelism support by @wenxindongwork in #865
Revert "Data Parallelism support " by @wenxindongwork in #1028
use sharded device put to handle torchax on multi host by @Chenyaaang in #1031
[Disagg] update execute_model use according to upstream by @sixiang-google in #1032
Support Data Parallelism by @wenxindongwork in #1035
[Torchax] Add attention sink support in torchax by @kyuyeunk in #1038
[Disagg] Fix unit test for execute_model update by @sixiang-google in #1046
[Bugfix] Fix sinks related argument error by @kyuyeunk in #1044
[Bugfix] Fix attention interface unit test by @kyuyeunk in #1048
Fix API usage of jax.make_mesh by @Lumosis in #1051
Set VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 by default for TPU inference by @xingliu14 in #1021
[Kernel] implement 1st version of data-movement friendly MLA kernel with no kv update fused by @yaochengji in #1022
Add lora layer tests by @vanbasten23 in #981
[Refactor] Fix jax layers tests directory by @kyuyeunk in #1050
[Llama4/JAX] Llama4 FP8 Quantized Weight Loading and Sharding by @sierraisland in #962
[Bug fix] vLLM upstream compatibility. Fix DP scheduler by @wenxindongwork in #1057
Update execute_model to support async scheduling in vllm by @sixiang-google in #1047
[Torchax] Fix sink parameter initialization by @kyuyeunk in #1056
[Torchax] Support bias and swiglu in MoE by @kyuyeunk in #1040
Add instruction to set up pre-commit hooks. by @vanbasten23 in #1029
initial commit on compressed-tensors quantization support for fp8 by @qihqi in #1011
Fix Torchax backend on Pathways by @richardsliu in #1052
Multi-slice mesh creation support by @wenxindongwork in #1060
Add nightly releases for vllm/vllm-tpu docker by @jcyang43 in #1062
Add kv quantization for gpt-oss by @rupeng-liu in #1063
Fix lora layers by @richardsliu in #1068
[KV] Change padding logic when head_dim is 64 by @kyuyeunk in #1064
[Refactor] Remove jax suffix from file names by @kyuyeunk in #1070
Remove JAX_RANDOM_WEIGHTS by @kyuyeunk in #1017
[Llama4-Maverick/Optimization] Refactor: Standardize Sharding and Parallelism Configs in FFW/MoE Layers by @sierraisland in #1067
[Disagg] Fixes for vllm model impl disagg support by @sixiang-google in #1066
Ignore MoE test for now. by @qihqi in #1065
[Bug fix] Pathways HBM calculation with DP enabled + Async scheduler precompilation by @wenxindongwork in #1072
Fix typo in VLLM_TPU_USING_PATHWAYS flag by @richardsliu in #1074
[DeepSeek] Optimize RoPE Cache to remove re-layout by @bzgoogle in #1073
[CI] Introduce a default features to pre-set 'pass' status in the support matrix by @boe20211 in #1026
[BUG FIX] Fix device order enforcement by @wenxindongwork in #1076
[Feature] Add automated PyPI publishing workflow by @ylangtsou in #985
Centralized environment variables by @xingliu14 in #1058
Remove use of VLLM_USE_V1. by @QiliangCui in #1083
[CI] fix bug from RoPE by @bzgoogle in #1084
Fix tensor type check in convert_to_torchax_and_shard by @richardsliu in #1085
Create Utils functions for PP by @Chenyaaang in #1042
Add buildkite-test-collector for testsuite setup by @pv97 in #1033
Add email alerts to buildkite pipelines by @pv97 in #1030
[GPT-OSS] Load MXFP4 and BF16 weights directly and enable online requantization by @amishacorns in #992
Upgrade jax version to 0.8.0 by @vanbasten23 in #1087
Refactor(nnx): Remove Redundant FFW Call in SharedExpertsTransformerBlock by @sierraisland in #1089
Add more lora wrapper unit tests by @vanbasten23 in #1036
[CI] fix unit test by @bzgoogle in #1086
Revert "Upgrade jax version to 0.8.0" by @vanbasten23 in #1094
Update Pathways HBM calculation by @wenxindongwork in #1097
[BugFix] Remove redundant device_put. by @Lumosis in #1099
Add email alerts for nightly and torch pipelines by @pv97 in #1095
Enable Pipeline Parallelism on torchax path by @Chenyaaang in #1055
Enable Pipeline Parallelism to use mp as distributed backend on Jax TPU platform by @Chenyaaang in #1054
[Refactor] Move shared files to common by @kyuyeunk in #1092
[Torchax] Add initial support for loading mxfp4 by @kyuyeunk in #1080
Add test_envs.py by @xingliu14 in #1079
Use Qwen 1.5b for e2e DP test by @wenxindongwork in #1105
Added tuned kernel block size for customer's model by @cychiuak in #1071
[Bugfix] Fix attention backend signature by @kyuyeunk in #1103
[CI] Fix head dim check by @kyuyeunk in #1091
Consolidate quant method names into a single file by @kyuyeunk in #1101
Upgrade jax version to 0.8.0 by @vanbasten23 in #1107
Change email to create buganizer ticket for cmcs oncall by @pv97 in #1108
Fix import error from pr 1103 by @QiliangCui in #1109
[RPA] sliding window optimization by @kyuyeunk in #1110
[Bug fix] Return correct log probability for DP by @wenxindongwork in #1106
[Disagg] Misc fix for vllm model loading in disagg by @sixiang-google in #1098
[CI] Fix mxfp4 test by @kyuyeunk in #1111
Fix sharding mismatch caused recompilation in Qwen2.5-vl-7b integration test by @kwang3939 in #1117
Enable Pipeline Parallelism on Ray by @Chenyaaang in #1078
[Misc] Change default device for vllm_get_model by @sixiang-google in #1116
Change from email to slack notifications on vllm#tpu-ci-notifications by @pv97 in #1118
Separate model and feature support matrices by category by @boe20211 in #1100
[Bugfix] Fix error where vLLM expects numpy sampled token ids by @kyuyeunk in #1119
Update ray test with new sharding config by @Chenyaaang in #1123
[CI] Fix tpu worker test by @kyuyeunk in #1124
Add google chat notifications by @pv97 in #1125
Split quantized features into separate YAML files, and add MoE and MLA to the feature support matrix by @boe20211 in #1120
Centralizes environment variable access by routing variables reads through the envs.py module. by @xingliu14 in #1102
Revert vllm weight loading to fix OOM by @kyuyeunk in #1127
[Misc] Add CODEOWNERS to the project. by @py4 in #988
[Bug fix] Fix log probabilities handling by @wenxindongwork in #1114
[Misc] Update codeowners. by @py4 in #1129
Enable Pipeline Parallelism on jax worker by @Chenyaaang in #1043
[Misc] Update code owners file path by @kyuyeunk in #1130
add uses_sampler to ray executor by @Chenyaaang in #1131
[Spec Decoding] Fix API error caused by upstream. by @py4 in #1133
feat(ci): Record vLLM and tpu-inference commit hashes by @dennisYehCienet in #795
[Llama4 Guard] Add JAX Llama-Guard-4-12B Text Portion by @JiriesKaileh in #1090
Rename model support matrix CSV files to include _model in the filename and update MLA/MoE references to kernel support matrix by @boe20211 in #1134
[Llama4/Test] Add unit test for _get_expert_num in Llama4WeightLoader by @sierraisland in #1138
[wip] update torchvision to 0.24.0 so that it uses torch 2.9 by @QiliangCui in #1135
[Misc] Move llama guard warning from top-level to class-level by @py4 in #1140
Skip size calculation during async copy wait=True by @rupengliu-meta in #1126
[MISC] Resolving misleading logging for torchax fallback in model loading by @JiriesKaileh in #1144
fix the tpu_worker failed in Ray with wrong devices set by @mrjunwan-lang in #1146
Extend feature support matrix to track out-of-tree and sampling features by @boe20211 in #1145
Revert "Skip size calculation during async copy wait=True" by @kyuyeunk in #1148
Remove unnecessary mock objects by @mailvijayasingh in #1149
[Spec Decoding][Bugfix] Use draft_config properly to support other models by @py4 in #1142
remove wip models from model_loader by @mailvijayasingh in #1143
[Fixed] Skip size calculation during async copy by @rupengliu-meta in #1152
Revert previous API changes due to upstream change. by @Lumosis in #1155
Eager resolution for vllm.current_platform when running Pathways by @richardsliu in #1150
Fix lora test by removing LoRA extra vocab by @vanbasten23 in #1156
Fix numerical issue on hybrid kv cache allocation by @Chenyaaang in #1139
Use FP8_e5m2 automatically when using quantized kv cache FP8 on trillium by @zixi-qi in #1136
[FIX] Add dummy get_input_embeddings to fix vLLM model type check by @kuafou in #971
[FusedMoE] Support sub-channel quantization: FP4, FP8, INT8, ... by @bythew3i in #1158
[DP] Functional DP for GPT-OSS by @wenxindongwork in #1137
Use Qwen for DP correctness test by @wenxindongwork in #1160
Fix the unit test failure by @mrjunwan-lang in #1162
Implement runai model streamer for MODEL_IMPL_TYPE=flax_nnx by @amacaskill in #955
Fix lora e2e tests by removing LoRA extra vocab. by @vanbasten23 in #1164
[Spec Decoding][Eagle3] Fix bug of eagle-3 not being compataible with non-8b models. by @py4 in #1165
[Kernel][FusedMoE] Add support for bias by @kyuyeunk in #1167
Fix load_context to use nullcontext() if running under Pathways by @richardsliu in #1172
[Model] Add vision encoder padding and warmup for Qwen2.5 VL model by @kwang3939 in #1151
Add vLLM commit hash to the Docker tag by @dennisYehCienet in #1179
[Bug Fix] Fix AttributeError: 'str' object has no attribute 'page_size_bytes' issue by @jrplatin in #1183
Make DP test soft-fail temporarily as feature is still WIP by @jcyang43 in #1184
Skip DP test until feature is ready by @jcyang43 in #1187
[Bugfix] Fix attention not found error by @kyuyeunk in #1186
[Buildkite] Seperate eagle3 and ngram for the matrix. by @py4 in #1178
Reduce throughput threshold for gemma3 to tolerate performance fluctuation. by @Lumosis in #1176
[Buildkite] Add E2E test for structured decoding to nightly by @py4 in #1188
[RPA][Kernel] Update hd64 variant sliding window code by @kyuyeunk in #1180
[Sampling][Bugfix] Use different rng per step + add e2e tests by @py4 in #1189
[CI] Lower baseline threshold for qwen3-30b-a3b by @kyuyeunk in #1191
Add Lora torch to feature matrix. by @vanbasten23 in #1173
[RPA] Pipeline flash attention in hd64 kernel by @yuyanpeng-google in #1194
Remove GPT-OSS from registry by @kyuyeunk in #1193
Add recommended TPU generations column in quantization support matrix by @boe20211 in #1181
Add feature support matrices by @boe20211 in #1177
Update nightly docker image tag by @dennisYehCienet in #1182
Add e2e test for model registration (out of tree plugins) by @karan in #1171
Centralizes environment variable access by routing variables reads through the envs.py module by @xingliu14 in #1147
[MISC] Removed problematic local path for CONFTEST_DIR by @JiriesKaileh in #1141
[ONCALL] fix NEW_MODEL_DESIGN flag values from True to 1 by @bzgoogle in #1204
[CI] Improve the procedure of waiting vllm serve by @dennisYehCienet in #1196
Seperate the docker setup and docker execution by @mrjunwan-lang in #1205
improved non-continuous block insert impl for disagg perf by @sixiang-google in #1202
[Misc] Fix model dtype not being configured correctly by @kyuyeunk in #1093
Add env_with_choices function and apply it by @xingliu14 in #1200
Revert "Remove GPT-OSS from registry" by @kyuyeunk in #1208
Enable Pipeline Parallelism on Jax runner by @Chenyaaang in #1053
[Bugfix] Fix error when using trust remote code by @kyuyeunk in #1198
Enable expert parallelism in mxfp4 path by @kyuyeunk in #1195
Fix lora e2e test due to upstream change by @vanbasten23 in #1210
Increase threshold of perf test sensitivity to 6% by @karan in #1218
Fix dp sharding for compute_logits_func by @kyuyeunk in #1212
[Bug fix] Fix E2E DP test by @wenxindongwork in #1206
[Buildkite] Move DP e2e test to DP.yml and run as nightly by @py4 in #1221
[Buildkite] Merge lora tests pipelines by @py4 in #1222
[Buildkite] Fix pipeline_jax.yml by @py4 in #1223
[Buildkite] Move multi chip lora to nightly, keep single chip for pre merge. by @py4 in #1224
[Buildkite] Roll back lora changes in buildkite by @py4 in #1227
[Test] Fix broken tests due to upstream change. by @py4 in #1228
Update the diagg multi_host script to auto lunch proxy by @mrjunwan-lang in #1229
Remove deprecated arg in vllm serve command by @dennisYehCienet in #1230
fix unit test for tpu_connect update by @mrjunwan-lang in #1233
[Spec][Eagle3] Improve perf and compilation time by @py4 in #1192
[Misc] Update Attention backend registry by @kyuyeunk in #1215

New Contributors

@hosseinsarshar made their first contribution in #885
@cychiuak made their first contribution in #924
@richardsliu made their first contribution in #1052
@rupeng-liu made their first contribution in #1063
@pv97 made their first contribution in #1033
@amishacorns made their first contribution in #992
@mailvijayasingh made their first contribution in #1149
@zixi-qi made their first contribution in #1136
@kuafou made their first contribution in #971
@amacaskill made their first contribution in #955
@yuyanpeng-google made their first contribution in #1194

Full Changelog: v0.11.1...v0.12.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.12.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

This release brings several new features and improvements for vLLM TPU Inference.

Highlights

What's Changed

New Contributors

Contributors

Uh oh!