This release brings several new features and improvements for vLLM TPU Inference.
Highlights
Async Scheduler Enabled the async-scheduler in tpu-inference for improved performance on smaller models.
Spec Decoder EAGLE-3 Added support EAGLE-3 variant with verified performance for Llama 3.1-8B.
Out-of-Tree Model Support Load custom JAX models as plugins, enabling users to serve custom model architectures without forking or modifying vLLM internals.
Automated CI/CD and Pre-merge Check Improved the testing and validation pipeline with automated CI/CD and pre-merge checks to enhance stability and accelerate iteration. More improvements to come.
What's Changed
- [Bug Fix] Fix small bug in server-based profiling init by @jrplatin in #872
- [Disagg][Bugfix] add check for global devices in profiler start by @sixiang-google in #874
- [CI] Fix imports to catchup vLLM's recent update. by @hfan in #876
- [Kernel] Added a RPA V3 kernel variant optimized for head_dim=64 by @yaochengji in #875
- Update README.md by @bvrockwell in #880
- Remove convert_list_to_device_array to reduce latency between model forward pass by @Lumosis in #879
- fix mock device error in profile enabling by @sixiang-google in #878
- [RPA] Reduce VREG spill by optimize masking by @kyuyeunk in #818
- [Doc] Fixed the docker path for the quick start guide by @hosseinsarshar in #885
- Fix/docs links by @RobMulla in #873
- Light rewording of jax model development readme by @gpolovets1 in #871
- Revert "[CI] Fix imports to catchup vLLM's recent update." by @hfan in #887
- docs: Clarify support matrix messaging by @RobMulla in #886
- Docs rename reco page by @RobMulla in #888
- Update README.md by @bvrockwell in #897
- [Profiling] Pull Over the TPU Profiler from vLLM + add profiling docs by @jrplatin in #882
- [Misc] Fix various vLLM import issues by @jrplatin in #900
- Revert "[Misc] Fix various vLLM import issues" by @hfan in #902
- [Misc] Fix failing phased-based profiling test by @jrplatin in #905
- Added the docker login instructions by @hosseinsarshar in #891
- Unpin upstream vllm version by @jcyang43 in #904
- [Bug fix] Fix v7 HBM limit by @wenxindongwork in #903
- Enable spmd on lora by @vanbasten23 in #829
- Support --enforce-eager by @kyuyeunk in #907
- [CI] Fixes to catchup with vllm changes by @hfan in #912
- [Docker] Add V7X requirements and update Docker to accept option to build using it by @jrplatin in #916
- Fix the jax device ordering. by @wang2yn84 in #915
- Update the disagg multi host sh file to setup the disagg inference in… by @mrjunwan-lang in #922
- [Llama4/JAX] Refactor RoPE Scaling, QK Norm, and No-RoPE Layer Config Handling for Maverick by @sierraisland in #923
- [Bug fix + Qwix] Add JAX quantization YAMLs to WHL build + add fp8 quantization configs by @jrplatin in #929
- Enable multi-host P/D and adopt the vllm distributed executor changes by @mrjunwan-lang in #932
- fix the uniitest to adopt vllm API changes by @mrjunwan-lang in #933
- [CI] Fix Qwen2.5 VL get_mrope_input_positions after vLLM change. by @kwang3939 in #934
- [Disagg] Use pathways resharding api to handle transfer by @sixiang-google in #935
- [Misc] Report TPU usage by @hfan in #925
- [CI] Use real vLLM ModelConfig object in init_device test by @hfan in #937
- update the ports to make the ports consistent in single host and multihost by @mrjunwan-lang in #938
- [Spec Decoding] Merge jitted helpers for eagle3 by @Lumosis in #920
- [GPT-OSS] JAX implementation of GPT-OSS by @bzgoogle in #861
- [Bug fixes] Update vLLM imports by @jrplatin in #947
- [Misc] Move numba installation to requirements.txt by @py4 in #948
- [Multi-host] Fix bugs in the deployment script by @Lumosis in #940
- Fix issues when running multiple LoRA tests on the v6e-8 machine. by @vanbasten23 in #926
- [Bug fixes] Fix a few more vLLM imports + Dockerfile typo by @jrplatin in #953
- Add the bgmv tests by @vanbasten23 in #942
- [MMLU] Add chat-template support for MMLU by @bzgoogle in #952
- [RPA] Add attention sink support to 64 dim variant of RPA kernel by @kyuyeunk in #958
- Revert "Add the bgmv tests" by @vanbasten23 in #963
- fix the vllm import issue for round_down by @mrjunwan-lang in #965
- Update docs to include installation guide with building from source. by @RobMulla in #949
- Reduce the host overhead for LoRA by @vanbasten23 in #930
- [GPT-OSS] uncomment sink related changes as the kernel_hd64.py was merged by @bzgoogle in #966
- Add bgmv test by @vanbasten23 in #964
- [CI] Skip build if only docs/icons changed by @boe20211 in #908
- [Spec Decoding] Fix precompilation by @Lumosis in #960
- fix the bug in kv transfer params is None by @mrjunwan-lang in #969
- [GPT-OSS] fix unstable sparse sum among different by @bzgoogle in #968
- fused Moe by @bythew3i in #973
- fix readme links to the docs by @RobMulla in #974
- [Feature] Code implementation of Async Scheduler by @cychiuak in #924
- [Misc] Fix observability config to prevent error from upstream by @py4 in #979
- add unit test for tpu_connector.py by @mrjunwan-lang in #980
- [Model] Add vision encoder and input embeddings merger warmup for Qwen2.5 VL model by @kwang3939 in #972
- Fix the test of multimodal manager by @kwang3939 in #986
- Fix the test of tpu_jax_runner by @kwang3939 in #989
- [Misc] Attempt to fix hash mismatch in CI if it's because of incomplete download by @py4 in #994
- [RPA] Update attention_sink to use prepare_inputs by @kyuyeunk in #993
- [Misc] Only run JAX unit tests and few e2e tests for each PR in CI. by @py4 in #995
- [Misc] Remove unused interfaces by @py4 in #990
- [Misc] Fix buildkite yaml format. by @py4 in #997
- Update README.md by @bvrockwell in #998
- Fix kv cache shape for head_dim=64 by @yaochengji in #976
- Add precommit hook for detecting missing init.py files by @jcyang43 in #1001
- Fix grid size calculation in qwen2.5-vl vision encoder warmup by @kwang3939 in #1004
- [Runner] Separate execute_model and sample_tokens to adapt upstream change. by @py4 in #1003
- [Misc] Change buildkite pipeline to run all steps but skip some through command by @py4 in #1005
- Fix .gitignore rule for support_matrices by @boe20211 in #1008
- Fix the wrong KVAggregator finished count cause dead loop, adopt vllm changes by @mrjunwan-lang in #1009
- [Docs] Update CONTRIBUTING.MD with testing info by @jrplatin in #1007
- Bump jax version to 0.8.0 by @py4 in #1006
- Introduce cpu agents for general-purpose tasks that don't require TPU by @jcyang43 in #1012
- Revert "Bump jax version to 0.8.0" by @py4 in #1013
- Integrate MoE kernel for torchax path by @lsy323 in #996
- Update yml config for speculative decoding ngram by @jcyang43 in #1020
- async scheduler fix _substitute_placeholder_token_fn bug by @cychiuak in #991
- Add async scheduler test to CI by @jcyang43 in #1010
- Fix default value for USE_MOE_EP_KERNEL by @kyuyeunk in #1014
- Data Parallelism support by @wenxindongwork in #865
- Revert "Data Parallelism support " by @wenxindongwork in #1028
- use sharded device put to handle torchax on multi host by @Chenyaaang in #1031
- [Disagg] update execute_model use according to upstream by @sixiang-google in #1032
- Support Data Parallelism by @wenxindongwork in #1035
- [Torchax] Add attention sink support in torchax by @kyuyeunk in #1038
- [Disagg] Fix unit test for execute_model update by @sixiang-google in #1046
- [Bugfix] Fix sinks related argument error by @kyuyeunk in #1044
- [Bugfix] Fix attention interface unit test by @kyuyeunk in #1048
- Fix API usage of jax.make_mesh by @Lumosis in #1051
- Set VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 by default for TPU inference by @xingliu14 in #1021
- [Kernel] implement 1st version of data-movement friendly MLA kernel with no kv update fused by @yaochengji in #1022
- Add lora layer tests by @vanbasten23 in #981
- [Refactor] Fix jax layers tests directory by @kyuyeunk in #1050
- [Llama4/JAX] Llama4 FP8 Quantized Weight Loading and Sharding by @sierraisland in #962
- [Bug fix] vLLM upstream compatibility. Fix DP scheduler by @wenxindongwork in #1057
- Update execute_model to support async scheduling in vllm by @sixiang-google in #1047
- [Torchax] Fix sink parameter initialization by @kyuyeunk in #1056
- [Torchax] Support bias and swiglu in MoE by @kyuyeunk in #1040
- Add instruction to set up pre-commit hooks. by @vanbasten23 in #1029
- initial commit on compressed-tensors quantization support for fp8 by @qihqi in #1011
- Fix Torchax backend on Pathways by @richardsliu in #1052
- Multi-slice mesh creation support by @wenxindongwork in #1060
- Add nightly releases for vllm/vllm-tpu docker by @jcyang43 in #1062
- Add kv quantization for gpt-oss by @rupeng-liu in #1063
- Fix lora layers by @richardsliu in #1068
- [KV] Change padding logic when head_dim is 64 by @kyuyeunk in #1064
- [Refactor] Remove jax suffix from file names by @kyuyeunk in #1070
- Remove JAX_RANDOM_WEIGHTS by @kyuyeunk in #1017
- [Llama4-Maverick/Optimization] Refactor: Standardize Sharding and Parallelism Configs in FFW/MoE Layers by @sierraisland in #1067
- [Disagg] Fixes for vllm model impl disagg support by @sixiang-google in #1066
- Ignore MoE test for now. by @qihqi in #1065
- [Bug fix] Pathways HBM calculation with DP enabled + Async scheduler precompilation by @wenxindongwork in #1072
- Fix typo in VLLM_TPU_USING_PATHWAYS flag by @richardsliu in #1074
- [DeepSeek] Optimize RoPE Cache to remove re-layout by @bzgoogle in #1073
- [CI] Introduce a default features to pre-set 'pass' status in the support matrix by @boe20211 in #1026
- [BUG FIX] Fix device order enforcement by @wenxindongwork in #1076
- [Feature] Add automated PyPI publishing workflow by @ylangtsou in #985
- Centralized environment variables by @xingliu14 in #1058
- Remove use of VLLM_USE_V1. by @QiliangCui in #1083
- [CI] fix bug from RoPE by @bzgoogle in #1084
- Fix tensor type check in convert_to_torchax_and_shard by @richardsliu in #1085
- Create Utils functions for PP by @Chenyaaang in #1042
- Add buildkite-test-collector for testsuite setup by @pv97 in #1033
- Add email alerts to buildkite pipelines by @pv97 in #1030
- [GPT-OSS] Load MXFP4 and BF16 weights directly and enable online requantization by @amishacorns in #992
- Upgrade jax version to 0.8.0 by @vanbasten23 in #1087
- Refactor(nnx): Remove Redundant FFW Call in SharedExpertsTransformerBlock by @sierraisland in #1089
- Add more lora wrapper unit tests by @vanbasten23 in #1036
- [CI] fix unit test by @bzgoogle in #1086
- Revert "Upgrade jax version to 0.8.0" by @vanbasten23 in #1094
- Update Pathways HBM calculation by @wenxindongwork in #1097
- [BugFix] Remove redundant device_put. by @Lumosis in #1099
- Add email alerts for nightly and torch pipelines by @pv97 in #1095
- Enable Pipeline Parallelism on torchax path by @Chenyaaang in #1055
- Enable Pipeline Parallelism to use mp as distributed backend on Jax TPU platform by @Chenyaaang in #1054
- [Refactor] Move shared files to common by @kyuyeunk in #1092
- [Torchax] Add initial support for loading mxfp4 by @kyuyeunk in #1080
- Add test_envs.py by @xingliu14 in #1079
- Use Qwen 1.5b for e2e DP test by @wenxindongwork in #1105
- Added tuned kernel block size for customer's model by @cychiuak in #1071
- [Bugfix] Fix attention backend signature by @kyuyeunk in #1103
- [CI] Fix head dim check by @kyuyeunk in #1091
- Consolidate quant method names into a single file by @kyuyeunk in #1101
- Upgrade jax version to 0.8.0 by @vanbasten23 in #1107
- Change email to create buganizer ticket for cmcs oncall by @pv97 in #1108
- Fix import error from pr 1103 by @QiliangCui in #1109
- [RPA] sliding window optimization by @kyuyeunk in #1110
- [Bug fix] Return correct log probability for DP by @wenxindongwork in #1106
- [Disagg] Misc fix for vllm model loading in disagg by @sixiang-google in #1098
- [CI] Fix mxfp4 test by @kyuyeunk in #1111
- Fix sharding mismatch caused recompilation in Qwen2.5-vl-7b integration test by @kwang3939 in #1117
- Enable Pipeline Parallelism on Ray by @Chenyaaang in #1078
- [Misc] Change default device for vllm_get_model by @sixiang-google in #1116
- Change from email to slack notifications on vllm#tpu-ci-notifications by @pv97 in #1118
- Separate model and feature support matrices by category by @boe20211 in #1100
- [Bugfix] Fix error where vLLM expects numpy sampled token ids by @kyuyeunk in #1119
- Update ray test with new sharding config by @Chenyaaang in #1123
- [CI] Fix tpu worker test by @kyuyeunk in #1124
- Add google chat notifications by @pv97 in #1125
- Split quantized features into separate YAML files, and add MoE and MLA to the feature support matrix by @boe20211 in #1120
- Centralizes environment variable access by routing variables reads through the envs.py module. by @xingliu14 in #1102
- Revert vllm weight loading to fix OOM by @kyuyeunk in #1127
- [Misc] Add CODEOWNERS to the project. by @py4 in #988
- [Bug fix] Fix log probabilities handling by @wenxindongwork in #1114
- [Misc] Update codeowners. by @py4 in #1129
- Enable Pipeline Parallelism on jax worker by @Chenyaaang in #1043
- [Misc] Update code owners file path by @kyuyeunk in #1130
- add uses_sampler to ray executor by @Chenyaaang in #1131
- [Spec Decoding] Fix API error caused by upstream. by @py4 in #1133
- feat(ci): Record vLLM and tpu-inference commit hashes by @dennisYehCienet in #795
- [Llama4 Guard] Add JAX Llama-Guard-4-12B Text Portion by @JiriesKaileh in #1090
- Rename model support matrix CSV files to include _model in the filename and update MLA/MoE references to kernel support matrix by @boe20211 in #1134
- [Llama4/Test] Add unit test for _get_expert_num in Llama4WeightLoader by @sierraisland in #1138
- [wip] update torchvision to 0.24.0 so that it uses torch 2.9 by @QiliangCui in #1135
- [Misc] Move llama guard warning from top-level to class-level by @py4 in #1140
- Skip size calculation during async copy wait=True by @rupengliu-meta in #1126
- [MISC] Resolving misleading logging for torchax fallback in model loading by @JiriesKaileh in #1144
- fix the tpu_worker failed in Ray with wrong devices set by @mrjunwan-lang in #1146
- Extend feature support matrix to track out-of-tree and sampling features by @boe20211 in #1145
- Revert "Skip size calculation during async copy wait=True" by @kyuyeunk in #1148
- Remove unnecessary mock objects by @mailvijayasingh in #1149
- [Spec Decoding][Bugfix] Use draft_config properly to support other models by @py4 in #1142
- remove wip models from model_loader by @mailvijayasingh in #1143
- [Fixed] Skip size calculation during async copy by @rupengliu-meta in #1152
- Revert previous API changes due to upstream change. by @Lumosis in #1155
- Eager resolution for vllm.current_platform when running Pathways by @richardsliu in #1150
- Fix lora test by removing LoRA extra vocab by @vanbasten23 in #1156
- Fix numerical issue on hybrid kv cache allocation by @Chenyaaang in #1139
- Use FP8_e5m2 automatically when using quantized kv cache FP8 on trillium by @zixi-qi in #1136
- [FIX] Add dummy get_input_embeddings to fix vLLM model type check by @kuafou in #971
- [FusedMoE] Support sub-channel quantization: FP4, FP8, INT8, ... by @bythew3i in #1158
- [DP] Functional DP for GPT-OSS by @wenxindongwork in #1137
- Use Qwen for DP correctness test by @wenxindongwork in #1160
- Fix the unit test failure by @mrjunwan-lang in #1162
- Implement runai model streamer for MODEL_IMPL_TYPE=flax_nnx by @amacaskill in #955
- Fix lora e2e tests by removing LoRA extra vocab. by @vanbasten23 in #1164
- [Spec Decoding][Eagle3] Fix bug of eagle-3 not being compataible with non-8b models. by @py4 in #1165
- [Kernel][FusedMoE] Add support for bias by @kyuyeunk in #1167
- Fix load_context to use nullcontext() if running under Pathways by @richardsliu in #1172
- [Model] Add vision encoder padding and warmup for Qwen2.5 VL model by @kwang3939 in #1151
- Add vLLM commit hash to the Docker tag by @dennisYehCienet in #1179
- [Bug Fix] Fix
AttributeError: 'str' object has no attribute 'page_size_bytes'issue by @jrplatin in #1183 - Make DP test soft-fail temporarily as feature is still WIP by @jcyang43 in #1184
- Skip DP test until feature is ready by @jcyang43 in #1187
- [Bugfix] Fix attention not found error by @kyuyeunk in #1186
- [Buildkite] Seperate eagle3 and ngram for the matrix. by @py4 in #1178
- Reduce throughput threshold for gemma3 to tolerate performance fluctuation. by @Lumosis in #1176
- [Buildkite] Add E2E test for structured decoding to nightly by @py4 in #1188
- [RPA][Kernel] Update hd64 variant sliding window code by @kyuyeunk in #1180
- [Sampling][Bugfix] Use different rng per step + add e2e tests by @py4 in #1189
- [CI] Lower baseline threshold for qwen3-30b-a3b by @kyuyeunk in #1191
- Add Lora torch to feature matrix. by @vanbasten23 in #1173
- [RPA] Pipeline flash attention in hd64 kernel by @yuyanpeng-google in #1194
- Remove GPT-OSS from registry by @kyuyeunk in #1193
- Add recommended TPU generations column in quantization support matrix by @boe20211 in #1181
- Add feature support matrices by @boe20211 in #1177
- Update nightly docker image tag by @dennisYehCienet in #1182
- Add e2e test for model registration (out of tree plugins) by @karan in #1171
- Centralizes environment variable access by routing variables reads through the envs.py module by @xingliu14 in #1147
- [MISC] Removed problematic local path for CONFTEST_DIR by @JiriesKaileh in #1141
- [ONCALL] fix NEW_MODEL_DESIGN flag values from True to 1 by @bzgoogle in #1204
- [CI] Improve the procedure of waiting vllm serve by @dennisYehCienet in #1196
- Seperate the docker setup and docker execution by @mrjunwan-lang in #1205
- improved non-continuous block insert impl for disagg perf by @sixiang-google in #1202
- [Misc] Fix model dtype not being configured correctly by @kyuyeunk in #1093
- Add env_with_choices function and apply it by @xingliu14 in #1200
- Revert "Remove GPT-OSS from registry" by @kyuyeunk in #1208
- Enable Pipeline Parallelism on Jax runner by @Chenyaaang in #1053
- [Bugfix] Fix error when using trust remote code by @kyuyeunk in #1198
- Enable expert parallelism in mxfp4 path by @kyuyeunk in #1195
- Fix lora e2e test due to upstream change by @vanbasten23 in #1210
- Increase threshold of perf test sensitivity to 6% by @karan in #1218
- Fix dp sharding for compute_logits_func by @kyuyeunk in #1212
- [Bug fix] Fix E2E DP test by @wenxindongwork in #1206
- [Buildkite] Move DP e2e test to DP.yml and run as nightly by @py4 in #1221
- [Buildkite] Merge lora tests pipelines by @py4 in #1222
- [Buildkite] Fix pipeline_jax.yml by @py4 in #1223
- [Buildkite] Move multi chip lora to nightly, keep single chip for pre merge. by @py4 in #1224
- [Buildkite] Roll back lora changes in buildkite by @py4 in #1227
- [Test] Fix broken tests due to upstream change. by @py4 in #1228
- Update the diagg multi_host script to auto lunch proxy by @mrjunwan-lang in #1229
- Remove deprecated arg in vllm serve command by @dennisYehCienet in #1230
- fix unit test for tpu_connect update by @mrjunwan-lang in #1233
- [Spec][Eagle3] Improve perf and compilation time by @py4 in #1192
- [Misc] Update Attention backend registry by @kyuyeunk in #1215
New Contributors
- @hosseinsarshar made their first contribution in #885
- @cychiuak made their first contribution in #924
- @richardsliu made their first contribution in #1052
- @rupeng-liu made their first contribution in #1063
- @pv97 made their first contribution in #1033
- @amishacorns made their first contribution in #992
- @mailvijayasingh made their first contribution in #1149
- @zixi-qi made their first contribution in #1136
- @kuafou made their first contribution in #971
- @amacaskill made their first contribution in #955
- @yuyanpeng-google made their first contribution in #1194
Full Changelog: v0.11.1...v0.12.0