[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building #28579

LucasWilkinson · 2025-11-12T19:05:20Z

FIX #23789

The goal of this PR is to:

Pad for cudagraphs before building attention metadata; this will allow us to
- update to the latest FA3 (FA3 variable length attention sort/swizzle flash-attention#82)
- remove hacks like: vLLM Easier cudagraph integration FlashMLA#3
- remove pad_for_cudagraphs from attention backends; this is done for FlashInfer but will be done for GDNAttentionBackend, Mamba1AttentionBackend, Mamba2AttentionMetadata, ShortConvAttentionBackend in future PRs
Pad for cudagraphs in less places; prior to this PR pad_for_cudagraphs was called multiple times inside execute_model before the forward pass making it challenging to reason about the padding order. This PR starts to make the padding order in gpu_model_runner clearer but more work still needs to be done.
Generally make the padding logic more isolated and easier to reason about

Future related work that will be based off this PR:

remove pad_for_cudagraphs from attention backends (see 1)
remove pad_for_cudagraphs from config; transferring ownership to CUDAGraphDispatcher
- this will make it easier to have seperate cudagraph sizes for FULL and PIECEWISE; important for a more robust and long term solution to: [Bug]: CUDA Graph Capture Issue: Unexpected Prefill Branches in Uniform Decode Graphs when MTP=2 #28207
refactor dummy_run to seperate cudagraph capture

Shout-out to @ayushsatyam146 for the preliminary work in #24002
Co-authored-by: ayushsatyam146 ayushsatyam146@gmail.com

mergify · 2025-11-12T19:05:58Z

Documentation preview: https://vllm--28579.org.readthedocs.build/en/28579/

gemini-code-assist

Code Review

This pull request refactors the CUDA graph padding logic, moving it from individual attention backends into the gpu_model_runner. This centralization is a good improvement for maintainability. The BatchDescriptor has also been updated to be more descriptive. While the overall direction is positive, I've identified two critical bugs in the implementation within gpu_model_runner.py that could lead to incorrect behavior or prevent CUDA graph optimizations from being applied. Please see the detailed comments for fixes.

gemini-code-assist · 2025-11-12T19:06:50Z

vllm/v1/worker/gpu_model_runner.py

+                )
+                uniform_decode = (
+                    (max_num_scheduled_tokens == self.uniform_decode_query_len)
+                    and (num_reqs == max_num_scheduled_tokens)


The condition for uniform_decode seems incorrect. num_reqs == max_num_scheduled_tokens will only be true in very specific cases (e.g., a single decode request when uniform_decode_query_len is 1), preventing most uniform decode batches from being correctly identified. This will likely disable CUDA graph optimizations for decode paths.

The condition should probably check if the total number of tokens is equal to the number of requests multiplied by the query length, similar to the previous implementation.

Suggested change

and (num_reqs == max_num_scheduled_tokens)

and (num_tokens_unpadded == num_reqs * max_num_scheduled_tokens)

gemini-code-assist · 2025-11-12T19:06:50Z

vllm/v1/worker/gpu_model_runner.py

                attn_metadata, spec_decode_common_attn_metadata = (
                    self._build_attention_metadata(
-                        total_num_scheduled_tokens=total_num_scheduled_tokens,
+                        total_num_scheduled_tokens=num_reqs_padded,


The total_num_scheduled_tokens argument for _build_attention_metadata is being passed num_reqs_padded, which is the number of requests. It should be num_tokens_padded, the total number of tokens. This will likely lead to incorrect attention metadata and could cause errors or incorrect model outputs.

Suggested change

total_num_scheduled_tokens=num_reqs_padded,

total_num_scheduled_tokens=num_tokens_padded,

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

vllm/v1/worker/gpu_model_runner.py

vllm/v1/attention/backends/utils.py

SageMoore

Looks like a good cleanup @LucasWilkinson. Thanks for the contribution.

vllm/forward_context.py

vllm/v1/worker/gpu_model_runner.py

vllm/v1/cudagraph_dispatcher.py

vllm/v1/worker/gpu_model_runner.py

ProExpertProg

LGTM overall, just a few qs above

SageMoore

This all looks good to me, Lucas.

mergify · 2025-11-19T16:16:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

BoyuanFeng · 2025-12-01T01:35:45Z

it seems this pr adds some runtime overhead: #29760

…n metadata building (vllm-project#28579) Signed-off-by: Benjamin Feuer <penfever@gmail.com>

…n metadata building (vllm-project#28579)

1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

hidva · 2025-12-02T14:47:02Z

vllm/v1/worker/gpu_model_runner.py

+            )
+
+            ubatch_slices, num_tokens_across_dp = coordinate_batch_across_dp(
+                num_tokens_unpadded=num_tokens_padded,


num_tokens_unpadded = num_tokens?

1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

minosfuture · 2025-12-04T01:16:48Z

@LucasWilkinson I'm hitting this issue:

(EngineCore_DP4 pid=858652) Traceback (most recent call last):
(EngineCore_DP4 pid=858652)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP4 pid=858652)     self.run()
(EngineCore_DP4 pid=858652)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP4 pid=858652)     self._target(*self._args, **self._kwargs)
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 850, in run_engine_core
(EngineCore_DP4 pid=858652)     raise e
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 839, in run_engine_core
(EngineCore_DP4 pid=858652)     engine_core.run_busy_loop()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 1228, in run_busy_loop
(EngineCore_DP4 pid=858652)     self.execute_dummy_batch()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 499, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.model_executor.execute_dummy_batch()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/executor/abstract.py", line 229, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.collective_rpc("execute_dummy_batch")
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP4 pid=858652)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP4 pid=858652)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/serial_utils.py", line 479, in run_method
(EngineCore_DP4 pid=858652)     return func(*args, **kwargs)
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_worker.py", line 632, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.model_runner._dummy_run(1, uniform_decode=True)
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP4 pid=858652)     return func(*args, **kwargs)
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_model_runner.py", line 3947, in _dummy_run
(EngineCore_DP4 pid=858652)     self._determine_batch_execution_and_padding(
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_model_runner.py", line 2825, in _determine_batch_execution_and_padding
(EngineCore_DP4 pid=858652)     assert batch_descriptor.num_tokens == num_tokens_padded
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652) AssertionError

couldn't do a clean revert so I'm not sure if it is caused by this PR yet. Checking.

EDIT: reverting this and a couple follow-up fixes helped.

…attention metadata building (vllm-project#28579)" This reverts commit 56539cd.

1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>

LucasWilkinson · 2025-12-04T19:37:04Z

@LucasWilkinson I'm hitting this issue:

(EngineCore_DP4 pid=858652) Traceback (most recent call last):
(EngineCore_DP4 pid=858652)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP4 pid=858652)     self.run()
(EngineCore_DP4 pid=858652)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP4 pid=858652)     self._target(*self._args, **self._kwargs)
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 850, in run_engine_core
(EngineCore_DP4 pid=858652)     raise e
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 839, in run_engine_core
(EngineCore_DP4 pid=858652)     engine_core.run_busy_loop()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 1228, in run_busy_loop
(EngineCore_DP4 pid=858652)     self.execute_dummy_batch()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 499, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.model_executor.execute_dummy_batch()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/executor/abstract.py", line 229, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.collective_rpc("execute_dummy_batch")
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP4 pid=858652)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP4 pid=858652)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/serial_utils.py", line 479, in run_method
(EngineCore_DP4 pid=858652)     return func(*args, **kwargs)
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_worker.py", line 632, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.model_runner._dummy_run(1, uniform_decode=True)
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP4 pid=858652)     return func(*args, **kwargs)
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_model_runner.py", line 3947, in _dummy_run
(EngineCore_DP4 pid=858652)     self._determine_batch_execution_and_padding(
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_model_runner.py", line 2825, in _determine_batch_execution_and_padding
(EngineCore_DP4 pid=858652)     assert batch_descriptor.num_tokens == num_tokens_padded
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652) AssertionError

couldn't do a clean revert so I'm not sure if it is caused by this PR yet. Checking.

EDIT: reverting this and a couple follow-up fixes helped.

Do you have repro instructions? happy to help debug

minosfuture · 2025-12-04T22:30:58Z

@LucasWilkinson I'm hitting this issue:

(EngineCore_DP4 pid=858652) Traceback (most recent call last):
(EngineCore_DP4 pid=858652)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP4 pid=858652)     self.run()
(EngineCore_DP4 pid=858652)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP4 pid=858652)     self._target(*self._args, **self._kwargs)
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 850, in run_engine_core
(EngineCore_DP4 pid=858652)     raise e
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 839, in run_engine_core
(EngineCore_DP4 pid=858652)     engine_core.run_busy_loop()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 1228, in run_busy_loop
(EngineCore_DP4 pid=858652)     self.execute_dummy_batch()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 499, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.model_executor.execute_dummy_batch()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/executor/abstract.py", line 229, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.collective_rpc("execute_dummy_batch")
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP4 pid=858652)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP4 pid=858652)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/serial_utils.py", line 479, in run_method
(EngineCore_DP4 pid=858652)     return func(*args, **kwargs)
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_worker.py", line 632, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.model_runner._dummy_run(1, uniform_decode=True)
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP4 pid=858652)     return func(*args, **kwargs)
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_model_runner.py", line 3947, in _dummy_run
(EngineCore_DP4 pid=858652)     self._determine_batch_execution_and_padding(
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_model_runner.py", line 2825, in _determine_batch_execution_and_padding
(EngineCore_DP4 pid=858652)     assert batch_descriptor.num_tokens == num_tokens_padded
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652) AssertionError

couldn't do a clean revert so I'm not sure if it is caused by this PR yet. Checking.
EDIT: reverting this and a couple follow-up fixes helped.

Do you have repro instructions? happy to help debug

thanks @LucasWilkinson !

one node with the following command.
another node with the same command, but --data-parallel-start-rank=0 (as the leader)

this is tested on 2 nodes of 4xGB200.

I'll try to minimize the config, but so far, looks like nixl is necessary to reproduce the issue.

CUDA_HOME=/usr/local/cuda \
LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
NCCL_CUMEM_ENABLE=1 \
NCCL_MNNVL_ENABLE=1 \
NCCL_NVLS_ENABLE=1 \
NVSHMEM_IB_ENABLE_IBGDA=1 \
PATH=/usr/local/cuda/bin:$PATH \
UCX_TLS=all \
VLLM_ATTENTION_BACKEND=FLASHINFER_MLA \
VLLM_DISABLE_FLASHINFER_PREFILL=0 \
VLLM_FORCE_TORCH_ALLREDUCE=1 \
VLLM_LOGGING_LEVEL=INFO \
VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300 \
VLLM_NIXL_SIDE_CHANNEL_HOST=`hostname -i` \
VLLM_NIXL_SIDE_CHANNEL_PORT=5700 \
VLLM_TORCH_PROFILER_DIR=./profile/ \
VLLM_USE_DEEP_GEMM=0 \
VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_USE_FLASHINFER_MOE_FP8=1 \
VLLM_USE_FLASHINFER_SAMPLER=1 \
VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL=1 \
VLLM_V1_OUTPUT_PROC_CHUNK_SIZE=2048 \
VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1 \
VLLM_MOE_DP_CHUNK_SIZE=1024 \
 \
VLLM_USE_STANDALONE_COMPILE=0 \
VLLM_DEEPEP_BUFFER_SIZE_MB=0 \
VLLM_DEEPEP_LOW_LATENCY_ALLOW_NVLINK=1 \
VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL=1 \
VLLM_FLASHINFER_MOE_BACKEND=latency \
 vllm serve nvidia/DeepSeek-R1-0528-FP4-v2 --async-scheduling \
--disable_custom_all_reduce \
--disable-uvicorn-access-log \
--disable_nccl_for_dp_synchronization \
--enable-expert-parallel \
--kv-cache-dtype fp8 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
--tensor-parallel-size 1 \
--trust-remote-code \
 \
--all2all-backend deepep_low_latency \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY","max_cudagraph_capture_size":2048}' \
--data-parallel-hybrid-lb \
--data-parallel-size 8 \
--data-parallel-size-local 4 \
--gpu-memory-utilization 0.9 \
--max-model-len 4096 \
--max-num-batched-tokens 16384 \
--max-num-seqs 2048 \
 --data-parallel-address 192.168.5.50 --data-parallel-start-rank=4 2>&1 | tee decode.log

bench:

vllm bench serve --model nvidia/DeepSeek-R1-0528-FP4-v2 --port 8000 --dataset-name random --ignore-eos --num-prompts 1024 --request-rate inf --random-input-len 2 --random-output-len 1024 --max-concurrency 2048 --trust_remote_code --seed $RANDOM --ready-check-timeout-sec 0

…n metadata building (vllm-project#28579) Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>

1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

hjjq · 2025-12-05T21:50:25Z

export VLLM_ATTENTION_BACKEND=FLASHINFER_MLA
export VLLM_FLASHINFER_MOE_BACKEND=latency
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_NCCL_SYMM_MEM=1
export NCCL_NVLS_ENABLE=1
export NCCL_CUMEM_ENABLE=1
export VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL=1

python3 -m vllm.entrypoints.openai.api_server --model nvidia/DeepSeek-R1-0528-FP4 --tokenizer nvidia/DeepSeek-R1-0528-FP4 --dtype auto --kv-cache-dtype fp8 --tensor-parallel-size 1 --pipeline-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel --swap-space 16 --max-num-seqs 1024 --trust-remote-code --max-model-len 10240 --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --no-enable-prefix-caching --async-scheduling --compilation_config.pass_config.fuse_allreduce_rms true --compilation_config.pass_config.fuse_attn_quant true --compilation_config.pass_config.eliminate_noops true --compilation_config.custom_ops+=+quant_fp8,+rms_norm --max-cudagraph-capture-size 2048 --compilation_config.cudagraph_mode FULL_DECODE_ONLY --stream-interval=20 --api-server-count=20

vllm bench serve --dataset-name random --ignore-eos --num-prompts 10240 --max-concurrency 2048 --random-range-ratio 0.8 --random-input-len 2048 --random-output-len 1024 --model nvidia/DeepSeek-R1-0528-FP4 --base-url http://0.0.0.0:8000/ --ready-check-timeout-sec 0

@minosfuture 's error can also be reproduced with above. Only 4xGB200/B200 required.

LucasWilkinson · 2025-12-05T22:15:52Z

export VLLM_ATTENTION_BACKEND=FLASHINFER_MLA
export VLLM_FLASHINFER_MOE_BACKEND=latency
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_NCCL_SYMM_MEM=1
export NCCL_NVLS_ENABLE=1
export NCCL_CUMEM_ENABLE=1
export VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL=1

python3 -m vllm.entrypoints.openai.api_server --model nvidia/DeepSeek-R1-0528-FP4 --tokenizer nvidia/DeepSeek-R1-0528-FP4 --dtype auto --kv-cache-dtype fp8 --tensor-parallel-size 1 --pipeline-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel --swap-space 16 --max-num-seqs 1024 --trust-remote-code --max-model-len 10240 --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --no-enable-prefix-caching --async-scheduling --compilation_config.pass_config.fuse_allreduce_rms true --compilation_config.pass_config.fuse_attn_quant true --compilation_config.pass_config.eliminate_noops true --compilation_config.custom_ops+=+quant_fp8,+rms_norm --max-cudagraph-capture-size 2048 --compilation_config.cudagraph_mode FULL_DECODE_ONLY --stream-interval=20 --api-server-count=20

vllm bench serve --dataset-name random --ignore-eos --num-prompts 10240 --max-concurrency 2048 --random-range-ratio 0.8 --random-input-len 2048 --random-output-len 1024 --model nvidia/DeepSeek-R1-0528-FP4 --base-url http://0.0.0.0:8000/ --ready-check-timeout-sec 0

@minosfuture 's error can also be reproduced with above. Only 4xGB200/B200 required.

Thank you! I do not have access to a GB200 so this helps; what torch and cuda version are you using? im getting NCCL crashes when I run this command

hjjq · 2025-12-05T22:58:11Z

We caught this from our ci which uses the vllm nightly image. I was also able to reproduce locally on 4xGB200 with torch 2.9.0 cu129 from a base image nvcr.io/nvidia/pytorch:25.06-py3 and vllm built from source. I was just trying to run on 4xB200 with latest main but I hit some allreduce fusion pass error. I'll get back once I can reproduce on B200.

hjjq · 2025-12-06T03:57:52Z

Okay, I was able to reproduce the error on 4xB200 with a slightly older commit (c719c40). The allreduce fusion pass error I saw is probably a separate issue on latest main (most likely #24252)

The commands are the same as I posted before. The assertion failure happens soon after the first few requests complete.
For B200 I am also using nvcr.io/nvidia/pytorch:25.06-py3. See env below.

collect_env.py output

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : version 4.2.0
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.5.0-45-generic-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA B200
GPU 1: NVIDIA B200
GPU 2: NVIDIA B200
GPU 3: NVIDIA B200

Nvidia driver version        : 580.82.07
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.2
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             112
On-line CPU(s) list:                0-111
Vendor ID:                          GenuineIntel
Model name:                         INTEL(R) XEON(R) PLATINUM 8570
CPU family:                         6
Model:                              207
Thread(s) per core:                 1
Core(s) per socket:                 56
Socket(s):                          2
Stepping:                           2
Frequency boost:                    enabled
CPU(s) scaling MHz:                 111%
CPU max MHz:                        2101.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4200.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          5.3 MiB (112 instances)
L1i cache:                          3.5 MiB (112 instances)
L2 cache:                           224 MiB (112 instances)
L3 cache:                           600 MiB (2 instances)
NUMA node(s):                       4
NUMA node0 CPU(s):                  0-27
NUMA node1 CPU(s):                  28-55
NUMA node2 CPU(s):                  56-83
NUMA node3 CPU(s):                  84-111
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.5.3
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.16.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.3.1
[pip3] nvidia-ml-py==13.580.82
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.0
[pip3] torchaudio==2.9.0
[pip3] torchvision==0.24.0
[pip3] transformers==4.57.3
[pip3] triton==3.5.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.2.dev606+g7b5575fa7 (git sha: 7b5575fa7)
vLLM Build Flags:
  CUDA Archs: 7.5 8.0 8.6 9.0 10.0 12.0+PTX; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10      NIC11   NIC12   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NODE    NODE    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     SYSSYS     NODE    0-23    0               N/A
GPU1    NV18     X      NV18    NV18    NODE    NODE    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     SYSSYS     NODE    0-23    0               N/A
GPU2    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     SYS     PIX     NODE    SYS     SYS     SYSSYS     SYS     28-51   1               N/A
GPU3    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     NODE    PIX     SYS     SYS     SYSSYS     SYS     28-51   1               N/A
NIC0    NODE    NODE    SYS     SYS      X      PIX     PIX     PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYSSYS     NODE
NIC1    NODE    NODE    SYS     SYS     PIX      X      PIX     PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYSSYS     NODE
NIC2    NODE    NODE    SYS     SYS     PIX     PIX      X      PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYSSYS     NODE
NIC3    NODE    NODE    SYS     SYS     PIX     PIX     PIX      X      NODE    NODE    SYS     SYS     SYS     SYS     SYSSYS     NODE
NIC4    PIX     NODE    SYS     SYS     NODE    NODE    NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     SYSSYS     NODE
NIC5    NODE    PIX     SYS     SYS     NODE    NODE    NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     SYSSYS     NODE
NIC6    SYS     SYS     PIX     NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     SYSSYS     SYS
NIC7    SYS     SYS     NODE    PIX     SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     SYSSYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYSSYS     SYS
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYSSYS     SYS
NIC10   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X SYS     SYS
NIC11   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS X      SYS
NIC12   NODE    NODE    SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYSSYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_7
  NIC6: mlx5_8
  NIC7: mlx5_9
  NIC8: mlx5_10
  NIC9: mlx5_11
  NIC10: mlx5_12
  NIC11: mlx5_13
  NIC12: mlx5_bond_0

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
LOCAL_RANK=0
CUBLAS_VERSION=12.9.1.4
NVIDIA_REQUIRE_CUDA=cuda>=9.0
TORCH_CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0+PTX
NCCL_VERSION=2.27.3
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
TORCH_NCCL_USE_COMM_NONBLOCKING=0
CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0
NCCL_IB_HCA=mlx5_10,mlx5_11,mlx5_12,mlx5_13,mlx5_4,mlx5_7,mlx5_8,mlx5_9
NVIDIA_PRODUCT_NAME=PyTorch
CUDA_VERSION=12.9.1.010
PYTORCH_VERSION=2.8.0a0+5228986
PYTORCH_BUILD_NUMBER=0
CUBLASMP_VERSION=0.4.0.789
CUDA_VISIBLE_DEVICES=0,1,2,3
CUDA_VISIBLE_DEVICES=0,1,2,3
CUDNN_FRONTEND_VERSION=1.12.0
CUDNN_VERSION=9.10.2.21
PYTORCH_HOME=/opt/pytorch/pytorch
LD_LIBRARY_PATH=/usr/local/cuda/lib64
NVIDIA_BUILD_ID=177567386
CUDA_DRIVER_VERSION=575.57.08
PYTORCH_BUILD_VERSION=2.8.0a0+5228986
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_MODULE_LOADING=LAZY
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
NVIDIA_PYTORCH_VERSION=25.06
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

LucasWilkinson · 2025-12-06T05:35:52Z

Thanks! was able to repro; this should fix it: #30173 (please comment on that PR if it fixes it for you 👍)

…n metadata building (vllm-project#28579)

LucasWilkinson requested review from mgoin and pavanimajety as code owners November 12, 2025 19:05

mergify bot added documentation Improvements or additions to documentation nvidia v1 labels Nov 12, 2025

github-project-automation bot added this to NVIDIA Nov 12, 2025

gemini-code-assist bot reviewed Nov 12, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 12, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/utils.py Outdated Show resolved Hide resolved

LucasWilkinson changed the title ~~[Core] Pad for CUDA graphs before attention metadata building and refactor padding logic~~ [Core] Refactor padding logic and pad for CUDA graphs before attention metadata building and Nov 13, 2025

LucasWilkinson requested review from ProExpertProg and njhill November 13, 2025 02:38

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2025

LucasWilkinson changed the title ~~[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building and~~ [Core] Refactor padding logic and pad for CUDA graphs before attention metadata building Nov 13, 2025

SageMoore reviewed Nov 14, 2025

View reviewed changes

vllm/forward_context.py Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

LucasWilkinson force-pushed the lwilkinson/padding-refactor branch 3 times, most recently from 8d3975f to dd6ad9e Compare November 17, 2025 20:21

ProExpertProg reviewed Nov 17, 2025

View reviewed changes

vllm/v1/cudagraph_dispatcher.py Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

ProExpertProg approved these changes Nov 17, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Nov 17, 2025

LucasWilkinson force-pushed the lwilkinson/padding-refactor branch from bb224a0 to 22ab0f9 Compare November 18, 2025 15:34

SageMoore approved these changes Nov 19, 2025

View reviewed changes

mergify bot added the needs-rebase label Nov 19, 2025

LucasWilkinson force-pushed the lwilkinson/padding-refactor branch 2 times, most recently from a7a04ba to 2b58a28 Compare November 21, 2025 04:55

mergify bot removed the needs-rebase label Nov 21, 2025

LucasWilkinson mentioned this pull request Nov 21, 2025

Refactor: Move CUDA graph dispatch logic earlier #27382

Merged

5 tasks

LucasWilkinson force-pushed the lwilkinson/padding-refactor branch from 38cac6d to bf3731f Compare November 22, 2025 06:00

LucasWilkinson mentioned this pull request Nov 28, 2025

[BugFix] Fix DBO failing with TypeError: 'NoneType' object is not iterable #29698

Merged

Potabk mentioned this pull request Dec 1, 2025

[Main] Upgrade vllm commit to 2025_12_01 vllm-project/vllm-ascend#4527

Closed

wangxiyuan mentioned this pull request Dec 1, 2025

upgrade vLLM to main vllm-project/vllm-ascend#4608

Merged

penfever pushed a commit to mlfoundations/vllm that referenced this pull request Dec 1, 2025

[Core] Refactor padding logic and pad for CUDA graphs before attentio…

6259bae

…n metadata building (vllm-project#28579) Signed-off-by: Benjamin Feuer <penfever@gmail.com>

Josephasafg mentioned this pull request Dec 1, 2025

[Bugfix]: Fix RuntimeError due to wrong split in CUDAGraphs for Mamba1 #29404

Closed

5 tasks

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[Core] Refactor padding logic and pad for CUDA graphs before attentio…

a32fd83

…n metadata building (vllm-project#28579)

hidva reviewed Dec 2, 2025

View reviewed changes

This was referenced Dec 2, 2025

[BugFix] Fix assert in build_for_cudagraph_capture #29893

Merged

[BugFix] Fix DBO assert assert B_block_table == B_q #29933

Merged

minosfuture added a commit to minosfuture/vllm that referenced this pull request Dec 4, 2025

Revert "[Core] Refactor padding logic and pad for CUDA graphs before …

eac422b

…attention metadata building (vllm-project#28579)" This reverts commit 56539cd.

MengqingCao mentioned this pull request Dec 4, 2025

[DP] Fix dp padding logic in dummyrun vllm-project/vllm-ascend#4705

Open

LucasWilkinson mentioned this pull request Dec 5, 2025

[BugFix] Fix DeepSeek-R1 hang with DP and MTP #30119

Open

charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 5, 2025

[Core] Refactor padding logic and pad for CUDA graphs before attentio…

5d92128

…n metadata building (vllm-project#28579) Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>

LucasWilkinson mentioned this pull request Dec 5, 2025

[Misc] Remove pad_for_cudagraphs from config #30143

Open

LucasWilkinson mentioned this pull request Dec 6, 2025

[BugFix] Fix assert batch_descriptor.num_tokens == num_tokens_padded #30173

Open

Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Dec 6, 2025

[Core] Refactor padding logic and pad for CUDA graphs before attentio…

ba3bfb1

…n metadata building (vllm-project#28579)

	and (num_reqs == max_num_scheduled_tokens)
	and (num_tokens_unpadded == num_reqs * max_num_scheduled_tokens)

	total_num_scheduled_tokens=num_reqs_padded,
	total_num_scheduled_tokens=num_tokens_padded,

Uh oh!

[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building #28579

[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building #28579

Uh oh!

Conversation

LucasWilkinson commented Nov 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Nov 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 19, 2025

Uh oh!

BoyuanFeng commented Dec 1, 2025

Uh oh!

hidva Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

minosfuture commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasWilkinson commented Dec 4, 2025

Uh oh!

minosfuture commented Dec 4, 2025

Uh oh!

hjjq commented Dec 5, 2025

Uh oh!

LucasWilkinson commented Dec 5, 2025

Uh oh!

hjjq commented Dec 5, 2025

Uh oh!

hjjq commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasWilkinson commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

LucasWilkinson commented Nov 12, 2025 •

edited by github-actions bot

Loading

minosfuture commented Dec 4, 2025 •

edited

Loading

hjjq commented Dec 6, 2025 •

edited

Loading