Skip to content

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Nov 12, 2025

FIX #23789

The goal of this PR is to:

  1. Pad for cudagraphs before building attention metadata; this will allow us to
    - update to the latest FA3 (FA3 variable length attention sort/swizzle flash-attention#82)
    - remove hacks like: vLLM Easier cudagraph integration FlashMLA#3
    - remove pad_for_cudagraphs from attention backends; this is done for FlashInfer but will be done for GDNAttentionBackend, Mamba1AttentionBackend, Mamba2AttentionMetadata, ShortConvAttentionBackend in future PRs
  2. Pad for cudagraphs in less places; prior to this PR pad_for_cudagraphs was called multiple times inside execute_model before the forward pass making it challenging to reason about the padding order. This PR starts to make the padding order in gpu_model_runner clearer but more work still needs to be done.
  3. Generally make the padding logic more isolated and easier to reason about

Future related work that will be based off this PR:

Shout-out to @ayushsatyam146 for the preliminary work in #24002
Co-authored-by: ayushsatyam146 ayushsatyam146@gmail.com

@mergify
Copy link

mergify bot commented Nov 12, 2025

Documentation preview: https://vllm--28579.org.readthedocs.build/en/28579/

@mergify mergify bot added documentation Improvements or additions to documentation nvidia v1 labels Nov 12, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the CUDA graph padding logic, moving it from individual attention backends into the gpu_model_runner. This centralization is a good improvement for maintainability. The BatchDescriptor has also been updated to be more descriptive. While the overall direction is positive, I've identified two critical bugs in the implementation within gpu_model_runner.py that could lead to incorrect behavior or prevent CUDA graph optimizations from being applied. Please see the detailed comments for fixes.

)
uniform_decode = (
(max_num_scheduled_tokens == self.uniform_decode_query_len)
and (num_reqs == max_num_scheduled_tokens)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The condition for uniform_decode seems incorrect. num_reqs == max_num_scheduled_tokens will only be true in very specific cases (e.g., a single decode request when uniform_decode_query_len is 1), preventing most uniform decode batches from being correctly identified. This will likely disable CUDA graph optimizations for decode paths.

The condition should probably check if the total number of tokens is equal to the number of requests multiplied by the query length, similar to the previous implementation.

Suggested change
and (num_reqs == max_num_scheduled_tokens)
and (num_tokens_unpadded == num_reqs * max_num_scheduled_tokens)

attn_metadata, spec_decode_common_attn_metadata = (
self._build_attention_metadata(
total_num_scheduled_tokens=total_num_scheduled_tokens,
total_num_scheduled_tokens=num_reqs_padded,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The total_num_scheduled_tokens argument for _build_attention_metadata is being passed num_reqs_padded, which is the number of requests. It should be num_tokens_padded, the total number of tokens. This will likely lead to incorrect attention metadata and could cause errors or incorrect model outputs.

Suggested change
total_num_scheduled_tokens=num_reqs_padded,
total_num_scheduled_tokens=num_tokens_padded,

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@LucasWilkinson LucasWilkinson changed the title [Core] Pad for CUDA graphs before attention metadata building and refactor padding logic [Core] Refactor padding logic and pad for CUDA graphs before attention metadata building and Nov 13, 2025
@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2025
@LucasWilkinson LucasWilkinson changed the title [Core] Refactor padding logic and pad for CUDA graphs before attention metadata building and [Core] Refactor padding logic and pad for CUDA graphs before attention metadata building Nov 13, 2025
Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good cleanup @LucasWilkinson. Thanks for the contribution.

@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/padding-refactor branch 3 times, most recently from 8d3975f to dd6ad9e Compare November 17, 2025 20:21
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just a few qs above

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Nov 17, 2025
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/padding-refactor branch from bb224a0 to 22ab0f9 Compare November 18, 2025 15:34
Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me, Lucas.

@mergify
Copy link

mergify bot commented Nov 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 19, 2025
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/padding-refactor branch 2 times, most recently from a7a04ba to 2b58a28 Compare November 21, 2025 04:55
@mergify mergify bot removed the needs-rebase label Nov 21, 2025
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/padding-refactor branch from 38cac6d to bf3731f Compare November 22, 2025 06:00
@BoyuanFeng
Copy link
Contributor

it seems this pr adds some runtime overhead: #29760

penfever pushed a commit to mlfoundations/vllm that referenced this pull request Dec 1, 2025
…n metadata building (vllm-project#28579)

Signed-off-by: Benjamin Feuer <penfever@gmail.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
wangxiyuan added a commit to vllm-project/vllm-ascend that referenced this pull request Dec 2, 2025
1. fix vllm-project/vllm#28542
The model structure modifications we involved in are:
     - Qwen2.5-VL(still exist some patch)
     - Qwen2-VL
     - Qwen2
     - DeepSeek series
     - Qwen-moe series
2. fix vllm-project/vllm#29121
   the output token now  type changed from np to `list[list[int]]`

3. fix vllm-project/vllm#29262
    `xformers` backend for multimodal now has been deprecated
4. fix vllm-project/vllm#29342

5. fix vllm-project/vllm#28579
6. fix vllm-project/vllm#28718
7. fix vllm-project/vllm#28665
8. fix vllm-project/vllm#26847
vllm introduced the `optimization-level`, some default config has been
changed, and the param `--enforce-eager` has been deprecated
9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple
for sampler.
10. fix vllm-project/vllm#29471 we'll remove the
related patch to avoid this kind of error.

Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
)

ubatch_slices, num_tokens_across_dp = coordinate_batch_across_dp(
num_tokens_unpadded=num_tokens_padded,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_tokens_unpadded = num_tokens?

ChenCangtao pushed a commit to ChenCangtao/vllm-ascend that referenced this pull request Dec 3, 2025
1. fix vllm-project/vllm#28542
The model structure modifications we involved in are:
     - Qwen2.5-VL(still exist some patch)
     - Qwen2-VL
     - Qwen2
     - DeepSeek series
     - Qwen-moe series
2. fix vllm-project/vllm#29121
   the output token now  type changed from np to `list[list[int]]`

3. fix vllm-project/vllm#29262
    `xformers` backend for multimodal now has been deprecated
4. fix vllm-project/vllm#29342

5. fix vllm-project/vllm#28579
6. fix vllm-project/vllm#28718
7. fix vllm-project/vllm#28665
8. fix vllm-project/vllm#26847
vllm introduced the `optimization-level`, some default config has been
changed, and the param `--enforce-eager` has been deprecated
9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple
for sampler.
10. fix vllm-project/vllm#29471 we'll remove the
related patch to avoid this kind of error.

Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
@minosfuture
Copy link
Contributor

minosfuture commented Dec 4, 2025

@LucasWilkinson I'm hitting this issue:

(EngineCore_DP4 pid=858652) Traceback (most recent call last):
(EngineCore_DP4 pid=858652)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP4 pid=858652)     self.run()
(EngineCore_DP4 pid=858652)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP4 pid=858652)     self._target(*self._args, **self._kwargs)
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 850, in run_engine_core
(EngineCore_DP4 pid=858652)     raise e
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 839, in run_engine_core
(EngineCore_DP4 pid=858652)     engine_core.run_busy_loop()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 1228, in run_busy_loop
(EngineCore_DP4 pid=858652)     self.execute_dummy_batch()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 499, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.model_executor.execute_dummy_batch()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/executor/abstract.py", line 229, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.collective_rpc("execute_dummy_batch")
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP4 pid=858652)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP4 pid=858652)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/serial_utils.py", line 479, in run_method
(EngineCore_DP4 pid=858652)     return func(*args, **kwargs)
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_worker.py", line 632, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.model_runner._dummy_run(1, uniform_decode=True)
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP4 pid=858652)     return func(*args, **kwargs)
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_model_runner.py", line 3947, in _dummy_run
(EngineCore_DP4 pid=858652)     self._determine_batch_execution_and_padding(
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_model_runner.py", line 2825, in _determine_batch_execution_and_padding
(EngineCore_DP4 pid=858652)     assert batch_descriptor.num_tokens == num_tokens_padded
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652) AssertionError

couldn't do a clean revert so I'm not sure if it is caused by this PR yet. Checking.

EDIT: reverting this and a couple follow-up fixes helped.

minosfuture added a commit to minosfuture/vllm that referenced this pull request Dec 4, 2025
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
1. fix vllm-project/vllm#28542
The model structure modifications we involved in are:
     - Qwen2.5-VL(still exist some patch)
     - Qwen2-VL
     - Qwen2
     - DeepSeek series
     - Qwen-moe series
2. fix vllm-project/vllm#29121
   the output token now  type changed from np to `list[list[int]]`

3. fix vllm-project/vllm#29262
    `xformers` backend for multimodal now has been deprecated
4. fix vllm-project/vllm#29342

5. fix vllm-project/vllm#28579
6. fix vllm-project/vllm#28718
7. fix vllm-project/vllm#28665
8. fix vllm-project/vllm#26847
vllm introduced the `optimization-level`, some default config has been
changed, and the param `--enforce-eager` has been deprecated
9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple
for sampler.
10. fix vllm-project/vllm#29471 we'll remove the
related patch to avoid this kind of error.

Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>

- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
1. fix vllm-project/vllm#28542
The model structure modifications we involved in are:
     - Qwen2.5-VL(still exist some patch)
     - Qwen2-VL
     - Qwen2
     - DeepSeek series
     - Qwen-moe series
2. fix vllm-project/vllm#29121
   the output token now  type changed from np to `list[list[int]]`

3. fix vllm-project/vllm#29262
    `xformers` backend for multimodal now has been deprecated
4. fix vllm-project/vllm#29342

5. fix vllm-project/vllm#28579
6. fix vllm-project/vllm#28718
7. fix vllm-project/vllm#28665
8. fix vllm-project/vllm#26847
vllm introduced the `optimization-level`, some default config has been
changed, and the param `--enforce-eager` has been deprecated
9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple
for sampler.
10. fix vllm-project/vllm#29471 we'll remove the
related patch to avoid this kind of error.

Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>

- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
@LucasWilkinson
Copy link
Collaborator Author

@LucasWilkinson I'm hitting this issue:

(EngineCore_DP4 pid=858652) Traceback (most recent call last):
(EngineCore_DP4 pid=858652)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP4 pid=858652)     self.run()
(EngineCore_DP4 pid=858652)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP4 pid=858652)     self._target(*self._args, **self._kwargs)
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 850, in run_engine_core
(EngineCore_DP4 pid=858652)     raise e
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 839, in run_engine_core
(EngineCore_DP4 pid=858652)     engine_core.run_busy_loop()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 1228, in run_busy_loop
(EngineCore_DP4 pid=858652)     self.execute_dummy_batch()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 499, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.model_executor.execute_dummy_batch()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/executor/abstract.py", line 229, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.collective_rpc("execute_dummy_batch")
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP4 pid=858652)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP4 pid=858652)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/serial_utils.py", line 479, in run_method
(EngineCore_DP4 pid=858652)     return func(*args, **kwargs)
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_worker.py", line 632, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.model_runner._dummy_run(1, uniform_decode=True)
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP4 pid=858652)     return func(*args, **kwargs)
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_model_runner.py", line 3947, in _dummy_run
(EngineCore_DP4 pid=858652)     self._determine_batch_execution_and_padding(
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_model_runner.py", line 2825, in _determine_batch_execution_and_padding
(EngineCore_DP4 pid=858652)     assert batch_descriptor.num_tokens == num_tokens_padded
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652) AssertionError

couldn't do a clean revert so I'm not sure if it is caused by this PR yet. Checking.

EDIT: reverting this and a couple follow-up fixes helped.

Do you have repro instructions? happy to help debug

@minosfuture
Copy link
Contributor

@LucasWilkinson I'm hitting this issue:

(EngineCore_DP4 pid=858652) Traceback (most recent call last):
(EngineCore_DP4 pid=858652)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP4 pid=858652)     self.run()
(EngineCore_DP4 pid=858652)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP4 pid=858652)     self._target(*self._args, **self._kwargs)
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 850, in run_engine_core
(EngineCore_DP4 pid=858652)     raise e
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 839, in run_engine_core
(EngineCore_DP4 pid=858652)     engine_core.run_busy_loop()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 1228, in run_busy_loop
(EngineCore_DP4 pid=858652)     self.execute_dummy_batch()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/engine/core.py", line 499, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.model_executor.execute_dummy_batch()
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/executor/abstract.py", line 229, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.collective_rpc("execute_dummy_batch")
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP4 pid=858652)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP4 pid=858652)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/serial_utils.py", line 479, in run_method
(EngineCore_DP4 pid=858652)     return func(*args, **kwargs)
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_worker.py", line 632, in execute_dummy_batch
(EngineCore_DP4 pid=858652)     self.model_runner._dummy_run(1, uniform_decode=True)
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP4 pid=858652)     return func(*args, **kwargs)
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_model_runner.py", line 3947, in _dummy_run
(EngineCore_DP4 pid=858652)     self._determine_batch_execution_and_padding(
(EngineCore_DP4 pid=858652)   File "/data/nfs01/ming/vllm/vllm/v1/worker/gpu_model_runner.py", line 2825, in _determine_batch_execution_and_padding
(EngineCore_DP4 pid=858652)     assert batch_descriptor.num_tokens == num_tokens_padded
(EngineCore_DP4 pid=858652)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP4 pid=858652) AssertionError

couldn't do a clean revert so I'm not sure if it is caused by this PR yet. Checking.
EDIT: reverting this and a couple follow-up fixes helped.

Do you have repro instructions? happy to help debug

thanks @LucasWilkinson !

one node with the following command.
another node with the same command, but --data-parallel-start-rank=0 (as the leader)

this is tested on 2 nodes of 4xGB200.

I'll try to minimize the config, but so far, looks like nixl is necessary to reproduce the issue.

CUDA_HOME=/usr/local/cuda \
LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
NCCL_CUMEM_ENABLE=1 \
NCCL_MNNVL_ENABLE=1 \
NCCL_NVLS_ENABLE=1 \
NVSHMEM_IB_ENABLE_IBGDA=1 \
PATH=/usr/local/cuda/bin:$PATH \
UCX_TLS=all \
VLLM_ATTENTION_BACKEND=FLASHINFER_MLA \
VLLM_DISABLE_FLASHINFER_PREFILL=0 \
VLLM_FORCE_TORCH_ALLREDUCE=1 \
VLLM_LOGGING_LEVEL=INFO \
VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300 \
VLLM_NIXL_SIDE_CHANNEL_HOST=`hostname -i` \
VLLM_NIXL_SIDE_CHANNEL_PORT=5700 \
VLLM_TORCH_PROFILER_DIR=./profile/ \
VLLM_USE_DEEP_GEMM=0 \
VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_USE_FLASHINFER_MOE_FP8=1 \
VLLM_USE_FLASHINFER_SAMPLER=1 \
VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL=1 \
VLLM_V1_OUTPUT_PROC_CHUNK_SIZE=2048 \
VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1 \
VLLM_MOE_DP_CHUNK_SIZE=1024 \
 \
VLLM_USE_STANDALONE_COMPILE=0 \
VLLM_DEEPEP_BUFFER_SIZE_MB=0 \
VLLM_DEEPEP_LOW_LATENCY_ALLOW_NVLINK=1 \
VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL=1 \
VLLM_FLASHINFER_MOE_BACKEND=latency \
 vllm serve nvidia/DeepSeek-R1-0528-FP4-v2 --async-scheduling \
--disable_custom_all_reduce \
--disable-uvicorn-access-log \
--disable_nccl_for_dp_synchronization \
--enable-expert-parallel \
--kv-cache-dtype fp8 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
--tensor-parallel-size 1 \
--trust-remote-code \
 \
--all2all-backend deepep_low_latency \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY","max_cudagraph_capture_size":2048}' \
--data-parallel-hybrid-lb \
--data-parallel-size 8 \
--data-parallel-size-local 4 \
--gpu-memory-utilization 0.9 \
--max-model-len 4096 \
--max-num-batched-tokens 16384 \
--max-num-seqs 2048 \
 --data-parallel-address 192.168.5.50 --data-parallel-start-rank=4 2>&1 | tee decode.log

bench:

vllm bench serve --model nvidia/DeepSeek-R1-0528-FP4-v2 --port 8000 --dataset-name random --ignore-eos --num-prompts 1024 --request-rate inf --random-input-len 2 --random-output-len 1024 --max-concurrency 2048 --trust_remote_code --seed $RANDOM --ready-check-timeout-sec 0

charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 5, 2025
…n metadata building (vllm-project#28579)

Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
Meihan-chen pushed a commit to Meihan-chen/vllm-ascend that referenced this pull request Dec 5, 2025
1. fix vllm-project/vllm#28542
The model structure modifications we involved in are:
     - Qwen2.5-VL(still exist some patch)
     - Qwen2-VL
     - Qwen2
     - DeepSeek series
     - Qwen-moe series
2. fix vllm-project/vllm#29121
   the output token now  type changed from np to `list[list[int]]`

3. fix vllm-project/vllm#29262
    `xformers` backend for multimodal now has been deprecated
4. fix vllm-project/vllm#29342

5. fix vllm-project/vllm#28579
6. fix vllm-project/vllm#28718
7. fix vllm-project/vllm#28665
8. fix vllm-project/vllm#26847
vllm introduced the `optimization-level`, some default config has been
changed, and the param `--enforce-eager` has been deprecated
9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple
for sampler.
10. fix vllm-project/vllm#29471 we'll remove the
related patch to avoid this kind of error.

Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
@hjjq
Copy link
Contributor

hjjq commented Dec 5, 2025

export VLLM_ATTENTION_BACKEND=FLASHINFER_MLA
export VLLM_FLASHINFER_MOE_BACKEND=latency
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_NCCL_SYMM_MEM=1
export NCCL_NVLS_ENABLE=1
export NCCL_CUMEM_ENABLE=1
export VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL=1

python3 -m vllm.entrypoints.openai.api_server --model nvidia/DeepSeek-R1-0528-FP4 --tokenizer nvidia/DeepSeek-R1-0528-FP4 --dtype auto --kv-cache-dtype fp8 --tensor-parallel-size 1 --pipeline-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel --swap-space 16 --max-num-seqs 1024 --trust-remote-code --max-model-len 10240 --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --no-enable-prefix-caching --async-scheduling --compilation_config.pass_config.fuse_allreduce_rms true --compilation_config.pass_config.fuse_attn_quant true --compilation_config.pass_config.eliminate_noops true --compilation_config.custom_ops+=+quant_fp8,+rms_norm --max-cudagraph-capture-size 2048 --compilation_config.cudagraph_mode FULL_DECODE_ONLY --stream-interval=20 --api-server-count=20

vllm bench serve --dataset-name random --ignore-eos --num-prompts 10240 --max-concurrency 2048 --random-range-ratio 0.8 --random-input-len 2048 --random-output-len 1024 --model nvidia/DeepSeek-R1-0528-FP4 --base-url http://0.0.0.0:8000/ --ready-check-timeout-sec 0

@minosfuture 's error can also be reproduced with above. Only 4xGB200/B200 required.

@LucasWilkinson
Copy link
Collaborator Author

export VLLM_ATTENTION_BACKEND=FLASHINFER_MLA
export VLLM_FLASHINFER_MOE_BACKEND=latency
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_NCCL_SYMM_MEM=1
export NCCL_NVLS_ENABLE=1
export NCCL_CUMEM_ENABLE=1
export VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL=1

python3 -m vllm.entrypoints.openai.api_server --model nvidia/DeepSeek-R1-0528-FP4 --tokenizer nvidia/DeepSeek-R1-0528-FP4 --dtype auto --kv-cache-dtype fp8 --tensor-parallel-size 1 --pipeline-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel --swap-space 16 --max-num-seqs 1024 --trust-remote-code --max-model-len 10240 --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --no-enable-prefix-caching --async-scheduling --compilation_config.pass_config.fuse_allreduce_rms true --compilation_config.pass_config.fuse_attn_quant true --compilation_config.pass_config.eliminate_noops true --compilation_config.custom_ops+=+quant_fp8,+rms_norm --max-cudagraph-capture-size 2048 --compilation_config.cudagraph_mode FULL_DECODE_ONLY --stream-interval=20 --api-server-count=20

vllm bench serve --dataset-name random --ignore-eos --num-prompts 10240 --max-concurrency 2048 --random-range-ratio 0.8 --random-input-len 2048 --random-output-len 1024 --model nvidia/DeepSeek-R1-0528-FP4 --base-url http://0.0.0.0:8000/ --ready-check-timeout-sec 0

@minosfuture 's error can also be reproduced with above. Only 4xGB200/B200 required.

Thank you! I do not have access to a GB200 so this helps; what torch and cuda version are you using? im getting NCCL crashes when I run this command

@hjjq
Copy link
Contributor

hjjq commented Dec 5, 2025

We caught this from our ci which uses the vllm nightly image. I was also able to reproduce locally on 4xGB200 with torch 2.9.0 cu129 from a base image nvcr.io/nvidia/pytorch:25.06-py3 and vllm built from source. I was just trying to run on 4xB200 with latest main but I hit some allreduce fusion pass error. I'll get back once I can reproduce on B200.

@hjjq
Copy link
Contributor

hjjq commented Dec 6, 2025

Okay, I was able to reproduce the error on 4xB200 with a slightly older commit (c719c40). The allreduce fusion pass error I saw is probably a separate issue on latest main (most likely #24252)

The commands are the same as I posted before. The assertion failure happens soon after the first few requests complete.
For B200 I am also using nvcr.io/nvidia/pytorch:25.06-py3. See env below.

collect_env.py output
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : version 4.2.0
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.5.0-45-generic-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA B200
GPU 1: NVIDIA B200
GPU 2: NVIDIA B200
GPU 3: NVIDIA B200

Nvidia driver version        : 580.82.07
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.2
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             112
On-line CPU(s) list:                0-111
Vendor ID:                          GenuineIntel
Model name:                         INTEL(R) XEON(R) PLATINUM 8570
CPU family:                         6
Model:                              207
Thread(s) per core:                 1
Core(s) per socket:                 56
Socket(s):                          2
Stepping:                           2
Frequency boost:                    enabled
CPU(s) scaling MHz:                 111%
CPU max MHz:                        2101.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4200.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          5.3 MiB (112 instances)
L1i cache:                          3.5 MiB (112 instances)
L2 cache:                           224 MiB (112 instances)
L3 cache:                           600 MiB (2 instances)
NUMA node(s):                       4
NUMA node0 CPU(s):                  0-27
NUMA node1 CPU(s):                  28-55
NUMA node2 CPU(s):                  56-83
NUMA node3 CPU(s):                  84-111
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.5.3
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.16.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.3.1
[pip3] nvidia-ml-py==13.580.82
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.0
[pip3] torchaudio==2.9.0
[pip3] torchvision==0.24.0
[pip3] transformers==4.57.3
[pip3] triton==3.5.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.2.dev606+g7b5575fa7 (git sha: 7b5575fa7)
vLLM Build Flags:
  CUDA Archs: 7.5 8.0 8.6 9.0 10.0 12.0+PTX; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10      NIC11   NIC12   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NODE    NODE    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     SYSSYS     NODE    0-23    0               N/A
GPU1    NV18     X      NV18    NV18    NODE    NODE    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     SYSSYS     NODE    0-23    0               N/A
GPU2    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     SYS     PIX     NODE    SYS     SYS     SYSSYS     SYS     28-51   1               N/A
GPU3    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     NODE    PIX     SYS     SYS     SYSSYS     SYS     28-51   1               N/A
NIC0    NODE    NODE    SYS     SYS      X      PIX     PIX     PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYSSYS     NODE
NIC1    NODE    NODE    SYS     SYS     PIX      X      PIX     PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYSSYS     NODE
NIC2    NODE    NODE    SYS     SYS     PIX     PIX      X      PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYSSYS     NODE
NIC3    NODE    NODE    SYS     SYS     PIX     PIX     PIX      X      NODE    NODE    SYS     SYS     SYS     SYS     SYSSYS     NODE
NIC4    PIX     NODE    SYS     SYS     NODE    NODE    NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     SYSSYS     NODE
NIC5    NODE    PIX     SYS     SYS     NODE    NODE    NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     SYSSYS     NODE
NIC6    SYS     SYS     PIX     NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     SYSSYS     SYS
NIC7    SYS     SYS     NODE    PIX     SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     SYSSYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYSSYS     SYS
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYSSYS     SYS
NIC10   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X SYS     SYS
NIC11   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS X      SYS
NIC12   NODE    NODE    SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYSSYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_7
  NIC6: mlx5_8
  NIC7: mlx5_9
  NIC8: mlx5_10
  NIC9: mlx5_11
  NIC10: mlx5_12
  NIC11: mlx5_13
  NIC12: mlx5_bond_0

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
LOCAL_RANK=0
CUBLAS_VERSION=12.9.1.4
NVIDIA_REQUIRE_CUDA=cuda>=9.0
TORCH_CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0+PTX
NCCL_VERSION=2.27.3
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
TORCH_NCCL_USE_COMM_NONBLOCKING=0
CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0
NCCL_IB_HCA=mlx5_10,mlx5_11,mlx5_12,mlx5_13,mlx5_4,mlx5_7,mlx5_8,mlx5_9
NVIDIA_PRODUCT_NAME=PyTorch
CUDA_VERSION=12.9.1.010
PYTORCH_VERSION=2.8.0a0+5228986
PYTORCH_BUILD_NUMBER=0
CUBLASMP_VERSION=0.4.0.789
CUDA_VISIBLE_DEVICES=0,1,2,3
CUDA_VISIBLE_DEVICES=0,1,2,3
CUDNN_FRONTEND_VERSION=1.12.0
CUDNN_VERSION=9.10.2.21
PYTORCH_HOME=/opt/pytorch/pytorch
LD_LIBRARY_PATH=/usr/local/cuda/lib64
NVIDIA_BUILD_ID=177567386
CUDA_DRIVER_VERSION=575.57.08
PYTORCH_BUILD_VERSION=2.8.0a0+5228986
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_MODULE_LOADING=LAZY
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
NVIDIA_PYTORCH_VERSION=25.06
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

@LucasWilkinson
Copy link
Collaborator Author

Thanks! was able to repro; this should fix it: #30173 (please comment on that PR if it fixes it for you 👍)

Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Dec 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Attention]: Pad for cudagraphs before constructing attention metadata

8 participants