Refactor: Move CUDA graph dispatch logic earlier #27382

yiz-liu · 2025-10-23T03:33:46Z

Purpose

After carefully reading the code, I found a potential edge case: when execute_dummy_batch runs, dummy attention metadata isn't created even if aclgraph_runtime_mode is later set to CUDAGraphMode.FULL. That's odd, because attention normally requires metadata, otherwise it may raise an error or produce incorrect output.

The only explanation I can think of is that we're skipping metadata creation for dummy batches to save a bit of performance since we don't care about their output. Can anyone elaborate on this? Thanks.

I also propose a potential fix by moving CUDA graph dispatch logic earlier, this ensures metadata is built when replaying a CUDA graph, and the performance impact should be negligible.

Test Plan

None.

Test Result

None.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request correctly identifies and fixes a potential bug where dummy attention metadata might not be created when cudagraph_runtime_mode is set to FULL. By deferring the metadata creation until after cudagraph_runtime_mode is determined, the change ensures correctness. The implementation is a straightforward move of a code block, and it looks correct. I've added one suggestion to improve maintainability by refactoring a large block of duplicated code.

vllm/v1/worker/gpu_model_runner.py

chatgpt-codex-connector

💡 Codex Review

vllm/vllm/v1/worker/gpu_model_runner.py

Lines 3312 to 3315 in 4b696d1

    
           # Make sure padding doesn't exceed max_num_tokens 
        
           assert num_tokens_after_padding <= self.max_num_tokens 
        
           model_kwargs = self._init_model_kwargs(num_tokens_after_padding) 
        
           if self.supports_mm_inputs and not self.model_config.is_encoder_decoder:

Populate seq_lens before pooling model kwargs

_dummy_run now invokes _init_model_kwargs before seq_lens is filled for the current dummy batch—the lengths are only written later when force_attention is true or the dispatcher returns CUDAGraphMode.FULL. _init_model_kwargs reads self.seq_lens to build token_type_ids for pooling models, so a dummy run that captures attention metadata on a pooling model will use stale lengths from a previous call, yielding token_type_ids whose size no longer matches the current num_tokens_after_padding and causing incorrect inputs during CUDA‑graph warmup. Ensure the seq lens tensor is populated before calling _init_model_kwargs or defer that call until after the lengths are updated.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

yiz-liu · 2025-10-30T07:08:17Z

@WoosukKwon Could you please take a look at this? Thanks!

fhl2000 · 2025-11-03T17:18:23Z

Yep, it is a good catch. The flashInfer backend may potentially hang if it is at full cudagraph without preparing attn_metadata. Could you please move the code of attn_metadata building back out of the with context, instead, moving the cudagraph dispatching parts above?

yiz-liu · 2025-11-12T07:26:16Z

Yep, it is a good catch. The flashInfer backend may potentially hang if it is at full cudagraph without preparing attn_metadata. Could you please move the code of attn_metadata building back out of the with context, instead, moving the cudagraph dispatching parts above?

OK, will do.

Moves the CUDA graph dispatch logic to execute before the attention metadata is calculated within the dummy run. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

fhl2000

Sorry for this late review. I think there is no harm in moving this logic earlier in dummy_run.
cc @LucasWilkinson, it is also closer to your idea in that the padding logic (and cg mode) verified by the cudagraph dispatcher is done before attention metadata building.

fhl2000 · 2025-11-19T03:50:38Z

cc @ProExpertProg

ProExpertProg

LGTM, @LucasWilkinson is this ok with you?

LucasWilkinson

LGTM; will be fixed by: #28579 but we can take this in the interim

Copilot

Pull request overview

This PR refactors the _dummy_run method in the GPU model runner by moving the CUDA graph dispatch logic to execute earlier in the function flow. The change ensures that cudagraph_runtime_mode is determined before the attention metadata creation decision, addressing a potential edge case where attention metadata might not be created when replaying a CUDA graph in FULL mode.

Key Changes

CUDA graph dispatch logic moved from inside the LoRA context (after intermediate tensor setup) to immediately after num_tokens_after_padding calculation
This ensures cudagraph_runtime_mode is set before the condition if force_attention or cudagraph_runtime_mode == CUDAGraphMode.FULL: is evaluated for attention metadata creation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

mergify bot added the v1 label Oct 23, 2025

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 23, 2025

View reviewed changes

yiz-liu force-pushed the main branch from 4b696d1 to 8de1521 Compare October 23, 2025 03:45

Refactor: Move CUDA graph dispatch logic earlier

bf2feee

Moves the CUDA graph dispatch logic to execute before the attention metadata is calculated within the dummy run. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

yiz-liu force-pushed the main branch from 8de1521 to bf2feee Compare November 14, 2025 10:24

yiz-liu changed the title ~~Refactor: Defer dummy attention metadata creation~~ Refactor: Move CUDA graph dispatch logic earlier Nov 14, 2025

mergify bot added the nvidia label Nov 14, 2025

github-project-automation bot added this to NVIDIA Nov 14, 2025

fhl2000 approved these changes Nov 19, 2025

View reviewed changes

ProExpertProg approved these changes Nov 19, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Nov 19, 2025

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025

LucasWilkinson approved these changes Nov 21, 2025

View reviewed changes

Merge branch 'main' into main

09ee430

Copilot AI review requested due to automatic review settings November 21, 2025 23:37

Copilot started reviewing on behalf of yiz-liu November 21, 2025 23:49 View session

Copilot finished reviewing on behalf of yiz-liu November 21, 2025 23:53

Copilot AI reviewed Nov 21, 2025

View reviewed changes

ProExpertProg merged commit df78aee into vllm-project:main Nov 22, 2025
45 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Nov 22, 2025

ywang96 pushed a commit to ywang96/vllm that referenced this pull request Nov 23, 2025

Refactor: Move CUDA graph dispatch logic earlier (vllm-project#27382)

7f03326

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025

Refactor: Move CUDA graph dispatch logic earlier (vllm-project#27382)

58e46be

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025

Refactor: Move CUDA graph dispatch logic earlier (vllm-project#27382)

e5796f8

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025

Refactor: Move CUDA graph dispatch logic earlier (vllm-project#27382)

8eeb6a5

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

Refactor: Move CUDA graph dispatch logic earlier (vllm-project#27382)

a0aa2ff

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

Refactor: Move CUDA graph dispatch logic earlier (vllm-project#27382)

b78dcd2

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 5, 2025

Refactor: Move CUDA graph dispatch logic earlier (vllm-project#27382)

bcf13cb

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>

Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Dec 6, 2025

Refactor: Move CUDA graph dispatch logic earlier (vllm-project#27382)

4346828

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Refactor: Move CUDA graph dispatch logic earlier #27382

Refactor: Move CUDA graph dispatch logic earlier #27382

Uh oh!

yiz-liu commented Oct 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

yiz-liu commented Oct 30, 2025

Uh oh!

fhl2000 commented Nov 3, 2025

Uh oh!

yiz-liu commented Nov 12, 2025

Uh oh!

fhl2000 left a comment

Uh oh!

fhl2000 commented Nov 19, 2025

Uh oh!

ProExpertProg left a comment

Uh oh!

LucasWilkinson left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	# Make sure padding doesn't exceed max_num_tokens
	assert num_tokens_after_padding <= self.max_num_tokens
	model_kwargs = self._init_model_kwargs(num_tokens_after_padding)
	if self.supports_mm_inputs and not self.model_config.is_encoder_decoder:

Uh oh!

Refactor: Move CUDA graph dispatch logic earlier #27382

Refactor: Move CUDA graph dispatch logic earlier #27382

Uh oh!

Conversation

yiz-liu commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

yiz-liu commented Oct 30, 2025

Uh oh!

fhl2000 commented Nov 3, 2025

Uh oh!

yiz-liu commented Nov 12, 2025

Uh oh!

fhl2000 left a comment

Choose a reason for hiding this comment

Uh oh!

fhl2000 commented Nov 19, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yiz-liu commented Oct 23, 2025 •

edited by github-actions bot

Loading