-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Refactor: Move CUDA graph dispatch logic earlier #27382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly identifies and fixes a potential bug where dummy attention metadata might not be created when cudagraph_runtime_mode is set to FULL. By deferring the metadata creation until after cudagraph_runtime_mode is determined, the change ensures correctness. The implementation is a straightforward move of a code block, and it looks correct. I've added one suggestion to improve maintainability by refactoring a large block of duplicated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
vllm/vllm/v1/worker/gpu_model_runner.py
Lines 3312 to 3315 in 4b696d1
| # Make sure padding doesn't exceed max_num_tokens | |
| assert num_tokens_after_padding <= self.max_num_tokens | |
| model_kwargs = self._init_model_kwargs(num_tokens_after_padding) | |
| if self.supports_mm_inputs and not self.model_config.is_encoder_decoder: |
_dummy_run now invokes _init_model_kwargs before seq_lens is filled for the current dummy batch—the lengths are only written later when force_attention is true or the dispatcher returns CUDAGraphMode.FULL. _init_model_kwargs reads self.seq_lens to build token_type_ids for pooling models, so a dummy run that captures attention metadata on a pooling model will use stale lengths from a previous call, yielding token_type_ids whose size no longer matches the current num_tokens_after_padding and causing incorrect inputs during CUDA‑graph warmup. Ensure the seq lens tensor is populated before calling _init_model_kwargs or defer that call until after the lengths are updated.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@WoosukKwon Could you please take a look at this? Thanks! |
|
Yep, it is a good catch. The flashInfer backend may potentially hang if it is at full cudagraph without preparing attn_metadata. Could you please move the code of attn_metadata building back out of the |
OK, will do. |
Moves the CUDA graph dispatch logic to execute before the attention metadata is calculated within the dummy run. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
fhl2000
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for this late review. I think there is no harm in moving this logic earlier in dummy_run.
cc @LucasWilkinson, it is also closer to your idea in that the padding logic (and cg mode) verified by the cudagraph dispatcher is done before attention metadata building.
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @LucasWilkinson is this ok with you?
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; will be fixed by: #28579 but we can take this in the interim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the _dummy_run method in the GPU model runner by moving the CUDA graph dispatch logic to execute earlier in the function flow. The change ensures that cudagraph_runtime_mode is determined before the attention metadata creation decision, addressing a potential edge case where attention metadata might not be created when replaying a CUDA graph in FULL mode.
Key Changes
- CUDA graph dispatch logic moved from inside the LoRA context (after intermediate tensor setup) to immediately after
num_tokens_after_paddingcalculation - This ensures
cudagraph_runtime_modeis set before the conditionif force_attention or cudagraph_runtime_mode == CUDAGraphMode.FULL:is evaluated for attention metadata creation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Purpose
After carefully reading the code, I found a potential edge case: when
execute_dummy_batchruns, dummy attention metadata isn't created even ifaclgraph_runtime_modeis later set toCUDAGraphMode.FULL. That's odd, because attention normally requires metadata, otherwise it may raise an error or produce incorrect output.The only explanation I can think of is that we're skipping metadata creation for dummy batches to save a bit of performance since we don't care about their output. Can anyone elaborate on this? Thanks.
I also propose a potential fix by moving CUDA graph dispatch logic earlier, this ensures metadata is built when replaying a CUDA graph, and the performance impact should be negligible.
Test Plan
None.
Test Result
None.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.