Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions vllm/v1/worker/gpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -4104,10 +4104,16 @@ def _dummy_run(

if self.speculative_config and self.speculative_config.use_eagle():
assert isinstance(self.drafter, EagleProposer)
# Eagle currently only supports PIECEWISE cudagraphs.
# Therefore only use cudagraphs if the main model uses PIECEWISE
# NOTE(lucas): this is a hack, need to clean up.
use_cudagraphs = (
cudagraph_runtime_mode.has_mode(CUDAGraphMode.PIECEWISE)
and not self.speculative_config.enforce_eager
)
(
is_graph_capturing
and cudagraph_runtime_mode == CUDAGraphMode.PIECEWISE
)
or (cudagraph_runtime_mode != CUDAGraphMode.NONE)
Comment on lines +4111 to +4115
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LucasWilkinson could you double check this logic?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the basic idea here is during a DP dummy run we may end up using FULL CGs for the target model; this was set to only using drafter PIECEWISE CGs when he target model was using PIECEWISE CGs (mostly for CG capture) meaning if the target model used FULL CGs this would revert the drafter to eager. This didn't match the behavior of non-dummy runs where if the main model used FULL CGs the drafter would still use PIECEWISE CGs.

) and not self.speculative_config.enforce_eager

# Note(gnovack) - We need to disable cudagraphs for one of the two
# lora cases when cudagraph_specialize_lora is enabled. This is a
Expand Down