Skip to content

Commit 16722b9

Browse files
remove files
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> cleanup Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> cleanup Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> review comment Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> remove dead code Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> cleanup Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> fix doc error Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> cleanup Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> wip Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> clean-up Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> cleanup Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> cleanup Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> wip Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> pad ubatches Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> test fixes Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> fix CPU backend Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> fix typo Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Update vllm/v1/worker/gpu_model_runner.py Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> format Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
1 parent 02ccc8d commit 16722b9

File tree

12 files changed

+257
-5792
lines changed

12 files changed

+257
-5792
lines changed

docs/design/cuda_graphs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ class BatchDescriptor(NamedTuple):
8989
has_lora: bool = False
9090
```
9191

92-
where `num_tokens` can be the padded token length, and `uniform` indicates if all the requests have the same query lengths. Many attention backends only support full cudagraphs when the batches are uniform; pure decode batches are uniform but may not be query length 1 (i.e. `num_tokens == num_reqs`), this occurs in the validation pass of spec-decode where "decode" batches will have a query length of `1+num_spec_tokens`.
92+
where `num_tokens` can be the padded token length, and `uniform` indicates if all the requests have the same query lengths. Many attention backends only support full cudagraphs when the batches are uniform; pure decode batches are uniform but may not be query length 1 (i.e. `num_tokens == num_reqs`), this occurs in the validation pass of spec-decode where "decode" batches will have a query length of `1+num_spec_tokens`.
9393

9494
The goal of this structure is to uniquely identify a (padded) batch with minimal possible items corresponding to a CUDA Graphs item.
9595

gsm8k-results-pr/llama3-8b-pad-before-metadata-flashinfer/meta-llama__Meta-Llama-3-8B-Instruct/results_2025-11-12T05-15-05.443105.json

Lines changed: 0 additions & 160 deletions
This file was deleted.

0 commit comments

Comments
 (0)