vllm-project
diff --git a/‎docs/design/cuda_graphs.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/design/cuda_graphs.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎gsm8k-results-pr/llama3-8b-pad-before-metadata-flashinfer/meta-llama__Meta-Llama-3-8B-Instruct/results_2025-11-12T05-15-05.443105.json‎
Lines changed: 0 additions & 160 deletions b/‎gsm8k-results-pr/llama3-8b-pad-before-metadata-flashinfer/meta-llama__Meta-Llama-3-8B-Instruct/results_2025-11-12T05-15-05.443105.json‎
Lines changed: 0 additions & 160 deletions
@@ -89,7 +89,7 @@ class BatchDescriptor(NamedTuple):
     has_lora: bool = False
 ```
 
-where `num_tokens` can be the padded token length, and `uniform` indicates if all the requests have the same query lengths. Many attention backends only support full cudagraphs when the batches are uniform; pure decode batches are uniform but may not be query length 1 (i.e. `num_tokens == num_reqs`), this occurs in the validation pass of spec-decode where "decode" batches will have a query length of  `1+num_spec_tokens`. 
+where `num_tokens` can be the padded token length, and `uniform` indicates if all the requests have the same query lengths. Many attention backends only support full cudagraphs when the batches are uniform; pure decode batches are uniform but may not be query length 1 (i.e. `num_tokens == num_reqs`), this occurs in the validation pass of spec-decode where "decode" batches will have a query length of  `1+num_spec_tokens`.
 
 The goal of this structure is to uniquely identify a (padded) batch with minimal possible items corresponding to a CUDA Graphs item.