Cuda out of Memory

Hi guys,
I am trying to run Qwen 3B Instruct model on a GPU with 24 GB VRAM, but when the VLLM is creating CUDA graphs, it goes out of memory. It seems like we can set the gpu_memory_utilization config of VLLM to be around 0.7 to free up the GPU memory. Is there a way to pass this flag during backend initialization? Another interesting issue is that this happens when I run it on Databricks with a GPU of 24 GB VRAM, but when I run it on a local machine with RTX 3090, it runs fine. Not sure what the cause is. Thank you.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cuda out of Memory #452

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cuda out of Memory #452

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions