-
Notifications
You must be signed in to change notification settings - Fork 634
Open
Description
Hi guys,
I am trying to run Qwen 3B Instruct model on a GPU with 24 GB VRAM, but when the VLLM is creating CUDA graphs, it goes out of memory. It seems like we can set the gpu_memory_utilization config of VLLM to be around 0.7 to free up the GPU memory. Is there a way to pass this flag during backend initialization? Another interesting issue is that this happens when I run it on Databricks with a GPU of 24 GB VRAM, but when I run it on a local machine with RTX 3090, it runs fine. Not sure what the cause is. Thank you.
Metadata
Metadata
Assignees
Labels
No labels