-
-
Notifications
You must be signed in to change notification settings - Fork 12k
Open
Description
Summary
vLLM 0.12.0 V1 engine fails to load a 14GB NVFP4-quantized model on RTX 5080 16GB due to overly aggressive pre-flight memory validation. The same model loads successfully with llama.cpp (GGUF Q4_K_M format), proving the model fits in VRAM.
Environment
- GPU: NVIDIA GeForce RTX 5080 (16GB VRAM, SM120/Blackwell)
- Driver: 591.44
- CUDA: 13.0 (required for SM120 support)
- vLLM: 0.12.0
- OS: Ubuntu 22.04 (WSL2)
- Model: Qwen3-Coder-REAP-25B-NVFP4-v2 (14GB, compressed-tensors quantization)
Issue Description
What happens:
- Model weights load successfully (14.0 GiB, ~90 seconds)
- V1 engine performs memory validation check
- Check fails because
gpu_memory_utilization * total_memoryexceeds free memory - Server crashes before attempting to allocate KV cache
Error 1 - Initial startup check (gpu_memory_utilization=0.92):
ValueError: Free memory on device (14.55/15.92 GiB) on startup is less than desired GPU memory utilization (0.92, 14.65 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
Error 2 - After model loads (gpu_memory_utilization=0.95):
INFO: Model loading took 14.0036 GiB memory and 84.552094 seconds
INFO: Available KV cache memory: -1.09 GiB
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
Reproduction
# CUDA 13.0 required for SM120
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=/usr/local/cuda-13.0/bin:$PATH
vllm serve /path/to/Qwen3-Coder-REAP-25B-NVFP4-v2 \
--quantization compressed-tensors \
--gpu-memory-utilization 0.95 \
--max-model-len 2048 \
--max-num-seqs 1 \
--enforce-eager \
--trust-remote-code \
--port 8000Root Cause Analysis
The V1 engine's memory validation in gpu_worker.py:239 and kv_cache_utils.py:686 is too conservative:
- Pre-flight check fails even when
gpu_memory_utilization * total_memorywould technically fit after model loading - No consideration for lazy allocation - validation assumes all memory must be available upfront
- Display overhead not accounted for - RTX 5080 uses ~1GB for display, leaving ~15GB usable
Why llama.cpp Works
The same model (GGUF Q4_K_M format) loads and runs successfully with llama.cpp because:
- Lazy KV cache allocation (allocates on-demand, not upfront)
- No pre-flight memory reservation check
- More granular GPU memory management
Suggested Fix
Consider one of:
- Lazy validation: Check KV cache memory AFTER model loads, not before
- Relaxed pre-flight: Allow startup if
model_size + min_kv_cache < available_memory - Override flag: Add
--skip-memory-checkfor users who know their model fits - Dynamic allocation: Like llama.cpp, allocate KV cache on-demand
Related Issues
- [Doc]: Steps to run vLLM on your RTX5080 or 5090! #14452 - Steps to run vLLM on RTX5080/5090
- [Bug]: w8a8 quantization not supporting sm120 #21097 - w8a8 quantization not supporting sm120
- [NVFP4] Enable MOE support for SM_120 (RTX 5090) #24968 - NVFP4 MOE support for SM_120
Additional Context
- RTX 5080/5090 (Blackwell/SM120) is bleeding-edge hardware
- CUDA 13.0+ required (12.5 gives
nvcc fatal: Unsupported gpu architecture 'compute_120a') - FlashInfer CUTLASS kernels compile successfully with CUDA 13.0
- The 14GB model + 2GB KV cache SHOULD fit in 16GB, but validation prevents attempting it
Metadata
Metadata
Assignees
Labels
No labels