Skip to content

[Bug]: RTX 5080 (SM120) + NVFP4 model fails pre-flight memory check despite model fitting in VRAM #30707

@Platano78

Description

@Platano78

Summary

vLLM 0.12.0 V1 engine fails to load a 14GB NVFP4-quantized model on RTX 5080 16GB due to overly aggressive pre-flight memory validation. The same model loads successfully with llama.cpp (GGUF Q4_K_M format), proving the model fits in VRAM.

Environment

  • GPU: NVIDIA GeForce RTX 5080 (16GB VRAM, SM120/Blackwell)
  • Driver: 591.44
  • CUDA: 13.0 (required for SM120 support)
  • vLLM: 0.12.0
  • OS: Ubuntu 22.04 (WSL2)
  • Model: Qwen3-Coder-REAP-25B-NVFP4-v2 (14GB, compressed-tensors quantization)

Issue Description

What happens:

  1. Model weights load successfully (14.0 GiB, ~90 seconds)
  2. V1 engine performs memory validation check
  3. Check fails because gpu_memory_utilization * total_memory exceeds free memory
  4. Server crashes before attempting to allocate KV cache

Error 1 - Initial startup check (gpu_memory_utilization=0.92):

ValueError: Free memory on device (14.55/15.92 GiB) on startup is less than desired GPU memory utilization (0.92, 14.65 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.

Error 2 - After model loads (gpu_memory_utilization=0.95):

INFO: Model loading took 14.0036 GiB memory and 84.552094 seconds
INFO: Available KV cache memory: -1.09 GiB
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

Reproduction

# CUDA 13.0 required for SM120
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=/usr/local/cuda-13.0/bin:$PATH

vllm serve /path/to/Qwen3-Coder-REAP-25B-NVFP4-v2 \
  --quantization compressed-tensors \
  --gpu-memory-utilization 0.95 \
  --max-model-len 2048 \
  --max-num-seqs 1 \
  --enforce-eager \
  --trust-remote-code \
  --port 8000

Root Cause Analysis

The V1 engine's memory validation in gpu_worker.py:239 and kv_cache_utils.py:686 is too conservative:

  1. Pre-flight check fails even when gpu_memory_utilization * total_memory would technically fit after model loading
  2. No consideration for lazy allocation - validation assumes all memory must be available upfront
  3. Display overhead not accounted for - RTX 5080 uses ~1GB for display, leaving ~15GB usable

Why llama.cpp Works

The same model (GGUF Q4_K_M format) loads and runs successfully with llama.cpp because:

  • Lazy KV cache allocation (allocates on-demand, not upfront)
  • No pre-flight memory reservation check
  • More granular GPU memory management

Suggested Fix

Consider one of:

  1. Lazy validation: Check KV cache memory AFTER model loads, not before
  2. Relaxed pre-flight: Allow startup if model_size + min_kv_cache < available_memory
  3. Override flag: Add --skip-memory-check for users who know their model fits
  4. Dynamic allocation: Like llama.cpp, allocate KV cache on-demand

Related Issues

Additional Context

  • RTX 5080/5090 (Blackwell/SM120) is bleeding-edge hardware
  • CUDA 13.0+ required (12.5 gives nvcc fatal: Unsupported gpu architecture 'compute_120a')
  • FlashInfer CUTLASS kernels compile successfully with CUDA 13.0
  • The 14GB model + 2GB KV cache SHOULD fit in 16GB, but validation prevents attempting it

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions