[Bug]: RTX 5080 (SM120) + NVFP4 model fails pre-flight memory check despite model fitting in VRAM

## Summary

vLLM 0.12.0 V1 engine fails to load a 14GB NVFP4-quantized model on RTX 5080 16GB due to overly aggressive pre-flight memory validation. The same model loads successfully with llama.cpp (GGUF Q4_K_M format), proving the model fits in VRAM.

## Environment

- **GPU**: NVIDIA GeForce RTX 5080 (16GB VRAM, SM120/Blackwell)
- **Driver**: 591.44
- **CUDA**: 13.0 (required for SM120 support)
- **vLLM**: 0.12.0
- **OS**: Ubuntu 22.04 (WSL2)
- **Model**: Qwen3-Coder-REAP-25B-NVFP4-v2 (14GB, compressed-tensors quantization)

## Issue Description

### What happens:
1. Model weights load successfully (14.0 GiB, ~90 seconds)
2. V1 engine performs memory validation check
3. Check fails because `gpu_memory_utilization * total_memory` exceeds free memory
4. Server crashes before attempting to allocate KV cache

### Error 1 - Initial startup check (gpu_memory_utilization=0.92):
```
ValueError: Free memory on device (14.55/15.92 GiB) on startup is less than desired GPU memory utilization (0.92, 14.65 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
```

### Error 2 - After model loads (gpu_memory_utilization=0.95):
```
INFO: Model loading took 14.0036 GiB memory and 84.552094 seconds
INFO: Available KV cache memory: -1.09 GiB
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
```

## Reproduction

```bash
# CUDA 13.0 required for SM120
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=/usr/local/cuda-13.0/bin:$PATH

vllm serve /path/to/Qwen3-Coder-REAP-25B-NVFP4-v2 \
  --quantization compressed-tensors \
  --gpu-memory-utilization 0.95 \
  --max-model-len 2048 \
  --max-num-seqs 1 \
  --enforce-eager \
  --trust-remote-code \
  --port 8000
```

## Root Cause Analysis

The V1 engine's memory validation in `gpu_worker.py:239` and `kv_cache_utils.py:686` is too conservative:

1. **Pre-flight check fails** even when `gpu_memory_utilization * total_memory` would technically fit after model loading
2. **No consideration for lazy allocation** - validation assumes all memory must be available upfront
3. **Display overhead not accounted for** - RTX 5080 uses ~1GB for display, leaving ~15GB usable

## Why llama.cpp Works

The same model (GGUF Q4_K_M format) loads and runs successfully with llama.cpp because:
- Lazy KV cache allocation (allocates on-demand, not upfront)
- No pre-flight memory reservation check
- More granular GPU memory management

## Suggested Fix

Consider one of:
1. **Lazy validation**: Check KV cache memory AFTER model loads, not before
2. **Relaxed pre-flight**: Allow startup if `model_size + min_kv_cache < available_memory`
3. **Override flag**: Add `--skip-memory-check` for users who know their model fits
4. **Dynamic allocation**: Like llama.cpp, allocate KV cache on-demand

## Related Issues

- #14452 - Steps to run vLLM on RTX5080/5090
- #21097 - w8a8 quantization not supporting sm120
- #24968 - NVFP4 MOE support for SM_120

## Additional Context

- RTX 5080/5090 (Blackwell/SM120) is bleeding-edge hardware
- CUDA 13.0+ required (12.5 gives `nvcc fatal: Unsupported gpu architecture 'compute_120a'`)
- FlashInfer CUTLASS kernels compile successfully with CUDA 13.0
- The 14GB model + 2GB KV cache SHOULD fit in 16GB, but validation prevents attempting it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: RTX 5080 (SM120) + NVFP4 model fails pre-flight memory check despite model fitting in VRAM #30707

Summary

Environment

Issue Description

What happens:

Error 1 - Initial startup check (gpu_memory_utilization=0.92):

Error 2 - After model loads (gpu_memory_utilization=0.95):

Reproduction

Root Cause Analysis

Why llama.cpp Works

Suggested Fix

Related Issues

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: RTX 5080 (SM120) + NVFP4 model fails pre-flight memory check despite model fitting in VRAM #30707

Description

Summary

Environment

Issue Description

What happens:

Error 1 - Initial startup check (gpu_memory_utilization=0.92):

Error 2 - After model loads (gpu_memory_utilization=0.95):

Reproduction

Root Cause Analysis

Why llama.cpp Works

Suggested Fix

Related Issues

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions