|
| 1 | +# vLLM LoRA Feature Test Suite |
| 2 | + |
| 3 | +This test suite validates vLLM's LoRA adapter memory management, including LRU eviction, CPU-GPU swapping, pinning, and API behaviors. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +- Python 3.10+ |
| 8 | +- vLLM installed with LoRA support |
| 9 | +- GPU with sufficient memory (recommended: 24GB+ for Qwen3-8B) |
| 10 | +- Required packages: `pytest`, `pytest-asyncio`, `torch`, `safetensors`, `transformers`, `vllm` |
| 11 | + |
| 12 | +```bash |
| 13 | +pip install pytest pytest-asyncio torch safetensors transformers vllm |
| 14 | +``` |
| 15 | + |
| 16 | +## Quick Start |
| 17 | + |
| 18 | +```bash |
| 19 | +# 1. Generate test LoRA adapters |
| 20 | +python create_test_loras.py --base-model Qwen/Qwen3-8B --num-loras 10 --ranks 8,16,32 |
| 21 | + |
| 22 | +# 2. Run all tests |
| 23 | +python run_all_tests.py |
| 24 | +``` |
| 25 | + |
| 26 | +## Test Coverage |
| 27 | + |
| 28 | +The test suite validates: |
| 29 | +- LRU eviction behavior |
| 30 | +- CPU-GPU swap latency |
| 31 | +- LoRA pinning functionality |
| 32 | +- Memory isolation between LoRA and KV cache |
| 33 | +- Memory budget and pre-allocation |
| 34 | +- Batching constraints with max_loras |
| 35 | +- Concurrent request handling |
| 36 | +- API idempotency |
| 37 | + |
| 38 | + |
| 39 | +## Artifact Preparation |
| 40 | + |
| 41 | +### Option 1: Generate Dummy LoRAs (Recommended for Testing) |
| 42 | + |
| 43 | +Generate dummy LoRA adapters that match your base model's architecture: |
| 44 | + |
| 45 | +```bash |
| 46 | +# Basic usage - creates 10 LoRAs with rank 8 |
| 47 | +python create_test_loras.py --base-model Qwen/Qwen3-8B |
| 48 | + |
| 49 | +# Multiple ranks for swap latency testing |
| 50 | +python create_test_loras.py --base-model Qwen/Qwen3-8B --num-loras 10 --ranks 8,16,32 |
| 51 | + |
| 52 | +# Include MLP modules for larger LoRAs |
| 53 | +python create_test_loras.py --base-model Qwen/Qwen3-8B \ |
| 54 | + --target-modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj |
| 55 | + |
| 56 | +# Custom output directory |
| 57 | +python create_test_loras.py --base-model Qwen/Qwen3-8B \ |
| 58 | + --output-dir /path/to/loras |
| 59 | +``` |
| 60 | + |
| 61 | +#### Estimate LoRA Sizes |
| 62 | + |
| 63 | +```bash |
| 64 | +python create_test_loras.py --base-model Qwen/Qwen3-8B --ranks 8,16,32,64 --estimate |
| 65 | +``` |
| 66 | + |
| 67 | +#### Verify Generated Adapters |
| 68 | + |
| 69 | +```bash |
| 70 | +python create_test_loras.py --verify --output-dir /tmp/test_loras |
| 71 | +``` |
| 72 | + |
| 73 | +### Option 2: Use Real LoRA from HuggingFace |
| 74 | + |
| 75 | +Edit `test_config.py` to use a real LoRA adapter: |
| 76 | + |
| 77 | +```python |
| 78 | +# In test_config.py |
| 79 | +LORA_MODEL = "your-org/your-lora-adapter" |
| 80 | +``` |
| 81 | + |
| 82 | +The test suite automatically falls back to `LORA_MODEL` if generated LoRAs are not found. |
| 83 | + |
| 84 | +## Running Tests |
| 85 | + |
| 86 | +### Run All Tests |
| 87 | + |
| 88 | +```bash |
| 89 | +python run_all_tests.py |
| 90 | +``` |
| 91 | + |
| 92 | +Options: |
| 93 | +- `--timeout 600` - Timeout per test file in seconds (default: 600) |
| 94 | +- `--quiet` - Reduce output verbosity |
| 95 | + |
| 96 | +### Run Specific Tests with pytest |
| 97 | + |
| 98 | +```bash |
| 99 | +# All tests in a file |
| 100 | +pytest test_lru_swap_pinning.py -v -s |
| 101 | + |
| 102 | +# Specific test class |
| 103 | +pytest test_memory.py::TestMemoryBudget -v -s |
| 104 | + |
| 105 | +# Specific test method |
| 106 | +pytest test_api.py::TestDynamicLoading::test_reload_same_lora -v -s |
| 107 | + |
| 108 | +# Run tests matching a keyword |
| 109 | +pytest -v -s -k "eviction" |
| 110 | +``` |
| 111 | + |
| 112 | + |
| 113 | +## Configuration |
| 114 | + |
| 115 | +Edit `test_config.py` to customize: |
| 116 | + |
| 117 | +```python |
| 118 | +# Base model |
| 119 | +BASE_MODEL = "Qwen/Qwen3-8B" |
| 120 | + |
| 121 | +# Generated LoRA directory |
| 122 | +LORA_BASE_PATH = "/tmp/test_loras" |
| 123 | + |
| 124 | +# Fallback LoRA (if generated not found) |
| 125 | +LORA_MODEL = "mtzig/qwen3-8b-tfdark-lora2" |
| 126 | + |
| 127 | +# Test parameters |
| 128 | +TEST_PROMPT = "Hello, how are you today?" |
| 129 | +MAX_TOKENS = 50 |
| 130 | +``` |
| 131 | + |
| 132 | +## Troubleshooting |
| 133 | + |
| 134 | +### LoRA Loading Fails |
| 135 | + |
| 136 | +1. Verify LoRA adapters exist: `python create_test_loras.py --verify` |
| 137 | +2. Check base model compatibility |
| 138 | +3. Ensure `max_lora_rank` >= actual LoRA rank |
0 commit comments