|
| 1 | +# Embeddings Benchmarking |
| 2 | + |
| 3 | +GuideLLM supports benchmarking embedding models through the `/v1/embeddings` endpoint. This guide covers how to set up and run benchmarks for text embedding models, which are commonly used for semantic search, clustering, and other ML tasks. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Embedding models convert text into dense vector representations that capture semantic meaning. Benchmarking these models helps you: |
| 8 | + |
| 9 | +- Measure throughput and latency for embedding generation |
| 10 | +- Test performance under different load conditions |
| 11 | +- Compare different embedding model deployments |
| 12 | +- Optimize your embedding service configuration |
| 13 | + |
| 14 | +## Supported Backends |
| 15 | + |
| 16 | +### vLLM |
| 17 | + |
| 18 | +vLLM supports embedding models starting from version 0.4.0. To serve an embedding model with vLLM: |
| 19 | + |
| 20 | +```bash |
| 21 | +vllm serve "BAAI/bge-small-en-v1.5" |
| 22 | +``` |
| 23 | + |
| 24 | +Popular embedding models supported by vLLM: |
| 25 | + |
| 26 | +- **BAAI/bge-small-en-v1.5**: Lightweight English embedding model (384 dimensions) |
| 27 | +- **BAAI/bge-base-en-v1.5**: Base English embedding model (768 dimensions) |
| 28 | +- **BAAI/bge-large-en-v1.5**: Large English embedding model (1024 dimensions) |
| 29 | +- **sentence-transformers/all-MiniLM-L6-v2**: Compact multilingual model (384 dimensions) |
| 30 | +- **intfloat/e5-large-v2**: High-performance English model (1024 dimensions) |
| 31 | + |
| 32 | +For the latest list of supported models, see the [vLLM documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). |
| 33 | + |
| 34 | +### OpenAI API |
| 35 | + |
| 36 | +GuideLLM can also benchmark OpenAI's embedding endpoints: |
| 37 | + |
| 38 | +```bash |
| 39 | +guidellm benchmark \ |
| 40 | + --target "https://api.openai.com" \ |
| 41 | + --request-type embeddings \ |
| 42 | + --model "text-embedding-3-small" \ |
| 43 | + --rate 5 \ |
| 44 | + --max-requests 50 \ |
| 45 | + --data "prompt_tokens=256,output_tokens=1" \ |
| 46 | + --processor "gpt2" |
| 47 | +``` |
| 48 | + |
| 49 | +Note: You'll need to set your OpenAI API key as an environment variable or in the request headers. |
| 50 | + |
| 51 | +## Basic Benchmarking |
| 52 | + |
| 53 | +### Simple Concurrent Benchmark (Recommended) |
| 54 | + |
| 55 | +For embeddings, concurrent testing is the most relevant approach. To run a basic concurrent benchmark with synthetic data: |
| 56 | + |
| 57 | +```bash |
| 58 | +guidellm benchmark \ |
| 59 | + --target "http://localhost:8000" \ |
| 60 | + --request-type embeddings \ |
| 61 | + --profile concurrent \ |
| 62 | + --rate 32 \ |
| 63 | + --max-requests 100 \ |
| 64 | + --data "prompt_tokens=256,output_tokens=1" \ |
| 65 | + --processor "BAAI/bge-small-en-v1.5" |
| 66 | +``` |
| 67 | + |
| 68 | +This command: |
| 69 | + |
| 70 | +- Tests with 32 concurrent requests (parallel processing) |
| 71 | +- Stops after 100 total requests |
| 72 | +- Uses synthetic text with ~256 tokens per request |
| 73 | +- Uses the bge-small tokenizer for token counting |
| 74 | +- **Note**: `output_tokens=1` is required when using synthetic data, even though embeddings don't generate output. This is a current limitation of the synthetic data generator. |
| 75 | + |
| 76 | +## Benchmark Profiles for Embeddings |
| 77 | + |
| 78 | +Different benchmark profiles serve different purposes when testing embedding models: |
| 79 | + |
| 80 | +- **Concurrent** (Recommended): Tests parallel request handling - the most common production pattern for embeddings |
| 81 | +- **Throughput**: Finds maximum sustainable request rate - useful for capacity planning |
| 82 | +- **Synchronous**: Sequential baseline testing - useful for measuring per-request latency without concurrency effects |
| 83 | +- **Constant**: Fixed-rate testing - less relevant for embeddings since they have predictable processing times |
| 84 | +- **Sweep**: Not recommended for embeddings (designed for optimizing generative model parameters) |
| 85 | + |
| 86 | +For most embedding benchmarks, use **concurrent** or **throughput** profiles. |
| 87 | + |
| 88 | +## Advanced Usage |
| 89 | + |
| 90 | +### Variable Input Lengths |
| 91 | + |
| 92 | +Test performance across different input lengths: |
| 93 | + |
| 94 | +```bash |
| 95 | +guidellm benchmark \ |
| 96 | + --target "http://localhost:8000" \ |
| 97 | + --request-type embeddings \ |
| 98 | + --rate 10 \ |
| 99 | + --max-requests 200 \ |
| 100 | + --data "prompt_tokens=256,prompt_tokens_min=128,prompt_tokens_max=500,output_tokens=1" \ |
| 101 | + --processor "BAAI/bge-small-en-v1.5" |
| 102 | +``` |
| 103 | + |
| 104 | +This creates requests with uniformly distributed lengths between 128 and 500 tokens. |
| 105 | + |
| 106 | +### Using Real Data |
| 107 | + |
| 108 | +Benchmark with actual text data from a file or Hugging Face dataset: |
| 109 | + |
| 110 | +```bash |
| 111 | +guidellm benchmark \ |
| 112 | + --target "http://localhost:8000" \ |
| 113 | + --request-type embeddings \ |
| 114 | + --rate 10 \ |
| 115 | + --max-requests 100 \ |
| 116 | + --data "path/to/your/data.jsonl" \ |
| 117 | + --data-args '{"prompt_column": "text"}' \ |
| 118 | + --processor "BAAI/bge-small-en-v1.5" |
| 119 | +``` |
| 120 | + |
| 121 | +Or from Hugging Face: |
| 122 | + |
| 123 | +```bash |
| 124 | +guidellm benchmark \ |
| 125 | + --target "http://localhost:8000" \ |
| 126 | + --request-type embeddings \ |
| 127 | + --rate 10 \ |
| 128 | + --max-requests 100 \ |
| 129 | + --data "sentence-transformers/stsb" \ |
| 130 | + --data-args '{"prompt_column": "sentence1", "split": "test"}' \ |
| 131 | + --processor "BAAI/bge-small-en-v1.5" |
| 132 | +``` |
| 133 | + |
| 134 | +### Load Testing Scenarios |
| 135 | + |
| 136 | +#### Testing Concurrent Request Handling (Recommended) |
| 137 | + |
| 138 | +The concurrent profile is the most relevant for embeddings, as it simulates how production systems typically use embedding models (parallel batch processing): |
| 139 | + |
| 140 | +```bash |
| 141 | +guidellm benchmark \ |
| 142 | + --target "http://localhost:8000" \ |
| 143 | + --request-type embeddings \ |
| 144 | + --profile concurrent \ |
| 145 | + --rate 32 \ |
| 146 | + --max-requests 500 \ |
| 147 | + --data "prompt_tokens=256,output_tokens=1" \ |
| 148 | + --processor "BAAI/bge-small-en-v1.5" |
| 149 | +``` |
| 150 | + |
| 151 | +The `--rate` parameter specifies the number of concurrent streams (e.g., 32 parallel requests). |
| 152 | + |
| 153 | +#### Finding Maximum Throughput |
| 154 | + |
| 155 | +Use the throughput profile to find the maximum sustainable request rate for capacity planning: |
| 156 | + |
| 157 | +```bash |
| 158 | +guidellm benchmark \ |
| 159 | + --target "http://localhost:8000" \ |
| 160 | + --request-type embeddings \ |
| 161 | + --profile throughput \ |
| 162 | + --max-requests 500 \ |
| 163 | + --data "prompt_tokens=256,output_tokens=1" \ |
| 164 | + --processor "BAAI/bge-small-en-v1.5" |
| 165 | +``` |
| 166 | + |
| 167 | +## Metrics and Analysis |
| 168 | + |
| 169 | +When benchmarking embeddings, GuideLLM tracks: |
| 170 | + |
| 171 | +- **Request Latency**: Time from request start to completion |
| 172 | +- **Time to First Token (TTFT)**: For embeddings, this is effectively the processing time |
| 173 | +- **Throughput**: Requests processed per second |
| 174 | +- **Token Throughput**: Input tokens processed per second |
| 175 | +- **Success Rate**: Percentage of successful requests |
| 176 | +- **Error Rate**: Percentage of failed requests |
| 177 | + |
| 178 | +### Example Output |
| 179 | + |
| 180 | +```bash |
| 181 | +guidellm benchmark \ |
| 182 | + --target "http://localhost:8000" \ |
| 183 | + --request-type embeddings \ |
| 184 | + --rate 10 \ |
| 185 | + --max-requests 100 \ |
| 186 | + --data "prompt_tokens=256,output_tokens=1" \ |
| 187 | + --processor "BAAI/bge-small-en-v1.5" \ |
| 188 | + --output-path embeddings_report.json |
| 189 | +``` |
| 190 | + |
| 191 | +The JSON report will include: |
| 192 | + |
| 193 | +- Per-request timing and token counts |
| 194 | +- Aggregate statistics (mean, median, percentiles) |
| 195 | +- Request success/failure breakdown |
| 196 | +- Overall benchmark metadata |
| 197 | + |
| 198 | +## Best Practices |
| 199 | + |
| 200 | +1. **Match the Processor**: Use the same tokenizer as your embedding model for accurate token counting |
| 201 | + |
| 202 | +2. **Account for Model Context Length**: |
| 203 | + |
| 204 | + - **Check your model's limit**: Query the models endpoint to find `max_model_len`: |
| 205 | + |
| 206 | + ```bash |
| 207 | + curl -s http://localhost:8000/v1/models | python3 -m json.tool | grep "max_model_len" |
| 208 | + ``` |
| 209 | + |
| 210 | + This will show something like: `"max_model_len": 512` |
| 211 | + |
| 212 | + - **Synthetic data overhead**: The generator adds 2-5 tokens per request to ensure uniqueness |
| 213 | + |
| 214 | + - **Leave headroom**: Subtract ~10 tokens from `max_model_len` for safety |
| 215 | + |
| 216 | + - **Examples**: |
| 217 | + |
| 218 | + - 512-token model → use `prompt_tokens=500` or `prompt_tokens_max=500` |
| 219 | + - 8192-token model → use up to `prompt_tokens=8180` |
| 220 | + |
| 221 | + - **Error symptom**: "maximum context length exceeded" errors mean your tokens + prefix > model limit |
| 222 | + |
| 223 | +3. **Start with Low Rates**: Begin with conservative request rates and gradually increase |
| 224 | + |
| 225 | +4. **Use Realistic Data**: Test with data similar to your production workload |
| 226 | + |
| 227 | +5. **Test Multiple Scenarios**: Vary input lengths, batch sizes, and request patterns |
| 228 | + |
| 229 | +6. **Monitor System Resources**: Watch CPU, memory, and GPU utilization during benchmarks |
| 230 | + |
| 231 | +7. **Run Multiple Iterations**: Execute benchmarks several times to account for variance |
| 232 | + |
| 233 | +## Examples |
| 234 | + |
| 235 | +### Short Context Embeddings (128-512 tokens) |
| 236 | + |
| 237 | +Typical for BERT-style models with concurrent processing: |
| 238 | + |
| 239 | +```bash |
| 240 | +guidellm benchmark \ |
| 241 | + --target "http://localhost:8000" \ |
| 242 | + --request-type embeddings \ |
| 243 | + --profile concurrent \ |
| 244 | + --rate 32 \ |
| 245 | + --max-requests 500 \ |
| 246 | + --data "prompt_tokens=256,prompt_tokens_min=128,prompt_tokens_max=500,output_tokens=1" \ |
| 247 | + --processor "BAAI/bge-small-en-v1.5" |
| 248 | +``` |
| 249 | + |
| 250 | +This tests with 32 concurrent streams, which matches common production patterns. Using `prompt_tokens_max=500` instead of 512 leaves headroom for the synthetic data generator's unique request prefix. |
| 251 | +
|
| 252 | +### Long Context Embeddings (8k-32k tokens) |
| 253 | +
|
| 254 | +For newer long-context embedding models (lower concurrency due to larger context): |
| 255 | +
|
| 256 | +```bash |
| 257 | +guidellm benchmark \ |
| 258 | + --target "http://localhost:8000" \ |
| 259 | + --request-type embeddings \ |
| 260 | + --profile concurrent \ |
| 261 | + --rate 8 \ |
| 262 | + --max-requests 100 \ |
| 263 | + --data "prompt_tokens=16384,prompt_tokens_min=8192,prompt_tokens_max=32768,output_tokens=1" \ |
| 264 | + --processor "jinaai/jina-embeddings-v3" |
| 265 | +``` |
| 266 | +
|
| 267 | +### Production Simulation |
| 268 | +
|
| 269 | +Simulate realistic production workload with variable input lengths: |
| 270 | +
|
| 271 | +```bash |
| 272 | +guidellm benchmark \ |
| 273 | + --target "http://localhost:8000" \ |
| 274 | + --request-type embeddings \ |
| 275 | + --profile concurrent \ |
| 276 | + --rate 16 \ |
| 277 | + --max-requests 1000 \ |
| 278 | + --data "prompt_tokens=256,prompt_tokens_stdev=100,output_tokens=1,samples=1000" \ |
| 279 | + --data-sampler random \ |
| 280 | + --processor "BAAI/bge-base-en-v1.5" \ |
| 281 | + --output-path production_simulation.json |
| 282 | +``` |
| 283 | +
|
| 284 | +This runs a comprehensive benchmark with 1000 requests and variable-length inputs (using standard deviation), closely mimicking real-world usage patterns. |
0 commit comments