Skip to content

Commit 02ad329

Browse files
committed
Add embeddings endpoint support
Adds support for benchmarking the /v1/embeddings endpoint, enabling performance testing of text embedding models. - Add embeddings request type to schemas - Implement EmbeddingsResponseHandler for processing embedding responses - Add EmbeddingsRequestFormatter for request preparation - Implement mock server handler with synthetic embedding generation - Add e2e and unit tests for embeddings benchmarking - Add embeddings guide documentation Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
1 parent 0adaff9 commit 02ad329

File tree

23 files changed

+1341
-129
lines changed

23 files changed

+1341
-129
lines changed

docs/guides/datasets.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,21 @@ guidellm benchmark \
5959
--data '{"prompt_tokens": 256, "output_tokens": 128}'
6060
```
6161

62+
For embeddings endpoints, you need to specify `output_tokens=1` (a current limitation of the synthetic data generator):
63+
64+
```bash
65+
guidellm benchmark \
66+
--target "http://localhost:8000" \
67+
--request-type embeddings \
68+
--profile concurrent \
69+
--rate 32 \
70+
--max-requests 500 \
71+
--data "prompt_tokens=256,output_tokens=1" \
72+
--processor "BAAI/bge-small-en-v1.5"
73+
```
74+
75+
For more details on embeddings benchmarking, see the [Embeddings Guide](./embeddings.md).
76+
6277
#### Configuration Options
6378

6479
- `prompt_tokens`: Average number of tokens in prompts. If nothing else is specified, all requests will have this number of tokens.

docs/guides/embeddings.md

Lines changed: 284 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,284 @@
1+
# Embeddings Benchmarking
2+
3+
GuideLLM supports benchmarking embedding models through the `/v1/embeddings` endpoint. This guide covers how to set up and run benchmarks for text embedding models, which are commonly used for semantic search, clustering, and other ML tasks.
4+
5+
## Overview
6+
7+
Embedding models convert text into dense vector representations that capture semantic meaning. Benchmarking these models helps you:
8+
9+
- Measure throughput and latency for embedding generation
10+
- Test performance under different load conditions
11+
- Compare different embedding model deployments
12+
- Optimize your embedding service configuration
13+
14+
## Supported Backends
15+
16+
### vLLM
17+
18+
vLLM supports embedding models starting from version 0.4.0. To serve an embedding model with vLLM:
19+
20+
```bash
21+
vllm serve "BAAI/bge-small-en-v1.5"
22+
```
23+
24+
Popular embedding models supported by vLLM:
25+
26+
- **BAAI/bge-small-en-v1.5**: Lightweight English embedding model (384 dimensions)
27+
- **BAAI/bge-base-en-v1.5**: Base English embedding model (768 dimensions)
28+
- **BAAI/bge-large-en-v1.5**: Large English embedding model (1024 dimensions)
29+
- **sentence-transformers/all-MiniLM-L6-v2**: Compact multilingual model (384 dimensions)
30+
- **intfloat/e5-large-v2**: High-performance English model (1024 dimensions)
31+
32+
For the latest list of supported models, see the [vLLM documentation](https://docs.vllm.ai/en/latest/models/supported_models.html).
33+
34+
### OpenAI API
35+
36+
GuideLLM can also benchmark OpenAI's embedding endpoints:
37+
38+
```bash
39+
guidellm benchmark \
40+
--target "https://api.openai.com" \
41+
--request-type embeddings \
42+
--model "text-embedding-3-small" \
43+
--rate 5 \
44+
--max-requests 50 \
45+
--data "prompt_tokens=256,output_tokens=1" \
46+
--processor "gpt2"
47+
```
48+
49+
Note: You'll need to set your OpenAI API key as an environment variable or in the request headers.
50+
51+
## Basic Benchmarking
52+
53+
### Simple Concurrent Benchmark (Recommended)
54+
55+
For embeddings, concurrent testing is the most relevant approach. To run a basic concurrent benchmark with synthetic data:
56+
57+
```bash
58+
guidellm benchmark \
59+
--target "http://localhost:8000" \
60+
--request-type embeddings \
61+
--profile concurrent \
62+
--rate 32 \
63+
--max-requests 100 \
64+
--data "prompt_tokens=256,output_tokens=1" \
65+
--processor "BAAI/bge-small-en-v1.5"
66+
```
67+
68+
This command:
69+
70+
- Tests with 32 concurrent requests (parallel processing)
71+
- Stops after 100 total requests
72+
- Uses synthetic text with ~256 tokens per request
73+
- Uses the bge-small tokenizer for token counting
74+
- **Note**: `output_tokens=1` is required when using synthetic data, even though embeddings don't generate output. This is a current limitation of the synthetic data generator.
75+
76+
## Benchmark Profiles for Embeddings
77+
78+
Different benchmark profiles serve different purposes when testing embedding models:
79+
80+
- **Concurrent** (Recommended): Tests parallel request handling - the most common production pattern for embeddings
81+
- **Throughput**: Finds maximum sustainable request rate - useful for capacity planning
82+
- **Synchronous**: Sequential baseline testing - useful for measuring per-request latency without concurrency effects
83+
- **Constant**: Fixed-rate testing - less relevant for embeddings since they have predictable processing times
84+
- **Sweep**: Not recommended for embeddings (designed for optimizing generative model parameters)
85+
86+
For most embedding benchmarks, use **concurrent** or **throughput** profiles.
87+
88+
## Advanced Usage
89+
90+
### Variable Input Lengths
91+
92+
Test performance across different input lengths:
93+
94+
```bash
95+
guidellm benchmark \
96+
--target "http://localhost:8000" \
97+
--request-type embeddings \
98+
--rate 10 \
99+
--max-requests 200 \
100+
--data "prompt_tokens=256,prompt_tokens_min=128,prompt_tokens_max=500,output_tokens=1" \
101+
--processor "BAAI/bge-small-en-v1.5"
102+
```
103+
104+
This creates requests with uniformly distributed lengths between 128 and 500 tokens.
105+
106+
### Using Real Data
107+
108+
Benchmark with actual text data from a file or Hugging Face dataset:
109+
110+
```bash
111+
guidellm benchmark \
112+
--target "http://localhost:8000" \
113+
--request-type embeddings \
114+
--rate 10 \
115+
--max-requests 100 \
116+
--data "path/to/your/data.jsonl" \
117+
--data-args '{"prompt_column": "text"}' \
118+
--processor "BAAI/bge-small-en-v1.5"
119+
```
120+
121+
Or from Hugging Face:
122+
123+
```bash
124+
guidellm benchmark \
125+
--target "http://localhost:8000" \
126+
--request-type embeddings \
127+
--rate 10 \
128+
--max-requests 100 \
129+
--data "sentence-transformers/stsb" \
130+
--data-args '{"prompt_column": "sentence1", "split": "test"}' \
131+
--processor "BAAI/bge-small-en-v1.5"
132+
```
133+
134+
### Load Testing Scenarios
135+
136+
#### Testing Concurrent Request Handling (Recommended)
137+
138+
The concurrent profile is the most relevant for embeddings, as it simulates how production systems typically use embedding models (parallel batch processing):
139+
140+
```bash
141+
guidellm benchmark \
142+
--target "http://localhost:8000" \
143+
--request-type embeddings \
144+
--profile concurrent \
145+
--rate 32 \
146+
--max-requests 500 \
147+
--data "prompt_tokens=256,output_tokens=1" \
148+
--processor "BAAI/bge-small-en-v1.5"
149+
```
150+
151+
The `--rate` parameter specifies the number of concurrent streams (e.g., 32 parallel requests).
152+
153+
#### Finding Maximum Throughput
154+
155+
Use the throughput profile to find the maximum sustainable request rate for capacity planning:
156+
157+
```bash
158+
guidellm benchmark \
159+
--target "http://localhost:8000" \
160+
--request-type embeddings \
161+
--profile throughput \
162+
--max-requests 500 \
163+
--data "prompt_tokens=256,output_tokens=1" \
164+
--processor "BAAI/bge-small-en-v1.5"
165+
```
166+
167+
## Metrics and Analysis
168+
169+
When benchmarking embeddings, GuideLLM tracks:
170+
171+
- **Request Latency**: Time from request start to completion
172+
- **Time to First Token (TTFT)**: For embeddings, this is effectively the processing time
173+
- **Throughput**: Requests processed per second
174+
- **Token Throughput**: Input tokens processed per second
175+
- **Success Rate**: Percentage of successful requests
176+
- **Error Rate**: Percentage of failed requests
177+
178+
### Example Output
179+
180+
```bash
181+
guidellm benchmark \
182+
--target "http://localhost:8000" \
183+
--request-type embeddings \
184+
--rate 10 \
185+
--max-requests 100 \
186+
--data "prompt_tokens=256,output_tokens=1" \
187+
--processor "BAAI/bge-small-en-v1.5" \
188+
--output-path embeddings_report.json
189+
```
190+
191+
The JSON report will include:
192+
193+
- Per-request timing and token counts
194+
- Aggregate statistics (mean, median, percentiles)
195+
- Request success/failure breakdown
196+
- Overall benchmark metadata
197+
198+
## Best Practices
199+
200+
1. **Match the Processor**: Use the same tokenizer as your embedding model for accurate token counting
201+
202+
2. **Account for Model Context Length**:
203+
204+
- **Check your model's limit**: Query the models endpoint to find `max_model_len`:
205+
206+
```bash
207+
curl -s http://localhost:8000/v1/models | python3 -m json.tool | grep "max_model_len"
208+
```
209+
210+
This will show something like: `"max_model_len": 512`
211+
212+
- **Synthetic data overhead**: The generator adds 2-5 tokens per request to ensure uniqueness
213+
214+
- **Leave headroom**: Subtract ~10 tokens from `max_model_len` for safety
215+
216+
- **Examples**:
217+
218+
- 512-token model → use `prompt_tokens=500` or `prompt_tokens_max=500`
219+
- 8192-token model → use up to `prompt_tokens=8180`
220+
221+
- **Error symptom**: "maximum context length exceeded" errors mean your tokens + prefix > model limit
222+
223+
3. **Start with Low Rates**: Begin with conservative request rates and gradually increase
224+
225+
4. **Use Realistic Data**: Test with data similar to your production workload
226+
227+
5. **Test Multiple Scenarios**: Vary input lengths, batch sizes, and request patterns
228+
229+
6. **Monitor System Resources**: Watch CPU, memory, and GPU utilization during benchmarks
230+
231+
7. **Run Multiple Iterations**: Execute benchmarks several times to account for variance
232+
233+
## Examples
234+
235+
### Short Context Embeddings (128-512 tokens)
236+
237+
Typical for BERT-style models with concurrent processing:
238+
239+
```bash
240+
guidellm benchmark \
241+
--target "http://localhost:8000" \
242+
--request-type embeddings \
243+
--profile concurrent \
244+
--rate 32 \
245+
--max-requests 500 \
246+
--data "prompt_tokens=256,prompt_tokens_min=128,prompt_tokens_max=500,output_tokens=1" \
247+
--processor "BAAI/bge-small-en-v1.5"
248+
```
249+
250+
This tests with 32 concurrent streams, which matches common production patterns. Using `prompt_tokens_max=500` instead of 512 leaves headroom for the synthetic data generator's unique request prefix.
251+
252+
### Long Context Embeddings (8k-32k tokens)
253+
254+
For newer long-context embedding models (lower concurrency due to larger context):
255+
256+
```bash
257+
guidellm benchmark \
258+
--target "http://localhost:8000" \
259+
--request-type embeddings \
260+
--profile concurrent \
261+
--rate 8 \
262+
--max-requests 100 \
263+
--data "prompt_tokens=16384,prompt_tokens_min=8192,prompt_tokens_max=32768,output_tokens=1" \
264+
--processor "jinaai/jina-embeddings-v3"
265+
```
266+
267+
### Production Simulation
268+
269+
Simulate realistic production workload with variable input lengths:
270+
271+
```bash
272+
guidellm benchmark \
273+
--target "http://localhost:8000" \
274+
--request-type embeddings \
275+
--profile concurrent \
276+
--rate 16 \
277+
--max-requests 1000 \
278+
--data "prompt_tokens=256,prompt_tokens_stdev=100,output_tokens=1,samples=1000" \
279+
--data-sampler random \
280+
--processor "BAAI/bge-base-en-v1.5" \
281+
--output-path production_simulation.json
282+
```
283+
284+
This runs a comprehensive benchmark with 1000 requests and variable-length inputs (using standard deviation), closely mimicking real-world usage patterns.

src/guidellm/backends/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
from .response_handlers import (
1717
AudioResponseHandler,
1818
ChatCompletionsResponseHandler,
19+
EmbeddingsResponseHandler,
1920
GenerationResponseHandler,
2021
GenerationResponseHandlerFactory,
2122
TextCompletionsResponseHandler,
@@ -26,6 +27,7 @@
2627
"Backend",
2728
"BackendType",
2829
"ChatCompletionsResponseHandler",
30+
"EmbeddingsResponseHandler",
2931
"GenerationResponseHandler",
3032
"GenerationResponseHandlerFactory",
3133
"OpenAIHTTPBackend",

src/guidellm/backends/openai.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@ def __init__(
8787
"chat_completions": "v1/chat/completions",
8888
"audio_transcriptions": "v1/audio/transcriptions",
8989
"audio_translations": "v1/audio/translations",
90+
"embeddings": "v1/embeddings",
9091
}
9192
self.response_handlers = response_handlers
9293
self.timeout = timeout

0 commit comments

Comments
 (0)