Commit 1b9b79c
feat(isb1): add KV cache stress benchmark with multi-turn synthetic traces
Add ISB-1 (Inference Stress Benchmark) — a multi-turn, long-context KV cache
stress testing dataset for InferenceX V3.
## What this adds
**35 synthetic multi-turn traces** across 7 context bands (8K → 1M+ tokens):
- 6 workload families: long_chat, coding, agent, rag, cache_stress, multimodal
- KV stress patterns: prefix reuse, offload cliff, compaction, reactivation, fanout
- Real conversation content with 60-95% prefix overlap (enables prefix cache testing)
- Context assets from 15KB to 6.6MB inlined into traces for honest token counts
**Export bundles** for vLLM + SGLang replay:
- extension_131k: DeepSeek-R1, GPT-OSS, Qwen 3.5 (H200/B200)
- preview/long_context_500k: Qwen 3.5 500K context stress test
- preview/long_context_1m: Qwen 3.5 1M context stress test
**10 KV stress sweep configs** (isb1-kv-stress-pr993.yaml):
- 3 models × 2 GPUs × 2 engines
- Sweep: 2→256 concurrent users × on/off/noprefix offload modes × 1800s
## Coexistence with kv-cache-tester
This dataset complements PR #993's kv-cache-tester (522 real Claude Code traces):
- kv-cache-tester: real workload distribution, natural performance profile
- ISB1: controlled KV stress patterns that force offload cliffs and cache pressure
No files in experimental/multiturn/ are modified. Separate config files, separate
data directory (datasets/isb1/), shared replay infrastructure.
## Benchmark infrastructure
- benchmark_export_replay.py: replay harness with actual_context_len telemetry
- process_result_isb1.py: result aggregation with KV metrics
- Prometheus metrics: kv_cache_usage, prefix_cache_hits, kv_offload_bytes
- Pareto frontier: throughput vs p99 TTFT at each concurrency level
## Why this matters (from GTC 2026)
> "Right now the benchmarks are kind of showing the worst the chips will
> actually perform... for V3 we want to add agentic benchmarks like really
> good representative multi-turn QA chat benchmarks where there are a ton
> of client sessions each with multiple turns and we'll enable prefix caching."
> — Cameron Quilici
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 6cb8291 commit 1b9b79c
162 files changed
Lines changed: 27657 additions & 99 deletions
File tree
- .github
- configs
- workflows
- benchmarks
- single_node
- datasets/isb1
- exports
- core
- sglang
- vllm
- extension_131k
- sglang
- vllm
- extension_32k
- sglang
- vllm
- extension_64k
- sglang
- vllm
- preview
- long_context_1m
- long_context_500k
- scripts
- experimental
- multiturn
- vllm_benchmark
- aiperf_traces
- kv-cache-tester
- traces
- launch
- scripts
- runners
- utils
- bench_serving
- matrix_logic
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
0 commit comments