[experimental] add multi-turn KV cache stress benchmark traces#1032
Open
OCWC22 wants to merge 2 commits intoSemiAnalysisAI:mainfrom
Open
[experimental] add multi-turn KV cache stress benchmark traces#1032OCWC22 wants to merge 2 commits intoSemiAnalysisAI:mainfrom
OCWC22 wants to merge 2 commits intoSemiAnalysisAI:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an ISB1 “KV cache stress / multi-turn replay” benchmarking surface (data + configs + runners + analysis utilities) to enable realistic long-context, high-prefix-overlap replay and offload-mode sweeps, while keeping it isolated from the existing experimental multiturn/kv-cache-tester lane.
Changes:
- Add committed ISB1 export bundles (including preview 500K/1M lanes) and supporting ISB1 dataset documentation.
- Add ISB1 KV-stress sweep workflow/config plus result summarization + gating utilities and tests.
- Add/extend runner + single-node benchmark scripts (vLLM/SGLang + TriAttention variants) and GMI helper scripts for running/collecting sweeps.
Reviewed changes
Copilot reviewed 147 out of 150 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| utils/verify_producer_sync.py | New utility to compare producer vs consumer export trees for selected ISB1 subtrees. |
| utils/test_verify_producer_sync.py | Tests for verify_producer_sync utility (pass + content mismatch). |
| utils/test_summarize_isb1.py | Tests for ISB1 operator summary output formatting/sections. |
| utils/test_process_result.py | Adds guards/tests ensuring ISB1 replay-style results don’t go through throughput processor. |
| utils/test_gate_isb1.py | Tests for ISB1 gating logic and strict failure behavior. |
| utils/process_result.py | Adds “fail fast” guards for ISB1 replay env/payload in throughput result processor. |
| runners/lib_single_node_script.sh | New helper to resolve benchmark script paths (runtime-aware for ISB1 replay). |
| runners/launch_h200-nb.sh | Uses new script resolver; executes resolved benchmark script. |
| runners/launch_h200-dgxc-slurm.sh | Uses new script resolver; executes resolved benchmark script. |
| runners/launch_h200-cw.sh | Uses new script resolver; executes resolved benchmark script. |
| runners/launch_h100-dgxc-slurm.sh | Uses new script resolver; executes resolved benchmark script. |
| runners/launch_h100-cw.sh | Uses new script resolver; executes resolved benchmark script. |
| runners/launch_h100-cr.sh | Uses new script resolver; expands env passthrough for ISB1 replay/kv-stress. |
| runners/launch_b200-nb.sh | Uses new script resolver; executes resolved benchmark script. |
| runners/launch_b200-dgxc.sh | Uses new script resolver; expands env passthrough for ISB1 replay/kv-stress. |
| runners/launch_b200-dgxc-slurm.sh | Uses new script resolver; executes resolved benchmark script; ensures cleanup. |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_h200_vllm.sh | Adds experimental trace-replay runner script (vLLM). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_h200_sglang.sh | Adds experimental trace-replay runner script (SGLang). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_b200_vllm.sh | Adds experimental trace-replay runner script (vLLM). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_b200_sglang.sh | Adds experimental trace-replay runner script (SGLang). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_h200_vllm.sh | Adds experimental trace-replay runner script (vLLM). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_h200_sglang.sh | Adds experimental trace-replay runner script (SGLang). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_b200_vllm.sh | Adds experimental trace-replay runner script (vLLM). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_b200_sglang.sh | Adds experimental trace-replay runner script (SGLang). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_dsr1_fp8_h200_vllm.sh | Adds experimental trace-replay runner script (vLLM). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_dsr1_fp8_b200_vllm.sh | Adds experimental trace-replay runner script (vLLM). |
| experimental/multiturn/vllm_benchmark/launch/lmcache_vllm_h200.sh | Adds experimental LMCache-enabled vLLM launcher (H200). |
| experimental/multiturn/vllm_benchmark/launch/lmcache_vllm_b200.sh | Adds experimental LMCache-enabled vLLM launcher (B200). |
| experimental/multiturn/vllm_benchmark/launch/README.md | Docs for experimental LMCache launch helpers. |
| experimental/multiturn/vllm_benchmark/kv-cache-tester/traces/.gitkeep | Placeholder for external trace assets directory. |
| experimental/multiturn/vllm_benchmark/kv-cache-tester/README.md | Placeholder README describing expected kv-cache-tester population. |
| experimental/multiturn/vllm_benchmark/aiperf_traces/generate_aiperf_traces.py | Script to generate synthetic AIPerf-style sessions for replay. |
| experimental/multiturn/vllm_benchmark/README.md | Docs describing experimental parity surface and links to ISB1 scripts. |
| experimental/multiturn/vllm_benchmark/.gitignore | Ignores generated artifacts in experimental multiturn bench area. |
| experimental/multiturn/README.md | Replaces older notes with scoped “experimental notes” guidance and pointers to ISB1 ground truth. |
| experimental/README.md | Updates experimental directory warning + pointers to ISB1 ground truth docs. |
| datasets/isb1/scripts/plot_pareto.py | Adds Pareto frontier computation + optional plotting (TTFT p99 vs throughput). |
| datasets/isb1/scripts/gpu_profile_collector.sh | Adds nvidia-smi polling helper for GPU utilization/power logging. |
| datasets/isb1/scripts/gmi_test_matrix.sh | Adds a curated “matrix” driver for running portable benchmarks. |
| datasets/isb1/scripts/gmi_kv_sweep.sh | Adds concurrency × offload-mode sweep driver for portable benchmarks. |
| datasets/isb1/scripts/gmi_full_suite.sh | Adds full-suite portable runner across models/engines/bands (with skips). |
| datasets/isb1/scripts/generate_qwen35_low_band_exports.py | Generates Qwen3.5-specific low-band export bundles by rewriting filtered cells. |
| datasets/isb1/scripts/collect_sweep_results.py | Aggregates sweep results from DB or JSON dir; computes cliffs/benefits. |
| datasets/isb1/scripts/analyze_benchmark_distributions.py | Analyzes token/turn distributions for ISB1 exports or kv-cache traces. |
| datasets/isb1/scripts/adapt_trace_replay_result.py | Adapts kv-cache trace replay outputs into ISB1 replay JSON schema. |
| datasets/isb1/exports/preview/long_context_500k/manifest_qwen3.5.json | Adds preview 500k manifest (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/manifest.json | Adds preview 500k manifest (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__coding_qwen3.5_xlc2_500k_preview_v1__vllm.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__coding_qwen3.5_xlc2_500k_preview_v1__sglang.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__coding_gptoss_xlc2_500k_preview_v1__vllm.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__coding_gptoss_xlc2_500k_preview_v1__sglang.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__chat_qwen3.5_xlc2_500k_preview_v1__vllm.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__chat_qwen3.5_xlc2_500k_preview_v1__sglang.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__chat_gptoss_xlc2_500k_preview_v1__vllm.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__chat_gptoss_xlc2_500k_preview_v1__sglang.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/README.md | Documents bounded 500k-class preview lanes and claim boundary. |
| datasets/isb1/exports/preview/long_context_1m/manifest.json | Adds preview 1m manifest (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_1m/inferencex_trace_replay__coding_qwen3.5_ulc2_1m_preview_v1__vllm.json | Adds preview 1m export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_1m/inferencex_trace_replay__coding_qwen3.5_ulc2_1m_preview_v1__sglang.json | Adds preview 1m export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_1m/inferencex_trace_replay__chat_qwen3.5_ulc2_1m_preview_v1__vllm.json | Adds preview 1m export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_1m/inferencex_trace_replay__chat_qwen3.5_ulc2_1m_preview_v1__sglang.json | Adds preview 1m export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_1m/README.md | Documents gated 1M preview lane and manual config boundary. |
| datasets/isb1/exports/extension_64k/vllm/code_64k1k_qwen3.5.json | Adds extension 64k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_64k/vllm/code_64k1k.json | Adds extension 64k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_64k/vllm/chat_64k1k_qwen3.5.json | Adds extension 64k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_64k/vllm/chat_64k1k.json | Adds extension 64k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_64k/sglang/code_64k1k_qwen3.5.json | Adds extension 64k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_64k/sglang/code_64k1k.json | Adds extension 64k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_64k/sglang/chat_64k1k_qwen3.5.json | Adds extension 64k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_64k/sglang/chat_64k1k.json | Adds extension 64k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/vllm/code_32k1k_qwen3.5.json | Adds extension 32k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/vllm/code_32k1k.json | Adds extension 32k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/vllm/chat_32k1k_qwen3.5.json | Adds extension 32k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/vllm/chat_32k1k.json | Adds extension 32k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/sglang/code_32k1k_qwen3.5.json | Adds extension 32k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/sglang/code_32k1k.json | Adds extension 32k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/sglang/chat_32k1k_qwen3.5.json | Adds extension 32k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/sglang/chat_32k1k.json | Adds extension 32k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/vllm/code_131k1k_qwen3.5.json | Adds/updates extension 131k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/vllm/code_131k1k.json | Adds/updates extension 131k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/vllm/chat_131k1k_qwen3.5.json | Adds/updates extension 131k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/vllm/chat_131k1k_dsr1.json | Adds/updates extension 131k DSR1 bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/vllm/chat_131k1k.json | Adds/updates extension 131k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/sglang/code_131k1k_qwen3.5.json | Adds/updates extension 131k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/sglang/code_131k1k.json | Adds/updates extension 131k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/sglang/chat_131k1k_qwen3.5.json | Adds/updates extension 131k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/sglang/chat_131k1k_dsr1.json | Adds/updates extension 131k DSR1 bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/sglang/chat_131k1k.json | Adds/updates extension 131k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/core/vllm/code_8k1k_qwen3.5.json | Adds core 8k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/core/vllm/code_8k1k.json | Adds core 8k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/core/vllm/chat_8k1k_qwen3.5.json | Adds core 8k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/core/vllm/chat_8k1k.json | Adds core 8k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/core/sglang/code_8k1k_qwen3.5.json | Adds core 8k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/core/sglang/code_8k1k.json | Adds core 8k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/core/sglang/chat_8k1k_qwen3.5.json | Adds core 8k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/core/sglang/chat_8k1k.json | Adds core 8k generic bundle (Git LFS pointer). |
| datasets/isb1/README.md | Adds ISB1 consumer-package README with coverage inventory and claim boundary. |
| datasets/isb1/GMI_EXECUTION_PLAN.md | Adds execution plan/runbook for external GMI KV-stress benchmarking. |
| datasets/isb1/COEXISTENCE_WITH_KV_CACHE_TESTER.md | Adds coexistence plan doc for ISB1 vs kv-cache-tester surfaces. |
| datasets/isb1/.gitattributes | Adds attributes for exports (linguist + EOL handling). |
| benchmarks/single_node/qwen3.5triattn_fp8_h200_vllm.sh | Adds TriAttention vLLM benchmark script (H200). |
| benchmarks/single_node/qwen3.5triattn_fp8_h100_vllm.sh | Adds TriAttention vLLM benchmark script (H100). |
| benchmarks/single_node/qwen3.5_fp8_h200_vllm.sh | Adds/updates Qwen3.5 vLLM script (H200) with ISB1-aware prefix/offload behavior. |
| benchmarks/single_node/qwen3.5_fp8_h200_sglang.sh | Adds Qwen3.5 SGLang script (H200) with ISB1-aware radix/offload behavior. |
| benchmarks/single_node/qwen3.5_fp8_h100_vllm.sh | Adds Qwen3.5 vLLM script (H100). |
| benchmarks/single_node/qwen3.5_fp8_h100_sglang.sh | Adds Qwen3.5 SGLang script (H100). |
| benchmarks/single_node/qwen3.5_fp8_b200_vllm.sh | Adds Qwen3.5 vLLM script (B200). |
| benchmarks/single_node/qwen3.5_fp8_b200_sglang.sh | Adds Qwen3.5 SGLang script (B200). |
| benchmarks/single_node/gptosstriattn_fp4_h200_vllm.sh | Adds TriAttention vLLM benchmark script for GPT-OSS (H200). |
| benchmarks/single_node/gptosstriattn_fp4_h100_vllm.sh | Adds TriAttention vLLM benchmark script for GPT-OSS (H100). |
| benchmarks/single_node/gptoss_fp4_h200_sglang.sh | Adds GPT-OSS SGLang script (H200). |
| benchmarks/single_node/gptoss_fp4_h200.sh | Updates GPT-OSS H200 script to be ISB1-aware and align to run_single_node_benchmark. |
| benchmarks/single_node/gptoss_fp4_h100_sglang.sh | Adds GPT-OSS SGLang script (H100). |
| benchmarks/single_node/gptoss_fp4_h100.sh | Updates GPT-OSS H100 script to be ISB1-aware and align to run_single_node_benchmark. |
| benchmarks/single_node/gptoss_fp4_b200_sglang.sh | Adds GPT-OSS SGLang script (B200). |
| benchmarks/single_node/gptoss_fp4_b200.sh | Updates GPT-OSS B200 script to be ISB1-aware and align to run_single_node_benchmark. |
| benchmarks/single_node/dsr1triattn_fp8_h200_vllm.sh | Adds TriAttention vLLM benchmark script for DSR1 (H200). |
| benchmarks/single_node/dsr1triattn_fp8_h100_vllm.sh | Adds TriAttention vLLM benchmark script for DSR1 (H100). |
| benchmarks/single_node/dsr1_fp8_h200_vllm.sh | Adds DSR1 vLLM script (H200). |
| benchmarks/single_node/dsr1_fp8_h200.sh | Updates DSR1 H200 SGLang script to be ISB1-aware and align to run_single_node_benchmark. |
| benchmarks/single_node/dsr1_fp8_b200_vllm.sh | Adds DSR1 vLLM script (B200). |
| benchmarks/single_node/dsr1_fp8_b200.sh | Updates DSR1 B200 SGLang script to be ISB1-aware and align to run_single_node_benchmark. |
| benchmarks/single_node/dsr1_fp4_b200.sh | Updates DSR1 FP4 B200 SGLang script to be ISB1-aware and align to run_single_node_benchmark. |
| .gitignore | Adds ignores for macOS metadata + local prompt exports + .claude. |
| .github/workflows/run-isb1-kv-stress-sweep.yml | Adds workflow_dispatch sweep driver for ISB1 KV-stress matrix runs. |
| .github/workflows/collect-results.yml | Adds ISB1-specific summary + gating report generation and uploads. |
| .github/configs/isb1-qwen-1m-preview.yaml | Adds a manual-only gated config for 1M Qwen preview runs. |
| .github/configs/isb1-kv-stress.yaml | Adds dedicated KV-stress sweep config (separate from isb1-master). |
| .gitattributes | Tracks ISB1 export JSON under Git LFS. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
af64122 to
1b9b79c
Compare
1b9b79c to
ef90b64
Compare
…races Add ISB-1 (Inference Stress Benchmark) — a multi-turn, long-context KV cache stress testing dataset for InferenceX V3. ## What this adds **35 synthetic multi-turn traces** across 7 context bands (8K → 1M+ tokens): - 6 workload families: long_chat, coding, agent, rag, cache_stress, multimodal - KV stress patterns: prefix reuse, offload cliff, compaction, reactivation, fanout - Real conversation content with 60-95% prefix overlap (enables prefix cache testing) - Context assets from 15KB to 6.6MB inlined into traces for honest token counts **Export bundles** for vLLM + SGLang replay: - extension_131k: DeepSeek-R1, GPT-OSS, Qwen 3.5 (H200/B200) - preview/long_context_500k: Qwen 3.5 500K context stress test - preview/long_context_1m: Qwen 3.5 1M context stress test **10 KV stress sweep configs** (isb1-kv-stress-pr993.yaml): - 3 models × 2 GPUs × 2 engines - Sweep: 2→256 concurrent users × on/off/noprefix offload modes × 1800s ## Coexistence with kv-cache-tester This dataset complements PR SemiAnalysisAI#993's kv-cache-tester (522 real Claude Code traces): - kv-cache-tester: real workload distribution, natural performance profile - ISB1: controlled KV stress patterns that force offload cliffs and cache pressure No files in experimental/multiturn/ are modified. Separate config files, separate data directory (datasets/isb1/), shared replay infrastructure. ## Benchmark infrastructure - benchmark_export_replay.py: replay harness with actual_context_len telemetry - process_result_isb1.py: result aggregation with KV metrics - Prometheus metrics: kv_cache_usage, prefix_cache_hits, kv_offload_bytes - Pareto frontier: throughput vs p99 TTFT at each concurrency level ## Why this matters (from GTC 2026) > "Right now the benchmarks are kind of showing the worst the chips will > actually perform... for V3 we want to add agentic benchmarks like really > good representative multi-turn QA chat benchmarks where there are a ton > of client sessions each with multiple turns and we'll enable prefix caching." > — Cameron Quilici Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ef90b64 to
fbe9f79
Compare
- Keep only configs whose (runtime, hardware, model) triples exist in the export files — eliminates sweep generator failures - Fix canonical-model-id to match export metadata (e.g., gpt_oss_120b not gptoss) - Fix support-status to match export tiers (reviewed_preview vs unsupported) - Remove configs for engines/GPUs not yet in exports (SGLang, Dynamo, TRT, Atom, AMD) — these need export metadata updates before they can be added back - Add workload-type field required by sweep generator schema - Remove disagg/multinode fields not in KV stress schema Sweep generator now passes: exit code 0, produces valid matrix rows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
|
Some good stuff in here. Will collab async on this one and take some stuff from this PR into experimental/agentic-benchmark MVP. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add multi-turn, long-context KV cache stress testing traces for realistic inference benchmarking.
Why this matters
Current benchmarks use random data — no prefix caching, no multi-turn, no KV cache reuse. This adds realistic multi-turn traces that:
Sweep configuration
Each config produces a throughput vs p99 TTFT Pareto frontier across concurrency levels and offload modes.
Context bands
Coexistence with kv-cache-tester (PR #993)
This complements kv-cache-tester's 522 real Claude Code traces:
No files in
experimental/multiturn/are modified. Separate directory (datasets/isb1/), separate configs.Test plan
generate_sweep_configs.pydry-run resolves all configsbenchmark_export_replay.py