Skip to content

[WIP][experimental] add agentic trace replay benchmark infrastructure#993

Draft
cquil11 wants to merge 68 commits intomainfrom
experimental/agentic-benchmark
Draft

[WIP][experimental] add agentic trace replay benchmark infrastructure#993
cquil11 wants to merge 68 commits intomainfrom
experimental/agentic-benchmark

Conversation

@cquil11
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 commented Apr 1, 2026

Trace replay benchmarking for agentic coding workloads using real Claude Code traces. Includes:

  • Trace replay scripts for H200, MI355X, B200 (vLLM-based)
  • kv-cache-tester submodule (trace replayer + 522 anonymized traces)
  • AIPerf submodule (alternative synthetic benchmarking)
  • Pareto frontier plotting and sweep aggregation
  • Metrics collector (prometheus scraper + visualization)
  • Workload distribution analysis
  • GitHub Actions workflow with per-TP sweep configs
  • MI355X runner SCRIPT_SUFFIX support

Trace replay benchmarking for agentic coding workloads using real
Claude Code traces. Includes:

- Trace replay scripts for H200, MI355X, B200 (vLLM-based)
- kv-cache-tester submodule (trace replayer + 522 anonymized traces)
- AIPerf submodule (alternative synthetic benchmarking)
- Pareto frontier plotting and sweep aggregation
- Metrics collector (prometheus scraper + visualization)
- Workload distribution analysis
- GitHub Actions workflow with per-TP sweep configs
- MI355X runner SCRIPT_SUFFIX support

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Comment on lines +97 to +171
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.gen.outputs.matrix }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
if: ${{ inputs.config_file != '' }}
with:
token: ${{ secrets.REPO_PAT }}
fetch-depth: 1
ref: ${{ inputs.ref || github.ref }}
sparse-checkout: ${{ inputs.config_file }}

- id: gen
run: |
pip install -q pyyaml
python3 << 'PYEOF'
import json, os, sys

config_file = "${{ inputs.config_file }}".strip()

if config_file:
import yaml
with open(config_file) as f:
full_config = yaml.safe_load(f)

config_key = "${{ inputs.config_key }}".strip()

# If config_key specified, use that section; otherwise auto-detect
if config_key and config_key in full_config:
config = full_config[config_key]
elif config_key:
print(f"ERROR: config_key '{config_key}' not found. Available: {list(full_config.keys())}")
sys.exit(1)
elif len(full_config) == 1:
config = next(iter(full_config.values()))
else:
# Check if top-level keys look like tp entries (tp2, tp4, etc.)
if all(k.startswith("tp") for k in full_config):
config = full_config
else:
print(f"ERROR: Multiple entries in config, specify --config_key. Available: {list(full_config.keys())}")
sys.exit(1)

includes = []
for key, settings in config.items():
tp = int(key.replace("tp", ""))
users = settings.get("users", [])
offloads = settings.get("offload", ["on", "off"])
ep = settings.get("ep", 0)
for u in users:
for o in offloads:
entry = {"tp": tp, "users": u, "offload": o}
if ep > 0:
entry["ep"] = ep
includes.append(entry)
else:
tp_values = json.loads('${{ inputs.tp_values }}')
user_values = json.loads('${{ inputs.user_values }}')
offload_values = json.loads('${{ inputs.offload_values }}')
includes = []
for tp in tp_values:
for u in user_values:
for o in offload_values:
includes.append({"tp": tp, "users": u, "offload": o})

matrix = {"include": includes}
print(f"Generated {len(includes)} matrix entries")
with open(os.environ["GITHUB_OUTPUT"], "a") as f:
f.write(f"matrix={json.dumps(matrix)}\n")
PYEOF

# ---------------------------------------------------------------------------
# Matrix benchmark jobs — each cell calls the multiturn template
# ---------------------------------------------------------------------------
sweep:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
Comment on lines +172 to +198
needs: generate-matrix
uses: ./.github/workflows/benchmark-multiturn-tmpl.yml
name: sweep /
strategy:
fail-fast: false
matrix: ${{ fromJson(needs.generate-matrix.outputs.matrix) }}
secrets: inherit
with:
runner: ${{ inputs.runner }}
image: ${{ inputs.image }}
model: ${{ inputs.model }}
precision: ${{ inputs.precision }}
exp-name: "multiturn_tp${{ matrix.tp }}_users${{ matrix.users }}_offload${{ matrix.offload }}"
tp: "${{ matrix.tp }}"
users: "${{ matrix.users }}"
offload-mode: ${{ matrix.offload }}
duration: ${{ inputs.duration }}
request-rate: ${{ inputs.request_rate }}
total-cpu-dram-gb: ${{ inputs.total_cpu_dram_gb }}
script-suffix: ${{ inputs.script_suffix }}
ep: "${{ matrix.ep || inputs.ep }}"
ref: ${{ inputs.ref }}

# ---------------------------------------------------------------------------
# Collect & aggregate results
# ---------------------------------------------------------------------------
collect:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {}

Copilot Autofix

AI about 1 hour ago

Add an explicit top-level permissions block in .github/workflows/multiturn-sweep.yml so every job in this workflow gets restricted defaults. The safest non-breaking baseline is:

  • contents: read (needed for repository checkout/read)
  • actions: read (needed for artifact download/upload interactions in many workflows)

This preserves current behavior while documenting and enforcing least privilege. Place it near the top of the workflow (after run-name and before on:), so it applies uniformly to generate-matrix, sweep, and collect unless overridden later.

Suggested changeset 1
.github/workflows/multiturn-sweep.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/multiturn-sweep.yml b/.github/workflows/multiturn-sweep.yml
--- a/.github/workflows/multiturn-sweep.yml
+++ b/.github/workflows/multiturn-sweep.yml
@@ -1,6 +1,10 @@
 name: Multi-Turn Benchmark Sweep
 run-name: "${{ inputs.run_name || format('Multi-Turn Sweep - tp={0} users={1} offload={2}', inputs.tp_values, inputs.user_values, inputs.offload_values) }}"
 
+permissions:
+  contents: read
+  actions: read
+
 on:
   # push:
   #   branches:
EOF
@@ -1,6 +1,10 @@
name: Multi-Turn Benchmark Sweep
run-name: "${{ inputs.run_name || format('Multi-Turn Sweep - tp={0} users={1} offload={2}', inputs.tp_values, inputs.user_values, inputs.offload_values) }}"

permissions:
contents: read
actions: read

on:
# push:
# branches:
Copilot is powered by AI and may make mistakes. Always verify output.
Comment on lines +199 to +231
runs-on: ubuntu-latest
needs: sweep
if: always()
name: Collect results
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
token: ${{ secrets.REPO_PAT }}
fetch-depth: 1
ref: ${{ inputs.ref || github.ref }}

- uses: actions/setup-python@v5
with:
python-version: '3.11'

- name: Install dependencies
run: pip install pandas matplotlib numpy

- name: Download all artifacts
uses: actions/download-artifact@v4
with:
pattern: 'multiturn_*'
path: results/

- name: Run aggregation
run: |
python experimental/multiturn/vllm_benchmark/scripts/collect_sweep_results.py results/ aggregated/

- name: Upload aggregated results
uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
with:
name: multiturn_aggregated
path: aggregated/

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
cquil11 and others added 25 commits April 1, 2026 15:27
Replaced by vLLM's native kv_offload metrics. Removes subprocess/threading
imports and ~100 lines of dead code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add VLLMMetricsParser and SGLangMetricsParser with shared MetricsSnapshot.
Backend is auto-detected from metrics prefix (vllm: vs sglang:) on first poll.

sglang metrics mapped:
- token_usage / num_used_tokens → kv_cache_usage
- num_running_reqs → num_requests_running
- num_queue_reqs → num_requests_waiting
- cache_hit_rate × prompt_tokens → prefix_cache_hits/queries
- num_retracted_reqs → num_preemptions
- realtime_tokens_total mode=prefill_compute/prefill_cache → token source

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replays SWE-bench/GAIA/WildClaw traces from sammshen/lmcache-agentic-traces
via AIPerf with mooncake_trace format. Downloads and converts traces at
runtime. Supports concurrency sweep with offload on/off.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --fixed-schedule to replay at exact trace timestamps
- Remove --extra-inputs ignore_eos:true (let model stop naturally)
- Remove unused REQUEST_RATE logic

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cessing

Drops ~18GB per artifact by excluding inputs.json, conversations.jsonl,
responses.json, GPU telemetry, raw records, and full aiperf_artifacts/.
Only uploads the specific files used by collect_sweep_results.py and
plot_pareto.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The profile_export.jsonl with 233K records was ~10GB per artifact.
Switch collect_sweep_results.py and plot_pareto.py to read from the
pre-computed profile_export_aiperf.csv (~4KB) instead. Remove the JSONL
from the artifact upload. Existing client CSV and trace_replay paths
are unchanged.

Also exclude low-FreeMem H100 nodes (1, 7, 18) to avoid
cudaMallocHost/mlock failures during vLLM CPU KV cache allocation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vLLM v0.18.0 follows the newer OpenAI API spec where the 'system'
message role was renamed to 'developer'. The LMCache traces use
'system', causing 100% 400 Bad Request errors. Also drop the 15GB
profile_export_aiperf.json from artifact uploads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The LMCache traces include explicit null values for optional fields
(tool_calls, tool_call_id, name) on every message. vLLM's strict
Pydantic validation rejects these, causing 100% HTTP 400 errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment on lines +1073 to +1079
# for tp in sorted(df["tp"].unique()):
# tp_data = df[df["tp"] == tp]
# ax.scatter(tp_data[x_col], tp_data[y_col],
# c=tp_colors.get(tp, "purple"),
# marker=tp_markers.get(tp, "x"),
# s=40, alpha=0.15, linewidths=0.3,
# edgecolors="gray")
from pathlib import Path

import pandas as pd
import numpy as np
import sys
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
with open(metadata_file) as f:
metadata = json.load(f)
total_time_sec = metadata.get("benchmark_runtime_sec")
except Exception:
with open(metadata_file) as f:
metadata = json.load(f)
total_time_sec = metadata.get("benchmark_runtime_sec")
except Exception:
self._task.cancel()
try:
await self._task
except asyncio.CancelledError:
with open(metadata_file) as f:
metadata = json.load(f)
total_time_sec = metadata.get("benchmark_runtime_sec")
except Exception:
cquil11 and others added 9 commits April 14, 2026 11:00
The 14GB LMCache dataset mmap takes >5 minutes on some nodes,
exceeding the default 300s PROFILE_CONFIGURE_TIMEOUT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace native offloading with SimpleCPUOffloadConnector
  (VLLM_USE_SIMPLE_KV_OFFLOAD=1 + --no-disable-hybrid-kv-cache-manager)
  for ~10% better throughput and TPOT per vllm-project/vllm#37160
- Remove local_cache_hit and scheduler.py monkey-patches (fixed in
  vLLM 0.19.0+), replace with version check warning
- Add AIPERF_SERVICE_PROFILE_CONFIGURE_TIMEOUT=1800 to H200 and B200
  (H100 already had it)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same changes as the aiperf scripts: replace native offloading with
SimpleCPUOffloadConnector, remove monkey-patches fixed in vLLM 0.19.0+.

Applies to: B200 trace replay, H200 trace replay, MI355X trace replay.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Passes ignore_eos=true to vLLM via extra_body when IGNORE_EOS=true,
forcing exact output token count from traces. Plumbed through:
- kv-cache-tester: --ignore-eos CLI flag
- trace replay scripts: conditional on IGNORE_EOS env var
- GH Actions: ignore_eos workflow dispatch input

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use github.sha instead of github.ref so in-flight sweep jobs
don't pick up new commits pushed to the branch mid-run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cquil11 cquil11 force-pushed the experimental/agentic-benchmark branch from f35c1fe to ea1013d Compare April 14, 2026 20:50
OCWC22 added a commit to OCWC22/InferenceX that referenced this pull request Apr 15, 2026
…races

Add ISB-1 (Inference Stress Benchmark) — a multi-turn, long-context KV cache
stress testing dataset for InferenceX V3.

## What this adds

**35 synthetic multi-turn traces** across 7 context bands (8K → 1M+ tokens):
- 6 workload families: long_chat, coding, agent, rag, cache_stress, multimodal
- KV stress patterns: prefix reuse, offload cliff, compaction, reactivation, fanout
- Real conversation content with 60-95% prefix overlap (enables prefix cache testing)
- Context assets from 15KB to 6.6MB inlined into traces for honest token counts

**Export bundles** for vLLM + SGLang replay:
- extension_131k: DeepSeek-R1, GPT-OSS, Qwen 3.5 (H200/B200)
- preview/long_context_500k: Qwen 3.5 500K context stress test
- preview/long_context_1m: Qwen 3.5 1M context stress test

**10 KV stress sweep configs** (isb1-kv-stress-pr993.yaml):
- 3 models × 2 GPUs × 2 engines
- Sweep: 2→256 concurrent users × on/off/noprefix offload modes × 1800s

## Coexistence with kv-cache-tester

This dataset complements PR SemiAnalysisAI#993's kv-cache-tester (522 real Claude Code traces):
- kv-cache-tester: real workload distribution, natural performance profile
- ISB1: controlled KV stress patterns that force offload cliffs and cache pressure

No files in experimental/multiturn/ are modified. Separate config files, separate
data directory (datasets/isb1/), shared replay infrastructure.

## Benchmark infrastructure
- benchmark_export_replay.py: replay harness with actual_context_len telemetry
- process_result_isb1.py: result aggregation with KV metrics
- Prometheus metrics: kv_cache_usage, prefix_cache_hits, kv_offload_bytes
- Pareto frontier: throughput vs p99 TTFT at each concurrency level

## Why this matters (from GTC 2026)
> "Right now the benchmarks are kind of showing the worst the chips will
> actually perform... for V3 we want to add agentic benchmarks like really
> good representative multi-turn QA chat benchmarks where there are a ton
> of client sessions each with multiple turns and we'll enable prefix caching."
> — Cameron Quilici

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OCWC22 added a commit to OCWC22/InferenceX that referenced this pull request Apr 15, 2026
…races

Add ISB-1 (Inference Stress Benchmark) — a multi-turn, long-context KV cache
stress testing dataset for InferenceX V3.

## What this adds

**35 synthetic multi-turn traces** across 7 context bands (8K → 1M+ tokens):
- 6 workload families: long_chat, coding, agent, rag, cache_stress, multimodal
- KV stress patterns: prefix reuse, offload cliff, compaction, reactivation, fanout
- Real conversation content with 60-95% prefix overlap (enables prefix cache testing)
- Context assets from 15KB to 6.6MB inlined into traces for honest token counts

**Export bundles** for vLLM + SGLang replay:
- extension_131k: DeepSeek-R1, GPT-OSS, Qwen 3.5 (H200/B200)
- preview/long_context_500k: Qwen 3.5 500K context stress test
- preview/long_context_1m: Qwen 3.5 1M context stress test

**10 KV stress sweep configs** (isb1-kv-stress-pr993.yaml):
- 3 models × 2 GPUs × 2 engines
- Sweep: 2→256 concurrent users × on/off/noprefix offload modes × 1800s

## Coexistence with kv-cache-tester

This dataset complements PR SemiAnalysisAI#993's kv-cache-tester (522 real Claude Code traces):
- kv-cache-tester: real workload distribution, natural performance profile
- ISB1: controlled KV stress patterns that force offload cliffs and cache pressure

No files in experimental/multiturn/ are modified. Separate config files, separate
data directory (datasets/isb1/), shared replay infrastructure.

## Benchmark infrastructure
- benchmark_export_replay.py: replay harness with actual_context_len telemetry
- process_result_isb1.py: result aggregation with KV metrics
- Prometheus metrics: kv_cache_usage, prefix_cache_hits, kv_offload_bytes
- Pareto frontier: throughput vs p99 TTFT at each concurrency level

## Why this matters (from GTC 2026)
> "Right now the benchmarks are kind of showing the worst the chips will
> actually perform... for V3 we want to add agentic benchmarks like really
> good representative multi-turn QA chat benchmarks where there are a ton
> of client sessions each with multiple turns and we'll enable prefix caching."
> — Cameron Quilici

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OCWC22 added a commit to OCWC22/InferenceX that referenced this pull request Apr 15, 2026
…races

Add ISB-1 (Inference Stress Benchmark) — a multi-turn, long-context KV cache
stress testing dataset for InferenceX V3.

## What this adds

**35 synthetic multi-turn traces** across 7 context bands (8K → 1M+ tokens):
- 6 workload families: long_chat, coding, agent, rag, cache_stress, multimodal
- KV stress patterns: prefix reuse, offload cliff, compaction, reactivation, fanout
- Real conversation content with 60-95% prefix overlap (enables prefix cache testing)
- Context assets from 15KB to 6.6MB inlined into traces for honest token counts

**Export bundles** for vLLM + SGLang replay:
- extension_131k: DeepSeek-R1, GPT-OSS, Qwen 3.5 (H200/B200)
- preview/long_context_500k: Qwen 3.5 500K context stress test
- preview/long_context_1m: Qwen 3.5 1M context stress test

**10 KV stress sweep configs** (isb1-kv-stress-pr993.yaml):
- 3 models × 2 GPUs × 2 engines
- Sweep: 2→256 concurrent users × on/off/noprefix offload modes × 1800s

## Coexistence with kv-cache-tester

This dataset complements PR SemiAnalysisAI#993's kv-cache-tester (522 real Claude Code traces):
- kv-cache-tester: real workload distribution, natural performance profile
- ISB1: controlled KV stress patterns that force offload cliffs and cache pressure

No files in experimental/multiturn/ are modified. Separate config files, separate
data directory (datasets/isb1/), shared replay infrastructure.

## Benchmark infrastructure
- benchmark_export_replay.py: replay harness with actual_context_len telemetry
- process_result_isb1.py: result aggregation with KV metrics
- Prometheus metrics: kv_cache_usage, prefix_cache_hits, kv_offload_bytes
- Pareto frontier: throughput vs p99 TTFT at each concurrency level

## Why this matters (from GTC 2026)
> "Right now the benchmarks are kind of showing the worst the chips will
> actually perform... for V3 we want to add agentic benchmarks like really
> good representative multi-turn QA chat benchmarks where there are a ton
> of client sessions each with multiple turns and we'll enable prefix caching."
> — Cameron Quilici

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OCWC22 added a commit to OCWC22/InferenceX that referenced this pull request Apr 15, 2026
…races

Add ISB-1 (Inference Stress Benchmark) — a multi-turn, long-context KV cache
stress testing dataset for InferenceX V3.

## What this adds

**35 synthetic multi-turn traces** across 7 context bands (8K → 1M+ tokens):
- 6 workload families: long_chat, coding, agent, rag, cache_stress, multimodal
- KV stress patterns: prefix reuse, offload cliff, compaction, reactivation, fanout
- Real conversation content with 60-95% prefix overlap (enables prefix cache testing)
- Context assets from 15KB to 6.6MB inlined into traces for honest token counts

**Export bundles** for vLLM + SGLang replay:
- extension_131k: DeepSeek-R1, GPT-OSS, Qwen 3.5 (H200/B200)
- preview/long_context_500k: Qwen 3.5 500K context stress test
- preview/long_context_1m: Qwen 3.5 1M context stress test

**10 KV stress sweep configs** (isb1-kv-stress-pr993.yaml):
- 3 models × 2 GPUs × 2 engines
- Sweep: 2→256 concurrent users × on/off/noprefix offload modes × 1800s

## Coexistence with kv-cache-tester

This dataset complements PR SemiAnalysisAI#993's kv-cache-tester (522 real Claude Code traces):
- kv-cache-tester: real workload distribution, natural performance profile
- ISB1: controlled KV stress patterns that force offload cliffs and cache pressure

No files in experimental/multiturn/ are modified. Separate config files, separate
data directory (datasets/isb1/), shared replay infrastructure.

## Benchmark infrastructure
- benchmark_export_replay.py: replay harness with actual_context_len telemetry
- process_result_isb1.py: result aggregation with KV metrics
- Prometheus metrics: kv_cache_usage, prefix_cache_hits, kv_offload_bytes
- Pareto frontier: throughput vs p99 TTFT at each concurrency level

## Why this matters (from GTC 2026)
> "Right now the benchmarks are kind of showing the worst the chips will
> actually perform... for V3 we want to add agentic benchmarks like really
> good representative multi-turn QA chat benchmarks where there are a ton
> of client sessions each with multiple turns and we'll enable prefix caching."
> — Cameron Quilici

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OCWC22 added a commit to OCWC22/InferenceX that referenced this pull request Apr 15, 2026
…races

Add ISB-1 (Inference Stress Benchmark) — a multi-turn, long-context KV cache
stress testing dataset for InferenceX V3.

## What this adds

**35 synthetic multi-turn traces** across 7 context bands (8K → 1M+ tokens):
- 6 workload families: long_chat, coding, agent, rag, cache_stress, multimodal
- KV stress patterns: prefix reuse, offload cliff, compaction, reactivation, fanout
- Real conversation content with 60-95% prefix overlap (enables prefix cache testing)
- Context assets from 15KB to 6.6MB inlined into traces for honest token counts

**Export bundles** for vLLM + SGLang replay:
- extension_131k: DeepSeek-R1, GPT-OSS, Qwen 3.5 (H200/B200)
- preview/long_context_500k: Qwen 3.5 500K context stress test
- preview/long_context_1m: Qwen 3.5 1M context stress test

**10 KV stress sweep configs** (isb1-kv-stress-pr993.yaml):
- 3 models × 2 GPUs × 2 engines
- Sweep: 2→256 concurrent users × on/off/noprefix offload modes × 1800s

## Coexistence with kv-cache-tester

This dataset complements PR SemiAnalysisAI#993's kv-cache-tester (522 real Claude Code traces):
- kv-cache-tester: real workload distribution, natural performance profile
- ISB1: controlled KV stress patterns that force offload cliffs and cache pressure

No files in experimental/multiturn/ are modified. Separate config files, separate
data directory (datasets/isb1/), shared replay infrastructure.

## Benchmark infrastructure
- benchmark_export_replay.py: replay harness with actual_context_len telemetry
- process_result_isb1.py: result aggregation with KV metrics
- Prometheus metrics: kv_cache_usage, prefix_cache_hits, kv_offload_bytes
- Pareto frontier: throughput vs p99 TTFT at each concurrency level

## Why this matters (from GTC 2026)
> "Right now the benchmarks are kind of showing the worst the chips will
> actually perform... for V3 we want to add agentic benchmarks like really
> good representative multi-turn QA chat benchmarks where there are a ton
> of client sessions each with multiple turns and we'll enable prefix caching."
> — Cameron Quilici

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OCWC22 added a commit to OCWC22/InferenceX that referenced this pull request Apr 15, 2026
…races

Add ISB-1 (Inference Stress Benchmark) — a multi-turn, long-context KV cache
stress testing dataset for InferenceX V3.

## What this adds

**35 synthetic multi-turn traces** across 7 context bands (8K → 1M+ tokens):
- 6 workload families: long_chat, coding, agent, rag, cache_stress, multimodal
- KV stress patterns: prefix reuse, offload cliff, compaction, reactivation, fanout
- Real conversation content with 60-95% prefix overlap (enables prefix cache testing)
- Context assets from 15KB to 6.6MB inlined into traces for honest token counts

**Export bundles** for vLLM + SGLang replay:
- extension_131k: DeepSeek-R1, GPT-OSS, Qwen 3.5 (H200/B200)
- preview/long_context_500k: Qwen 3.5 500K context stress test
- preview/long_context_1m: Qwen 3.5 1M context stress test

**10 KV stress sweep configs** (isb1-kv-stress-pr993.yaml):
- 3 models × 2 GPUs × 2 engines
- Sweep: 2→256 concurrent users × on/off/noprefix offload modes × 1800s

## Coexistence with kv-cache-tester

This dataset complements PR SemiAnalysisAI#993's kv-cache-tester (522 real Claude Code traces):
- kv-cache-tester: real workload distribution, natural performance profile
- ISB1: controlled KV stress patterns that force offload cliffs and cache pressure

No files in experimental/multiturn/ are modified. Separate config files, separate
data directory (datasets/isb1/), shared replay infrastructure.

## Benchmark infrastructure
- benchmark_export_replay.py: replay harness with actual_context_len telemetry
- process_result_isb1.py: result aggregation with KV metrics
- Prometheus metrics: kv_cache_usage, prefix_cache_hits, kv_offload_bytes
- Pareto frontier: throughput vs p99 TTFT at each concurrency level

## Why this matters (from GTC 2026)
> "Right now the benchmarks are kind of showing the worst the chips will
> actually perform... for V3 we want to add agentic benchmarks like really
> good representative multi-turn QA chat benchmarks where there are a ton
> of client sessions each with multiple turns and we'll enable prefix caching."
> — Cameron Quilici

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cquil11 and others added 12 commits April 15, 2026 09:15
Based on B200 FP4 trace replay, adapted for MI355X (ROCm):
- rocm-smi fallback for GPU detection
- No CUDA arch or NVIDIA-specific compilation config
- Simple KV offloading, version warning, ignore-eos support

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AITER ck_moe_stage1 kernel crashes with MXFP4 + expert-parallel on
MI355X (vllm-project/vllm#35637). Disable AITER MoE while keeping
AITER attention, and add MEC firmware scratch reclaim guard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
VLLM_USE_SIMPLE_KV_OFFLOAD=1 routes to SimpleCPUOffloadConnector which
imports cuda.bindings (NVIDIA-only, PR vllm-project/vllm#37160). Remove
it from MI355X scripts so native offloading uses the ROCm-safe
OffloadingConnector. Also update H200 trace dir to use traces_neon with
env-var override to match the other trace replay scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Brings in curated v8 trace set, rate limiting metrics (goodput,
effective TTFT, SLO tracking), and updated trace data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Nodes define GRES with GPU subtypes (gpu:h100:8, gpu:h200:8), so salloc
must request gpu:h100:N / gpu:h200:N instead of generic gpu:N.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Plumbs TRACE_DIR through sweep workflow → template → benchmark script.
Accepts relative dir name (e.g. 'traces') or absolute path.
Defaults to traces_neon when empty.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only pulled trace data files (curated v8 set), no code changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SimpleCPUOffloadConnector uses cuda.bindings (NVIDIA-only).
MI355X must use --disable-hybrid-kv-cache-manager with the
native OffloadingConnector.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants