Paired A/B benchmark suite for Clarté
Paired A/B harness that measures how Clarté affects AI coding agent cost, speed and correctness. The harness gives an agent the same tools as a real coding assistant (read, write, edit, glob, grep, bash), runs it against fixture repos under controlled conditions, and compares the results with statistical testing.
Built with TypeScript, the Anthropic SDK and Vitest. Runs against Claude Sonnet 4.6 and Haiku 4.5.
5/5 real-world tasks completed vs 3/5 without Clarté · 28-66% fewer turns on shared completions · 700+ controlled sessions · 32 experiments, 7 studies
Clarté went through two major iterations. Understanding the difference matters for interpreting the results below.
v0 (context file): A static context file injected into the system prompt - structural analysis, key files, chokepoints, coupling patterns. This cut turns by 28% and tokens by 60%, but a minimal 2-line placebo (language + test framework) performed nearly as well. The content didn't matter; the file's existence changed agent behavior.
v1 (pre-flight): Instead of injecting information, Clarté now predicts which files need editing (BM25F + LSA + Katz centrality + logistic fusion) and tells the agent where to start. A pre-flight agent reads each target once and returns exact edit locations. The agent's first action becomes an edit, not a file read.
The fixture benchmarks below tested v0. The real-world A/B tests tested v1.
Real bug fixes in open-source repos. Opaque prompts, Claude Sonnet, claude -p:
| Task | Repo | Without Clarté | With Clarté | n |
|---|---|---|---|---|
| JSX async context loss | Hono | wrong file, did not finish | correct file, 2 min to first edit | 2+2 |
| Form validator prototype pollution | Hono | did not finish | completed (18 turns) | 1+1 |
| SQLite simple-enum array | TypeORM | 47.7 turns / $1.47 | 16.3 turns / $0.43 (-71%) | 3+3 |
| WebSocket adapter shutdown | NestJS | 53 turns / $2.70 | 38 turns / $2.17 (-20%) | 7+7 |
| URL fragment stripping | Hono | $0.34 avg | $0.28 avg (-17%) | 8+8 |
Clarté completed 5/5. Without it, the agent completed 3/5 within the same budget.
These are the controlled, reproducible runs. The development process included hundreds of additional sessions across more tasks and repos, iterating on parameters and approaches. The 32 experiment writeups and 7 studies document the full research arc, from content injection failures through the placebo discovery to pre-flight targeting.
Controlled benchmarks isolating the static context file alone (no hooks, no pre-flight). Same tasks, same model, randomized condition ordering. Statistical testing with Wilcoxon signed-rank, bootstrap CIs, Benjamini-Hochberg FDR correction and Cliff's delta effect sizes.
| Metric | Without Context | With Context | Delta | p-adj | Effect |
|---|---|---|---|---|---|
| Cost (median) | $1.08 | $0.45 | -58.5% | <0.001 | medium |
| Input tokens (median) | 272K | 108K | -60.4% | <0.001 | large |
| Turns (median) | 16 | 11.5 | -28.1% | <0.001 | medium |
| Duration (median) | 130s | 98s | -24.8% | <0.001 | small |
Aggregate statistics are from the 6-task batch (90 sessions) which was run as a single report with full statistical testing. The 3 ecommerce tasks were run in a separate batch and are included in the per-task breakdown below.
Per-task breakdown
| Task | Repo | Difficulty | Cost Δ | Turns (base → treat) |
|---|---|---|---|---|
| test-date-utils | large-ts-project | easy | -58.2% | 16 → 10 |
| add-pagination | medium-ts-api | medium | -48.3% | 17 → 12 |
| add-product-search | ecommerce-api | medium | -38.5% | 25 → 20 |
| test-posts-resource | medium-ts-api | medium | -37.0% | 12 → 9 |
| fix-order-tax | ecommerce-api | medium | -32.7% | 9 → 7 |
| multi-file-error-refactor | medium-ts-api | hard | -30.5% | 23 → 19 |
| fix-relative-time | large-ts-project | medium | -26.5% | 20 → 17 |
| fix-task-transition | large-ts-project | medium | -13.0% | 8 → 7 |
| test-payment-service | ecommerce-api | medium | +7.6% | 11 → 13 |
8 of 9 tasks show cost reduction with context. The effect is largest on tasks requiring broad codebase exploration (test-date-utils, add-product-search) and smallest on tasks where the fix location is quickly discoverable (fix-task-transition). The one outlier (test-payment-service, +7.6%) is a test-writing task on a small codebase where the agent finds the target file quickly regardless of context.
By category
| Category | Cost Δ | p-adj | Effect |
|---|---|---|---|
| test-writing | -51.8% | 0.044 | large |
| add-feature | -48.3% | n<10 | large |
| refactor | -30.5% | n<10 | large |
| bug-fix | -23.8% | 0.044 | large |
Claude Haiku 4.5 - 3 tasks, 7 repetitions (127 sessions)
| Metric | Without Context | With Context | Delta | Significance |
|---|---|---|---|---|
| Pass rate | 86% | 95% | +9pp | - |
| Turns (median) | 19 | 14 | -26% | p<0.001, medium effect |
| Cost (median) | $0.35 | $0.29 | -15% | n.s. |
The effect holds across model tiers. Sonnet shows larger absolute savings because its higher per-token cost ($3/$15 vs $1/$5) amplifies the token reduction.
Reports are written to results/ (gitignored). Full session-level data in JSON, human-readable summaries in Markdown.
The benchmark agent has the same tool set as a real AI coding assistant:
| Tool | Description |
|---|---|
read |
Read file contents |
write |
Create or overwrite files |
edit |
String replacement edits |
glob |
Find files by pattern |
grep |
Search file contents |
bash |
Execute shell commands (sandboxed) |
For each task, the harness:
- Clones the fixture repository to a temporary workspace
- Runs
npm install - Gives the agent the task prompt under the selected condition (see below)
- Lets the agent work for up to 25 turns (tool calls execute in parallel)
- Evaluates the result against task-specific criteria (content match, test pass, function behavior, typecheck)
- Records tokens, cost, duration, tool call patterns and pass/fail
- Computes statistical significance and effect sizes
Each combination is repeated multiple times (default: 7) with randomized condition ordering to eliminate temporal bias. Optional warmup reps can be discarded. Results are aggregated with median and BCa bootstrap confidence intervals, and deltas are tested for statistical significance.
Conditions
Default runs use three conditions. Use --all-conditions for the full set of 26 (ablations, delivery mechanisms, MCP, hooks, framing variants and more).
| Condition | What the agent gets |
|---|---|
without-context |
Task prompt only (baseline) |
with-context |
Runs clarte, injects the generated context into the system prompt |
placebo-minimal |
Injects a minimal stub (language + test framework only, no analysis) |
The placebo condition tests whether improvements come from graph analysis or just from having extra text in the prompt. Additional conditions (with-brief-context, placebo-random, stale-context, misleading-context, ablation-*, delivery-*, mcp-*, etc.) are available via --condition or --all-conditions.
npm installRequires ANTHROPIC_API_KEY environment variable.
npm run bench:drynpm run bench -- --list-tasksnpm run bench -- --task medium-ts-api:bug-fix-retry --repetitions 1 -vnpm run benchnpm run bench -- --repetitions 10 --warmup-reps 1 --seed 42npm run bench:compare -- results/benchmark-v1.0.0-2026-02-15.json results/benchmark-v1.1.0-2026-02-20.jsonnpm run bench -- --resumeOptions
--dry-run Estimate costs without making API calls
--repo <id> Filter to specific repo(s) (repeatable)
--task <id> Filter to specific task(s) (repeatable)
--all-tasks Run all tasks (default: core tasks only)
--condition <c> Filter by condition (see Conditions table)
--all-conditions Run all 26 conditions (default: 3)
--prompt-style <s> Filter: detailed | opaque (default: both)
--repetitions <n> Repetitions per combination (default: 7)
--warmup-reps <n> Warmup reps to discard (default: 0)
--seed <n> Random seed for run order (default: random)
--max-turns <n> Max turns per task (default: 25)
--max-tokens <n> Max output tokens per turn (default: 16384)
--max-cost <usd> Maximum cumulative cost before stopping
--model <m> Model ID (default: claude-sonnet-4-6)
--output-dir <d> Output directory (default: results/)
--early-stop-pairs <n> Min completed pairs before futility checks (default: 4)
--resume Resume from last checkpoint
--parallel Run conditions in parallel
--cache-cooldown <ms> Delay between runs to avoid cache hits (default: 0)
--list-tasks List all available tasks and exit
-v, --verbose Verbose output
Tasks
7 repos (5 local fixtures, 2 pinned open-source). Each task has a detailed variant (names files, functions, symptoms) and an opaque variant (the agent must explore). Use --prompt-style to filter.
| Task | Repo | Category | Difficulty | What It Tests |
|---|---|---|---|---|
bug-fix-retry |
medium-ts-api | bug-fix | medium | Off-by-one in exponential backoff retry logic |
add-pagination |
medium-ts-api | add-feature | medium | Cursor-based pagination for users resource |
add-request-timeout |
medium-ts-api | add-feature | medium | Configurable request timeout with error type |
multi-file-error-refactor |
medium-ts-api | refactor | hard | Centralize scattered error handling |
add-request-batching |
medium-ts-api | add-feature | hard | Concurrent request batching with error handling |
test-posts-resource |
medium-ts-api | test-writing | medium | Tests for untested API resource module |
fix-relative-time |
large-ts-project | bug-fix | medium | Off-by-one in date utility at day boundaries |
fix-task-transition |
large-ts-project | bug-fix | medium | Task service status transition validation |
test-notification-service |
large-ts-project | test-writing | medium | Tests for notification service |
test-date-utils |
large-ts-project | test-writing | easy | Tests for date utility module |
fix-order-tax |
ecommerce-api | bug-fix | medium | Tax calculated on subtotal instead of discounted amount |
test-payment-service |
ecommerce-api | test-writing | medium | Tests for payment service |
add-product-search |
ecommerce-api | add-feature | medium | Full-text search for product service |
add-order-history |
ecommerce-api | add-feature | hard | Order history tracking with status transitions |
test-middleware-validation |
ecommerce-api | test-writing | easy | Tests for validation middleware |
add-inventory-check |
ecommerce-api | add-feature | hard | Inventory checking in order creation flow |
fix-auth-validation |
nestjs-api | bug-fix | medium | Bearer token validation logic |
fix-user-not-found |
nestjs-api | bug-fix | easy | Missing NotFoundError for absent users |
add-user-search |
nestjs-api | add-feature | medium | Search method for UsersService |
add-semver-validator |
zod | add-feature | medium | .semver() string format validator |
cancel-refund |
ecommerce-api-quality | quality | medium | Auto-refund payments on order cancellation |
fix-tax-quality |
ecommerce-api-quality | quality | medium | Tax calculation bug with hidden behavioral tests |
bulk-discount |
ecommerce-api-quality | quality | medium | Bulk quantity discount (10% for 5+ items) |
Default runs use 9 core task pairs (18 tasks) covering all repos, categories and difficulties. Use --all-tasks for the full set.
Each task has multiple evaluators (file existence, content matching, test execution, test count, function behavior, typecheck) to prevent false positives.
Metrics
| Metric | Description |
|---|---|
| Input tokens | Total input tokens consumed (prompt + context + tool results) |
| Output tokens | Total output tokens generated by the model |
| Cost | API cost in USD based on model pricing |
| Duration | Wall-clock time from first API call to task completion |
| Turns | Number of model invocations (each turn = one API call) |
| Exploration calls | Read-only tool calls: read, glob, grep, exploratory bash |
| Editing calls | Mutating tool calls: write, edit, file-modifying bash |
| Execution calls | Test/build tool calls: bash running tests, type-checking, installing |
| Unique files read | Distinct files read during the session |
| Files read, never edited | Files read but never subsequently written to or edited |
| Redundant file reads | Number of times a file was re-read (same path read more than once) |
| Cache hit rate | Fraction of input tokens served from cache: cacheRead / (cacheRead + input) |
| Pass rate | Percentage of runs that pass all evaluators |
| Score | Partial credit (0.0-1.0) based on fraction of evaluators passed |
Evaluators and statistical methods
| Type | Description |
|---|---|
file-exists |
Check that a file exists (or doesn't) |
content-match |
Regex pattern match against file contents |
test-pass |
Run a test command and check exit code |
test-count |
Verify minimum number of passing tests |
function-behavior |
Execute a function with known inputs and verify output |
typecheck |
Run tsc --noEmit and verify no type errors |
diff-check |
Verify expected files were modified via git diff |
coverage-minimum |
Validate test coverage meets a minimum threshold |
Reports are saved to results/ as versioned files:
benchmark-v{clarteVersion}-{date}.json- Full machine-readable databenchmark-v{clarteVersion}-{date}.md- Human-readable markdown with tables
Each report includes system metadata (Node version, OS, CPU, memory), the Clarté version and git SHA, confidence intervals, p-values and effect sizes for all comparisons. Results are stratified by difficulty and category when multiple strata exist.
| Method | Purpose |
|---|---|
| Wilcoxon signed-rank | Primary paired test for metric differences (sessions matched by taskId + repetition) |
| Mann-Whitney U | Fallback unpaired test when < 5 pairs can be matched |
| Fisher's exact | Test significance of pass rate differences |
| BCa bootstrap CI | 95% confidence intervals for medians (10,000 resamples) |
| Cliff's delta | Non-parametric effect size (negligible/small/medium/large) |
| Benjamini-Hochberg | FDR correction across all tests within a report |
npm test # Unit tests
npm run typecheck # Type checking