Skip to content

michaelabrt/clarte-benchmark

Repository files navigation

clarte-benchmark

Paired A/B benchmark suite for Clarté

CI TypeScript

Paired A/B harness that measures how Clarté affects AI coding agent cost, speed and correctness. The harness gives an agent the same tools as a real coding assistant (read, write, edit, glob, grep, bash), runs it against fixture repos under controlled conditions, and compares the results with statistical testing.

Built with TypeScript, the Anthropic SDK and Vitest. Runs against Claude Sonnet 4.6 and Haiku 4.5.

5/5 real-world tasks completed vs 3/5 without Clarté · 28-66% fewer turns on shared completions · 700+ controlled sessions · 32 experiments, 7 studies

Background

Clarté went through two major iterations. Understanding the difference matters for interpreting the results below.

v0 (context file): A static context file injected into the system prompt - structural analysis, key files, chokepoints, coupling patterns. This cut turns by 28% and tokens by 60%, but a minimal 2-line placebo (language + test framework) performed nearly as well. The content didn't matter; the file's existence changed agent behavior.

v1 (pre-flight): Instead of injecting information, Clarté now predicts which files need editing (BM25F + LSA + Katz centrality + logistic fusion) and tells the agent where to start. A pre-flight agent reads each target once and returns exact edit locations. The agent's first action becomes an edit, not a file read.

The fixture benchmarks below tested v0. The real-world A/B tests tested v1.

Results

Real-world A/B tests (v1, pre-flight)

Real bug fixes in open-source repos. Opaque prompts, Claude Sonnet, claude -p:

Task Repo Without Clarté With Clarté n
JSX async context loss Hono wrong file, did not finish correct file, 2 min to first edit 2+2
Form validator prototype pollution Hono did not finish completed (18 turns) 1+1
SQLite simple-enum array TypeORM 47.7 turns / $1.47 16.3 turns / $0.43 (-71%) 3+3
WebSocket adapter shutdown NestJS 53 turns / $2.70 38 turns / $2.17 (-20%) 7+7
URL fragment stripping Hono $0.34 avg $0.28 avg (-17%) 8+8

Clarté completed 5/5. Without it, the agent completed 3/5 within the same budget.

These are the controlled, reproducible runs. The development process included hundreds of additional sessions across more tasks and repos, iterating on parameters and approaches. The 32 experiment writeups and 7 studies document the full research arc, from content injection failures through the placebo discovery to pre-flight targeting.

Fixture benchmarks (v0, context file only)

Controlled benchmarks isolating the static context file alone (no hooks, no pre-flight). Same tasks, same model, randomized condition ordering. Statistical testing with Wilcoxon signed-rank, bootstrap CIs, Benjamini-Hochberg FDR correction and Cliff's delta effect sizes.

Claude Sonnet 4.6 - 9 opaque tasks, 3 conditions, 5 repetitions (135 sessions)

Metric Without Context With Context Delta p-adj Effect
Cost (median) $1.08 $0.45 -58.5% <0.001 medium
Input tokens (median) 272K 108K -60.4% <0.001 large
Turns (median) 16 11.5 -28.1% <0.001 medium
Duration (median) 130s 98s -24.8% <0.001 small

Aggregate statistics are from the 6-task batch (90 sessions) which was run as a single report with full statistical testing. The 3 ecommerce tasks were run in a separate batch and are included in the per-task breakdown below.

Per-task breakdown
Task Repo Difficulty Cost Δ Turns (base → treat)
test-date-utils large-ts-project easy -58.2% 16 → 10
add-pagination medium-ts-api medium -48.3% 17 → 12
add-product-search ecommerce-api medium -38.5% 25 → 20
test-posts-resource medium-ts-api medium -37.0% 12 → 9
fix-order-tax ecommerce-api medium -32.7% 9 → 7
multi-file-error-refactor medium-ts-api hard -30.5% 23 → 19
fix-relative-time large-ts-project medium -26.5% 20 → 17
fix-task-transition large-ts-project medium -13.0% 8 → 7
test-payment-service ecommerce-api medium +7.6% 11 → 13

8 of 9 tasks show cost reduction with context. The effect is largest on tasks requiring broad codebase exploration (test-date-utils, add-product-search) and smallest on tasks where the fix location is quickly discoverable (fix-task-transition). The one outlier (test-payment-service, +7.6%) is a test-writing task on a small codebase where the agent finds the target file quickly regardless of context.

By category
Category Cost Δ p-adj Effect
test-writing -51.8% 0.044 large
add-feature -48.3% n<10 large
refactor -30.5% n<10 large
bug-fix -23.8% 0.044 large
Claude Haiku 4.5 - 3 tasks, 7 repetitions (127 sessions)
Metric Without Context With Context Delta Significance
Pass rate 86% 95% +9pp -
Turns (median) 19 14 -26% p<0.001, medium effect
Cost (median) $0.35 $0.29 -15% n.s.

The effect holds across model tiers. Sonnet shows larger absolute savings because its higher per-token cost ($3/$15 vs $1/$5) amplifies the token reduction.

Reports are written to results/ (gitignored). Full session-level data in JSON, human-readable summaries in Markdown.

How It Works

The benchmark agent has the same tool set as a real AI coding assistant:

Tool Description
read Read file contents
write Create or overwrite files
edit String replacement edits
glob Find files by pattern
grep Search file contents
bash Execute shell commands (sandboxed)

For each task, the harness:

  1. Clones the fixture repository to a temporary workspace
  2. Runs npm install
  3. Gives the agent the task prompt under the selected condition (see below)
  4. Lets the agent work for up to 25 turns (tool calls execute in parallel)
  5. Evaluates the result against task-specific criteria (content match, test pass, function behavior, typecheck)
  6. Records tokens, cost, duration, tool call patterns and pass/fail
  7. Computes statistical significance and effect sizes

Each combination is repeated multiple times (default: 7) with randomized condition ordering to eliminate temporal bias. Optional warmup reps can be discarded. Results are aggregated with median and BCa bootstrap confidence intervals, and deltas are tested for statistical significance.

Conditions

Default runs use three conditions. Use --all-conditions for the full set of 26 (ablations, delivery mechanisms, MCP, hooks, framing variants and more).

Condition What the agent gets
without-context Task prompt only (baseline)
with-context Runs clarte, injects the generated context into the system prompt
placebo-minimal Injects a minimal stub (language + test framework only, no analysis)

The placebo condition tests whether improvements come from graph analysis or just from having extra text in the prompt. Additional conditions (with-brief-context, placebo-random, stale-context, misleading-context, ablation-*, delivery-*, mcp-*, etc.) are available via --condition or --all-conditions.

Setup

npm install

Requires ANTHROPIC_API_KEY environment variable.

Usage

Dry run (cost estimate)

npm run bench:dry

List available tasks

npm run bench -- --list-tasks

Single task

npm run bench -- --task medium-ts-api:bug-fix-retry --repetitions 1 -v

Full benchmark

npm run bench

Full benchmark with warmup and deterministic seed

npm run bench -- --repetitions 10 --warmup-reps 1 --seed 42

Compare two reports

npm run bench:compare -- results/benchmark-v1.0.0-2026-02-15.json results/benchmark-v1.1.0-2026-02-20.json

Resume after crash

npm run bench -- --resume
Options
--dry-run             Estimate costs without making API calls
--repo <id>           Filter to specific repo(s) (repeatable)
--task <id>           Filter to specific task(s) (repeatable)
--all-tasks           Run all tasks (default: core tasks only)
--condition <c>       Filter by condition (see Conditions table)
--all-conditions      Run all 26 conditions (default: 3)
--prompt-style <s>    Filter: detailed | opaque (default: both)
--repetitions <n>     Repetitions per combination (default: 7)
--warmup-reps <n>     Warmup reps to discard (default: 0)
--seed <n>            Random seed for run order (default: random)
--max-turns <n>       Max turns per task (default: 25)
--max-tokens <n>      Max output tokens per turn (default: 16384)
--max-cost <usd>      Maximum cumulative cost before stopping
--model <m>           Model ID (default: claude-sonnet-4-6)
--output-dir <d>      Output directory (default: results/)
--early-stop-pairs <n> Min completed pairs before futility checks (default: 4)
--resume              Resume from last checkpoint
--parallel            Run conditions in parallel
--cache-cooldown <ms> Delay between runs to avoid cache hits (default: 0)
--list-tasks          List all available tasks and exit
-v, --verbose         Verbose output
Tasks

7 repos (5 local fixtures, 2 pinned open-source). Each task has a detailed variant (names files, functions, symptoms) and an opaque variant (the agent must explore). Use --prompt-style to filter.

Task Repo Category Difficulty What It Tests
bug-fix-retry medium-ts-api bug-fix medium Off-by-one in exponential backoff retry logic
add-pagination medium-ts-api add-feature medium Cursor-based pagination for users resource
add-request-timeout medium-ts-api add-feature medium Configurable request timeout with error type
multi-file-error-refactor medium-ts-api refactor hard Centralize scattered error handling
add-request-batching medium-ts-api add-feature hard Concurrent request batching with error handling
test-posts-resource medium-ts-api test-writing medium Tests for untested API resource module
fix-relative-time large-ts-project bug-fix medium Off-by-one in date utility at day boundaries
fix-task-transition large-ts-project bug-fix medium Task service status transition validation
test-notification-service large-ts-project test-writing medium Tests for notification service
test-date-utils large-ts-project test-writing easy Tests for date utility module
fix-order-tax ecommerce-api bug-fix medium Tax calculated on subtotal instead of discounted amount
test-payment-service ecommerce-api test-writing medium Tests for payment service
add-product-search ecommerce-api add-feature medium Full-text search for product service
add-order-history ecommerce-api add-feature hard Order history tracking with status transitions
test-middleware-validation ecommerce-api test-writing easy Tests for validation middleware
add-inventory-check ecommerce-api add-feature hard Inventory checking in order creation flow
fix-auth-validation nestjs-api bug-fix medium Bearer token validation logic
fix-user-not-found nestjs-api bug-fix easy Missing NotFoundError for absent users
add-user-search nestjs-api add-feature medium Search method for UsersService
add-semver-validator zod add-feature medium .semver() string format validator
cancel-refund ecommerce-api-quality quality medium Auto-refund payments on order cancellation
fix-tax-quality ecommerce-api-quality quality medium Tax calculation bug with hidden behavioral tests
bulk-discount ecommerce-api-quality quality medium Bulk quantity discount (10% for 5+ items)

Default runs use 9 core task pairs (18 tasks) covering all repos, categories and difficulties. Use --all-tasks for the full set.

Each task has multiple evaluators (file existence, content matching, test execution, test count, function behavior, typecheck) to prevent false positives.

Metrics
Metric Description
Input tokens Total input tokens consumed (prompt + context + tool results)
Output tokens Total output tokens generated by the model
Cost API cost in USD based on model pricing
Duration Wall-clock time from first API call to task completion
Turns Number of model invocations (each turn = one API call)
Exploration calls Read-only tool calls: read, glob, grep, exploratory bash
Editing calls Mutating tool calls: write, edit, file-modifying bash
Execution calls Test/build tool calls: bash running tests, type-checking, installing
Unique files read Distinct files read during the session
Files read, never edited Files read but never subsequently written to or edited
Redundant file reads Number of times a file was re-read (same path read more than once)
Cache hit rate Fraction of input tokens served from cache: cacheRead / (cacheRead + input)
Pass rate Percentage of runs that pass all evaluators
Score Partial credit (0.0-1.0) based on fraction of evaluators passed
Evaluators and statistical methods
Type Description
file-exists Check that a file exists (or doesn't)
content-match Regex pattern match against file contents
test-pass Run a test command and check exit code
test-count Verify minimum number of passing tests
function-behavior Execute a function with known inputs and verify output
typecheck Run tsc --noEmit and verify no type errors
diff-check Verify expected files were modified via git diff
coverage-minimum Validate test coverage meets a minimum threshold

Reports are saved to results/ as versioned files:

  • benchmark-v{clarteVersion}-{date}.json - Full machine-readable data
  • benchmark-v{clarteVersion}-{date}.md - Human-readable markdown with tables

Each report includes system metadata (Node version, OS, CPU, memory), the Clarté version and git SHA, confidence intervals, p-values and effect sizes for all comparisons. Results are stratified by difficulty and category when multiple strata exist.

Statistical methods

Method Purpose
Wilcoxon signed-rank Primary paired test for metric differences (sessions matched by taskId + repetition)
Mann-Whitney U Fallback unpaired test when < 5 pairs can be matched
Fisher's exact Test significance of pass rate differences
BCa bootstrap CI 95% confidence intervals for medians (10,000 resamples)
Cliff's delta Non-parametric effect size (negligible/small/medium/large)
Benjamini-Hochberg FDR correction across all tests within a report

Development

npm test          # Unit tests
npm run typecheck # Type checking

About

Paired A/B benchmark suite for Clarté - measures how dependency-graph intelligence affects AI coding agent cost, speed and correctness

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors