clarte-benchmark

Paired A/B benchmark suite for Clarté

Paired A/B harness that measures how Clarté affects AI coding agent cost, speed and correctness. The harness gives an agent the same tools as a real coding assistant (read, write, edit, glob, grep, bash), runs it against fixture repos under controlled conditions, and compares the results with statistical testing.

Built with TypeScript, the Anthropic SDK and Vitest. Runs against Claude Sonnet 4.6 and Haiku 4.5.

5/5 real-world tasks completed vs 3/5 without Clarté · 28-66% fewer turns on shared completions · 700+ controlled sessions · 32 experiments, 7 studies

Background

Clarté went through two major iterations. Understanding the difference matters for interpreting the results below.

v0 (context file): A static context file injected into the system prompt - structural analysis, key files, chokepoints, coupling patterns. This cut turns by 28% and tokens by 60%, but a minimal 2-line placebo (language + test framework) performed nearly as well. The content didn't matter; the file's existence changed agent behavior.

v1 (pre-flight): Instead of injecting information, Clarté now predicts which files need editing (BM25F + LSA + Katz centrality + logistic fusion) and tells the agent where to start. A pre-flight agent reads each target once and returns exact edit locations. The agent's first action becomes an edit, not a file read.

The fixture benchmarks below tested v0. The real-world A/B tests tested v1.

Results

Real-world A/B tests (v1, pre-flight)

Real bug fixes in open-source repos. Opaque prompts, Claude Sonnet, claude -p:

Task	Repo	Without Clarté	With Clarté	n
JSX async context loss	Hono	wrong file, did not finish	correct file, 2 min to first edit	2+2
Form validator prototype pollution	Hono	did not finish	completed (18 turns)	1+1
SQLite simple-enum array	TypeORM	47.7 turns / $1.47	16.3 turns / $0.43 (-71%)	3+3
WebSocket adapter shutdown	NestJS	53 turns / $2.70	38 turns / $2.17 (-20%)	7+7
URL fragment stripping	Hono	$0.34 avg	$0.28 avg (-17%)	8+8

Clarté completed 5/5. Without it, the agent completed 3/5 within the same budget.

These are the controlled, reproducible runs. The development process included hundreds of additional sessions across more tasks and repos, iterating on parameters and approaches. The 32 experiment writeups and 7 studies document the full research arc, from content injection failures through the placebo discovery to pre-flight targeting.

Fixture benchmarks (v0, context file only)

Controlled benchmarks isolating the static context file alone (no hooks, no pre-flight). Same tasks, same model, randomized condition ordering. Statistical testing with Wilcoxon signed-rank, bootstrap CIs, Benjamini-Hochberg FDR correction and Cliff's delta effect sizes.

Claude Sonnet 4.6 - 9 opaque tasks, 3 conditions, 5 repetitions (135 sessions)

Metric	Without Context	With Context	Delta	p-adj	Effect
Cost (median)	$1.08	$0.45	-58.5%	<0.001	medium
Input tokens (median)	272K	108K	-60.4%	<0.001	large
Turns (median)	16	11.5	-28.1%	<0.001	medium
Duration (median)	130s	98s	-24.8%	<0.001	small

Aggregate statistics are from the 6-task batch (90 sessions) which was run as a single report with full statistical testing. The 3 ecommerce tasks were run in a separate batch and are included in the per-task breakdown below.

Per-task breakdown

Task	Repo	Difficulty	Cost Δ	Turns (base → treat)
test-date-utils	large-ts-project	easy	-58.2%	16 → 10
add-pagination	medium-ts-api	medium	-48.3%	17 → 12
add-product-search	ecommerce-api	medium	-38.5%	25 → 20
test-posts-resource	medium-ts-api	medium	-37.0%	12 → 9
fix-order-tax	ecommerce-api	medium	-32.7%	9 → 7
multi-file-error-refactor	medium-ts-api	hard	-30.5%	23 → 19
fix-relative-time	large-ts-project	medium	-26.5%	20 → 17
fix-task-transition	large-ts-project	medium	-13.0%	8 → 7
test-payment-service	ecommerce-api	medium	+7.6%	11 → 13

8 of 9 tasks show cost reduction with context. The effect is largest on tasks requiring broad codebase exploration (test-date-utils, add-product-search) and smallest on tasks where the fix location is quickly discoverable (fix-task-transition). The one outlier (test-payment-service, +7.6%) is a test-writing task on a small codebase where the agent finds the target file quickly regardless of context.

By category

Category	Cost Δ	p-adj	Effect
test-writing	-51.8%	0.044	large
add-feature	-48.3%	n<10	large
refactor	-30.5%	n<10	large
bug-fix	-23.8%	0.044	large

Claude Haiku 4.5 - 3 tasks, 7 repetitions (127 sessions)

Metric	Without Context	With Context	Delta	Significance
Pass rate	86%	95%	+9pp	-
Turns (median)	19	14	-26%	p<0.001, medium effect
Cost (median)	$0.35	$0.29	-15%	n.s.

The effect holds across model tiers. Sonnet shows larger absolute savings because its higher per-token cost ($3/$15 vs $1/$5) amplifies the token reduction.

Reports are written to results/ (gitignored). Full session-level data in JSON, human-readable summaries in Markdown.

How It Works

The benchmark agent has the same tool set as a real AI coding assistant:

Tool	Description
`read`	Read file contents
`write`	Create or overwrite files
`edit`	String replacement edits
`glob`	Find files by pattern
`grep`	Search file contents
`bash`	Execute shell commands (sandboxed)

For each task, the harness:

Clones the fixture repository to a temporary workspace
Runs npm install
Gives the agent the task prompt under the selected condition (see below)
Lets the agent work for up to 25 turns (tool calls execute in parallel)
Evaluates the result against task-specific criteria (content match, test pass, function behavior, typecheck)
Records tokens, cost, duration, tool call patterns and pass/fail
Computes statistical significance and effect sizes

Each combination is repeated multiple times (default: 7) with randomized condition ordering to eliminate temporal bias. Optional warmup reps can be discarded. Results are aggregated with median and BCa bootstrap confidence intervals, and deltas are tested for statistical significance.

Conditions

Default runs use three conditions. Use --all-conditions for the full set of 26 (ablations, delivery mechanisms, MCP, hooks, framing variants and more).

Condition	What the agent gets
`without-context`	Task prompt only (baseline)
`with-context`	Runs `clarte`, injects the generated context into the system prompt
`placebo-minimal`	Injects a minimal stub (language + test framework only, no analysis)

The placebo condition tests whether improvements come from graph analysis or just from having extra text in the prompt. Additional conditions (with-brief-context, placebo-random, stale-context, misleading-context, ablation-*, delivery-*, mcp-*, etc.) are available via --condition or --all-conditions.

Setup

npm install

Requires ANTHROPIC_API_KEY environment variable.

Usage

Dry run (cost estimate)

npm run bench:dry

List available tasks

npm run bench -- --list-tasks

Single task

npm run bench -- --task medium-ts-api:bug-fix-retry --repetitions 1 -v

Full benchmark

npm run bench

Full benchmark with warmup and deterministic seed

npm run bench -- --repetitions 10 --warmup-reps 1 --seed 42

Compare two reports

npm run bench:compare -- results/benchmark-v1.0.0-2026-02-15.json results/benchmark-v1.1.0-2026-02-20.json

Resume after crash

npm run bench -- --resume

Options

--dry-run             Estimate costs without making API calls
--repo <id>           Filter to specific repo(s) (repeatable)
--task <id>           Filter to specific task(s) (repeatable)
--all-tasks           Run all tasks (default: core tasks only)
--condition <c>       Filter by condition (see Conditions table)
--all-conditions      Run all 26 conditions (default: 3)
--prompt-style <s>    Filter: detailed | opaque (default: both)
--repetitions <n>     Repetitions per combination (default: 7)
--warmup-reps <n>     Warmup reps to discard (default: 0)
--seed <n>            Random seed for run order (default: random)
--max-turns <n>       Max turns per task (default: 25)
--max-tokens <n>      Max output tokens per turn (default: 16384)
--max-cost <usd>      Maximum cumulative cost before stopping
--model <m>           Model ID (default: claude-sonnet-4-6)
--output-dir <d>      Output directory (default: results/)
--early-stop-pairs <n> Min completed pairs before futility checks (default: 4)
--resume              Resume from last checkpoint
--parallel            Run conditions in parallel
--cache-cooldown <ms> Delay between runs to avoid cache hits (default: 0)
--list-tasks          List all available tasks and exit
-v, --verbose         Verbose output

Tasks

7 repos (5 local fixtures, 2 pinned open-source). Each task has a detailed variant (names files, functions, symptoms) and an opaque variant (the agent must explore). Use --prompt-style to filter.

Task	Repo	Category	Difficulty	What It Tests
`bug-fix-retry`	medium-ts-api	bug-fix	medium	Off-by-one in exponential backoff retry logic
`add-pagination`	medium-ts-api	add-feature	medium	Cursor-based pagination for users resource
`add-request-timeout`	medium-ts-api	add-feature	medium	Configurable request timeout with error type
`multi-file-error-refactor`	medium-ts-api	refactor	hard	Centralize scattered error handling
`add-request-batching`	medium-ts-api	add-feature	hard	Concurrent request batching with error handling
`test-posts-resource`	medium-ts-api	test-writing	medium	Tests for untested API resource module
`fix-relative-time`	large-ts-project	bug-fix	medium	Off-by-one in date utility at day boundaries
`fix-task-transition`	large-ts-project	bug-fix	medium	Task service status transition validation
`test-notification-service`	large-ts-project	test-writing	medium	Tests for notification service
`test-date-utils`	large-ts-project	test-writing	easy	Tests for date utility module
`fix-order-tax`	ecommerce-api	bug-fix	medium	Tax calculated on subtotal instead of discounted amount
`test-payment-service`	ecommerce-api	test-writing	medium	Tests for payment service
`add-product-search`	ecommerce-api	add-feature	medium	Full-text search for product service
`add-order-history`	ecommerce-api	add-feature	hard	Order history tracking with status transitions
`test-middleware-validation`	ecommerce-api	test-writing	easy	Tests for validation middleware
`add-inventory-check`	ecommerce-api	add-feature	hard	Inventory checking in order creation flow
`fix-auth-validation`	nestjs-api	bug-fix	medium	Bearer token validation logic
`fix-user-not-found`	nestjs-api	bug-fix	easy	Missing NotFoundError for absent users
`add-user-search`	nestjs-api	add-feature	medium	Search method for UsersService
`add-semver-validator`	zod	add-feature	medium	`.semver()` string format validator
`cancel-refund`	ecommerce-api-quality	quality	medium	Auto-refund payments on order cancellation
`fix-tax-quality`	ecommerce-api-quality	quality	medium	Tax calculation bug with hidden behavioral tests
`bulk-discount`	ecommerce-api-quality	quality	medium	Bulk quantity discount (10% for 5+ items)

Default runs use 9 core task pairs (18 tasks) covering all repos, categories and difficulties. Use --all-tasks for the full set.

Each task has multiple evaluators (file existence, content matching, test execution, test count, function behavior, typecheck) to prevent false positives.

Metrics

Metric	Description
Input tokens	Total input tokens consumed (prompt + context + tool results)
Output tokens	Total output tokens generated by the model
Cost	API cost in USD based on model pricing
Duration	Wall-clock time from first API call to task completion
Turns	Number of model invocations (each turn = one API call)
Exploration calls	Read-only tool calls: `read`, `glob`, `grep`, exploratory `bash`
Editing calls	Mutating tool calls: `write`, `edit`, file-modifying `bash`
Execution calls	Test/build tool calls: `bash` running tests, type-checking, installing
Unique files read	Distinct files read during the session
Files read, never edited	Files read but never subsequently written to or edited
Redundant file reads	Number of times a file was re-read (same path read more than once)
Cache hit rate	Fraction of input tokens served from cache: `cacheRead / (cacheRead + input)`
Pass rate	Percentage of runs that pass all evaluators
Score	Partial credit (0.0-1.0) based on fraction of evaluators passed

Evaluators and statistical methods

Type	Description
`file-exists`	Check that a file exists (or doesn't)
`content-match`	Regex pattern match against file contents
`test-pass`	Run a test command and check exit code
`test-count`	Verify minimum number of passing tests
`function-behavior`	Execute a function with known inputs and verify output
`typecheck`	Run `tsc --noEmit` and verify no type errors
`diff-check`	Verify expected files were modified via `git diff`
`coverage-minimum`	Validate test coverage meets a minimum threshold

Reports are saved to results/ as versioned files:

benchmark-v{clarteVersion}-{date}.json - Full machine-readable data
benchmark-v{clarteVersion}-{date}.md - Human-readable markdown with tables

Each report includes system metadata (Node version, OS, CPU, memory), the Clarté version and git SHA, confidence intervals, p-values and effect sizes for all comparisons. Results are stratified by difficulty and category when multiple strata exist.

Statistical methods

Method	Purpose
Wilcoxon signed-rank	Primary paired test for metric differences (sessions matched by taskId + repetition)
Mann-Whitney U	Fallback unpaired test when < 5 pairs can be matched
Fisher's exact	Test significance of pass rate differences
BCa bootstrap CI	95% confidence intervals for medians (10,000 resamples)
Cliff's delta	Non-parametric effect size (negligible/small/medium/large)
Benjamini-Hochberg	FDR correction across all tests within a report

Development

npm test          # Unit tests
npm run typecheck # Type checking

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
docs		docs
fixtures		fixtures
scripts		scripts
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

clarte-benchmark

Background

Results

Real-world A/B tests (v1, pre-flight)

Fixture benchmarks (v0, context file only)

Claude Sonnet 4.6 - 9 opaque tasks, 3 conditions, 5 repetitions (135 sessions)

How It Works

Setup

Usage

Dry run (cost estimate)

List available tasks

Single task

Full benchmark

Full benchmark with warmup and deterministic seed

Compare two reports

Resume after crash

Statistical methods

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

clarte-benchmark

Background

Results

Real-world A/B tests (v1, pre-flight)

Fixture benchmarks (v0, context file only)

Claude Sonnet 4.6 - 9 opaque tasks, 3 conditions, 5 repetitions (135 sessions)

How It Works

Setup

Usage

Dry run (cost estimate)

List available tasks

Single task

Full benchmark

Full benchmark with warmup and deterministic seed

Compare two reports

Resume after crash

Statistical methods

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages