This repository contains a Python harness for benchmarking agentic coding CLI tools such as Codex and Cursor. The harness focuses on stressing context retrieval by running the agents against tasks backed by workspaces of different sizes—including real-world open source repositories—and verifying the resulting code quality and performance.
- Command abstraction – Each agent is described declaratively, allowing you
to plug in CLIs like
codexandcursor-agent. Sample shims live inagents_benchmark/scripts/to demonstrate how to adapt bespoke interfaces; by default they expect the binaries specified in$CODEX_AGENT_BINand$CURSOR_AGENT_BIN(or an explicit--agent-binargument). - Workspace isolation – Tasks are executed in per-run copies of the source workspace, enabling repeatable and parallel-friendly benchmarking.
- Validation hooks – Run project-specific checks (e.g.
pytest, bespoke scripts) to gauge code quality after an agent finishes. - Structured reporting – Collect per-run metrics including durations, success rates, validation outcomes, and log file locations.
- Verbose telemetry – Run with
-v/-vvto stream live progress and debug-level diagnostics during benchmark execution. - Sample scenarios – Two example tasks clone
pallets/click(small) anddjango/django(large) to highlight context retrieval stress at different scales.
# Create and activate virtual environment (Python 3.14+)
python3 -m venv .venv
source .venv/bin/activate
# Install package and dependencies
pip install -e .
pip install pytest # Needed for validation in sample scenarios
# Run with default config (stub agents, fast)
benchmark --summary -v
# Run with pretty formatted output
agents-benchmark-cli
# Run with real agents
benchmark --config scenarios/sample.json --summary -vImportant: Always activate the venv (source .venv/bin/activate) before running benchmarks to ensure Python 3.14+ is used.
The CLI copies each task workspace into the run directory, invokes both agents
with the configured commands, executes validation suites, and stores a structured
report under reports/<timestamp>/report.json.
benchmark [--config PATH] [OPTIONS]For interactive development, use the Rich-formatted wrapper:
agents-benchmark-cli # Auto-runs with --summary -v and colorful tablesAll arguments are optional with defaults:
--config PATH- Path to benchmark configuration file (JSON or TOML)- Default:
scenarios/sample_stub.json(fast stub agents for testing) - Supports:
.jsonor.tomlfiles - Use
scenarios/sample.jsonfor real agents
- Default:
Optional Flags (all default to off/false):
-
--dry-run- Validate configuration and exit without running benchmarks- Default: off (runs benchmarks normally)
- Combine with
--print-configto inspect parsed configuration
-
--print-config- Print the parsed configuration for inspection- Default: off (no output)
- Shows: agent commands, task sources, resolved file paths
-
--summary- Print compact JSON summary after the run to stdout- Default: off (prints "Benchmark completed. Report saved to ")
- With flag: outputs full JSON report (can redirect:
--summary > results.json) - Tip: Use
agents-benchmark-cliinstead for prettier formatted output
-
-v, --verbose- Increase logging verbosity (repeatable)- Default: WARNING level (errors and warnings only)
-v: INFO level (progress messages, task completions, validation results)-vv: DEBUG level (detailed execution traces, commands, timeouts, environment)
Validate configuration:
# Check config is valid without running
benchmark --config scenarios/sample.json --dry-run
# Validate and print parsed config
benchmark --config scenarios/sample.json --dry-run --print-configRun benchmarks:
# Run with default config (stub agents, fast for testing)
benchmark
# Run with JSON summary and progress logging
benchmark --summary -v
# Run with specific config
benchmark --config scenarios/sample.json --summary -v
# Run with detailed debug logging
benchmark --config scenarios/sample.json --summary -vvOutput:
- Without
--summary: Prints "Benchmark completed. Report saved to " - With
--summary: Prints full JSON report to stdout (can redirect:agents-benchmark --summary > results.json) - Report files: Always written to
reports/<timestamp>/report.jsonregardless of--summaryflag - Logs: Written to
reports/<timestamp>/with per-agent workspaces in subdirectories
The --summary flag outputs a JSON report with aggregated metrics and per-run details:
{
"generated_at": "2025-10-09T22:19:19+00:00",
"run_directory": "/absolute/path/to/reports/20251009-221918",
"agents": {
"codex": {
"total_runs": 4,
"agent_success_rate": 0.75,
"validation_success_rate": 0.5,
"average_duration_seconds": 125.3
},
"cursor-agent": {
"total_runs": 4,
"agent_success_rate": 1.0,
"validation_success_rate": 1.0,
"average_duration_seconds": 98.7
}
},
"tasks": {
"click-human-bytes": {
"description": "Implement a human-readable bytes helper in Click",
"workspace_size_bytes": 1048576,
"agent_runs": [
{
"agent": "codex",
"run_index": 1,
"agent_success": true,
"agent_duration_seconds": 120.5,
"validation_success": true,
"workspace": "/path/to/reports/.../codex/click-human-bytes-run1",
"stdout_log": "/path/to/.../logs/codex/stdout-run1.log",
"stderr_log": "/path/to/.../logs/codex/stderr-run1.log",
"validation_stdout": "/path/to/.../validation/run1/stdout.log",
"validation_stderr": "/path/to/.../validation/run1/stderr.log"
}
]
}
}
}Field descriptions:
Top level:
generated_at- ISO 8601 timestamp when report was generatedrun_directory- Absolute path to the run directory containing all artifacts
agents object (keyed by agent name):
total_runs- Number of times this agent ran (tasks × runs_per_task)agent_success_rate- Fraction of runs where agent exited successfully (0.0-1.0)validation_success_rate- Fraction of runs where validation passed (0.0-1.0)average_duration_seconds- Mean execution time across all runs
tasks object (keyed by task ID):
description- Human-readable task descriptionworkspace_size_bytes- Size of the source workspace in bytesagent_runs- Array of individual run records
Each agent_run record:
agent- Name of the agent that ran this taskrun_index- Run number (1-based)agent_success- Whether agent completed successfully (exit code 0)agent_duration_seconds- How long the agent took to runvalidation_success- Whether validation passed (null if no validation)workspace- Path to the isolated workspace for this runstdout_log/stderr_log- Agent output logsvalidation_stdout/validation_stderr- Validation output logs (null if no validation)
Processing the summary:
# Save to file
agents-benchmark --summary > benchmark-results.json
# Extract specific metrics
agents-benchmark --summary | jq '.agents.codex.agent_success_rate'
# Check which tasks failed
agents-benchmark --summary | jq '.tasks | to_entries[] |
select(.value.agent_runs[].agent_success == false) | .key'
# Get average duration by agent
agents-benchmark --summary | jq '.agents | to_entries[] |
{agent: .key, avg_duration: .value.average_duration_seconds}'- Click (
pallets/click) – Implement a human-readable bytes helper and satisfytests/test_benchmark_utils.py. - Django (
django/django) – Enhanceslugifywith an optional dash preservation flag validated bybenchmarks/agent_task_check.py.
The repositories are cloned into scenarios/repos/ using git clone --depth 1.
Refresh them whenever you want newer upstream changes.
Benchmark runs are configured via JSON or TOML. The sample configuration at
scenarios/sample.json illustrates the schema:
agents: list of agent definitions withname,command, optionalenv,timeout_seconds, andwarmup_command.tasks: list of tasks referencing asource_dir, optionalinstructions, and an optionalvalidationcommand. The sample configuration points to shallow clones stored underscenarios/repos/and uses shim commands underagents_benchmark.scripts.*as placeholders for the real agent CLIs. The placeholder{python}expands to the interpreter running the benchmark, making it easy to reuse the active virtual environment inside task workspaces.runs_per_task: repeat runs for stability.output_dir: location for benchmark artifacts and reports.
You can extend the harness with additional agents by subclassing the command
runner or by implementing custom AgentRunner classes if a CLI needs special
handling (for instance, to stream prompts or capture intermediate telemetry).
- Integrate real Codex and Cursor agent invocations in
scenarios/sample.jsonby replacing the shim commands or updating them to call the actual CLIs. - Implement deeper code quality scoring (coverage, linting, runtime metrics) by extending validation hooks.
- Capture agent interaction transcripts for qualitative analysis.
- Add concurrency controls if you plan to benchmark many tasks in parallel.
agents_benchmark/- Python package (cli.py, cli_wrapper.py, runner.py, config.py)agents_benchmark/agents/- Agent adapters (base.py, command.py)agents_benchmark/scripts/- Agent wrappers (codex_wrapper.py, cursor_wrapper.py, stub_agent.py)agents_benchmark/tests/- Unit tests