Skip to content

jslamartina/agents-benchmark

Repository files navigation

Agents Benchmark Harness

This repository contains a Python harness for benchmarking agentic coding CLI tools such as Codex and Cursor. The harness focuses on stressing context retrieval by running the agents against tasks backed by workspaces of different sizes—including real-world open source repositories—and verifying the resulting code quality and performance.

Features

  • Command abstraction – Each agent is described declaratively, allowing you to plug in CLIs like codex and cursor-agent. Sample shims live in agents_benchmark/scripts/ to demonstrate how to adapt bespoke interfaces; by default they expect the binaries specified in $CODEX_AGENT_BIN and $CURSOR_AGENT_BIN (or an explicit --agent-bin argument).
  • Workspace isolation – Tasks are executed in per-run copies of the source workspace, enabling repeatable and parallel-friendly benchmarking.
  • Validation hooks – Run project-specific checks (e.g. pytest, bespoke scripts) to gauge code quality after an agent finishes.
  • Structured reporting – Collect per-run metrics including durations, success rates, validation outcomes, and log file locations.
  • Verbose telemetry – Run with -v/-vv to stream live progress and debug-level diagnostics during benchmark execution.
  • Sample scenarios – Two example tasks clone pallets/click (small) and django/django (large) to highlight context retrieval stress at different scales.

Quick Start

# Create and activate virtual environment (Python 3.14+)
python3 -m venv .venv
source .venv/bin/activate

# Install package and dependencies
pip install -e .
pip install pytest  # Needed for validation in sample scenarios

# Run with default config (stub agents, fast)
benchmark --summary -v

# Run with pretty formatted output
agents-benchmark-cli

# Run with real agents
benchmark --config scenarios/sample.json --summary -v

Important: Always activate the venv (source .venv/bin/activate) before running benchmarks to ensure Python 3.14+ is used.

The CLI copies each task workspace into the run directory, invokes both agents with the configured commands, executes validation suites, and stores a structured report under reports/<timestamp>/report.json.

Command Line Reference

Main CLI

benchmark [--config PATH] [OPTIONS]

Pretty Wrapper

For interactive development, use the Rich-formatted wrapper:

agents-benchmark-cli  # Auto-runs with --summary -v and colorful tables

All arguments are optional with defaults:

  • --config PATH - Path to benchmark configuration file (JSON or TOML)
    • Default: scenarios/sample_stub.json (fast stub agents for testing)
    • Supports: .json or .toml files
    • Use scenarios/sample.json for real agents

Optional Flags (all default to off/false):

  • --dry-run - Validate configuration and exit without running benchmarks

    • Default: off (runs benchmarks normally)
    • Combine with --print-config to inspect parsed configuration
  • --print-config - Print the parsed configuration for inspection

    • Default: off (no output)
    • Shows: agent commands, task sources, resolved file paths
  • --summary - Print compact JSON summary after the run to stdout

    • Default: off (prints "Benchmark completed. Report saved to ")
    • With flag: outputs full JSON report (can redirect: --summary > results.json)
    • Tip: Use agents-benchmark-cli instead for prettier formatted output
  • -v, --verbose - Increase logging verbosity (repeatable)

    • Default: WARNING level (errors and warnings only)
    • -v: INFO level (progress messages, task completions, validation results)
    • -vv: DEBUG level (detailed execution traces, commands, timeouts, environment)

Usage Examples

Validate configuration:

# Check config is valid without running
benchmark --config scenarios/sample.json --dry-run

# Validate and print parsed config
benchmark --config scenarios/sample.json --dry-run --print-config

Run benchmarks:

# Run with default config (stub agents, fast for testing)
benchmark

# Run with JSON summary and progress logging
benchmark --summary -v

# Run with specific config
benchmark --config scenarios/sample.json --summary -v

# Run with detailed debug logging
benchmark --config scenarios/sample.json --summary -vv

Output:

  • Without --summary: Prints "Benchmark completed. Report saved to "
  • With --summary: Prints full JSON report to stdout (can redirect: agents-benchmark --summary > results.json)
  • Report files: Always written to reports/<timestamp>/report.json regardless of --summary flag
  • Logs: Written to reports/<timestamp>/ with per-agent workspaces in subdirectories

Summary Output Format

The --summary flag outputs a JSON report with aggregated metrics and per-run details:

{
  "generated_at": "2025-10-09T22:19:19+00:00",
  "run_directory": "/absolute/path/to/reports/20251009-221918",
  
  "agents": {
    "codex": {
      "total_runs": 4,
      "agent_success_rate": 0.75,
      "validation_success_rate": 0.5,
      "average_duration_seconds": 125.3
    },
    "cursor-agent": {
      "total_runs": 4,
      "agent_success_rate": 1.0,
      "validation_success_rate": 1.0,
      "average_duration_seconds": 98.7
    }
  },
  
  "tasks": {
    "click-human-bytes": {
      "description": "Implement a human-readable bytes helper in Click",
      "workspace_size_bytes": 1048576,
      "agent_runs": [
        {
          "agent": "codex",
          "run_index": 1,
          "agent_success": true,
          "agent_duration_seconds": 120.5,
          "validation_success": true,
          "workspace": "/path/to/reports/.../codex/click-human-bytes-run1",
          "stdout_log": "/path/to/.../logs/codex/stdout-run1.log",
          "stderr_log": "/path/to/.../logs/codex/stderr-run1.log",
          "validation_stdout": "/path/to/.../validation/run1/stdout.log",
          "validation_stderr": "/path/to/.../validation/run1/stderr.log"
        }
      ]
    }
  }
}

Field descriptions:

Top level:

  • generated_at - ISO 8601 timestamp when report was generated
  • run_directory - Absolute path to the run directory containing all artifacts

agents object (keyed by agent name):

  • total_runs - Number of times this agent ran (tasks × runs_per_task)
  • agent_success_rate - Fraction of runs where agent exited successfully (0.0-1.0)
  • validation_success_rate - Fraction of runs where validation passed (0.0-1.0)
  • average_duration_seconds - Mean execution time across all runs

tasks object (keyed by task ID):

  • description - Human-readable task description
  • workspace_size_bytes - Size of the source workspace in bytes
  • agent_runs - Array of individual run records

Each agent_run record:

  • agent - Name of the agent that ran this task
  • run_index - Run number (1-based)
  • agent_success - Whether agent completed successfully (exit code 0)
  • agent_duration_seconds - How long the agent took to run
  • validation_success - Whether validation passed (null if no validation)
  • workspace - Path to the isolated workspace for this run
  • stdout_log / stderr_log - Agent output logs
  • validation_stdout / validation_stderr - Validation output logs (null if no validation)

Processing the summary:

# Save to file
agents-benchmark --summary > benchmark-results.json

# Extract specific metrics
agents-benchmark --summary | jq '.agents.codex.agent_success_rate'

# Check which tasks failed
agents-benchmark --summary | jq '.tasks | to_entries[] | 
  select(.value.agent_runs[].agent_success == false) | .key'

# Get average duration by agent
agents-benchmark --summary | jq '.agents | to_entries[] | 
  {agent: .key, avg_duration: .value.average_duration_seconds}'

Sample Tasks

  • Click (pallets/click) – Implement a human-readable bytes helper and satisfy tests/test_benchmark_utils.py.
  • Django (django/django) – Enhance slugify with an optional dash preservation flag validated by benchmarks/agent_task_check.py.

The repositories are cloned into scenarios/repos/ using git clone --depth 1. Refresh them whenever you want newer upstream changes.

Configuration

Benchmark runs are configured via JSON or TOML. The sample configuration at scenarios/sample.json illustrates the schema:

  • agents: list of agent definitions with name, command, optional env, timeout_seconds, and warmup_command.
  • tasks: list of tasks referencing a source_dir, optional instructions, and an optional validation command. The sample configuration points to shallow clones stored under scenarios/repos/ and uses shim commands under agents_benchmark.scripts.* as placeholders for the real agent CLIs. The placeholder {python} expands to the interpreter running the benchmark, making it easy to reuse the active virtual environment inside task workspaces.
  • runs_per_task: repeat runs for stability.
  • output_dir: location for benchmark artifacts and reports.

You can extend the harness with additional agents by subclassing the command runner or by implementing custom AgentRunner classes if a CLI needs special handling (for instance, to stream prompts or capture intermediate telemetry).

Next Steps

  • Integrate real Codex and Cursor agent invocations in scenarios/sample.json by replacing the shim commands or updating them to call the actual CLIs.
  • Implement deeper code quality scoring (coverage, linting, runtime metrics) by extending validation hooks.
  • Capture agent interaction transcripts for qualitative analysis.
  • Add concurrency controls if you plan to benchmark many tasks in parallel.

Project Structure

  • agents_benchmark/ - Python package (cli.py, cli_wrapper.py, runner.py, config.py)
  • agents_benchmark/agents/ - Agent adapters (base.py, command.py)
  • agents_benchmark/scripts/ - Agent wrappers (codex_wrapper.py, cursor_wrapper.py, stub_agent.py)
  • agents_benchmark/tests/ - Unit tests

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors