OpenAdapt Evals

Evaluation infrastructure for GUI agent benchmarks.

Overview

openadapt-evals provides a unified framework for evaluating GUI automation agents across standardized benchmarks like Windows Agent Arena (WAA), OSWorld, WebArena, and others.

Recent Improvements

We've made significant improvements to reliability, cost-efficiency, and observability:

Azure Reliability (v0.2.0 - January 2026)

95%+ Success Rate Target: Fixed nested virtualization issues that caused 0% task completion
VM Configuration: Upgraded to Standard_D4s_v5 with proper nested virtualization support
Health Monitoring: Automatic detection and retry of stuck jobs
Fast Failure Detection: 10-minute timeout instead of 8+ hour hangs
See PR #11 for details

Cost Optimization (v0.2.0 - January 2026)

67% Cost Reduction: From $7.68 to $2.50 per full evaluation (154 tasks)
Tiered VM Sizing: Automatic VM size selection based on task complexity (37% savings)
Spot Instance Support: 70-80% discount on compute costs (64% savings with tiered VMs)
Azure Container Registry: 10x faster image pulls (1-2 min vs 8-12 min)
Real-time Cost Tracking: Monitor costs during evaluation
See COST_OPTIMIZATION.md and PR #13 for details

Screenshot Validation & Viewer (v0.2.0 - January 2026)

Real Benchmark Screenshots: Viewer now displays actual WAA evaluation screenshots
Auto-Screenshot Tool: Automated screenshot generation with Playwright
Screenshot Validation: Manifest-based validation ensuring correctness
Execution Logs: Step-by-step logs with search and filtering
Live Monitoring: Real-time Azure ML job monitoring with auto-refresh
See PR #6 for details

Installation

pip install openadapt-evals

Or with uv:

uv add openadapt-evals

Quick Start

Note: Examples use real WAA evaluation data. For testing without a Windows VM, see the Mock Adapter section below.

from openadapt_evals import (
    WAALiveAdapter,
    WAALiveConfig,
    ApiAgent,
    evaluate_agent_on_benchmark,
    compute_metrics,
)

# Configure connection to WAA server (real Windows VM)
config = WAALiveConfig(
    server_url="http://vm-ip:5000",
    a11y_backend="uia",
    max_steps=15,
)

# Create adapter for live WAA evaluation
adapter = WAALiveAdapter(config)

# Create API-based agent (Claude or GPT)
agent = ApiAgent(provider="anthropic")  # or "openai" for GPT-5.1

# Run evaluation
results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_1"])

# Compute metrics
metrics = compute_metrics(results)
print(f"Success rate: {metrics['success_rate']:.1%}")

Mock Adapter for Testing

For testing without a Windows VM, use the mock adapter:

from openadapt_evals import WAAMockAdapter, SmartMockAgent

# Create mock adapter (testing only, not for production use)
adapter = WAAMockAdapter(num_tasks=10)
agent = SmartMockAgent()

# Run mock evaluation
results = evaluate_agent_on_benchmark(agent, adapter, max_steps=15)

Warning: Mock adapter uses synthetic data and is only for testing infrastructure. Always use real WAA data for actual evaluations.

Core Concepts

BenchmarkAdapter

Abstract interface for benchmark integration. Implementations:

WAAAdapter - Windows Agent Arena (requires WAA repository)
WAAMockAdapter - Mock adapter for testing without Windows

BenchmarkAgent

Abstract interface for agents to be evaluated. Implementations:

ScriptedAgent - Follows predefined action sequence
RandomAgent - Takes random actions (baseline)
SmartMockAgent - Designed to pass mock adapter tests

Data Classes

BenchmarkTask - Task definition (instruction, domain, etc.)
BenchmarkObservation - Screenshot, accessibility tree, context
BenchmarkAction - Click, type, scroll, key actions
BenchmarkResult - Success/failure, score, trajectory

Benchmark Viewer

Generate an HTML viewer for benchmark results:

from openadapt_evals import generate_benchmark_viewer
from pathlib import Path

# Run evaluation with trace collection
from openadapt_evals import EvaluationConfig

config = EvaluationConfig(
    save_execution_traces=True,
    output_dir="benchmark_results",
    run_name="my_eval_run",
)

results = evaluate_agent_on_benchmark(agent, adapter, config=config)

# Generate viewer
generate_benchmark_viewer(
    benchmark_dir=Path("benchmark_results/my_eval_run"),
    output_path=Path("benchmark_results/my_eval_run/viewer.html"),
)

Demo: Benchmark Viewer in Action

Animation shows real WAA evaluation results from waa-live_eval_20260116_200004

The viewer provides:

Summary statistics (success rate, per-domain breakdown)
Task list with pass/fail status
Step-by-step replay with screenshots
Action and reasoning display
Playback controls (play/pause, speed, seek)
Execution logs with filtering and search

Viewer Screenshots

Overview Panel

Desktop view showing summary statistics and domain breakdown:

Task Detail View

Step-by-step task execution with screenshot replay:

Execution Logs

Detailed execution logs with filtering and search capabilities:

Responsive Design

The viewer works on all devices:

Desktop (1920x1080)	Tablet (768x1024)	Mobile (375x667)

Generating Viewer Screenshots

Automatically capture screenshots of the viewer in multiple viewports with built-in validation:

# Install Playwright (required for screenshots)
pip install playwright
playwright install chromium

# Generate screenshots with automatic validation
python -m openadapt_evals.benchmarks.auto_screenshot \
    --html-path benchmark_results/my_eval_run/viewer.html \
    --output-dir screenshots \
    --viewports desktop tablet mobile \
    --states overview task_detail log_expanded log_collapsed

The auto-screenshot tool includes:

Automatic Validation: Ensures screenshots match expected dimensions and content
Manifest Generation: Creates manifest.json with screenshot metadata
Multiple Viewports: Desktop (1920x1080), Tablet (768x1024), Mobile (375x667)
Multiple States: Overview, task detail, log expanded, log collapsed

Or programmatically:

from openadapt_evals.benchmarks.auto_screenshot import generate_screenshots

screenshots = generate_screenshots(
    html_path="benchmark_results/my_eval_run/viewer.html",
    output_dir="screenshots",
    viewports=["desktop", "tablet", "mobile"],
    states=["overview", "task_detail", "log_expanded", "log_collapsed"],
)

Custom Agents

Implement the BenchmarkAgent interface:

from openadapt_evals import BenchmarkAgent, BenchmarkAction, BenchmarkObservation, BenchmarkTask

class MyAgent(BenchmarkAgent):
    def act(
        self,
        observation: BenchmarkObservation,
        task: BenchmarkTask,
        history: list[tuple[BenchmarkObservation, BenchmarkAction]] | None = None,
    ) -> BenchmarkAction:
        # Your agent logic here
        return BenchmarkAction(type="click", x=0.5, y=0.5)

    def reset(self) -> None:
        # Reset agent state between tasks
        pass

Windows Agent Arena Integration

Command Line Interface

The package provides a CLI for running WAA evaluations:

# Check if WAA server is ready
python -m openadapt_evals.benchmarks.cli probe --server http://vm-ip:5000

# Run live evaluation against a WAA server
python -m openadapt_evals.benchmarks.cli live --server http://vm-ip:5000 --task-ids notepad_1,notepad_2

# Generate HTML viewer for results
python -m openadapt_evals.benchmarks.cli view --run-name my_eval_run

# Estimate Azure costs (with optimization options)
python -m openadapt_evals.benchmarks.cli estimate --tasks 154 --workers 10 --enable-tiered-vms --use-spot

# Run mock evaluation for testing (no Windows VM required - testing only!)
python -m openadapt_evals.benchmarks.cli mock --tasks 10

Note: Mock mode is for testing infrastructure only. Always use live or Azure mode for actual evaluations.

Live WAA Adapter

Connect to a WAA Flask server running inside a Windows VM:

from openadapt_evals import WAALiveAdapter, WAALiveConfig

# Configure connection to WAA server
config = WAALiveConfig(
    server_url="http://vm-ip:5000",
    a11y_backend="uia",  # or "win32"
    max_steps=15,
)

# Create adapter
adapter = WAALiveAdapter(config)

# Check connection
if not adapter.check_connection():
    print("WAA server not ready")

# Run evaluation
results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_1"])

Local WAA Evaluation

For real WAA evaluation with local WAA repository:

from openadapt_evals import WAAAdapter

adapter = WAAAdapter(waa_repo_path="/path/to/WindowsAgentArena")
tasks = adapter.list_tasks(domain="notepad")

results = evaluate_agent_on_benchmark(agent, adapter, task_ids=[t.task_id for t in tasks[:5]])

Azure-based Parallel Evaluation

Run WAA at scale using Azure ML compute with optimized costs:

# Install Azure dependencies
pip install openadapt-evals[azure]

# Set environment variables
export AZURE_SUBSCRIPTION_ID="your-subscription-id"
export AZURE_ML_RESOURCE_GROUP="your-resource-group"
export AZURE_ML_WORKSPACE_NAME="your-workspace"

# Enable cost optimizations (recommended)
export AZURE_ENABLE_TIERED_VMS=true
export AZURE_ENVIRONMENT=development  # Enables spot instances

# Run evaluation with multiple workers
python -m openadapt_evals.benchmarks.cli azure \
    --waa-path /path/to/WindowsAgentArena \
    --workers 10 \
    --timeout-hours 4

Cost Optimization: With tiered VMs and spot instances enabled, a full 154-task evaluation costs $2.50-4.00 instead of $7.68. See COST_OPTIMIZATION.md for details.

Or programmatically:

from openadapt_evals.benchmarks.azure import AzureConfig, AzureWAAOrchestrator

config = AzureConfig.from_env()
orchestrator = AzureWAAOrchestrator(
    config=config,
    waa_repo_path="/path/to/WindowsAgentArena",
)

results = orchestrator.run_evaluation(
    agent=my_agent,
    num_workers=40,  # 40 parallel VMs
    cleanup_on_complete=True,
)

Azure Reliability: The orchestrator now uses Standard_D4s_v5 VMs with proper nested virtualization support and automatic health monitoring, achieving 95%+ success rates.

Live Monitoring

Monitor Azure ML jobs in real-time with auto-refreshing viewer:

# Install viewer dependencies
pip install openadapt-evals[viewer]

# Start an Azure evaluation (in terminal 1)
python -m openadapt_evals.benchmarks.cli azure \
    --workers 1 \
    --task-ids notepad_1,browser_1 \
    --waa-path /path/to/WAA

# Monitor job logs in real-time (in terminal 2)
python -m openadapt_evals.benchmarks.cli azure-monitor \
    --job-name waa-waa3718w0-1768743963-20a88242 \
    --output benchmark_live.json

# Start live viewer API (in terminal 3)
python -m openadapt_evals.benchmarks.live_api \
    --live-file benchmark_live.json \
    --port 5001

# Open http://localhost:5001 in browser to see live progress!

Features:

Real-time log streaming from Azure ML jobs
Auto-refreshing viewer with "LIVE" indicator
Task/step progress tracking
Real-time cost tracking
No need to wait for job completion

See LIVE_MONITORING.md for full documentation.

API Reference

Evaluation Functions

evaluate_agent_on_benchmark(agent, adapter, ...) - Run evaluation
compute_metrics(results) - Aggregate metrics (success_rate, avg_score, etc.)
compute_domain_metrics(results, tasks) - Per-domain metrics

Data Collection

ExecutionTraceCollector - Collect execution traces during evaluation
save_execution_trace(task, result, trajectory, ...) - Save single trace

Utilities

action_to_string(action) - Convert action to readable string
format_accessibility_tree(tree) - Format a11y tree for display
parse_action_response(response) - Parse VLM response to action

Documentation

COST_OPTIMIZATION.md - Azure cost optimization guide (67% savings)
LIVE_MONITORING.md - Real-time Azure ML job monitoring
CLAUDE.md - Development guide and best practices
CHANGELOG.md - Version history and changes

WAA Benchmark Results

⚠️ PLACEHOLDER: The results below are placeholders. Actual benchmark results will be added once the full evaluation completes.

Baseline Reproduction

We run the full WAA benchmark using the same methodology as the original paper to establish baseline performance.

WAA Baseline Results (GPT-4o):

Metric	Paper Reported	Our Reproduction	Status
Success Rate	~19.5%	`[PLACEHOLDER]`	`[PENDING]`
Tasks Evaluated	154	`[PLACEHOLDER]`	`[PENDING]`
Avg Steps/Task	N/A	`[PLACEHOLDER]`	`[PENDING]`
Avg Time/Task	N/A	`[PLACEHOLDER]`	`[PENDING]`

Model Comparison

Performance of different agents on WAA:

Agent	Success Rate	Avg Steps	Notes
GPT-4o (baseline)	`[PLACEHOLDER]`	`[PLACEHOLDER]`	Zero-shot
Claude Sonnet 4.5	`[PLACEHOLDER]`	`[PLACEHOLDER]`	Zero-shot

Domain Breakdown

Success rates by Windows application domain:

Domain	Tasks	Success Rate
Notepad	`[PLACEHOLDER]`	`[PLACEHOLDER]`
Chrome	`[PLACEHOLDER]`	`[PLACEHOLDER]`
File Explorer	`[PLACEHOLDER]`	`[PLACEHOLDER]`
Settings	`[PLACEHOLDER]`	`[PLACEHOLDER]`
...	...	...

Note: Full domain breakdown will be added when benchmark completes.

License

MIT

Related Projects

openadapt-ml - Training and policy runtime
openadapt-grounding - UI element localization
openadapt-capture - Screen recording

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.beads		.beads
.github/workflows		.github/workflows
animations		animations
benchmark_results		benchmark_results
demo_library		demo_library
docs		docs
openadapt_evals		openadapt_evals
screenshots		screenshots
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
demo_cost_tracking.py		demo_cost_tracking.py
pyproject.toml		pyproject.toml
refresh_vm_dashboard.py		refresh_vm_dashboard.py
uv.lock		uv.lock
vm_dashboard.html		vm_dashboard.html

License

OpenAdaptAI/openadapt-evals

Folders and files

Latest commit

History

Repository files navigation

OpenAdapt Evals

Overview

Recent Improvements

Azure Reliability (v0.2.0 - January 2026)

Cost Optimization (v0.2.0 - January 2026)

Screenshot Validation & Viewer (v0.2.0 - January 2026)

Installation

Quick Start

Mock Adapter for Testing

Core Concepts

BenchmarkAdapter

BenchmarkAgent

Data Classes

Benchmark Viewer

Demo: Benchmark Viewer in Action

Viewer Screenshots

Generating Viewer Screenshots

Custom Agents

Windows Agent Arena Integration

Command Line Interface

Live WAA Adapter

Local WAA Evaluation

Azure-based Parallel Evaluation

Live Monitoring

API Reference

Evaluation Functions

Data Collection

Utilities

Documentation

WAA Benchmark Results

Baseline Reproduction

Model Comparison

Domain Breakdown

License

Related Projects

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Uh oh!

Languages

Packages