Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
a83ad7b
docs: add verified repo consolidation plan
abrichr Jan 28, 2026
1ee037c
feat: add openadapt-evals as optional dependency
abrichr Jan 28, 2026
0da88cb
docs(cli): clarify serve vs dashboard command naming
abrichr Jan 28, 2026
f7c9fe4
refactor(benchmarks): consolidate to re-export from openadapt-evals
abrichr Jan 28, 2026
fdebe76
refactor(benchmarks): delete deprecation stubs, import from openadapt…
abrichr Jan 28, 2026
e6a09d2
docs(readme): add WAA benchmark results section with placeholders
abrichr Jan 28, 2026
23beca3
docs(readme): move WAA benchmark results to openadapt-evals
abrichr Jan 28, 2026
90c5454
feat(cli): add VNC auto-launch and --fast VM option
abrichr Jan 28, 2026
ab2414d
docs: add WAA speedup options documentation
abrichr Jan 28, 2026
988d207
docs(readme): add benchmark execution logs section
abrichr Jan 28, 2026
0b6b206
docs(readme): clarify --run flag for benchmark execution logs
abrichr Jan 28, 2026
1e82176
docs(readme): add example output for logs commands
abrichr Jan 28, 2026
0c699ed
feat(cli): add --progress flag for benchmark ETA
abrichr Jan 28, 2026
06dd117
docs(research): add cua.ai vs openadapt-ml WAA comparison
abrichr Jan 28, 2026
d9acbcf
feat(cli): add parallelization support with --worker-id and --num-wor…
abrichr Jan 28, 2026
075a35d
docs(research): add market positioning and strategic differentiation
abrichr Jan 28, 2026
6022772
docs(waa): add parallelization and scalable benchmark design docs
abrichr Jan 29, 2026
0fe26aa
style: fix ruff formatting
abrichr Jan 29, 2026
4336e81
fix(imports): update internal code to import from openadapt-evals
abrichr Jan 29, 2026
20e9078
fix(imports): add missing EvaluationConfig import
abrichr Jan 29, 2026
0c0ce72
fix(deps): require openadapt-evals>=0.1.1
abrichr Jan 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 104 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -825,9 +825,112 @@ uv run python -m openadapt_ml.benchmarks.cli vm monitor --mock
uv run python -m openadapt_ml.benchmarks.cli vm monitor --auto-shutdown-hours 2
```

### 13.5 Benchmark Execution Logs

View benchmark execution progress and logs:

```bash
# View WAA container status and Docker logs
uv run python -m openadapt_ml.benchmarks.cli logs

# View WAA benchmark execution logs (task progress, agent actions)
uv run python -m openadapt_ml.benchmarks.cli logs --run

# Stream execution logs live
uv run python -m openadapt_ml.benchmarks.cli logs --run -f

# Show last N lines of execution logs
uv run python -m openadapt_ml.benchmarks.cli logs --run --tail 100

# Show benchmark progress and ETA
uv run python -m openadapt_ml.benchmarks.cli logs --progress
```

**Example: Container status (`logs`)**
```
WAA Status (20.12.180.208)
============================================================

[Docker Images]
REPOSITORY TAG SIZE
waa-auto latest 25.4GB
windowsarena/winarena latest 25.8GB

[Container]
Status: Up 49 minutes

[Storage]
Total: 21G
Disk image: 64G

[QEMU VM]
Status: Running (PID 1471)
CPU: 176%, MEM: 51.6%, Uptime: 47:28

[WAA Server]
"status": "Probe successful"
(READY)
```

**Example: Benchmark execution logs (`logs --run -f`)**
```
Run log: /home/azureuser/cli_logs/run_20260128_175507.log
------------------------------------------------------------
Streaming log (Ctrl+C to stop)...

[2026-01-28 23:05:10,303 INFO agent/401-MainProcess] Thinking...
[2026-01-28 23:05:17,318 INFO python/62-MainProcess] Updated computer successfully
[2026-01-28 23:05:17,318 INFO lib_run_single/56-MainProcess] Step 9: computer.window_manager.switch_to_application("Summer Trip - File Explorer")
```

**Example: Benchmark progress (`logs --progress`)**
```
=== WAA Benchmark Progress ===

Log: /home/azureuser/cli_logs/run_20260128_175507.log
Started: 2026-01-28 22:55:14
Latest: 2026-01-28 23:28:37

Tasks completed: 1 / 154
Elapsed: 33 minutes

Avg time per task: ~33 min
Remaining tasks: 153
Estimated remaining: ~84h 9m

Progress: 0% [1/154]
```

**Other useful commands:**
```bash
# Check WAA server status (probe endpoint)
uv run python -m openadapt_ml.benchmarks.cli probe

# Check VM/Azure status
uv run python -m openadapt_ml.benchmarks.cli status

# Download benchmark results from VM
uv run python -m openadapt_ml.benchmarks.cli download

# Analyze downloaded results
uv run python -m openadapt_ml.benchmarks.cli analyze
```

**Running benchmarks:**
```bash
# Run full benchmark (154 tasks)
uv run python -m openadapt_ml.benchmarks.cli run --num-tasks 154

# Run specific domain
uv run python -m openadapt_ml.benchmarks.cli run --domain notepad --num-tasks 5

# Run single task
uv run python -m openadapt_ml.benchmarks.cli run --task notepad_1
```

For complete VM management commands and Azure setup instructions, see [`CLAUDE.md`](CLAUDE.md) and [`docs/azure_waa_setup.md`](docs/azure_waa_setup.md).

### 13.5 Screenshot Capture Tool
### 13.6 Screenshot Capture Tool

Capture screenshots of dashboards and VMs for documentation and PR purposes:

Expand Down
Loading