diff --git a/.github/workflows/trigger-docs-update.yml b/.github/workflows/trigger-docs-update.yml
new file mode 100644
index 0000000..15e6539
--- /dev/null
+++ b/.github/workflows/trigger-docs-update.yml
@@ -0,0 +1,23 @@
+name: Trigger Docs Update
+
+on:
+  push:
+    branches:
+      - main
+    paths:
+      - 'README.md'
+      - 'docs/**'
+      - 'CHANGELOG.md'
+  workflow_dispatch:
+
+jobs:
+  trigger:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Trigger docs sync in main OpenAdapt repo
+        uses: peter-evans/repository-dispatch@v3
+        with:
+          token: ${{ secrets.DOCS_UPDATE_TOKEN }}
+          repository: OpenAdaptAI/OpenAdapt
+          event-type: sync-package-docs
+          client-payload: '{"package": "openadapt-ml", "ref": "${{ github.sha }}"}'
diff --git a/BEADS.md b/BEADS.md
new file mode 100644
index 0000000..0816b61
--- /dev/null
+++ b/BEADS.md
@@ -0,0 +1,831 @@
+# BEADS - Build Evolution And Decision Summary
+
+This document tracks significant project decisions, architecture choices, and key learnings that persist across conversation compactions. Each "bead" represents a notable change worth remembering.
+
+---
+
+## 2026-01-24: Windows Product Key Prompt Fix (CRITICAL)
+
+**Category**: bugfix, docker, regression
+
+**Summary**: Fixed Windows asking for product key during installation. The Dockerfile was REPLACING dockurr/windows's autounattend.xml with windowsarena's version, which broke the OOBE flow.
+
+**Root Cause**: Dockerfile line `COPY --from=windowsarena/winarena:latest /run/assets/win11x64-enterprise-eval.xml /run/assets/win11x64.xml` replaced dockurr/windows's native autounattend.xml which handles OOBE properly.
+
+**The Fix** (from commits 914513e and 6b9f744):
+- DO NOT replace dockurr/windows's autounattend.xml
+- Instead, PATCH it to add InstallFrom element (prevents "Select OS" dialog)
+- VERSION="11e" is CORRECT - it downloads Enterprise Evaluation with built-in GVLK key
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/waa_deploy/Dockerfile` (lines 62-70)
+
+**Key Learning**: Never replace dockurr/windows's autounattend.xml - only patch it. The base image's XML handles OOBE properly for the editions it downloads.
+
+---
+
+## 2026-01-24: Automatic Disk Space Management
+
+**Category**: feature, reliability
+
+**Summary**: Added automatic disk space check and cleanup to `vm run-waa` command. Before starting Windows, the CLI now checks if /mnt has at least 15GB free. If not, it automatically runs Docker cleanup (`docker system prune -af --volumes`) to reclaim space. This prevents the recurring "Not enough free space in /storage" errors during Windows extraction.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/cli.py`
+  - Added `ensure_sufficient_disk_space()` function (lines 246-305)
+  - Called from `run-waa` action before starting container (lines 3130-3135)
+
+**Why It Matters**: Windows extraction requires ~9GB, and Docker images/containers accumulate over time. Previously, users would hit disk space errors mid-setup, requiring manual intervention. Now the CLI proactively ensures space is available.
+
+**Behavior**:
+1. Check /mnt free space
+2. If < 15GB, run `docker system prune -af --volumes`
+3. Re-check space
+4. If still < 15GB, provide manual cleanup instructions and exit
+
+---
+
+## 2026-01-24: Dockerfile CMD Fix - Missing /copy-oem.sh Script
+
+**Category**: bugfix, docker
+
+**Summary**: The waa-auto Dockerfile had a CMD that referenced a non-existent `/copy-oem.sh` script, causing containers to crash immediately on startup with "No such file or directory". Fixed by removing the reference since OEM files are already copied via modified `samba.sh`.
+
+**Root Cause**: The Dockerfile CMD was `["/copy-oem.sh /entry.sh --start-client false"]` but no script at `/copy-oem.sh` was ever created. The OEM file copying was actually handled by a `sed` modification to `/run/samba.sh` (line ~59) that injects `cp -r /oem/* /tmp/smb/` before the return statement.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/waa_deploy/Dockerfile` (line 230)
+
+**Fix**: Changed CMD from `["/copy-oem.sh /entry.sh --start-client false"]` to `["/entry.sh --start-client false"]`
+
+**Workaround**: To start an existing (broken) image, override the command:
+```bash
+docker run ... waa-auto:latest /entry.sh --start-client false
+```
+
+**Why It Matters**: Containers would crash immediately after creation, making it impossible to run the WAA benchmark without manual intervention.
+
+---
+
+## 2026-01-24: Windows CMD UNC Path Limitation
+
+**Category**: learning, windows
+
+**Summary**: Windows CMD.EXE does not support UNC paths (like `\\host.lan\Data`) as the current directory. When running batch scripts from a network share, `cd /d \\host.lan\Data\server` silently fails and the current directory remains unchanged.
+
+**Symptoms**:
+- Script runs from `\\host.lan\Data\script.bat` via Win+R
+- `cd /d \\host.lan\Data\server` appears to succeed
+- But `%CD%` shows `C:\Windows` (unchanged)
+- Python tries to run `main.py` from wrong directory
+
+**Fix**: Map the network share to a drive letter first:
+```batch
+net use Z: \\host.lan\Data /persistent:no
+cd /d Z:\server
+```
+
+**Key Learning**: Always use `net use` to map UNC paths to drive letters before using `cd` in batch scripts that need to change to a network directory.
+
+---
+
+## 2026-01-24: VNC View Only Mode Toggle
+
+**Category**: feature, ux
+
+**Summary**: Added "View Only" toggle switch to the VNC controls in the Azure Operations dashboard. When enabled, an overlay blocks keyboard and mouse events from reaching the embedded VNC iframe, allowing users to watch the VM without accidentally sending input. The toggle state persists in localStorage.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/training_output/azure_ops.html` - Added CSS for toggle switch and input blocker overlay, HTML toggle control, JavaScript `toggleViewOnly()` function with localStorage persistence
+- `/Users/abrichr/oa/src/openadapt-ml/training_output/current/azure_ops.html` - Copy for served version
+
+**Implementation Details**:
+1. CSS toggle switch styled to match dashboard theme (accent color when active)
+2. Transparent overlay div positioned over the VNC iframe to intercept all pointer/keyboard events
+3. Visual indicator in top-right corner when view-only mode is active ("View Only Mode - Input Disabled")
+4. Preference persisted in localStorage (`vnc_view_only_mode`) and restored on page load
+
+**Why It Matters**: Users need to monitor Windows VM activity during benchmark runs without accidentally interfering. This was previously listed as BUG-006 in the dashboard bugs document - the toggle was mentioned but never implemented.
+
+---
+
+## 2026-01-24: Auto-Shutdown to Prevent Runaway VM Costs
+
+**Category**: feature, cost-optimization
+
+**Summary**: Implemented default auto-shutdown for `vm monitor` command to prevent forgotten VMs from accumulating costs. The default is now 2 hours (was disabled). Users receive warnings at 15 minutes and 5 minutes before shutdown. Can be extended with `--auto-shutdown-hours N` or disabled with `--auto-shutdown-hours 0`.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/cli.py` - Updated default and added warnings
+- `/Users/abrichr/oa/src/openadapt-ml/docs/auto_shutdown_design.md` - Design doc for future Azure Automation
+- `/Users/abrichr/oa/src/openadapt-ml/CLAUDE.md` - Documentation update
+
+**Why It Matters**: A forgotten VM costs ~$0.42/hr ($10/day, $70/week). This change saves money by default while still allowing users to extend or disable as needed. Future Phase 2/3 will add server-side Azure Automation for shutdown even when client disconnects.
+
+**Cost Reference**:
+- Standard_D4ds_v5: $0.422/hr
+- 4 hours idle (common oversight): $1.69
+- 24 hours idle (weekend forgotten): $10.13
+- 1 week idle: $70.90
+
+---
+
+## 2026-01-24: Client-Side Evaluation Module (openadapt-evals)
+
+**Category**: feature, architecture
+
+**Summary**: Created client-side evaluation infrastructure in the `openadapt-evals` repository. The `EvaluatorClient` runs WAA evaluators locally by making HTTP calls to the WAA server's `/execute` endpoint. Includes `VMIPDiscovery` for auto-detecting VM IP from multiple sources (Azure CLI, SSH config, environment variables).
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-evals/openadapt_evals/evaluation/client.py` - EvaluatorClient
+- `/Users/abrichr/oa/src/openadapt-evals/openadapt_evals/evaluation/ip_discovery.py` - VMIPDiscovery
+- `/Users/abrichr/oa/src/openadapt-evals/openadapt_evals/evaluation/__init__.py`
+
+**Why It Matters**: Eliminates complexity of running evaluation as a separate service inside Docker. Evaluators just make HTTP calls and can run from anywhere with network access. Follows WAA's own design pattern (`run.py` uses client-side evaluation).
+
+---
+
+## 2026-01-24: Install Script Progress Indicators
+
+**Category**: feature, ux
+
+**Summary**: Created `install.bat` with step-by-step progress indicators for Windows installation inside the WAA Docker container. Shows window title updates for 14 installation steps (Python, ChromeDriver, LibreOffice, VSCode, WAA server, etc.). The script is copied into the Dockerfile to replace the original silent install.bat.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/waa_deploy/install.bat`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/waa_deploy/Dockerfile`
+
+**Why It Matters**: Previously, Windows installation appeared to hang with no progress feedback. Now VNC viewers can see exactly which step is running via the window title (e.g., "Step 3/14: Installing ChromeDriver...").
+
+---
+
+## 2026-01-24: Dashboard Consolidation (benchmark.html to ops.html)
+
+**Category**: refactor, ux
+
+**Summary**: Renamed `benchmark.html` to `ops.html` and updated the shared header from "Benchmarks" to "Operations" to better reflect the dashboard's broader purpose (VM management, Azure ops, not just benchmarks). Converted legacy `viewer.py` to a deprecation shim that re-exports from shared_ui.py.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/training/shared_ui.py` - Updated header text
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/cloud/local.py` - Updated filename references
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/benchmark_viewer.py` - Updated references
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/cli.py` - Updated CLI output messages
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/viewer.py` - Now deprecation shim
+
+**Why It Matters**: Clearer naming reflects actual usage. The dashboard manages VM operations, Azure resources, and SSH tunnels - not just benchmarks.
+
+---
+
+## 2026-01-24: Azure Ops Status API Enhancement
+
+**Category**: feature, api
+
+**Summary**: Enhanced the `/api/azure-ops-status` endpoint with benchmark detection, log fetching from VM, and better phase/operation state detection. The API now returns richer status information for the Azure operations dashboard.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/cloud/local.py`
+
+**Why It Matters**: Dashboard can now show more detailed status including whether benchmarks are running, container logs, and accurate phase detection (setup, ready, benchmark, error states).
+
+---
+
+## 2026-01-24: VNC Iframe Flicker Fix
+
+**Category**: bug-fix, ux
+
+**Summary**: Fixed VNC iframe flickering on the Azure operations dashboard (azure_ops.html). Added state tracking variables (`vncIframeLoaded`, `lastKnownVmState`) to prevent unnecessary iframe reloads. The VNC iframe now only reacts to actual state transitions (offline to online, or vice versa) rather than reloading on every status poll.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/training_output/azure_ops.html`
+
+**Why It Matters**: VNC was flickering every few seconds during status polling, making it unusable for monitoring. Now maintains stable connection while still updating status displays.
+
+---
+
+## 2026-01-16: Unified Baseline Adapters for VLM Comparison
+
+**Category**: feature, architecture
+
+**Summary**: Implemented a comprehensive baseline adapter system for comparing VLM providers (Claude, GPT, Gemini) across three evaluation tracks: direct coordinate prediction (Track A), ReAct-style reasoning (Track B), and Set-of-Mark element selection (Track C).
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/baselines/__init__.py`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/baselines/adapter.py`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/baselines/config.py`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/baselines/parser.py`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/baselines/prompts.py`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/models/providers/`
+
+**Why It Matters**: Enables systematic comparison of off-the-shelf VLMs before fine-tuning. Based on SOTA patterns from Claude Computer Use, Microsoft UFO/UFO2, OSWorld benchmark, and Agent-S/Agent-S2.
+
+---
+
+## 2026-01-16: Benchmark Migration to openadapt-evals Package
+
+**Category**: architecture, refactor
+
+**Summary**: Consolidated benchmark code into a separate `openadapt-evals` package. The `openadapt_ml/benchmarks/` directory now contains deprecation stubs that re-export from `openadapt-evals`.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/__init__.py`
+- `/Users/abrichr/oa/src/openadapt-ml/CLAUDE.md` (updated documentation)
+
+**Why It Matters**: Separates benchmark infrastructure from training code, enabling cleaner dependency management and allowing benchmarks to be used independently.
+
+---
+
+## 2026-01-16: Safety Gate and Perception Integration
+
+**Category**: feature
+
+**Summary**: Added a safety gate for runtime action validation and perception integration module for enhanced UI understanding.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/runtime/safety_gate.py`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/perception/integration.py`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/trace_export.py`
+
+**Why It Matters**: Safety gate prevents potentially dangerous actions from executing. Perception integration provides richer UI understanding for agents.
+
+---
+
+## 2026-01-16: Representation Shootout Experiment Framework
+
+**Category**: feature, experiment
+
+**Summary**: Created a framework for systematically comparing different UI representation approaches (coordinates vs marks vs hybrid).
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/experiments/representation_shootout/`
+- `/Users/abrichr/oa/src/openadapt-ml/docs/experiments/representation_shootout_design.md`
+
+**Why It Matters**: Enables data-driven decisions about which UI representation approach works best for different scenarios.
+
+---
+
+## 2026-01-09: TRL + Unsloth Training Integration
+
+**Category**: feature, architecture
+
+**Summary**: Replaced custom training implementation with TRL (Transformer Reinforcement Learning) and Unsloth for faster, more efficient fine-tuning.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/training/trl_trainer.py`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/scripts/train.py`
+- `/Users/abrichr/oa/src/openadapt-ml/tests/test_trl_trainer.py`
+
+**Why It Matters**: TRL provides battle-tested training loops. Unsloth provides 2-4x speedup for LoRA fine-tuning. Version bumped to 0.2.0 for PyPI release.
+
+---
+
+## 2026-01-09: Enhanced VM CLI and WAA Deployment
+
+**Category**: feature
+
+**Summary**: Significantly enhanced the VM CLI with new commands for WAA deployment, diagnostics, and monitoring. Added custom `waa-auto` Docker image that auto-downloads Windows 11.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/cli.py`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/waa_deploy/Dockerfile`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/waa_deploy/api_agent.py`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/cloud/ssh_tunnel.py`
+
+**New CLI Commands**:
+- `vm monitor` - Dashboard with auto-SSH tunnels
+- `vm setup-waa` - Full VM setup with Docker
+- `vm run-waa` - Run benchmark with agent options
+- `vm diag` - Check disk, Docker, containers
+- `vm logs` - View container logs
+- `vm probe` - Check WAA server status
+- `vm exec` - Run command in container
+- `vm docker-prune` - Clean Docker resources
+- `vm deallocate` / `vm start` - VM power management
+
+**Why It Matters**: Eliminates need for manual SSH commands. All VM operations now have CLI wrappers for consistency and documentation.
+
+---
+
+## 2026-01-04: VM Monitor Dashboard with Auto-Shutdown
+
+**Category**: feature
+
+**Summary**: Added `vm monitor` command that opens a real-time dashboard in the browser, automatically manages SSH tunnels for VNC access, and supports auto-shutdown to prevent runaway costs.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/cli.py`
+- `/Users/abrichr/oa/src/openadapt-ml/CLAUDE.md` (mandatory dashboard note moved to TOP)
+
+**Why It Matters**: The dashboard is the single entry point for VM operations. Auto-shutdown prevents billing surprises when VMs are left running.
+
+---
+
+## 2026-01-03: Demo Retrieval System and WAA Live Adapter
+
+**Category**: feature
+
+**Summary**: Implemented demo retrieval system for automatically selecting relevant demonstrations from a library based on task similarity.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/retrieval/` (if exists)
+- `/Users/abrichr/oa/src/openadapt-ml/docs/experiments/demo_conditioned_prompting_results.md`
+
+**Why It Matters**: Validated that demo-conditioned prompting improves action accuracy from 33% (zero-shot) to 100% (with demo).
+
+---
+
+## 2026-01-03: WAA Docker Image Fixes
+
+**Category**: bug-fix
+
+**Summary**: Fixed multiple issues with WAA Docker image: Python 3.13 compatibility, missing client dependencies (pydrive, openpyxl, docx), OEM files for Windows installation.
+
+**Key Commits**:
+- `e5b3dc0` - Copy Python env from official image to avoid 3.13 compat issues
+- `02e5e2f` - Add remaining WAA client dependencies
+- `ebdc4f6` - Add missing pydrive and other client dependencies
+
+**Why It Matters**: The official `windowsarena/winarena:latest` image is broken. Custom `waa-auto` image is required for automated Windows setup.
+
+---
+
+## 2026-01-02: Schema Consolidation
+
+**Category**: architecture, refactor
+
+**Summary**: Consolidated multiple schema variants into a single Pydantic-based Episode module with converters for external formats.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/schemas/`
+- `/Users/abrichr/oa/src/openadapt-ml/docs/schema_consolidation_plan.md`
+
+**Why It Matters**: Maintains schema purity - external systems adapt to the schema, not vice versa. Simplifies data flow through the system.
+
+---
+
+## 2026-01-02: WAA Demo-Conditioned Experiment
+
+**Category**: feature, experiment
+
+**Summary**: Added demo-conditioned experiment infrastructure with 7 manual demos recorded on Windows.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/agent.py` (DemoConditionedAgent)
+- `/Users/abrichr/oa/src/openadapt-ml/docs/waa_demo_recording_guide.md`
+
+**Why It Matters**: Proves the core value proposition - trajectory-conditioned disambiguation of UI affordances improves accuracy significantly.
+
+---
+
+## Uncommitted Changes (as of 2026-01-24)
+
+### Session Tracker
+**Category**: feature (uncommitted)
+
+**File**: `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/session_tracker.py`
+
+**Summary**: Persists VM runtime and cost across page refreshes. Tracks session start time, elapsed time, accumulated seconds (for pause/resume), and computes cost based on VM hourly rates.
+
+**Why It Matters**: Dashboard shows consistent cost/time values even after page refresh. Handles VM deallocate/start cycles correctly.
+
+---
+
+### Disk Manager
+**Category**: feature (uncommitted)
+
+**File**: `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/disk_manager.py`
+
+**Summary**: Automatic disk space monitoring and cleanup for Azure VM. Proactively cleans Docker resources when disk space drops below thresholds.
+
+**Thresholds**:
+- Warning: < 20GB free on /mnt
+- Critical: < 10GB free on /mnt
+- Auto-cleanup trigger: < 15GB free
+
+**Cleanup Priority**:
+1. Docker build cache
+2. Unused Docker images
+3. Stopped containers
+4. Old Windows storage files
+5. Old benchmark results
+
+**Why It Matters**: Prevents disk full errors during Windows VM operations which can corrupt the Windows image.
+
+---
+
+### Dashboard Bug Inventory
+**Category**: documentation (uncommitted)
+
+**File**: `/Users/abrichr/oa/src/openadapt-ml/docs/DASHBOARD_BUGS.md`
+
+**Summary**: Documented 11 known bugs in the Azure Ops dashboard:
+
+| Bug ID | Description | Priority | Status |
+|--------|-------------|----------|--------|
+| BUG-001 | IP Address Flickering | P1 | Open |
+| BUG-002 | Activity Detection Wrong State | P1 | Open |
+| BUG-003 | SSE Connection Memory Leaks | P2 | **Fixed** |
+| BUG-004 | Azure Jobs Polling Pauses | P2 | **Fixed** |
+| BUG-005 | Session Tracker State Corruption | P2 | Open |
+| BUG-006 | VNC Input Toggle Missing | P3 | **Fixed** |
+| BUG-007 | Panel States Reset on Polling | P3 | Open |
+| BUG-008 | Debug Console Logs | P3 | **Fixed** |
+| BUG-009 | Hardcoded Azure Values | P4 | Open |
+| BUG-010 | SSH Commands Not Using CLI | P4 | Open |
+
+**Why It Matters**: Provides systematic tracking of dashboard issues with root cause analysis and fix approaches.
+
+---
+
+### Ingest Module
+**Category**: feature (uncommitted)
+
+**Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/ingest/__main__.py`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/ingest/base.py`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/ingest/external_example.py`
+
+**Summary**: Data ingestion infrastructure for importing external data sources into the Episode schema.
+
+---
+
+## Key Architecture Decisions
+
+### CLI-First Development
+All VM operations should use CLI commands, not raw SSH. This ensures:
+- Commands are documented
+- Commands are tested
+- Commands persist across context compactions
+- Commands can be run by users or agents consistently
+
+### Dashboard-First for VM Operations
+Always start `vm monitor` before any VM operations. The dashboard:
+- Auto-manages SSH tunnels (VNC at localhost:8006)
+- Shows real-time VM status
+- Provides all diagnostic information in one place
+
+### Schema Purity
+The schema must remain domain-agnostic:
+- External systems adapt TO the schema
+- Never add fields for specific integrations
+- Use `raw` and `metadata` dicts for integration-specific data
+- Data transformation belongs in importers/exporters
+
+### Test in Container First
+Before rebuilding Docker images (~30 min), test fixes inside running containers (~seconds):
+```bash
+docker run -d --name test-fix --entrypoint /bin/bash waa-auto:latest -c "sleep 3600"
+docker exec test-fix sed -i 's/old/new/' /some/file.sh
+docker exec test-fix /some/script.sh && ls /expected/output
+docker rm -f test-fix
+```
+
+---
+
+## Key Architectural Decision: Client-Side Evaluation (2026-01-24)
+
+**Category**: architecture, decision
+
+**File**: `/Users/abrichr/oa/src/openadapt-ml/docs/EVALUATION_ARCHITECTURE.md`
+
+**Summary**: After analyzing three approaches (sidecar service, client-side, volume mount), client-side evaluation was chosen as the recommended architecture for benchmark evaluation.
+
+**Decision**: Evaluators run client-side by importing from WAA vendor submodule and making HTTP calls to the WAA server's `/execute` endpoint.
+
+**Rationale**:
+1. Follows WAA's own design (`run.py` uses client-side evaluation)
+2. Simplest architecture (no extra services or ports)
+3. Already proven working (`StandaloneEvaluator` in `evaluate_endpoint.py`)
+4. Generalizes to other benchmarks (WebArena, OSWorld use same pattern)
+
+**Why It Matters**: Eliminates complexity of running evaluation as a separate service inside the Docker container. The key insight is that evaluators just make HTTP calls - they can run from anywhere with network access.
+
+---
+
+## 2026-01-24: Synthetic Task ID Validation (a9c8bae)
+
+**Category**: bug-fix, architecture
+
+**Summary**: Fixed misleading evaluation results caused by synthetic task IDs. Mock task IDs like `notepad_1` were being accepted by the live adapter, producing fake evaluation scores that appeared legitimate.
+
+**Solution**:
+- Added `is_real_waa_task_id()` validation function that checks task IDs against the real WAA task database
+- Renamed mock task IDs from `notepad_1`, `chrome_1` to `mock_notepad_1`, `mock_chrome_1` prefix
+- Live adapter now rejects synthetic IDs with a helpful error message pointing to the mock adapter
+- Clear separation between mock testing and live evaluation
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-evals/openadapt_evals/adapters/waa_live.py` - Added ID validation
+- `/Users/abrichr/oa/src/openadapt-evals/openadapt_evals/adapters/waa.py` - Updated mock ID prefix
+- `/Users/abrichr/oa/src/openadapt-evals/openadapt_evals/benchmarks/cli.py` - Updated CLI references
+
+**Why It Matters**: Prevents false confidence from fake evaluation scores. Users now get clear feedback when using mock vs live evaluation.
+
+---
+
+## 2026-01-24: Unified Dashboard Command (a0eea20)
+
+**Category**: feature, refactor
+
+**Summary**: Replaced the crash-prone `vm monitor` command with a new `vm dashboard` command that uses subprocess-based SSH tunnels and generates a single HTML page with an embedded VNC iframe.
+
+**Problem**: The original `vm monitor` crashed due to signal handling in threads, and multiple stale dashboard instances would accumulate.
+
+**Solution**:
+- Created new `dashboard.py` module (637 lines) with:
+  - `SubprocessTunnel` class for reliable SSH tunnel management
+  - Single-page HTML dashboard with VNC iframe
+  - Real-time status polling
+  - Clean process management (no orphan processes)
+- Updated CLI to use new dashboard command
+- Updated CLAUDE.md with new command documentation
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/dashboard.py` - New 637-line module
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/cli.py` - Added dashboard command
+- `/Users/abrichr/oa/src/openadapt-ml/CLAUDE.md` - Updated documentation
+
+**Why It Matters**: Reliable dashboard is essential for VM monitoring. The subprocess-based approach avoids Python threading issues with signal handling.
+
+---
+
+## 2026-01-24: Safe Process Management (ae8d78e)
+
+**Category**: documentation, safety
+
+**Summary**: Added comprehensive safe process management guidelines to CLAUDE.md after broad `pkill` patterns accidentally killed unrelated applications (Windsurf, Chrome tabs, Signal).
+
+**Problem**: Commands like `pkill -9 -f "openadapt"` matched too broadly and killed important user applications.
+
+**Solution**: Enhanced CLAUDE.md with:
+- Banned patterns section (what NOT to do)
+- Safe alternatives using port-based or PID-based killing
+- Decision checklist: always run `pgrep -f "pattern" -l` first to see matches
+- Examples of specific vs broad patterns
+
+**Key File**:
+- `/Users/abrichr/oa/src/openadapt-ml/CLAUDE.md` - Added "Safe Process Management" section
+
+**Why It Matters**: Prevents accidental termination of user applications. Establishes pattern of checking before killing.
+
+---
+
+## 2026-01-24: WAA Server Diagnosis (a9eaa03)
+
+**Category**: diagnosis, bug-fix
+
+**Summary**: Diagnosed why WAA `/probe` endpoint was not responding. The status dashboard was showing incorrect information.
+
+**Root Cause**: Container was running `dockurr/windows:latest` instead of `waa-auto:latest`. The base dockurr image doesn't have WAA server installed.
+
+**Solution**: Identified issue and recommended rebuilding with `--rebuild` flag to ensure correct image is used.
+
+**Diagnostic Steps**:
+1. `vm diag` showed container running but probe failing
+2. `vm exec --cmd "docker images"` revealed wrong image
+3. `vm run-waa --rebuild` forces image rebuild
+
+**Why It Matters**: Demonstrates importance of checking actual container image, not just container status.
+
+---
+
+## 2026-01-24: Disk Manager SSH Warning Fix (adf86b7)
+
+**Category**: bug-fix
+
+**Summary**: Fixed parsing errors in disk_manager.py caused by SSH warning messages like "Permanently added host to known hosts".
+
+**Problem**: SSH outputs warning messages to stderr which were being mixed with stdout, causing JSON/number parsing to fail.
+
+**Solution**: Updated disk_manager.py to filter out SSH warning messages before parsing command output.
+
+**Key File**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/disk_manager.py` - Fixed SSH warning filtering
+
+**Why It Matters**: Ensures reliable disk space monitoring even when SSH adds hosts to known_hosts file.
+
+---
+
+## 2026-01-24: Client-Side Evaluation Module (a63112b)
+
+**Category**: feature, architecture
+
+**Summary**: Added comprehensive client-side evaluation infrastructure to openadapt-evals, including VM IP auto-discovery, config management, and proper WAA task loading.
+
+**Key Components**:
+- `EvaluatorClient` - Runs WAA evaluators locally via HTTP calls to WAA server `/execute` endpoint
+- `VMIPDiscovery` - Auto-detects VM IP from Azure CLI, SSH config, or environment variables
+- Config management - Centralized benchmark configuration
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-evals/openadapt_evals/evaluation/client.py` - EvaluatorClient
+- `/Users/abrichr/oa/src/openadapt-evals/openadapt_evals/evaluation/discovery.py` - VMIPDiscovery
+- `/Users/abrichr/oa/src/openadapt-evals/openadapt_evals/benchmarks/config.py` - Config management
+
+**Why It Matters**: Enables running benchmark evaluations from the client machine without needing to deploy evaluator code inside the Docker container.
+
+---
+
+## 2026-01-24: Dockerfile Python Ordering Fix
+
+**Category**: bug-fix
+
+**Summary**: Fixed Docker build failure "python3: not found" at step 16/32. The Dockerfile had Python used (for api_agent patching) before it was installed.
+
+**Problem**:
+- Python installation was at lines 169-182
+- Python used (python3 -c) at lines 95-100 for api_agent patching
+- Build failed with "python3: not found"
+
+**Solution**: Moved Python installation section to lines 84-127, before api_agent patching.
+
+**Key File**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/waa_deploy/Dockerfile`
+
+**Why It Matters**: Without this fix, the waa-auto image cannot be built.
+
+---
+
+## 2026-01-24: Auto-Shutdown Design for Azure VMs
+
+**Category**: feature, design
+
+**Summary**: Designed auto-shutdown mechanism to prevent wasted VM idle time (~$0.42/hr).
+
+**Problem**: VM was running for 4+ hours idle costing $1.74 because there was no automatic shutdown when not in use.
+
+**Recommended Solution**: Azure Automation + CPU Alert
+- Server-side execution (works even if client disconnects)
+- Monitors CPU < 5% for 30 min, then deallocates
+- Costs ~$0.10/month
+- See `docs/auto_shutdown_design.md` for full design
+
+**Quick Win**: Make `--auto-shutdown-hours 2` the default instead of opt-in.
+
+**Key File**:
+- `/Users/abrichr/oa/src/openadapt-ml/docs/auto_shutdown_design.md`
+
+**Why It Matters**: Prevents billing surprises when VMs are forgotten.
+
+---
+
+## 2026-01-24: Disk Space Threshold Adjustment
+
+**Category**: bug-fix
+
+**Summary**: Lowered disk space requirement from 50GB to 35GB for WAA benchmark runs.
+
+**Problem**: Benchmark was failing with "Insufficient disk space: 43GB (need 50GB)" even though Windows only needs ~30GB.
+
+**Solution**: Changed `MIN_DISK_GB` from 50 to 35 in cli.py.
+
+**Key File**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/cli.py` (lines 3155-3168)
+
+**Why It Matters**: Allows benchmarks to run on VMs with smaller disks.
+
+---
+
+## 2026-01-24: Viewer Screenshot Embedding Default
+
+**Category**: feature, ux
+
+**Summary**: Changed viewer generation to embed screenshots as base64 by default.
+
+**Problem**: viewer.html files referenced relative screenshot paths that broke when copied elsewhere.
+
+**Solution**: Changed `--embed-screenshots` to `--no-embed-screenshots` (inverted default). Viewers are now ~20MB but fully portable.
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-evals/openadapt_evals/benchmarks/cli.py` (cmd_view function)
+- `/Users/abrichr/oa/src/openadapt-evals/openadapt_evals/benchmarks/viewer.py` (embed_screenshots parameter)
+
+**Why It Matters**: Viewers can now be copied/shared without breaking screenshot references.
+
+---
+
+## Key Learnings from 2026-01-24 Session
+
+1. **Always check Dockerfile ordering** when adding new build steps that use tools
+2. **Auto-shutdown should be default**, not opt-in, for cloud VMs
+3. **Conservative disk checks can block valid operations** - 35GB is sufficient for WAA
+4. **Viewer portability matters** - embedding screenshots by default prevents broken references
+5. **Multiple concurrent Docker builds cause issues** - ensure only one build runs at a time
+
+---
+
+---
+
+## 2026-01-24: WAA Server Startup Script Fix
+
+**Category**: bug-fix
+
+**Summary**: Fixed WAA server not starting after install.bat completes. Root cause was Python not on PATH (`PrependPath=0`) and `start_waa_server.bat` using just `python` instead of full path.
+
+**Problem**:
+1. install.bat used `PrependPath=0` when installing Python
+2. `start_waa_server.bat` called `python main.py` which failed (python not found)
+3. WAA server never started, container showed "Waiting for response from windows server" forever
+
+**Solution**:
+1. Updated `install.bat` to use `PrependPath=1` (line 92)
+2. Updated `start_waa_server.bat` to use full Python path: `C:\Users\Docker\AppData\Local\Programs\Python\Python310\python.exe`
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/waa_deploy/install.bat`
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/benchmarks/waa_deploy/start_waa_server.bat`
+
+**Why It Matters**: Critical for WAA benchmark to run end-to-end. Without this fix, manual VNC intervention required.
+
+**Important Note**: Docker builds use files at build START. If you modify files after build starts, you must rebuild with `--rebuild`.
+
+---
+
+## 2026-01-24: Dashboard Liveness Indicators
+
+**Category**: feature, ux
+
+**Summary**: Added heartbeat indicator, current action display, and log freshness tracking to the Azure Ops dashboard.
+
+**New Features**:
+- Green pulsing dot (●) next to "Live Logs" shows connection is active
+- "Current action" display shows what's happening (Downloading, Installing, etc.)
+- "Log updated Xs ago" shows freshness of log data
+- API returns `server_time`, `dockerfile_step`, `current_action`, `log_file_mtime`
+
+**Key Files**:
+- `/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/cloud/local.py` - `_detect_docker_build_on_vm()` enhanced
+- `/Users/abrichr/oa/src/openadapt-ml/training_output/azure_ops.html` - UI additions
+
+**Known Issue**: Stale Docker build data causes contradictory display ("Dockerfile step 25/31" + "Log stale 36m ago"). Background agent ae154ac launched to fix this.
+
+---
+
+## 2026-01-24: Training Output File Serving
+
+**Category**: bug-fix
+
+**Summary**: Dashboard server serves from `training_output/current/`, not `training_output/`. Changes to azure_ops.html must be copied to the current directory.
+
+**Problem**: Edits to `/training_output/azure_ops.html` weren't reflected in served dashboard.
+
+**Solution**: Copy updated files: `cp training_output/azure_ops.html training_output/current/azure_ops.html`
+
+**Why It Matters**: Prevents confusion when dashboard changes don't appear.
+
+---
+
+## Key Learnings from 2026-01-24 Evening Session
+
+1. **Docker builds capture files at START** - modifying files during build has no effect
+2. **Dashboard serves from training_output/current/** - must copy changes there
+3. **PrependPath=1 is essential** for Python to be findable
+4. **SSH timeouts cause stale detection** - API may not detect running containers if SSH is slow
+5. **Contradictory UI states confuse users** - need to clear stale data when not applicable
+
+*Last updated: 2026-01-25*
+
+---
+
+## 2026-01-25: Standalone WAA Build with Dev Mode - VERIFIED WORKING
+
+**Category**: architecture, verification
+
+**Summary**: After multiple failed approaches, found a **verified working** solution using vanilla WAA's "dev mode" with a standalone Dockerfile. The key insight: use Samba share (`\\host.lan\Data`) instead of OEM folder patching.
+
+**What Works**:
+1. Standalone Dockerfile copies from LOCAL vendor submodule (no COPY --from circular dependency)
+2. Uses vanilla WAA's `dev_win11x64-enterprise-eval.xml` unattend (expects files at `\\host.lan\Data`)
+3. Injects file copy into `samba.sh` to populate `/tmp/smb/` at container startup
+4. Only patches IP addresses (20.20.20.21 → 172.30.0.2 for modern dockurr/windows)
+5. Image size: **1.3GB** (vs 45GB official)
+
+**Verification Results** (2026-01-25):
+- [x] Windows ISO downloads automatically (VERSION=11e)
+- [x] Windows installs unattended (no license key prompt)
+- [x] FirstLogonCommands find scripts at `\\host.lan\Data`
+- [x] setup.ps1 installs Python, Git, dependencies
+- [x] WAA server starts on port 5000
+- [x] `/probe` endpoint returns 200
+
+**Key Files**:
+- `/openadapt_ml/benchmarks/waa_deploy/Dockerfile` - Standalone build (159 lines)
+- `/openadapt_ml/benchmarks/cli.py` - Updated to copy vendor files and build
+- `/docs/waa_automation_checklist.md` - End-to-end verification checklist
+
+**Why Previous Approaches Failed**:
+1. **3-step vanilla build**: Required 50GB+ disk, downloads 15GB models
+2. **COPY --from official image**: Circular dependency, old dockurr/windows base
+3. **OEM folder with script patching**: Scripts expected `\\host.lan\Data`, not `C:\oem`
+
+**The Fix**: Use vanilla WAA's dev mode as-is. The scripts already expect `\\host.lan\Data` (Samba share). Just copy files to `/tmp/smb/` when container starts.
+
+**CLI Commands**:
+```bash
+# Full setup (builds 1.3GB image, ~5 min)
+uv run python -m openadapt_ml.benchmarks.cli vm setup-waa
+
+# Run benchmark (first run downloads Windows, ~15 min)
+uv run python -m openadapt_ml.benchmarks.cli vm run-waa --num-tasks 5
+```
+
+---
diff --git a/docs/auto-docs-update.md b/docs/auto-docs-update.md
new file mode 100644
index 0000000..92cc452
--- /dev/null
+++ b/docs/auto-docs-update.md
@@ -0,0 +1,249 @@
+# Auto-Update Documentation System
+
+This document describes the automatic documentation update system for OpenAdapt.
+
+## Overview
+
+When code is committed to OpenAdapt sub-packages (openadapt-ml, openadapt-evals, etc.), the main documentation site at https://docs.openadapt.ai/ is automatically updated.
+
+**Current Status**: docs.openadapt.ai is hosted on **GitHub Pages** using **MkDocs Material**, deployed from the main [OpenAdaptAI/OpenAdapt](https://github.com/OpenAdaptAI/OpenAdapt) repository.
+
+## Architecture
+
+```
+┌─────────────────────┐     repository_dispatch     ┌─────────────────────┐
+│  openadapt-ml       │ ─────────────────────────▶  │  OpenAdapt (main)   │
+│  push to main       │                             │                     │
+└─────────────────────┘                             │  sync-package-docs  │
+                                                    │  workflow runs      │
+┌─────────────────────┐     repository_dispatch     │                     │
+│  openadapt-evals    │ ─────────────────────────▶  │  Updates docs/      │
+│  push to main       │                             │  packages/*.md      │
+└─────────────────────┘                             │                     │
+                                                    │  Triggers docs.yml  │
+     ... other repos ...                            │  (deploys to GH     │
+                                                    │   Pages)            │
+                                                    └─────────────────────┘
+```
+
+## Components
+
+### 1. Trigger Workflow (in sub-packages)
+
+Located at `.github/workflows/trigger-docs-update.yml` in each sub-package.
+
+When README.md or docs/ changes are pushed to main, this workflow sends a `repository_dispatch` event to the main OpenAdapt repo.
+
+**Required secret**: `DOCS_UPDATE_TOKEN` - A GitHub PAT with `contents:write` permission on OpenAdaptAI/OpenAdapt.
+
+### 2. Sync Script (in main OpenAdapt repo)
+
+Located at `docs/_scripts/sync_package_docs.py` in the main repo.
+
+This Python script:
+- Fetches README.md from each sub-package via GitHub API
+- Transforms content to MkDocs-compatible format
+- Updates `docs/packages/*.md` files
+- Handles badges, relative links, and formatting
+
+### 3. Sync Workflow (in main OpenAdapt repo)
+
+Located at `.github/workflows/sync-package-docs.yml` in the main repo.
+
+Triggers on:
+- `repository_dispatch` events from sub-packages
+- Manual `workflow_dispatch`
+- Daily schedule at 6 AM UTC
+
+Creates a PR with documentation changes and optionally auto-merges.
+
+### 4. Deploy Workflow (existing)
+
+The existing `docs.yml` workflow in the main repo deploys to GitHub Pages when docs are updated.
+
+## Setup Instructions
+
+### Step 1: Set Up the Main OpenAdapt Repo
+
+1. **Copy sync script**:
+   ```bash
+   # In the main OpenAdapt repo
+   cp path/to/openadapt-ml/scripts/sync_package_docs.py docs/_scripts/
+   ```
+
+2. **Copy sync workflow**:
+   ```bash
+   cp path/to/openadapt-ml/docs/workflows/sync-package-docs.yml .github/workflows/
+   ```
+
+3. **Configure repository settings**:
+   - Go to Settings > Actions > General
+   - Enable "Allow GitHub Actions to create and approve pull requests"
+
+### Step 2: Set Up Each Sub-Package
+
+1. **Create a Personal Access Token (PAT)**:
+   - Go to GitHub Settings > Developer settings > Personal access tokens > Fine-grained tokens
+   - Create token with:
+     - Repository access: `OpenAdaptAI/OpenAdapt`
+     - Permissions: `contents: write`
+   - Save the token securely
+
+2. **Add as organization secret** (recommended) or repo secret:
+   - Organization: Settings > Secrets and variables > Actions > New organization secret
+   - Name: `DOCS_UPDATE_TOKEN`
+   - Value: The PAT you created
+   - Repository access: Select repos that need it
+
+3. **Add the trigger workflow** to each sub-package:
+   - Copy `.github/workflows/trigger-docs-update.yml`
+   - Update `client-payload.package` to match the package name
+
+### Packages Currently Configured
+
+| Package | Trigger Workflow | Status |
+|---------|-----------------|--------|
+| openadapt-ml | `.github/workflows/trigger-docs-update.yml` | Ready |
+| openadapt-evals | Needs to be added | Pending |
+| openadapt-capture | Needs to be added | Pending |
+| openadapt-viewer | Needs to be added | Pending |
+| openadapt-grounding | Needs to be added | Pending |
+| openadapt-retrieval | Needs to be added | Pending |
+| openadapt-privacy | Needs to be added | Pending |
+
+## Manual Operations
+
+### Trigger a sync manually
+
+```bash
+# From any sub-package repo (requires workflow file)
+gh workflow run trigger-docs-update.yml
+
+# From the main OpenAdapt repo
+gh workflow run sync-package-docs.yml
+
+# Sync a specific package
+gh workflow run sync-package-docs.yml -f package=openadapt-ml
+```
+
+### Run sync script locally
+
+```bash
+# In the main OpenAdapt repo
+cd /path/to/OpenAdapt
+
+# Dry run (preview changes)
+python docs/_scripts/sync_package_docs.py --dry-run
+
+# Sync all packages
+python docs/_scripts/sync_package_docs.py
+
+# Sync specific package
+python docs/_scripts/sync_package_docs.py --package openadapt-ml
+```
+
+## Troubleshooting
+
+### Workflow not triggering
+
+1. Check that paths filter matches your changes (README.md, docs/**, CHANGELOG.md)
+2. Verify `DOCS_UPDATE_TOKEN` secret is set and not expired
+3. Check Actions tab for workflow run logs
+4. Ensure the workflow file is on the default branch
+
+### Sync script failing
+
+1. Check GitHub API rate limits (authenticated requests get 5000/hour)
+2. Verify the package README exists and is accessible
+3. Review error messages in workflow logs
+4. Test locally with `--dry-run`
+
+### Docs not deploying
+
+1. Check that `docs.yml` workflow is enabled
+2. Verify GitHub Pages is configured (Settings > Pages)
+3. Check MkDocs build output for errors
+4. Ensure `mkdocs.yml` nav includes the package pages
+
+### Permission errors
+
+1. Verify PAT has correct permissions (`contents:write` on main repo)
+2. Check PAT hasn't expired
+3. Ensure secret is accessible to the workflow
+
+## How It Works (Technical Details)
+
+### Transform Process
+
+The sync script applies these transformations to README content:
+
+1. **Title normalization**: Replaces `# PackageName` with standardized `# package-name`
+2. **Badge removal**: Strips shields.io badges that don't render well in MkDocs
+3. **Link fixing**: Converts relative links (`[link](docs/file.md)`) to absolute GitHub URLs
+4. **Metadata addition**: Adds repository link at the top
+5. **Footer addition**: Adds auto-generation notice with source link
+
+### Package Configuration
+
+Packages are configured in `sync_package_docs.py`:
+
+```python
+PACKAGES = {
+    "openadapt-ml": {
+        "repo": "OpenAdaptAI/openadapt-ml",
+        "doc_file": "docs/packages/ml.md",
+        "title": "openadapt-ml",
+        "description": "Policy learning, training, and inference for GUI automation agents.",
+    },
+    # ... more packages
+}
+```
+
+To add a new package, add an entry to this dict and add the trigger workflow to the package repo.
+
+## Best Practices
+
+### README Format
+
+For best results, structure package READMEs as:
+
+```markdown
+# Package Name
+
+Brief description (1-2 sentences).
+
+## Installation
+
+\`\`\`bash
+pip install package-name
+\`\`\`
+
+## Quick Start
+
+...
+
+## API Reference
+
+...
+```
+
+### What to Include
+
+- Installation instructions
+- Basic usage examples
+- API documentation or links
+- Configuration options
+
+### What NOT to Include
+
+- CI badges (handled automatically)
+- Development setup (belongs in CONTRIBUTING.md)
+- Internal implementation details
+
+## Future Improvements
+
+- [ ] Add support for CHANGELOG.md syncing
+- [ ] Add API reference auto-generation from docstrings
+- [ ] Add versioned documentation support
+- [ ] Add link validation in CI
+- [ ] Add preview deployments for PRs
diff --git a/docs/waa_automation_checklist.md b/docs/waa_automation_checklist.md
new file mode 100644
index 0000000..246c003
--- /dev/null
+++ b/docs/waa_automation_checklist.md
@@ -0,0 +1,122 @@
+# WAA End-to-End Automation Checklist
+
+**Purpose**: Track all steps required for fully automated WAA deployment. If any step fails, automation is broken.
+
+---
+
+## Pre-flight
+
+| Step | What | How to Verify |
+|------|------|---------------|
+| 1 | Azure credentials valid | `az account show` succeeds |
+| 2 | SSH key exists | `~/.ssh/id_rsa.pub` exists |
+| 3 | Nested virt VM size available | D4ds_v4 has quota |
+
+---
+
+## VM Setup Phase
+
+| Step | What | How to Verify | Automated By |
+|------|------|---------------|--------------|
+| 4 | VM creates | `az vm show` succeeds | `vm create` |
+| 5 | VM has nested virt | KVM device exists | VM size selection |
+| 6 | Docker installs | `docker --version` | `vm setup-waa` |
+| 7 | Docker uses /mnt storage | `docker info` shows /mnt | daemon.json |
+
+---
+
+## Image Build Phase
+
+| Step | What | How to Verify | Automated By |
+|------|------|---------------|--------------|
+| 8 | Vendor files copied to VM | Files exist in ~/waa-build | `vm setup-waa` |
+| 9 | Image builds | `docker images` shows image | `docker build` |
+| 10 | Image size reasonable | < 2GB (not 45GB) | Standalone approach |
+
+---
+
+## Container Start Phase
+
+| Step | What | How to Verify | Automated By |
+|------|------|---------------|--------------|
+| 11 | Container starts | `docker ps` shows running | `docker run` |
+| 12 | VNC accessible | Port 8006 responds | Container ports |
+| 13 | Samba share created | /tmp/smb/ exists | dockurr/windows |
+| 14 | WAA files in Samba | /tmp/smb/install.bat exists | samba.sh injection |
+
+---
+
+## Windows Install Phase
+
+| Step | What | How to Verify | Automated By |
+|------|------|---------------|--------------|
+| 15 | Windows ISO downloads | VNC shows "Downloading" | VERSION=11e |
+| 16 | No ISO selection prompt | VNC shows progress bar | Enterprise Eval |
+| 17 | Windows installs | VNC shows setup screens | Unattend XML |
+| 18 | No license key prompt | Setup continues | Enterprise GVLK |
+| 19 | Windows boots to desktop | VNC shows desktop | Unattend XML |
+
+---
+
+## WAA Setup Phase
+
+| Step | What | How to Verify | Automated By |
+|------|------|---------------|--------------|
+| 20 | FirstLogonCommands run | Log file created | Unattend XML |
+| 21 | install.bat finds scripts | No "file not found" in log | Dev mode + samba.sh |
+| 22 | setup.ps1 runs | Log shows "Running setup.ps1" | install.bat |
+| 23 | Python installs | Log shows Python version | setup.ps1 |
+| 24 | Dependencies install | Log shows pip install | setup.ps1 |
+| 25 | WAA server starts | Port 5000 responds | on-logon.ps1 |
+
+---
+
+## Connectivity Phase
+
+| Step | What | How to Verify | Automated By |
+|------|------|---------------|--------------|
+| 26 | Port 5000 forwarded | Container:5000 → VM:5000 | port_forward.sh |
+| 27 | Probe returns 200 | `curl localhost:5000/probe` | WAA server |
+| 28 | SSH tunnels work | Localhost ports accessible | ssh_tunnel.py |
+
+---
+
+## Benchmark Phase
+
+| Step | What | How to Verify | Automated By |
+|------|------|---------------|--------------|
+| 29 | Task executes | Client sends action | start_client.sh |
+| 30 | Screenshot captured | Response includes image | WAA server |
+| 31 | Task completes | Result logged | Evaluation |
+
+---
+
+## Current Status
+
+**Last tested**: (date)
+**Result**: (pass/fail)
+**Failed at step**: (if applicable)
+
+---
+
+## Approach Summary
+
+```
+                    Container                          Windows VM
+                   ┌──────────────────────────────────────────────────────┐
+                   │                                                      │
+ Build time:       │  /waa-setup/  ──────────────────────────────────────│
+                   │  (staging)                                           │
+                   │                                                      │
+ Runtime:          │  samba.sh     ──────┐                                │
+                   │  copies to          │                                │
+                   │                     ▼                                │
+                   │  /tmp/smb/   ◄──────┬───────────► \\host.lan\Data    │
+                   │  (Samba)            │                                │
+                   │                     │                                │
+                   │  port 5000  ◄───────┼───────────► WAA Server :5000   │
+                   │  (forwarder)        │                                │
+                   └─────────────────────┴────────────────────────────────┘
+```
+
+**Key insight**: Use vanilla WAA's "dev mode" which expects files at `\\host.lan\Data` (Samba share). No script patching required except IP address fix for modern dockurr/windows.
diff --git a/docs/waa_definitive_approach.md b/docs/waa_definitive_approach.md
new file mode 100644
index 0000000..086bb93
--- /dev/null
+++ b/docs/waa_definitive_approach.md
@@ -0,0 +1,267 @@
+# WAA Definitive Approach - Stop the Flip-Flopping
+
+**Date**: January 25, 2026
+**Status**: DESIGN DOCUMENT - READ BEFORE ANY MORE CHANGES
+
+---
+
+## The Problem: We Keep Flip-Flopping
+
+We've been switching between approaches without fully committing to or testing any of them:
+
+| Date | Commit | Change | Result |
+|------|--------|--------|--------|
+| Jan 9 | 1689ab4 | Created custom `waa-auto` Dockerfile | Worked initially |
+| Jan 21 | 8fe7e6f | Switched to vanilla WAA build | Claimed to work |
+| Jan 21 | 6b9f744 | Fixed unattended installation | Claimed to work |
+| Jan 24 | 9a37bb4 | Merged "Vanilla WAA bootstrap" PR | NOT VERIFIED |
+| Jan 25 | Today | Back to custom Dockerfile | Failed - disk space |
+| Jan 25 | Today | Pull official image | Failed - uses old dockurr |
+
+**Every switch was made without fully verifying the previous approach worked.**
+
+---
+
+## Root Cause Analysis
+
+### Why Custom Dockerfile Fails
+
+Our `waa_deploy/Dockerfile` uses:
+```dockerfile
+COPY --from=windowsarena/winarena:latest /entry.sh /entry.sh
+COPY --from=windowsarena/winarena:latest /client /client
+COPY --from=windowsarena/winarena:latest /oem /oem
+```
+
+This requires `windowsarena/winarena:latest` to exist. Options:
+1. **Pull from Docker Hub**: Image uses OLD dockurr/windows v0.00 (doesn't auto-download Windows)
+2. **Build locally**: Needs 50GB+ disk space (winarena-base is 15GB)
+
+**Circular dependency**: We need the image to build the image.
+
+### Why Vanilla Build Fails
+
+The 3-step vanilla build:
+```bash
+docker build -t windowsarena/windows-local docker/windows-local/
+docker build -t windowsarena/winarena-base ...  # 15GB
+docker build -t windowsarena/winarena ...
+```
+
+Fails because:
+- winarena-base downloads 15GB of ML models
+- Total disk needed: ~50GB
+- D4s_v3 temp disk: 32GB
+- D4ds_v4 temp disk: 150GB (but we didn't finish testing it)
+
+### Why Runtime Override Fails
+
+Even if we pull the official image, `VERSION=11e` may not work because:
+- The base image (old dockurr/windows) may have VERSION baked in
+- Environment variables may be ignored if the base handles Windows download at build time
+
+---
+
+## The Three Actual Options
+
+### Option A: Build Vanilla with Big Disk
+
+**Approach**: Use D4ds_v4 (150GB temp disk), build official 3-step WAA
+
+**Pros**:
+- Uses upstream code exactly
+- Any fixes from Microsoft come automatically
+- Already partially implemented in `vm setup-waa`
+
+**Cons**:
+- Slow (30+ min build)
+- Higher VM cost ($0.38/hr vs $0.19/hr)
+- 15GB model download every time
+
+**Unattended?**: Need to verify vendor submodule uses modern dockurr/windows that auto-downloads
+
+### Option B: Standalone Custom Build (NO COPY --from)
+
+**Approach**: Build from scratch without depending on any pre-existing image
+
+```dockerfile
+FROM dockurr/windows:latest
+
+# Copy files from LOCAL vendor submodule (not from Docker image)
+COPY vendor/WindowsAgentArena/src/win-arena-container/entry.sh /entry.sh
+COPY vendor/WindowsAgentArena/src/win-arena-container/client /client
+COPY vendor/WindowsAgentArena/src/win-arena-container/vm/setup /oem
+
+# Set VERSION for auto Windows download
+ENV VERSION="11e"
+```
+
+**Pros**:
+- No circular dependency
+- Small image (no 15GB models baked in)
+- Fast build
+- Completely controlled
+
+**Cons**:
+- Need to maintain parity with upstream
+- Models loaded at runtime (slower first run)
+- May miss upstream fixes
+
+**Unattended?**: Yes, `VERSION=11e` on modern dockurr/windows auto-downloads Windows 11 Enterprise Eval
+
+### Option C: Pre-built Golden Image
+
+**Approach**: Build once, snapshot, reuse
+
+1. Build winarena image once (any method)
+2. Run container, let Windows install complete
+3. Snapshot the Windows disk (`data.img`)
+4. Store in Azure Blob or ship with repo
+5. Future runs use pre-built snapshot
+
+**Pros**:
+- Fast startup (Windows already installed)
+- No download wait
+- Consistent environment
+
+**Cons**:
+- 30GB snapshot to store/transfer
+- Maintenance burden when Windows updates needed
+- May violate Windows licensing for redistribution
+
+**Unattended?**: Yes (after initial setup)
+
+---
+
+## Recommendation: Option B (Standalone Custom Build)
+
+**Why**: It's the only approach that:
+1. Doesn't require pulling 15GB image
+2. Doesn't require 50GB disk space
+3. Uses modern dockurr/windows for auto-download
+4. Is fully unattended
+
+### Implementation Plan
+
+1. **Create new Dockerfile** that copies from LOCAL vendor submodule
+2. **Test on D4s_v3** (32GB is enough without winarena-base)
+3. **Verify unattended**: Windows downloads, installs, WAA server starts
+4. **Add to CLI**: New command or modify existing
+5. **Document**: Update CLAUDE.md with ONE way to do things
+6. **Remove alternatives**: Delete Dockerfile.simplified, Dockerfile.backup, etc.
+
+### Required Files from Vendor Submodule
+
+```
+vendor/WindowsAgentArena/
+├── src/win-arena-container/
+│   ├── entry.sh              → /entry.sh
+│   ├── entry_setup.sh        → /entry_setup.sh
+│   ├── start_client.sh       → /start_client.sh
+│   ├── client/               → /client/
+│   └── vm/
+│       └── setup/            → /oem/
+│           ├── install.bat
+│           ├── setup.ps1
+│           └── on-logon.ps1
+```
+
+### Key Configuration
+
+```dockerfile
+# Modern base that auto-downloads Windows
+FROM dockurr/windows:latest
+
+# Environment for unattended setup
+ENV VERSION="11e"           # Enterprise Eval (no license key)
+ENV RAM_SIZE="8G"
+ENV DISK_SIZE="30G"
+ENV XRES="1440"
+ENV YRES="900"
+```
+
+---
+
+## What NOT To Do
+
+1. **Don't switch approaches again** without fully testing current one
+2. **Don't use COPY --from** windowsarena/winarena (circular dependency)
+3. **Don't assume "it works"** without end-to-end verification:
+   - Windows boots automatically (no manual ISO)
+   - No license key prompt
+   - WAA server starts (port 5000 responds)
+   - Benchmark task completes
+4. **Don't run one-off SSH commands** - use CLI
+5. **Don't have multiple Dockerfiles** - ONE definitive approach
+
+---
+
+## Verification Checklist
+
+Before declaring "it works", verify ALL of these:
+
+- [ ] VM created successfully
+- [ ] Docker installed with /mnt storage
+- [ ] Docker image built successfully
+- [ ] Container started without manual intervention
+- [ ] Windows ISO downloaded automatically (check VNC)
+- [ ] Windows installed without product key prompt
+- [ ] Windows booted to desktop
+- [ ] WAA server started (curl localhost:5000/probe returns 200)
+- [ ] SSH tunnel to 8006 shows Windows desktop
+- [ ] SSH tunnel to 5000 forwards to WAA server
+- [ ] At least ONE benchmark task completes successfully
+
+---
+
+## CLI Commands (Final State)
+
+```bash
+# Create VM (use D4ds_v4 if need big disk, D4s_v3 if using Option B)
+uv run python -m openadapt_ml.benchmarks.cli vm create --size Standard_D4s_v3
+
+# Setup everything (Docker, image build, container start)
+uv run python -m openadapt_ml.benchmarks.cli vm setup-waa
+
+# Monitor (dashboard with VNC)
+uv run python -m openadapt_ml.benchmarks.cli vm dashboard
+
+# Run benchmark
+uv run python -m openadapt_ml.benchmarks.cli vm run-waa --num-tasks 5
+
+# Clean up
+uv run python -m openadapt_ml.benchmarks.cli vm delete -y
+```
+
+---
+
+## Action Items
+
+1. [ ] Review this document with user
+2. [ ] Agree on Option B (standalone custom build)
+3. [ ] Create new Dockerfile.standalone (copies from vendor, no COPY --from)
+4. [ ] Update cli.py to use new Dockerfile
+5. [ ] Delete old Dockerfiles (simplified, backup, complex)
+6. [ ] Test end-to-end with verification checklist
+7. [ ] Update CLAUDE.md with final approach
+8. [ ] Commit and document in BEADS.md
+
+---
+
+## Files to Clean Up After Decision
+
+```
+DELETE:
+- openadapt_ml/benchmarks/waa_deploy/Dockerfile.simplified
+- openadapt_ml/benchmarks/waa_deploy/Dockerfile.backup
+- openadapt_ml/benchmarks/waa_deploy/Dockerfile.complex
+
+KEEP (after rewrite):
+- openadapt_ml/benchmarks/waa_deploy/Dockerfile  (the ONE true Dockerfile)
+- openadapt_ml/benchmarks/waa_deploy/start_waa_server.bat
+- openadapt_ml/benchmarks/waa_deploy/api_agent.py
+```
+
+---
+
+*This document must be reviewed and approved before any more WAA changes are made.*
diff --git a/docs/workflows/sync-package-docs.yml b/docs/workflows/sync-package-docs.yml
new file mode 100644
index 0000000..27d1336
--- /dev/null
+++ b/docs/workflows/sync-package-docs.yml
@@ -0,0 +1,113 @@
+# This workflow should be placed in the main OpenAdapt repo at:
+# .github/workflows/sync-package-docs.yml
+#
+# It receives repository_dispatch events from sub-packages and syncs their
+# documentation to the main docs site.
+
+name: Sync Package Docs
+
+on:
+  # Triggered by sub-packages when they push to main
+  repository_dispatch:
+    types: [sync-package-docs]
+
+  # Can also be triggered manually
+  workflow_dispatch:
+    inputs:
+      package:
+        description: 'Package to sync (leave empty for all)'
+        required: false
+        default: ''
+
+  # Also run on a schedule to catch any missed updates
+  schedule:
+    - cron: '0 6 * * *'  # Daily at 6 AM UTC
+
+permissions:
+  contents: write
+  pull-requests: write
+
+jobs:
+  sync:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.12'
+
+      - name: Sync package documentation
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          # Determine which package to sync (if specified)
+          PACKAGE=""
+          if [ "${{ github.event_name }}" = "repository_dispatch" ]; then
+            PACKAGE="${{ github.event.client_payload.package }}"
+            echo "Triggered by package: $PACKAGE"
+          elif [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
+            PACKAGE="${{ github.event.inputs.package }}"
+            echo "Manual trigger for package: $PACKAGE"
+          else
+            echo "Scheduled run: syncing all packages"
+          fi
+
+          # Run the sync script
+          if [ -n "$PACKAGE" ]; then
+            python docs/_scripts/sync_package_docs.py --package "$PACKAGE"
+          else
+            python docs/_scripts/sync_package_docs.py
+          fi
+
+      - name: Check for changes
+        id: changes
+        run: |
+          if git diff --quiet docs/packages/; then
+            echo "changes=false" >> $GITHUB_OUTPUT
+          else
+            echo "changes=true" >> $GITHUB_OUTPUT
+            echo "Changed files:"
+            git diff --name-only docs/packages/
+          fi
+
+      - name: Create Pull Request
+        if: steps.changes.outputs.changes == 'true'
+        uses: peter-evans/create-pull-request@v6
+        with:
+          token: ${{ secrets.GITHUB_TOKEN }}
+          commit-message: 'docs: sync package documentation'
+          title: 'docs: sync package documentation'
+          body: |
+            This PR was automatically created by the docs sync workflow.
+
+            **Trigger**: ${{ github.event_name }}
+            ${{ github.event_name == 'repository_dispatch' && format('**Package**: {0}', github.event.client_payload.package) || '' }}
+
+            **Changes**:
+            - Updated package documentation from sub-repository READMEs
+
+            ---
+            *This is an automated PR. Review the changes and merge if they look correct.*
+          branch: docs/sync-package-docs
+          delete-branch: true
+          labels: |
+            documentation
+            automated
+
+      - name: Auto-merge if no conflicts
+        if: steps.changes.outputs.changes == 'true'
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          # Wait for PR to be created
+          sleep 5
+
+          # Try to enable auto-merge
+          PR_NUMBER=$(gh pr list --head docs/sync-package-docs --json number -q '.[0].number')
+          if [ -n "$PR_NUMBER" ]; then
+            echo "Enabling auto-merge for PR #$PR_NUMBER"
+            gh pr merge "$PR_NUMBER" --auto --squash || echo "Could not enable auto-merge"
+          fi
diff --git a/openadapt_ml/baselines/__init__.py b/openadapt_ml/baselines/__init__.py
index 467a8dd..d5e5520 100644
--- a/openadapt_ml/baselines/__init__.py
+++ b/openadapt_ml/baselines/__init__.py
@@ -7,6 +7,12 @@
 - Track B: ReAct-style reasoning with coordinates
 - Track C: Set-of-Mark element selection
 
+Based on SOTA patterns from:
+- Claude Computer Use (Anthropic)
+- Microsoft UFO/UFO2
+- OSWorld benchmark
+- Agent-S/Agent-S2 (Simular AI)
+
 Usage:
     from openadapt_ml.baselines import UnifiedBaselineAdapter, BaselineConfig, TrackConfig
 
@@ -21,35 +27,95 @@
         track=TrackConfig.track_c(),
     )
     adapter = UnifiedBaselineAdapter(config)
+
+    # OSWorld-compatible configuration
+    config = BaselineConfig(
+        provider="openai",
+        model="gpt-5.2",
+        track=TrackConfig.osworld_compatible(),
+    )
+
+    # Parse responses directly
+    from openadapt_ml.baselines import UnifiedResponseParser, ElementRegistry
+
+    parser = UnifiedResponseParser()
+    action = parser.parse('{"action": "CLICK", "x": 0.5, "y": 0.3}')
+
+    # With element ID to coordinate conversion
+    registry = ElementRegistry.from_a11y_tree(tree)
+    parser = UnifiedResponseParser(element_registry=registry)
+    action = parser.parse_and_resolve('{"action": "CLICK", "element_id": 17}')
 """
 
 from openadapt_ml.baselines.adapter import UnifiedBaselineAdapter
 from openadapt_ml.baselines.config import (
+    # Enums
+    ActionOutputFormat,
+    CoordinateSystem,
+    TrackType,
+    # Config dataclasses
     BaselineConfig,
     ModelSpec,
+    ReActConfig,
+    ScreenConfig,
+    SoMConfig,
     TrackConfig,
-    TrackType,
+    # Registry
     MODELS,
-    get_model_spec,
+    # Helper functions
     get_default_model,
+    get_model_spec,
+)
+from openadapt_ml.baselines.parser import (
+    ElementRegistry,
+    ParsedAction,
+    UIElement,
+    UnifiedResponseParser,
+)
+from openadapt_ml.baselines.prompts import (
+    # System prompts
+    FORMAT_PROMPTS,
+    SYSTEM_PROMPT_OSWORLD,
+    SYSTEM_PROMPT_TRACK_A,
+    SYSTEM_PROMPT_TRACK_B,
+    SYSTEM_PROMPT_TRACK_C,
+    SYSTEM_PROMPT_UFO,
+    SYSTEM_PROMPTS,
+    # Builder class
+    PromptBuilder,
 )
-from openadapt_ml.baselines.parser import ParsedAction, UnifiedResponseParser
-from openadapt_ml.baselines.prompts import PromptBuilder
 
 __all__ = [
     # Main adapter
     "UnifiedBaselineAdapter",
-    # Configuration
-    "BaselineConfig",
-    "TrackConfig",
+    # Configuration - Enums
+    "ActionOutputFormat",
+    "CoordinateSystem",
     "TrackType",
+    # Configuration - Dataclasses
+    "BaselineConfig",
     "ModelSpec",
+    "ReActConfig",
+    "ScreenConfig",
+    "SoMConfig",
+    "TrackConfig",
+    # Configuration - Registry
     "MODELS",
-    "get_model_spec",
+    # Configuration - Functions
     "get_default_model",
+    "get_model_spec",
     # Parsing
+    "ElementRegistry",
     "ParsedAction",
+    "UIElement",
     "UnifiedResponseParser",
     # Prompts
+    "FORMAT_PROMPTS",
     "PromptBuilder",
+    "SYSTEM_PROMPT_OSWORLD",
+    "SYSTEM_PROMPT_TRACK_A",
+    "SYSTEM_PROMPT_TRACK_B",
+    "SYSTEM_PROMPT_TRACK_C",
+    "SYSTEM_PROMPT_UFO",
+    "SYSTEM_PROMPTS",
 ]
diff --git a/openadapt_ml/baselines/config.py b/openadapt_ml/baselines/config.py
index cbcab37..5950a33 100644
--- a/openadapt_ml/baselines/config.py
+++ b/openadapt_ml/baselines/config.py
@@ -1,6 +1,11 @@
 """Configuration for baseline adapters.
 
 Defines track types, model registry, and configuration dataclasses.
+Based on SOTA patterns from:
+- Claude Computer Use API
+- Microsoft UFO/UFO2
+- OSWorld benchmark
+- Agent-S/Agent-S2
 """
 
 from __future__ import annotations
@@ -23,6 +28,113 @@ class TrackType(str, Enum):
     TRACK_C = "set_of_mark"
 
 
+class CoordinateSystem(str, Enum):
+    """Coordinate system for action output.
+
+    NORMALIZED: Coordinates in 0.0-1.0 range (relative to screen)
+    PIXEL: Absolute pixel coordinates
+    PERCENTAGE: Coordinates as percentages (0-100)
+    """
+
+    NORMALIZED = "normalized"
+    PIXEL = "pixel"
+    PERCENTAGE = "percentage"
+
+
+class ActionOutputFormat(str, Enum):
+    """Output format style for model responses.
+
+    JSON: Structured JSON object
+    FUNCTION_CALL: Function-style like CLICK(x, y)
+    PYAUTOGUI: PyAutoGUI-style Python code (OSWorld compatible)
+    """
+
+    JSON = "json"
+    FUNCTION_CALL = "function_call"
+    PYAUTOGUI = "pyautogui"
+
+
+@dataclass
+class SoMConfig:
+    """Configuration for Set-of-Mark (SoM) overlay.
+
+    Controls how UI elements are labeled and displayed.
+    Based on patterns from SoM paper and OMNI-parser.
+
+    Attributes:
+        overlay_enabled: Whether to draw element overlays on screenshot.
+        label_format: Format for element labels ("[{id}]", "{id}", "e{id}").
+        font_size: Font size for labels in pixels.
+        label_background_color: RGBA tuple for label background.
+        label_text_color: RGB tuple for label text.
+        max_elements: Maximum elements to include (0=unlimited).
+        include_roles: Element roles to include (None=all).
+        exclude_roles: Element roles to exclude.
+        min_element_area: Minimum element area in pixels to include.
+        include_invisible: Whether to include non-visible elements.
+    """
+
+    overlay_enabled: bool = True
+    label_format: str = "[{id}]"  # "[1]", "1", "e1"
+    font_size: int = 12
+    label_background_color: tuple[int, int, int, int] = (0, 120, 255, 200)  # Blue
+    label_text_color: tuple[int, int, int] = (255, 255, 255)  # White
+    max_elements: int = 100
+    include_roles: list[str] | None = None  # None = include all
+    exclude_roles: list[str] = field(
+        default_factory=lambda: ["group", "generic", "static_text", "separator"]
+    )
+    min_element_area: int = 100  # Minimum bbox area in pixels
+    include_invisible: bool = False
+
+
+@dataclass
+class ReActConfig:
+    """Configuration for ReAct-style reasoning.
+
+    Controls the observation-thought-action cycle used in Track B.
+    Based on ReAct paper and UFO's Observation->Thought->Action pattern.
+
+    Attributes:
+        require_observation: Whether to require explicit observation.
+        require_thought: Whether to require reasoning explanation.
+        require_plan: Whether to require multi-step plan.
+        max_plan_steps: Maximum steps in plan output.
+        thinking_budget: Token budget for thinking (Claude extended thinking).
+    """
+
+    require_observation: bool = True
+    require_thought: bool = True
+    require_plan: bool = False
+    max_plan_steps: int = 5
+    thinking_budget: int | None = None  # For Claude extended thinking
+
+
+@dataclass
+class ScreenConfig:
+    """Screen/display configuration for coordinate handling.
+
+    Attributes:
+        width: Display width in pixels.
+        height: Display height in pixels.
+        coordinate_system: How coordinates are represented.
+        scale_factor: DPI scale factor (1.0 = standard, 2.0 = retina).
+    """
+
+    width: int = 1920
+    height: int = 1080
+    coordinate_system: CoordinateSystem = CoordinateSystem.NORMALIZED
+    scale_factor: float = 1.0
+
+    def normalize_coords(self, x: float, y: float) -> tuple[float, float]:
+        """Convert pixel coordinates to normalized (0-1)."""
+        return (x / self.width, y / self.height)
+
+    def denormalize_coords(self, x: float, y: float) -> tuple[int, int]:
+        """Convert normalized coordinates to pixels."""
+        return (int(x * self.width), int(y * self.height))
+
+
 @dataclass
 class TrackConfig:
     """Configuration for a specific evaluation track.
@@ -30,54 +142,125 @@ class TrackConfig:
     Attributes:
         track_type: The track type (A, B, or C).
         output_format: Expected output format string.
+        action_format: Style of action output (JSON, function, pyautogui).
         use_som: Whether to use Set-of-Mark overlay.
+        som_config: Configuration for SoM (Track C).
         use_a11y_tree: Whether to include accessibility tree.
         max_a11y_elements: Max elements in a11y tree (truncation).
         include_reasoning: Whether to request reasoning steps.
+        react_config: Configuration for ReAct (Track B).
         include_history: Whether to include action history.
         max_history_steps: Max history steps to include.
+        screen_config: Screen/coordinate configuration.
+        verify_after_action: Request screenshot verification after actions.
     """
 
     track_type: TrackType
     output_format: str
+    action_format: ActionOutputFormat = ActionOutputFormat.JSON
     use_som: bool = False
+    som_config: SoMConfig | None = None
     use_a11y_tree: bool = True
     max_a11y_elements: int = 50
     include_reasoning: bool = False
+    react_config: ReActConfig | None = None
     include_history: bool = True
     max_history_steps: int = 5
+    screen_config: ScreenConfig = field(default_factory=ScreenConfig)
+    verify_after_action: bool = False  # Claude computer use best practice
 
     @classmethod
-    def track_a(cls) -> "TrackConfig":
-        """Create Track A (Direct Coordinates) config."""
+    def track_a(cls, **kwargs: Any) -> "TrackConfig":
+        """Create Track A (Direct Coordinates) config.
+
+        Simplest track: screenshot + goal -> coordinates.
+        No reasoning or element IDs.
+        """
         return cls(
             track_type=TrackType.TRACK_A,
             output_format='{"action": "CLICK", "x": float, "y": float}',
+            action_format=ActionOutputFormat.JSON,
             use_som=False,
             use_a11y_tree=True,
             include_reasoning=False,
+            **kwargs,
         )
 
     @classmethod
-    def track_b(cls) -> "TrackConfig":
-        """Create Track B (ReAct with Coordinates) config."""
+    def track_b(cls, **kwargs: Any) -> "TrackConfig":
+        """Create Track B (ReAct with Coordinates) config.
+
+        Includes observation->thought->action cycle.
+        Based on ReAct, UFO, and Claude thinking patterns.
+        """
+        react_config = kwargs.pop("react_config", None) or ReActConfig()
         return cls(
             track_type=TrackType.TRACK_B,
-            output_format='{"thought": str, "action": "CLICK", "x": float, "y": float}',
+            output_format='{"observation": str, "thought": str, "action": "CLICK", "x": float, "y": float}',
+            action_format=ActionOutputFormat.JSON,
             use_som=False,
             use_a11y_tree=True,
             include_reasoning=True,
+            react_config=react_config,
+            **kwargs,
         )
 
     @classmethod
-    def track_c(cls) -> "TrackConfig":
-        """Create Track C (Set-of-Mark) config."""
+    def track_c(cls, **kwargs: Any) -> "TrackConfig":
+        """Create Track C (Set-of-Mark) config.
+
+        Uses numbered element labels instead of coordinates.
+        Based on SoM paper and OMNI-parser patterns.
+        """
+        som_config = kwargs.pop("som_config", None) or SoMConfig()
         return cls(
             track_type=TrackType.TRACK_C,
             output_format='{"action": "CLICK", "element_id": int}',
+            action_format=ActionOutputFormat.JSON,
             use_som=True,
+            som_config=som_config,
             use_a11y_tree=True,
             include_reasoning=False,
+            **kwargs,
+        )
+
+    @classmethod
+    def osworld_compatible(cls, **kwargs: Any) -> "TrackConfig":
+        """Create OSWorld-compatible config.
+
+        Uses PyAutoGUI-style action format for OSWorld benchmark.
+        """
+        return cls(
+            track_type=TrackType.TRACK_A,
+            output_format="pyautogui.click(x, y)",
+            action_format=ActionOutputFormat.PYAUTOGUI,
+            use_som=False,
+            use_a11y_tree=True,
+            include_reasoning=False,
+            **kwargs,
+        )
+
+    @classmethod
+    def ufo_compatible(cls, **kwargs: Any) -> "TrackConfig":
+        """Create UFO-compatible config.
+
+        Uses UFO's AppAgent output format with observation/thought/plan.
+        """
+        react_config = kwargs.pop("react_config", None) or ReActConfig(
+            require_observation=True,
+            require_thought=True,
+            require_plan=True,
+        )
+        return cls(
+            track_type=TrackType.TRACK_B,
+            output_format='{"Observation": str, "Thought": str, "ControlLabel": int, "Function": str, "Args": list}',
+            action_format=ActionOutputFormat.JSON,
+            use_som=True,
+            som_config=SoMConfig(),
+            use_a11y_tree=True,
+            include_reasoning=True,
+            react_config=react_config,
+            **kwargs,
         )
 
 
diff --git a/openadapt_ml/baselines/parser.py b/openadapt_ml/baselines/parser.py
index c73ca95..0139070 100644
--- a/openadapt_ml/baselines/parser.py
+++ b/openadapt_ml/baselines/parser.py
@@ -1,14 +1,167 @@
 """Response parsing for baseline adapters.
 
-Extracts structured actions from VLM responses.
+Extracts structured actions from VLM responses with support for:
+- JSON format extraction
+- Function-call syntax (CLICK(x, y))
+- PyAutoGUI format (OSWorld compatible)
+- UFO format (Observation/Thought/ControlLabel)
+- Element ID to coordinate normalization
+- Robust fallback parsing
+
+Based on patterns from:
+- Claude Computer Use
+- OSWorld benchmark
+- Microsoft UFO
+- Agent-S
 """
 
 from __future__ import annotations
 
 import json
+import logging
 import re
 from dataclasses import dataclass, field
-from typing import Any
+from typing import TYPE_CHECKING, Any
+
+if TYPE_CHECKING:
+    from openadapt_ml.baselines.config import ScreenConfig
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class UIElement:
+    """UI element with bounding box for coordinate conversion.
+
+    Used to convert element_id actions to coordinate actions.
+    """
+
+    element_id: int | str
+    role: str = ""
+    name: str = ""
+    bbox: tuple[float, float, float, float] | None = None  # (x1, y1, x2, y2)
+
+    @property
+    def center(self) -> tuple[float, float] | None:
+        """Get center point of element."""
+        if self.bbox is None:
+            return None
+        x1, y1, x2, y2 = self.bbox
+        return ((x1 + x2) / 2, (y1 + y2) / 2)
+
+
+@dataclass
+class ElementRegistry:
+    """Registry of UI elements for element_id to coordinate conversion.
+
+    Used by the parser to convert Track C (SoM) element IDs to
+    Track A coordinates when needed.
+    """
+
+    elements: dict[int, UIElement] = field(default_factory=dict)
+    screen_width: int = 1920
+    screen_height: int = 1080
+    is_normalized: bool = True  # Whether bbox coordinates are normalized (0-1)
+
+    def add_element(
+        self,
+        element_id: int | str,
+        bbox: tuple[float, float, float, float],
+        role: str = "",
+        name: str = "",
+    ) -> None:
+        """Add an element to the registry."""
+        eid = int(element_id) if isinstance(element_id, str) else element_id
+        self.elements[eid] = UIElement(
+            element_id=eid, role=role, name=name, bbox=bbox
+        )
+
+    def get_element(self, element_id: int) -> UIElement | None:
+        """Get element by ID."""
+        return self.elements.get(element_id)
+
+    def get_center_coords(
+        self, element_id: int, normalize: bool = True
+    ) -> tuple[float, float] | None:
+        """Get center coordinates for an element.
+
+        Args:
+            element_id: Element ID to look up.
+            normalize: Whether to return normalized (0-1) coordinates.
+
+        Returns:
+            (x, y) center coordinates, or None if element not found.
+        """
+        element = self.get_element(element_id)
+        if element is None or element.bbox is None:
+            return None
+
+        center = element.center
+        if center is None:
+            return None
+
+        x, y = center
+
+        # Handle normalization
+        if self.is_normalized and not normalize:
+            # Convert from normalized to pixels
+            x = x * self.screen_width
+            y = y * self.screen_height
+        elif not self.is_normalized and normalize:
+            # Convert from pixels to normalized
+            x = x / self.screen_width
+            y = y / self.screen_height
+
+        return (x, y)
+
+    @classmethod
+    def from_a11y_tree(
+        cls,
+        tree: dict[str, Any] | list[dict[str, Any]],
+        screen_width: int = 1920,
+        screen_height: int = 1080,
+    ) -> "ElementRegistry":
+        """Build registry from accessibility tree.
+
+        Args:
+            tree: Accessibility tree as dict or list of element dicts.
+            screen_width: Screen width for coordinate conversion.
+            screen_height: Screen height for coordinate conversion.
+
+        Returns:
+            ElementRegistry with all elements from tree.
+        """
+        registry = cls(screen_width=screen_width, screen_height=screen_height)
+
+        def process_node(node: dict[str, Any]) -> None:
+            node_id = node.get("id", node.get("node_id", node.get("element_id")))
+            if node_id is not None:
+                try:
+                    eid = int(str(node_id).replace("e", "").replace("[", "").replace("]", ""))
+                    bbox = node.get("bbox", node.get("bounds"))
+                    if bbox and len(bbox) >= 4:
+                        registry.add_element(
+                            element_id=eid,
+                            bbox=tuple(bbox[:4]),
+                            role=node.get("role", ""),
+                            name=node.get("name", ""),
+                        )
+                except (ValueError, TypeError):
+                    pass
+
+            # Process children
+            for child in node.get("children", []):
+                if isinstance(child, dict):
+                    process_node(child)
+
+        if isinstance(tree, dict):
+            process_node(tree)
+        elif isinstance(tree, list):
+            for node in tree:
+                if isinstance(node, dict):
+                    process_node(node)
+
+        return registry
 
 
 @dataclass
@@ -16,16 +169,22 @@ class ParsedAction:
     """Parsed action from model response.
 
     Attributes:
-        action_type: Action type (click, type, key, scroll, done, unknown).
+        action_type: Action type (click, type, key, scroll, done, wait, fail, unknown).
         x: X coordinate (normalized 0-1) for coordinate actions.
         y: Y coordinate (normalized 0-1) for coordinate actions.
         element_id: Element ID for SoM actions.
         text: Text content for type actions.
         key: Key name for key actions.
+        modifiers: Key modifiers (ctrl, shift, alt) for key/hotkey actions.
         direction: Scroll direction for scroll actions.
-        thought: Reasoning text (for ReAct track).
+        amount: Scroll amount for scroll actions.
+        observation: Observed state description (ReAct/UFO format).
+        thought: Reasoning text (ReAct/UFO format).
+        plan: Multi-step plan (UFO format).
+        status: Execution status (UFO format: CONTINUE, FINISH, ERROR).
         raw_response: Original model response.
         parse_error: Error message if parsing failed.
+        confidence: Parser confidence score (0-1).
         metadata: Additional parsed data.
     """
 
@@ -35,10 +194,16 @@ class ParsedAction:
     element_id: int | None = None
     text: str | None = None
     key: str | None = None
+    modifiers: list[str] | None = None
     direction: str | None = None
+    amount: int | None = None
+    observation: str | None = None
     thought: str | None = None
+    plan: list[str] | None = None
+    status: str | None = None
     raw_response: str | None = None
     parse_error: str | None = None
+    confidence: float = 1.0
     metadata: dict[str, Any] = field(default_factory=dict)
 
     @property
@@ -46,6 +211,21 @@ def is_valid(self) -> bool:
         """Check if the action was successfully parsed."""
         return self.parse_error is None and self.action_type != "unknown"
 
+    @property
+    def is_terminal(self) -> bool:
+        """Check if this action terminates the episode."""
+        return self.action_type in ("done", "fail")
+
+    @property
+    def has_coordinates(self) -> bool:
+        """Check if action has coordinate data."""
+        return self.x is not None and self.y is not None
+
+    @property
+    def has_element_id(self) -> bool:
+        """Check if action has element ID."""
+        return self.element_id is not None
+
     def to_dict(self) -> dict[str, Any]:
         """Convert to action dictionary for benchmark integration."""
         result: dict[str, Any] = {"type": self.action_type}
@@ -60,35 +240,158 @@ def to_dict(self) -> dict[str, Any]:
             result["text"] = self.text
         if self.key is not None:
             result["key"] = self.key
+        if self.modifiers:
+            result["modifiers"] = self.modifiers
         if self.direction is not None:
             result["direction"] = self.direction
+        if self.amount is not None:
+            result["amount"] = self.amount
+        if self.observation is not None:
+            result["observation"] = self.observation
         if self.thought is not None:
             result["thought"] = self.thought
+        if self.plan:
+            result["plan"] = self.plan
+        if self.status is not None:
+            result["status"] = self.status
 
         return result
 
+    def to_pyautogui(
+        self,
+        screen_width: int = 1920,
+        screen_height: int = 1080,
+    ) -> str:
+        """Convert to PyAutoGUI code string.
+
+        Args:
+            screen_width: Screen width for coordinate conversion.
+            screen_height: Screen height for coordinate conversion.
+
+        Returns:
+            PyAutoGUI code string.
+        """
+        if self.action_type == "click":
+            if self.x is not None and self.y is not None:
+                px = int(self.x * screen_width)
+                py = int(self.y * screen_height)
+                return f"pyautogui.click({px}, {py})"
+            elif self.element_id is not None:
+                return f"# CLICK element {self.element_id} (needs coordinate conversion)"
+        elif self.action_type == "type":
+            text = self.text or ""
+            return f"pyautogui.write('{text}')"
+        elif self.action_type == "key":
+            key = self.key or ""
+            if self.modifiers:
+                keys = ", ".join([f"'{k}'" for k in self.modifiers + [key]])
+                return f"pyautogui.hotkey({keys})"
+            return f"pyautogui.press('{key}')"
+        elif self.action_type == "scroll":
+            direction = self.direction or "down"
+            amount = self.amount or 3
+            clicks = -amount if direction == "down" else amount
+            return f"pyautogui.scroll({clicks})"
+        elif self.action_type == "done":
+            return "DONE"
+        elif self.action_type == "wait":
+            return "WAIT"
+        elif self.action_type == "fail":
+            return "FAIL"
+
+        return f"# Unknown action: {self.action_type}"
+
+    def with_coordinates(
+        self,
+        x: float,
+        y: float,
+        source: str = "conversion",
+    ) -> "ParsedAction":
+        """Create a copy with coordinates added.
+
+        Useful for converting element_id actions to coordinate actions.
+
+        Args:
+            x: X coordinate (normalized 0-1).
+            y: Y coordinate (normalized 0-1).
+            source: Source of coordinates for metadata.
+
+        Returns:
+            New ParsedAction with coordinates.
+        """
+        return ParsedAction(
+            action_type=self.action_type,
+            x=x,
+            y=y,
+            element_id=self.element_id,
+            text=self.text,
+            key=self.key,
+            modifiers=self.modifiers,
+            direction=self.direction,
+            amount=self.amount,
+            observation=self.observation,
+            thought=self.thought,
+            plan=self.plan,
+            status=self.status,
+            raw_response=self.raw_response,
+            parse_error=self.parse_error,
+            confidence=self.confidence,
+            metadata={**self.metadata, "coord_source": source},
+        )
+
 
 class UnifiedResponseParser:
-    """Parser for VLM responses across all tracks.
+    """Parser for VLM responses across all tracks and formats.
 
     Supports:
     - JSON format: {"action": "CLICK", "x": 0.5, "y": 0.3}
     - Function format: CLICK(0.5, 0.3) or CLICK([17])
-    - Mixed format: Thought + action
+    - PyAutoGUI format: pyautogui.click(960, 540)
+    - UFO format: {"Observation": ..., "Thought": ..., "ControlLabel": 17}
+    - Mixed format: ReAct-style with thought + action
 
     Example:
         parser = UnifiedResponseParser()
-        action = parser.parse("{"action": "CLICK", "x": 0.5, "y": 0.3}")
+        action = parser.parse('{"action": "CLICK", "x": 0.5, "y": 0.3}')
         print(action.x, action.y)  # 0.5, 0.3
+
+        # With element registry for ID->coordinate conversion
+        registry = ElementRegistry.from_a11y_tree(tree)
+        parser = UnifiedResponseParser(element_registry=registry)
+        action = parser.parse('{"action": "CLICK", "element_id": 17}')
+        action = parser.resolve_element_id(action)
+        print(action.x, action.y)  # Converted coordinates
     """
 
+    def __init__(
+        self,
+        element_registry: ElementRegistry | None = None,
+        screen_config: "ScreenConfig | None" = None,
+        normalize_coordinates: bool = True,
+    ):
+        """Initialize parser.
+
+        Args:
+            element_registry: Optional registry for element_id conversion.
+            screen_config: Optional screen configuration for coordinate handling.
+            normalize_coordinates: Whether to normalize coordinates to 0-1.
+        """
+        self.element_registry = element_registry
+        self.screen_config = screen_config
+        self.normalize_coordinates = normalize_coordinates
+
+        # Default screen dimensions
+        self._screen_width = screen_config.width if screen_config else 1920
+        self._screen_height = screen_config.height if screen_config else 1080
+
     def parse(self, response: str) -> ParsedAction:
         """Parse model response into structured action.
 
-        Tries multiple parsing strategies:
-        1. JSON extraction
-        2. Regex patterns for function-style actions
-        3. Fallback text patterns
+        Tries multiple parsing strategies in order:
+        1. JSON extraction (most reliable)
+        2. PyAutoGUI code patterns
+        3. Function-style patterns (CLICK, TYPE, etc.)
+        4. Special keywords (DONE, WAIT, FAIL)
 
         Args:
             response: Raw model response string.
@@ -96,123 +399,339 @@ def parse(self, response: str) -> ParsedAction:
         Returns:
             ParsedAction with extracted fields.
         """
+        if not response:
+            return ParsedAction(
+                action_type="unknown",
+                raw_response=response,
+                parse_error="Empty response",
+            )
+
         response = response.strip()
 
-        # Try JSON first
+        # Try JSON first (most structured)
         action = self._try_json_parse(response)
         if action.is_valid:
             action.raw_response = response
             return action
 
-        # Try regex patterns
+        # Try PyAutoGUI format
+        action = self._try_pyautogui_parse(response)
+        if action.is_valid:
+            action.raw_response = response
+            return action
+
+        # Try function-call patterns
         action = self._try_regex_parse(response)
         if action.is_valid:
             action.raw_response = response
             return action
 
+        # Try special keywords
+        action = self._try_keyword_parse(response)
+        if action.is_valid:
+            action.raw_response = response
+            return action
+
         # Return unknown action with error
         return ParsedAction(
             action_type="unknown",
             raw_response=response,
             parse_error="No action pattern found in response",
+            confidence=0.0,
         )
 
     def _try_json_parse(self, response: str) -> ParsedAction:
         """Try to extract and parse JSON from response."""
-        # Find JSON in response
-        json_match = re.search(r'\{[^{}]*\}', response)
-        if not json_match:
-            return ParsedAction(action_type="unknown", parse_error="No JSON found")
-
-        try:
-            data = json.loads(json_match.group())
-        except json.JSONDecodeError as e:
-            return ParsedAction(action_type="unknown", parse_error=f"JSON parse error: {e}")
-
-        return self._dict_to_action(data)
+        # Try to find JSON object in response
+        json_patterns = [
+            r'```json\s*(\{[^`]*\})\s*```',  # Markdown code block
+            r'```\s*(\{[^`]*\})\s*```',       # Plain code block
+            r'(\{[^{}]*\})',                  # Simple JSON object
+            r'(\{[^{}]*\{[^{}]*\}[^{}]*\})',  # Nested JSON (max 1 level)
+        ]
+
+        for pattern in json_patterns:
+            matches = re.findall(pattern, response, re.DOTALL)
+            for match in matches:
+                try:
+                    data = json.loads(match)
+                    action = self._dict_to_action(data)
+                    if action.is_valid:
+                        return action
+                except json.JSONDecodeError:
+                    continue
+
+        return ParsedAction(action_type="unknown", parse_error="No valid JSON found")
 
     def _dict_to_action(self, data: dict[str, Any]) -> ParsedAction:
-        """Convert parsed dict to ParsedAction."""
-        action_type = data.get("action", data.get("type", "")).lower()
-        thought = data.get("thought")
+        """Convert parsed dict to ParsedAction.
+
+        Handles multiple formats:
+        - Standard: {"action": "CLICK", "x": 0.5, "y": 0.3}
+        - UFO: {"Observation": ..., "Thought": ..., "ControlLabel": 17}
+        - ReAct: {"observation": ..., "thought": ..., "action": "CLICK"}
+        """
+        # Extract ReAct/UFO fields first
+        observation = data.get("observation", data.get("Observation"))
+        thought = data.get("thought", data.get("Thought"))
+        plan = data.get("plan", data.get("Plan"))
+        status = data.get("status", data.get("Status"))
+
+        # Get action type (handle various key names)
+        action_type = (
+            data.get("action", "")
+            or data.get("type", "")
+            or data.get("Function", "")  # UFO format
+        ).lower()
+
+        # Handle UFO ControlLabel as element click
+        control_label = data.get("ControlLabel", data.get("control_label"))
+        if control_label is not None and not action_type:
+            action_type = "click"
 
         if action_type == "click":
-            # Check for element_id (SoM) vs coordinates
-            if "element_id" in data:
-                element_id = data["element_id"]
-                if isinstance(element_id, str):
-                    # Extract number from "e17" or "[17]" format
-                    match = re.search(r'\d+', element_id)
-                    element_id = int(match.group()) if match else None
+            # Check for element_id first (SoM/UFO)
+            element_id = data.get("element_id", data.get("ControlLabel"))
+            if element_id is not None:
                 return ParsedAction(
                     action_type="click",
-                    element_id=element_id,
+                    element_id=self._normalize_element_id(element_id),
+                    observation=observation,
                     thought=thought,
+                    plan=plan,
+                    status=status,
                 )
-            elif "x" in data and "y" in data:
+
+            # Then check for coordinates
+            if "x" in data and "y" in data:
+                x, y = self._normalize_coords(float(data["x"]), float(data["y"]))
                 return ParsedAction(
                     action_type="click",
-                    x=float(data["x"]),
-                    y=float(data["y"]),
+                    x=x,
+                    y=y,
+                    observation=observation,
                     thought=thought,
+                    plan=plan,
+                    status=status,
                 )
-            else:
-                return ParsedAction(
-                    action_type="click",
-                    parse_error="CLICK missing coordinates or element_id",
-                )
 
-        elif action_type == "type":
+            # Check coordinate array format
+            if "coordinate" in data:
+                coords = data["coordinate"]
+                if isinstance(coords, (list, tuple)) and len(coords) >= 2:
+                    x, y = self._normalize_coords(float(coords[0]), float(coords[1]))
+                    return ParsedAction(
+                        action_type="click",
+                        x=x,
+                        y=y,
+                        observation=observation,
+                        thought=thought,
+                    )
+
+            return ParsedAction(
+                action_type="click",
+                parse_error="CLICK missing coordinates or element_id",
+                observation=observation,
+                thought=thought,
+            )
+
+        elif action_type in ("type", "input_text", "write"):
+            text = data.get("text", "")
+            # Handle UFO Args format
+            args = data.get("Args", data.get("args", []))
+            if not text and args:
+                text = args[0] if args else ""
             return ParsedAction(
                 action_type="type",
-                text=data.get("text", ""),
+                text=text,
+                observation=observation,
                 thought=thought,
             )
 
-        elif action_type == "key":
+        elif action_type in ("key", "press", "hotkey"):
+            key = data.get("key", "")
+            modifiers = data.get("modifiers", [])
+
+            # Handle UFO Args format for hotkey
+            args = data.get("Args", data.get("args", []))
+            if args and not key:
+                if len(args) == 1:
+                    key = args[0]
+                else:
+                    modifiers = args[:-1]
+                    key = args[-1]
+
             return ParsedAction(
                 action_type="key",
-                key=data.get("key", ""),
+                key=key,
+                modifiers=modifiers if modifiers else None,
+                observation=observation,
                 thought=thought,
             )
 
         elif action_type == "scroll":
+            direction = data.get("direction", data.get("scroll_direction", "down"))
+            amount = data.get("amount", data.get("scroll_amount", 3))
+
+            # Handle UFO Args format
+            args = data.get("Args", data.get("args", []))
+            if args and not direction:
+                direction = args[0] if args else "down"
+
             return ParsedAction(
                 action_type="scroll",
-                direction=data.get("direction", "down"),
+                direction=direction,
+                amount=amount,
+                observation=observation,
+                thought=thought,
+            )
+
+        elif action_type in ("done", "finish", "complete"):
+            return ParsedAction(
+                action_type="done",
+                status="FINISH",
+                observation=observation,
+                thought=thought,
+            )
+
+        elif action_type in ("wait", "pause"):
+            return ParsedAction(
+                action_type="wait",
+                observation=observation,
                 thought=thought,
             )
 
-        elif action_type == "done":
-            return ParsedAction(action_type="done", thought=thought)
+        elif action_type in ("fail", "error", "impossible"):
+            return ParsedAction(
+                action_type="fail",
+                status="ERROR",
+                observation=observation,
+                thought=thought,
+            )
 
         else:
             return ParsedAction(
                 action_type="unknown",
                 parse_error=f"Unknown action type: {action_type}",
+                observation=observation,
+                thought=thought,
             )
 
-    def _try_regex_parse(self, response: str) -> ParsedAction:
-        """Try regex patterns for function-style actions."""
-        # CLICK(x, y) pattern
-        click_coords = re.search(
-            r'CLICK\s*\(\s*([\d.]+)\s*,\s*([\d.]+)\s*\)',
+    def _try_pyautogui_parse(self, response: str) -> ParsedAction:
+        """Try to parse PyAutoGUI-style code."""
+        # pyautogui.click(x, y)
+        click_match = re.search(
+            r'pyautogui\.click\s*\(\s*(\d+)\s*,\s*(\d+)\s*\)',
+            response,
+            re.IGNORECASE,
+        )
+        if click_match:
+            x = int(click_match.group(1))
+            y = int(click_match.group(2))
+            x, y = self._normalize_coords(x, y)
+            return ParsedAction(action_type="click", x=x, y=y)
+
+        # pyautogui.doubleClick(x, y)
+        dclick_match = re.search(
+            r'pyautogui\.doubleClick\s*\(\s*(\d+)\s*,\s*(\d+)\s*\)',
+            response,
+            re.IGNORECASE,
+        )
+        if dclick_match:
+            x = int(dclick_match.group(1))
+            y = int(dclick_match.group(2))
+            x, y = self._normalize_coords(x, y)
+            return ParsedAction(
+                action_type="click",
+                x=x,
+                y=y,
+                metadata={"double_click": True},
+            )
+
+        # pyautogui.write('text')
+        write_match = re.search(
+            r'pyautogui\.write\s*\(\s*[\'"](.+?)[\'"]\s*\)',
+            response,
+            re.IGNORECASE,
+        )
+        if write_match:
+            return ParsedAction(action_type="type", text=write_match.group(1))
+
+        # pyautogui.press('key')
+        press_match = re.search(
+            r'pyautogui\.press\s*\(\s*[\'"](.+?)[\'"]\s*\)',
+            response,
+            re.IGNORECASE,
+        )
+        if press_match:
+            return ParsedAction(action_type="key", key=press_match.group(1))
+
+        # pyautogui.hotkey('key1', 'key2')
+        hotkey_match = re.search(
+            r'pyautogui\.hotkey\s*\(\s*(.+?)\s*\)',
             response,
             re.IGNORECASE,
         )
-        if click_coords:
-            try:
+        if hotkey_match:
+            keys_str = hotkey_match.group(1)
+            # Extract keys from quotes
+            keys = re.findall(r'[\'"]([^\'"]+)[\'"]', keys_str)
+            if keys:
+                modifiers = keys[:-1] if len(keys) > 1 else None
+                key = keys[-1]
                 return ParsedAction(
-                    action_type="click",
-                    x=float(click_coords.group(1)),
-                    y=float(click_coords.group(2)),
+                    action_type="key",
+                    key=key,
+                    modifiers=modifiers,
                 )
-            except ValueError:
-                pass
 
-        # CLICK([id]) pattern for SoM
+        # pyautogui.scroll(amount)
+        scroll_match = re.search(
+            r'pyautogui\.scroll\s*\(\s*(-?\d+)\s*\)',
+            response,
+            re.IGNORECASE,
+        )
+        if scroll_match:
+            clicks = int(scroll_match.group(1))
+            direction = "up" if clicks > 0 else "down"
+            return ParsedAction(
+                action_type="scroll",
+                direction=direction,
+                amount=abs(clicks),
+            )
+
+        return ParsedAction(action_type="unknown", parse_error="No PyAutoGUI pattern matched")
+
+    def _try_regex_parse(self, response: str) -> ParsedAction:
+        """Try regex patterns for function-style actions."""
+        # CLICK(x, y) - normalized coordinates
+        click_norm = re.search(
+            r'CLICK\s*\(\s*(0?\.\d+)\s*,\s*(0?\.\d+)\s*\)',
+            response,
+            re.IGNORECASE,
+        )
+        if click_norm:
+            return ParsedAction(
+                action_type="click",
+                x=float(click_norm.group(1)),
+                y=float(click_norm.group(2)),
+            )
+
+        # CLICK(x, y) - larger numbers (pixels)
+        click_pixel = re.search(
+            r'CLICK\s*\(\s*(\d+(?:\.\d+)?)\s*,\s*(\d+(?:\.\d+)?)\s*\)',
+            response,
+            re.IGNORECASE,
+        )
+        if click_pixel:
+            x = float(click_pixel.group(1))
+            y = float(click_pixel.group(2))
+            x, y = self._normalize_coords(x, y)
+            return ParsedAction(action_type="click", x=x, y=y)
+
+        # CLICK([id]) - element ID
         click_element = re.search(
-            r'CLICK\s*\(\s*\[?\s*(\d+)\s*\]?\s*\)',
+            r'CLICK\s*\(\s*\[\s*(\d+)\s*\]\s*\)',
             response,
             re.IGNORECASE,
         )
@@ -222,59 +741,163 @@ def _try_regex_parse(self, response: str) -> ParsedAction:
                 element_id=int(click_element.group(1)),
             )
 
-        # TYPE("text") pattern
+        # CLICK(id) without brackets
+        click_id = re.search(
+            r'CLICK\s*\(\s*(\d+)\s*\)',
+            response,
+            re.IGNORECASE,
+        )
+        if click_id:
+            # Check if it's likely an element ID (small number) vs coordinate
+            val = int(click_id.group(1))
+            if val < 1000:  # Likely element ID
+                return ParsedAction(action_type="click", element_id=val)
+
+        # TYPE("text") or TYPE('text')
         type_match = re.search(
             r'TYPE\s*\(\s*["\'](.+?)["\']\s*\)',
             response,
             re.IGNORECASE,
         )
         if type_match:
-            return ParsedAction(
-                action_type="type",
-                text=type_match.group(1),
-            )
+            return ParsedAction(action_type="type", text=type_match.group(1))
 
-        # KEY(key) pattern
+        # KEY(key) or KEY(mod+key)
         key_match = re.search(
-            r'KEY\s*\(\s*([a-zA-Z_]+)\s*\)',
+            r'KEY\s*\(\s*([a-zA-Z0-9_+]+)\s*\)',
             response,
             re.IGNORECASE,
         )
         if key_match:
-            return ParsedAction(
-                action_type="key",
-                key=key_match.group(1).lower(),
-            )
-
-        # SCROLL(direction) pattern
+            key_str = key_match.group(1).lower()
+            if '+' in key_str:
+                parts = key_str.split('+')
+                modifiers = parts[:-1]
+                key = parts[-1]
+                return ParsedAction(action_type="key", key=key, modifiers=modifiers)
+            return ParsedAction(action_type="key", key=key_str)
+
+        # SCROLL(direction) or SCROLL(direction, amount)
         scroll_match = re.search(
-            r'SCROLL\s*\(\s*([a-zA-Z]+)\s*\)',
+            r'SCROLL\s*\(\s*([a-zA-Z]+)(?:\s*,\s*(\d+))?\s*\)',
             response,
             re.IGNORECASE,
         )
         if scroll_match:
+            direction = scroll_match.group(1).lower()
+            amount = int(scroll_match.group(2)) if scroll_match.group(2) else 3
+            return ParsedAction(action_type="scroll", direction=direction, amount=amount)
+
+        return ParsedAction(action_type="unknown", parse_error="No regex pattern matched")
+
+    def _try_keyword_parse(self, response: str) -> ParsedAction:
+        """Try special keywords."""
+        response_upper = response.upper().strip()
+
+        # DONE() or just DONE
+        if re.search(r'\bDONE\s*\(\s*\)\s*$', response, re.IGNORECASE) or response_upper == "DONE":
+            return ParsedAction(action_type="done")
+
+        # WAIT() or WAIT
+        if re.search(r'\bWAIT\s*\(\s*\)\s*$', response, re.IGNORECASE) or response_upper == "WAIT":
+            return ParsedAction(action_type="wait")
+
+        # FAIL() or FAIL
+        if re.search(r'\bFAIL\s*\(\s*\)\s*$', response, re.IGNORECASE) or response_upper == "FAIL":
+            return ParsedAction(action_type="fail")
+
+        # Look for "task is complete" or similar phrases
+        if re.search(r'task\s+(?:is\s+)?(?:complete|done|finished)', response, re.IGNORECASE):
             return ParsedAction(
-                action_type="scroll",
-                direction=scroll_match.group(1).lower(),
+                action_type="done",
+                confidence=0.7,
+                metadata={"inferred": True},
             )
 
-        # DONE() pattern
-        if re.search(r'DONE\s*\(\s*\)', response, re.IGNORECASE):
-            return ParsedAction(action_type="done")
+        return ParsedAction(action_type="unknown", parse_error="No keyword matched")
 
-        return ParsedAction(action_type="unknown", parse_error="No regex pattern matched")
+    def _normalize_coords(self, x: float, y: float) -> tuple[float, float]:
+        """Normalize coordinates to 0-1 range if needed."""
+        if not self.normalize_coordinates:
+            return (x, y)
 
-    def normalize_element_id(self, action: ParsedAction) -> ParsedAction:
-        """Normalize element_id to integer format.
+        # If coordinates are large, assume they're pixels
+        if x > 1.5 or y > 1.5:
+            x = x / self._screen_width
+            y = y / self._screen_height
+
+        # Clamp to valid range
+        x = max(0.0, min(1.0, x))
+        y = max(0.0, min(1.0, y))
+
+        return (x, y)
+
+    def _normalize_element_id(self, element_id: Any) -> int | None:
+        """Normalize element_id to integer format."""
+        if element_id is None:
+            return None
+
+        if isinstance(element_id, int):
+            return element_id
+
+        if isinstance(element_id, str):
+            # Extract number from "e17", "[17]", "element_17" etc.
+            match = re.search(r'\d+', element_id)
+            if match:
+                return int(match.group())
+
+        try:
+            return int(element_id)
+        except (ValueError, TypeError):
+            return None
+
+    def resolve_element_id(
+        self,
+        action: ParsedAction,
+        registry: ElementRegistry | None = None,
+    ) -> ParsedAction:
+        """Convert element_id to coordinates if registry available.
 
         Args:
-            action: Action with possibly string element_id.
+            action: ParsedAction with element_id.
+            registry: Element registry (uses self.element_registry if None).
 
         Returns:
-            Action with normalized integer element_id.
+            ParsedAction with coordinates added if conversion succeeded,
+            original action otherwise.
         """
-        if action.element_id is not None and isinstance(action.element_id, str):
-            match = re.search(r'\d+', str(action.element_id))
-            if match:
-                action.element_id = int(match.group())
+        if not action.has_element_id or action.has_coordinates:
+            return action
+
+        reg = registry or self.element_registry
+        if reg is None:
+            return action
+
+        coords = reg.get_center_coords(action.element_id, normalize=True)
+        if coords is not None:
+            return action.with_coordinates(
+                x=coords[0],
+                y=coords[1],
+                source=f"element_{action.element_id}",
+            )
+
         return action
+
+    def parse_and_resolve(
+        self,
+        response: str,
+        registry: ElementRegistry | None = None,
+    ) -> ParsedAction:
+        """Parse response and resolve element_id to coordinates.
+
+        Convenience method that combines parse() and resolve_element_id().
+
+        Args:
+            response: Raw model response.
+            registry: Optional element registry for ID conversion.
+
+        Returns:
+            ParsedAction with coordinates if available.
+        """
+        action = self.parse(response)
+        return self.resolve_element_id(action, registry)
diff --git a/openadapt_ml/baselines/prompts.py b/openadapt_ml/baselines/prompts.py
index 7619b4e..82383de 100644
--- a/openadapt_ml/baselines/prompts.py
+++ b/openadapt_ml/baselines/prompts.py
@@ -1,103 +1,409 @@
 """Prompt templates for baseline adapters.
 
 Provides track-specific system prompts and user content builders.
+Based on SOTA patterns from:
+- Claude Computer Use (Anthropic)
+- UFO/UFO2 (Microsoft)
+- OSWorld benchmark
+- Agent-S/Agent-S2 (Simular AI)
+
+Key design principles:
+1. Structured observation -> thought -> action flow (ReAct)
+2. Clear action format specification with examples
+3. Explicit coordinate system definition
+4. Screen verification after action (Claude best practice)
+5. Error handling guidance
 """
 
 from __future__ import annotations
 
+import textwrap
 from typing import TYPE_CHECKING, Any
 
-from openadapt_ml.baselines.config import TrackConfig, TrackType
+from openadapt_ml.baselines.config import (
+    ActionOutputFormat,
+    TrackConfig,
+    TrackType,
+)
 
 if TYPE_CHECKING:
     from PIL import Image
 
 
-# System prompts for each track
-SYSTEM_PROMPTS = {
-    TrackType.TRACK_A: """You are a GUI automation agent. Your task is to interact with graphical user interfaces by analyzing screenshots and determining the next action.
-
-CAPABILITIES:
-- CLICK(x, y): Click at normalized coordinates (0.0-1.0) where (0,0) is top-left
-- TYPE("text"): Type the specified text
-- KEY(key): Press a key (e.g., KEY(enter), KEY(escape))
-- SCROLL(direction): Scroll up or down
-- DONE(): Mark task as complete
-
-RULES:
-1. Analyze the screenshot carefully before acting
-2. Use normalized coordinates (0.0-1.0) for all clicks
-3. Return ONLY the action in JSON format
-4. If the goal is achieved, use DONE()
-
-OUTPUT FORMAT:
+# =============================================================================
+# TRACK A: Direct Coordinate Prediction
+# =============================================================================
+
+SYSTEM_PROMPT_TRACK_A = """You are a GUI automation agent that controls computer interfaces by analyzing screenshots.
+
+## YOUR CAPABILITIES
+
+You can perform these actions:
+- **CLICK**: Click at specific screen coordinates
+- **TYPE**: Enter text at the current cursor position
+- **KEY**: Press keyboard keys or key combinations
+- **SCROLL**: Scroll in a direction
+- **DONE**: Mark task as complete when the goal is achieved
+
+## COORDINATE SYSTEM
+
+- Coordinates are **normalized** between 0.0 and 1.0
+- (0.0, 0.0) is the **top-left** corner of the screen
+- (1.0, 1.0) is the **bottom-right** corner
+- For example, the center of the screen is (0.5, 0.5)
+
+## OUTPUT FORMAT
+
+Respond with a single JSON object containing your action:
+
+```json
 {"action": "CLICK", "x": 0.5, "y": 0.3}
-or
+```
+
+```json
 {"action": "TYPE", "text": "hello world"}
-or
+```
+
+```json
 {"action": "KEY", "key": "enter"}
-or
-{"action": "SCROLL", "direction": "down"}
-or
-{"action": "DONE"}""",
-    TrackType.TRACK_B: """You are a GUI automation agent using ReAct (Reason + Act) to interact with interfaces.
-
-CAPABILITIES:
-- CLICK(x, y): Click at normalized coordinates (0.0-1.0) where (0,0) is top-left
-- TYPE("text"): Type the specified text
-- KEY(key): Press a key (e.g., KEY(enter), KEY(escape))
-- SCROLL(direction): Scroll up or down
-- DONE(): Mark task as complete
-
-PROCESS:
-1. OBSERVE: Describe what you see in the screenshot
-2. THINK: Reason about what action to take and why
-3. ACT: Execute the chosen action
-
-RULES:
-1. Always explain your reasoning before acting
-2. Use normalized coordinates (0.0-1.0) for all clicks
-3. Return both thought and action in JSON format
-4. If the goal is achieved, use DONE()
-
-OUTPUT FORMAT:
-{"thought": "I can see a login form. The username field is at the top. I should click it first to enter credentials.", "action": "CLICK", "x": 0.5, "y": 0.3}
-or
-{"thought": "The username field is now focused. I should type the username.", "action": "TYPE", "text": "user@example.com"}
-or
-{"thought": "The task is complete - I can see the success message.", "action": "DONE"}""",
-    TrackType.TRACK_C: """You are a GUI automation agent. UI elements are marked with numbered labels [1], [2], etc.
-
-CAPABILITIES:
-- CLICK([id]): Click the element with the given ID
-- TYPE("text"): Type the specified text
-- KEY(key): Press a key (e.g., KEY(enter), KEY(escape))
-- SCROLL(direction): Scroll up or down
-- DONE(): Mark task as complete
-
-RULES:
-1. Use element IDs from the labels, NOT coordinates
-2. Each element has a number in brackets like [1], [17], [42]
-3. Return ONLY the action in JSON format
-4. If the goal is achieved, use DONE()
-
-OUTPUT FORMAT:
+```
+
+```json
+{"action": "SCROLL", "direction": "down", "amount": 3}
+```
+
+```json
+{"action": "DONE"}
+```
+
+## RULES
+
+1. **Analyze carefully**: Study the screenshot to identify UI elements
+2. **Be precise**: Aim for the center of clickable elements
+3. **One action at a time**: Return exactly one action per response
+4. **Validate coordinates**: Ensure x and y are between 0.0 and 1.0
+5. **Complete the task**: Use DONE only when the goal is fully achieved
+6. **Handle errors**: If an action fails, try an alternative approach
+
+## IMPORTANT
+
+- Return ONLY the JSON object, no additional text
+- If you cannot determine the correct action, explain in a "reason" field and still provide your best guess"""
+
+
+# =============================================================================
+# TRACK B: ReAct-style Reasoning with Coordinates
+# =============================================================================
+
+SYSTEM_PROMPT_TRACK_B = """You are a GUI automation agent using ReAct (Reasoning + Acting) to complete tasks.
+
+## YOUR CAPABILITIES
+
+You can perform these actions:
+- **CLICK**: Click at specific screen coordinates
+- **TYPE**: Enter text at the current cursor position
+- **KEY**: Press keyboard keys or key combinations
+- **SCROLL**: Scroll in a direction
+- **DONE**: Mark task as complete
+
+## COORDINATE SYSTEM
+
+- Coordinates are **normalized** between 0.0 and 1.0
+- (0.0, 0.0) is the **top-left** corner
+- (1.0, 1.0) is the **bottom-right** corner
+
+## ReAct PROCESS
+
+For each step, follow this process:
+
+1. **OBSERVE**: Describe what you see in the screenshot
+   - What application/window is visible?
+   - What UI elements are present?
+   - What is the current state?
+
+2. **THINK**: Reason about the next action
+   - What is the goal?
+   - What progress has been made?
+   - What is the logical next step?
+   - Where exactly should I click?
+
+3. **ACT**: Execute the action
+
+## OUTPUT FORMAT
+
+Respond with a JSON object containing observation, thought, and action:
+
+```json
+{
+  "observation": "I see a login form with username and password fields. The username field is empty and appears to be focused.",
+  "thought": "To log in, I first need to enter the username. The username field is positioned at approximately x=0.5, y=0.35.",
+  "action": "CLICK",
+  "x": 0.5,
+  "y": 0.35
+}
+```
+
+```json
+{
+  "observation": "The username field is now active with a cursor blinking.",
+  "thought": "I should type the username now.",
+  "action": "TYPE",
+  "text": "user@example.com"
+}
+```
+
+```json
+{
+  "observation": "I can see the confirmation page showing 'Success! You are logged in.'",
+  "thought": "The task is complete - the login was successful.",
+  "action": "DONE"
+}
+```
+
+## RULES
+
+1. **Always explain your reasoning** before acting
+2. **Be specific** in observations - describe what you actually see
+3. **Justify coordinates** - explain why you chose those coordinates
+4. **Track progress** - consider previous actions when planning
+5. **Verify completion** - ensure the goal is fully achieved before DONE
+
+## TIPS
+
+- If an element is hard to click, try using keyboard navigation
+- After clicking, verify the expected result occurred
+- For text fields, click to focus before typing"""
+
+
+# =============================================================================
+# TRACK C: Set-of-Mark Element Selection
+# =============================================================================
+
+SYSTEM_PROMPT_TRACK_C = """You are a GUI automation agent. UI elements in the screenshot are labeled with numbered markers like [1], [2], [3], etc.
+
+## YOUR CAPABILITIES
+
+You can perform these actions:
+- **CLICK**: Click an element by its label number
+- **TYPE**: Enter text at the current cursor position
+- **KEY**: Press keyboard keys or key combinations
+- **SCROLL**: Scroll in a direction
+- **DONE**: Mark task as complete
+
+## ELEMENT LABELS
+
+- Each interactive UI element is marked with a number in brackets: [1], [2], [3], etc.
+- The accessibility tree below lists all labeled elements with their roles and names
+- Use the element ID (the number) to specify which element to click
+
+## OUTPUT FORMAT
+
+Respond with a JSON object:
+
+```json
 {"action": "CLICK", "element_id": 17}
-or
+```
+
+```json
 {"action": "TYPE", "text": "hello world"}
-or
+```
+
+```json
 {"action": "KEY", "key": "enter"}
-or
+```
+
+```json
 {"action": "SCROLL", "direction": "down"}
-or
-{"action": "DONE"}""",
+```
+
+```json
+{"action": "DONE"}
+```
+
+## RULES
+
+1. **Use element IDs** - Click by element number, NOT coordinates
+2. **Match carefully** - Find the element that matches your intent
+3. **Check roles** - Consider element type (button, textfield, checkbox)
+4. **Read labels** - Use element names to identify correct targets
+5. **One action** - Return exactly one action per response
+
+## ELEMENT SELECTION TIPS
+
+- Look for buttons with matching text labels
+- Text fields are often named by their placeholder or label
+- If multiple similar elements exist, choose based on position
+- Some elements may be nested - prefer the most specific match
+
+## IMPORTANT
+
+- Return ONLY the JSON object
+- element_id must be an integer from the labeled elements"""
+
+
+# =============================================================================
+# OSWORLD-COMPATIBLE PROMPTS (PyAutoGUI format)
+# =============================================================================
+
+SYSTEM_PROMPT_OSWORLD = """You are a GUI automation agent controlling a computer through PyAutoGUI.
+
+## ENVIRONMENT
+
+You are interacting with a desktop environment (Ubuntu/Windows/macOS).
+Execute tasks by generating Python code using the PyAutoGUI library.
+
+## AVAILABLE ACTIONS
+
+```python
+# Mouse actions
+pyautogui.click(x, y)           # Click at pixel coordinates
+pyautogui.doubleClick(x, y)     # Double-click
+pyautogui.rightClick(x, y)      # Right-click
+pyautogui.moveTo(x, y)          # Move mouse
+pyautogui.drag(dx, dy)          # Drag relative
+
+# Keyboard actions
+pyautogui.write('text')         # Type text
+pyautogui.press('key')          # Press single key
+pyautogui.hotkey('ctrl', 'c')   # Key combination
+
+# Scrolling
+pyautogui.scroll(clicks)        # Scroll (positive=up, negative=down)
+
+# Special
+WAIT                            # Agent should wait
+FAIL                            # Task is infeasible
+DONE                            # Task is complete
+```
+
+## COORDINATE SYSTEM
+
+- Coordinates are in **pixels** from the screen's top-left corner
+- Screen dimensions are provided in the observation
+
+## OUTPUT FORMAT
+
+Output a single line of Python code or special command:
+
+```
+pyautogui.click(960, 540)
+```
+
+```
+pyautogui.write('Hello, World!')
+```
+
+```
+pyautogui.hotkey('ctrl', 's')
+```
+
+```
+DONE
+```
+
+## RULES
+
+1. **One action per response** - Output exactly one line
+2. **Use pixel coordinates** - Not normalized
+3. **Be precise** - Aim for the center of elements
+4. **Handle failures** - Output FAIL if task is impossible
+5. **Wait when needed** - Output WAIT if UI is loading
+
+## TIPS
+
+- Click in the center of buttons and links
+- For text fields, click to focus before typing
+- Use hotkeys when available (faster, more reliable)
+- Scroll to reveal off-screen elements"""
+
+
+# =============================================================================
+# UFO-COMPATIBLE PROMPTS
+# =============================================================================
+
+SYSTEM_PROMPT_UFO = """You are an AppAgent in the UFO framework, controlling Windows applications.
+
+## YOUR ROLE
+
+You interact with application UI by selecting controls and executing functions.
+Each control is labeled with a number that you reference in your response.
+
+## PROCESS
+
+For each step:
+1. **Observe** the current application state
+2. **Think** about what action achieves the goal
+3. **Select** the appropriate control and function
+4. **Plan** subsequent steps
+
+## OUTPUT FORMAT
+
+Respond with a JSON object:
+
+```json
+{
+  "Observation": "The Notepad application is open with an empty document.",
+  "Thought": "To save the file, I need to use File > Save or Ctrl+S. I'll click the File menu first.",
+  "ControlLabel": 3,
+  "ControlText": "File",
+  "Function": "click",
+  "Args": [],
+  "Status": "CONTINUE",
+  "Plan": ["Click Save in the menu", "Enter filename", "Click Save button"],
+  "Comment": "Starting the save workflow"
+}
+```
+
+## AVAILABLE FUNCTIONS
+
+- **click**: Click the control
+- **input_text**: Type text (Args: ["text to type"])
+- **select**: Select option from dropdown (Args: ["option"])
+- **scroll**: Scroll control (Args: ["up"] or ["down"])
+- **hotkey**: Press key combination (Args: ["ctrl", "s"])
+- **wait**: Wait for UI update (Args: [seconds])
+
+## STATUS VALUES
+
+- **CONTINUE**: More actions needed
+- **FINISH**: Task completed successfully
+- **ERROR**: Something went wrong
+- **PENDING**: Waiting for user input
+
+## RULES
+
+1. **Always provide Observation and Thought**
+2. **ControlLabel must match a labeled element**
+3. **Plan should list remaining steps**
+4. **Use FINISH only when goal is achieved**"""
+
+
+# =============================================================================
+# System Prompt Registry
+# =============================================================================
+
+SYSTEM_PROMPTS = {
+    TrackType.TRACK_A: SYSTEM_PROMPT_TRACK_A,
+    TrackType.TRACK_B: SYSTEM_PROMPT_TRACK_B,
+    TrackType.TRACK_C: SYSTEM_PROMPT_TRACK_C,
 }
 
+# Additional format-specific prompts
+FORMAT_PROMPTS = {
+    ActionOutputFormat.PYAUTOGUI: SYSTEM_PROMPT_OSWORLD,
+}
+
+
+# =============================================================================
+# PromptBuilder Class
+# =============================================================================
+
 
 class PromptBuilder:
     """Builds prompts for baseline API calls.
 
     Constructs system prompts and user content based on track configuration.
+    Supports multiple output formats and benchmark compatibility.
 
     Example:
         builder = PromptBuilder(track_config)
@@ -118,21 +424,71 @@ def __init__(self, track: TrackConfig):
         """
         self.track = track
 
-    def get_system_prompt(self, demo: str | None = None) -> str:
+    def get_system_prompt(
+        self,
+        demo: str | None = None,
+        custom_instructions: str | None = None,
+    ) -> str:
         """Get the system prompt for this track.
 
         Args:
-            demo: Optional demo text to include.
+            demo: Optional demo text to include as an example.
+            custom_instructions: Optional custom instructions to append.
 
         Returns:
             System prompt string.
         """
-        base_prompt = SYSTEM_PROMPTS[self.track.track_type]
+        # Select base prompt based on format or track
+        if self.track.action_format == ActionOutputFormat.PYAUTOGUI:
+            base_prompt = SYSTEM_PROMPT_OSWORLD
+        else:
+            base_prompt = SYSTEM_PROMPTS.get(
+                self.track.track_type, SYSTEM_PROMPT_TRACK_A
+            )
+
+        parts = [base_prompt]
 
+        # Add demo example if provided
         if demo:
-            base_prompt += f"\n\nEXAMPLE DEMONSTRATION:\n{demo}"
+            parts.append(self._format_demo_section(demo))
+
+        # Add screen verification instruction if enabled
+        if self.track.verify_after_action:
+            parts.append(self._get_verification_instruction())
+
+        # Add custom instructions
+        if custom_instructions:
+            parts.append(f"\n## ADDITIONAL INSTRUCTIONS\n\n{custom_instructions}")
+
+        return "\n\n".join(parts)
+
+    def _format_demo_section(self, demo: str) -> str:
+        """Format demonstration example section."""
+        return textwrap.dedent(f"""
+        ## EXAMPLE DEMONSTRATION
+
+        Here is an example of successfully completing a similar task:
+
+        {demo}
+
+        Follow a similar pattern for your task.
+        """).strip()
 
-        return base_prompt
+    def _get_verification_instruction(self) -> str:
+        """Get instruction for post-action verification.
+
+        Based on Claude Computer Use best practices.
+        """
+        return textwrap.dedent("""
+        ## VERIFICATION
+
+        After each action, a new screenshot will be provided. Verify that:
+        1. The action was executed correctly
+        2. The UI state changed as expected
+        3. You are making progress toward the goal
+
+        If something unexpected happened, explain what went wrong and try again.
+        """).strip()
 
     def build_user_content(
         self,
@@ -141,6 +497,8 @@ def build_user_content(
         a11y_tree: str | dict[str, Any] | None = None,
         history: list[dict[str, Any]] | None = None,
         encode_image_fn: Any = None,
+        screen_info: dict[str, Any] | None = None,
+        window_info: dict[str, Any] | None = None,
     ) -> list[dict[str, Any]]:
         """Build user message content for API call.
 
@@ -150,6 +508,8 @@ def build_user_content(
             a11y_tree: Accessibility tree (string or dict).
             history: List of previous actions.
             encode_image_fn: Function to encode image for API.
+            screen_info: Screen dimensions and other info.
+            window_info: Active window information.
 
         Returns:
             List of content blocks for API message.
@@ -157,26 +517,33 @@ def build_user_content(
         content: list[dict[str, Any]] = []
 
         # Build text prompt
-        text_parts = [f"GOAL: {goal}"]
+        text_parts = [self._format_goal(goal)]
+
+        # Add screen info if provided
+        if screen_info:
+            text_parts.append(self._format_screen_info(screen_info))
+
+        # Add window info if provided
+        if window_info:
+            text_parts.append(self._format_window_info(window_info))
 
         # Add accessibility tree if configured
         if self.track.use_a11y_tree and a11y_tree:
             tree_text = self._format_a11y_tree(a11y_tree)
             if tree_text:
-                text_parts.append(f"\nACCESSIBILITY TREE:\n{tree_text}")
+                text_parts.append(self._format_a11y_section(tree_text))
 
         # Add action history if configured
         if self.track.include_history and history:
             history_text = self._format_history(history)
             if history_text:
-                text_parts.append(f"\nPREVIOUS ACTIONS:\n{history_text}")
+                text_parts.append(self._format_history_section(history_text))
 
-        # Add instruction
-        text_parts.append("\nAnalyze the screenshot and provide the next action.")
-        text_parts.append(f"OUTPUT FORMAT: {self.track.output_format}")
+        # Add instruction based on track
+        text_parts.append(self._get_action_instruction())
 
-        # Add text content
-        content.append({"type": "text", "text": "\n".join(text_parts)})
+        # Combine text parts
+        content.append({"type": "text", "text": "\n\n".join(text_parts)})
 
         # Add screenshot if provided
         if screenshot is not None and encode_image_fn is not None:
@@ -184,6 +551,47 @@ def build_user_content(
 
         return content
 
+    def _format_goal(self, goal: str) -> str:
+        """Format the task goal."""
+        return f"## TASK\n\n{goal}"
+
+    def _format_screen_info(self, screen_info: dict[str, Any]) -> str:
+        """Format screen information."""
+        width = screen_info.get("width", "unknown")
+        height = screen_info.get("height", "unknown")
+        return f"## SCREEN\n\nResolution: {width} x {height} pixels"
+
+    def _format_window_info(self, window_info: dict[str, Any]) -> str:
+        """Format active window information."""
+        parts = ["## ACTIVE WINDOW"]
+
+        if "title" in window_info:
+            parts.append(f"Title: {window_info['title']}")
+        if "app" in window_info:
+            parts.append(f"Application: {window_info['app']}")
+        if "url" in window_info:
+            parts.append(f"URL: {window_info['url']}")
+
+        return "\n".join(parts)
+
+    def _format_a11y_section(self, tree_text: str) -> str:
+        """Format accessibility tree section with header."""
+        header = "## UI ELEMENTS" if self.track.use_som else "## ACCESSIBILITY TREE"
+        return f"{header}\n\n{tree_text}"
+
+    def _format_history_section(self, history_text: str) -> str:
+        """Format history section with header."""
+        return f"## PREVIOUS ACTIONS\n\n{history_text}"
+
+    def _get_action_instruction(self) -> str:
+        """Get instruction for action output based on track."""
+        if self.track.track_type == TrackType.TRACK_B:
+            return "## YOUR TURN\n\nAnalyze the screenshot, explain your reasoning, and provide the next action."
+        elif self.track.track_type == TrackType.TRACK_C:
+            return "## YOUR TURN\n\nAnalyze the screenshot and select the appropriate element to interact with."
+        else:
+            return "## YOUR TURN\n\nAnalyze the screenshot and provide the next action."
+
     def _format_a11y_tree(self, tree: str | dict[str, Any]) -> str:
         """Format accessibility tree for prompt.
 
@@ -204,13 +612,17 @@ def _format_a11y_tree(self, tree: str | dict[str, Any]) -> str:
         max_lines = self.track.max_a11y_elements
         lines = text.split("\n")
         if len(lines) > max_lines:
+            original_count = len(lines)
             lines = lines[:max_lines]
-            lines.append(f"... (truncated, {len(lines)} of {max_lines} elements)")
+            lines.append(f"... (showing {max_lines} of {original_count} elements)")
 
         return "\n".join(lines)
 
     def _dict_to_tree_string(
-        self, tree: dict[str, Any], indent: int = 0, max_depth: int = 5
+        self,
+        tree: dict[str, Any],
+        indent: int = 0,
+        max_depth: int = 5,
     ) -> str:
         """Convert dict tree to formatted string.
 
@@ -230,24 +642,42 @@ def _dict_to_tree_string(
 
         role = tree.get("role", "unknown")
         name = tree.get("name", "")
-        node_id = tree.get("id", "")
+        node_id = tree.get("id", tree.get("node_id", ""))
 
-        # Format node
-        if node_id:
+        # Format node based on track
+        if self.track.use_som and node_id:
+            # SoM format: [id] role "name"
             line = f"{prefix}[{node_id}] {role}"
+        elif node_id:
+            # Non-SoM with ID
+            line = f"{prefix}({node_id}) {role}"
         else:
             line = f"{prefix}{role}"
 
         if name:
+            # Truncate long names
+            if len(name) > 50:
+                name = name[:47] + "..."
             line += f': "{name}"'
 
+        # Add bounding box if available (useful for debugging)
+        bbox = tree.get("bbox", tree.get("bounds"))
+        if bbox and isinstance(bbox, (list, tuple)) and len(bbox) >= 4:
+            # Show center point for SoM
+            if self.track.use_som:
+                cx = (bbox[0] + bbox[2]) / 2
+                cy = (bbox[1] + bbox[3]) / 2
+                line += f" @ ({cx:.2f}, {cy:.2f})"
+
         lines.append(line)
 
         # Process children
         children = tree.get("children", [])
         for child in children:
             if isinstance(child, dict):
-                lines.append(self._dict_to_tree_string(child, indent + 1, max_depth))
+                child_text = self._dict_to_tree_string(child, indent + 1, max_depth)
+                if child_text:
+                    lines.append(child_text)
 
         return "\n".join(lines)
 
@@ -269,26 +699,87 @@ def _format_history(self, history: list[dict[str, Any]]) -> str:
 
         for i, action in enumerate(recent, 1):
             action_type = action.get("type", action.get("action", "unknown")).upper()
+            line = self._format_single_action(i, action_type, action)
+            lines.append(line)
 
-            if action_type == "CLICK":
-                if "element_id" in action:
-                    lines.append(f"{i}. CLICK([{action['element_id']}])")
-                elif "x" in action and "y" in action:
-                    lines.append(f"{i}. CLICK({action['x']:.3f}, {action['y']:.3f})")
-                else:
-                    lines.append(f"{i}. CLICK()")
-            elif action_type == "TYPE":
-                text = action.get("text", "")
-                lines.append(f'{i}. TYPE("{text}")')
-            elif action_type == "KEY":
-                key = action.get("key", "")
-                lines.append(f"{i}. KEY({key})")
-            elif action_type == "SCROLL":
-                direction = action.get("direction", "down")
-                lines.append(f"{i}. SCROLL({direction})")
-            elif action_type == "DONE":
-                lines.append(f"{i}. DONE()")
+        return "\n".join(lines)
+
+    def _format_single_action(
+        self, step: int, action_type: str, action: dict[str, Any]
+    ) -> str:
+        """Format a single action for history display."""
+        if action_type == "CLICK":
+            if "element_id" in action:
+                return f"{step}. CLICK([{action['element_id']}])"
+            elif "x" in action and "y" in action:
+                return f"{step}. CLICK({action['x']:.3f}, {action['y']:.3f})"
             else:
-                lines.append(f"{i}. {action_type}()")
+                return f"{step}. CLICK()"
+        elif action_type == "TYPE":
+            text = action.get("text", "")
+            # Truncate long text
+            if len(text) > 30:
+                text = text[:27] + "..."
+            return f'{step}. TYPE("{text}")'
+        elif action_type == "KEY":
+            key = action.get("key", "")
+            return f"{step}. KEY({key})"
+        elif action_type == "SCROLL":
+            direction = action.get("direction", "down")
+            amount = action.get("amount", 1)
+            return f"{step}. SCROLL({direction}, {amount})"
+        elif action_type == "DONE":
+            return f"{step}. DONE()"
+        elif action_type == "WAIT":
+            return f"{step}. WAIT()"
+        else:
+            return f"{step}. {action_type}()"
 
-        return "\n".join(lines)
+    def build_verification_prompt(
+        self,
+        goal: str,
+        previous_action: dict[str, Any],
+        screenshot: "Image" | None = None,
+        encode_image_fn: Any = None,
+    ) -> list[dict[str, Any]]:
+        """Build a verification prompt after an action.
+
+        Used to verify action results and decide next steps.
+        Based on Claude Computer Use best practices.
+
+        Args:
+            goal: Original task goal.
+            previous_action: The action that was just executed.
+            screenshot: Screenshot after action execution.
+            encode_image_fn: Function to encode image.
+
+        Returns:
+            List of content blocks.
+        """
+        content: list[dict[str, Any]] = []
+
+        action_str = self._format_single_action(0, previous_action.get("type", ""), previous_action)
+        action_str = action_str[3:]  # Remove "0. " prefix
+
+        text = textwrap.dedent(f"""
+        ## VERIFICATION CHECK
+
+        **Goal**: {goal}
+
+        **Previous Action**: {action_str}
+
+        Analyze the screenshot and verify:
+        1. Did the action execute correctly?
+        2. Is the UI state as expected?
+        3. Are we making progress toward the goal?
+
+        If the goal is achieved, respond with {{"action": "DONE"}}.
+        Otherwise, provide the next action.
+        """).strip()
+
+        content.append({"type": "text", "text": text})
+
+        if screenshot is not None and encode_image_fn is not None:
+            content.append(encode_image_fn(screenshot))
+
+        return content
diff --git a/openadapt_ml/benchmarks/__init__.py b/openadapt_ml/benchmarks/__init__.py
index 9090446..878bde1 100644
--- a/openadapt_ml/benchmarks/__init__.py
+++ b/openadapt_ml/benchmarks/__init__.py
@@ -88,6 +88,7 @@
     RandomAgent,
     ScriptedAgent,
     SmartMockAgent,
+    UnifiedBaselineAgent,
 )
 from openadapt_ml.benchmarks.base import (
     BenchmarkAction,
@@ -131,6 +132,7 @@ def _get_azure_classes():
     "BenchmarkAgent",
     "PolicyAgent",
     "APIBenchmarkAgent",
+    "UnifiedBaselineAgent",
     "ScriptedAgent",
     "RandomAgent",
     "SmartMockAgent",
diff --git a/openadapt_ml/benchmarks/agent.py b/openadapt_ml/benchmarks/agent.py
index 05a61f3..e8d9151 100644
--- a/openadapt_ml/benchmarks/agent.py
+++ b/openadapt_ml/benchmarks/agent.py
@@ -902,3 +902,281 @@ def reset(self) -> None:
         """Reset agent state."""
         # APIBenchmarkAgent is stateless, nothing to reset
         pass
+
+
+class UnifiedBaselineAgent(BenchmarkAgent):
+    """Agent that uses the UnifiedBaselineAdapter for benchmark evaluation.
+
+    This agent provides a unified interface for comparing Claude, GPT, and Gemini
+    models across multiple evaluation tracks (coordinates, ReAct, SoM).
+
+    Compared to APIBenchmarkAgent, this agent:
+    - Uses the new provider abstraction (models/providers/)
+    - Supports multiple tracks (A, B, C) with track-specific prompts
+    - Uses the unified response parser
+    - Supports model aliases for easy switching
+
+    Args:
+        model_alias: Model alias (e.g., 'claude-opus-4.5', 'gpt-5.2', 'gemini-3-pro').
+        track: Track type ('A', 'B', or 'C'). Defaults to 'A'.
+        api_key: Optional API key override. If not provided, uses env vars.
+        temperature: Sampling temperature. Defaults to 0.1.
+        max_tokens: Maximum tokens for response. Defaults to 1024.
+        demo: Optional demo text to include in prompts.
+        verbose: Whether to print verbose debug output.
+
+    Example:
+        # Claude baseline with Track C (Set-of-Mark)
+        agent = UnifiedBaselineAgent(
+            model_alias="claude-opus-4.5",
+            track="C",
+        )
+        results = evaluate_agent_on_benchmark(agent, waa_adapter)
+
+        # GPT baseline with Track A (direct coordinates)
+        agent = UnifiedBaselineAgent(
+            model_alias="gpt-5.2",
+            track="A",
+        )
+        results = evaluate_agent_on_benchmark(agent, waa_adapter)
+
+        # Gemini baseline with Track B (ReAct reasoning)
+        agent = UnifiedBaselineAgent(
+            model_alias="gemini-3-pro",
+            track="B",
+        )
+        results = evaluate_agent_on_benchmark(agent, waa_adapter)
+    """
+
+    def __init__(
+        self,
+        model_alias: str = "claude-opus-4.5",
+        track: str = "A",
+        api_key: str | None = None,
+        temperature: float = 0.1,
+        max_tokens: int = 1024,
+        demo: str | None = None,
+        verbose: bool = False,
+    ):
+        self.model_alias = model_alias
+        self.track = track.upper()
+        self.api_key = api_key
+        self.temperature = temperature
+        self.max_tokens = max_tokens
+        self.demo = demo
+        self.verbose = verbose
+        self._adapter = None
+
+    def _get_adapter(self):
+        """Lazily initialize the UnifiedBaselineAdapter."""
+        if self._adapter is None:
+            from openadapt_ml.baselines import (
+                BaselineConfig,
+                TrackConfig,
+                UnifiedBaselineAdapter,
+            )
+
+            # Select track config
+            track_configs = {
+                "A": TrackConfig.track_a(),
+                "B": TrackConfig.track_b(),
+                "C": TrackConfig.track_c(),
+            }
+            track_config = track_configs.get(self.track, TrackConfig.track_a())
+
+            # Create adapter from alias
+            self._adapter = UnifiedBaselineAdapter.from_alias(
+                self.model_alias,
+                track=track_config,
+                api_key=self.api_key,
+                temperature=self.temperature,
+                max_tokens=self.max_tokens,
+                demo=self.demo,
+                verbose=self.verbose,
+            )
+        return self._adapter
+
+    def act(
+        self,
+        observation: BenchmarkObservation,
+        task: BenchmarkTask,
+        history: list[tuple[BenchmarkObservation, BenchmarkAction]] | None = None,
+    ) -> BenchmarkAction:
+        """Use UnifiedBaselineAdapter to determine next action.
+
+        Args:
+            observation: Current observation with screenshot.
+            task: Task being performed.
+            history: Previous observations and actions.
+
+        Returns:
+            BenchmarkAction parsed from adapter response.
+        """
+        from PIL import Image
+
+        adapter = self._get_adapter()
+
+        # Load screenshot if available
+        screenshot = None
+        if observation.screenshot_path:
+            try:
+                screenshot = Image.open(observation.screenshot_path)
+            except Exception as e:
+                if self.verbose:
+                    print(f"[UnifiedBaselineAgent] Failed to load screenshot: {e}")
+
+        # Build accessibility tree string
+        a11y_tree = None
+        if observation.accessibility_tree:
+            a11y_tree = observation.accessibility_tree
+
+        # Build history for adapter
+        adapter_history = None
+        if history:
+            adapter_history = []
+            for obs, action in history[-5:]:  # Last 5 actions
+                adapter_history.append(self._benchmark_action_to_dict(action))
+
+        # Call adapter
+        try:
+            parsed_action = adapter.predict(
+                screenshot=screenshot,
+                goal=task.instruction,
+                a11y_tree=a11y_tree,
+                history=adapter_history,
+            )
+        except Exception as e:
+            if self.verbose:
+                print(f"[UnifiedBaselineAgent] Adapter error: {e}")
+            return BenchmarkAction(
+                type="done",
+                raw_action={"error": str(e)},
+            )
+
+        # Convert ParsedAction to BenchmarkAction
+        return self._parsed_to_benchmark_action(parsed_action, observation)
+
+    def _benchmark_action_to_dict(self, action: BenchmarkAction) -> dict[str, Any]:
+        """Convert BenchmarkAction to dict for history."""
+        result = {"type": action.type}
+
+        if action.x is not None:
+            result["x"] = action.x
+        if action.y is not None:
+            result["y"] = action.y
+        if action.text:
+            result["text"] = action.text
+        if action.key:
+            result["key"] = action.key
+        if action.target_node_id:
+            result["element_id"] = action.target_node_id
+        if action.scroll_direction:
+            result["direction"] = action.scroll_direction
+
+        return result
+
+    def _parsed_to_benchmark_action(
+        self,
+        parsed_action,
+        observation: BenchmarkObservation | None = None,
+    ) -> BenchmarkAction:
+        """Convert ParsedAction to BenchmarkAction.
+
+        Args:
+            parsed_action: ParsedAction from adapter.
+            observation: Current observation (for coordinate normalization).
+
+        Returns:
+            BenchmarkAction.
+        """
+        raw_action = {
+            "raw_response": parsed_action.raw_response,
+            "thought": parsed_action.thought,
+        }
+
+        if not parsed_action.is_valid:
+            raw_action["parse_error"] = parsed_action.parse_error
+            return BenchmarkAction(type="done", raw_action=raw_action)
+
+        action_type = parsed_action.action_type
+
+        if action_type == "click":
+            if parsed_action.element_id is not None:
+                return BenchmarkAction(
+                    type="click",
+                    target_node_id=str(parsed_action.element_id),
+                    raw_action=raw_action,
+                )
+            elif parsed_action.x is not None and parsed_action.y is not None:
+                x = parsed_action.x
+                y = parsed_action.y
+
+                # Normalize coordinates if they appear to be pixel values
+                if observation and observation.viewport and (x > 1.0 or y > 1.0):
+                    width, height = observation.viewport
+                    raw_action["original_coords"] = {"x": x, "y": y}
+                    raw_action["normalized"] = True
+                    x = x / width
+                    y = y / height
+
+                return BenchmarkAction(
+                    type="click",
+                    x=x,
+                    y=y,
+                    raw_action=raw_action,
+                )
+
+        elif action_type == "type":
+            return BenchmarkAction(
+                type="type",
+                text=parsed_action.text,
+                raw_action=raw_action,
+            )
+
+        elif action_type == "key":
+            return BenchmarkAction(
+                type="key",
+                key=parsed_action.key,
+                raw_action=raw_action,
+            )
+
+        elif action_type == "scroll":
+            return BenchmarkAction(
+                type="scroll",
+                scroll_direction=parsed_action.direction,
+                raw_action=raw_action,
+            )
+
+        elif action_type == "done":
+            return BenchmarkAction(type="done", raw_action=raw_action)
+
+        elif action_type == "drag":
+            x = parsed_action.x
+            y = parsed_action.y
+            end_x = getattr(parsed_action, "end_x", None)
+            end_y = getattr(parsed_action, "end_y", None)
+
+            return BenchmarkAction(
+                type="drag",
+                x=x,
+                y=y,
+                end_x=end_x,
+                end_y=end_y,
+                raw_action=raw_action,
+            )
+
+        # Unknown action type, return done
+        raw_action["unknown_action"] = action_type
+        return BenchmarkAction(type="done", raw_action=raw_action)
+
+    def reset(self) -> None:
+        """Reset agent state."""
+        # UnifiedBaselineAgent is stateless, nothing to reset
+        pass
+
+    def __repr__(self) -> str:
+        return (
+            f"UnifiedBaselineAgent("
+            f"model={self.model_alias}, "
+            f"track={self.track})"
+        )
diff --git a/openadapt_ml/benchmarks/cli.py b/openadapt_ml/benchmarks/cli.py
index 8211cb0..e35e1fe 100644
--- a/openadapt_ml/benchmarks/cli.py
+++ b/openadapt_ml/benchmarks/cli.py
@@ -121,6 +121,25 @@
     "-o", "TCPKeepAlive=yes",
 ]
 
+# Docker image name for WAA benchmarks
+# Built from official Microsoft WindowsAgentArena repo (vendor submodule)
+# Uses: vendor/WindowsAgentArena/src/win-arena-container/Dockerfile-WinArena
+WAA_DOCKER_IMAGE = "windowsarena/winarena:latest"
+
+# Persistent log directory for Docker containers
+# Logs are persisted even after container stops/crashes for debugging
+WAA_LOG_DIR = "/mnt/waa-logs"
+
+# VM OS disk size in GB - must be large enough for:
+# - Ubuntu OS (~10GB)
+# - Docker images (~30GB for WAA)
+# - Windows VM disk (~30GB)
+# - Build cache and temp files (~20GB)
+VM_OS_DISK_SIZE_GB = "128"
+
+# VM size for WAA - must support nested virtualization (v5 series)
+VM_SIZE_WAA = "Standard_D4ds_v5"
+
 
 def ssh_cmd(ip: str, cmd: str, extra_opts: list[str] | None = None) -> list[str]:
     """Build SSH command with proper options for Azure VMs.
@@ -159,6 +178,186 @@ def scp_cmd(src: str, dest: str, recursive: bool = False) -> list[str]:
     return base
 
 
+def verify_waa_image_exists(ip: str) -> bool:
+    """Check if the WAA Docker image exists on the VM.
+
+    Args:
+        ip: IP address of the VM
+
+    Returns:
+        True if waa-auto:latest image exists, False otherwise
+    """
+    import subprocess
+    check_cmd = f"docker images {WAA_DOCKER_IMAGE} --format '{{{{.Repository}}}}:{{{{.Tag}}}}' | head -1"
+    result = subprocess.run(
+        ["ssh", *SSH_OPTS, f"azureuser@{ip}", check_cmd],
+        capture_output=True, text=True, timeout=30
+    )
+    return result.stdout.strip() == WAA_DOCKER_IMAGE
+
+
+def verify_container_image(ip: str, container_name: str = "winarena") -> tuple[bool, str]:
+    """Verify that a running container is using the correct WAA image.
+
+    Args:
+        ip: IP address of the VM
+        container_name: Name of the container to check
+
+    Returns:
+        Tuple of (is_correct, actual_image_name)
+    """
+    import subprocess
+    inspect_cmd = f"docker inspect {container_name} --format '{{{{.Config.Image}}}}' 2>/dev/null"
+    result = subprocess.run(
+        ["ssh", *SSH_OPTS, f"azureuser@{ip}", inspect_cmd],
+        capture_output=True, text=True, timeout=30
+    )
+    actual_image = result.stdout.strip()
+    is_correct = actual_image == WAA_DOCKER_IMAGE
+    return is_correct, actual_image
+
+
+def require_waa_image(ip: str) -> None:
+    """Ensure the WAA Docker image exists, fail with clear error if not.
+
+    Args:
+        ip: IP address of the VM
+
+    Raises:
+        SystemExit: If the image does not exist
+    """
+    if not verify_waa_image_exists(ip):
+        print(f"\n✗ CRITICAL: {WAA_DOCKER_IMAGE} image not found!")
+        print(f"  The official windowsarena/winarena image is BROKEN and cannot be used.")
+        print(f"\n  Run with --rebuild to build {WAA_DOCKER_IMAGE}:")
+        print(f"    uv run python -m openadapt_ml.benchmarks.cli vm run-waa --rebuild")
+        sys.exit(1)
+
+
+def validate_running_container(ip: str, container_name: str = "winarena") -> None:
+    """Validate that a running container is using the correct image.
+
+    Args:
+        ip: IP address of the VM
+        container_name: Name of the container to check
+
+    Raises:
+        SystemExit: If the container is running the wrong image
+    """
+    is_correct, actual_image = verify_container_image(ip, container_name)
+    if not is_correct and actual_image:  # Only fail if container exists but wrong image
+        print(f"\n✗ CRITICAL: Container '{container_name}' is running wrong image!")
+        print(f"  Expected: {WAA_DOCKER_IMAGE}")
+        print(f"  Actual:   {actual_image}")
+        print(f"\n  The WAA server will NOT be installed with the wrong image.")
+        print(f"\n  To fix:")
+        print(f"    1. Stop container: ssh azureuser@{ip} 'docker rm -f {container_name}'")
+        print(f"    2. Rebuild image: uv run python -m openadapt_ml.benchmarks.cli vm run-waa --rebuild")
+        sys.exit(1)
+
+
+def get_docker_logging_opts(container_name: str = "winarena") -> str:
+    """Get Docker logging options for persistent log capture.
+
+    Configures Docker to write logs to a file on the host that persists
+    even after container stops or crashes. This enables post-mortem
+    debugging of container/QEMU crashes.
+
+    Args:
+        container_name: Name of the container (used in log filename)
+
+    Returns:
+        Docker command-line options for logging configuration
+    """
+    # Use json-file driver with rotation to prevent disk exhaustion
+    # Logs are written to /mnt/waa-logs/{container_name}.log
+    # max-size: 100MB per log file
+    # max-file: 5 rotated log files (500MB total max)
+    return f'''--log-driver json-file \
+  --log-opt max-size=100m \
+  --log-opt max-file=5 \
+  --log-opt tag="{container_name}"'''
+
+
+def ensure_log_dir(ip: str) -> bool:
+    """Ensure the persistent log directory exists on the VM.
+
+    Args:
+        ip: IP address of the VM
+
+    Returns:
+        True if directory exists or was created, False on error
+    """
+    import subprocess
+    result = subprocess.run(
+        ssh_cmd(ip, f"sudo mkdir -p {WAA_LOG_DIR} && sudo chmod 777 {WAA_LOG_DIR}"),
+        capture_output=True, text=True, timeout=30
+    )
+    return result.returncode == 0
+
+
+def ensure_sufficient_disk_space(ip: str, min_gb: int = 15, auto_clean: bool = True) -> bool:
+    """Check /mnt disk space and auto-clean if needed.
+
+    Windows extraction needs ~9GB, plus buffer for storage. This function
+    prevents the recurring "Not enough free space" errors during Windows setup.
+
+    Args:
+        ip: IP address of the VM
+        min_gb: Minimum required free space in GB (default: 15GB for safety margin)
+        auto_clean: If True, run docker-prune automatically when low on space
+
+    Returns:
+        True if sufficient space available (after cleanup if needed), False otherwise
+    """
+    import subprocess
+
+    def get_free_gb() -> float:
+        """Get free space on /mnt in GB."""
+        result = subprocess.run(
+            ssh_cmd(ip, "df -BG /mnt | tail -1 | awk '{print $4}' | tr -d 'G'"),
+            capture_output=True, text=True, timeout=30
+        )
+        try:
+            return float(result.stdout.strip())
+        except (ValueError, AttributeError):
+            return 0.0
+
+    free_gb = get_free_gb()
+    print(f"  Disk space on /mnt: {free_gb:.1f}GB free (need {min_gb}GB minimum)")
+
+    if free_gb >= min_gb:
+        return True
+
+    if not auto_clean:
+        print(f"\n⚠ WARNING: Only {free_gb:.1f}GB free, need {min_gb}GB for Windows setup")
+        return False
+
+    # Auto-clean Docker
+    print(f"\n  Low disk space ({free_gb:.1f}GB < {min_gb}GB). Running auto-cleanup...")
+
+    # Run docker system prune
+    subprocess.run(
+        ssh_cmd(ip, "docker system prune -af --volumes 2>/dev/null"),
+        capture_output=True, text=True, timeout=300
+    )
+
+    # Check again
+    free_gb = get_free_gb()
+    print(f"  After cleanup: {free_gb:.1f}GB free")
+
+    if free_gb >= min_gb:
+        print(f"  ✓ Sufficient disk space available")
+        return True
+
+    # Still not enough - suggest deleting old images
+    print(f"\n✗ Still insufficient space ({free_gb:.1f}GB < {min_gb}GB)")
+    print(f"  Try manually cleaning old files:")
+    print(f"    ssh azureuser@{ip} 'rm -rf /mnt/WindowsAgentArena/src/win-arena-container/vm/storage/*'")
+    print(f"  Then rebuild: uv run python -m openadapt_ml.benchmarks.cli vm run-waa --rebuild")
+    return False
+
+
 def setup_logging(verbose: bool = False) -> None:
     """Configure logging with appropriate verbosity.
 
@@ -2001,7 +2200,7 @@ def launch_benchmark_viewer(
     print(f"  VM IP: {vm_ip}")
     print(f"  Internal IP: {internal_ip}")
     print(f"  Local port: {port}")
-    print(f"  Dashboard: http://localhost:{port}/benchmark.html")
+    print(f"  Dashboard: http://localhost:{port}/azure_ops.html")
     print(f"  VNC available via button in viewer when VM is ready")
     print()
 
@@ -2009,12 +2208,13 @@ def launch_benchmark_viewer(
     os.environ["WAA_VM_IP"] = vm_ip
     os.environ["WAA_INTERNAL_IP"] = internal_ip
 
-    # Build the serve command - use --benchmark to generate benchmark.html
+    # Build the serve command - use --benchmark to generate ops.html (benchmark viewer)
+    # Note: azure_ops.html is the VM operations dashboard (separate file)
     serve_cmd = [
         sys.executable, "-m", "openadapt_ml.cloud.local", "serve",
         "--port", str(port),
         "--quiet",
-        "--benchmark", "benchmark_results",  # This triggers benchmark.html generation
+        "--benchmark", "benchmark_results",  # This triggers ops.html generation
     ]
     if open_browser:
         serve_cmd.append("--open")
@@ -2064,7 +2264,34 @@ def cmd_vm(args: argparse.Namespace) -> None:
     vm_size = args.size
     location = args.location
 
-    if args.action == "list-sizes":
+    if args.action == "dashboard":
+        # Unified dashboard - simple, robust, single HTML page
+        from openadapt_ml.benchmarks.dashboard import serve_dashboard
+
+        # Get VM IP
+        ip = get_vm_ip(resource_group, vm_name)
+        if not ip:
+            print(f"ERROR: VM '{vm_name}' not found or not running.")
+            print(f"  Create with: uv run python -m openadapt_ml.benchmarks.cli vm setup-waa")
+            print(f"  Or start with: uv run python -m openadapt_ml.benchmarks.cli vm start")
+            sys.exit(1)
+
+        port = getattr(args, 'port', 8765)
+        internal_ip = getattr(args, 'internal_ip', '172.30.0.2')
+        no_open = getattr(args, 'no_open', False)
+
+        # Serve the unified dashboard (blocks until Ctrl+C)
+        serve_dashboard(
+            vm_ip=ip,
+            resource_group=resource_group,
+            vm_name=vm_name,
+            port=port,
+            internal_ip=internal_ip,
+            auto_open=not no_open,
+        )
+        return
+
+    elif args.action == "list-sizes":
         print(f"\n=== Available VM Sizes with Nested Virtualization in {location} ===\n")
         print("Checking available D-series sizes (support nested virt)...")
 
@@ -2115,6 +2342,7 @@ def cmd_vm(args: argparse.Namespace) -> None:
                 "--location", location,
                 "--image", "Ubuntu2204",
                 "--size", vm_size,
+                "--os-disk-size-gb", VM_OS_DISK_SIZE_GB,
                 "--admin-username", "azureuser",
                 "--generate-ssh-keys",
                 "--public-ip-sku", "Standard",
@@ -2201,6 +2429,10 @@ def cmd_vm(args: argparse.Namespace) -> None:
             print(f"✗ Error deleting VM: {result.stderr}")
             sys.exit(1)
 
+        # Clear session data so dashboard shows fresh values for new VMs
+        from openadapt_ml.benchmarks.session_tracker import clear_session
+        clear_session()
+
         print(f"✓ VM '{vm_name}' deleted")
 
     elif args.action == "deallocate":
@@ -2221,6 +2453,7 @@ def cmd_vm(args: argparse.Namespace) -> None:
         print(f"  To restart: python -m openadapt_ml.benchmarks.cli vm start")
 
     elif args.action == "start":
+        import time
         print(f"\n=== Starting VM: {vm_name} ===\n")
 
         result = subprocess.run(
@@ -2397,8 +2630,12 @@ def cmd_vm(args: argparse.Namespace) -> None:
 
     elif args.action == "setup-waa":
         from openadapt_ml.benchmarks.vm_monitor import VMPoolRegistry
+        from openadapt_ml.benchmarks.session_tracker import clear_session
         from concurrent.futures import ThreadPoolExecutor, as_completed
 
+        # Clear any stale session data from previous VMs
+        clear_session()
+
         # Comprehensive one-command WAA setup with multi-worker support
         num_workers = getattr(args, 'workers', 1)
 
@@ -2425,7 +2662,8 @@ def create_single_vm(worker_name: str, worker_location: str) -> tuple[str, str |
                      "--name", worker_name,
                      "--location", loc,
                      "--image", "Ubuntu2204",
-                     "--size", "Standard_D4ds_v5",
+                     "--size", VM_SIZE_WAA,
+                     "--os-disk-size-gb", VM_OS_DISK_SIZE_GB,
                      "--admin-username", "azureuser",
                      "--generate-ssh-keys",
                      "--public-ip-sku", "Standard",
@@ -2476,6 +2714,19 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
                 capture_output=True, text=True
             )
 
+            # CRITICAL: Symlink WAA storage to /mnt (has 60GB+ vs 15GB on root)
+            # Windows VM disk is 30GB and MUST be on /mnt to have enough space
+            storage_cmd = """
+mkdir -p /mnt/waa-storage && \
+rm -rf ~/WindowsAgentArena/src/win-arena-container/vm/storage 2>/dev/null; \
+ln -sf /mnt/waa-storage ~/WindowsAgentArena/src/win-arena-container/vm/storage && \
+echo 'Storage symlinked to /mnt/waa-storage'
+"""
+            subprocess.run(
+                ["ssh", *SSH_OPTS, f"azureuser@{ip}", storage_cmd],
+                capture_output=True, text=True
+            )
+
             # Create config
             config_cmd = f'''cat > ~/WindowsAgentArena/config.json << 'EOF'
 {{
@@ -2490,6 +2741,98 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
             )
             return True
 
+        def stream_docker_build(ip: str, build_cmd: str, image_name: str, timeout: int = 1800) -> bool:
+            """Stream docker build output to console and /tmp/waa_build.log on VM.
+
+            Args:
+                ip: VM IP address
+                build_cmd: Docker build command to run
+                image_name: Name of image being built (for display)
+                timeout: Timeout in seconds (default 30 min)
+
+            Returns:
+                True if build succeeded, False otherwise
+            """
+            import time
+
+            # Ensure we tee to log file for dashboard visibility
+            if "tee /tmp/waa_build.log" not in build_cmd:
+                build_cmd = f"{build_cmd} 2>&1 | tee /tmp/waa_build.log"
+
+            print(f"         (streaming to /tmp/waa_build.log - tail -f to watch)")
+
+            build_process = subprocess.Popen(
+                ["ssh", *SSH_OPTS, f"azureuser@{ip}", build_cmd],
+                stdout=subprocess.PIPE,
+                stderr=subprocess.STDOUT,
+                text=True,
+                bufsize=1
+            )
+
+            build_output = []
+            start_time = time.time()
+            last_progress_time = start_time
+
+            try:
+                while True:
+                    # Check timeout
+                    elapsed = time.time() - start_time
+                    if elapsed > timeout:
+                        build_process.kill()
+                        print(f"\n      ✗ Build timed out after {timeout//60} minutes")
+                        return False
+
+                    line = build_process.stdout.readline()
+                    if not line and build_process.poll() is not None:
+                        break
+                    if line:
+                        line = line.rstrip()
+                        build_output.append(line)
+
+                        # Show key progress indicators
+                        line_lower = line.lower()
+                        show_line = any(x in line_lower for x in [
+                            'step', 'copying', 'downloading', '#',
+                            'cached', 'done', 'error', 'failed',
+                            'extracting', 'pulling', 'waiting',
+                            'successfully', 'sha256', 'layer'
+                        ])
+
+                        # Also show any line that starts with a number (e.g., "1/5")
+                        if not show_line and line and line[0].isdigit():
+                            show_line = True
+
+                        if show_line:
+                            # Truncate long lines, show timestamp every 30 seconds
+                            current_time = time.time()
+                            if current_time - last_progress_time > 30:
+                                elapsed_min = int(elapsed // 60)
+                                elapsed_sec = int(elapsed % 60)
+                                print(f"         [{elapsed_min}m{elapsed_sec}s] {line[-90:]}", flush=True)
+                                last_progress_time = current_time
+                            else:
+                                print(f"         {line[-100:]}", flush=True)
+
+            except Exception as e:
+                print(f"\n      ✗ Build error: {e}")
+                return False
+
+            # Check return code
+            if build_process.returncode != 0:
+                print(f"\n      ✗ Build failed with exit code {build_process.returncode}")
+                print("         Last 10 lines of output:")
+                for output_line in build_output[-10:]:
+                    print(f"           {output_line}")
+                return False
+
+            # Check for success in output
+            full_output = '\n'.join(build_output)
+            if "Successfully tagged" in full_output or "naming to docker.io" in full_output:
+                return True
+
+            # Build completed but success message not found - check if image exists
+            return True  # Let caller verify image existence
+
         # Handle single worker (backward compatible)
         if num_workers == 1:
             # Get VM IP or create VM
@@ -2505,7 +2848,7 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
                 print(f"[✓] VM already exists: {ip}")
                 vm_created = True
             else:
-                print("[1/6] Creating Azure VM with nested virtualization...")
+                print("[1/8] Creating Azure VM with nested virtualization...")
                 # Try multiple locations if needed
                 locations_to_try = [location, "westus2", "centralus", "eastus2"]
                 for loc in locations_to_try:
@@ -2516,6 +2859,7 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
                          "--location", loc,
                          "--image", "Ubuntu2204",
                          "--size", "Standard_D4ds_v5",  # v5 series supports nested virt
+                         "--os-disk-size-gb", VM_OS_DISK_SIZE_GB,
                          "--admin-username", "azureuser",
                          "--generate-ssh-keys",
                          "--public-ip-sku", "Standard"],
@@ -2535,7 +2879,7 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
                 print("✗ Could not create VM in any region")
                 sys.exit(1)
 
-            print(f"\n[2/6] Installing Docker with /mnt storage (147GB)...")
+            print(f"\n[2/8] Installing Docker with /mnt storage (147GB)...")
             docker_cmds = [
                 "sudo apt-get update -qq",
                 "sudo apt-get install -y -qq docker.io",
@@ -2558,7 +2902,7 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
             else:
                 print("  ✓ Docker installed with /mnt storage")
 
-            print(f"\n[3/6] Verifying nested virtualization...")
+            print(f"\n[3/8] Verifying nested virtualization...")
             result = subprocess.run(
                 ["ssh", *SSH_OPTS, f"azureuser@{ip}",
                  "egrep -c '(vmx|svm)' /proc/cpuinfo"],
@@ -2571,7 +2915,7 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
                 print("  ✗ Nested virtualization not supported - WAA won't work")
                 sys.exit(1)
 
-            print(f"\n[4/6] Pulling dockurr/windows image (for Windows VM)...")
+            print(f"\n[4/8] Pulling dockurr/windows image (for Windows VM)...")
             # Use dockurr/windows directly - the ACR winarena image has broken dockur
             result = subprocess.run(
                 ["ssh", *SSH_OPTS, f"azureuser@{ip}",
@@ -2582,7 +2926,7 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
                 print(f"  ⚠ Image pull warning: {result.stderr[:100] if result.stderr else ''}")
             print("  ✓ Windows image pulled")
 
-            print(f"\n[5/6] Cloning WindowsAgentArena repository...")
+            print(f"\n[5/8] Cloning WindowsAgentArena repository...")
             result = subprocess.run(
                 ["ssh", *SSH_OPTS, f"azureuser@{ip}",
                  "cd ~ && git clone --depth 1 https://github.com/microsoft/WindowsAgentArena.git 2>/dev/null || echo 'Already cloned'"],
@@ -2590,7 +2934,22 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
             )
             print("  ✓ WAA repo cloned")
 
-            print(f"\n[6/6] Creating WAA config file...")
+            # CRITICAL: Symlink WAA storage to /mnt (has 60GB+ vs 15GB on root)
+            # Windows VM disk is 30GB and MUST be on /mnt to have enough space
+            print(f"\n[6/8] Configuring storage on /mnt (Windows needs 30GB)...")
+            storage_cmd = """
+mkdir -p /mnt/waa-storage && \
+rm -rf ~/WindowsAgentArena/src/win-arena-container/vm/storage 2>/dev/null; \
+ln -sf /mnt/waa-storage ~/WindowsAgentArena/src/win-arena-container/vm/storage && \
+echo 'Storage symlinked to /mnt/waa-storage'
+"""
+            subprocess.run(
+                ["ssh", *SSH_OPTS, f"azureuser@{ip}", storage_cmd],
+                capture_output=True, text=True
+            )
+            print("  ✓ Storage configured on /mnt")
+
+            print(f"\n[7/8] Creating WAA config file...")
             api_key = args.api_key or settings.openai_api_key or ""
             if not api_key:
                 print("  ⚠ No API key provided. Set with --api-key, OPENAI_API_KEY env var, or in .env file")
@@ -2609,13 +2968,80 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
             )
             print("  ✓ Config created")
 
+            # Step 8: Build standalone WAA Docker image
+            # Uses our lightweight Dockerfile (no 15GB model downloads)
+            print(f"\n[8/8] Building standalone WAA Docker image...")
+            print(f"      Target image: {WAA_DOCKER_IMAGE}")
+            print(f"      Approach: Standalone (vanilla WAA dev mode, ~1.3GB)")
+
+            # Copy build files to VM (standalone approach - no COPY --from dependencies)
+            from pathlib import Path
+            waa_deploy_dir = Path(__file__).parent / "waa_deploy"
+            repo_root = Path(__file__).parent.parent.parent  # openadapt-ml root
+            vendor_dir = repo_root / "vendor" / "WindowsAgentArena" / "src" / "win-arena-container"
+
+            if not vendor_dir.exists():
+                print(f"      ✗ Vendor submodule not found at {vendor_dir}")
+                print(f"        Run: git submodule update --init --recursive")
+                sys.exit(1)
+
+            print(f"      Copying build files to VM...")
+
+            # Create build directory structure on VM
+            setup_cmd = "rm -rf ~/waa-build && mkdir -p ~/waa-build/vendor/WindowsAgentArena/src/win-arena-container ~/waa-build/openadapt_ml/benchmarks/waa_deploy"
+            subprocess.run(["ssh", *SSH_OPTS, f"azureuser@{ip}", setup_cmd], capture_output=True)
+
+            # Copy vendor submodule files (entry scripts, client, vm/setup)
+            for item in ["entry.sh", "entry_setup.sh", "start_client.sh", "start_vm.sh", "client", "vm"]:
+                src = vendor_dir / item
+                if src.exists():
+                    subprocess.run(
+                        ["scp", "-r", *SSH_OPTS, str(src),
+                         f"azureuser@{ip}:~/waa-build/vendor/WindowsAgentArena/src/win-arena-container/"],
+                        capture_output=True
+                    )
+
+            # Copy our waa_deploy files (Dockerfile, api_agent.py, start_waa_server.bat)
+            for item in ["Dockerfile", "api_agent.py", "start_waa_server.bat"]:
+                src = waa_deploy_dir / item
+                if src.exists():
+                    subprocess.run(
+                        ["scp", *SSH_OPTS, str(src),
+                         f"azureuser@{ip}:~/waa-build/openadapt_ml/benchmarks/waa_deploy/"],
+                        capture_output=True
+                    )
+
+            print(f"      ✓ Build files copied")
+
+            # Build with standalone Dockerfile
+            build_cmd = (
+                f"cd ~/waa-build && docker build --pull "
+                f"-t {WAA_DOCKER_IMAGE} -f openadapt_ml/benchmarks/waa_deploy/Dockerfile ."
+            )
+            if not stream_docker_build(ip, build_cmd, WAA_DOCKER_IMAGE, timeout=600):
+                print(f"      ✗ Failed to build {WAA_DOCKER_IMAGE}")
+                sys.exit(1)
+
+            # Verify image exists
+            verify_result = subprocess.run(
+                ["ssh", *SSH_OPTS, f"azureuser@{ip}",
+                 f"docker images {WAA_DOCKER_IMAGE} --format '{{{{.Repository}}}}:{{{{.Tag}}}}' | head -1"],
+                capture_output=True, text=True, timeout=30
+            )
+            if verify_result.stdout.strip() == WAA_DOCKER_IMAGE:
+                print(f"  ✓ {WAA_DOCKER_IMAGE} built successfully!")
+            else:
+                print(f"  ✗ CRITICAL: Build appeared to succeed but image not found!")
+                print(f"    Expected: {WAA_DOCKER_IMAGE}")
+                print(f"    Found: {verify_result.stdout.strip() or 'nothing'}")
+                sys.exit(1)
+
             print(f"\n{'='*60}")
             print("  WAA Setup Complete!")
             print(f"{'='*60}")
             print(f"\n  VM IP: {ip}")
-            print(f"\n  Next step: Prepare Windows image (one-time, ~20 min):")
-            print(f"    uv run python -m openadapt_ml.benchmarks.cli vm prepare-windows")
-            print(f"\n  Or run WAA directly (will auto-prepare on first run):")
+            print(f"  Image: {WAA_DOCKER_IMAGE} ✓")
+            print(f"\n  Run WAA benchmark:")
             print(f"    uv run python -m openadapt_ml.benchmarks.cli vm run-waa --num-tasks 5")
 
         else:
@@ -2746,10 +3172,10 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
             capture_output=True, text=True
         )
 
-        build_cmd = '''
+        build_cmd = f'''
 mkdir -p ~/build-waa
 cp -r ~/WindowsAgentArena/src/win-arena-container/vm ~/build-waa/
-cd ~/build-waa && docker build --no-cache --pull -t waa-auto:latest . 2>&1 | tail -10
+cd ~/build-waa && docker build --no-cache --pull -t {WAA_DOCKER_IMAGE} . 2>&1 | tail -10
 '''
         result = subprocess.run(
             ["ssh", *SSH_OPTS, f"azureuser@{ip}", build_cmd],
@@ -2759,7 +3185,7 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
             print(f"  ✗ Failed to build image: {result.stderr}")
             print(f"  Output: {result.stdout}")
             sys.exit(1)
-        print("  ✓ WAA image built (waa-auto:latest)")
+        print(f"  ✓ WAA image built ({WAA_DOCKER_IMAGE})")
 
         # Step 2: Stop existing container and clean up for fresh install
         # Use /mnt/waa-storage for temp disk (115GB) instead of ~/waa-storage (root, <10GB)
@@ -2781,8 +3207,15 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
         # Use VERSION=11e for Windows 11 Enterprise (accepts GVLK keys, no product key dialog)
         # Note: VERSION=11 would download Pro, which also works but is less suitable for benchmarks
         print("\n[3/4] Starting automated WAA container...")
-        docker_cmd = '''docker run -d \
+        print(f"      Using image: {WAA_DOCKER_IMAGE}")
+
+        # Ensure log directory exists for persistent logging
+        ensure_log_dir(ip)
+        logging_opts = get_docker_logging_opts("winarena")
+
+        docker_cmd = f'''docker run -d \
   --name winarena \
+  {logging_opts} \
   --device=/dev/kvm \
   --cap-add NET_ADMIN \
   -p 8006:8006 \
@@ -2794,7 +3227,7 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
   -e RAM_SIZE=12G \
   -e CPU_CORES=4 \
   -e DISK_SIZE=64G \
-  waa-auto:latest'''
+  {WAA_DOCKER_IMAGE}'''
 
         result = subprocess.run(
             ["ssh", *SSH_OPTS, f"azureuser@{ip}", docker_cmd],
@@ -2805,6 +3238,9 @@ def setup_single_vm(worker_name: str, ip: str, api_key: str) -> bool:
             sys.exit(1)
         print("  ✓ WAA container started")
 
+        # Validate the container is running the correct image
+        validate_running_container(ip)
+
         # Step 4: Wait for Windows to boot (Enterprise edition with GVLK should skip product key)
         print("\n[4/5] Waiting for Windows to boot...")
         print("      Using Windows 11 Enterprise with GVLK key (should skip product key dialog)")
@@ -2949,8 +3385,15 @@ def write_live_status(status: str, phase: str = None, detail: str = None,
             sys.exit(1)
         ip = result.stdout.strip()
 
+        # Check disk space BEFORE starting Windows (prevents "Not enough free space" error)
+        # This auto-cleans Docker if space is low
+        write_live_status("setup", phase="checking_disk", detail="Checking disk space...")
+        if not ensure_sufficient_disk_space(ip, min_gb=15, auto_clean=True):
+            write_live_status("error", phase="disk_space", detail="Insufficient disk space")
+            sys.exit(1)
+
         num_tasks = args.num_tasks
-        model = getattr(args, 'model', 'gpt-4o')
+        model = getattr(args, 'model', None) or settings.waa_default_model
         agent = getattr(args, 'agent', 'navi')
         open_viewer = getattr(args, 'open', True)
         port = getattr(args, 'port', 8765)
@@ -2997,7 +3440,7 @@ def write_live_status(status: str, phase: str = None, detail: str = None,
         # Use the proper server from local.py that has /api/benchmark-live endpoint
         server_process = None
         if open_viewer:
-            print(f"\n  Launching benchmark viewer at http://localhost:{port}/benchmark.html")
+            print(f"\n  Launching benchmark viewer at http://localhost:{port}/azure_ops.html")
 
             def start_server():
                 # Use the full-featured server from local.py with API endpoints
@@ -3022,12 +3465,12 @@ def start_server():
             time.sleep(1)
 
             # Open browser
-            webbrowser.open(f"http://localhost:{port}/benchmark.html")
+            webbrowser.open(f"http://localhost:{port}/azure_ops.html")
 
         print()
 
         # Ensure Docker is running (may not auto-start after VM restart)
-        print("[1/5] Ensuring Docker is running...", flush=True)
+        print("[1/6] Ensuring Docker is running...", flush=True)
         write_live_status("setup", phase="docker", detail="Ensuring Docker is running...", total_tasks=num_tasks)
         if not ensure_docker_running(ip):
             write_live_status("error", phase="docker", detail="Docker is not running and could not be started")
@@ -3036,10 +3479,37 @@ def start_server():
             sys.exit(1)
         print("      ✓ Docker is running", flush=True)
 
+        # Check and ensure disk space - 50GB required for:
+        # - Windows 11 ISO download and VM disk: ~30GB
+        # - WAA /models directory (OmniParser, SoM): ~15GB
+        # - Docker build cache and buffer: ~5GB
+        MIN_DISK_GB = 50.0
+        print("[2/6] Checking disk space...", flush=True)
+        write_live_status("setup", phase="disk_check", detail="Checking disk space...", total_tasks=num_tasks)
+        from openadapt_ml.benchmarks.disk_manager import DiskManager
+        dm = DiskManager(vm_ip=ip)
+        status = dm.get_disk_status()
+        free_gb = status.mnt_free_gb if status.mnt_total_gb > 0 else status.root_free_gb
+        if free_gb < MIN_DISK_GB:
+            print(f"      Current free space: {free_gb:.1f}GB (need {MIN_DISK_GB:.0f}GB)", flush=True)
+            print("      Running auto-cleanup...", flush=True)
+            cleanup_result = dm.check_and_cleanup(
+                min_free_gb=MIN_DISK_GB,
+                callback=lambda msg: print(f"      {msg}", flush=True)
+            )
+            if not cleanup_result.target_met:
+                write_live_status("error", phase="disk_check", detail=f"Insufficient disk space: {cleanup_result.final_free_gb:.1f}GB free (need {MIN_DISK_GB:.0f}GB)")
+                print(f"      Insufficient disk space after cleanup: {cleanup_result.final_free_gb:.1f}GB", flush=True)
+                print("      Consider: vm docker-move or vm reset-windows", flush=True)
+                sys.exit(1)
+            print(f"      ✓ Disk space OK: {cleanup_result.final_free_gb:.1f}GB free", flush=True)
+        else:
+            print(f"      ✓ Disk space OK: {free_gb:.1f}GB free", flush=True)
+
         # Stop any existing container
         fresh = getattr(args, 'fresh', False)
-        step = 2
-        print(f"[{step}/5] Stopping any existing WAA container...")
+        step = 3
+        print(f"[{step}/6] Stopping any existing WAA container...")
         write_live_status("setup", phase="container_stop", detail="Stopping any existing WAA container...", total_tasks=num_tasks)
         subprocess.run(
             ["ssh", *SSH_OPTS, f"azureuser@{ip}",
@@ -3050,7 +3520,7 @@ def start_server():
         # If --fresh, delete Windows storage and reinstall
         if fresh:
             step += 1
-            print(f"\n[{step}/5] Deleting Windows storage for fresh install...")
+            print(f"\n[{step}/6] Deleting Windows storage for fresh install...")
             cleanup_cmd = """
 # Ensure storage is on /mnt
 sudo mkdir -p /mnt/waa-storage
@@ -3073,11 +3543,11 @@ def start_server():
 
         # Ensure waa-auto image exists (auto-rebuild if needed)
         rebuild = getattr(args, 'rebuild', False)
-        print("[3/5] Checking waa-auto Docker image...", flush=True)
-        write_live_status("setup", phase="image_check", detail="Checking waa-auto Docker image...", total_tasks=num_tasks)
+        print(f"[4/6] Checking {WAA_DOCKER_IMAGE} Docker image...", flush=True)
+        write_live_status("setup", phase="image_check", detail=f"Checking {WAA_DOCKER_IMAGE} Docker image...", total_tasks=num_tasks)
 
         # Check if waa-auto exists and is recent (built with current dockurr/windows)
-        check_image_cmd = "docker images waa-auto:latest --format '{{.ID}}' | head -1"
+        check_image_cmd = f"docker images {WAA_DOCKER_IMAGE} --format '{{{{.ID}}}}' | head -1"
         check_result = subprocess.run(
             ["ssh", *SSH_OPTS, f"azureuser@{ip}", check_image_cmd],
             capture_output=True, text=True
@@ -3089,134 +3559,141 @@ def start_server():
             waa_auto_exists = False  # Force rebuild
 
         if not waa_auto_exists:
-            print("      waa-auto image not found, building...")
+            print(f"      {WAA_DOCKER_IMAGE} image not found, building...")
 
-            # Copy Dockerfile and api_agent.py to VM
+            # Copy build files to VM (standalone approach - no COPY --from dependencies)
+            # Structure on VM: ~/waa-build/{vendor/..., openadapt_ml/...}
+            from pathlib import Path
             waa_deploy_dir = Path(__file__).parent / "waa_deploy"
-            dockerfile_path = waa_deploy_dir / "Dockerfile"
-            api_agent_path = waa_deploy_dir / "api_agent.py"
-            if dockerfile_path.exists():
-                scp_result = subprocess.run(
-                    ["scp", *SSH_OPTS,
-                     str(dockerfile_path), f"azureuser@{ip}:~/Dockerfile.waa"],
-                    capture_output=True, text=True
-                )
-                if scp_result.returncode != 0:
-                    print(f"      ✗ Failed to copy Dockerfile: {scp_result.stderr}")
-                    sys.exit(1)
-
-                # Copy api_agent.py (required by Dockerfile)
-                if api_agent_path.exists():
-                    scp_result = subprocess.run(
-                        ["scp", *SSH_OPTS,
-                         str(api_agent_path), f"azureuser@{ip}:~/api_agent.py"],
-                        capture_output=True, text=True
+            repo_root = Path(__file__).parent.parent.parent  # openadapt-ml root
+            vendor_dir = repo_root / "vendor" / "WindowsAgentArena" / "src" / "win-arena-container"
+
+            if not vendor_dir.exists():
+                print(f"      ✗ Vendor submodule not found at {vendor_dir}")
+                print(f"        Run: git submodule update --init --recursive")
+                sys.exit(1)
+
+            print(f"      Copying build files to VM...")
+
+            # Create build directory structure on VM
+            setup_cmd = "rm -rf ~/waa-build && mkdir -p ~/waa-build/vendor/WindowsAgentArena/src/win-arena-container ~/waa-build/openadapt_ml/benchmarks/waa_deploy"
+            subprocess.run(["ssh", *SSH_OPTS, f"azureuser@{ip}", setup_cmd], capture_output=True)
+
+            # Copy vendor submodule files (entry scripts, client, vm/setup)
+            for item in ["entry.sh", "entry_setup.sh", "start_client.sh", "start_vm.sh", "client", "vm"]:
+                src = vendor_dir / item
+                if src.exists():
+                    subprocess.run(
+                        ["scp", "-r", *SSH_OPTS, str(src),
+                         f"azureuser@{ip}:~/waa-build/vendor/WindowsAgentArena/src/win-arena-container/"],
+                        capture_output=True
                     )
-                    if scp_result.returncode != 0:
-                        print(f"      ✗ Failed to copy api_agent.py: {scp_result.stderr}")
-                        sys.exit(1)
-                else:
-                    print(f"      ✗ api_agent.py not found at {api_agent_path}")
-                    sys.exit(1)
-
-                # Build the image (using /home/azureuser as context to avoid /tmp issues)
-                print("      Building waa-auto image (streaming output)...")
-                print("      This may take 5-15 minutes for first build (model weights are ~2GB)...")
-                print(flush=True)
-
-                # Update live status for build phase
-                write_live_status("setup", phase="docker_build",
-                                 detail="Building waa-auto Docker image...",
-                                 total_tasks=num_tasks)
-
-                # Stream build output so user can see progress
-                # Note: --progress=plain requires BuildKit; fall back to legacy builder without it
-                # SSH_OPTS already includes keepalive settings to prevent timeout during long builds
-                build_cmd = "cd ~ && docker build --pull -t waa-auto:latest -f ~/Dockerfile.waa . 2>&1"
-                build_process = subprocess.Popen(
-                    ["ssh", *SSH_OPTS, f"azureuser@{ip}", build_cmd],
-                    stdout=subprocess.PIPE,
-                    stderr=subprocess.STDOUT,
-                    text=True,
-                    bufsize=1
-                )
 
-                build_output = []
-                last_status_update = 0
-                try:
-                    while True:
-                        line = build_process.stdout.readline()
-                        if not line and build_process.poll() is not None:
-                            break
-                        if line:
-                            line = line.rstrip()
-                            build_output.append(line)
-                            # Show key progress lines
-                            if any(x in line.lower() for x in ['step', 'copying', 'downloading', '#', 'cached', 'done', 'error', 'failed']):
-                                print(f"      {line[-100:]}", flush=True)  # Truncate long lines, flush immediately
-                                # Update live status periodically (every 10 lines)
-                                if len(build_output) - last_status_update >= 10:
-                                    last_status_update = len(build_output)
-                                    write_live_status("setup", phase="docker_build",
-                                                     detail=f"Building... {line[-60:]}",
-                                                     total_tasks=num_tasks)
-                except subprocess.TimeoutExpired:
-                    build_process.kill()
-                    print("      ✗ Build timed out after 30 minutes", flush=True)
-                    write_live_status("error", phase="docker_build", detail="Build timed out")
-                    sys.exit(1)
-
-                full_output = '\n'.join(build_output)
-                if "Successfully tagged waa-auto:latest" in full_output or \
-                   "naming to docker.io/library/waa-auto:latest" in full_output:
-                    print()
-                    print("      ✓ waa-auto image built successfully")
-                else:
-                    print()
-                    print("      Last 20 lines of build output:")
-                    for line in build_output[-20:]:
-                        print(f"        {line}")
-                    print()
-                    print("      ✗ CRITICAL: waa-auto build failed!")
-                    print("      The official windowsarena/winarena image is BROKEN (uses outdated dockurr/windows v0.00)")
-                    print("      The waa-auto image is REQUIRED for Windows 11 to auto-download.")
-                    print()
-                    print("      Troubleshooting:")
-                    print("        1. Check Docker storage: uv run python -m openadapt_ml.benchmarks.cli vm diag")
-                    print("        2. If disk full: uv run python -m openadapt_ml.benchmarks.cli vm fix-storage")
-                    print("        3. Clean Docker: ssh azureuser@<ip> 'docker system prune -af'")
-                    print("        4. Retry: uv run python -m openadapt_ml.benchmarks.cli vm run-waa --rebuild")
-                    sys.exit(1)
+            # Copy our waa_deploy files (Dockerfile, api_agent.py, start_waa_server.bat)
+            for item in ["Dockerfile", "api_agent.py", "start_waa_server.bat"]:
+                src = waa_deploy_dir / item
+                if src.exists():
+                    subprocess.run(
+                        ["scp", *SSH_OPTS, str(src),
+                         f"azureuser@{ip}:~/waa-build/openadapt_ml/benchmarks/waa_deploy/"],
+                        capture_output=True
+                    )
+
+            print(f"      ✓ Build files copied")
+
+            # Build the image from waa-build directory
+            print(f"      Building {WAA_DOCKER_IMAGE} image (streaming output)...")
+            print("      This takes ~5 minutes (no large model downloads)...")
+            print(flush=True)
+
+            # Update live status for build phase
+            write_live_status("setup", phase="docker_build",
+                             detail=f"Building {WAA_DOCKER_IMAGE} Docker image...",
+                             total_tasks=num_tasks)
+
+            # Build with Dockerfile path relative to build context
+            cache_flag = "--no-cache" if rebuild else ""
+            build_cmd = f"cd ~/waa-build && docker build --pull {cache_flag} -t {WAA_DOCKER_IMAGE} -f openadapt_ml/benchmarks/waa_deploy/Dockerfile . 2>&1 | tee /tmp/waa_build.log"
+            build_process = subprocess.Popen(
+                ["ssh", *SSH_OPTS, f"azureuser@{ip}", build_cmd],
+                stdout=subprocess.PIPE,
+                stderr=subprocess.STDOUT,
+                text=True,
+                bufsize=1
+            )
+
+            build_output = []
+            last_status_update = 0
+            try:
+                while True:
+                    line = build_process.stdout.readline()
+                    if not line and build_process.poll() is not None:
+                        break
+                    if line:
+                        line = line.rstrip()
+                        build_output.append(line)
+                        # Show key progress lines
+                        if any(x in line.lower() for x in ['step', 'copying', 'downloading', '#', 'cached', 'done', 'error', 'failed']):
+                            print(f"      {line[-100:]}", flush=True)  # Truncate long lines, flush immediately
+                            # Update live status periodically (every 10 lines)
+                            if len(build_output) - last_status_update >= 10:
+                                last_status_update = len(build_output)
+                                write_live_status("setup", phase="docker_build",
+                                                 detail=f"Building... {line[-60:]}",
+                                                 total_tasks=num_tasks)
+            except subprocess.TimeoutExpired:
+                build_process.kill()
+                print("      ✗ Build timed out after 30 minutes", flush=True)
+                write_live_status("error", phase="docker_build", detail="Build timed out")
+                sys.exit(1)
+
+            full_output = '\n'.join(build_output)
+            if f"Successfully tagged {WAA_DOCKER_IMAGE}" in full_output or \
+               f"naming to docker.io/library/{WAA_DOCKER_IMAGE}" in full_output:
+                print()
+                print(f"      ✓ {WAA_DOCKER_IMAGE} image built successfully")
+                # No model extraction needed - standalone build doesn't use windowsarena/winarena
             else:
-                print(f"      ✗ CRITICAL: Dockerfile not found at {dockerfile_path}")
-                print("      Cannot proceed without waa-auto image.")
+                print()
+                print("      Last 20 lines of build output:")
+                for line in build_output[-20:]:
+                    print(f"        {line}")
+                print()
+                print(f"      ✗ CRITICAL: {WAA_DOCKER_IMAGE} build failed!")
+                print()
+                print("      Troubleshooting:")
+                print("        1. Check Docker storage: uv run python -m openadapt_ml.benchmarks.cli vm diag")
+                print("        2. If disk full: uv run python -m openadapt_ml.benchmarks.cli vm fix-storage")
+                print("        3. Clean Docker: ssh azureuser@<ip> 'docker system prune -af'")
+                print("        4. Retry: uv run python -m openadapt_ml.benchmarks.cli vm run-waa --rebuild")
                 sys.exit(1)
         else:
-            print("      ✓ waa-auto image found")
+            print(f"      ✓ {WAA_DOCKER_IMAGE} image found")
 
         # Verify waa-auto image exists (required - official image is broken)
-        print("[4/5] Verifying waa-auto image...")
-        verify_cmd = "docker images waa-auto:latest --format '{{.Repository}}:{{.Tag}}' | head -1"
+        print(f"[5/6] Verifying {WAA_DOCKER_IMAGE} image...")
+        verify_cmd = f"docker images {WAA_DOCKER_IMAGE} --format '{{{{.Repository}}}}:{{{{.Tag}}}}' | head -1"
         verify_result = subprocess.run(
             ["ssh", *SSH_OPTS, f"azureuser@{ip}", verify_cmd],
             capture_output=True, text=True
         )
-        if verify_result.stdout.strip() == "waa-auto:latest":
-            docker_image = "waa-auto:latest"
+        if verify_result.stdout.strip() == WAA_DOCKER_IMAGE:
+            docker_image = WAA_DOCKER_IMAGE
             print(f"      ✓ Using: {docker_image} (with dockurr/windows auto-download)")
         else:
-            print("      ✗ CRITICAL: waa-auto image not found!")
+            print(f"      ✗ CRITICAL: {WAA_DOCKER_IMAGE} image not found!")
             print("      The official windowsarena/winarena image is BROKEN and cannot be used.")
             print()
-            print("      Run with --rebuild to build waa-auto:")
+            print(f"      Run with --rebuild to build {WAA_DOCKER_IMAGE}:")
             print(f"        uv run python -m openadapt_ml.benchmarks.cli vm run-waa --rebuild --num-tasks {num_tasks}")
             sys.exit(1)
 
         # Start WAA container with full benchmark run
-        print(f"[5/5] Starting WAA benchmark (this will take a while)...")
+        print(f"[6/6] Starting WAA benchmark (this will take a while)...")
+        print(f"      Using image: {docker_image}")
         print(f"      Agent will run {num_tasks} tasks using {model}")
         if open_viewer:
-            print(f"      Viewer running at: http://localhost:{port}/benchmark.html")
+            print(f"      Viewer running at: http://localhost:{port}/azure_ops.html")
         print()
 
         # Build task filtering arguments
@@ -3233,8 +3710,14 @@ def start_server():
         else:
             env_args = f'-e OPENAI_API_KEY="{api_key}"'
 
-        docker_cmd = f'''docker run --rm \
+        # Ensure log directory exists for persistent logging
+        ensure_log_dir(ip)
+        logging_opts = get_docker_logging_opts("winarena")
+
+        # Note: We don't use --rm so logs persist after container exits
+        docker_cmd = f'''docker run -d \
   --name winarena \
+  {logging_opts} \
   --device=/dev/kvm \
   --cap-add NET_ADMIN \
   -p 8006:8006 \
@@ -3242,6 +3725,7 @@ def start_server():
   -p 7200:7200 \
   -v /mnt/docker/storage:/storage \
   -v ~/waa-results:/results \
+  -v /mnt/waa-models:/models \
   {env_args} \
   {docker_image} \
   "/entry.sh --start-client true --model {model} --agent {agent} --result-dir /results{task_filter_args}"'''
@@ -3250,11 +3734,23 @@ def start_server():
         write_live_status("running", phase="benchmark", detail="Starting WAA benchmark...",
                          total_tasks=num_tasks, tasks_completed=0)
 
-        # Use Popen to stream output and parse progress in real-time
+        # Start the container in detached mode first
+        result = subprocess.run(
+            ["ssh", *SSH_OPTS, f"azureuser@{ip}", f"mkdir -p ~/waa-results && {docker_cmd}"],
+            capture_output=True, text=True, timeout=60
+        )
+        if result.returncode != 0:
+            print(f"  Failed to start container: {result.stderr}")
+            sys.exit(1)
+        container_id = result.stdout.strip()
+        print(f"  Container started: {container_id[:12]}")
+        print(f"  Logs persisted to: {WAA_LOG_DIR} (survives crashes)")
+
+        # Follow container logs in real-time
         # SSH_OPTS includes keepalive settings (ServerAliveInterval=60, ServerAliveCountMax=10)
         # to prevent timeout during long benchmark runs (1.5+ hours)
         process = subprocess.Popen(
-            ["ssh", *SSH_OPTS, f"azureuser@{ip}", f"mkdir -p ~/waa-results && {docker_cmd}"],
+            ["ssh", *SSH_OPTS, f"azureuser@{ip}", "docker logs -f winarena 2>&1"],
             stdout=subprocess.PIPE,
             stderr=subprocess.STDOUT,
             text=True,
@@ -3425,9 +3921,19 @@ def start_server():
         print("  ✓ Storage moved to /mnt/waa-storage")
 
         # Step 4: Restart container with new mount
+        # First verify the image exists
+        require_waa_image(ip)
+
         print("\n[4/4] Restarting WAA container with /mnt storage...")
-        docker_cmd = '''docker run -d \
+        print(f"      Using image: {WAA_DOCKER_IMAGE}")
+
+        # Ensure log directory exists for persistent logging
+        ensure_log_dir(ip)
+        logging_opts = get_docker_logging_opts("winarena")
+
+        docker_cmd = f'''docker run -d \
   --name winarena \
+  {logging_opts} \
   --device=/dev/kvm \
   --cap-add NET_ADMIN \
   -p 8006:8006 \
@@ -3437,7 +3943,7 @@ def start_server():
   -e RAM_SIZE=12G \
   -e CPU_CORES=4 \
   -e DISK_SIZE=64G \
-  waa-auto:latest'''
+  {WAA_DOCKER_IMAGE}'''
 
         result = subprocess.run(
             ["ssh", *SSH_OPTS, f"azureuser@{ip}", docker_cmd],
@@ -3448,6 +3954,9 @@ def start_server():
             sys.exit(1)
         print("  ✓ WAA container restarted with /mnt storage")
 
+        # Validate the container is running the correct image
+        validate_running_container(ip)
+
         print(f"\n{'='*60}")
         print("  Storage Fixed!")
         print(f"{'='*60}")
@@ -3662,9 +4171,19 @@ def start_server():
         print("  ✓ Disk image deleted (ISO cache preserved for faster reinstall)")
 
         # Step 3: Restart with fresh install
+        # First verify the image exists
+        require_waa_image(ip)
+
         print("\n[3/3] Starting fresh Windows installation...")
-        docker_cmd = '''docker run -d \
+        print(f"      Using image: {WAA_DOCKER_IMAGE}")
+
+        # Ensure log directory exists for persistent logging
+        ensure_log_dir(ip)
+        logging_opts = get_docker_logging_opts("winarena")
+
+        docker_cmd = f'''docker run -d \
   --name winarena \
+  {logging_opts} \
   --device=/dev/kvm \
   --cap-add NET_ADMIN \
   -p 8006:8006 \
@@ -3674,7 +4193,7 @@ def start_server():
   -e RAM_SIZE=12G \
   -e CPU_CORES=4 \
   -e DISK_SIZE=64G \
-  waa-auto:latest'''
+  {WAA_DOCKER_IMAGE}'''
 
         result = subprocess.run(
             ["ssh", *SSH_OPTS, f"azureuser@{ip}", docker_cmd],
@@ -3685,6 +4204,9 @@ def start_server():
             sys.exit(1)
         print("  ✓ Fresh Windows installation started")
 
+        # Validate the container is running the correct image
+        validate_running_container(ip)
+
         # Wait and monitor
         print(f"\n  VNC: http://{ip}:8006")
         print("  Windows will install automatically (~10-15 min)...")
@@ -3867,8 +4389,10 @@ def delete_vm(name: str) -> tuple[str, bool, str]:
                 status = "✓ deleted" if success else f"✗ FAILED: {msg}"
                 print(f"  {name}: {status}")
 
-        # Delete registry
+        # Delete registry and clear session
         registry.delete_pool()
+        from openadapt_ml.benchmarks.session_tracker import clear_session
+        clear_session()
         print("\n✓ Pool deleted.")
 
     elif args.action == "cleanup-stale":
@@ -4157,12 +4681,16 @@ def start_server():
                 print(f"  ⚠ Could not start tunnels: {e}")
 
         # Open browser
-        url = f"http://localhost:{port}/benchmark.html"
+        url = f"http://localhost:{port}/azure_ops.html"
         print(f"\n  Opening: {url}")
         print(f"  VNC: http://localhost:8006")
         if auto_shutdown_hours > 0:
             shutdown_time = datetime.now() + timedelta(hours=auto_shutdown_hours)
-            print(f"  Auto-shutdown: {shutdown_time.strftime('%H:%M:%S')} ({auto_shutdown_hours}h)")
+            print(f"\n  ** Auto-shutdown enabled: VM will deallocate at {shutdown_time.strftime('%H:%M:%S')} ({auto_shutdown_hours}h)")
+            print(f"     Override with: --auto-shutdown-hours 0 (disable) or --auto-shutdown-hours 4")
+        else:
+            print(f"\n  WARNING: Auto-shutdown disabled. VM will run until manually deallocated!")
+            print(f"           Enable with: --auto-shutdown-hours 2")
         print(f"\n  Press Ctrl+C to stop monitoring.\n")
         webbrowser.open(url)
 
@@ -4170,6 +4698,14 @@ def start_server():
         start_time = datetime.now()
 
         # Keep running to maintain dashboard and show probe status
+        # Track last disk check time (check every 5 minutes)
+        last_disk_check = datetime.now()
+        disk_check_interval = timedelta(minutes=5)
+        disk_warning_shown = False
+        # Track auto-shutdown warnings
+        shutdown_warning_15m_shown = False
+        shutdown_warning_5m_shown = False
+
         try:
             while True:
                 ip = get_vm_ip(resource_group, vm_name)
@@ -4179,27 +4715,411 @@ def start_server():
                 if ip:
                     is_ready, response = check_waa_probe(ip, internal_ip='172.30.0.2')
                     status = "READY" if is_ready else "waiting..."
-                    print(f"  [{time.strftime('%H:%M:%S')}] WAA: {status} | Elapsed: {elapsed_str}      ", end='\r')
+
+                    # Periodic disk check (every 5 minutes)
+                    disk_info = ""
+                    if datetime.now() - last_disk_check > disk_check_interval:
+                        try:
+                            from openadapt_ml.benchmarks.disk_manager import DiskManager
+                            dm = DiskManager(vm_ip=ip)
+                            disk_status = dm.get_disk_status()
+                            free_gb = disk_status.mnt_free_gb if disk_status.mnt_total_gb > 0 else disk_status.root_free_gb
+                            if disk_status.is_critical:
+                                disk_info = f" | DISK CRITICAL: {free_gb:.0f}GB free!"
+                                if not disk_warning_shown:
+                                    print(f"\n\n  WARNING: Disk space critically low ({free_gb:.0f}GB free)")
+                                    print(f"  Run: vm auto-cleanup --min-free 50\n")
+                                    disk_warning_shown = True
+                            elif disk_status.is_warning:
+                                disk_info = f" | Disk: {free_gb:.0f}GB"
+                            last_disk_check = datetime.now()
+                        except Exception:
+                            pass  # Silently ignore disk check failures
+
+                    print(f"  [{time.strftime('%H:%M:%S')}] WAA: {status} | Elapsed: {elapsed_str}{disk_info}      ", end='\r')
                 else:
                     print(f"  [{time.strftime('%H:%M:%S')}] VM not found | Elapsed: {elapsed_str}      ", end='\r')
 
-                # Check auto-shutdown timeout
-                if auto_shutdown_hours > 0 and elapsed.total_seconds() >= auto_shutdown_hours * 3600:
-                    print(f"\n\n  Auto-shutdown triggered after {auto_shutdown_hours}h")
-                    deallocate_result = subprocess.run(
-                        ["az", "vm", "deallocate", "-g", resource_group, "-n", vm_name, "--no-wait"],
-                        capture_output=True, text=True
-                    )
-                    if deallocate_result.returncode == 0:
-                        print(f"  ✓ VM '{vm_name}' deallocation initiated")
-                    else:
-                        print(f"  ✗ Failed to deallocate: {deallocate_result.stderr[:50]}")
-                    break
+                # Check auto-shutdown warnings and timeout
+                if auto_shutdown_hours > 0:
+                    remaining_seconds = (auto_shutdown_hours * 3600) - elapsed.total_seconds()
+
+                    # 15-minute warning
+                    if not shutdown_warning_15m_shown and remaining_seconds <= 15 * 60:
+                        shutdown_warning_15m_shown = True
+                        print(f"\n\n  ** AUTO-SHUTDOWN WARNING: VM will deallocate in ~15 minutes")
+                        print(f"     To extend: Ctrl+C then restart with --auto-shutdown-hours {auto_shutdown_hours + 1}\n")
+
+                    # 5-minute warning
+                    if not shutdown_warning_5m_shown and remaining_seconds <= 5 * 60:
+                        shutdown_warning_5m_shown = True
+                        print(f"\n\n  ** AUTO-SHUTDOWN IMMINENT: VM will deallocate in ~5 minutes")
+                        print(f"     To prevent: Ctrl+C then restart with --auto-shutdown-hours 0\n")
+
+                    # Final shutdown
+                    if remaining_seconds <= 0:
+                        print(f"\n\n  Auto-shutdown triggered after {auto_shutdown_hours}h")
+                        deallocate_result = subprocess.run(
+                            ["az", "vm", "deallocate", "-g", resource_group, "-n", vm_name, "--no-wait"],
+                            capture_output=True, text=True
+                        )
+                        if deallocate_result.returncode == 0:
+                            print(f"  ✓ VM '{vm_name}' deallocation initiated")
+                        else:
+                            print(f"  ✗ Failed to deallocate: {deallocate_result.stderr[:50]}")
+                        break
 
                 time.sleep(10)
         except KeyboardInterrupt:
             print("\n\n  Monitoring stopped.")
 
+    elif args.action == "watch":
+        # Simple TUI for monitoring WAA status - no web dashboard, just terminal
+        import time
+        from datetime import datetime
+        from pathlib import Path
+
+        # ANSI color codes
+        BOLD = "\033[1m"
+        DIM = "\033[2m"
+        RED = "\033[91m"
+        GREEN = "\033[92m"
+        YELLOW = "\033[93m"
+        BLUE = "\033[94m"
+        CYAN = "\033[96m"
+        RESET = "\033[0m"
+        CLEAR_SCREEN = "\033[2J\033[H"
+
+        def format_elapsed(seconds: float) -> str:
+            """Format seconds into human-readable time."""
+            hours = int(seconds // 3600)
+            minutes = int((seconds % 3600) // 60)
+            secs = int(seconds % 60)
+            if hours > 0:
+                return f"{hours}h {minutes}m {secs}s"
+            elif minutes > 0:
+                return f"{minutes}m {secs}s"
+            else:
+                return f"{secs}s"
+
+        def get_status_color(status: str) -> str:
+            """Get color for status string."""
+            status_lower = status.lower()
+            if "ready" in status_lower or "complete" in status_lower:
+                return GREEN
+            elif "error" in status_lower or "fail" in status_lower or "critical" in status_lower:
+                return RED
+            elif "building" in status_lower or "running" in status_lower:
+                return YELLOW
+            elif "waiting" in status_lower or "idle" in status_lower:
+                return CYAN
+            else:
+                return RESET
+
+        def detect_docker_build(ip: str) -> dict:
+            """Detect Docker build progress on VM."""
+            import re
+            result = {
+                "building": False,
+                "step": None,
+                "total_steps": None,
+                "log_lines": [],
+                "current_action": None,
+            }
+
+            try:
+                # Check if docker build is running and get logs
+                cmd = (
+                    "ps aux | grep '[d]ocker build' | grep -v grep >/dev/null 2>&1 && echo 'BUILDING' || echo 'NOT_BUILDING'; "
+                    "tail -30 /tmp/build.log 2>/dev/null || tail -30 /tmp/waa_build.log 2>/dev/null || echo ''"
+                )
+                proc = subprocess.run(
+                    ["ssh", *SSH_OPTS, "-o", "ConnectTimeout=5", f"azureuser@{ip}", cmd],
+                    capture_output=True, text=True, timeout=15
+                )
+
+                if proc.returncode == 0:
+                    lines = proc.stdout.strip().split('\n')
+                    if lines and lines[0].strip() == "BUILDING":
+                        result["building"] = True
+                        # Parse build logs (skip first line which is BUILDING/NOT_BUILDING)
+                        log_lines = [l.strip() for l in lines[1:] if l.strip()]
+                        result["log_lines"] = log_lines[-20:]
+
+                        # Extract step info from build logs
+                        for line in reversed(log_lines):
+                            # Match "Step X/Y :" pattern
+                            step_match = re.search(r'Step\s+(\d+)/(\d+)\s*:', line)
+                            if step_match:
+                                result["step"] = step_match.group(1)
+                                result["total_steps"] = step_match.group(2)
+                                break
+                            # Match "==> Building step X" pattern
+                            step_match2 = re.search(r'Building step\s+(\d+)', line, re.IGNORECASE)
+                            if step_match2:
+                                result["step"] = step_match2.group(1)
+                                break
+
+                        # Get current action (last meaningful line)
+                        for line in reversed(log_lines):
+                            if line and not line.startswith('#') and len(line) > 5:
+                                result["current_action"] = line[:80]
+                                break
+
+            except (subprocess.TimeoutExpired, Exception):
+                pass
+
+            return result
+
+        def detect_benchmark(ip: str) -> dict:
+            """Detect benchmark running status."""
+            import re
+            result = {
+                "running": False,
+                "tasks_completed": 0,
+                "total_tasks": 0,
+                "current_task": None,
+                "log_lines": [],
+            }
+
+            try:
+                cmd = (
+                    "pgrep -f 'python.*run.py' >/dev/null 2>&1 && echo 'RUNNING' || echo 'NOT_RUNNING'; "
+                    "tail -30 /tmp/waa_benchmark.log 2>/dev/null || echo ''"
+                )
+                proc = subprocess.run(
+                    ["ssh", *SSH_OPTS, "-o", "ConnectTimeout=5", f"azureuser@{ip}", cmd],
+                    capture_output=True, text=True, timeout=15
+                )
+
+                if proc.returncode == 0:
+                    lines = proc.stdout.strip().split('\n')
+                    if lines and lines[0].strip() == "RUNNING":
+                        result["running"] = True
+
+                    # Parse log lines (skip first status line)
+                    log_lines = [l.strip() for l in lines[1:] if l.strip()]
+                    result["log_lines"] = log_lines[-20:]
+
+                    # Extract task progress
+                    for line in reversed(log_lines):
+                        task_match = re.search(r'[Tt]ask\s+(\d+)\s*/\s*(\d+)', line)
+                        if task_match:
+                            result["tasks_completed"] = int(task_match.group(1))
+                            result["total_tasks"] = int(task_match.group(2))
+                            break
+                        task_match2 = re.search(r'[Rr]unning\s+task\s+(\d+)\s+of\s+(\d+)', line)
+                        if task_match2:
+                            result["tasks_completed"] = int(task_match2.group(1))
+                            result["total_tasks"] = int(task_match2.group(2))
+                            break
+
+            except (subprocess.TimeoutExpired, Exception):
+                pass
+
+            return result
+
+        def get_container_logs(ip: str, lines: int = 20) -> list[str]:
+            """Get recent container logs."""
+            try:
+                cmd = f"docker logs winarena --tail {lines} 2>&1"
+                proc = subprocess.run(
+                    ["ssh", *SSH_OPTS, "-o", "ConnectTimeout=5", f"azureuser@{ip}", cmd],
+                    capture_output=True, text=True, timeout=15
+                )
+                if proc.returncode == 0 and proc.stdout.strip():
+                    return [l.strip() for l in proc.stdout.strip().split('\n') if l.strip()][-lines:]
+            except (subprocess.TimeoutExpired, Exception):
+                pass
+            return []
+
+        def get_container_status(ip: str) -> str:
+            """Get Docker container status."""
+            try:
+                cmd = "docker inspect winarena --format '{{.State.Status}}' 2>/dev/null || echo 'not_found'"
+                proc = subprocess.run(
+                    ["ssh", *SSH_OPTS, "-o", "ConnectTimeout=5", f"azureuser@{ip}", cmd],
+                    capture_output=True, text=True, timeout=10
+                )
+                if proc.returncode == 0:
+                    return proc.stdout.strip()
+            except (subprocess.TimeoutExpired, Exception):
+                pass
+            return "unknown"
+
+        print(f"\n{BOLD}=== WAA Status Watch ==={RESET}")
+        print(f"VM: {vm_name} | Resource Group: {resource_group}")
+        print(f"Refresh: 2 seconds | Press Ctrl+C to exit\n")
+
+        # Import session tracker for elapsed time/cost
+        from openadapt_ml.benchmarks.session_tracker import get_session, update_session_vm_state
+
+        internal_ip = getattr(args, 'internal_ip', '172.30.0.2')
+
+        try:
+            while True:
+                # Get VM IP
+                ip = get_vm_ip(resource_group, vm_name)
+                now = datetime.now().strftime("%H:%M:%S")
+
+                # Clear screen and move to top
+                print(CLEAR_SCREEN, end='')
+
+                # Header
+                print(f"{BOLD}=== WAA Status Watch ==={RESET}  [{now}]")
+                print(f"{DIM}VM: {vm_name} | RG: {resource_group} | Refresh: 2s | Ctrl+C to exit{RESET}")
+                print("=" * 60)
+
+                if not ip:
+                    # VM not running
+                    session = get_session()
+                    update_session_vm_state("deallocated")
+
+                    print(f"\n{RED}VM Status:{RESET} Not running / Deallocated")
+                    print(f"\n{YELLOW}To start:{RESET}")
+                    print(f"  uv run python -m openadapt_ml.benchmarks.cli vm start")
+                    print(f"  uv run python -m openadapt_ml.benchmarks.cli vm setup-waa")
+
+                    # Show last session info if available
+                    if session.get("started_at"):
+                        elapsed = session.get("elapsed_seconds", 0)
+                        cost = session.get("cost_usd", 0)
+                        print(f"\n{DIM}Last session:{RESET}")
+                        print(f"  Elapsed: {format_elapsed(elapsed)}")
+                        print(f"  Cost: ${cost:.2f}")
+                else:
+                    # VM is running - update session
+                    session = update_session_vm_state("running", vm_ip=ip)
+                    elapsed = session.get("elapsed_seconds", 0)
+                    cost = session.get("cost_usd", 0)
+                    hourly_rate = session.get("hourly_rate_usd", 0.422)
+
+                    print(f"\n{GREEN}VM Status:{RESET} Running")
+                    print(f"  IP: {ip}")
+                    print(f"  Elapsed: {CYAN}{format_elapsed(elapsed)}{RESET}")
+                    print(f"  Cost: {YELLOW}${cost:.2f}{RESET} (${hourly_rate:.3f}/hr)")
+
+                    # Check container status
+                    container_status = get_container_status(ip)
+                    container_color = GREEN if container_status == "running" else (YELLOW if container_status == "not_found" else RED)
+                    print(f"\n{BOLD}Container:{RESET} {container_color}{container_status}{RESET}")
+
+                    # Determine current phase
+                    phase = "idle"
+                    phase_detail = ""
+                    log_lines = []
+
+                    # Check for Docker build
+                    build_info = detect_docker_build(ip)
+                    if build_info["building"]:
+                        phase = "building"
+                        step_info = ""
+                        if build_info["step"] and build_info["total_steps"]:
+                            step_info = f" ({build_info['step']}/{build_info['total_steps']})"
+                        phase_detail = f"Docker Build{step_info}"
+                        if build_info["current_action"]:
+                            phase_detail += f"\n  {DIM}{build_info['current_action'][:60]}{RESET}"
+                        log_lines = build_info["log_lines"]
+
+                    # Check for benchmark running
+                    elif container_status == "running":
+                        benchmark_info = detect_benchmark(ip)
+                        if benchmark_info["running"]:
+                            phase = "benchmark"
+                            if benchmark_info["total_tasks"] > 0:
+                                pct = (benchmark_info["tasks_completed"] / benchmark_info["total_tasks"]) * 100
+                                phase_detail = f"Benchmark: {benchmark_info['tasks_completed']}/{benchmark_info['total_tasks']} tasks ({pct:.0f}%)"
+                            else:
+                                phase_detail = "Benchmark running..."
+                            log_lines = benchmark_info["log_lines"]
+                        else:
+                            # Container running but no benchmark - check WAA probe
+                            is_ready, _ = check_waa_probe(ip, internal_ip=internal_ip)
+                            if is_ready:
+                                phase = "ready"
+                                phase_detail = "WAA Server Ready"
+                            else:
+                                phase = "waiting"
+                                phase_detail = "Container running (WAA not ready - check VNC)"
+
+                            # Get container logs
+                            log_lines = get_container_logs(ip)
+
+                    elif container_status == "not_found":
+                        phase = "no_container"
+                        phase_detail = "No container running"
+                        # Check if build logs exist
+                        build_info = detect_docker_build(ip)
+                        if build_info["log_lines"]:
+                            log_lines = build_info["log_lines"]
+
+                    # Display phase
+                    phase_color = get_status_color(phase)
+                    print(f"\n{BOLD}Phase:{RESET} {phase_color}{phase.upper()}{RESET}")
+                    if phase_detail:
+                        print(f"  {phase_detail}")
+
+                    # Progress bar for benchmark
+                    if phase == "benchmark" and benchmark_info["total_tasks"] > 0:
+                        pct = (benchmark_info["tasks_completed"] / benchmark_info["total_tasks"])
+                        bar_width = 40
+                        filled = int(bar_width * pct)
+                        bar = f"[{'=' * filled}{' ' * (bar_width - filled)}]"
+                        print(f"\n  {GREEN}{bar}{RESET} {pct*100:.1f}%")
+
+                    # Progress bar for build
+                    if phase == "building" and build_info["step"] and build_info["total_steps"]:
+                        try:
+                            step = int(build_info["step"])
+                            total = int(build_info["total_steps"])
+                            pct = step / total
+                            bar_width = 40
+                            filled = int(bar_width * pct)
+                            bar = f"[{'=' * filled}{' ' * (bar_width - filled)}]"
+                            print(f"\n  {YELLOW}{bar}{RESET} {pct*100:.1f}%")
+                        except (ValueError, ZeroDivisionError):
+                            pass
+
+                    # Show logs
+                    if log_lines:
+                        print(f"\n{BOLD}Recent Logs:{RESET} (last {len(log_lines)} lines)")
+                        print("-" * 60)
+                        for line in log_lines[-20:]:
+                            # Truncate long lines
+                            display_line = line[:80] + "..." if len(line) > 80 else line
+                            # Color code certain log patterns
+                            if "error" in display_line.lower() or "fail" in display_line.lower():
+                                print(f"  {RED}{display_line}{RESET}")
+                            elif "success" in display_line.lower() or "complete" in display_line.lower():
+                                print(f"  {GREEN}{display_line}{RESET}")
+                            elif "step" in display_line.lower() or "--->" in display_line:
+                                print(f"  {CYAN}{display_line}{RESET}")
+                            else:
+                                print(f"  {DIM}{display_line}{RESET}")
+                        print("-" * 60)
+                    else:
+                        print(f"\n{DIM}No recent logs available{RESET}")
+
+                    # Quick actions hint
+                    print(f"\n{DIM}Quick actions:{RESET}")
+                    print(f"  {DIM}vm dashboard - Full web dashboard with VNC{RESET}")
+                    print(f"  {DIM}vm logs -f   - Follow container logs{RESET}")
+                    print(f"  {DIM}vm ssh       - SSH into VM{RESET}")
+
+                # Wait before next refresh
+                time.sleep(2)
+
+        except KeyboardInterrupt:
+            print(f"\n\n{YELLOW}Watch stopped.{RESET}")
+            # Show final cost summary
+            session = get_session()
+            if session.get("started_at"):
+                elapsed = session.get("elapsed_seconds", 0)
+                cost = session.get("cost_usd", 0)
+                print(f"\nSession Summary:")
+                print(f"  Total Elapsed: {format_elapsed(elapsed)}")
+                print(f"  Total Cost: ${cost:.2f}")
+
     elif args.action == "exec":
         # Execute command in container
         ip = get_vm_ip(resource_group, vm_name)
@@ -4265,9 +5185,12 @@ def start_server():
         cleanup_cmd = "docker rm -f winarena winarena-test 2>/dev/null || true"
         subprocess.run(["ssh", *SSH_OPTS, f"azureuser@{ip}", cleanup_cmd], capture_output=True)
 
+        # Verify the image exists before testing
+        require_waa_image(ip)
+
         # Build the same docker command as run-waa but with timeout
         # Note: waa-auto has ENTRYPOINT ["/bin/bash", "-c"] so we pass the command as a string
-        docker_cmd = '''docker run --rm \
+        docker_cmd = f'''docker run --rm \
   --name winarena-test \
   --device=/dev/kvm \
   --cap-add NET_ADMIN \
@@ -4276,10 +5199,11 @@ def start_server():
   -p 7200:7200 \
   -v /mnt/docker/storage:/storage \
   -v ~/waa-results:/results \
-  waa-auto:latest \
+  {WAA_DOCKER_IMAGE} \
   "/copy-oem.sh echo OEM_FILES_COPIED && ls -la /tmp/smb/"'''
 
         print(f"\n[3/3] Testing docker run with copy-oem.sh...")
+        print(f"      Using image: {WAA_DOCKER_IMAGE}")
         print(f"  Command: {docker_cmd[:100]}...")
 
         result = subprocess.run(
@@ -4475,6 +5399,147 @@ def send_keys_string(sock, text):
         else:
             print(f"Error: {result.stderr}")
 
+    elif args.action == "update-scripts":
+        # Copy local waa_deploy scripts to container's Samba share
+        ip = get_vm_ip(resource_group, vm_name)
+        if not ip:
+            print(f"✗ VM '{vm_name}' not found. Run 'vm setup-waa' first.")
+            sys.exit(1)
+
+        # Get path to local scripts
+        script_dir = Path(__file__).parent / "waa_deploy"
+        scripts_to_copy = ["install.bat", "start_waa_server.bat"]
+
+        print(f"Updating scripts on VM {ip}...")
+        for script_name in scripts_to_copy:
+            local_path = script_dir / script_name
+            if not local_path.exists():
+                print(f"  ⚠ {script_name} not found locally, skipping")
+                continue
+
+            print(f"  Copying {script_name}...")
+            # Read local file and pipe to container via SSH
+            with open(local_path, 'r') as f:
+                script_content = f.read()
+
+            # Use docker exec with tee to write to Samba share
+            result = subprocess.run(
+                ["ssh", *SSH_OPTS, f"azureuser@{ip}",
+                 f"docker exec -i winarena tee /tmp/smb/{script_name} > /dev/null"],
+                input=script_content, capture_output=True, text=True
+            )
+            if result.returncode == 0:
+                print(f"    ✓ {script_name} updated")
+            else:
+                print(f"    ✗ Failed: {result.stderr}")
+
+        # Verify
+        result = subprocess.run(
+            ["ssh", *SSH_OPTS, f"azureuser@{ip}",
+             "docker exec winarena ls -la /tmp/smb/*.bat"],
+            capture_output=True, text=True
+        )
+        print(f"\nScripts in Samba share:\n{result.stdout}")
+        print("✓ Scripts updated. Use 'vm run-install' to execute install.bat")
+
+    elif args.action == "run-install":
+        # Run install.bat on Windows via QMP - simple keystrokes: Win+R, type cmd, enter
+        ip = get_vm_ip(resource_group, vm_name)
+        if not ip:
+            print(f"✗ VM '{vm_name}' not found. Run 'vm setup-waa' first.")
+            sys.exit(1)
+
+        print(f"Running install.bat on Windows VM...")
+
+        # Single QMP script that does everything: Win+R, type command, Enter
+        qmp_run_script = r'''
+import socket
+import json
+import time
+
+def qmp_cmd(sock, cmd, args=None):
+    msg = {"execute": cmd}
+    if args:
+        msg["arguments"] = args
+    sock.send((json.dumps(msg) + "\n").encode())
+    time.sleep(0.08)
+    return sock.recv(4096)
+
+def send_key(sock, key):
+    qmp_cmd(sock, "send-key", {"keys": [{"type": "qcode", "data": key}]})
+
+def send_shift_key(sock, key):
+    qmp_cmd(sock, "send-key", {"keys": [{"type": "qcode", "data": "shift"}, {"type": "qcode", "data": key}]})
+
+def type_char(sock, c):
+    key_map = {"\\": "backslash", ".": "dot", ":": ("shift", "semicolon"), " ": "spc", "-": "minus"}
+    if c.isupper():
+        send_shift_key(sock, c.lower())
+    elif c.isalpha() or c.isdigit():
+        send_key(sock, c.lower() if c.isalpha() else c)
+    elif c in key_map:
+        k = key_map[c]
+        if isinstance(k, tuple):
+            send_shift_key(sock, k[1])
+        else:
+            send_key(sock, k)
+    time.sleep(0.04)
+
+# Connect
+sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+sock.connect(("localhost", 7200))
+sock.recv(4096)
+qmp_cmd(sock, "qmp_capabilities")
+
+# Win+R
+print("[1/3] Win+R...")
+qmp_cmd(sock, "send-key", {"keys": [{"type": "qcode", "data": "meta_l"}, {"type": "qcode", "data": "r"}]})
+time.sleep(0.8)
+
+# Type: \\host.lan\Data\install.bat (UNC path - Windows will handle it)
+cmd = r"\\host.lan\Data\install.bat"
+print(f"[2/3] Typing {cmd}...")
+for c in cmd:
+    type_char(sock, c)
+
+time.sleep(0.3)
+
+# Enter to submit Run dialog
+print("[3/3] Enter...")
+send_key(sock, "ret")
+
+# Wait for security dialog and press 'r' for Run
+time.sleep(1.5)
+print("[4/4] Pressing 'r' to confirm Run...")
+send_key(sock, "r")
+
+# Wait for window to open and maximize it with Win+Up
+time.sleep(2)
+print("[5/5] Maximizing window (Win+Up)...")
+qmp_cmd(sock, "send-key", {"keys": [{"type": "qcode", "data": "meta_l"}, {"type": "qcode", "data": "up"}]})
+
+sock.close()
+print("OK: install.bat launched and maximized")
+'''
+
+        print("  Sending QMP commands to Windows...")
+        result = subprocess.run(
+            ["ssh", *SSH_OPTS, f"azureuser@{ip}",
+             f"docker exec winarena python3 -c '{qmp_run_script}'"],
+            capture_output=True, text=True
+        )
+        print(result.stdout)
+        if result.stderr:
+            print(f"  stderr: {result.stderr[:200]}")
+
+        if "OK" in result.stdout:
+            print("\n✓ install.bat launched on Windows")
+        else:
+            print("\n⚠ May have failed - check VNC")
+
+        print("  Watch progress via VNC: http://localhost:8006")
+        print("  Log file: Z:\\install_log.txt")
+
     elif args.action == "logs":
         # Get VM IP
         ip = get_vm_ip(resource_group, vm_name)
@@ -4502,6 +5567,78 @@ def send_keys_string(sock, text):
             if result.returncode != 0:
                 print(f"Error: {result.stderr}")
 
+    elif args.action == "crash-logs":
+        # View persisted logs from crashed containers
+        # These logs survive container restarts/crashes
+        print(f"\n=== Persisted Container Logs ===\n")
+
+        ip = get_vm_ip(resource_group, vm_name)
+        if not ip:
+            print(f"X VM '{vm_name}' not found. Run 'vm setup-waa' first.")
+            sys.exit(1)
+
+        num_lines = getattr(args, 'lines', 100)
+        container_name = "winarena"  # Could make this configurable
+
+        print(f"  VM IP: {ip}")
+        print(f"  Log directory: {WAA_LOG_DIR}")
+        print()
+
+        # Step 1: List available log files
+        print("[1/2] Available log files:")
+        list_cmd = f"ls -lh {WAA_LOG_DIR}/ 2>/dev/null || echo 'No logs found (directory may not exist)'"
+        result = subprocess.run(
+            ["ssh", *SSH_OPTS, f"azureuser@{ip}", list_cmd],
+            capture_output=True, text=True
+        )
+        print(f"  {result.stdout}")
+
+        # Step 2: Show container log from Docker's internal storage
+        # Docker stores logs in /var/lib/docker/containers/<container_id>/<container_id>-json.log
+        print(f"\n[2/2] Last {num_lines} lines from Docker daemon logs:")
+        # Get container ID (even if stopped)
+        get_id_cmd = f"docker ps -a --filter name={container_name} --format '{{{{.ID}}}}' | head -1"
+        id_result = subprocess.run(
+            ["ssh", *SSH_OPTS, f"azureuser@{ip}", get_id_cmd],
+            capture_output=True, text=True
+        )
+        container_id = id_result.stdout.strip()
+
+        if container_id:
+            # Read the Docker log file directly (works even if container crashed)
+            log_path = f"/var/lib/docker/containers/{container_id}/{container_id}-json.log"
+            log_cmd = f"sudo tail -n {num_lines} {log_path} 2>/dev/null | jq -r '.log' 2>/dev/null || sudo tail -n {num_lines} {log_path} 2>/dev/null"
+            result = subprocess.run(
+                ["ssh", *SSH_OPTS, f"azureuser@{ip}", log_cmd],
+                capture_output=True, text=True
+            )
+            if result.stdout.strip():
+                print(result.stdout)
+            else:
+                print(f"  No logs found for container {container_id}")
+                # Try docker logs as fallback (in case container still exists)
+                fallback_cmd = f"docker logs --tail {num_lines} {container_name} 2>&1"
+                fallback = subprocess.run(
+                    ["ssh", *SSH_OPTS, f"azureuser@{ip}", fallback_cmd],
+                    capture_output=True, text=True
+                )
+                if fallback.stdout.strip():
+                    print(f"\n  (Fallback from docker logs:)")
+                    print(fallback.stdout)
+        else:
+            print(f"  No container named '{container_name}' found (current or stopped)")
+            # Show what containers exist
+            ps_cmd = "docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.Image}}'"
+            ps_result = subprocess.run(
+                ["ssh", *SSH_OPTS, f"azureuser@{ip}", ps_cmd],
+                capture_output=True, text=True
+            )
+            print(f"\n  Available containers:")
+            print(f"  {ps_result.stdout}")
+
+        print(f"\n  TIP: Logs persist at {WAA_LOG_DIR} even after container crashes.")
+        print(f"  To view full logs: ssh azureuser@{ip} 'sudo cat /var/lib/docker/containers/<id>/<id>-json.log'")
+
     elif args.action == "stop-build":
         print(f"\n=== Stop Docker Build on VM ===\n")
 
@@ -4619,6 +5756,240 @@ def send_keys_string(sock, text):
         print(f"\n  VNC: http://{ip}:8006")
         print(f"  SSH: ssh azureuser@{ip}")
 
+    elif args.action == "deploy-evaluate":
+        """Deploy the /evaluate endpoint to the WAA container.
+
+        The WAA Flask server inside Windows doesn't have an /evaluate endpoint.
+        This command starts a sidecar Flask server on the Linux side of the
+        container (port 5001) that handles evaluation using WAA's evaluators.
+
+        The WAALiveAdapter will automatically use port 5001 for evaluation.
+        """
+        print("\n=== Deploy WAA /evaluate Endpoint ===\n")
+
+        ip = get_vm_ip(resource_group, vm_name)
+        if not ip:
+            print(f"  ✗ VM '{vm_name}' not found. Run 'vm setup-waa' first.")
+            sys.exit(1)
+
+        print(f"  VM IP: {ip}")
+
+        # Check if container is running
+        print("\n[1/4] Checking container status...")
+        result = subprocess.run(
+            ["ssh", *SSH_OPTS, f"azureuser@{ip}",
+             "docker ps --filter name=winarena --format '{{.Names}}' 2>/dev/null"],
+            capture_output=True, text=True, timeout=30,
+        )
+        if "winarena" not in result.stdout:
+            print("  ✗ Container 'winarena' not running.")
+            print("  Run: uv run python -m openadapt_ml.benchmarks.cli vm start-windows")
+            sys.exit(1)
+        print("  ✓ Container running")
+
+        # Check if evaluate server is already running
+        print("\n[2/4] Checking existing evaluate server...")
+        result = subprocess.run(
+            ["ssh", *SSH_OPTS, f"azureuser@{ip}",
+             "docker exec winarena curl -s http://localhost:5001/evaluate/health 2>/dev/null || echo 'not running'"],
+            capture_output=True, text=True, timeout=30,
+        )
+        if '"status": "ok"' in result.stdout or '"evaluators_loaded": true' in result.stdout:
+            print("  ✓ Evaluate server already running on port 5001")
+            print("\n  /evaluate endpoint is ready!")
+            return
+        print("  Server not running, deploying...")
+
+        # Copy the waa_server_patch.py to the container
+        print("\n[3/4] Deploying evaluate endpoint script...")
+        patch_script = Path(__file__).parent.parent.parent.parent / "openadapt-evals" / "openadapt_evals" / "server" / "waa_server_patch.py"
+        if not patch_script.exists():
+            # Try alternative path
+            patch_script = Path("/Users/abrichr/oa/src/openadapt-evals/openadapt_evals/server/waa_server_patch.py")
+
+        if not patch_script.exists():
+            print(f"  ✗ Script not found: {patch_script}")
+            print("  Ensure openadapt-evals is installed at ../openadapt-evals/")
+            sys.exit(1)
+
+        # Copy script to VM
+        subprocess.run(
+            ["scp", "-o", "StrictHostKeyChecking=no", "-o", "UserKnownHostsFile=/dev/null",
+             str(patch_script), f"azureuser@{ip}:/tmp/waa_server_patch.py"],
+            capture_output=True, text=True, timeout=60,
+        )
+
+        # Copy into container
+        result = subprocess.run(
+            ["ssh", *SSH_OPTS, f"azureuser@{ip}",
+             "docker cp /tmp/waa_server_patch.py winarena:/waa_server_patch.py"],
+            capture_output=True, text=True, timeout=30,
+        )
+        if result.returncode != 0:
+            print(f"  ✗ Failed to copy script: {result.stderr}")
+            sys.exit(1)
+        print("  ✓ Script deployed to container")
+
+        # Start the evaluate server in background
+        print("\n[4/4] Starting evaluate server on port 5001...")
+        # Use nohup to keep it running after SSH disconnects
+        # Note: use python3 as that's the executable name in Debian
+        start_cmd = """
+            docker exec -d winarena bash -c '
+                cd /client/desktop_env &&
+                nohup python3 /waa_server_patch.py --port 5001 --host 0.0.0.0 > /tmp/evaluate_server.log 2>&1 &
+                sleep 2 &&
+                curl -s http://localhost:5001/evaluate/health || cat /tmp/evaluate_server.log
+            '
+        """
+        result = subprocess.run(
+            ["ssh", *SSH_OPTS, f"azureuser@{ip}", start_cmd],
+            capture_output=True, text=True, timeout=60,
+        )
+
+        # Verify server started
+        import time
+        for attempt in range(5):
+            time.sleep(2)
+            result = subprocess.run(
+                ["ssh", *SSH_OPTS, f"azureuser@{ip}",
+                 "docker exec winarena curl -s http://localhost:5001/evaluate/health 2>/dev/null"],
+                capture_output=True, text=True, timeout=30,
+            )
+            if '"status": "ok"' in result.stdout or '"evaluators_loaded": true' in result.stdout:
+                print("  ✓ Evaluate server running on port 5001")
+                print(f"\n  Health check: {result.stdout.strip()}")
+                break
+        else:
+            print("  ⚠ Server may not have started correctly")
+            print("  Check logs with: vm host-exec --cmd 'docker exec winarena cat /tmp/evaluate_server.log'")
+
+        print("\n=== Evaluate Endpoint Deployed ===")
+        print()
+        print("  The /evaluate endpoint is now available at:")
+        print(f"    http://{ip}:5001/evaluate  (via SSH tunnel)")
+        print()
+        print("  To use with WAALiveAdapter, set the evaluate_url in config:")
+        print('    config.evaluate_url = "http://localhost:5001"')
+        print()
+        print("  Or update WAALiveAdapter to auto-detect the evaluate server.")
+
+    elif args.action == "disk-status":
+        """Show current disk usage on the VM."""
+        from openadapt_ml.benchmarks.disk_manager import DiskManager
+        import json as json_module
+
+        print(f"\n=== Disk Status: {vm_name} ===\n")
+
+        ip = get_vm_ip(resource_group, vm_name)
+        if not ip:
+            print(f"  VM '{vm_name}' not found. Run 'vm setup-waa' first.")
+            sys.exit(1)
+
+        print(f"  VM IP: {ip}")
+        print()
+
+        dm = DiskManager(vm_ip=ip)
+        status = dm.get_disk_status()
+
+        if args.json:
+            print(json_module.dumps(status.to_dict(), indent=2))
+        else:
+            if status.error:
+                print(f"  Error: {status.error}")
+                sys.exit(1)
+
+            print("  Filesystem Usage")
+            print("  " + "-" * 50)
+            print(f"  Root (/):")
+            print(f"    Total:     {status.root_total_gb:.1f} GB")
+            print(f"    Used:      {status.root_used_gb:.1f} GB ({status.root_percent_used}%)")
+            print(f"    Available: {status.root_free_gb:.1f} GB")
+
+            if status.mnt_total_gb > 0:
+                print(f"\n  /mnt (temp disk):")
+                print(f"    Total:     {status.mnt_total_gb:.1f} GB")
+                print(f"    Used:      {status.mnt_used_gb:.1f} GB ({status.mnt_percent_used}%)")
+                print(f"    Available: {status.mnt_free_gb:.1f} GB")
+
+            print("\n  Docker Usage")
+            print("  " + "-" * 50)
+            print(f"  Images:      {status.docker_images_gb:.2f} GB")
+            print(f"  Containers:  {status.docker_containers_gb:.2f} GB")
+            print(f"  Volumes:     {status.docker_volumes_gb:.2f} GB")
+            print(f"  Build cache: {status.docker_build_cache_gb:.2f} GB")
+            print(f"  Total:       {status.docker_total_gb:.2f} GB")
+
+            print("\n  Status")
+            print("  " + "-" * 50)
+            if status.is_critical:
+                print("  CRITICAL: Disk space critically low!")
+                print("  Run: vm auto-cleanup --min-free 50")
+            elif status.is_warning:
+                print("  WARNING: Disk space is low")
+                print("  Consider: vm auto-cleanup --min-free 50")
+            else:
+                print("  OK: Sufficient disk space available")
+
+    elif args.action == "auto-cleanup":
+        """Automatically cleanup disk space to meet minimum free space requirement."""
+        from openadapt_ml.benchmarks.disk_manager import DiskManager
+        import json as json_module
+
+        min_free_gb = getattr(args, 'min_free', 50.0)
+        output_json = getattr(args, 'json', False)
+
+        print(f"\n=== Auto Cleanup: {vm_name} ===\n")
+        print(f"  Target: Ensure {min_free_gb:.0f} GB free space on /mnt")
+        print()
+
+        ip = get_vm_ip(resource_group, vm_name)
+        if not ip:
+            print(f"  VM '{vm_name}' not found. Run 'vm setup-waa' first.")
+            sys.exit(1)
+
+        print(f"  VM IP: {ip}")
+        print()
+
+        dm = DiskManager(vm_ip=ip)
+
+        def progress_callback(msg: str):
+            print(msg, flush=True)
+
+        result = dm.check_and_cleanup(
+            min_free_gb=min_free_gb,
+            callback=progress_callback if not output_json else None,
+        )
+
+        if output_json:
+            print(json_module.dumps(result.to_dict(), indent=2))
+        else:
+            print()
+            print("  " + "=" * 50)
+            print("  Cleanup Summary")
+            print("  " + "=" * 50)
+            print(f"  Target free space:  {result.target_free_gb:.0f} GB")
+            print(f"  Final free space:   {result.final_free_gb:.1f} GB")
+            print(f"  Total recovered:    {result.total_recovered_gb:.2f} GB")
+            print(f"  Target met:         {'Yes' if result.target_met else 'No'}")
+
+            if result.steps_taken:
+                print("\n  Steps taken:")
+                for step in result.steps_taken:
+                    print(f"    - {step}")
+
+            if result.errors:
+                print("\n  Errors:")
+                for error in result.errors:
+                    print(f"    - {error}")
+
+            if not result.target_met:
+                print("\n  WARNING: Could not free enough space.")
+                print("  Consider:")
+                print("    - vm docker-prune (aggressive Docker cleanup)")
+                print("    - vm docker-move (move Docker to /mnt)")
+                print("    - vm reset-windows (delete Windows VM and reinstall)")
+
 
 def cmd_view(args: argparse.Namespace) -> None:
     """View benchmark results from collected data.
@@ -4660,7 +6031,7 @@ def cmd_view(args: argparse.Namespace) -> None:
     print("\n[1/2] Generating HTML viewer...")
     output_path = generate_benchmark_viewer(
         benchmark_dir=benchmark_dir,
-        output_path=benchmark_dir / "benchmark.html",
+        output_path=benchmark_dir / "ops.html",
         embed_screenshots=getattr(args, 'embed_screenshots', False),
     )
     print(f"  Generated: {output_path}")
@@ -4679,7 +6050,7 @@ def log_message(self, format, *log_args):
 
     try:
         with socketserver.TCPServer(("", port), QuietHandler) as httpd:
-            url = f"http://localhost:{port}/benchmark.html"
+            url = f"http://localhost:{port}/ops.html"
             print(f"\n  Viewer: {url}")
             print("  Press Ctrl+C to stop\n")
 
@@ -5010,7 +6381,7 @@ def main() -> None:
 
     # WAA eval VM management
     p_vm = subparsers.add_parser("vm", help="Manage dedicated WAA eval VM (with nested virtualization)")
-    p_vm.add_argument("action", choices=["monitor", "create", "status", "ssh", "delete", "deallocate", "start", "list-sizes", "setup", "pull-image", "setup-waa", "run-waa", "prepare-windows", "fix-storage", "docker-prune", "docker-move", "stop-build", "fix-oem", "reset-windows", "screenshot", "probe", "pool-status", "delete-pool", "cleanup-stale", "diag", "logs", "exec", "host-exec", "test-docker", "start-server"], help="Action to perform")
+    p_vm.add_argument("action", choices=["dashboard", "monitor", "watch", "create", "status", "ssh", "delete", "deallocate", "start", "list-sizes", "setup", "pull-image", "setup-waa", "run-waa", "prepare-windows", "fix-storage", "docker-prune", "docker-move", "stop-build", "fix-oem", "update-scripts", "run-install", "reset-windows", "screenshot", "probe", "pool-status", "delete-pool", "cleanup-stale", "diag", "logs", "crash-logs", "exec", "host-exec", "test-docker", "start-server", "deploy-evaluate", "disk-status", "auto-cleanup"], help="Action to perform")
     p_vm.add_argument("--resource-group", default="openadapt-agents", help="Azure resource group")
     p_vm.add_argument("--name", default="waa-eval-vm", help="VM name")
     p_vm.add_argument("--size", default="Standard_D4s_v3", help="VM size (must support nested virt)")
@@ -5021,7 +6392,7 @@ def main() -> None:
     p_vm.add_argument("--num-tasks", type=int, default=5, help="Number of tasks to run (for run-waa)")
     p_vm.add_argument("--domain", choices=["general", "office", "web", "coding", "system", "creative", "data", "communication", "media", "gaming", "utility"], help="WAA domain to filter tasks (for run-waa)")
     p_vm.add_argument("--task-ids", help="Comma-separated task IDs to run (e.g., 'task_001,task_015,task_042') for run-waa")
-    p_vm.add_argument("--model", default="gpt-4o", help="Model to use (gpt-4o, gpt-5.2, etc.)")
+    p_vm.add_argument("--model", default=None, help="Model to use (gpt-4o, gpt-4o-mini, etc.). Defaults to WAA_DEFAULT_MODEL env var or gpt-4o")
     p_vm.add_argument("--agent", default="navi",
                        choices=["navi", "api-claude", "api-openai"],
                        help="Agent type: navi (default WAA), api-claude (Claude Sonnet 4.5), api-openai (GPT-5.1)")
@@ -5039,7 +6410,7 @@ def main() -> None:
     p_vm.add_argument("--port", type=int, default=8765, help="Port for local dashboard server (default: 8765)")
     # Auto-shutdown option (for run-waa)
     p_vm.add_argument("--auto-shutdown", action="store_true", default=False, help="Deallocate VM after benchmark completes to save costs (for run-waa)")
-    p_vm.add_argument("--auto-shutdown-hours", type=float, default=0, help="For monitor: auto-deallocate VM after N hours (0=disabled)")
+    p_vm.add_argument("--auto-shutdown-hours", type=float, default=2.0, help="For monitor: auto-deallocate VM after N hours (default: 2, 0=disabled)")
     p_vm.add_argument("--rebuild", action="store_true", default=False, help="Force rebuild of waa-auto Docker image (for run-waa)")
     p_vm.add_argument("--fresh", action="store_true", default=False, help="Delete Windows storage and start fresh installation (for run-waa)")
     # Log viewing options (for logs action)
@@ -5050,6 +6421,9 @@ def main() -> None:
     p_vm.add_argument("--vm-max-hours", type=float, default=24.0, help="For cleanup-stale: deallocate VMs running longer than this (default: 24 hours)")
     # Exec command option
     p_vm.add_argument("--cmd", help="Command to execute in container (for exec action)")
+    # Disk management options (for disk-status and auto-cleanup actions)
+    p_vm.add_argument("--min-free", type=float, default=50.0, help="For auto-cleanup: minimum free space in GB (default: 50)")
+    p_vm.add_argument("--json", action="store_true", default=False, help="For disk-status: output JSON format")
 
     # Benchmark viewer subcommand - for monitoring already-running benchmarks
     p_viewer = subparsers.add_parser("viewer", help="Launch benchmark viewer for monitoring a running VM")
diff --git a/openadapt_ml/benchmarks/waa_deploy/Dockerfile b/openadapt_ml/benchmarks/waa_deploy/Dockerfile
index c746e18..2078170 100644
--- a/openadapt_ml/benchmarks/waa_deploy/Dockerfile
+++ b/openadapt_ml/benchmarks/waa_deploy/Dockerfile
@@ -1,222 +1,159 @@
 # =============================================================================
-# WAA (Windows Agent Arena) Docker Image
+# WAA (Windows Agent Arena) - Standalone Dockerfile
 # =============================================================================
 #
-# This image combines:
-# 1. dockurr/windows:latest - Modern base that auto-downloads Windows 11
-# 2. windowsarena/winarena:latest - Official WAA benchmark client and scripts
+# DEFINITIVE APPROACH - No circular dependencies, no script patching
 #
-# The official windowsarena/winarena uses an outdated dockurr/windows (v0.00)
-# that doesn't auto-download Windows. This image fixes that while keeping
-# full compatibility with the official WAA benchmark.
+# This uses vanilla WAA's "dev mode" which expects files at \\host.lan\Data
+# (the Samba share at /tmp/smb/). No patching of script paths required.
 #
-# Usage:
-#   # Build the image
-#   docker build -t waa-auto:latest .
+# Key features:
+# - Uses modern dockurr/windows:latest (auto-downloads Windows 11)
+# - VERSION=11e = Enterprise Evaluation (no license key required)
+# - Fully unattended installation
+# - No 15GB image pull required
+# - Uses vanilla WAA scripts without modification (except IP address)
 #
-#   # Run benchmark (after Windows is set up)
+# Build from repo root:
+#   docker build -f openadapt_ml/benchmarks/waa_deploy/Dockerfile -t waa:latest .
+#
+# Run:
 #   docker run --rm --device=/dev/kvm --cap-add NET_ADMIN \
-#     -p 8006:8006 -p 5000:5000 -p 7200:7200 \
+#     -p 8006:8006 -p 5000:5000 \
 #     -v /path/to/storage:/storage \
-#     -e OPENAI_API_KEY="your-key" \
-#     waa-auto:latest \
-#     "/entry.sh --start-client true --model gpt-4o --num-tasks 5"
+#     waa:latest
 #
 # =============================================================================
 
 FROM dockurr/windows:latest
 
 # -----------------------------------------------------------------------------
-# Copy official WAA components from windowsarena/winarena
+# Copy WAA components from LOCAL vendor submodule
+# Build context must be repo root for these paths to work
+# -----------------------------------------------------------------------------
+
+# Entry scripts (for benchmark client on Linux side)
+COPY vendor/WindowsAgentArena/src/win-arena-container/entry.sh /entry.sh
+COPY vendor/WindowsAgentArena/src/win-arena-container/entry_setup.sh /entry_setup.sh
+COPY vendor/WindowsAgentArena/src/win-arena-container/start_client.sh /start_client.sh
+COPY vendor/WindowsAgentArena/src/win-arena-container/start_vm.sh /start_vm.sh
+
+# Python benchmark client (runs on Linux side)
+COPY vendor/WindowsAgentArena/src/win-arena-container/client /client
+
+# Windows setup scripts → /waa-setup/ (staging location)
+# These get copied to /tmp/smb/ (Samba share) at container startup
+# Windows accesses them at \\host.lan\Data
+COPY vendor/WindowsAgentArena/src/win-arena-container/vm/setup /waa-setup
+
+# Our custom startup script (more robust than vanilla)
+COPY openadapt_ml/benchmarks/waa_deploy/start_waa_server.bat /waa-setup/start_waa_server.bat
+
+# -----------------------------------------------------------------------------
+# Use DEV unattend XML (expects files at \\host.lan\Data, not C:\oem)
+# This is vanilla WAA's approach for local development
+# -----------------------------------------------------------------------------
+
+COPY vendor/WindowsAgentArena/src/win-arena-container/vm/unattend-files/dev_win11x64-enterprise-eval.xml /run/assets/win11x64-enterprise-eval.xml
+
+# -----------------------------------------------------------------------------
+# Patch IP addresses: WAA uses 20.20.20.21, dockurr/windows uses 172.30.0.2
+# This is the ONLY patching we need to do
 # -----------------------------------------------------------------------------
 
-# Copy benchmark client scripts
-COPY --from=windowsarena/winarena:latest /entry.sh /entry.sh
-COPY --from=windowsarena/winarena:latest /entry_setup.sh /entry_setup.sh
-COPY --from=windowsarena/winarena:latest /start_client.sh /start_client.sh
+RUN sed -i 's|20\.20\.20\.21|172.30.0.2|g' /entry.sh /entry_setup.sh /start_client.sh && \
+    find /client -name "*.py" -exec sed -i 's|20\.20\.20\.21|172.30.0.2|g' {} \; && \
+    chmod +x /*.sh && \
+    echo "Patched IP addresses to 172.30.0.2"
 
-# Copy the Python benchmark client code
-COPY --from=windowsarena/winarena:latest /client /client
+# -----------------------------------------------------------------------------
+# Copy WAA setup files to Samba share when container starts
+# /tmp/smb/ is created by dockurr/windows at runtime
+# Windows sees this as \\host.lan\Data
+# -----------------------------------------------------------------------------
+
+RUN sed -i '/^return 0$/i \
+# Copy WAA setup files to Samba share\n\
+cp -r /waa-setup/* /tmp/smb/ 2>/dev/null || true\n\
+chmod -R 777 /tmp/smb/ 2>/dev/null || true\n\
+echo "WAA setup files copied to Samba share"' /run/samba.sh && \
+    echo "Added WAA file copy to samba.sh"
+
+# -----------------------------------------------------------------------------
+# Port forwarding: Forward port 5000 from container to Windows VM
+# dockurr/windows doesn't auto-forward ports to the QEMU VM
+# -----------------------------------------------------------------------------
 
-# Copy our WAA server startup script
-COPY start_waa_server.bat /oem/start_waa_server.bat
+RUN printf '#!/bin/bash\n\
+# Wait for Windows VM to get IP\n\
+while ! grep -q "172.30.0.2" /var/lib/misc/dnsmasq.leases 2>/dev/null; do\n\
+    sleep 5\n\
+done\n\
+echo "Windows VM is up, starting port forwarder"\n\
+# Forward port 5000 to Windows VM\n\
+while true; do\n\
+    nc -lp 5000 -c "nc 172.30.0.2 5000" 2>/dev/null || sleep 1\n\
+done\n\
+' > /port_forward.sh && chmod +x /port_forward.sh
+
+# Inject port forwarder into samba.sh (runs after network is up)
+RUN sed -i '/^return 0$/i nohup /port_forward.sh >/dev/null 2>\&1 \&' /run/samba.sh && \
+    echo "Added port forwarder to samba.sh"
 
-# Copy model weights (GroundingDINO, OmniParser, etc.)
-COPY --from=windowsarena/winarena:latest /models /models
+# -----------------------------------------------------------------------------
+# Install Python and dependencies for benchmark client (runs on Linux host)
+# -----------------------------------------------------------------------------
 
-# Copy Windows setup scripts (install.bat, setup.ps1, etc.)
-COPY --from=windowsarena/winarena:latest /oem /oem
-
-# Copy OEM files AFTER dockurr/samba starts (which wipes /tmp/smb)
-# Copy IMMEDIATELY (no delay) and SYNCHRONOUSLY (not backgrounded) to ensure
-# files are available before Windows boots and runs FirstLogonCommands
-RUN sed -i '/^return 0$/i cp -r /oem/* /tmp/smb/ 2>/dev/null || true' /run/samba.sh && \
-    echo "Inserted OEM copy before return in samba.sh"
-
-# Copy unattend.xml for automated Windows installation
-COPY --from=windowsarena/winarena:latest /run/assets/win11x64-enterprise-eval.xml /run/assets/win11x64.xml
-
-# -----------------------------------------------------------------------------
-# Create start_vm.sh that uses our dockurr/windows entrypoint
-# -----------------------------------------------------------------------------
-
-RUN printf '#!/bin/bash\n/usr/bin/tini -s /run/entry.sh\n' > /start_vm.sh && chmod +x /start_vm.sh
-
-# -----------------------------------------------------------------------------
-# Patch IP addresses: official uses 20.20.20.21, dockurr/windows uses 172.30.0.2
-# -----------------------------------------------------------------------------
-
-# Patch entry scripts (must work - these files were just copied)
-RUN sed -i 's|20.20.20.21|172.30.0.2|g' /entry_setup.sh && \
-    sed -i 's|20.20.20.21|172.30.0.2|g' /entry.sh && \
-    sed -i 's|20.20.20.21|172.30.0.2|g' /start_client.sh && \
-    echo "Patched entry scripts"
-
-# Patch client Python files
-RUN find /client -name "*.py" -exec sed -i 's|20.20.20.21|172.30.0.2|g' {} \; && \
-    echo "Patched client Python files"
-
-# -----------------------------------------------------------------------------
-# Add API-backed agent support (Claude Sonnet 4.5 / GPT-5.1)
-# This allows using --agent api-claude or --agent api-openai instead of navi
-# -----------------------------------------------------------------------------
-
-# Copy api_agent.py to the client mm_agents directory
-COPY api_agent.py /client/mm_agents/api_agent.py
-
-# Patch run.py to support api-claude and api-openai agents
-# This adds elif blocks after the "navi" agent handling
-# Using Python to insert the patch with proper indentation
-RUN python3 -c "import re; \
-f = open('/client/run.py', 'r'); c = f.read(); f.close(); \
-patch = '''    elif cfg_args[\"agent_name\"] in [\"api-claude\", \"api-openai\"]:\n        from mm_agents.api_agent import ApiAgent\n        provider = \"anthropic\" if cfg_args[\"agent_name\"] == \"api-claude\" else \"openai\"\n        agent = ApiAgent(provider=provider, temperature=args.temperature)\n'''; \
-c = c.replace('raise ValueError(f\"Unknown agent name: {cfg_args', patch + '    raise ValueError(f\"Unknown agent name: {cfg_args'); \
-f = open('/client/run.py', 'w'); f.write(c); f.close(); \
-print('Patched run.py for API agents')"
-
-# -----------------------------------------------------------------------------
-# Fix Windows setup for automation
-# -----------------------------------------------------------------------------
-
-# Set password for AutoLogon (Windows 11 requires password for login)
-RUN sed -i 's|<Value></Value>|<Value>docker</Value>|g' /run/assets/win11x64.xml 2>/dev/null || true
-RUN sed -i 's|<Value />|<Value>docker</Value>|g' /run/assets/win11x64.xml 2>/dev/null || true
-
-# Add firewall disable and other automation commands to FirstLogonCommands
-# CRITICAL: Also create a scheduled task so WAA server starts on EVERY boot, not just first logon
-RUN if grep -q "</FirstLogonCommands>" /run/assets/win11x64.xml; then \
-    LAST_ORDER=$(grep -oP "Order>\K[0-9]+" /run/assets/win11x64.xml | sort -n | tail -1) && \
-    N1=$((LAST_ORDER + 1)) && \
-    N2=$((LAST_ORDER + 2)) && \
-    N3=$((LAST_ORDER + 3)) && \
-    N4=$((LAST_ORDER + 4)) && \
-    N5=$((LAST_ORDER + 5)) && \
-    N6=$((LAST_ORDER + 6)) && \
-    sed -i "s|</FirstLogonCommands>|\
-        <SynchronousCommand wcm:action=\"add\">\n\
-          <Order>$N1</Order>\n\
-          <CommandLine>netsh advfirewall set allprofiles state off</CommandLine>\n\
-          <Description>Disable Windows Firewall</Description>\n\
-        </SynchronousCommand>\n\
-        <SynchronousCommand wcm:action=\"add\">\n\
-          <Order>$N2</Order>\n\
-          <CommandLine>powercfg /change standby-timeout-ac 0</CommandLine>\n\
-          <Description>Disable sleep</Description>\n\
-        </SynchronousCommand>\n\
-        <SynchronousCommand wcm:action=\"add\">\n\
-          <Order>$N3</Order>\n\
-          <CommandLine>powercfg /change monitor-timeout-ac 0</CommandLine>\n\
-          <Description>Disable monitor timeout</Description>\n\
-        </SynchronousCommand>\n\
-        <SynchronousCommand wcm:action=\"add\">\n\
-          <Order>$N4</Order>\n\
-          <CommandLine>reg add \"HKLM\\\\SOFTWARE\\\\Policies\\\\Microsoft\\\\Windows\\\\Personalization\" /v NoLockScreen /t REG_DWORD /d 1 /f</CommandLine>\n\
-          <Description>Disable lock screen</Description>\n\
-        </SynchronousCommand>\n\
-        <SynchronousCommand wcm:action=\"add\">\n\
-          <Order>$N5</Order>\n\
-          <CommandLine>cmd /c start /wait \\\\\\\\host.lan\\\\Data\\\\install.bat</CommandLine>\n\
-          <Description>Run WAA setup script to install Python, Chrome, etc.</Description>\n\
-        </SynchronousCommand>\n\
-        <SynchronousCommand wcm:action=\"add\">\n\
-          <Order>$N6</Order>\n\
-          <CommandLine>schtasks /create /tn \"WAAServer\" /tr \"\\\\\\\\host.lan\\\\Data\\\\start_waa_server.bat\" /sc onlogon /rl highest /f</CommandLine>\n\
-          <Description>Create scheduled task for WAA server auto-start on every boot</Description>\n\
-        </SynchronousCommand>\n\
-        <SynchronousCommand wcm:action=\"add\">\n\
-          <Order>$((N6 + 1))</Order>\n\
-          <CommandLine>reg add \"HKCU\\\\SOFTWARE\\\\Microsoft\\\\Windows\\\\CurrentVersion\\\\Run\" /v WAAServer /t REG_SZ /d \"cmd /c \\\\\\\\host.lan\\\\Data\\\\start_waa_server.bat\" /f</CommandLine>\n\
-          <Description>Add registry entry for WAA server auto-start (backup)</Description>\n\
-        </SynchronousCommand>\n\
-        <SynchronousCommand wcm:action=\"add\">\n\
-          <Order>$((N6 + 2))</Order>\n\
-          <CommandLine>\\\\\\\\host.lan\\\\Data\\\\start_waa_server.bat</CommandLine>\n\
-          <Description>Start WAA server immediately</Description>\n\
-        </SynchronousCommand>\n\
-      </FirstLogonCommands>|" /run/assets/win11x64.xml; \
-    fi
-
-# -----------------------------------------------------------------------------
-# Install Python and dependencies directly
-# dockurr/windows base is Debian trixie which has Python 3.12
-# -----------------------------------------------------------------------------
-
-# Install Python 3 and system dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
-    python3 \
-    python3-venv \
-    python3-pip \
-    tesseract-ocr \
-    libgl1 \
-    libglib2.0-0 \
-    libsm6 \
-    libxext6 \
-    libxrender-dev \
-    ffmpeg \
+    python3 python3-pip python3-venv \
+    tesseract-ocr libgl1 libglib2.0-0 ffmpeg netcat-openbsd \
     && rm -rf /var/lib/apt/lists/* \
     && ln -sf /usr/bin/python3 /usr/bin/python
 
-# Install Python dependencies for WAA client
-# Using --break-system-packages since we're in a container
-# Full dependency list from: github.com/microsoft/WindowsAgentArena/blob/main/src/win-arena-container/client/requirements.txt
+# Install minimal Python deps for client (full deps installed in Windows)
 RUN pip3 install --no-cache-dir --break-system-packages \
-    torch torchvision --index-url https://download.pytorch.org/whl/cpu && \
-    pip3 install --no-cache-dir --break-system-packages \
-    gymnasium farama-notifications cloudpickle packaging typer rich tqdm colorama \
-    openai anthropic google-generativeai groq tiktoken \
-    pyyaml jsonschema tenacity httpx backoff toml func-timeout wrapt-timeout-decorator \
-    psutil pyperclip screeninfo mss pyautogui fabric \
-    easyocr pillow pytesseract opencv-python-headless scikit-image ImageHash \
-    requests flask beautifulsoup4 lxml cssselect xmltodict playwright requests-toolbelt \
-    pydrive openpyxl python-docx python-pptx odfpy pypdf PyPDF2 pdfplumber pymupdf borb \
-    xlrd xlwt xlsxwriter mammoth pdf2image \
-    google-api-python-client google-auth-httplib2 google-auth-oauthlib gdown \
-    numpy pandas scipy formulas rapidfuzz anytree addict \
-    transformers accelerate "timm>=0.9.0,<1.0.0" ultralytics supervision pycocotools einops \
-    mutagen pyacoustid chardet librosa fastdtw \
-    py7zr LnkParse3 \
-    matplotlib wandb yapf
-
-# Install Playwright browsers
-RUN playwright install chromium
+    requests flask pyyaml pillow
+
+# -----------------------------------------------------------------------------
+# Add API agent support (Claude / GPT)
+# -----------------------------------------------------------------------------
+
+COPY openadapt_ml/benchmarks/waa_deploy/api_agent.py /client/mm_agents/api_agent.py
 
 # -----------------------------------------------------------------------------
 # Environment configuration
 # -----------------------------------------------------------------------------
 
-ENV YRES="900"
-ENV XRES="1440"
+# VERSION=11e downloads Windows 11 Enterprise Evaluation
+# This has a built-in GVLK key - NO LICENSE KEY REQUIRED
+ENV VERSION="11e"
+
+# VM resources
 ENV RAM_SIZE="8G"
 ENV CPU_CORES="4"
 ENV DISK_SIZE="30G"
-ENV VERSION="11e"
+
+# Display resolution
+ENV XRES="1440"
+ENV YRES="900"
+
+# QEMU arguments (expose QMP for monitoring)
 ENV ARGUMENTS="-qmp tcp:0.0.0.0:7200,server,nowait"
 
 # Expose ports
+# 8006: noVNC web interface
+# 5000: WAA Flask server (forwarded to Windows VM)
+# 7200: QEMU QMP (monitoring)
+# 3389: RDP (if needed)
 EXPOSE 8006 5000 7200 3389
 
-# Default entrypoint - copy OEM files then run entry.sh
-# Use: /entry.sh --start-client true --model gpt-4o
-# Or:  /entry.sh --start-client false (just start Windows, no benchmark)
-ENTRYPOINT ["/bin/bash", "-c"]
-CMD ["/copy-oem.sh /entry.sh --start-client false"]
+# -----------------------------------------------------------------------------
+# Entrypoint
+# -----------------------------------------------------------------------------
+
+# Use dockurr/windows default entry which handles:
+# - Windows ISO download (automatic with VERSION=11e)
+# - QEMU/KVM setup
+# - VNC server on 8006
+# - Samba share at /tmp/smb/ (\\host.lan\Data)
+ENTRYPOINT ["/usr/bin/tini", "-s", "/run/entry.sh"]
diff --git a/openadapt_ml/models/providers/__init__.py b/openadapt_ml/models/providers/__init__.py
index 8ca0b11..c53bfda 100644
--- a/openadapt_ml/models/providers/__init__.py
+++ b/openadapt_ml/models/providers/__init__.py
@@ -5,33 +5,83 @@
 - OpenAI (GPT)
 - Google (Gemini)
 
+The provider abstraction allows switching between different VLM backends
+without changing the calling code. Each provider handles:
+- Client creation with API key management
+- Message sending with vision support
+- Image encoding in provider-specific formats
+
 Usage:
     from openadapt_ml.models.providers import get_provider
 
+    # Get a provider and send a message
     provider = get_provider("anthropic")
     client = provider.create_client(api_key)
-    response = provider.send_message(client, model, system, content)
+    response = provider.send_message(
+        client,
+        model="claude-opus-4-5-20251101",
+        system="You are a GUI agent.",
+        content=provider.build_content(
+            text="Click the submit button",
+            image=screenshot,
+        ),
+    )
+
+    # Or use the quick_message helper
+    response = provider.quick_message(
+        api_key=key,
+        model="claude-opus-4-5-20251101",
+        prompt="What's in this image?",
+        image=screenshot,
+    )
+
+Model Aliases:
+    Common model aliases are provided for convenience:
+    - "claude-opus-4.5" -> ("anthropic", "claude-opus-4-5-20251101")
+    - "gpt-5.2" -> ("openai", "gpt-5.2")
+    - "gemini-3-pro" -> ("google", "gemini-3-pro")
 """
 
 from __future__ import annotations
 
 from typing import TYPE_CHECKING
 
-from openadapt_ml.models.providers.base import BaseAPIProvider
+from openadapt_ml.models.providers.base import (
+    BaseAPIProvider,
+    ProviderError,
+    AuthenticationError,
+    RateLimitError,
+    ModelNotFoundError,
+)
 from openadapt_ml.models.providers.anthropic import AnthropicProvider
 from openadapt_ml.models.providers.openai import OpenAIProvider
 from openadapt_ml.models.providers.google import GoogleProvider
 
 if TYPE_CHECKING:
-    pass
+    from PIL import Image
 
 __all__ = [
+    # Base classes and exceptions
     "BaseAPIProvider",
+    "ProviderError",
+    "AuthenticationError",
+    "RateLimitError",
+    "ModelNotFoundError",
+    # Provider implementations
     "AnthropicProvider",
     "OpenAIProvider",
     "GoogleProvider",
+    # Factory functions
     "get_provider",
+    "get_provider_for_model",
+    "resolve_model_alias",
+    # Registries
     "PROVIDERS",
+    "MODEL_ALIASES",
+    # Convenience functions
+    "quick_message",
+    "list_providers",
+    "list_models",
 ]
 
 # Provider registry
@@ -42,14 +92,17 @@
 }
 
 # Model aliases for convenience
+# Maps friendly names to (provider, model_id) tuples
 MODEL_ALIASES: dict[str, tuple[str, str]] = {
     # Anthropic
     "claude-opus-4.5": ("anthropic", "claude-opus-4-5-20251101"),
     "claude-sonnet-4.5": ("anthropic", "claude-sonnet-4-5-20250929"),
+    "claude-haiku-3.5": ("anthropic", "claude-haiku-3-5-20241022"),
     # OpenAI
     "gpt-5.2": ("openai", "gpt-5.2"),
     "gpt-5.1": ("openai", "gpt-5.1"),
     "gpt-4o": ("openai", "gpt-4o"),
+    "gpt-4o-mini": ("openai", "gpt-4o-mini"),
     # Google
     "gemini-3-pro": ("google", "gemini-3-pro"),
     "gemini-3-flash": ("google", "gemini-3-flash"),
@@ -88,22 +141,150 @@ def resolve_model_alias(alias: str) -> tuple[str, str]:
     """Resolve a model alias to (provider, model_id).
 
     Args:
-        alias: Model alias (e.g., 'claude-opus-4.5').
+        alias: Model alias (e.g., 'claude-opus-4.5') or full model ID.
 
     Returns:
         Tuple of (provider_name, model_id).
 
     Raises:
-        ValueError: If alias is not recognized.
+        ValueError: If alias is not recognized and can't be inferred.
 
     Example:
         >>> resolve_model_alias("claude-opus-4.5")
         ('anthropic', 'claude-opus-4-5-20251101')
+        >>> resolve_model_alias("gemini-3-pro")
+        ('google', 'gemini-3-pro')
     """
+    # Check explicit aliases first
     if alias in MODEL_ALIASES:
         return MODEL_ALIASES[alias]
 
+    # Try to infer provider from model name patterns
+    alias_lower = alias.lower()
+
+    if alias_lower.startswith("claude"):
+        return ("anthropic", alias)
+    elif alias_lower.startswith("gpt"):
+        return ("openai", alias)
+    elif alias_lower.startswith("gemini"):
+        return ("google", alias)
+
     raise ValueError(
         f"Unknown model alias: {alias}. "
-        f"Available: {', '.join(MODEL_ALIASES.keys())}"
+        f"Available aliases: {', '.join(MODEL_ALIASES.keys())}. "
+        f"Or use a full model ID with a known prefix (claude-*, gpt-*, gemini-*)."
     )
+
+
+def get_provider_for_model(model: str) -> tuple[BaseAPIProvider, str]:
+    """Get the appropriate provider for a model.
+
+    Args:
+        model: Model alias or full model ID.
+
+    Returns:
+        Tuple of (provider_instance, resolved_model_id).
+
+    Example:
+        >>> provider, model_id = get_provider_for_model("claude-opus-4.5")
+        >>> provider.name
+        'anthropic'
+        >>> model_id
+        'claude-opus-4-5-20251101'
+    """
+    provider_name, model_id = resolve_model_alias(model)
+    provider = get_provider(provider_name)
+    return provider, model_id
+
+
+def quick_message(
+    model: str,
+    prompt: str,
+    image: "Image | None" = None,
+    system: str = "",
+    api_key: str | None = None,
+    max_tokens: int = 1024,
+    temperature: float = 0.1,
+) -> str:
+    """Send a quick message to any model.
+
+    Convenience function that resolves the provider, creates a client,
+    and sends a message in one call. Useful for one-off requests.
+
+    Args:
+        model: Model alias or full model ID.
+        prompt: User prompt text.
+        image: Optional image to include.
+        system: Optional system prompt.
+        api_key: Optional API key (uses settings/env if not provided).
+        max_tokens: Maximum tokens in response.
+        temperature: Sampling temperature.
+
+    Returns:
+        Model response text.
+
+    Raises:
+        AuthenticationError: If no API key is available.
+        ProviderError: For API errors.
+
+    Example:
+        >>> response = quick_message(
+        ...     model="claude-opus-4.5",
+        ...     prompt="What's in this image?",
+        ...     image=screenshot,
+        ... )
+    """
+    provider, model_id = get_provider_for_model(model)
+    resolved_key = provider.get_api_key(api_key)
+    return provider.quick_message(
+        api_key=resolved_key,
+        model=model_id,
+        prompt=prompt,
+        image=image,
+        system=system,
+        max_tokens=max_tokens,
+        temperature=temperature,
+    )
+
+
+def list_providers() -> list[str]:
+    """List available provider names.
+
+    Returns:
+        List of provider identifiers.
+
+    Example:
+        >>> list_providers()
+        ['anthropic', 'openai', 'google']
+    """
+    return list(PROVIDERS.keys())
+
+
+def list_models(provider: str | None = None) -> dict[str, dict]:
+    """List available models, optionally filtered by provider.
+
+    Args:
+        provider: Optional provider name to filter by.
+
+    Returns:
+        Dict mapping model IDs to their properties.
+
+    Example:
+        >>> list_models("anthropic")
+        {
+            'claude-opus-4-5-20251101': {'context': 200000, 'description': 'SOTA computer use'},
+            ...
+        }
+    """
+    if provider:
+        provider_instance = get_provider(provider)
+        return provider_instance.supported_models
+
+    # Combine models from all providers
+    all_models = {}
+    for provider_name in PROVIDERS:
+        provider_instance = get_provider(provider_name)
+        for model_id, props in provider_instance.supported_models.items():
+            all_models[model_id] = {**props, "provider": provider_name}
+
+    return all_models
diff --git a/openadapt_ml/models/providers/anthropic.py b/openadapt_ml/models/providers/anthropic.py
index aeb06b1..fef9690 100644
--- a/openadapt_ml/models/providers/anthropic.py
+++ b/openadapt_ml/models/providers/anthropic.py
@@ -1,43 +1,87 @@
 """Anthropic (Claude) API provider.
 
 Supports Claude Opus 4.5, Sonnet 4.5, and other Claude models.
+Implements the BaseAPIProvider interface for the Anthropic Messages API.
 """
 
 from __future__ import annotations
 
+import logging
 from typing import TYPE_CHECKING, Any
 
-from openadapt_ml.models.providers.base import BaseAPIProvider
+from openadapt_ml.models.providers.base import (
+    BaseAPIProvider,
+    AuthenticationError,
+    ModelNotFoundError,
+    ProviderError,
+    RateLimitError,
+)
 
 if TYPE_CHECKING:
     from PIL import Image
 
+logger = logging.getLogger(__name__)
+
+# Default models
+DEFAULT_MODEL = "claude-sonnet-4-5-20250929"
+
+# Supported models with their context windows
+SUPPORTED_MODELS = {
+    "claude-opus-4-5-20251101": {"context": 200_000, "description": "SOTA computer use"},
+    "claude-sonnet-4-5-20250929": {"context": 200_000, "description": "Fast, cheaper"},
+    "claude-sonnet-4-20250514": {"context": 200_000, "description": "Previous Sonnet"},
+    "claude-haiku-3-5-20241022": {"context": 200_000, "description": "Fastest, cheapest"},
+}
+
 
 class AnthropicProvider(BaseAPIProvider):
     """Provider for Anthropic's Claude models.
 
+    Implements vision support via base64-encoded images in the Messages API format.
+    Claude models natively support screenshots and UI analysis for computer use tasks.
+
     Supported models:
-        - claude-opus-4-5-20251101 (SOTA computer use)
-        - claude-sonnet-4-5-20250929 (fast, cheaper)
+        - claude-opus-4-5-20251101: Most capable, best for complex GUI tasks
+        - claude-sonnet-4-5-20250929: Fast and cost-effective
+        - claude-haiku-3-5-20241022: Fastest, lowest cost
 
     Example:
-        provider = AnthropicProvider()
-        client = provider.create_client(api_key)
-        response = provider.send_message(
-            client,
-            model="claude-opus-4-5-20251101",
-            system="You are a GUI agent.",
-            content=[
-                {"type": "text", "text": "Click the submit button"},
-                provider.encode_image(screenshot),
-            ],
-        )
+        >>> provider = AnthropicProvider()
+        >>> client = provider.create_client(api_key)
+        >>> response = provider.send_message(
+        ...     client,
+        ...     model="claude-opus-4-5-20251101",
+        ...     system="You are a GUI agent.",
+        ...     content=[
+        ...         {"type": "text", "text": "Click the submit button"},
+        ...         provider.encode_image(screenshot),
+        ...     ],
+        ... )
+
+    Attributes:
+        name: Returns 'anthropic'.
     """
 
     @property
     def name(self) -> str:
+        """Provider name."""
         return "anthropic"
 
+    @property
+    def env_key_name(self) -> str:
+        """Environment variable name for API key."""
+        return "ANTHROPIC_API_KEY"
+
+    @property
+    def default_model(self) -> str:
+        """Default model to use."""
+        return DEFAULT_MODEL
+
+    @property
+    def supported_models(self) -> dict[str, dict[str, Any]]:
+        """Dictionary of supported models and their properties."""
+        return SUPPORTED_MODELS
+
     def create_client(self, api_key: str) -> Any:
         """Create Anthropic client.
 
@@ -49,14 +93,23 @@ def create_client(self, api_key: str) -> Any:
 
         Raises:
             ImportError: If anthropic package not installed.
+            AuthenticationError: If API key format is invalid.
         """
         try:
             from anthropic import Anthropic
         except ImportError as e:
             raise ImportError(
-                "anthropic package is required. Install with: pip install anthropic"
+                "anthropic package is required for provider='anthropic'. "
+                "Install with: uv add anthropic"
             ) from e
 
+        if not api_key or not api_key.strip():
+            raise AuthenticationError(
+                "Anthropic API key cannot be empty. "
+                "Get a key from https://console.anthropic.com/"
+            )
+
+        logger.debug("Creating Anthropic client")
         return Anthropic(api_key=api_key)
 
     def send_message(
@@ -71,43 +124,77 @@ def send_message(
         """Send message using Anthropic Messages API.
 
         Args:
-            client: Anthropic client.
+            client: Anthropic client from create_client().
             model: Model ID (e.g., 'claude-opus-4-5-20251101').
             system: System prompt.
-            content: List of content blocks.
+            content: List of content blocks (text and images).
             max_tokens: Max response tokens.
-            temperature: Sampling temperature.
+            temperature: Sampling temperature (0.0-1.0).
 
         Returns:
             Model response text.
+
+        Raises:
+            AuthenticationError: If API key is invalid.
+            RateLimitError: If rate limit exceeded.
+            ModelNotFoundError: If model doesn't exist.
+            ProviderError: For other API errors.
         """
-        response = client.messages.create(
-            model=model,
-            max_tokens=max_tokens,
-            temperature=temperature,
-            system=system or None,
-            messages=[{"role": "user", "content": content}],
-        )
-
-        # Extract text from content blocks
-        parts = getattr(response, "content", [])
-        texts = [
-            getattr(p, "text", "")
-            for p in parts
-            if getattr(p, "type", "") == "text"
-        ]
-        return "\n".join([t for t in texts if t]).strip()
+        logger.debug(f"Sending message to {model} with {len(content)} content blocks")
+
+        try:
+            response = client.messages.create(
+                model=model,
+                max_tokens=max_tokens,
+                temperature=temperature,
+                system=system or None,
+                messages=[{"role": "user", "content": content}],
+            )
+
+            # Extract text from content blocks
+            parts = getattr(response, "content", [])
+            texts = [
+                getattr(p, "text", "")
+                for p in parts
+                if getattr(p, "type", "") == "text"
+            ]
+            result = "\n".join([t for t in texts if t]).strip()
+
+            logger.debug(f"Received response: {len(result)} chars")
+            return result
+
+        except Exception as e:
+            error_str = str(e).lower()
+
+            # Map common errors to specific exceptions
+            if "authentication" in error_str or "api_key" in error_str:
+                raise AuthenticationError(f"Anthropic authentication failed: {e}") from e
+            elif "rate_limit" in error_str or "429" in error_str:
+                raise RateLimitError(f"Anthropic rate limit exceeded: {e}") from e
+            elif "model_not_found" in error_str or "not found" in error_str:
+                raise ModelNotFoundError(f"Model '{model}' not found: {e}") from e
+            else:
+                raise ProviderError(f"Anthropic API error: {e}") from e
 
     def encode_image(self, image: "Image") -> dict[str, Any]:
         """Encode image for Anthropic API.
 
         Anthropic uses base64-encoded images with explicit source type.
+        PNG format is used for lossless quality.
 
         Args:
             image: PIL Image.
 
         Returns:
-            Image content block for Anthropic API.
+            Image content block for Anthropic API in format:
+            {
+                "type": "image",
+                "source": {
+                    "type": "base64",
+                    "media_type": "image/png",
+                    "data": "<base64_string>"
+                }
+            }
         """
         return {
             "type": "image",
@@ -117,3 +204,55 @@ def encode_image(self, image: "Image") -> dict[str, Any]:
                 "data": self.image_to_base64(image, "PNG"),
             },
         }
+
+    def encode_image_from_bytes(
+        self,
+        image_bytes: bytes,
+        media_type: str = "image/png",
+    ) -> dict[str, Any]:
+        """Encode raw image bytes for Anthropic API.
+
+        Useful when you already have image bytes and don't need PIL.
+
+        Args:
+            image_bytes: Raw image bytes.
+            media_type: MIME type of the image.
+
+        Returns:
+            Image content block for Anthropic API.
+        """
+        import base64
+
+        return {
+            "type": "image",
+            "source": {
+                "type": "base64",
+                "media_type": media_type,
+                "data": base64.b64encode(image_bytes).decode("utf-8"),
+            },
+        }
+
+    def encode_image_from_url(self, url: str) -> dict[str, Any]:
+        """Create image content block from URL.
+
+        Note: Anthropic doesn't support URL-based images directly.
+        This method fetches the URL and encodes the image.
+
+        Args:
+            url: Image URL to fetch and encode.
+
+        Returns:
+            Image content block for Anthropic API.
+
+        Raises:
+            ProviderError: If URL fetch fails.
+        """
+        import urllib.request
+
+        try:
+            with urllib.request.urlopen(url) as response:
+                image_bytes = response.read()
+                content_type = response.headers.get("Content-Type", "image/png")
+                return self.encode_image_from_bytes(image_bytes, content_type)
+        except Exception as e:
+            raise ProviderError(f"Failed to fetch image from URL: {e}") from e
diff --git a/openadapt_ml/models/providers/base.py b/openadapt_ml/models/providers/base.py
index 5454b51..c4bd54c 100644
--- a/openadapt_ml/models/providers/base.py
+++ b/openadapt_ml/models/providers/base.py
@@ -1,24 +1,70 @@
 """Base provider abstraction for API-backed VLMs.
 
 This module defines the interface that all API providers must implement.
+Providers handle client creation, message sending, and image encoding
+in a provider-specific way.
 """
 
 from __future__ import annotations
 
 import base64
 import io
+import logging
+import os
 from abc import ABC, abstractmethod
 from typing import TYPE_CHECKING, Any
 
 if TYPE_CHECKING:
     from PIL import Image
 
+logger = logging.getLogger(__name__)
+
+
+class ProviderError(Exception):
+    """Base exception for provider errors."""
+
+    pass
+
+
+class AuthenticationError(ProviderError):
+    """Raised when API authentication fails."""
+
+    pass
+
+
+class RateLimitError(ProviderError):
+    """Raised when API rate limit is exceeded."""
+
+    pass
+
+
+class ModelNotFoundError(ProviderError):
+    """Raised when the specified model is not available."""
+
+    pass
+
 
 class BaseAPIProvider(ABC):
     """Abstract base class for API providers (Anthropic, OpenAI, Google).
 
     Each provider implements client creation, message sending, and image encoding
     in a provider-specific way.
+
+    Attributes:
+        name: Provider identifier ('anthropic', 'openai', 'google').
+
+    Example:
+        >>> provider = get_provider("anthropic")
+        >>> client = provider.create_client(api_key)
+        >>> response = provider.send_message(
+        ...     client,
+        ...     model="claude-opus-4-5-20251101",
+        ...     system="You are a GUI agent.",
+        ...     content=[
+        ...         {"type": "text", "text": "Click the submit button"},
+        ...         provider.encode_image(screenshot),
+        ...     ],
+        ... )
     """
 
     @property
@@ -27,6 +73,47 @@ def name(self) -> str:
         """Provider name (e.g., 'anthropic', 'openai', 'google')."""
         ...
 
+    @property
+    def env_key_name(self) -> str:
+        """Environment variable name for API key.
+
+        Returns:
+            Environment variable name (e.g., 'ANTHROPIC_API_KEY').
+        """
+        return f"{self.name.upper()}_API_KEY"
+
+    def get_api_key(self, api_key: str | None = None) -> str:
+        """Get API key from parameter, settings, or environment.
+
+        Args:
+            api_key: Optional explicit API key.
+
+        Returns:
+            API key string.
+
+        Raises:
+            AuthenticationError: If no API key is available.
+        """
+        if api_key:
+            return api_key
+
+        # Try settings
+        from openadapt_ml.config import settings
+
+        settings_key = getattr(settings, f"{self.name}_api_key", None)
+        if settings_key:
+            return settings_key
+
+        # Try environment
+        env_key = os.getenv(self.env_key_name)
+        if env_key:
+            return env_key
+
+        raise AuthenticationError(
+            f"{self.env_key_name} is required but not found. "
+            f"Set it in .env file, environment variable, or pass api_key parameter."
+        )
+
     @abstractmethod
     def create_client(self, api_key: str) -> Any:
         """Create and return an API client.
@@ -36,6 +123,10 @@ def create_client(self, api_key: str) -> Any:
 
         Returns:
             Provider-specific client object.
+
+        Raises:
+            ImportError: If required package is not installed.
+            AuthenticationError: If API key is invalid.
         """
         ...
 
@@ -61,6 +152,11 @@ def send_message(
 
         Returns:
             The model's text response.
+
+        Raises:
+            RateLimitError: If rate limit is exceeded.
+            ModelNotFoundError: If model is not available.
+            ProviderError: For other API errors.
         """
         ...
 
@@ -107,3 +203,97 @@ def get_media_type(self, format: str = "PNG") -> str:
             "WEBP": "image/webp",
         }
         return format_map.get(format.upper(), "image/png")
+
+    def create_text_content(self, text: str) -> dict[str, Any]:
+        """Create a text content block.
+
+        Args:
+            text: Text content.
+
+        Returns:
+            Text content block.
+        """
+        return {"type": "text", "text": text}
+
+    def build_content(
+        self,
+        text: str | None = None,
+        image: "Image | None" = None,
+        additional_content: list[dict[str, Any]] | None = None,
+    ) -> list[dict[str, Any]]:
+        """Build a content list from text and/or image.
+
+        Convenience method for building content lists in the correct format.
+
+        Args:
+            text: Optional text content.
+            image: Optional PIL Image.
+            additional_content: Optional additional content blocks.
+
+        Returns:
+            List of content blocks.
+
+        Example:
+            >>> content = provider.build_content(
+            ...     text="Click the button",
+            ...     image=screenshot,
+            ... )
+        """
+        content = []
+
+        if text:
+            content.append(self.create_text_content(text))
+
+        if image is not None:
+            content.append(self.encode_image(image))
+
+        if additional_content:
+            content.extend(additional_content)
+
+        return content
+
+    def quick_message(
+        self,
+        api_key: str,
+        model: str,
+        prompt: str,
+        image: "Image | None" = None,
+        system: str = "",
+        max_tokens: int = 1024,
+        temperature: float = 0.1,
+    ) -> str:
+        """Send a quick message without managing client lifecycle.
+
+        Convenience method that creates a client, sends a message, and returns
+        the response in one call. Useful for one-off requests.
+
+        Args:
+            api_key: API key for authentication.
+            model: Model identifier.
+            prompt: User prompt text.
+            image: Optional image to include.
+            system: Optional system prompt.
+            max_tokens: Maximum tokens in response.
+            temperature: Sampling temperature.
+
+        Returns:
+            Model response text.
+
+        Example:
+            >>> response = provider.quick_message(
+            ...     api_key=key,
+            ...     model="claude-opus-4-5-20251101",
+            ...     prompt="What's in this image?",
+            ...     image=screenshot,
+            ... )
+        """
+        client = self.create_client(api_key)
+        content = self.build_content(text=prompt, image=image)
+        return self.send_message(
+            client=client,
+            model=model,
+            system=system,
+            content=content,
+            max_tokens=max_tokens,
+            temperature=temperature,
+        )
diff --git a/openadapt_ml/models/providers/google.py b/openadapt_ml/models/providers/google.py
index 6dace1a..773dc27 100644
--- a/openadapt_ml/models/providers/google.py
+++ b/openadapt_ml/models/providers/google.py
@@ -1,71 +1,127 @@
 """Google (Gemini) API provider.
 
 Supports Gemini 3 Pro, Gemini 3 Flash, and other Gemini models.
+Implements the BaseAPIProvider interface for the Generative AI API.
 """
 
 from __future__ import annotations
 
+import logging
 from typing import TYPE_CHECKING, Any
 
-from openadapt_ml.models.providers.base import BaseAPIProvider
+from openadapt_ml.models.providers.base import (
+    BaseAPIProvider,
+    AuthenticationError,
+    ModelNotFoundError,
+    ProviderError,
+    RateLimitError,
+)
 
 if TYPE_CHECKING:
     from PIL import Image
 
+logger = logging.getLogger(__name__)
+
+# Default models
+DEFAULT_MODEL = "gemini-2.5-flash"
+
+# Supported models with their properties
+SUPPORTED_MODELS = {
+    "gemini-3-pro": {"context": 2_000_000, "description": "Most capable Gemini"},
+    "gemini-3-flash": {"context": 1_000_000, "description": "Fast inference"},
+    "gemini-2.5-pro": {"context": 2_000_000, "description": "Previous pro"},
+    "gemini-2.5-flash": {"context": 1_000_000, "description": "Fast previous gen"},
+    "gemini-2.0-flash": {"context": 1_000_000, "description": "Stable flash"},
+    "gemini-1.5-pro": {"context": 2_000_000, "description": "Legacy pro"},
+    "gemini-1.5-flash": {"context": 1_000_000, "description": "Legacy flash"},
+}
+
 
 class GoogleProvider(BaseAPIProvider):
     """Provider for Google's Gemini models.
 
+    Implements vision support with native PIL Image handling. Unlike Anthropic
+    and OpenAI which require base64 encoding, Gemini accepts PIL Images directly.
+
     Supported models:
-        - gemini-3-pro (most capable)
-        - gemini-3-flash (fast inference)
-        - gemini-2.5-pro (previous gen)
-        - gemini-2.5-flash (fast previous gen)
+        - gemini-3-pro: Most capable, 2M context window
+        - gemini-3-flash: Fast inference, 1M context
+        - gemini-2.5-pro/flash: Previous generation
+        - gemini-2.0-flash: Stable release
 
     Note:
         Gemini supports PIL Images directly without base64 encoding.
+        The encode_image method returns the image wrapped in a dict for
+        consistency with other providers.
 
     Example:
-        provider = GoogleProvider()
-        client = provider.create_client(api_key)
-        response = provider.send_message(
-            client,
-            model="gemini-3-pro",
-            system="You are a GUI agent.",
-            content=[
-                {"type": "text", "text": "Click the submit button"},
-                provider.encode_image(screenshot),
-            ],
-        )
+        >>> provider = GoogleProvider()
+        >>> client = provider.create_client(api_key)
+        >>> response = provider.send_message(
+        ...     client,
+        ...     model="gemini-3-pro",
+        ...     system="You are a GUI agent.",
+        ...     content=[
+        ...         {"type": "text", "text": "Click the submit button"},
+        ...         provider.encode_image(screenshot),
+        ...     ],
+        ... )
+
+    Attributes:
+        name: Returns 'google'.
     """
 
     @property
     def name(self) -> str:
+        """Provider name."""
         return "google"
 
+    @property
+    def env_key_name(self) -> str:
+        """Environment variable name for API key."""
+        return "GOOGLE_API_KEY"
+
+    @property
+    def default_model(self) -> str:
+        """Default model to use."""
+        return DEFAULT_MODEL
+
+    @property
+    def supported_models(self) -> dict[str, dict[str, Any]]:
+        """Dictionary of supported models and their properties."""
+        return SUPPORTED_MODELS
+
     def create_client(self, api_key: str) -> Any:
         """Create Google Generative AI client.
 
         Unlike Anthropic/OpenAI, Gemini uses a global configure call.
-        We return a dict with the API key for later use.
+        We return a dict containing the configured genai module.
 
         Args:
             api_key: Google API key.
 
         Returns:
-            Dict containing api_key for model creation.
+            Dict containing api_key and configured genai module.
 
         Raises:
             ImportError: If google-generativeai package not installed.
+            AuthenticationError: If API key is empty.
         """
         try:
             import google.generativeai as genai
         except ImportError as e:
             raise ImportError(
-                "google-generativeai package is required. "
-                "Install with: pip install google-generativeai"
+                "google-generativeai package is required for provider='google'. "
+                "Install with: uv add google-generativeai"
             ) from e
 
+        if not api_key or not api_key.strip():
+            raise AuthenticationError(
+                "Google API key cannot be empty. "
+                "Get a key from https://makersuite.google.com/app/apikey"
+            )
+
+        logger.debug("Configuring Google Generative AI")
         genai.configure(api_key=api_key)
         return {"api_key": api_key, "genai": genai}
 
@@ -83,14 +139,22 @@ def send_message(
         Args:
             client: Client dict from create_client().
             model: Model ID (e.g., 'gemini-3-pro').
-            system: System prompt (prepended to content).
+            system: System prompt (prepended to content as text).
             content: List of content blocks.
             max_tokens: Max response tokens.
-            temperature: Sampling temperature.
+            temperature: Sampling temperature (0.0-2.0 for Gemini).
 
         Returns:
             Model response text.
+
+        Raises:
+            AuthenticationError: If API key is invalid.
+            RateLimitError: If rate limit exceeded.
+            ModelNotFoundError: If model doesn't exist.
+            ProviderError: For other API errors.
         """
+        logger.debug(f"Sending message to {model} with {len(content)} content blocks")
+
         genai = client["genai"]
         model_instance = genai.GenerativeModel(model)
 
@@ -111,28 +175,198 @@ def send_message(
                 if image is not None:
                     gemini_content.append(image)
 
-        response = model_instance.generate_content(
-            gemini_content,
-            generation_config=genai.GenerationConfig(
-                temperature=temperature,
-                max_output_tokens=max_tokens,
-            ),
-        )
+        try:
+            response = model_instance.generate_content(
+                gemini_content,
+                generation_config=genai.GenerationConfig(
+                    temperature=temperature,
+                    max_output_tokens=max_tokens,
+                ),
+            )
+
+            result = response.text
+            logger.debug(f"Received response: {len(result)} chars")
+            return result
+
+        except Exception as e:
+            error_str = str(e).lower()
 
-        return response.text
+            # Map common errors to specific exceptions
+            if "api_key" in error_str or "authentication" in error_str or "invalid" in error_str:
+                raise AuthenticationError(f"Google authentication failed: {e}") from e
+            elif "quota" in error_str or "rate" in error_str or "429" in error_str:
+                raise RateLimitError(f"Google rate limit/quota exceeded: {e}") from e
+            elif "not found" in error_str or "does not exist" in error_str:
+                raise ModelNotFoundError(f"Model '{model}' not found: {e}") from e
+            else:
+                raise ProviderError(f"Google API error: {e}") from e
 
     def encode_image(self, image: "Image") -> dict[str, Any]:
         """Encode image for Gemini API.
 
-        Gemini accepts PIL Images directly, no base64 needed.
+        Gemini accepts PIL Images directly, no base64 encoding needed.
+        We wrap the image in a dict for API consistency.
 
         Args:
             image: PIL Image.
 
         Returns:
-            Image content block containing the PIL Image.
+            Image content block containing the PIL Image:
+            {
+                "type": "image",
+                "image": <PIL.Image>
+            }
         """
         return {
             "type": "image",
             "image": image,
         }
+
+    def encode_image_from_bytes(
+        self,
+        image_bytes: bytes,
+        media_type: str = "image/png",
+    ) -> dict[str, Any]:
+        """Encode raw image bytes for Gemini API.
+
+        Converts bytes to PIL Image for Gemini's native format.
+
+        Args:
+            image_bytes: Raw image bytes.
+            media_type: MIME type (used to verify format).
+
+        Returns:
+            Image content block with PIL Image.
+        """
+        import io
+
+        from PIL import Image as PILImage
+
+        image = PILImage.open(io.BytesIO(image_bytes))
+        return self.encode_image(image)
+
+    def encode_image_from_url(self, url: str) -> dict[str, Any]:
+        """Create image content block from URL.
+
+        Fetches the image and converts to PIL Image.
+
+        Args:
+            url: Image URL to fetch.
+
+        Returns:
+            Image content block with PIL Image.
+
+        Raises:
+            ProviderError: If URL fetch fails.
+        """
+        import io
+        import urllib.request
+
+        from PIL import Image as PILImage
+
+        try:
+            with urllib.request.urlopen(url) as response:
+                image_bytes = response.read()
+                image = PILImage.open(io.BytesIO(image_bytes))
+                return self.encode_image(image)
+        except Exception as e:
+            raise ProviderError(f"Failed to fetch image from URL: {e}") from e
+
+    def encode_image_as_base64(self, image: "Image") -> dict[str, Any]:
+        """Encode image as base64 for Gemini API.
+
+        While Gemini prefers PIL Images, it can also accept base64.
+        Use this for cases where you need to serialize the content.
+
+        Args:
+            image: PIL Image.
+
+        Returns:
+            Image content block with base64 data.
+        """
+        return {
+            "type": "image",
+            "inline_data": {
+                "mime_type": "image/png",
+                "data": self.image_to_base64(image, "PNG"),
+            },
+        }
+
+    def send_with_grounding(
+        self,
+        client: Any,
+        model: str,
+        prompt: str,
+        image: "Image",
+        max_tokens: int = 1024,
+        temperature: float = 0.1,
+    ) -> dict[str, Any]:
+        """Send message with grounding/bounding box detection.
+
+        Uses Gemini's native vision capabilities to detect UI elements
+        and return bounding boxes. Useful for Set-of-Marks processing.
+
+        Args:
+            client: Client dict from create_client().
+            model: Model ID.
+            prompt: Detection prompt.
+            image: Screenshot to analyze.
+            max_tokens: Max response tokens.
+            temperature: Sampling temperature.
+
+        Returns:
+            Dict with response text and any detected bounding boxes.
+
+        Example:
+            >>> result = provider.send_with_grounding(
+            ...     client,
+            ...     model="gemini-2.5-flash",
+            ...     prompt="Find the login button",
+            ...     image=screenshot,
+            ... )
+            >>> print(result["boxes"])  # List of bounding boxes
+        """
+        genai = client["genai"]
+        model_instance = genai.GenerativeModel(model)
+
+        grounding_prompt = f"""Analyze this screenshot and {prompt}
+
+Return a JSON object with:
+- "elements": array of detected elements with "label", "bbox" [x1,y1,x2,y2], "confidence"
+- "description": brief description of what you found
+
+Use pixel coordinates based on image dimensions: {image.width}x{image.height}
+
+Return ONLY valid JSON."""
+
+        try:
+            response = model_instance.generate_content(
+                [grounding_prompt, image],
+                generation_config=genai.GenerationConfig(
+                    temperature=temperature,
+                    max_output_tokens=max_tokens,
+                ),
+            )
+
+            text = response.text
+
+            # Try to parse JSON response
+            import json
+            import re
+
+            json_match = re.search(r'\{[\s\S]*\}', text)
+            if json_match:
+                try:
+                    data = json.loads(json_match.group())
+                    return {
+                        "text": text,
+                        "elements": data.get("elements", []),
+                        "description": data.get("description", ""),
+                    }
+                except json.JSONDecodeError:
+                    pass
+
+            return {"text": text, "elements": [], "description": ""}
+
+        except Exception as e:
+            raise ProviderError(f"Google grounding error: {e}") from e
diff --git a/openadapt_ml/models/providers/openai.py b/openadapt_ml/models/providers/openai.py
index 1653298..d54eeef 100644
--- a/openadapt_ml/models/providers/openai.py
+++ b/openadapt_ml/models/providers/openai.py
@@ -1,44 +1,93 @@
 """OpenAI (GPT) API provider.
 
-Supports GPT-5.2, GPT-5.1, and other OpenAI models with vision.
+Supports GPT-5.2, GPT-5.1, GPT-4o, and other OpenAI models with vision.
+Implements the BaseAPIProvider interface for the Chat Completions API.
 """
 
 from __future__ import annotations
 
+import logging
 from typing import TYPE_CHECKING, Any
 
-from openadapt_ml.models.providers.base import BaseAPIProvider
+from openadapt_ml.models.providers.base import (
+    BaseAPIProvider,
+    AuthenticationError,
+    ModelNotFoundError,
+    ProviderError,
+    RateLimitError,
+)
 
 if TYPE_CHECKING:
     from PIL import Image
 
+logger = logging.getLogger(__name__)
+
+# Default models
+DEFAULT_MODEL = "gpt-4o"
+
+# Supported models with their properties
+SUPPORTED_MODELS = {
+    "gpt-5.2": {"context": 128_000, "description": "Latest GPT model"},
+    "gpt-5.1": {"context": 128_000, "description": "Previous GPT-5"},
+    "gpt-4o": {"context": 128_000, "description": "Vision-capable, fast"},
+    "gpt-4o-mini": {"context": 128_000, "description": "Cheaper, fast"},
+    "gpt-4-turbo": {"context": 128_000, "description": "Previous gen turbo"},
+}
+
 
 class OpenAIProvider(BaseAPIProvider):
     """Provider for OpenAI's GPT models.
 
+    Implements vision support via data URL encoded images in the Chat Completions API.
+    Supports both standard chat and vision-enabled models.
+
     Supported models:
-        - gpt-5.2 (latest)
-        - gpt-5.1 (previous)
-        - gpt-4o (vision capable)
+        - gpt-5.2: Latest and most capable
+        - gpt-5.1: Previous generation GPT-5
+        - gpt-4o: Fast, vision-capable
+        - gpt-4o-mini: Cost-effective, vision-capable
 
     Example:
-        provider = OpenAIProvider()
-        client = provider.create_client(api_key)
-        response = provider.send_message(
-            client,
-            model="gpt-5.2",
-            system="You are a GUI agent.",
-            content=[
-                {"type": "text", "text": "Click the submit button"},
-                provider.encode_image(screenshot),
-            ],
-        )
+        >>> provider = OpenAIProvider()
+        >>> client = provider.create_client(api_key)
+        >>> response = provider.send_message(
+        ...     client,
+        ...     model="gpt-5.2",
+        ...     system="You are a GUI agent.",
+        ...     content=[
+        ...         {"type": "text", "text": "Click the submit button"},
+        ...         provider.encode_image(screenshot),
+        ...     ],
+        ... )
+
+    Note:
+        OpenAI uses data URLs for images (data:image/png;base64,...).
+        This differs from Anthropic's explicit source object format.
+
+    Attributes:
+        name: Returns 'openai'.
     """
 
     @property
     def name(self) -> str:
+        """Provider name."""
         return "openai"
 
+    @property
+    def env_key_name(self) -> str:
+        """Environment variable name for API key."""
+        return "OPENAI_API_KEY"
+
+    @property
+    def default_model(self) -> str:
+        """Default model to use."""
+        return DEFAULT_MODEL
+
+    @property
+    def supported_models(self) -> dict[str, dict[str, Any]]:
+        """Dictionary of supported models and their properties."""
+        return SUPPORTED_MODELS
+
     def create_client(self, api_key: str) -> Any:
         """Create OpenAI client.
 
@@ -50,14 +99,23 @@ def create_client(self, api_key: str) -> Any:
 
         Raises:
             ImportError: If openai package not installed.
+            AuthenticationError: If API key is empty.
         """
         try:
             from openai import OpenAI
         except ImportError as e:
             raise ImportError(
-                "openai package is required. Install with: pip install openai"
+                "openai package is required for provider='openai'. "
+                "Install with: uv add openai"
             ) from e
 
+        if not api_key or not api_key.strip():
+            raise AuthenticationError(
+                "OpenAI API key cannot be empty. "
+                "Get a key from https://platform.openai.com/api-keys"
+            )
+
+        logger.debug("Creating OpenAI client")
         return OpenAI(api_key=api_key)
 
     def send_message(
@@ -72,16 +130,24 @@ def send_message(
         """Send message using OpenAI Chat Completions API.
 
         Args:
-            client: OpenAI client.
-            model: Model ID (e.g., 'gpt-5.2').
+            client: OpenAI client from create_client().
+            model: Model ID (e.g., 'gpt-5.2', 'gpt-4o').
             system: System prompt.
-            content: List of content blocks.
+            content: List of content blocks (text and images).
             max_tokens: Max response tokens.
-            temperature: Sampling temperature.
+            temperature: Sampling temperature (0.0-2.0 for OpenAI).
 
         Returns:
             Model response text.
+
+        Raises:
+            AuthenticationError: If API key is invalid.
+            RateLimitError: If rate limit exceeded.
+            ModelNotFoundError: If model doesn't exist.
+            ProviderError: For other API errors.
         """
+        logger.debug(f"Sending message to {model} with {len(content)} content blocks")
+
         messages = []
 
         if system:
@@ -89,25 +155,48 @@ def send_message(
 
         messages.append({"role": "user", "content": content})
 
-        response = client.chat.completions.create(
-            model=model,
-            messages=messages,
-            max_completion_tokens=max_tokens,
-            temperature=temperature,
-        )
+        try:
+            response = client.chat.completions.create(
+                model=model,
+                messages=messages,
+                max_completion_tokens=max_tokens,
+                temperature=temperature,
+            )
+
+            result = response.choices[0].message.content or ""
+            logger.debug(f"Received response: {len(result)} chars")
+            return result
+
+        except Exception as e:
+            error_str = str(e).lower()
 
-        return response.choices[0].message.content or ""
+            # Map common errors to specific exceptions
+            if "authentication" in error_str or "api_key" in error_str or "invalid_api_key" in error_str:
+                raise AuthenticationError(f"OpenAI authentication failed: {e}") from e
+            elif "rate_limit" in error_str or "429" in error_str:
+                raise RateLimitError(f"OpenAI rate limit exceeded: {e}") from e
+            elif "model_not_found" in error_str or "does not exist" in error_str:
+                raise ModelNotFoundError(f"Model '{model}' not found: {e}") from e
+            else:
+                raise ProviderError(f"OpenAI API error: {e}") from e
 
     def encode_image(self, image: "Image") -> dict[str, Any]:
         """Encode image for OpenAI API.
 
-        OpenAI uses data URLs for images.
+        OpenAI uses data URLs for images in the format:
+        data:image/<type>;base64,<data>
 
         Args:
             image: PIL Image.
 
         Returns:
-            Image content block for OpenAI API.
+            Image content block for OpenAI API in format:
+            {
+                "type": "image_url",
+                "image_url": {
+                    "url": "data:image/png;base64,..."
+                }
+            }
         """
         base64_data = self.image_to_base64(image, "PNG")
         return {
@@ -116,3 +205,134 @@ def encode_image(self, image: "Image") -> dict[str, Any]:
                 "url": f"data:image/png;base64,{base64_data}",
             },
         }
+
+    def encode_image_with_detail(
+        self,
+        image: "Image",
+        detail: str = "auto",
+    ) -> dict[str, Any]:
+        """Encode image with detail level specification.
+
+        OpenAI supports different detail levels for vision processing:
+        - "low": Fixed 512x512, 85 tokens, fast
+        - "high": Scaled up to 2048x2048, more tokens, detailed
+        - "auto": Let the model decide based on image size
+
+        Args:
+            image: PIL Image.
+            detail: Detail level ("low", "high", "auto").
+
+        Returns:
+            Image content block with detail specification.
+        """
+        base64_data = self.image_to_base64(image, "PNG")
+        return {
+            "type": "image_url",
+            "image_url": {
+                "url": f"data:image/png;base64,{base64_data}",
+                "detail": detail,
+            },
+        }
+
+    def encode_image_from_url(
+        self,
+        url: str,
+        detail: str = "auto",
+    ) -> dict[str, Any]:
+        """Create image content block from URL.
+
+        OpenAI natively supports URL-based images, so no fetching needed.
+
+        Args:
+            url: Image URL.
+            detail: Detail level ("low", "high", "auto").
+
+        Returns:
+            Image content block for OpenAI API.
+        """
+        return {
+            "type": "image_url",
+            "image_url": {
+                "url": url,
+                "detail": detail,
+            },
+        }
+
+    def encode_image_from_bytes(
+        self,
+        image_bytes: bytes,
+        media_type: str = "image/png",
+    ) -> dict[str, Any]:
+        """Encode raw image bytes for OpenAI API.
+
+        Args:
+            image_bytes: Raw image bytes.
+            media_type: MIME type of the image.
+
+        Returns:
+            Image content block for OpenAI API.
+        """
+        import base64
+
+        base64_data = base64.b64encode(image_bytes).decode("utf-8")
+        return {
+            "type": "image_url",
+            "image_url": {
+                "url": f"data:{media_type};base64,{base64_data}",
+            },
+        }
+
+    def send_with_tools(
+        self,
+        client: Any,
+        model: str,
+        messages: list[dict[str, Any]],
+        tools: list[dict[str, Any]],
+        tool_choice: str | dict[str, Any] = "auto",
+        max_tokens: int = 1024,
+        temperature: float = 0.1,
+    ) -> Any:
+        """Send message with function calling/tools support.
+
+        OpenAI supports function calling which can be useful for structured
+        action extraction in GUI automation.
+
+        Args:
+            client: OpenAI client.
+            model: Model ID.
+            messages: Chat messages.
+            tools: Tool definitions.
+            tool_choice: Tool choice strategy.
+            max_tokens: Max response tokens.
+            temperature: Sampling temperature.
+
+        Returns:
+            Raw API response (for tool call handling).
+
+        Example:
+            >>> tools = [{
+            ...     "type": "function",
+            ...     "function": {
+            ...         "name": "click",
+            ...         "parameters": {
+            ...             "type": "object",
+            ...             "properties": {
+            ...                 "x": {"type": "number"},
+            ...                 "y": {"type": "number"}
+            ...             }
+            ...         }
+            ...     }
+            ... }]
+            >>> response = provider.send_with_tools(client, model, messages, tools)
+        """
+        try:
+            return client.chat.completions.create(
+                model=model,
+                messages=messages,
+                tools=tools,
+                tool_choice=tool_choice,
+                max_completion_tokens=max_tokens,
+                temperature=temperature,
+            )
+        except Exception as e:
+            raise ProviderError(f"OpenAI tools API error: {e}") from e
diff --git a/scripts/sync_package_docs.py b/scripts/sync_package_docs.py
new file mode 100644
index 0000000..4f3b770
--- /dev/null
+++ b/scripts/sync_package_docs.py
@@ -0,0 +1,276 @@
+#!/usr/bin/env python3
+"""
+Sync package documentation from sub-repos to the main OpenAdapt docs.
+
+This script fetches README.md files from OpenAdapt sub-packages and transforms
+them into MkDocs-compatible documentation pages.
+
+Usage:
+    python scripts/sync_package_docs.py [--dry-run] [--package PACKAGE]
+
+Arguments:
+    --dry-run       Print what would be changed without making changes
+    --package       Sync only a specific package (e.g., openadapt-ml)
+
+Environment:
+    GITHUB_TOKEN    GitHub token for API access (optional, increases rate limit)
+
+This script should be run from the main OpenAdapt repository root.
+"""
+
+import argparse
+import json
+import os
+import re
+import sys
+from pathlib import Path
+from typing import Optional
+from urllib.request import Request, urlopen
+from urllib.error import HTTPError
+
+
+# Package configuration: maps package name to doc output file
+PACKAGES = {
+    "openadapt-ml": {
+        "repo": "OpenAdaptAI/openadapt-ml",
+        "doc_file": "docs/packages/ml.md",
+        "title": "openadapt-ml",
+        "description": "Policy learning, training, and inference for GUI automation agents.",
+    },
+    "openadapt-evals": {
+        "repo": "OpenAdaptAI/openadapt-evals",
+        "doc_file": "docs/packages/evals.md",
+        "title": "openadapt-evals",
+        "description": "Benchmark evaluation framework for GUI automation agents.",
+    },
+    "openadapt-capture": {
+        "repo": "OpenAdaptAI/openadapt-capture",
+        "doc_file": "docs/packages/capture.md",
+        "title": "openadapt-capture",
+        "description": "Screen and input capture for demonstration recording.",
+    },
+    "openadapt-viewer": {
+        "repo": "OpenAdaptAI/openadapt-viewer",
+        "doc_file": "docs/packages/viewer.md",
+        "title": "openadapt-viewer",
+        "description": "Web-based viewer for demonstrations and training results.",
+    },
+    "openadapt-grounding": {
+        "repo": "OpenAdaptAI/openadapt-grounding",
+        "doc_file": "docs/packages/grounding.md",
+        "title": "openadapt-grounding",
+        "description": "Visual grounding for element detection and localization.",
+    },
+    "openadapt-retrieval": {
+        "repo": "OpenAdaptAI/openadapt-retrieval",
+        "doc_file": "docs/packages/retrieval.md",
+        "title": "openadapt-retrieval",
+        "description": "Demo retrieval for trajectory-conditioned prompting.",
+    },
+    "openadapt-privacy": {
+        "repo": "OpenAdaptAI/openadapt-privacy",
+        "doc_file": "docs/packages/privacy.md",
+        "title": "openadapt-privacy",
+        "description": "PII/PHI detection and scrubbing for privacy protection.",
+    },
+}
+
+
+def fetch_readme(repo: str, token: Optional[str] = None) -> Optional[str]:
+    """Fetch README.md from a GitHub repository."""
+    url = f"https://api.github.com/repos/{repo}/readme"
+    headers = {
+        "Accept": "application/vnd.github.v3.raw",
+        "User-Agent": "OpenAdapt-Docs-Sync/1.0",
+    }
+    if token:
+        headers["Authorization"] = f"token {token}"
+
+    try:
+        request = Request(url, headers=headers)
+        with urlopen(request, timeout=30) as response:
+            return response.read().decode("utf-8")
+    except HTTPError as e:
+        if e.code == 404:
+            print(f"  Warning: README not found for {repo}")
+        else:
+            print(f"  Error fetching README for {repo}: {e}")
+        return None
+    except Exception as e:
+        print(f"  Error fetching README for {repo}: {e}")
+        return None
+
+
+def transform_readme(
+    readme: str,
+    package_name: str,
+    config: dict,
+) -> str:
+    """Transform a package README into MkDocs-compatible documentation.
+
+    Transformations:
+    1. Replace the title with standardized format
+    2. Add package metadata header
+    3. Fix relative links to point to GitHub
+    4. Remove badges (they don't render well in MkDocs)
+    5. Add navigation hint at the end
+    """
+    repo = config["repo"]
+    title = config["title"]
+    description = config["description"]
+
+    lines = readme.split("\n")
+    output_lines = []
+
+    # Skip existing title and badges
+    i = 0
+    while i < len(lines):
+        line = lines[i]
+        # Skip title line
+        if line.startswith("# "):
+            i += 1
+            continue
+        # Skip badge lines (contain shields.io or badge images)
+        if "shields.io" in line or "badge" in line.lower() or line.strip().startswith("[!["):
+            i += 1
+            continue
+        # Skip empty lines after title/badges
+        if not line.strip() and i < 5:
+            i += 1
+            continue
+        break
+
+    # Add standardized header
+    output_lines.append(f"# {title}")
+    output_lines.append("")
+    output_lines.append(description)
+    output_lines.append("")
+    output_lines.append(f"**Repository**: [{repo}](https://github.com/{repo})")
+    output_lines.append("")
+
+    # Process remaining content
+    while i < len(lines):
+        line = lines[i]
+
+        # Fix relative links to docs/ -> GitHub
+        line = re.sub(
+            r'\[([^\]]+)\]\(docs/([^)]+)\)',
+            f'[\\1](https://github.com/{repo}/blob/main/docs/\\2)',
+            line
+        )
+
+        # Fix relative links to other files -> GitHub
+        line = re.sub(
+            r'\[([^\]]+)\]\((?!http)([^)]+\.md)\)',
+            f'[\\1](https://github.com/{repo}/blob/main/\\2)',
+            line
+        )
+
+        output_lines.append(line)
+        i += 1
+
+    # Add footer
+    output_lines.append("")
+    output_lines.append("---")
+    output_lines.append("")
+    output_lines.append(f"*This page is auto-generated from the [{package_name} README](https://github.com/{repo}). "
+                       f"To update, push changes to the source repository.*")
+
+    return "\n".join(output_lines)
+
+
+def sync_package(
+    package_name: str,
+    config: dict,
+    token: Optional[str] = None,
+    dry_run: bool = False,
+) -> bool:
+    """Sync documentation for a single package.
+
+    Returns True if changes were made (or would be made in dry-run mode).
+    """
+    print(f"Syncing {package_name}...")
+
+    # Fetch README
+    readme = fetch_readme(config["repo"], token)
+    if not readme:
+        print(f"  Skipping {package_name}: Could not fetch README")
+        return False
+
+    # Transform to MkDocs format
+    doc_content = transform_readme(readme, package_name, config)
+
+    # Check if file needs updating
+    doc_path = Path(config["doc_file"])
+    if doc_path.exists():
+        existing_content = doc_path.read_text()
+        if existing_content == doc_content:
+            print(f"  No changes for {package_name}")
+            return False
+
+    # Write or report
+    if dry_run:
+        print(f"  Would update {doc_path}")
+        print(f"  Preview (first 500 chars):")
+        print("  " + "-" * 40)
+        for line in doc_content[:500].split("\n"):
+            print(f"  {line}")
+        print("  " + "-" * 40)
+    else:
+        doc_path.parent.mkdir(parents=True, exist_ok=True)
+        doc_path.write_text(doc_content)
+        print(f"  Updated {doc_path}")
+
+    return True
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Sync package documentation from sub-repos"
+    )
+    parser.add_argument(
+        "--dry-run",
+        action="store_true",
+        help="Print what would be changed without making changes",
+    )
+    parser.add_argument(
+        "--package",
+        help="Sync only a specific package",
+    )
+    args = parser.parse_args()
+
+    # Get GitHub token from environment
+    token = os.environ.get("GITHUB_TOKEN")
+    if not token:
+        print("Note: GITHUB_TOKEN not set, API rate limits may apply")
+
+    # Determine which packages to sync
+    if args.package:
+        if args.package not in PACKAGES:
+            print(f"Error: Unknown package '{args.package}'")
+            print(f"Available packages: {', '.join(PACKAGES.keys())}")
+            sys.exit(1)
+        packages_to_sync = {args.package: PACKAGES[args.package]}
+    else:
+        packages_to_sync = PACKAGES
+
+    # Sync each package
+    changes_made = False
+    for package_name, config in packages_to_sync.items():
+        if sync_package(package_name, config, token, args.dry_run):
+            changes_made = True
+
+    # Summary
+    if changes_made:
+        if args.dry_run:
+            print("\nDry run complete. Use without --dry-run to apply changes.")
+        else:
+            print("\nSync complete. Changes were made.")
+    else:
+        print("\nNo changes needed.")
+
+    return 0 if not changes_made else 1  # Exit 1 if changes for CI detection
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/tests/test_baselines.py b/tests/test_baselines.py
index 9220d44..b615578 100644
--- a/tests/test_baselines.py
+++ b/tests/test_baselines.py
@@ -457,10 +457,11 @@ def test_track_a_system_prompt(self):
         prompt = builder.get_system_prompt()
 
         assert "GUI automation agent" in prompt
-        assert "CLICK(x, y)" in prompt
-        assert "normalized coordinates" in prompt
-        # Track A should NOT mention element IDs
-        assert "element ID" not in prompt.lower() or "element_id" not in prompt
+        # The prompt shows coordinates in JSON format
+        assert "normalized" in prompt.lower()
+        assert "0.0" in prompt and "1.0" in prompt  # Coordinate range explanation
+        # Track A should NOT mention element IDs as primary method
+        assert "element_id" not in prompt.lower() or "element ID" not in prompt.lower()
 
     def test_track_b_system_prompt(self):
         """Test PromptBuilder generates Track B system prompt."""
@@ -561,8 +562,8 @@ def test_build_user_content_truncates_a11y_tree(self):
         content = builder.build_user_content(goal="Click", a11y_tree=a11y_tree)
 
         text_content = content[0]["text"]
-        # Should be truncated
-        assert "truncated" in text_content
+        # Should be truncated - the implementation shows "showing X of Y elements"
+        assert "showing 5 of 20" in text_content or "truncated" in text_content
 
 
 class TestParsedAction: