feat(azure): implement Azure ML parallelization for WAA evaluation #24

abrichr · 2026-01-29T19:05:34Z

Summary

Completes the Azure ML parallelization implementation for WAA evaluation, enabling parallel execution across multiple Azure VMs.

Key changes:

Agent config serialization (_serialize_agent_config):
- Extracts provider, model, and API keys from agent
- Passes OPENAI_API_KEY/ANTHROPIC_API_KEY via environment variables
- Supports OpenAI and Anthropic agents
Worker command building (_build_worker_command):
- Uses vanilla WAA run.py with --worker_id and --num_workers
- Matches Microsoft's official Azure deployment pattern
- Task distribution handled by WAA internally (not our code)
Result fetching (_fetch_worker_results, _parse_waa_results):
- Downloads job outputs via Azure ML SDK (client.jobs.download())
- Parses WAA result.txt files (0.0 or 1.0 score)
- Handles partial results for failed jobs gracefully
Job status tracking:
- Added job_name field to WorkerState dataclass
- Updated _wait_and_collect_results to poll job status (not compute status)
- Fixed bug: was checking compute instance status instead of Azure ML job status
Log fetching (get_job_logs in AzureMLClient):
- Downloads logs via az ml job download
- Supports tail parameter for last N lines
- Updated health_checker to use new method instead of returning empty string

Design decisions:

Uses vanilla windowsarena/winarena:latest with VERSION=11e (no custom Dockerfile)
Follows Microsoft's official Azure deployment pattern from run_azure.py
Task distribution delegated to WAA's built-in --worker_id / --num_workers mechanism
Minimal changes to existing code structure

Test plan

All 188 existing tests pass
Import verification successful
Ruff linting passes (only style warnings for f-strings)
Manual test with actual Azure deployment (requires Azure quota)

Generated with Claude Code

- Remove unvalidated badges (95%+ success rate, 67% cost savings) - Add "First open-source WAA reproduction" as headline - Move WAA to top as main feature with status indicator - Change "Recent Improvements" to "Roadmap (In Progress)" - Remove v0.2.0 version references (current is v0.1.1) - Add Azure quota requirements note for parallelization - Mark features as [IN PROGRESS] where appropriate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Complete the Azure ML parallelization implementation: 1. Agent config serialization (_serialize_agent_config): - Extracts provider, model, and API keys from agent - Passes OPENAI_API_KEY/ANTHROPIC_API_KEY via env vars - Supports OpenAI and Anthropic agents 2. Worker command building (_build_worker_command): - Uses vanilla WAA run.py with --worker_id and --num_workers - Matches Microsoft's official Azure deployment pattern - Task distribution handled by WAA internally 3. Result fetching (_fetch_worker_results, _parse_waa_results): - Downloads job outputs via Azure ML SDK - Parses WAA result.txt files (0.0 or 1.0 score) - Handles partial results for failed jobs 4. Job status tracking: - Added job_name field to WorkerState - Updated _wait_and_collect_results to poll job status - Fixed: was checking compute status instead of job status 5. Log fetching (get_job_logs in AzureMLClient): - Downloads logs via az ml job download - Supports tail parameter for last N lines - Updated health_checker to use new method Uses vanilla windowsarena/winarena:latest with VERSION=11e. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

WAA is already open-source from Microsoft. Changed to accurate claim: "Simplified CLI toolkit for Windows Agent Arena" Updated value proposition to reflect what we actually provide: - Azure VM setup and SSH tunnel management - Agent adapters for Claude/GPT/custom agents - Results viewer - Parallelization support Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The code uses Standard_D4s_v5 (4 vCPUs) by default, not D8ds_v5. Updated all references to be accurate. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

New command that: - Checks Azure CLI installation and login status - Creates resource group (default: openadapt-agents) - Creates ML workspace (default: openadapt-ml) - Writes config to .env file Usage: uv run python -m openadapt_evals.benchmarks.cli azure-setup Also improved azure command error message to guide users to run setup. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The vanilla windowsarena/winarena:latest image does NOT work for unattended WAA installation. This adds: - `waa-image build` - Build custom waa-auto image locally - `waa-image push` - Push to Docker Hub or ACR - `waa-image build-push` - Build and push in one command - `waa-image check` - Check if image exists in registry Also updates azure.py to use openadaptai/waa-auto:latest as default image. The custom Dockerfile (in waa_deploy/) includes: - Modern dockurr/windows base (auto-downloads Windows 11) - FirstLogonCommands patches for unattended installation - Python 3.9 with transformers 4.46.2 (navi agent compatibility) - api_agent.py for Claude/GPT support Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add ECR as the default registry (ecr, dockerhub, acr options) - Auto-create ECR repository if it doesn't exist - Auto-login to ECR Public using AWS CLI - Update azure.py to use public.ecr.aws/g3w3k7s5/waa-auto:latest as default - Update docs with new default image ECR Public is preferred because: - No Docker Hub login required - Uses existing AWS credentials - Public access for Azure ML to pull without cross-cloud auth Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The windowsarena/winarena base image is only available for linux/amd64. This fixes builds on macOS (arm64) by explicitly specifying the target platform. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add `aws-costs` command to show AWS cost breakdown using Cost Explorer API - Shows current month costs (total and by service) - Shows historical monthly costs - Shows ECR storage costs specifically - Add `waa-image delete` action to clean up registry resources - ECR: Deletes repository with --force - Docker Hub: Shows manual instructions (free tier) - ACR: Deletes repository - Change default registry from ECR to Docker Hub - Docker Hub is free (no storage charges) - Use ECR when rate limiting becomes an issue Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Automatically bumps version and creates tags on PR merge: - feat: minor version bump - fix/perf: patch version bump - docs/style/refactor/test/chore/ci/build: patch version bump Triggers publish.yml which deploys to PyPI. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Root cause: Azure ML compute instances don't have Docker installed. Our code used SDK V2 command jobs which run in bare Python environment, never calling /entry_setup.sh to start QEMU/Windows. Fix follows Microsoft's official WAA Azure pattern: - Add azureml-core dependency (SDK V1) - Use DockerConfiguration with NET_ADMIN capability for QEMU networking - Create run_entry.py that calls /entry_setup.sh before running client - Create compute-instance-startup.sh to stop conflicting services (DNS, nginx) - Use ScriptRunConfig instead of raw command jobs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

abrichr and others added 5 commits January 29, 2026 13:43

docs: fix VM size to match code (D4s_v5 not D8ds_v5)

50660a8

The code uses Standard_D4s_v5 (4 vCPUs) by default, not D8ds_v5. Updated all references to be accurate. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Merge README fixes from PR #23

ec23ba0

abrichr mentioned this pull request Jan 29, 2026

docs: replace aspirational claims with honest placeholders #23

Closed

3 tasks

abrichr and others added 7 commits January 29, 2026 14:18

fix(cli): add --platform linux/amd64 flag for Docker build

65ce67b

The windowsarena/winarena base image is only available for linux/amd64. This fixes builds on macOS (arm64) by explicitly specifying the target platform. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(azure): implement Azure ML parallelization for WAA evaluation #24

feat(azure): implement Azure ML parallelization for WAA evaluation #24

Uh oh!

abrichr commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(azure): implement Azure ML parallelization for WAA evaluation #24

Are you sure you want to change the base?

feat(azure): implement Azure ML parallelization for WAA evaluation #24

Uh oh!

Conversation

abrichr commented Jan 29, 2026

Summary

Key changes:

Design decisions:

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants