Skip to content

Conversation

@abrichr
Copy link
Member

@abrichr abrichr commented Jan 29, 2026

Summary

Completes the Azure ML parallelization implementation for WAA evaluation, enabling parallel execution across multiple Azure VMs.

Key changes:

  1. Agent config serialization (_serialize_agent_config):

    • Extracts provider, model, and API keys from agent
    • Passes OPENAI_API_KEY/ANTHROPIC_API_KEY via environment variables
    • Supports OpenAI and Anthropic agents
  2. Worker command building (_build_worker_command):

    • Uses vanilla WAA run.py with --worker_id and --num_workers
    • Matches Microsoft's official Azure deployment pattern
    • Task distribution handled by WAA internally (not our code)
  3. Result fetching (_fetch_worker_results, _parse_waa_results):

    • Downloads job outputs via Azure ML SDK (client.jobs.download())
    • Parses WAA result.txt files (0.0 or 1.0 score)
    • Handles partial results for failed jobs gracefully
  4. Job status tracking:

    • Added job_name field to WorkerState dataclass
    • Updated _wait_and_collect_results to poll job status (not compute status)
    • Fixed bug: was checking compute instance status instead of Azure ML job status
  5. Log fetching (get_job_logs in AzureMLClient):

    • Downloads logs via az ml job download
    • Supports tail parameter for last N lines
    • Updated health_checker to use new method instead of returning empty string

Design decisions:

  • Uses vanilla windowsarena/winarena:latest with VERSION=11e (no custom Dockerfile)
  • Follows Microsoft's official Azure deployment pattern from run_azure.py
  • Task distribution delegated to WAA's built-in --worker_id / --num_workers mechanism
  • Minimal changes to existing code structure

Test plan

  • All 188 existing tests pass
  • Import verification successful
  • Ruff linting passes (only style warnings for f-strings)
  • Manual test with actual Azure deployment (requires Azure quota)

Generated with Claude Code

abrichr and others added 5 commits January 29, 2026 13:43
- Remove unvalidated badges (95%+ success rate, 67% cost savings)
- Add "First open-source WAA reproduction" as headline
- Move WAA to top as main feature with status indicator
- Change "Recent Improvements" to "Roadmap (In Progress)"
- Remove v0.2.0 version references (current is v0.1.1)
- Add Azure quota requirements note for parallelization
- Mark features as [IN PROGRESS] where appropriate

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Complete the Azure ML parallelization implementation:

1. Agent config serialization (_serialize_agent_config):
   - Extracts provider, model, and API keys from agent
   - Passes OPENAI_API_KEY/ANTHROPIC_API_KEY via env vars
   - Supports OpenAI and Anthropic agents

2. Worker command building (_build_worker_command):
   - Uses vanilla WAA run.py with --worker_id and --num_workers
   - Matches Microsoft's official Azure deployment pattern
   - Task distribution handled by WAA internally

3. Result fetching (_fetch_worker_results, _parse_waa_results):
   - Downloads job outputs via Azure ML SDK
   - Parses WAA result.txt files (0.0 or 1.0 score)
   - Handles partial results for failed jobs

4. Job status tracking:
   - Added job_name field to WorkerState
   - Updated _wait_and_collect_results to poll job status
   - Fixed: was checking compute status instead of job status

5. Log fetching (get_job_logs in AzureMLClient):
   - Downloads logs via az ml job download
   - Supports tail parameter for last N lines
   - Updated health_checker to use new method

Uses vanilla windowsarena/winarena:latest with VERSION=11e.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
WAA is already open-source from Microsoft. Changed to accurate claim:
"Simplified CLI toolkit for Windows Agent Arena"

Updated value proposition to reflect what we actually provide:
- Azure VM setup and SSH tunnel management
- Agent adapters for Claude/GPT/custom agents
- Results viewer
- Parallelization support

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The code uses Standard_D4s_v5 (4 vCPUs) by default, not D8ds_v5.
Updated all references to be accurate.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
abrichr and others added 7 commits January 29, 2026 14:18
New command that:
- Checks Azure CLI installation and login status
- Creates resource group (default: openadapt-agents)
- Creates ML workspace (default: openadapt-ml)
- Writes config to .env file

Usage: uv run python -m openadapt_evals.benchmarks.cli azure-setup

Also improved azure command error message to guide users to run setup.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The vanilla windowsarena/winarena:latest image does NOT work for unattended
WAA installation. This adds:

- `waa-image build` - Build custom waa-auto image locally
- `waa-image push` - Push to Docker Hub or ACR
- `waa-image build-push` - Build and push in one command
- `waa-image check` - Check if image exists in registry

Also updates azure.py to use openadaptai/waa-auto:latest as default image.

The custom Dockerfile (in waa_deploy/) includes:
- Modern dockurr/windows base (auto-downloads Windows 11)
- FirstLogonCommands patches for unattended installation
- Python 3.9 with transformers 4.46.2 (navi agent compatibility)
- api_agent.py for Claude/GPT support

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add ECR as the default registry (ecr, dockerhub, acr options)
- Auto-create ECR repository if it doesn't exist
- Auto-login to ECR Public using AWS CLI
- Update azure.py to use public.ecr.aws/g3w3k7s5/waa-auto:latest as default
- Update docs with new default image

ECR Public is preferred because:
- No Docker Hub login required
- Uses existing AWS credentials
- Public access for Azure ML to pull without cross-cloud auth

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The windowsarena/winarena base image is only available for linux/amd64.
This fixes builds on macOS (arm64) by explicitly specifying the target platform.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add `aws-costs` command to show AWS cost breakdown using Cost Explorer API
  - Shows current month costs (total and by service)
  - Shows historical monthly costs
  - Shows ECR storage costs specifically

- Add `waa-image delete` action to clean up registry resources
  - ECR: Deletes repository with --force
  - Docker Hub: Shows manual instructions (free tier)
  - ACR: Deletes repository

- Change default registry from ECR to Docker Hub
  - Docker Hub is free (no storage charges)
  - Use ECR when rate limiting becomes an issue

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Automatically bumps version and creates tags on PR merge:
- feat: minor version bump
- fix/perf: patch version bump
- docs/style/refactor/test/chore/ci/build: patch version bump

Triggers publish.yml which deploys to PyPI.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: Azure ML compute instances don't have Docker installed.
Our code used SDK V2 command jobs which run in bare Python environment,
never calling /entry_setup.sh to start QEMU/Windows.

Fix follows Microsoft's official WAA Azure pattern:
- Add azureml-core dependency (SDK V1)
- Use DockerConfiguration with NET_ADMIN capability for QEMU networking
- Create run_entry.py that calls /entry_setup.sh before running client
- Create compute-instance-startup.sh to stop conflicting services (DNS, nginx)
- Use ScriptRunConfig instead of raw command jobs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants