From 04ffbea8071328d084aa873114c1a0983fc24ddb Mon Sep 17 00:00:00 2001 From: Calvin Smith Date: Fri, 29 May 2026 11:44:51 -0600 Subject: [PATCH] docs: add A/B testing reference for plugin automations Add references/ab-testing.md covering variant definitions, weighted selection, experiment config, observability tags, validation rules, and examples. Link from SKILL.md Reference Files section and the Choosing the Right Preset table. Relates to OpenHands/automation#146 and OpenHands/automation#147. Co-authored-by: openhands --- skills/openhands-automation/SKILL.md | 2 + .../references/ab-testing.md | 185 ++++++++++++++++++ 2 files changed, 187 insertions(+) create mode 100644 skills/openhands-automation/references/ab-testing.md diff --git a/skills/openhands-automation/SKILL.md b/skills/openhands-automation/SKILL.md index a255219..5f1d32c 100644 --- a/skills/openhands-automation/SKILL.md +++ b/skills/openhands-automation/SKILL.md @@ -850,6 +850,7 @@ Pick based on **what the task needs**, not just **what is technically possible** |----------|-------------| | Reasoning, summarization, triage, code review, or open-ended tool use | **Prompt Preset** | | Needs plugin commands / skills / MCP configs / hooks | **Plugin Preset** | +| Compare plugin versions or configurations across runs | **Plugin Preset with A/B testing** — see `references/ab-testing.md` | | **Deterministic task** (fixed data + scheduled action, e.g. healthcheck, Slack notification, rotating from a known list) — especially if it runs frequently | **Custom script, no LLM** — see `references/custom-automation.md#deterministic-script-no-llm` | | Custom Python dependencies, multi-file project, or direct SDK lifecycle control | **Custom script with SDK** — see `references/custom-automation.md#sdk-based-scripts` | @@ -862,3 +863,4 @@ The **prompt preset** is the right default for genuinely agent-shaped work — a ## Reference Files - **`references/custom-automation.md`** — Detailed guide for custom automations: tarball uploads, code structure (SDK and no-LLM), environment variables, validation rules, and complete examples. Consult this whenever you need to evaluate or recommend the custom path (including for deterministic / cost-sensitive tasks per rule 0). Only *implement* a custom automation after the user agrees to that path. +- **`references/ab-testing.md`** — A/B testing for plugin automations: defining variants with weights, experiment configuration, variant selection logic, observability via conversation tags, and complete examples. Consult this when a user wants to compare plugin versions or configurations. diff --git a/skills/openhands-automation/references/ab-testing.md b/skills/openhands-automation/references/ab-testing.md new file mode 100644 index 0000000..210148c --- /dev/null +++ b/skills/openhands-automation/references/ab-testing.md @@ -0,0 +1,185 @@ +# A/B Testing for Plugin Automations + +Run A/B tests on plugin automations by defining **variants** — each with its own plugin set and selection weight — instead of a single `plugins` list. The automation service generates a tarball with all variant configs; at runtime, the SDK script selects a variant via weighted random and loads its plugins. + +> **Scope:** A/B testing is currently supported on the **plugin preset** endpoint only (`POST /v1/preset/plugin`). See [OpenHands/automation#147](https://github.com/OpenHands/automation/issues/147) for the roadmap to server-level variant support across all automation types. + +--- + +## Quick Start + +Replace `plugins` with `variants` and add an `experiment_id`: + +```bash +curl -X POST "${OPENHANDS_HOST}/api/automation/v1/preset/plugin" \ + -H "Authorization: Bearer ${OPENHANDS_API_KEY}" \ + -H "Content-Type: application/json" \ + -d '{ + "name": "Code Review A/B Test", + "experiment_id": "review-model-comparison", + "variants": [ + { + "name": "control", + "weight": 50, + "plugins": [{"source": "github:owner/review-plugin", "ref": "v1.0.0"}] + }, + { + "name": "treatment", + "weight": 50, + "plugins": [{"source": "github:owner/review-plugin", "ref": "v2.0.0-beta"}] + } + ], + "prompt": "Review this pull request for code quality and potential bugs.", + "trigger": { + "type": "event", + "source": "github", + "on": "pull_request.opened" + } + }' +``` + +## How It Works + +1. **At creation time**, the service generates a single tarball containing an `experiment_config.json` with all variant definitions (names, weights, plugin configs) alongside the SDK entrypoint and prompt. + +2. **At runtime**, `sdk_main.py` reads `experiment_config.json`, performs weighted-random selection across variants, and loads the selected variant's plugins. + +3. **Experiment metadata** (`experiment_id` and `variant` name) is attached as conversation tags, allowing you to filter and compare runs by variant in the UI. + +## API Reference + +### Request Fields + +`plugins` and `variants` are **mutually exclusive** — provide exactly one. + +| Field | Required | Description | +|-------|----------|-------------| +| `variants` | Yes* | List of experiment variants (2–10). Replaces `plugins`. | +| `experiment_id` | Yes* | Human-readable experiment identifier (1–200 chars). Required when using `variants`. | + +*Required only for A/B tests. Standard plugin automations use `plugins` instead. + +All other fields (`name`, `prompt`, `trigger`, `timeout`, `repos`, `model`) are identical to the standard plugin preset request. + +### Variant Object + +| Field | Required | Type | Description | +|-------|----------|------|-------------| +| `name` | Yes | string | Unique variant name (1–100 chars) | +| `weight` | Yes | integer | Relative selection weight (> 0) | +| `plugins` | Yes | array | Plugin source(s) for this variant (at least one) | + +### Validation Rules + +- Exactly **one of** `plugins` or `variants` must be provided (not both, not neither) +- `experiment_id` is **required** with `variants`, **forbidden** with `plugins` +- At least **2** variants, at most **10** +- Variant **names must be unique** within an experiment +- Each variant must have at least **1 plugin** +- Weights are relative — `[50, 50]` and `[1, 1]` both give 50/50 selection + +## Variant Selection + +Selection uses Python's `random.choices` with the configured weights. The probability of selecting variant *i* is: + +``` +P(variant_i) = weight_i / sum(all_weights) +``` + +Examples: +- `[50, 50]` → 50% / 50% +- `[80, 20]` → 80% / 20% +- `[1, 1, 1]` → 33.3% each +- `[70, 20, 10]` → 70% / 20% / 10% + +Selection happens independently on every run — there is no cross-run state or session stickiness. + +## Examples + +### Compare two plugin versions + +```json +{ + "name": "Plugin v2 Rollout", + "experiment_id": "plugin-v2-rollout", + "variants": [ + { + "name": "stable", + "weight": 80, + "plugins": [{"source": "github:myorg/my-plugin", "ref": "v1.4.2"}] + }, + { + "name": "canary", + "weight": 20, + "plugins": [{"source": "github:myorg/my-plugin", "ref": "v2.0.0-rc1"}] + } + ], + "prompt": "Run the standard analysis workflow.", + "trigger": {"type": "cron", "schedule": "0 9 * * 1-5"} +} +``` + +### Test different plugin combinations + +```json +{ + "name": "Review Pipeline Experiment", + "experiment_id": "review-pipeline-2026", + "variants": [ + { + "name": "basic", + "weight": 1, + "plugins": [{"source": "github:myorg/code-review"}] + }, + { + "name": "enhanced", + "weight": 1, + "plugins": [ + {"source": "github:myorg/code-review"}, + {"source": "github:myorg/security-scanner"} + ] + } + ], + "prompt": "Review the PR and report findings.", + "trigger": { + "type": "event", + "source": "github", + "on": "pull_request.opened", + "filter": "contains(pull_request.labels[].name, 'needs-review')" + } +} +``` + +### Three-way comparison + +```json +{ + "name": "Scanner Comparison", + "experiment_id": "scanner-eval-q3", + "variants": [ + {"name": "scanner-a", "weight": 1, "plugins": [{"source": "github:myorg/scanner-a"}]}, + {"name": "scanner-b", "weight": 1, "plugins": [{"source": "github:myorg/scanner-b"}]}, + {"name": "scanner-c", "weight": 1, "plugins": [{"source": "github:myorg/scanner-c"}]} + ], + "prompt": "Scan the repository and produce a findings report.", + "trigger": {"type": "cron", "schedule": "0 2 * * 0"} +} +``` + +## Observability + +Each experiment run tags the conversation with: + +| Tag | Value | +|-----|-------| +| `experiment_id` | The `experiment_id` from the request | +| `variant` | The name of the selected variant | + +Use these tags to filter runs in the OpenHands UI and compare outcomes across variants. + +## Limitations + +- **Plugin preset only** — A/B testing is not yet supported for prompt presets or custom automations. See [#147](https://github.com/OpenHands/automation/issues/147) for the server-level variant selection roadmap. +- **No session stickiness** — each run independently selects a variant. There is no user- or session-based assignment. +- **No built-in metrics** — the platform records which variant ran (via tags) but does not compute statistical significance. Export run data for external analysis. +- **Single prompt** — all variants share the same prompt. To test different prompts, use separate automations.