feat: A/B testing support for plugin automations by csmith49 · Pull Request #146 · OpenHands/automation

csmith49 · 2026-05-29T14:42:31Z

Summary

Add A/B testing (experiment variants) to the plugin preset endpoint. Users can now provide variants instead of plugins to create automations that randomly select a plugin configuration per run and tag the resulting conversation with experiment metadata.

No database migrations. No SDK changes. No new endpoints. No new dependencies.

How it works

API: same endpoint, new field

POST /v1/preset/plugin now accepts variants as an alternative to plugins (mutually exclusive):

{
  "name": "PR Review A/B Test",
  "experiment_id": "pr-review-v2-test",
  "variants": [
    {
      "name": "control",
      "weight": 70,
      "plugins": [{"source": "github:OpenHands/extensions", "repo_path": "plugins/pr-review", "ref": "v1.0.0"}]
    },
    {
      "name": "treatment",
      "weight": 30,
      "plugins": [{"source": "github:OpenHands/extensions", "repo_path": "plugins/pr-review", "ref": "v2.0.0"}]
    }
  ],
  "prompt": "Review this PR for code quality and potential bugs.",
  "trigger": {"type": "event", "source": "github", "on": "pull_request.opened"}
}

Tarball: one extra file

When variants is used, the tarball contains experiment_config.json instead of plugins_config.json. Standard plugin automations are unchanged.

Runtime: variant selection + tagging

sdk_main.py detects experiment_config.json, selects a variant via weighted random choice, and passes experiment_id + variant tags to Conversation(). These merge with existing workspace/plugin auto-tags.

Validation

Exactly one of plugins or variants required
experiment_id required with variants
2–10 variants, unique names, positive weights
experiment_id rejected when using plugins

Changes

File	Change
`preset_router.py`	`ExperimentVariant` model, `variants`/`experiment_id` fields, mutual exclusivity validator, tarball generation
`sdk_main.py`	Experiment config detection, weighted variant selection, conversation tagging
`test_preset_router.py`	12 new unit tests (validation + tarball generation)
`test_ab_testing_integration.py`	15 new integration tests (SQLite-based, no Docker needed)

Test results

90 total | 64 passed | 0 failed | 26 skipped (pre-existing Docker-gated tests)

This PR was created by an AI agent (OpenHands) on behalf of csmith49.

Extend the plugin preset endpoint to support experiment variants. Users can now provide 'variants' (mutually exclusive with 'plugins') to create A/B tests that randomly select a plugin configuration per run and tag the resulting conversation with experiment metadata. Changes: - preset_router.py: ExperimentVariant model, variants/experiment_id fields on CreatePluginAutomationRequest with mutual exclusivity validation, experiment_config.json tarball generation - sdk_main.py: experiment config detection, weighted variant selection via random.choices, experiment tags on Conversation() - tests: 12 new tests covering validation edge cases and tarball generation for both experiment and standard paths No database migrations. No SDK changes. No new endpoints. Co-authored-by: openhands <openhands@all-hands.dev>

End-to-end tests that run without Docker by using SQLite + mock file store. Covers: - API creation flow (201 response, tarball contents, experiment_config.json) - Runtime variant selection logic (weight distribution, determinism) - Backward compatibility (standard plugin automations unchanged) - Validation errors via API (422 responses for invalid requests) 15 tests, all passing. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-05-29T14:44:01Z

🚀 Deploy Preview PR Created/Updated

A deploy preview has been created/updated for this PR.

Deploy PR: https://github.com/OpenHands/deploy/pull/4474
Automation SHA: d44c383edd58ebbae6a74531a6904eea282dc3e3
Last updated: May 29, 2026, 01:02:39 PM ET

Once the deploy PR's CI passes, the automation service will be deployed to the feature environment.

github-actions · 2026-05-29T14:44:01Z

- Break long lines in preset_router.py (E501) - Add ClassVar annotations to test class attributes (RUF012) - Apply ruff format to both test files Co-authored-by: openhands <openhands@all-hands.dev>

Replace type: ignore comments with proper None-checks so pyright can narrow list[ExperimentVariant] | None to list[ExperimentVariant]. Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-05-29T16:06:26Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Review: A/B Testing Support for Plugin Automations

Solid implementation overall. The mutual-exclusivity design is clear, validation coverage is thorough, and backward-compatibility for standard plugin automations is preserved. I found a few places where the type-safety story can be tightened up without adding complexity, plus a misleading docstring in the test helpers worth fixing.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

- Drop intermediate bool vars (has_variants/is_experiment), use direct 'is not None' checks so pyright narrows without redundant conditions - Replace 'type: ignore' comments with explicit asserts in _generate_plugin_tarball and sdk_main.py experiment_tags block - Fix misleading docstring on _simulate_variant_selection test helper (seeded RNG ≠ 'same logic' as production) Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-05-29T16:21:01Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review — A/B Testing Support for Plugin Automations

Overall this is a well-scoped, clean feature addition. The mutual-exclusivity validation is robust, the tarball branching is clear, and the 27 new tests cover both the happy path and all validation edge cases. The backward-compatibility guarantee (standard plugin automations untouched) is explicitly tested, which is exactly the right instinct for a feature that modifies shared infrastructure.

A few issues are worth addressing before merge:

🔴 Bug: `ExperimentVariant.plugins` accepts an empty list

The plugins field on ExperimentVariant uses Field(...) (required, no length constraint). An empty [] passes Pydantic validation, produces a variant entry in experiment_config.json with "plugins": [], and when that variant is selected at runtime the agent starts with zero plugins loaded — silently, with no error. This should be caught at API ingestion time.

Fix: Add min_length=1 to the field, matching the spirit of the existing weight: int = Field(..., gt=0) constraint:

plugins: list[PluginSource] = Field(..., min_length=1, description="Plugin(s) for this variant.")

🟡 Nit: Unnecessary `f`-prefix on a non-interpolated string literal (`sdk_main.py`)

print(f"\n=== EXPERIMENT ===")   # no {} expressions

This is a plain string literal with an f-prefix but no interpolation. Static linters will flag it and it may confuse readers into looking for a missing variable.

Fix:

print("\n=== EXPERIMENT ===")

🟠 Risk: `assert` in production runtime for invariant enforcement (`sdk_main.py`)

assert selected_variant is not None  # set when experiment_id is set

This is logically correct — if experiment_id is set, selected_variant will always be set. However, assert statements are stripped when Python is run with the -O (optimize) flag, and an AssertionError produces a less informative traceback than an explicit exception. Since this is a shipped runtime script embedded in tarballs, consider raising a clear RuntimeError if the invariant is ever violated:

if selected_variant is None:
    raise RuntimeError(
        f"BUG: experiment_id is set but selected_variant is None — "
        f"experiment config may be malformed."
    )

🟡 Suggestion: Variant selection is non-reproducible from logs alone

sdk_main.py uses the global unseeded random.choices, which is correct for production. However, there is currently no way to reproduce which variant was selected for a specific run from logs alone — only the conversation tags carry that information. If post-hoc audit or debugging is needed, consider emitting the variant selection at a structured log level (not just print) so it survives in the log aggregator.

Minor: `experiment_id` allows arbitrary characters

The experiment_id field is validated for length (1–200 chars) but not for character set. Since it ends up as a conversation tag value, characters like =, :, or Unicode may behave unexpectedly depending on how the tagging system processes them. Consider documenting or enforcing a safe character set (e.g., [a-zA-Z0-9._-]), or at least note the expectation in the field description.

✅ What's working well

The validate_plugins_or_variants condition (self.plugins is None) == (self.variants is None) is idiomatic and handles both "both provided" and "neither provided" in a single expression — clean.
_generate_plugin_tarball is cleanly branched with no shared mutable state between the two code paths.
The TestBackwardCompatibility class explicitly verifies that standard plugin automations produce plugins_config.json and no experiment_config.json — exactly the regression test this feature needs.
The assert body.plugins is not None and assert plugins is not None in service-side internal helpers (post-validator) strike the right balance: readable, self-documenting, fast-failing in development. The concern above about assert applies specifically to sdk_main.py since it's a generated artifact shipped to external environments.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

… safety - Add min_length=1 to ExperimentVariant.plugins to reject empty lists - Remove extraneous f-prefix on non-interpolated string in sdk_main.py - Replace assert with explicit RuntimeError guard in sdk_main.py since generated code runs in sandboxes where python -O could strip asserts Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-05-29T16:32:34Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review — A/B Testing Support for Plugin Automations

All feedback from the two prior review rounds has been fully addressed: the redundant bool intermediates are gone, type narrowing is explicit via assertions (not # type: ignore), min_length=1 closes the empty-plugins bug, the f-prefix nit is fixed, and the assert in the generated tarball script is replaced by a proper RuntimeError guard. The PR is in good shape.

Two remaining suggestions below; both are minor.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

- Wrap experiment description in _safe_truncate(..., 200) for consistency with the plugin description path - Add test_max_variants_accepted to close off-by-one gap Co-authored-by: openhands <openhands@all-hands.dev>

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-05-29T16:53:12Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review — A/B Testing Support for Plugin Automations

This is a clean, well-scoped feature. Three rounds of review feedback have all been addressed: the empty-plugins bug is closed with min_length=1, the assert in the generated tarball script is replaced by a proper RuntimeError guard, the intermediate bool variables are gone, type narrowing is explicit, and the test docstring accurately describes the RNG divergence from production. Backward compatibility is explicitly tested. The implementation is merge-ready.

One minor suggestion below — genuinely optional.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

…stency Co-authored-by: openhands <openhands@all-hands.dev>

malhotra5

LGTM!

This is great to start with, but maybe not as flexible for a/b testing arbitrary sdk code

An alternative would be to support variants on the automation definition level between multiple tarballs. Then, for the plugin endpoint, we'd auto-generate multiple tarball variants and specify those as part of the automation. We can even track the experiment metadata as part of the automation run.

The downside, however, is that it requires a migration and the variant selection logic needs to be orchestrated by the automation server rather than within the sdk script (but I'm not so concerned about it, and avoids having to re-implement variation selection for users that want to run a/b tests for non-preset automations)

csmith49 and others added 2 commits May 28, 2026 10:35

csmith49 and others added 3 commits May 29, 2026 08:49

style: fix ruff format + RUF012 lint errors

a161616

- Break long lines in preset_router.py (E501) - Add ClassVar annotations to test class attributes (RUF012) - Apply ruff format to both test files Co-authored-by: openhands <openhands@all-hands.dev>

fix: resolve pyright type-narrowing errors for Optional variants/plugins

50d8a26

Replace type: ignore comments with proper None-checks so pyright can narrow list[ExperimentVariant] | None to list[ExperimentVariant]. Co-authored-by: openhands <openhands@all-hands.dev>

Merge branch 'main' into feat/ab-testing-plugin-variants

d18fc4a

csmith49 requested a review from all-hands-bot May 29, 2026 16:05

all-hands-bot reviewed May 29, 2026

View reviewed changes

Comment thread openhands/automation/preset_router.py Outdated

Comment thread openhands/automation/preset_router.py Outdated

Comment thread openhands/automation/presets/plugin/sdk_main.py

Comment thread tests/test_ab_testing_integration.py

csmith49 requested a review from all-hands-bot May 29, 2026 16:19

all-hands-bot reviewed May 29, 2026

View reviewed changes

Comment thread openhands/automation/preset_router.py

Comment thread openhands/automation/presets/plugin/sdk_main.py Outdated

Comment thread openhands/automation/presets/plugin/sdk_main.py Outdated

csmith49 requested a review from all-hands-bot May 29, 2026 16:31

all-hands-bot reviewed May 29, 2026

View reviewed changes

Comment thread openhands/automation/preset_router.py Outdated

Comment thread tests/test_preset_router.py

csmith49 and others added 2 commits May 29, 2026 10:43

fix: truncate experiment description, add MAX_VARIANTS boundary test

2baf544

- Wrap experiment description in _safe_truncate(..., 200) for consistency with the plugin description path - Add test_max_variants_accepted to close off-by-one gap Co-authored-by: openhands <openhands@all-hands.dev>

fix: narrow Optional type in boundary test for pyright

b044db5

Co-authored-by: openhands <openhands@all-hands.dev>

csmith49 requested a review from all-hands-bot May 29, 2026 16:52

all-hands-bot approved these changes May 29, 2026

View reviewed changes

Comment thread tests/test_ab_testing_integration.py Outdated

Comment thread tests/test_ab_testing_integration.py Outdated

fix: use valid plugin entries in test fixtures for min_length=1 consi…

d44c383

…stency Co-authored-by: openhands <openhands@all-hands.dev>

malhotra5 approved these changes May 29, 2026

View reviewed changes

csmith49 mentioned this pull request May 29, 2026

A/B testing: promote variant selection to the automation server level #147

Open

csmith49 merged commit 3a35409 into main May 29, 2026
5 checks passed

csmith49 mentioned this pull request May 29, 2026

docs: add A/B testing reference for plugin automations OpenHands/extensions#278

Open

Conversation

csmith49 commented May 29, 2026

Summary

How it works

API: same endpoint, new field

Tarball: one extra file

Runtime: variant selection + tagging

Validation

Changes

Test results

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Review: A/B Testing Support for Plugin Automations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review — A/B Testing Support for Plugin Automations

🔴 Bug: ExperimentVariant.plugins accepts an empty list

🟡 Nit: Unnecessary f-prefix on a non-interpolated string literal (sdk_main.py)

🟠 Risk: assert in production runtime for invariant enforcement (sdk_main.py)

🟡 Suggestion: Variant selection is non-reproducible from logs alone

Minor: experiment_id allows arbitrary characters

✅ What's working well

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review — A/B Testing Support for Plugin Automations

Uh oh!

Uh oh!

Uh oh!

all-hands-bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review — A/B Testing Support for Plugin Automations

Uh oh!

Uh oh!

Uh oh!

malhotra5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 29, 2026 •

edited

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading

all-hands-bot commented May 29, 2026 •

edited

Loading

all-hands-bot commented May 29, 2026 •

edited

Loading

🔴 Bug: `ExperimentVariant.plugins` accepts an empty list

🟡 Nit: Unnecessary `f`-prefix on a non-interpolated string literal (`sdk_main.py`)

🟠 Risk: `assert` in production runtime for invariant enforcement (`sdk_main.py`)

Minor: `experiment_id` allows arbitrary characters

all-hands-bot commented May 29, 2026 •

edited

Loading

all-hands-bot commented May 29, 2026 •

edited

Loading