Skip to content

feat: A/B testing support for plugin automations#146

Merged
csmith49 merged 10 commits into
mainfrom
feat/ab-testing-plugin-variants
May 29, 2026
Merged

feat: A/B testing support for plugin automations#146
csmith49 merged 10 commits into
mainfrom
feat/ab-testing-plugin-variants

Conversation

@csmith49
Copy link
Copy Markdown
Contributor

Summary

Add A/B testing (experiment variants) to the plugin preset endpoint. Users can now provide variants instead of plugins to create automations that randomly select a plugin configuration per run and tag the resulting conversation with experiment metadata.

No database migrations. No SDK changes. No new endpoints. No new dependencies.

How it works

API: same endpoint, new field

POST /v1/preset/plugin now accepts variants as an alternative to plugins (mutually exclusive):

{
  "name": "PR Review A/B Test",
  "experiment_id": "pr-review-v2-test",
  "variants": [
    {
      "name": "control",
      "weight": 70,
      "plugins": [{"source": "github:OpenHands/extensions", "repo_path": "plugins/pr-review", "ref": "v1.0.0"}]
    },
    {
      "name": "treatment",
      "weight": 30,
      "plugins": [{"source": "github:OpenHands/extensions", "repo_path": "plugins/pr-review", "ref": "v2.0.0"}]
    }
  ],
  "prompt": "Review this PR for code quality and potential bugs.",
  "trigger": {"type": "event", "source": "github", "on": "pull_request.opened"}
}

Tarball: one extra file

When variants is used, the tarball contains experiment_config.json instead of plugins_config.json. Standard plugin automations are unchanged.

Runtime: variant selection + tagging

sdk_main.py detects experiment_config.json, selects a variant via weighted random choice, and passes experiment_id + variant tags to Conversation(). These merge with existing workspace/plugin auto-tags.

Validation

  • Exactly one of plugins or variants required
  • experiment_id required with variants
  • 2–10 variants, unique names, positive weights
  • experiment_id rejected when using plugins

Changes

File Change
preset_router.py ExperimentVariant model, variants/experiment_id fields, mutual exclusivity validator, tarball generation
sdk_main.py Experiment config detection, weighted variant selection, conversation tagging
test_preset_router.py 12 new unit tests (validation + tarball generation)
test_ab_testing_integration.py 15 new integration tests (SQLite-based, no Docker needed)

Test results

90 total | 64 passed | 0 failed | 26 skipped (pre-existing Docker-gated tests)

This PR was created by an AI agent (OpenHands) on behalf of csmith49.

csmith49 and others added 2 commits May 28, 2026 10:35
Extend the plugin preset endpoint to support experiment variants.
Users can now provide 'variants' (mutually exclusive with 'plugins')
to create A/B tests that randomly select a plugin configuration per
run and tag the resulting conversation with experiment metadata.

Changes:
- preset_router.py: ExperimentVariant model, variants/experiment_id
  fields on CreatePluginAutomationRequest with mutual exclusivity
  validation, experiment_config.json tarball generation
- sdk_main.py: experiment config detection, weighted variant
  selection via random.choices, experiment tags on Conversation()
- tests: 12 new tests covering validation edge cases and tarball
  generation for both experiment and standard paths

No database migrations. No SDK changes. No new endpoints.

Co-authored-by: openhands <openhands@all-hands.dev>
End-to-end tests that run without Docker by using SQLite + mock
file store. Covers:
- API creation flow (201 response, tarball contents, experiment_config.json)
- Runtime variant selection logic (weight distribution, determinism)
- Backward compatibility (standard plugin automations unchanged)
- Validation errors via API (422 responses for invalid requests)

15 tests, all passing.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 29, 2026

🚀 Deploy Preview PR Created/Updated

A deploy preview has been created/updated for this PR.

Deploy PR: https://github.com/OpenHands/deploy/pull/4474
Automation SHA: d44c383edd58ebbae6a74531a6904eea282dc3e3
Last updated: May 29, 2026, 01:02:39 PM ET

Once the deploy PR's CI passes, the automation service will be deployed to the feature environment.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 29, 2026

Coverage

csmith49 and others added 3 commits May 29, 2026 08:49
- Break long lines in preset_router.py (E501)
- Add ClassVar annotations to test class attributes (RUF012)
- Apply ruff format to both test files

Co-authored-by: openhands <openhands@all-hands.dev>
Replace type: ignore comments with proper None-checks so pyright
can narrow list[ExperimentVariant] | None to list[ExperimentVariant].

Co-authored-by: openhands <openhands@all-hands.dev>
@csmith49 csmith49 requested a review from all-hands-bot May 29, 2026 16:05
Copy link
Copy Markdown
Contributor

all-hands-bot commented May 29, 2026

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Contributor

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: A/B Testing Support for Plugin Automations

Solid implementation overall. The mutual-exclusivity design is clear, validation coverage is thorough, and backward-compatibility for standard plugin automations is preserved. I found a few places where the type-safety story can be tightened up without adding complexity, plus a misleading docstring in the test helpers worth fixing.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread openhands/automation/preset_router.py Outdated
Comment thread openhands/automation/preset_router.py Outdated
Comment thread openhands/automation/presets/plugin/sdk_main.py
Comment thread tests/test_ab_testing_integration.py
- Drop intermediate bool vars (has_variants/is_experiment), use direct
  'is not None' checks so pyright narrows without redundant conditions
- Replace 'type: ignore' comments with explicit asserts in
  _generate_plugin_tarball and sdk_main.py experiment_tags block
- Fix misleading docstring on _simulate_variant_selection test helper
  (seeded RNG ≠ 'same logic' as production)

Co-authored-by: openhands <openhands@all-hands.dev>
@csmith49 csmith49 requested a review from all-hands-bot May 29, 2026 16:19
Copy link
Copy Markdown
Contributor

all-hands-bot commented May 29, 2026

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Contributor

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — A/B Testing Support for Plugin Automations

Overall this is a well-scoped, clean feature addition. The mutual-exclusivity validation is robust, the tarball branching is clear, and the 27 new tests cover both the happy path and all validation edge cases. The backward-compatibility guarantee (standard plugin automations untouched) is explicitly tested, which is exactly the right instinct for a feature that modifies shared infrastructure.

A few issues are worth addressing before merge:


🔴 Bug: ExperimentVariant.plugins accepts an empty list

The plugins field on ExperimentVariant uses Field(...) (required, no length constraint). An empty [] passes Pydantic validation, produces a variant entry in experiment_config.json with "plugins": [], and when that variant is selected at runtime the agent starts with zero plugins loaded — silently, with no error. This should be caught at API ingestion time.

Fix: Add min_length=1 to the field, matching the spirit of the existing weight: int = Field(..., gt=0) constraint:

plugins: list[PluginSource] = Field(..., min_length=1, description="Plugin(s) for this variant.")

🟡 Nit: Unnecessary f-prefix on a non-interpolated string literal (sdk_main.py)

print(f"\n=== EXPERIMENT ===")   # no {} expressions

This is a plain string literal with an f-prefix but no interpolation. Static linters will flag it and it may confuse readers into looking for a missing variable.

Fix:

print("\n=== EXPERIMENT ===")

🟠 Risk: assert in production runtime for invariant enforcement (sdk_main.py)

assert selected_variant is not None  # set when experiment_id is set

This is logically correct — if experiment_id is set, selected_variant will always be set. However, assert statements are stripped when Python is run with the -O (optimize) flag, and an AssertionError produces a less informative traceback than an explicit exception. Since this is a shipped runtime script embedded in tarballs, consider raising a clear RuntimeError if the invariant is ever violated:

if selected_variant is None:
    raise RuntimeError(
        f"BUG: experiment_id is set but selected_variant is None — "
        f"experiment config may be malformed."
    )

🟡 Suggestion: Variant selection is non-reproducible from logs alone

sdk_main.py uses the global unseeded random.choices, which is correct for production. However, there is currently no way to reproduce which variant was selected for a specific run from logs alone — only the conversation tags carry that information. If post-hoc audit or debugging is needed, consider emitting the variant selection at a structured log level (not just print) so it survives in the log aggregator.


Minor: experiment_id allows arbitrary characters

The experiment_id field is validated for length (1–200 chars) but not for character set. Since it ends up as a conversation tag value, characters like =, :, or Unicode may behave unexpectedly depending on how the tagging system processes them. Consider documenting or enforcing a safe character set (e.g., [a-zA-Z0-9._-]), or at least note the expectation in the field description.


✅ What's working well

  • The validate_plugins_or_variants condition (self.plugins is None) == (self.variants is None) is idiomatic and handles both "both provided" and "neither provided" in a single expression — clean.
  • _generate_plugin_tarball is cleanly branched with no shared mutable state between the two code paths.
  • The TestBackwardCompatibility class explicitly verifies that standard plugin automations produce plugins_config.json and no experiment_config.json — exactly the regression test this feature needs.
  • The assert body.plugins is not None and assert plugins is not None in service-side internal helpers (post-validator) strike the right balance: readable, self-documenting, fast-failing in development. The concern above about assert applies specifically to sdk_main.py since it's a generated artifact shipped to external environments.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread openhands/automation/preset_router.py
Comment thread openhands/automation/presets/plugin/sdk_main.py Outdated
Comment thread openhands/automation/presets/plugin/sdk_main.py Outdated
… safety

- Add min_length=1 to ExperimentVariant.plugins to reject empty lists
- Remove extraneous f-prefix on non-interpolated string in sdk_main.py
- Replace assert with explicit RuntimeError guard in sdk_main.py since
  generated code runs in sandboxes where python -O could strip asserts

Co-authored-by: openhands <openhands@all-hands.dev>
@csmith49 csmith49 requested a review from all-hands-bot May 29, 2026 16:31
Copy link
Copy Markdown
Contributor

all-hands-bot commented May 29, 2026

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Contributor

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — A/B Testing Support for Plugin Automations

All feedback from the two prior review rounds has been fully addressed: the redundant bool intermediates are gone, type narrowing is explicit via assertions (not # type: ignore), min_length=1 closes the empty-plugins bug, the f-prefix nit is fixed, and the assert in the generated tarball script is replaced by a proper RuntimeError guard. The PR is in good shape.

Two remaining suggestions below; both are minor.


This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread openhands/automation/preset_router.py Outdated
Comment thread tests/test_preset_router.py
csmith49 and others added 2 commits May 29, 2026 10:43
- Wrap experiment description in _safe_truncate(..., 200) for
  consistency with the plugin description path
- Add test_max_variants_accepted to close off-by-one gap

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
@csmith49 csmith49 requested a review from all-hands-bot May 29, 2026 16:52
Copy link
Copy Markdown
Contributor

all-hands-bot commented May 29, 2026

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Contributor

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — A/B Testing Support for Plugin Automations

This is a clean, well-scoped feature. Three rounds of review feedback have all been addressed: the empty-plugins bug is closed with min_length=1, the assert in the generated tarball script is replaced by a proper RuntimeError guard, the intermediate bool variables are gone, type narrowing is explicit, and the test docstring accurately describes the RNG divergence from production. Backward compatibility is explicitly tested. The implementation is merge-ready.

One minor suggestion below — genuinely optional.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread tests/test_ab_testing_integration.py Outdated
Comment thread tests/test_ab_testing_integration.py Outdated
…stency

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Member

@malhotra5 malhotra5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

This is great to start with, but maybe not as flexible for a/b testing arbitrary sdk code

An alternative would be to support variants on the automation definition level between multiple tarballs. Then, for the plugin endpoint, we'd auto-generate multiple tarball variants and specify those as part of the automation. We can even track the experiment metadata as part of the automation run.

The downside, however, is that it requires a migration and the variant selection logic needs to be orchestrated by the automation server rather than within the sdk script (but I'm not so concerned about it, and avoids having to re-implement variation selection for users that want to run a/b tests for non-preset automations)

@csmith49 csmith49 merged commit 3a35409 into main May 29, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants