feat: A/B testing support for plugin automations#146
Conversation
Extend the plugin preset endpoint to support experiment variants. Users can now provide 'variants' (mutually exclusive with 'plugins') to create A/B tests that randomly select a plugin configuration per run and tag the resulting conversation with experiment metadata. Changes: - preset_router.py: ExperimentVariant model, variants/experiment_id fields on CreatePluginAutomationRequest with mutual exclusivity validation, experiment_config.json tarball generation - sdk_main.py: experiment config detection, weighted variant selection via random.choices, experiment tags on Conversation() - tests: 12 new tests covering validation edge cases and tarball generation for both experiment and standard paths No database migrations. No SDK changes. No new endpoints. Co-authored-by: openhands <openhands@all-hands.dev>
End-to-end tests that run without Docker by using SQLite + mock file store. Covers: - API creation flow (201 response, tarball contents, experiment_config.json) - Runtime variant selection logic (weight distribution, determinism) - Backward compatibility (standard plugin automations unchanged) - Validation errors via API (422 responses for invalid requests) 15 tests, all passing. Co-authored-by: openhands <openhands@all-hands.dev>
|
🚀 Deploy Preview PR Created/Updated A deploy preview has been created/updated for this PR. Deploy PR: https://github.com/OpenHands/deploy/pull/4474 Once the deploy PR's CI passes, the automation service will be deployed to the feature environment. |
- Break long lines in preset_router.py (E501) - Add ClassVar annotations to test class attributes (RUF012) - Apply ruff format to both test files Co-authored-by: openhands <openhands@all-hands.dev>
Replace type: ignore comments with proper None-checks so pyright can narrow list[ExperimentVariant] | None to list[ExperimentVariant]. Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Review: A/B Testing Support for Plugin Automations
Solid implementation overall. The mutual-exclusivity design is clear, validation coverage is thorough, and backward-compatibility for standard plugin automations is preserved. I found a few places where the type-safety story can be tightened up without adding complexity, plus a misleading docstring in the test helpers worth fixing.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
- Drop intermediate bool vars (has_variants/is_experiment), use direct 'is not None' checks so pyright narrows without redundant conditions - Replace 'type: ignore' comments with explicit asserts in _generate_plugin_tarball and sdk_main.py experiment_tags block - Fix misleading docstring on _simulate_variant_selection test helper (seeded RNG ≠ 'same logic' as production) Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review — A/B Testing Support for Plugin Automations
Overall this is a well-scoped, clean feature addition. The mutual-exclusivity validation is robust, the tarball branching is clear, and the 27 new tests cover both the happy path and all validation edge cases. The backward-compatibility guarantee (standard plugin automations untouched) is explicitly tested, which is exactly the right instinct for a feature that modifies shared infrastructure.
A few issues are worth addressing before merge:
🔴 Bug: ExperimentVariant.plugins accepts an empty list
The plugins field on ExperimentVariant uses Field(...) (required, no length constraint). An empty [] passes Pydantic validation, produces a variant entry in experiment_config.json with "plugins": [], and when that variant is selected at runtime the agent starts with zero plugins loaded — silently, with no error. This should be caught at API ingestion time.
Fix: Add min_length=1 to the field, matching the spirit of the existing weight: int = Field(..., gt=0) constraint:
plugins: list[PluginSource] = Field(..., min_length=1, description="Plugin(s) for this variant.")🟡 Nit: Unnecessary f-prefix on a non-interpolated string literal (sdk_main.py)
print(f"\n=== EXPERIMENT ===") # no {} expressionsThis is a plain string literal with an f-prefix but no interpolation. Static linters will flag it and it may confuse readers into looking for a missing variable.
Fix:
print("\n=== EXPERIMENT ===")🟠 Risk: assert in production runtime for invariant enforcement (sdk_main.py)
assert selected_variant is not None # set when experiment_id is setThis is logically correct — if experiment_id is set, selected_variant will always be set. However, assert statements are stripped when Python is run with the -O (optimize) flag, and an AssertionError produces a less informative traceback than an explicit exception. Since this is a shipped runtime script embedded in tarballs, consider raising a clear RuntimeError if the invariant is ever violated:
if selected_variant is None:
raise RuntimeError(
f"BUG: experiment_id is set but selected_variant is None — "
f"experiment config may be malformed."
)🟡 Suggestion: Variant selection is non-reproducible from logs alone
sdk_main.py uses the global unseeded random.choices, which is correct for production. However, there is currently no way to reproduce which variant was selected for a specific run from logs alone — only the conversation tags carry that information. If post-hoc audit or debugging is needed, consider emitting the variant selection at a structured log level (not just print) so it survives in the log aggregator.
Minor: experiment_id allows arbitrary characters
The experiment_id field is validated for length (1–200 chars) but not for character set. Since it ends up as a conversation tag value, characters like =, :, or Unicode may behave unexpectedly depending on how the tagging system processes them. Consider documenting or enforcing a safe character set (e.g., [a-zA-Z0-9._-]), or at least note the expectation in the field description.
✅ What's working well
- The
validate_plugins_or_variantscondition(self.plugins is None) == (self.variants is None)is idiomatic and handles both "both provided" and "neither provided" in a single expression — clean. _generate_plugin_tarballis cleanly branched with no shared mutable state between the two code paths.- The
TestBackwardCompatibilityclass explicitly verifies that standard plugin automations produceplugins_config.jsonand noexperiment_config.json— exactly the regression test this feature needs. - The
assert body.plugins is not Noneandassert plugins is not Nonein service-side internal helpers (post-validator) strike the right balance: readable, self-documenting, fast-failing in development. The concern above aboutassertapplies specifically tosdk_main.pysince it's a generated artifact shipped to external environments.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
… safety - Add min_length=1 to ExperimentVariant.plugins to reject empty lists - Remove extraneous f-prefix on non-interpolated string in sdk_main.py - Replace assert with explicit RuntimeError guard in sdk_main.py since generated code runs in sandboxes where python -O could strip asserts Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review — A/B Testing Support for Plugin Automations
All feedback from the two prior review rounds has been fully addressed: the redundant bool intermediates are gone, type narrowing is explicit via assertions (not # type: ignore), min_length=1 closes the empty-plugins bug, the f-prefix nit is fixed, and the assert in the generated tarball script is replaced by a proper RuntimeError guard. The PR is in good shape.
Two remaining suggestions below; both are minor.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
- Wrap experiment description in _safe_truncate(..., 200) for consistency with the plugin description path - Add test_max_variants_accepted to close off-by-one gap Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review — A/B Testing Support for Plugin Automations
This is a clean, well-scoped feature. Three rounds of review feedback have all been addressed: the empty-plugins bug is closed with min_length=1, the assert in the generated tarball script is replaced by a proper RuntimeError guard, the intermediate bool variables are gone, type narrowing is explicit, and the test docstring accurately describes the RNG divergence from production. Backward compatibility is explicitly tested. The implementation is merge-ready.
One minor suggestion below — genuinely optional.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
…stency Co-authored-by: openhands <openhands@all-hands.dev>
malhotra5
left a comment
There was a problem hiding this comment.
LGTM!
This is great to start with, but maybe not as flexible for a/b testing arbitrary sdk code
An alternative would be to support variants on the automation definition level between multiple tarballs. Then, for the plugin endpoint, we'd auto-generate multiple tarball variants and specify those as part of the automation. We can even track the experiment metadata as part of the automation run.
The downside, however, is that it requires a migration and the variant selection logic needs to be orchestrated by the automation server rather than within the sdk script (but I'm not so concerned about it, and avoids having to re-implement variation selection for users that want to run a/b tests for non-preset automations)
Summary
Add A/B testing (experiment variants) to the plugin preset endpoint. Users can now provide
variantsinstead ofpluginsto create automations that randomly select a plugin configuration per run and tag the resulting conversation with experiment metadata.No database migrations. No SDK changes. No new endpoints. No new dependencies.
How it works
API: same endpoint, new field
POST /v1/preset/pluginnow acceptsvariantsas an alternative toplugins(mutually exclusive):{ "name": "PR Review A/B Test", "experiment_id": "pr-review-v2-test", "variants": [ { "name": "control", "weight": 70, "plugins": [{"source": "github:OpenHands/extensions", "repo_path": "plugins/pr-review", "ref": "v1.0.0"}] }, { "name": "treatment", "weight": 30, "plugins": [{"source": "github:OpenHands/extensions", "repo_path": "plugins/pr-review", "ref": "v2.0.0"}] } ], "prompt": "Review this PR for code quality and potential bugs.", "trigger": {"type": "event", "source": "github", "on": "pull_request.opened"} }Tarball: one extra file
When
variantsis used, the tarball containsexperiment_config.jsoninstead ofplugins_config.json. Standard plugin automations are unchanged.Runtime: variant selection + tagging
sdk_main.pydetectsexperiment_config.json, selects a variant via weighted random choice, and passesexperiment_id+varianttags toConversation(). These merge with existing workspace/plugin auto-tags.Validation
pluginsorvariantsrequiredexperiment_idrequired withvariantsexperiment_idrejected when usingpluginsChanges
preset_router.pyExperimentVariantmodel,variants/experiment_idfields, mutual exclusivity validator, tarball generationsdk_main.pytest_preset_router.pytest_ab_testing_integration.pyTest results
This PR was created by an AI agent (OpenHands) on behalf of csmith49.