π¬ Experiment Infrastructure Improvements
Sub-issue of #38825 | Triggered by ab-testing-advisor on 2026-06-12
Area 1: Frontmatter Schema β notify Alert-Posting Gap
The field-presence-checker confirms that analysis_type and tags are fully implemented end-to-end (parsed in Go β marshalled into GH_AW_EXPERIMENT_SPEC β rendered in JS step summary). However, notify is partially implemented: the field is parsed in Go (compiler_experiments.go lines 201β214) and read in the JS picker to build notifyTargets (pick_experiment.cjs lines 242β246), but those targets are only displayed as text in the step summary β no code posts alerts to the referenced discussion or issue when an experiment concludes.
Proposed fix in actions/setup/js/pick_experiment.cjs:
// After step summary is written, check if experiment has concluded
const totalSamples = Object.values(variantCounts).reduce((a, b) => a + b, 0);
const minSamplesReached = Object.values(variantCounts)
.every(count => count >= (cfg.min_samples ?? 0));
if (minSamplesReached && notifyTargets.length > 0) {
const summary = buildResultsSummary(variantCounts, cfg); // winner, effect size, p-value
for (const target of notifyTargets) {
if (target.type === 'discussion') {
// POST /repos/{owner}/{repo}/discussions/{id}/comments
await octokit.request('POST /repos/{owner}/{repo}/discussions/{discussion_number}/comments', {
discussion_number: target.id,
body: summary
});
} else if (target.type === 'issue') {
await octokit.rest.issues.createComment({ issue_number: target.id, body: summary });
}
}
}
This requires pick_experiment.cjs to:
- Track
variantCounts across runs (already available via state.json)
- Build a results summary when
min_samples is reached for all variants
- Post the comment via GitHub REST API β using the existing Octokit instance if available, or via a safe-output tool
Area 2: Reporting & Dashboards
Proposed daily experiment report pipeline
The existing daily-experiment-report workflow can be enhanced to provide a full analytics pipeline:
Step 1 β Aggregate run artifacts
gh run list --workflow="*.lock.yml" --json databaseId --limit 200 | \
jq -r '.[].databaseId' | xargs -I{} \
gh run download {} --name experiments-state --dir /tmp/experiments/{} 2>/dev/null || true
Step 2 β Compute running statistics per variant
For each experiment found across all state.json files:
- Group samples by
variant
- Compute:
mean, variance, sample_count for metric and each secondary_metric
Step 3 β Apply significance test based on analysis_type
analysis_type |
Test applied |
t_test |
Welch's t-test on metric means |
mann_whitney |
Mann-Whitney U on metric distributions |
proportion_test |
Two-proportion z-test (binary outcomes) |
bayesian_ab |
Beta-binomial posterior P(B > A) |
Significance threshold: Ξ± = 0.05; post conclusion comment when reached.
Step 4 β ASCII table in step summary
experiment: prompt_style (daily-cli-performance)
variant n regression_accuracy ai_credits winner?
detailed 14 94.3% Β± 3.1% 8 420 Β± 620 (baseline)
concise 11 91.8% Β± 4.2% 6 210 Β± 510 β p=0.031
Step 5 β Discussion post when significant
Post to the discussion referenced in experiments.<name>.notify.discussion once min_samples is reached and p-value < Ξ±.
Area 3: Audit & OTEL Integration
Proposed experiment observability changes
OTEL resource attributes (add to pick_experiment.cjs immediately after variant assignment):
const existingAttrs = process.env.OTEL_RESOURCE_ATTRIBUTES ?? '';
const experimentAttrs = [
`experiment.name=${experimentName}`,
`experiment.variant=${chosenVariant}`,
`experiment.run_index=${runIndex}`
].join(',');
core.exportVariable('OTEL_RESOURCE_ATTRIBUTES',
existingAttrs ? `${existingAttrs},${experimentAttrs}` : experimentAttrs);
This causes all downstream spans in the run to carry experiment.name and experiment.variant, enabling Honeycomb/Jaeger slice-and-dice by experiment assignment without any workflow changes.
gh aw audit output enrichment: Surface experiment assignment as a structured block:
{
"experiment": {
"name": "prompt_style",
"variant": "concise",
"run_index": 11,
"min_samples": 20,
"progress": "55%",
"assigned_at": "2026-06-12T11:38:00Z"
}
}
Step summary progress bar (already partially done in pick_experiment.cjs): Add run_index / min_samples ratio and estimated days to conclusion:
| Progress | 11 / 20 runs (55%) β est. 3β4 weeks to conclusion |
Audit log filtering: Enable gh aw audit --experiment prompt_style --variant concise to return only runs matching that assignment, making it easy to compare failure modes between variants.
Implementation Steps
References
Generated by π§ͺ Daily A/B Testing Advisor Β· 395.6 AIC Β· β 21.6 AIC Β· β 22.4K Β· β·
π¬ Experiment Infrastructure Improvements
Sub-issue of #38825 | Triggered by
ab-testing-advisoron 2026-06-12Area 1: Frontmatter Schema β
notifyAlert-Posting GapThe
field-presence-checkerconfirms thatanalysis_typeandtagsare fully implemented end-to-end (parsed in Go β marshalled intoGH_AW_EXPERIMENT_SPECβ rendered in JS step summary). However,notifyis partially implemented: the field is parsed in Go (compiler_experiments.golines 201β214) and read in the JS picker to buildnotifyTargets(pick_experiment.cjslines 242β246), but those targets are only displayed as text in the step summary β no code posts alerts to the referenced discussion or issue when an experiment concludes.Proposed fix in
actions/setup/js/pick_experiment.cjs:This requires
pick_experiment.cjsto:variantCountsacross runs (already available viastate.json)min_samplesis reached for all variantsArea 2: Reporting & Dashboards
Proposed daily experiment report pipeline
The existing
daily-experiment-reportworkflow can be enhanced to provide a full analytics pipeline:Step 1 β Aggregate run artifacts
Step 2 β Compute running statistics per variant
For each experiment found across all
state.jsonfiles:variantmean,variance,sample_countformetricand eachsecondary_metricStep 3 β Apply significance test based on
analysis_typeanalysis_typet_testmann_whitneyproportion_testbayesian_abSignificance threshold: Ξ± = 0.05; post conclusion comment when reached.
Step 4 β ASCII table in step summary
Step 5 β Discussion post when significant
Post to the discussion referenced in
experiments.<name>.notify.discussiononcemin_samplesis reached and p-value < Ξ±.Area 3: Audit & OTEL Integration
Proposed experiment observability changes
OTEL resource attributes (add to
pick_experiment.cjsimmediately after variant assignment):This causes all downstream spans in the run to carry
experiment.nameandexperiment.variant, enabling Honeycomb/Jaeger slice-and-dice by experiment assignment without any workflow changes.gh aw auditoutput enrichment: Surface experiment assignment as a structured block:{ "experiment": { "name": "prompt_style", "variant": "concise", "run_index": 11, "min_samples": 20, "progress": "55%", "assigned_at": "2026-06-12T11:38:00Z" } }Step summary progress bar (already partially done in
pick_experiment.cjs): Addrun_index / min_samplesratio and estimated days to conclusion:Audit log filtering: Enable
gh aw audit --experiment prompt_style --variant conciseto return only runs matching that assignment, making it easy to compare failure modes between variants.Implementation Steps
notifyalert-posting inpick_experiment.cjswhenminSamplesReached && notifyTargets.length > 0experiment.name+experiment.variant+experiment.run_indexas OTEL resource attributes inpick_experiment.cjsgh aw auditoutputrun_index / min_samplesprogress bar and estimated-days-to-conclusion to step summarydaily-experiment-reportto aggregate artifacts, compute statistics, applyanalysis_typetest, and post significance results--experiment/--variantfilter flags togh aw auditReferences
pkg/workflow/compiler_experiments.goactions/setup/js/pick_experiment.cjs.github/workflows/daily-experiment-report.mdRelated to [ab-advisor] Experiment campaign for daily-cli-performance: A/B test prompt_styleΒ #38825