Skip to content

[ab-advisor] Improve experiment infrastructure: schema, reporting & auditΒ #38826

@github-actions

Description

@github-actions

πŸ”¬ Experiment Infrastructure Improvements

Sub-issue of #38825 | Triggered by ab-testing-advisor on 2026-06-12


Area 1: Frontmatter Schema β€” notify Alert-Posting Gap

The field-presence-checker confirms that analysis_type and tags are fully implemented end-to-end (parsed in Go β†’ marshalled into GH_AW_EXPERIMENT_SPEC β†’ rendered in JS step summary). However, notify is partially implemented: the field is parsed in Go (compiler_experiments.go lines 201–214) and read in the JS picker to build notifyTargets (pick_experiment.cjs lines 242–246), but those targets are only displayed as text in the step summary β€” no code posts alerts to the referenced discussion or issue when an experiment concludes.

Proposed fix in actions/setup/js/pick_experiment.cjs:

// After step summary is written, check if experiment has concluded
const totalSamples = Object.values(variantCounts).reduce((a, b) => a + b, 0);
const minSamplesReached = Object.values(variantCounts)
  .every(count => count >= (cfg.min_samples ?? 0));

if (minSamplesReached && notifyTargets.length > 0) {
  const summary = buildResultsSummary(variantCounts, cfg); // winner, effect size, p-value
  for (const target of notifyTargets) {
    if (target.type === 'discussion') {
      // POST /repos/{owner}/{repo}/discussions/{id}/comments
      await octokit.request('POST /repos/{owner}/{repo}/discussions/{discussion_number}/comments', {
        discussion_number: target.id,
        body: summary
      });
    } else if (target.type === 'issue') {
      await octokit.rest.issues.createComment({ issue_number: target.id, body: summary });
    }
  }
}

This requires pick_experiment.cjs to:

  1. Track variantCounts across runs (already available via state.json)
  2. Build a results summary when min_samples is reached for all variants
  3. Post the comment via GitHub REST API β€” using the existing Octokit instance if available, or via a safe-output tool

Area 2: Reporting & Dashboards

Proposed daily experiment report pipeline

The existing daily-experiment-report workflow can be enhanced to provide a full analytics pipeline:

Step 1 β€” Aggregate run artifacts

gh run list --workflow="*.lock.yml" --json databaseId --limit 200 | \
  jq -r '.[].databaseId' | xargs -I{} \
  gh run download {} --name experiments-state --dir /tmp/experiments/{} 2>/dev/null || true

Step 2 β€” Compute running statistics per variant
For each experiment found across all state.json files:

  • Group samples by variant
  • Compute: mean, variance, sample_count for metric and each secondary_metric

Step 3 β€” Apply significance test based on analysis_type

analysis_type Test applied
t_test Welch's t-test on metric means
mann_whitney Mann-Whitney U on metric distributions
proportion_test Two-proportion z-test (binary outcomes)
bayesian_ab Beta-binomial posterior P(B > A)

Significance threshold: Ξ± = 0.05; post conclusion comment when reached.

Step 4 β€” ASCII table in step summary

experiment: prompt_style (daily-cli-performance)
variant     n    regression_accuracy    ai_credits    winner?
detailed    14   94.3% Β± 3.1%          8 420 Β± 620   (baseline)
concise     11   91.8% Β± 4.2%          6 210 Β± 510   βœ“ p=0.031

Step 5 β€” Discussion post when significant
Post to the discussion referenced in experiments.<name>.notify.discussion once min_samples is reached and p-value < Ξ±.

Area 3: Audit & OTEL Integration

Proposed experiment observability changes

OTEL resource attributes (add to pick_experiment.cjs immediately after variant assignment):

const existingAttrs = process.env.OTEL_RESOURCE_ATTRIBUTES ?? '';
const experimentAttrs = [
  `experiment.name=${experimentName}`,
  `experiment.variant=${chosenVariant}`,
  `experiment.run_index=${runIndex}`
].join(',');
core.exportVariable('OTEL_RESOURCE_ATTRIBUTES',
  existingAttrs ? `${existingAttrs},${experimentAttrs}` : experimentAttrs);

This causes all downstream spans in the run to carry experiment.name and experiment.variant, enabling Honeycomb/Jaeger slice-and-dice by experiment assignment without any workflow changes.

gh aw audit output enrichment: Surface experiment assignment as a structured block:

{
  "experiment": {
    "name": "prompt_style",
    "variant": "concise",
    "run_index": 11,
    "min_samples": 20,
    "progress": "55%",
    "assigned_at": "2026-06-12T11:38:00Z"
  }
}

Step summary progress bar (already partially done in pick_experiment.cjs): Add run_index / min_samples ratio and estimated days to conclusion:

| Progress | 11 / 20 runs (55%) β€” est. 3–4 weeks to conclusion |

Audit log filtering: Enable gh aw audit --experiment prompt_style --variant concise to return only runs matching that assignment, making it easy to compare failure modes between variants.

Implementation Steps

  • Implement notify alert-posting in pick_experiment.cjs when minSamplesReached && notifyTargets.length > 0
  • Add experiment.name + experiment.variant + experiment.run_index as OTEL resource attributes in pick_experiment.cjs
  • Add experiment assignment JSON block to gh aw audit output
  • Add run_index / min_samples progress bar and estimated-days-to-conclusion to step summary
  • Enhance daily-experiment-report to aggregate artifacts, compute statistics, apply analysis_type test, and post significance results
  • Add --experiment / --variant filter flags to gh aw audit

References

Generated by πŸ§ͺ Daily A/B Testing Advisor Β· 395.6 AIC Β· βŒ– 21.6 AIC Β· ⊞ 22.4K Β· β—·

  • expires on Jun 26, 2026, 3:46 AM UTC-08:00

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions