You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The canary-ring promotion model (epic #495) advances a release next → ring0 → ring1 → stable, gated at every step by a soak on the prior ring. Today that gate is qualitative — the runbook says "confirm its callers' runs are healthy before advancing" — which can't be automated or applied consistently. The first real promotion of the six #482 reusables is in flight (.github-private#870, next now at v2.1.0), so we need a concrete, mechanical soak standard the promotion can be held to (and that canary-rollout.sh / #501 can eventually enforce for cross-repo agents).
What the data says (why a naive rule fails)
14-day run volume on the next tier (.github-private), executed = success+failure (skips excluded); 0 failures across all six → baseline ≈ 0%:
reusable
exec'd/day
50%/day
notes
agent-shield
~71
~36
high volume
dependency-audit
~71
~36
high volume
pr-review-mention
~52
~26
38% skipped
auto-rebase
9
~5
medium
dependabot-automerge
3
~1.5
96% skipped
dependabot-rebase
0
0
no runs in 14d
A literal "N hours + 50% of daily average" breaks three ways:
The time floor and the count aren't independent — at 71 runs/day, 36 runs takes ~12h, so the count silently dominates for busy reusables and the time floor never binds.
It collapses at the low end — dependabot-rebase (0 runs) gets no sample gate and may not fire in any short window; dependabot-automerge is 96% skips. Only time can protect these.
Skips and volume-capping distort the average — must count executed runs only, and the busy three are so high-volume that 50% balloons the soak.
Proposed standard
A ring advance passes for a reusable when all hold (evaluated per reusable, per ring):
daily_avg₁₄d = executed runs (success+failure; exclude skipped/cancelled) over the last 14 days, computed across the tier's repos (next→.github-private; ring0→.github; ring1→TalkTerm+bmad; stable→broad fleet).
floor 5 (below that a count is noise), cap 25 (beyond that is diminishing returns and just stalls promotion).
count only runs started after the cut (so they exercise the new candidate).
Health: failure_rate ≤ baseline + ε AND zero startup_failures — the count is the sample size; this is the pass/fail (reuses decide_gate / failure_rate_permille).
Low-volume fallback: if 0.5 × daily_avg < 5, drop the count gate and require 24 h + ≥ 1 executed healthy run instead (covers dependabot-automerge, dependabot-rebase).
Net effect on the in-flight promotion: busy three gate ~12 h (floor-bound); auto-rebase ~12–13 h (5 runs); the two dependabot reusables on the 24 h fallback — i.e. two waves, not six independent timers.
Open questions (to decide here)
Per-ring floor. 12 h is sized for next (first real exposure; doesn't span a daily cycle at 12 h, but the count drags busy ones there anyway). Should ring0/ring1/stable use a shorter floor (e.g. 4–6 h) since the candidate already soaked upstream, or stay uniform at 12 h?
Sample bounds. Are MIN=5 / MAX=25 right? (Rationale: ~25 clean runs makes a ~30% regression a <0.1% miss; <5 is statistically meaningless.)
ε for the failure-rate comparison — baseline is ~0 today; allowed delta = 0, or a permille threshold (e.g. ≤ baseline + 50‰)?
Per-reusable vs batch promotion (batching lets a 0-volume reusable gate the whole cohort into the 24 h path).
Run attribution — "since cut" by created-after-timestamp (proxy) vs detecting the run actually resolved the candidate SHA.
Where the standard lives — petry-projects/.github/standards/ (org standard) cross-referenced from .github-private/docs/release/{versioning,runbook}.md.
Soak standard written (numbers + rationale + per-ring floor decision) in standards/ and cross-linked from the release runbook/versioning docs.
A mechanical soak-check that, given a reusable + tier, emits PASS/WAIT with the gating reason (time / sample / health / fallback).
Wired into the promotion path (interim: manual soak-check before each cut-release.sh --channel <ring> --push; target: canary-rollout.sh consumes it for cross-repo agents).
Why
The canary-ring promotion model (epic #495) advances a release
next → ring0 → ring1 → stable, gated at every step by a soak on the prior ring. Today that gate is qualitative — the runbook says "confirm its callers' runs are healthy before advancing" — which can't be automated or applied consistently. The first real promotion of the six #482 reusables is in flight (.github-private#870,nextnow atv2.1.0), so we need a concrete, mechanical soak standard the promotion can be held to (and thatcanary-rollout.sh/ #501 can eventually enforce for cross-repo agents).What the data says (why a naive rule fails)
14-day run volume on the next tier (
.github-private), executed = success+failure (skips excluded); 0 failures across all six → baseline ≈ 0%:A literal "N hours + 50% of daily average" breaks three ways:
dependabot-rebase(0 runs) gets no sample gate and may not fire in any short window;dependabot-automergeis 96% skips. Only time can protect these.Proposed standard
A ring advance passes for a reusable when all hold (evaluated per reusable, per ring):
clamp(round(0.5 × daily_avg₁₄d), 5, 25)daily_avg₁₄d= executed runs (success+failure; exclude skipped/cancelled) over the last 14 days, computed across the tier's repos (next→.github-private;ring0→.github;ring1→TalkTerm+bmad;stable→broad fleet).failure_rate ≤ baseline + εAND zerostartup_failures— the count is the sample size; this is the pass/fail (reusesdecide_gate/failure_rate_permille).0.5 × daily_avg < 5, drop the count gate and require 24 h + ≥ 1 executed healthy run instead (coversdependabot-automerge,dependabot-rebase).Net effect on the in-flight promotion: busy three gate ~12 h (floor-bound);
auto-rebase~12–13 h (5 runs); the two dependabot reusables on the 24 h fallback — i.e. two waves, not six independent timers.Open questions (to decide here)
next(first real exposure; doesn't span a daily cycle at 12 h, but the count drags busy ones there anyway). Shouldring0/ring1/stableuse a shorter floor (e.g. 4–6 h) since the candidate already soaked upstream, or stay uniform at 12 h?MIN=5/MAX=25right? (Rationale: ~25 clean runs makes a ~30% regression a <0.1% miss; <5 is statistically meaningless.)petry-projects/.github/standards/(org standard) cross-referenced from.github-private/docs/release/{versioning,runbook}.md.soak-check(reads channels + run history → per-reusable PASS/WAIT + reason) for the interim, then consumed bycanary-rollout.shfor cross-repo agents (feat: implement issue #500 — pr-review: dispatcher fails on review_requested (empty gh api URL) + agent comment runaway #501).Acceptance
standards/and cross-linked from the release runbook/versioning docs.soak-checkthat, given a reusable + tier, emits PASS/WAIT with the gating reason (time / sample / health / fallback).soak-checkbefore eachcut-release.sh --channel <ring> --push; target:canary-rollout.shconsumes it for cross-repo agents).References
.github-private#870; automated gate.github-private#501; cross-repo cut wiring.github-private#959..github-private/docs/release/runbook.md("Promotion is gated at every ring") +versioning.md.