Define the release soak standard (ring-promotion gate): time floor + traffic-scaled sample + health

## Why

The canary-ring promotion model (epic [#495](https://github.com/petry-projects/.github-private/issues/495)) advances a release `next → ring0 → ring1 → stable`, gated at every step by a **soak** on the prior ring. Today that gate is **qualitative** — the runbook says *\"confirm its callers' runs are healthy before advancing\"* — which can't be automated or applied consistently. The first real promotion of the six #482 reusables is in flight ([.github-private#870](https://github.com/petry-projects/.github-private/issues/870), `next` now at `v2.1.0`), so we need a **concrete, mechanical soak standard** the promotion can be held to (and that `canary-rollout.sh` / [#501](https://github.com/petry-projects/.github-private/issues/501) can eventually enforce for cross-repo agents).

## What the data says (why a naive rule fails)

14-day run volume on the **next** tier (`.github-private`), *executed* = success+failure (skips excluded); **0 failures across all six → baseline ≈ 0%**:

| reusable | exec'd/day | 50%/day | notes |
|---|---|---|---|
| agent-shield | ~71 | ~36 | high volume |
| dependency-audit | ~71 | ~36 | high volume |
| pr-review-mention | ~52 | ~26 | 38% skipped |
| auto-rebase | 9 | ~5 | medium |
| dependabot-automerge | 3 | ~1.5 | **96% skipped** |
| dependabot-rebase | **0** | **0** | **no runs in 14d** |

A literal *\"N hours + 50% of daily average\"* breaks three ways:
1. **The time floor and the count aren't independent** — at 71 runs/day, 36 runs takes ~12h, so the *count* silently dominates for busy reusables and the time floor never binds.
2. **It collapses at the low end** — `dependabot-rebase` (0 runs) gets *no* sample gate and may not fire in any short window; `dependabot-automerge` is 96% skips. Only *time* can protect these.
3. **Skips and volume-capping distort the average** — must count executed runs only, and the busy three are so high-volume that 50% balloons the soak.

## Proposed standard

A ring advance **passes for a reusable** when **all** hold (evaluated per reusable, per ring):

- **Time floor: ≥ 12 h** since the channel tag moved.
- **Sample: executed healthy runs ≥ `clamp(round(0.5 × daily_avg₁₄d), 5, 25)`**
  - `daily_avg₁₄d` = executed runs (success+failure; **exclude skipped/cancelled**) over the last 14 days, computed across the **tier's repos** (`next`→.github-private; `ring0`→.github; `ring1`→TalkTerm+bmad; `stable`→broad fleet).
  - floor **5** (below that a count is noise), cap **25** (beyond that is diminishing returns and just stalls promotion).
  - count only runs started **after** the cut (so they exercise the new candidate).
- **Health: `failure_rate ≤ baseline + ε` AND zero `startup_failures`** — the count is the *sample size*; this is the pass/fail (reuses `decide_gate` / `failure_rate_permille`).
- **Low-volume fallback:** if `0.5 × daily_avg < 5`, drop the count gate and require **24 h + ≥ 1 executed healthy run** instead (covers `dependabot-automerge`, `dependabot-rebase`).

Net effect on the in-flight promotion: busy three gate ~12 h (floor-bound); `auto-rebase` ~12–13 h (5 runs); the two dependabot reusables on the 24 h fallback — i.e. **two waves**, not six independent timers.

## Open questions (to decide here)

1. **Per-ring floor.** 12 h is sized for `next` (first real exposure; doesn't span a daily cycle at 12 h, but the count drags busy ones there anyway). Should `ring0`/`ring1`/`stable` use a **shorter floor** (e.g. 4–6 h) since the candidate already soaked upstream, or stay uniform at 12 h?
2. **Sample bounds.** Are `MIN=5` / `MAX=25` right? (Rationale: ~25 clean runs makes a ~30% regression a <0.1% miss; <5 is statistically meaningless.)
3. **ε for the failure-rate comparison** — baseline is ~0 today; allowed delta = 0, or a permille threshold (e.g. ≤ baseline + 50‰)?
4. **Per-reusable vs batch** promotion (batching lets a 0-volume reusable gate the whole cohort into the 24 h path).
5. **Run attribution** — \"since cut\" by created-after-timestamp (proxy) vs detecting the run actually resolved the candidate SHA.
6. **Where the standard lives** — `petry-projects/.github/standards/` (org standard) cross-referenced from `.github-private/docs/release/{versioning,runbook}.md`.
7. **Enforcement** — a standalone `soak-check` (reads channels + run history → per-reusable PASS/WAIT + reason) for the interim, then consumed by `canary-rollout.sh` for cross-repo agents (#501).

## Acceptance

- [ ] Soak standard written (numbers + rationale + per-ring floor decision) in `standards/` and cross-linked from the release runbook/versioning docs.
- [ ] A mechanical `soak-check` that, given a reusable + tier, emits PASS/WAIT with the gating reason (time / sample / health / fallback).
- [ ] Wired into the promotion path (interim: manual `soak-check` before each `cut-release.sh --channel <ring> --push`; target: `canary-rollout.sh` consumes it for cross-repo agents).

## References
- Epic #495; promotion in flight `.github-private#870`; automated gate `.github-private#501`; cross-repo cut wiring `.github-private#959`.
- Current qualitative gate: `.github-private/docs/release/runbook.md` (\"Promotion is gated at every ring\") + `versioning.md`.

> Until this lands, the in-flight #870 promotion uses a provisional **12 h + healthy-runs** judgment before advancing each ring.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Define the release soak standard (ring-promotion gate): time floor + traffic-scaled sample + health #548

Why

What the data says (why a naive rule fails)

Proposed standard

Open questions (to decide here)

Acceptance

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

reusable	exec'd/day	50%/day	notes
agent-shield	~71	~36	high volume
dependency-audit	~71	~36	high volume
pr-review-mention	~52	~26	38% skipped
auto-rebase	9	~5	medium
dependabot-automerge	3	~1.5	96% skipped
dependabot-rebase	0	0	no runs in 14d

Uh oh!

Define the release soak standard (ring-promotion gate): time floor + traffic-scaled sample + health #548

Description

Why

What the data says (why a naive rule fails)

Proposed standard

Open questions (to decide here)

Acceptance

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions