Skip to content

fix(bundle): distinguish RETRY exits (code 2) from OK exits (code 0)#3619

Open
fuleinist wants to merge 126 commits into
koala73:mainfrom
fuleinist:fix/keyring-fallback-read
Open

fix(bundle): distinguish RETRY exits (code 2) from OK exits (code 0)#3619
fuleinist wants to merge 126 commits into
koala73:mainfrom
fuleinist:fix/keyring-fallback-read

Conversation

@fuleinist
Copy link
Copy Markdown
Contributor

Summary

Bundle runner's close handler treated any exit code 0 as ok:true, but the RETRY path in _seed-utils.mjs (contract mode: declareRecords returned 0 with zeroIsValid=false) also called process.exit(0). This made graceful-failure seeder exits appear identical to successful runs in bundle logs — status=OK with records= and no seedComplete marker.

Problem

When a contract-mode seeder hits the RETRY path:

  1. Seeder logs RETRY: declareRecords returned 0 — envelope unchanged, TTL extended, bundle will retry next cycle
  2. Seeder exits with code 0
  3. Bundle runner's close handler sees code === 0ok: true
  4. Bundle summary shows status=OK records= durationMs=… — indistinguishable from a real success

Real upstream outages are masked; bundle health monitoring cannot detect when a seeder is failing gracefully.

Solution

_seed-utils.mjs (RETRY path)

Changed process.exit(0)process.exit(2). Exit code 2 is otherwise unused (code 1 = hard failure, code 143 = SIGTERM).

_bundle-runner.mjs (close handler)

Added explicit handling for code 2:

} else if (code === 2) {
  // Exit code 2 = RETRY (graceful failure in contract mode). Do not
  // treat as ok, but do not alert as a hard failure either — TTL was
  // already extended by the seeder and bundle will retry next cycle.
  settle({ elapsed, ok: false, reason: 'graceful_retry', alreadyLogged: false });
}

The bundle summary will now show these sections as ok:false reason=graceful_retry, clearly distinct from both successful runs and hard failures.

Testing

New test file: tests/bundle-runner-exit-codes.test.mjs

  • RETRY path exits with code 2: verifies _seed-utils.mjs RETRY block calls process.exit(2) (not 0)
  • Hard failure exits with code 1: verifies strictFailure path exits non-zero
  • code 0 → ok:true, seedComplete set: bundle runner behavior
  • code 2 → ok:false, reason:graceful_retry: bundle runner behavior for RETRY
  • code 1 / null / signal → ok:false: error path coverage
  • seed_complete JSON event: verifies logSeedResult in _seed-utils.mjs emits correct event

All 9 tests pass.

Files Changed

  • scripts/_seed-utils.mjs — RETRY path: process.exit(0)process.exit(2)
  • scripts/_bundle-runner.mjs — close handler: explicit code 2 branch
  • tests/bundle-runner-exit-codes.test.mjs — new test file

Fixes #3526.

P0 fix for koala73#3421 (linux keyring fallback):
- load_from_keychain() now reads secrets-vault.json when keyring is unavailable
  (DBus secret-service absent on Wayland/headless Linux)
- Without this, save_vault() wrote the file but load_from_keychain() never
  read it back — secrets were lost after every restart
- Also removes unused 'dirs' crate from Cargo.toml
- Corrects misleading comment: vault file is plaintext, not encrypted

Greptile review: koala73#3421 (comment)
@github-actions github-actions Bot added the trust:safe Brin: contributor trust score safe label May 7, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR adds a Linux/DBus keyring fallback to SecretsCache: when keyring::Entry::set_password fails, secrets are persisted as a plaintext JSON file in the app data directory, and load_from_keychain can read them back on the next launch. SecretsCache initialization is moved from the eager .manage() call into .setup() so app.app_data_dir() is available.

Note: The PR title and description describe JavaScript bundle-runner exit-code changes (scripts/_seed-utils.mjs, scripts/_bundle-runner.mjs) that are not present in the diff. The actual changes are entirely in src-tauri/.

  • src-tauri/src/main.rs — Adds app_data_dir: PathBuf to SecretsCache, a file-read path in load_from_keychain, and a file-write fallback in save_vault when the keyring is unavailable.
  • src-tauri/Cargo.toml — Package version rolled back from 2.8.0 to 2.6.7 (unexplained).

Confidence Score: 3/5

The fallback write path can fail on a fresh Linux install before the app data directory exists, and when it does fall back to file storage the secrets are written as unencrypted JSON with no permission hardening.

The save_vault fallback lacks a create_dir_all call, meaning the very first set_secret invocation on a fresh Linux install where the keyring is unavailable will fail with ENOENT and leave the user's secret unsaved. Combined with the already-noted plaintext file exposure, the file-based fallback path needs additional hardening before this is ready to ship.

src-tauri/src/main.rs — specifically the save_vault keyring-fallback branch and the app_data_dir empty-path guard on the write side.

Important Files Changed

Filename Overview
src-tauri/src/main.rs Adds Linux/DBus file-based keyring fallback; save_vault fallback path lacks create_dir_all, risking ENOENT on first write to a fresh install; keyring_err is unused (compiler warning); empty app_data_dir guard is present on load but not on write.
src-tauri/Cargo.toml Version downgraded from 2.8.0 to 2.6.7; no other dependency changes. The downgrade is unexplained in the diff but was flagged in a prior review thread.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[App setup: load_from_keychain] --> B{Keyring entry exists?}
    B -- Yes --> C{JSON valid & non-empty?}
    C -- Yes --> D[Return SecretsCache from keyring]
    C -- No --> E{app_data_dir non-empty?}
    B -- No --> E
    E -- Yes --> F{secrets-vault.json exists?}
    F -- Yes --> G{JSON valid & non-empty?}
    G -- Yes --> H[Return SecretsCache from file]
    G -- No --> I[Migration: read individual keys]
    F -- No --> I
    E -- No --> I
    I --> J[Return SecretsCache from migrated keys]
    K[set_secret / delete_secret] --> L[save_vault]
    L --> M{keyring set_password OK?}
    M -- Yes --> N[Done]
    M -- No --> O{app_data_dir non-empty?}
    O -- Yes --> P{Directory exists on disk?}
    P -- Yes --> Q[Write secrets-vault.json]
    P -- No --> R[ENOENT error - secret not saved]
    O -- No --> S[Write to CWD/secrets-vault.json]
Loading

Reviews (1): Last reviewed commit: "fix(desktop): read fallback vault file i..." | Re-trigger Greptile

Comment thread src-tauri/Cargo.toml Outdated
Comment thread src-tauri/src/main.rs
Comment on lines +255 to +264
match entry.set_password(&json) {
Ok(()) => Ok(()),
Err(keyring_err) => {
// Linux/DBus fallback: write vault to app data dir as plaintext JSON file
let vault_path = app_data_dir.join("secrets-vault.json");
std::fs::write(&vault_path, &json)
.map_err(|e| format!("Failed to write vault file {}: {e}", vault_path.display()))?;
Ok(())
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Secrets written as plaintext JSON on Linux DBus fallback

When the keyring is unavailable, secrets-vault.json is written as unencrypted JSON in the app data directory. Any process or user account that can read the app data directory can read all stored API keys in cleartext. At minimum, the file's permissions should be restricted to the owner (0o600) immediately after creation; ideally a note should warn callers that this path trades security for compatibility.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SECURITY NOTE added to the fallback branch documenting the plaintext exposure risk. File permissions are now set to 0o600 (owner read/write only) on Unix via set_permissions. See fuleinist@15f547f3#diff-fec9337

Comment thread src-tauri/src/main.rs Outdated
Comment on lines +257 to +258
Err(keyring_err) => {
// Linux/DBus fallback: write vault to app data dir as plaintext JSON file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unused variable keyring_err will produce a compiler warning

keyring_err is bound but never read. The Rust compiler will emit an unused_variables warning here. Prefixing with _ suppresses the warning and makes the intentional-swallow explicit.

Suggested change
Err(keyring_err) => {
// Linux/DBus fallback: write vault to app data dir as plaintext JSON file
Err(_keyring_err) => {
// Linux/DBus fallback: write vault to app data dir as plaintext JSON file

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — keyring_err is now _keyring_err to silence the unused variable warning. See fuleinist@15f547f3#diff-fec9337

Comment thread src-tauri/src/main.rs
Comment on lines +1447 to +1450
let data_dir = app
.app_data_dir()
.map_err(|e| format!("Failed to resolve app data dir: {e}")).unwrap_or_default();
app.manage(SecretsCache::load_from_keychain(data_dir));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 app_data_dir() failure is silently swallowed

map_err(|e| format!(...)).unwrap_or_default() converts the Result into an empty PathBuf and discards the formatted error message entirely — nothing is logged and the app continues. Downstream, load_from_keychain guards reads with if !app_data_dir.as_os_str().is_empty(), but save_vault has no such guard: if data_dir is empty and the keyring is unavailable, vault_path resolves to the bare filename "secrets-vault.json" relative to CWD — writing secrets to an arbitrary directory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — save_vault now checks if app_data_dir is empty and returns an error instead of writing secrets-vault.json to CWD. See fuleinist@15f547f3#diff-fec9337

Comment thread src-tauri/src/main.rs
koala73 added 25 commits May 9, 2026 17:45
…erval (koala73#3448)

* fix(health): co-pin bisDsr/Residential/Commercial maxStaleMin to 3× bundle interval

Production /api/health 2026-04-27: all three BIS-Extended health
entries flipped to STALE_SEED simultaneously at seedAgeMin=1442 vs
maxStaleMin=1440 — just 2 minutes over. Synchronous flip across all
three confirms a SINGLE missed cron event, not three independent
failures.

Root cause: BIS-Extended bundle interval is 12h (720min) per
scripts/seed-bundle-macro.mjs:6. maxStaleMin was set to exactly 2×
interval = ZERO grace for cron jitter, Railway container cold-start
delay, or single missed run + retry. Per the project convention for
cron-driven keys (portwatchPortActivity, chokepointTransits,
transitSummaries all follow 3× interval), the correct value is
3 × 720 = 2160min (36h):
  - 1 missed cron + recovery → still OK
  - 2 missed crons → STALE_SEED (real outage signal)

Tests: 10 new regression tests in tests/bis-extended-seed.test.mjs
under the "maxStaleMin co-pinned to 3× bundle interval" suite:
  - Pin BIS-Extended bundle interval = 12h (so the assertions stay
    meaningful if the bundle cadence ever changes)
  - For each of bisDsr/bisPropertyResidential/bisPropertyCommercial:
    - Pin maxStaleMin = 2160
    - Assert maxStaleMin >= 2.5× interval (no false-STALE floor)
    - Assert maxStaleMin <= 4× interval (real-outage detection ceiling)

Per skill `health-maxstalemin-write-cadence`.

* docs(health): collapse stale '24h = 2× 12h cron' comment with the new 3× justification

Greptile P2 on PR koala73#3448: the unchanged comment line ending '24h = 2× 12h cron'
contradicted the new 2160 (36h, 3×) values added immediately below. Merged
into a single coherent block.

* fix(health): correct cadence baseline — bundle is daily Railway cron, not 12h

Earlier commit 54e1f91 set BIS-Extended triplet maxStaleMin to 2160 (3×)
based on the bundle config's `intervalMs: 12 * HOUR`. That was wrong:
seed-bis-extended.mjs is NOT a standalone Railway service — it's a
child-process spawned by `seed-bundle-macro` whose actual cron schedule
is `0 8 * * *` (daily 08:00 UTC, per docs/railway-seed-consolidation-runbook.md
Bundle 8). The `intervalMs: 12 * HOUR` is a per-section staleness gate
that's a no-op when the cron only fires once per 24h.

Effective write cadence is therefore 24h (1440min), not 12h. So:
  - 2160 (= 1.5× actual cadence) is still too tight; routine cron drift
    can push seedAgeMin past 36h.
  - 2880 (= 2× actual cadence, 48h) gives proper grace and still catches
    a real outage within 2 days.

Tests now derive cadence from the runbook's Railway cron schedule rather
than the bundle config's gate, with explicit assertion that the gate
stays smaller than the cron cadence (so the test family auto-fails if
the cron schedule ever changes and the gate becomes load-bearing).

* fix(health): revert to 2160 — the 12h section gate IS load-bearing per production logs

Earlier commit ea02da7 incorrectly bumped maxStaleMin to 2880 based on
the runbook's daily cron schedule. Production log 2026-04-26T08:00:45
proves that's wrong:

    [BIS-Extended] Skipped, last seeded 175min ago (interval: 720min)

The bundle clearly fired at ~05:05 UTC AND ~08:00 UTC (3h apart), and
the 12h gate ACTIVELY skipped BIS at 08:00. So:

  - Bundle cron fires more often than daily (runbook is stale or
    incomplete — possibly multiple cron entries or watch-paths re-runs)
  - The 12h section gate is load-bearing: it controls the actual write
    cadence for BIS sections, not the runbook's `0 8 * * *` schedule
  - Effective write cadence = 12h ideal, degrading to 24h if a single
    intermediate bundle invocation fails (which is what produced the
    2026-04-27 incident's 1442min staleness)

The original PR target of 2160 (3× the 12h gate = 1.5× the degraded
24h cadence) was directionally correct. Restored. Tests now derive
cadence from the section gate (with the caveat documented that this
is only authoritative when the bundle cron fires more often than the
gate, which production logs confirm).
…on mutation failure (koala73#3449)

* fix(broadcast): drop OCC-prone counter; aggregate at read time + 5xx on mutation failure

The canary PRO-launch broadcast (250 recipients) lost 53 of 250 webhook
delivered events because every email.delivered webhook tried to
read-modify-write the same broadcastEventCounts row for
(broadcastId, "email.delivered"). Convex's OCC retried then threw "Documents
read from or written to the broadcastEventCounts table changed while this
mutation was being run and on every subsequent retry" — visible in Sentry as
WORLDMONITOR-PA (54 events at 22:50:52Z) but silent to operators because
the webhook handler swallowed the throw and returned 200, so Resend never
retried. Whole mutation rolled back, losing the per-event log row too.

Bounces (7) didn't hit the bug — different counter row, no contention.
At 30k recipients the bug would hide a much larger fraction of metrics
and could mask a true kill-gate trip.

Fix:
- Drop broadcastEventCounts table + index (existing canary rows orphan;
  data state is fine, the table is just no longer in schema)
- recordBroadcastEvent now does ONLY db.insert(broadcastEvents). No
  shared-row write means no contention to retry-exhaust.
- getBroadcastStats becomes an internalAction that paginates
  broadcastEvents at read time via _countBroadcastEventsPage internal
  query. Each page is a separate function execution with its own
  16,384-doc read budget, so we are not capped by Convex's per-query
  read limit. PAGE_SIZE=4096 → 1 page for any event type with <4k
  events, ~8 pages for 30k email.delivered.
- resendWebhookHandler no longer try/catches recordBroadcastEvent. Throws
  propagate as 5xx so Resend retries automatically; eventual consistency
  on the event log without operator intervention.

Read cost trade-off: getBroadcastStats was O(8) constant. Now O(events /
PAGE_SIZE). At 30s polling cadence and 30k recipients that's ~16 paginated
reads per stats call — well under Convex action time budget. Worth it for
correctness; if/when broadcasts grow past ~100k, revisit with sharded
counters or @convex-dev/aggregate.

* docs(broadcast): address greptile P2 nits — clarify _countBroadcastEventsPage export rationale + getBroadcastStats consistency model + fix stale 'query' label
…adence) (koala73#3450)

* fix(health): close climateAnomalies silent-EMPTY-window (TTL = cron cadence)

Production health 2026-04-27 reported climateAnomalies status=EMPTY
records=0 seedAgeMin=202 maxStaleMin=240. Railway logs (00:00:59 +
03:03:35 UTC) confirm seeder is healthy — wrote 22 records on each
cron tick, with normal 3h+3min drift between runs.

Root cause: CACHE_TTL was 10800s (3h) — exactly the cron cadence of
seed-bundle-climate (`0 */3 * * *`). Any cron jitter (the 1-3min
Railway variance is routine, not a fault) meant the data key expired
before the next cron could refresh it. seedAgeMin (~3h+drift) was
still < maxStaleMin (4h), so health emitted status=EMPTY records=0
(display-forced because hasData=false per api/health.js:589) — and
UptimeRobot's HEALTHY-substring check kept saying HEALTHY while the
panel showed "data unavailable."

Compounding: maxStaleMin=240min was 1.33× cron cadence; project
convention for cron-driven keys is 3× (portwatchPortActivity,
chokepointTransits, transitSummaries). Plus the inline comment
("runs as independent Railway cron 0 */2 * * *") was stale — the
entry was migrated into seed-bundle-climate Bundle 6 (cron `0 */3 * * *`).

Fix:
  - CACHE_TTL: 10800 (3h, = cron cadence) → 32400 (9h, 3× cron)
  - climateAnomalies.maxStaleMin: 240 → 540 (3× cron)
  - Inline comment in api/health.js corrected
  - Both values co-pinned at 540min so there is no TTL_DATA <
    maxStaleMin inversion (no silent-EMPTY window)

Tests: 6 new regression tests in tests/climate-seeds.test.mjs:
  - Pin Anomalies bundle gate = 3h
  - Pin CACHE_TTL = 32400
  - Pin maxStaleMin = 540
  - Assert TTL >= cron × 2 (data survives 1 missed cron)
  - Assert TTL_min >= maxStaleMin (no silent-EMPTY window)
  - Assert maxStaleMin >= cron × 2.5 (no false-STALE on cron drift)

Per skills health-empty-status-data-ttl-vs-maxstalemin-gap +
health-maxstalemin-write-cadence.

* test(climate): update stale comment '6h' to '9h' to match the actual CACHE_TTL

Greptile P2 on PR koala73#3450: the describe-block comment said 'TTL = 6h
(2× cron cadence)' — leftover from my earlier draft of the fix where
TTL was 6h. The TTL_min >= maxStaleMin test caught that residual gap
and I bumped TTL to 9h to match maxStaleMin, but forgot to update the
comment. Now reflects the actual 9h value and notes that the test
assertion was the thing that caught the gap.
… + convex/payments + add lint guard (koala73#3451)

* chore(observability): comprehensive Sentry coverage sweep across api/ + convex/payments + add lint guard

Following the canary-broadcast post-mortem (Sentry issue WORLDMONITOR-PA, 54
events at 22:50:52Z), audited the codebase for `try { ... } catch (err) {
console.error(...) }` patterns that swallow errors without surfacing them
to Sentry. Found 25+ such sites across api/ — only one file
(notification-channels.ts) was using the existing hand-rolled
`api/_sentry-edge.js` helper, and only at top-level catches.

Path B (no SDK install): build on the existing pattern. Generalize the
helper, mirror it for Node-runtime, sweep silent-swallow sites.

Changes:

api/_sentry-edge.js
- Generalized to expose captureSilentError(err, { tags?, extra? })
- Upgraded ingestion endpoint /store/ → /envelope/ (current Sentry path)
- Stack-frame parsing for native dashboard rendering
- captureEdgeException kept as backwards-compat alias for the existing
  notification-channels.ts callsites
- Auto-tags `surface: api`, `runtime: edge`

api/_sentry-node.js (NEW)
- Mirror of edge helper for the ~17% of api/ files on Node runtime
- Same captureSilentError shape so call sites are runtime-agnostic
- Auto-tags `surface: api`, `runtime: node`

api/ sweep (24 sites across 13 files)
- Add captureSilentError calls alongside existing console.error/warn at
  every silent-swallow site identified in the audit. Tags identify
  route + step for filtering in Sentry.
- Files touched: brief/[userId]/[issueDate].ts, brief/carousel/...,
  brief/public/[hash].ts, brief/share-url.ts, create-checkout.ts,
  customer-portal.ts, internal/brief-why-matters.ts,
  invalidate-user-api-key-cache.ts, latest-brief.ts, mcp.ts,
  notification-channels.ts (4 inner catches), referral/me.ts,
  slack/oauth/callback.ts, user-prefs.ts, fwdstart.js, rss-proxy.js
  (rss-proxy skips Sentry on AbortError to avoid drowning in routine
  upstream timeouts).

convex/payments/cacheActions.ts
- Convert silent `console.warn` swallows on Redis SET/DEL failures to
  re-throws. The actions are scheduled fire-and-forget by
  upsertEntitlements, the operations are idempotent, and Convex's
  scheduler retries with auto-Sentry capture on each throw. Persistent
  Upstash failures (which would silently leave PRO entitlement caches
  stale before this) now page operators.

convex/payments/webhookHandlers.ts
- Dodo signature verification failures previously console.error'd and
  401'd silently. Cannot throw (Dodo would retry-storm) so we 401 as
  before but ALSO schedule a new internalMutation
  `reportDodoSignatureFailure` via ctx.scheduler.runAfter(0, ...). The
  scheduled throw runs after the response and is captured by Convex
  auto-Sentry — closes the "botched secret rotation goes silent for
  hours" failure mode.

scripts/check-sentry-coverage.mjs (NEW)
- Lint guard. Walks api/ + convex/ catch blocks; flags any that contain
  console.error/warn but no `captureSilentError`, `captureEdgeException`,
  `Sentry.captureException`, `throw`, or `status: 5xx` (the safe
  patterns).
- Defaults to --diff mode (only files changed vs origin/main) so legacy
  catches don't block unrelated PRs. --all mode for ad-hoc full scans.
- Excludes the Sentry helper files themselves (their console.warn on
  delivery failure is correct — capturing inside the capture helper
  would loop).

.husky/pre-push
- Wires `node scripts/check-sentry-coverage.mjs` into the pre-push gate.

Decisions documented in PR description:
- Why hand-rolled fetch over @sentry/vercel-edge + @sentry/node SDKs:
  zero bundle bloat on every edge cold start, no new deps, builds on
  the existing pattern that was already in production.
- Why same Sentry project as the frontend (VITE_SENTRY_DSN) rather than
  a backend-only DSN: cross-surface correlation. Events tagged
  `surface: api`, `runtime: edge|node` so the dashboard can filter.

* fixup: address review on PR koala73#3451

P1 (codex): `void captureSilentError(...)` was fire-and-forget on Vercel
edge runtime — the helper awaits a fetch internally, but the caller's
void discarded the promise. After a handler returns its Response, the
V8 isolate may be torn down before unawaited microtasks finish, so the
fetch could never dispatch and the silent-swallow paths the PR meant to
surface still wouldn't reach Sentry.

Two-layer fix:

1. `_sentry-common.js` — added `keepalive: true` to the envelope fetch.
   Lets the underlying request survive isolate teardown for callers that
   can't easily plumb ctx (e.g., deep helpers).
2. Handler signatures — added `ctx: { waitUntil: (p: Promise<unknown>)
   => void }` to every handler I touched in the sweep, and converted
   every `void captureSilentError(...)` to
   `ctx.waitUntil(captureSilentError(...))` at handler-level catch
   sites. For nested helpers called inside an existing waitUntil chain
   (publishWelcome / publishFlushHeld in notification-channels.ts and
   slack/oauth/callback.ts; runAnalystPath / runGeminiPath / cache R/W
   in brief-why-matters.ts), changed `void` → `await` so the helper's
   own promise stays pending until Sentry delivery completes —
   propagating the wait through to the parent waitUntil. Also converted
   the two pre-existing `void captureEdgeException(...)` sites in
   notification-channels.ts for consistency.

P1 (greptile): `webhookHandlers.ts` `await ctx.scheduler.runAfter(...)`
inside the signature-failure catch could throw on a Convex scheduler
hiccup, suppressing the `return new Response(401)` and triggering the
Dodo retry-storm the pattern was meant to prevent. Wrapped the
runAfter in its own try/catch so a scheduling failure NEVER blocks the
401 path.

P2 (greptile): `_sentry-edge.js` and `_sentry-node.js` were ~100 lines
of duplicated code with three substitutions (`runtime`, `platform`,
log prefix). Extracted shared envelope builder + delivery into
`api/_sentry-common.js` exposing `makeCaptureSilentError({ runtime,
platform, logPrefix })`. The edge/node helpers are now ~20-line factory
wrappers — single source of truth for envelope format, `keepalive`
flag, ingestion endpoint, and stack parser.

P2 (greptile): lint guard's brace counter was fooled by braces inside
string literals. Now strips comments and string literals (line, block,
single-quoted, double-quoted, template — including ${...} expression
parts) before walking braces. Operating on the stripped source means
brace counts inside strings can no longer extend the catch body past
its true closing brace. Same strip also fixes the codex P2 about
`\bthrow\b` matching inside comments / strings.

P2 (codex): lint guard's --diff mode said "introduced" but actually
scanned every catch in changed files. Now parses
`git diff --unified=0 origin/main...HEAD` to extract added/modified
line ranges per file, and only flags catch blocks whose line range
overlaps a changed hunk. Legacy catches in legacy files no longer
block unrelated edits.

P2 (greptile): `webhookHandlers.ts:31` JSDoc now explains why
`internalMutation` not `internalAction` — Convex auto-retries failed
actions, which would produce N duplicate Sentry events per signature
failure during outages. Mutations are NOT auto-retried, ensuring
exactly one Sentry event per failed signature check.

Lint guard support: added `// sentry-coverage-ok` inline override
marker for cases where a catch surfaces to Sentry through a non-obvious
channel (e.g., scheduled mutation throw). Used on the webhookHandlers
401-path catch where re-throwing or returning 5xx are both wrong.

Verified: `node scripts/check-sentry-coverage.mjs --all` reports 152
files, 0 offenders. `--diff` mode reports 20 files changed, 0
offenders.

* fixup: ctx is optional in handler signatures + helper handles waitUntil internally

The previous fixup made `ctx` REQUIRED in handler signatures and called
`ctx.waitUntil(captureSilentError(...))` at every site. This broke local
test invocations that call `handler(req)` without a second argument —
e.g., `node --test tests/brief-edge-route-smoke.test.mjs` and
`tests/mcp.test.mjs` failed on every error path with `TypeError: Cannot
read properties of undefined (reading 'waitUntil')`.

Fix: fold the waitUntil scheduling INTO `captureSilentError` itself.
Callers pass `ctx` (when they have it) as a property of the opts object;
the helper registers the delivery via `ctx.waitUntil` when present, or
falls back to fire-and-forget with `keepalive: true` + an
unhandled-rejection defuse when absent.

API change:
  Before:  ctx.waitUntil(captureSilentError(err, { tags: {...} }))
  After:   captureSilentError(err, { tags: {...}, ctx })

Same Sentry delivery guarantees on Vercel; cleanly degrades to keepalive
fire-and-forget for tests/sidecar/non-Vercel invocations.

Mechanical sweep across all 25 call sites in api/ via a one-shot Python
transform (mcp.ts, invalidate-user-api-key-cache.ts, and
brief/carousel/.../[page].ts had multiline forms — fixed by hand).
captureEdgeException grew an optional 3rd `ctx` parameter so its two
existing callers in notification-channels.ts can pass ctx through
without changing the original (err, context) calling convention.

All handler signatures I touched in the sweep now use `ctx?:` (optional)
so non-Vercel callers — Node test runner, sidecar, direct invocation —
no longer crash on the error path. The pre-existing handlers that
required ctx (referral/me.ts, notification-channels.ts) keep their
required signatures because their EXISTING `ctx.waitUntil(...)` calls
elsewhere in the body would break otherwise; tests that exercised those
were already passing ctx and continue to.

Lint guard still clean (152 files --all, 21 files --diff). Helper
docstrings updated to reflect the new API.
…a (plan 002 PR 3+4+5, v15→v16) (koala73#3452)

* feat(resilience): coverage penalty + source-comprehensive + per-capita (plan 002 PR 3+4+5, v15→v16)

Plan 2026-04-26-002 §U4+U5+U6 — combined PR 3+4+5 — three coordinated
levers that eliminate the structural small-state inflation cohort bias.
All ride a single cache prefix bump (v15→v16, history v10→v11) so
mixed-formula payloads can't leak into the same response.

§U4 coverage penalty (`coverageWeightedMean` in `_shared.ts`):
- Fully-imputed dims (no observed data, scorer set imputationClass) now
  contribute `coverage × weight × 0.5` instead of `coverage × weight`.
- Discriminator: `imputationClass !== ''` (the post-buildDimensionList
  shape converts null → empty string for observed dims).
- Empirically lifts median(G7) above median(microstate-territories) for
  the first time since v14 — TV/PW/NR previously hit ~95% of dims via
  stable-absence imputes (no IPC, no UNHCR) and rode imputed 85s to
  false-high overall scores.

§U5 source-comprehensiveness flag (`_indicator-registry.ts`):
- New REQUIRED `comprehensive: boolean` field on every IndicatorSpec
  (68 entries tagged; 19 marked false: BIS curated, WTO top-50, event
  feeds, news/social signals, GIE EU-only, Wikipedia SWF manifest).
- Helper `isIndicatorComprehensive(id)` with conservative default `false`
  for unknown ids per the plan's risk-mitigation row.
- Wired into `scoreSocialCohesion`'s GPI-only unrest impute (the only
  current site reaching for a stable-absence anchor on a non-comprehensive
  source). Drops the impute from 70/0.5 (stable-absence) to 50/0.3
  (unmonitored) for unrest:events:v1 (event-scraping feed, English-bias).

§U6 per-capita normalization (`scoreSocialCohesion`, `scoreBorderSecurity`):
- Unrest event count and UCDP eventCount + deaths divide by
  `max(populationMillions, 0.5)` (population read from
  `economic:imf:labor:v1`).
- Goalposts re-anchored: socialCohesion 0..20 → 0..10 events/M;
  borderSecurity 0..30 → 0..15 events/M.
- 0.5-million floor anchors tiny states (TV/PW/NR ≈ 0.01M-0.02M) as-if
  500k pop, preventing per-capita amplification of single events.

Cache prefix propagation (memory: cache-prefix-bump-propagation-scope):
- 11 hardcoded literal sites bulk-updated across tests/, scripts/,
  api/health.js — every site reading the v15 prefix now reads v16,
  every history-v10 reader reads v11.

Cohort fixture (`tests/resilience-cohort-anti-inversion.test.mts`)
tightened from PR 0 PERMISSIVE to plan-002-PR-5 thresholds:
  - median(G7) > median(microstate) + 15pt
  - count(microstate in top 20) <= 1
  - median(Nordics) >= median(GCC) - 5pt
  - min(G7) >= max(Sub-Saharan-LIC) - 10pt

Tests: all 7473 pass (npm run test:data). Three test fixtures re-anchored
to track the score shift (TV socialCohesion 80→76, US overall 64.78→65.45,
NO pillar-combined high-band floor 60→55) — each re-anchor is documented
per-site with the §U-id justifying it. Iceland regression guard for
peaceful + comprehensive-source countries passes (no regression).

Plan: docs/plans/2026-04-26-002-feat-resilience-universe-coverage-rebuild-plan.md
Origin: docs/brainstorms/2026-04-26-002-resilience-universe-coverage-rebuild-requirements.md

* test(resilience): pin §U5 source-comprehensiveness flag invariants

Plan 2026-04-26-002 §U5 explicitly requested
`tests/resilience-source-comprehensive-flag.test.mts` as a focused
pinning suite for the flag's per-source classification + helper
behavior. The integration cases (TV, Iceland, NR cohort) are covered by
existing scorer + cohort-bias tests; this file pins:

- Every indicator entry has comprehensive: boolean (no missing tags)
- Canonical global-coverage sources stay comprehensive=true (IPC,
  UNHCR, UCDP, FATF, WGI, recovery-derived)
- Event feeds + curated subsets stay comprehensive=false (unrest events,
  news threat, social velocity, GDS event feeds, BIS curated, WTO
  top-50, GIE EU-only, SWF Wikipedia manifest, retired fuel-stocks)
- isIndicatorComprehensive() returns false for unknown ids (conservative
  default per the plan's risk-mitigation row)
- Every comprehensive=true entry has coverage >= 100 (sanity gate
  catching mis-tagging)

Adds the test file referenced in plan §U5 §Files. Future contributors
can't silently flip a flag without the test review surfacing it.

* test(resilience): pin §U4 coverage penalty + §U6 per-capita invariants

Plan 2026-04-26-002 §U4 and §U6 each listed a focused pinning test file
in their §Files sections that didn't ship in the initial commit. Adding
both now so PR 3+4+5's §Files lists are complete.

`tests/resilience-coverage-penalty.test.mts` (7 tests):
- observed-only dims unchanged from v15 (no penalty when nothing imputed)
- half-imputed dim contributes half weight (formula pin: 0.5 factor)
- low-scoring impute (50/0.3) at half weight lifts the mean
- pure-imputed list invariant (penalty cancels in ratio)
- zero-coverage dims neutralized whether imputed or not
- empty dim list → 0 (no div-by-zero)
- per-dim weight × imputation factor compose multiplicatively

`tests/resilience-per-capita-normalization.test.mts` (5 tests):
- TV (zero unrest, tiny) MUST NOT out-score US (low-rate, 333M pop) —
  the load-bearing invariant the §U6 lever exists to enforce
- two countries with identical event counts and different pops produce
  inversely-scaled socialCohesion scores (per-capita scaling is real)
- same invariant for borderSecurity / UCDP eventCount + deaths
- 0.5M pop floor: 0.01M and 0.5M reported pops produce identical scores
  (both clamp to the same denominator, protecting tiny states from
  per-capita inflation)
- missing IMF labor seed → 0.5M default doesn't crash the scorer

All 12 new tests green. Combined with the earlier source-comprehensive-
flag pinning suite, every test file the plan listed for PRs 3+4+5 now
ships in this PR.

* fix(resilience): three §U6 + §U5 review-fix bugs (PR koala73#3452 review round 1)

Three load-bearing bugs caught in review of the initial PR 3+4+5 commit:

(P1) §U6 per-capita normalization was 1e6× too large because IMF SDMX
`LP` returns Population in PERSONS (raw count), not millions. The seeder
stored `populationMillions: lp?.value ?? null` directly, so US arrived
as 342_594_000 instead of 342.6. Per-capita math then divided event
counts by ~342M instead of ~342, saturating the unrest+UCDP scores at
100 for every country and silently neutralizing §U6.

Fix:
  - `scripts/seed-imf-labor.mjs`: divide by 1_000_000 before storing,
    so the field name matches its semantic. Documented why.
  - `_dimension-scorers.ts`: new `readPopulationMillions()` helper with
    defensive raw-persons detection (value > 10_000 → divide by 1e6).
    Handles in-flight cached payloads from prior cron runs that still
    carry raw persons; once the cache cycles, this branch is a no-op.
  - `tests/seed-imf-extended.test.mjs`: mock LP fixtures with raw
    persons (333_300_000) to match real upstream shape.

(P2) `typeWeight` in scoreBorderSecurity was left outside the per-capita
denominator on the assumption it was a "dimensionless severity tag." It
is not — `summarizeUcdp:907` increments typeWeight per event, scaling
linearly with eventCount. For high-event countries the unnormalized
typeWeight could dominate the supposedly per-capita metric, defeating
§U6's intended scaling.

Fix: divide the entire event-derived conflict component by population
(`(eventCount*2 + typeWeight + sqrt(deaths)) / popDenominator`).

(P2) `shortTermExternalDebtPctGni` was tagged comprehensive=true despite
the registry's own comment noting WB IDS publishes for ~125 LMICs only
(HIC fall through to BIS LBS). Mis-tagging would cause future IMPUTE
callers to treat HIC absence as the high stable-absence anchor (85+),
misrepresenting HIC financial-system exposure.

Fix: flip to `comprehensive: false`. Pinning test extended to enforce.

Tests: all 7490 pass. Re-anchored expected scores in two existing
fixture tests (US social-governance 65.25 → 66.25, US overall 65.45 →
65.64, US stress 69.08 → 69.63) — the typeWeight per-capita fix lifts
US borderSecurity by ~1pt because typeWeight was the largest unscaled
contributor to that dim's metric.

* fix(resilience): two PR koala73#3452 review-round-2 P2 cleanups

Greptile review round 2 (commit 0cb6418):

(P2) Stale test descriptions in tests/resilience-scores-seed.test.mjs
said "(v14)" while the assertions were updated to v16 — failure
messages would read "matches server-side key (v14)" which is misleading.
Updated both descriptions to "(v16)".

(P2) The §U5 unrest-impute conditional in scoreSocialCohesion was
dead-by-construction: `isIndicatorComprehensive('unrestEvents')` always
returns false because unrestEvents is permanently `comprehensive: false`
in the registry (and pinned by the §U5 source-comprehensive-flag test).
The true-branch was unreachable and untested. Inlined the impute
(IMPUTATION.curated_list_absent directly) so the active code path is
the only code path. The §U5 contract is still enforced — the pinning
test asserts unrestEvents stays comprehensive=false; flipping the flag
would surface the test failure and force a contributor to also restore
the higher-anchor IMPUTE here.

Removed the now-unused `isIndicatorComprehensive` import from the
scorer; the helper is still exported and used by the pinning test
suite + remains available for any future scorer that needs it.

All 662 resilience tests still pass; typecheck clean.

* fix(resilience): TV-boundary normalizer + intervals lockstep with v16 (PR koala73#3452 review round 3)

(P1) `readPopulationMillions()` defensive raw-persons branch used
`raw > 10_000`, exclusive. Live Redis currently has TV.populationMillions
= 10_000 exactly (Tuvalu's actual headcount of ~10k stored as raw
persons). The exclusive comparison let TV fall through as "10000M" →
denominator dominated → §U6 per-capita normalization neutralized for
Tuvalu. Tuvalu is a headline target country for the small-state-bias
fix, so this silent miss undermined the load-bearing PR-3+4+5 lever
for the very cohort it targets, until the next IMF labor bundle (the
labor bundle is 30-day gated per scripts/seed-bundle-imf-extended.mjs).
Fix: `raw >= 10_000` (inclusive). New regression test pins the TV
exact-boundary case so this can't return.

(P2) Score intervals were not in lockstep with the v15→v16 score-prefix
bump. `RESILIENCE_INTERVAL_KEY_PREFIX` stayed at v1, AND both interval
seeders (`scripts/seed-resilience-intervals.mjs`, `scripts/seed-
resilience-scores.mjs`) computed their score-band Monte Carlo against
the OLD 5-domain weights (no recovery; economic 0.22 vs canonical 0.17;
etc.). Post-bump, scoreInterval/rankStable on the ranking handler would
mix v16 6-domain scores against v1 5-domain bands, producing
internally-inconsistent stability gates. Fix: bump
`RESILIENCE_INTERVAL_KEY_PREFIX` v1 → v2 in lockstep with the score
prefix bump; update both interval seeders to the canonical 6-domain
weights (matching `RESILIENCE_DOMAIN_WEIGHTS` in `_dimension-scorers.ts`,
including recovery=0.25); bulk-update 4 test/api literal sites to v2.

7491/7491 tests pass; typecheck clean.

* docs(resilience): bump interval cache key v1 → v2 in methodology table

PR koala73#3452 review round 4 (P3): code/runbook drift after the §interval-
key bump in commit aa39f44. The methodology doc's cache key table
still referenced `resilience:intervals:v1:{countryCode}` while the
production constant is now v2.

(Note: docs/internal/country-resilience-upgrade-plan.md:238 was also
flagged but that file is in .gitignore — internal-only working doc,
not source of truth for users.)

Memory ref: feedback_doc_drift_after_behavior_fix_needs_grep_sweep —
after every cache-prefix or behavior bump, grep across .md/.mdx for
the OLD distinctive token before the PR closes.
…gap) (koala73#3454)

* fix(entitlements): lower stock-analysis tier gate from 2 → 1 (close Pro 403 gap)

Pro subscribers (tier=1) calling /api/market/v1/{analyze,backtest,...}-stock
via Clerk session (no tester key in localStorage) were silently 403'd.

Two parallel gates cover the same paths:
  - PREMIUM_RPC_PATHS  → legacy bearer gate, accepts tier ≥ 1 (Pro)
  - ENDPOINT_ENTITLEMENTS → new strict gate, was tier ≥ 2 (API tier)

gateway.ts:404's `needsLegacyProBearerGate = LEGACY.has(p) && !isTierGated`
clause excludes the strict-gated paths from the legacy gate, so the strict
gate becomes the ONLY check. With the strict threshold higher than the
legacy one, Pro users in the legitimate band silently fail.

Failure mode is silent because:
  - client-side hasPremiumAccess() hides panels before the RPC fires
  - testers/admins with API keys bypass the entitlement check entirely
    via the wmKey shortcut at gateway.ts:554

Marketing copy in productCatalog.ts:124 promises "AI stock analysis &
backtesting" as a Pro feature, so tier=1 is the intended threshold.

Adds a regression test asserting tier=1 succeeds on /analyze-stock —
previous tests only covered tier=0 (fail) and tier=2 (pass), leaving
tier=1 (the gap band) unverified.

* test(entitlements): parametrize getRequiredTier assertion across all 4 stock paths

Greptile P2: a future accidental revert on /get-stock-analysis-history,
/backtest-stock, or /list-stored-stock-backtests would have gone undetected
because only /analyze-stock had a direct getRequiredTier assertion.

Replace the single-path test with test.each over all 4 stock paths so any
revert to tier=2 on any individual path fails CI.
…t + add CI guard (koala73#3455)

Mintlify reserves /mcp and /authed/mcp for its auto-generated docs-as-MCP
JSON-RPC server (https://mintlify.com/docs/ai/model-context-protocol).
Our docs/mcp.mdx was silently shadowed: HEAD /docs/mcp returned 504, GET
returned 405, and POST returned a JSON-RPC error envelope from Mintlify's
handler instead of rendering the page. Adjacent slugs all rendered fine.

Rename mcp.mdx -> mcp-server.mdx, update docs.json nav, sweep 8 inbound
links across documentation/usage/api-proxies/panel pages. Add a small
always-run CI lint (scripts/enforce-mintlify-reserved-slugs.mjs) that
fails the build if either reserved slug ever returns to docs.json or as
a docs/*.mdx filename.
…ts (koala73#3453)

* chore(broadcast): backfill proLaunchWave stamps for canary-250 contacts

The 244 registrations who received yesterday's PRO-launch canary broadcast
need a wave stamp in Convex so future wave-export actions can exclude
them. Without this, the next wave-export would re-pick them and re-email.

Two pieces:

1. Schema: add `proLaunchWave?: v.string()` and
   `proLaunchWaveAssignedAt?: v.number()` to `registrations`, plus a
   `by_proLaunchWave` index for efficient unstamped-only scans at next
   wave's pick time. Both fields optional so existing rows pass schema
   validation.

2. One-shot internal action `backfillCanaryWaveStamps:backfillCanary250`:
   - Pages Resend `GET /contacts?segment_id=<canary>` (cursor-based via
     `after=<contact-id>`, max 100/page)
   - Normalizes each email (`trim().toLowerCase()` — same convention
     `registrations.normalizedEmail` uses)
   - Calls internal mutation `_stampWaveByNormalizedEmail` to look up
     and patch the matching registration
   - Reports {fetched, stamped, alreadyStamped, notFound, failed}
   - Idempotent — re-runs are no-ops on already-stamped rows
   - Masks emails in logs (Convex dashboard is observable to project
     viewers; raw waitlist addresses must never land in plaintext logs)

The full wave-export action that handles "pick N unstamped, stamp them,
push to fresh Resend segment" comes in the next PR — this PR just lays
the schema + the canary backfill so we don't accidentally re-email the
244 when the next wave runs.

Run after deploy:
  npx convex run broadcast/backfillCanaryWaveStamps:backfillCanary250

* fixup: address review on PR koala73#3453 — fix Resend URL + lint guard

P1: wrong Resend endpoint
The action built `GET /contacts?segment_id=...`. That URL exists but the
canonical per-segment listing endpoint is
`GET /segments/{segment_id}/contacts` (verified against Resend docs:
https://resend.com/docs/api-reference/segments/list-segment-contacts).
The wrong URL would have failed before stamping any rows, leaving the
244 canary contacts eligible for re-emailing in the next wave —
defeating the entire point of the backfill.

P1: lint guard flagged the per-contact catch
The `try/catch + console.error + stats.failed++` block in
`backfillCanary250` is intentional — per-contact stamp failures are
counted into `stats.failed` and surfaced in the action's return value
(the operator's visible surface for partial failures). Re-throwing
would abort the whole loop on the first failure and leave most
contacts unstamped. Convex auto-Sentry still captures the underlying
mutation throw inside the mutation itself, before it bubbles here as a
rejection.

Added `// sentry-coverage-ok:` marker INSIDE the catch body (the lint
guard checks the body, not surrounding lines) with a multi-line
rationale so the next reader doesn't undo the choice. Lint guard now
clean: 153 files --all, 2 files --diff.

* fixup: address review on PR koala73#3453 — close the wave-skip + regen api.d.ts

P1: backfill stamp not used by current export path
The schema doc claimed "future wave exports filter on
proLaunchWave === undefined", but the EXISTING audienceExport.ts (the
only exporter that exists today) skipped only on
empty/suppressed/paid — meaning a re-run against pro-launch-main
would re-pick the canary 244 (and any future stamped wave) and the
next broadcast would dupe-email them.

Extended the existing exporter:
- Added `alreadyInPriorWaveSkipped: number` to ExportStats.
- Added a per-row check: `if (row.proLaunchWave) { stats.alreadyInPriorWaveSkipped++; continue; }`.
  Sits AFTER suppressed/paid so the priority order is consistent
  (auth/permanent suppressions first, then prior-wave history).
- Both dry-run and live-mode honor the skip — operators see the
  count in the dry-run output before committing.

This makes the backfill load-bearing as advertised.

P1: stale convex codegen
Adding convex/broadcast/backfillCanaryWaveStamps.ts requires
regenerating convex/_generated/api.d.ts so the new module's
internal mutations/actions are reachable via internal.broadcast.*.
The pre-push gate runs the root + api typechecks but NOT
`tsc -p convex/tsconfig.json`, so the missing codegen slipped through.
Ran `npx convex codegen --typecheck=disable`; verified fix with
`npx tsc --noEmit -p convex/tsconfig.json` (silent / clean).
…koala73#3456)

* chore(pre-push): typecheck convex/ to catch stale _generated/api.d.ts

PR koala73#3453 review caught a missing-codegen slip: a new module under
convex/broadcast/ was committed without re-running `npx convex codegen`,
so convex/_generated/api.d.ts was stale. The pre-push gate ran
`typecheck` (root) and `typecheck:api` but not
`tsc -p convex/tsconfig.json`, so the stale-codegen import error
("Property 'backfillCanaryWaveStamps' does not exist on type
'internal.broadcast'") only surfaced in PR review.

Adds `npx tsc --noEmit -p convex/tsconfig.json || exit 1` between the
existing API typecheck and the CJS syntax check. Catches:
  - stale _generated/api.d.ts (forgotten codegen after adding a module)
  - drift between convex/schema.ts and code that reads it
  - any TS error inside convex/ that the root tsconfig's project
    references would otherwise miss

* fixup: also add convex typecheck to CI typecheck.yml workflow

PR koala73#3456 review caught the gap: pre-push runs locally and can be
bypassed (`git push --no-verify`, direct pushes to main, CI-only
paths), so the convex typecheck addition was incomplete as a
correctness gate. CI's typecheck.yml ran only `typecheck` (root) and
`typecheck:api`, leaving stale `convex/_generated/api.d.ts` slippable
through CI without a failure.

Mirrors the pre-push step into the workflow:

  - run: npx tsc --noEmit -p convex/tsconfig.json

Same step, same exit semantics. Now both layers (local pre-push +
remote CI) catch stale codegen and any drift between
`convex/schema.ts` and code that reads it.
…eta write (koala73#3458)

* fix(resilience): parity-check actual persistence before lying meta write

Production observation 2026-04-27: /api/health reported
resilienceIntervals status=EMPTY records=0 seedAgeMin=671 maxStaleMin=20160.
Direct Redis query showed:

  resilience:intervals:v2:*  → 0 keys     (health reads this)
  resilience:score:v15:*     → 4 keys     (leftovers, pre-PR koala73#3452)
  resilience:score:v16:*     → 2 keys     (BR, CN — current code)
  seed-meta:resilience:ranking → count=196, scored=196 (LYING)
  seed-meta:resilience:intervals → recordCount=196 (LYING)

Root cause: under saturated edge-runtime conditions, Upstash REST
/pipeline returns result:'OK' for SETs that don't durably persist.
The handler's existing persistence guard
(persistResults[i]?.result === 'OK') trusts the OK response, so
cachedScores.size inflates to 196 while only 6 actually landed in
Redis. The coverage gate (`>= 0.75`) passes; meta gets written with
scored=196; downstream health reads the lying meta.

Fix: parity check before the meta write. Sample up to 20 score keys
from cachedScores, EXISTS-pipeline them. If <50% exist, refuse the
ranking + meta SETs; the next cron tick retries naturally. The
handler still returns the computed response so callers see correct
data — only the cache + meta publish is skipped.

Cost: one extra ~50-200ms round-trip on Edge. Benefit: prevents the
"meta says scored=196, actual data is 6" lying state that produced
the 2026-04-27 incident.

Tests:
  - 1 new regression test pinning the parity-fail behavior
    (Upstash returns OK without persisting → no ranking/meta write)
  - All 16 existing ranking tests pass — including the
    pipeline-GET-race test that simulates write→re-read visibility
    lag (parity check uses EXISTS not GET, so that mock falls
    through to the real fake redis).
  - Added EXISTS support to fake-upstash-redis.mts test helper.
  - Exported scoreCacheKey from _shared.ts (was private; needed by
    handler for sample-key construction).

Per skill `upstash-rest-pipeline-ok-not-durable-persistence`.
Companion to skill `seed-meta-lies-about-recordcount-coverage-gate-bug`.

* fix(resilience): parity-check samples warmed-only entries (closes mixed-failure blind spot)

Reviewer catch on PR koala73#3458: the parity check used `slice(0, 20)` over
cachedScores, which is deterministic. If the first 20 entries are
pre-warmed score keys (which came from getCachedResilienceScores and
are tautologically present), and the durability failure only affects
the newly warmed tail, the parity check passes and meta still gets
written claiming scored=N — exactly the lying-meta state we're
trying to prevent.

Two changes:

1. Track `warmedCountryCodes` — the list of country codes whose
   scores were SET by THIS invocation via warmMissingResilienceScores.
   Pre-warmed entries from getCachedResilienceScores are excluded
   because verifying them is uninformative (we just READ them so
   they exist by definition).

2. Sample from `warmedCountryCodes` rather than cachedScores.
   Shuffle before slicing so the same N keys aren't checked every
   invocation — partial-failure modes that consistently affect the
   same subset (e.g. last batch of 30 fails due to queue saturation)
   are more likely to be sampled across cycles.

3. Skip the parity check entirely when warmedCountryCodes.length === 0
   (cache hit on every country — no recent writes to verify).

Test: 1 new regression test in resilience-ranking.test.mts that
simulates the exact mixed-failure mode the reviewer flagged. Pre-cache
NO + US (the "first" entries that would be sampled by slice(0, 20)
in the buggy version), then warm YE + ZZ but mock the SET pipeline
to return OK without persisting. Asserts ranking + meta are NOT
written. Pre-fix (deterministic slice over cachedScores) this test
fails; post-fix (sample from warmedCountryCodes) it passes.

All 18 ranking tests pass — including the existing pipeline-GET
race test, the all-failed test from PR koala73#3458's first commit, and
this new mixed-failure regression.
…ay (koala73#3461)

* feat(notifications): forbid (realtime, all) — PR1 server+UI+transport+relay

User foot-gun: enabling Real-time × All events produced 14 emails in 22min,
including 4 NWS thunderstorm warnings for adjacent zones inside 3 minutes.
Real-time + 'all' is semantically incoherent ("interrupt me now" + "for
everything") and threatens Resend sender reputation during the PRO launch
broadcast warmup (kills at complaint > 0.08%).

Makes (digestMode='realtime', sensitivity='all') unrepresentable across
every surface — server validators, HTTP transport, settings UI, and the
notification relay's read path. Plan: plans/forbid-realtime-all-events.md
(approved after 5 rounds of Codex review).

Server (convex/alertRules.ts):
- resolveEffectivePair + assertCompatibleDeliveryMode helpers applied at
  all 6 mutations, including quiet-hours mutations whose default-insert
  path can create forbidden rows from scratch.
- sensitivity made optional in setAlertRules + setAlertRulesForUser; patch
  paths preserve existing.sensitivity when caller omits it (no silent
  narrowing of digest users).
- 4 default-insert literals flipped from 'all' to pair.sensitivity (now
  'high' on fresh insert).
- New atomic internal mutation setNotificationConfigForUser updates both
  fields together — fixes the daily+all -> realtime race the legacy
  two-call sequence has against the cross-field validator.
- Temp admin-secret-gated _countRealtimeAllRules + _migrateRealtimeAllPage
  (paginated, idempotent) for the §4 backfill, removed in PR 2.

Transport (convex/http.ts, api/notification-channels.ts,
src/services/notification-channels.ts):
- Removed the (body.sensitivity ?? "all") fallback at convex/http.ts:504
  that would have silently rewritten existing digest users on omitted-
  field calls.
- New "set-notification-config" HTTP-action and Vercel-proxy branches
  with INCOMPATIBLE_DELIVERY -> 400 passthrough (not generic 500), so
  the UI can render the helper text inline.
- New setNotificationConfig client wrapper + IncompatibleDeliveryError
  typed error.

UI (src/services/notifications-settings.ts):
- Sensitivity dropdown lifted OUT of usRealtimeSection so digest users
  can see and change it (previously hidden in digest mode).
- 'all' option disabled when delivery mode is realtime; helper text
  matches the server error wording.
- Mode-change handler snaps sensitivity to 'high' when switching TO
  realtime, then routes the save through setNotificationConfig
  atomically (catches IncompatibleDeliveryError to surface the inline
  hint).

Relay (scripts/notification-relay.cjs):
- shouldNotify normalizes effectiveSensitivity once at function entry;
  both the legacy matchesSensitivity call AND the importance-threshold
  lookup use it. Fixes the half-defense bug where wrapping only the
  match would let the threshold path silently fall through to the
  looser IMPORTANCE_SCORE_MIN floor for in-flight (realtime, all) rows.

Migration scripts (scripts/migrate-{discover,realtime-all-to-daily}.mjs):
- Driver scripts use ConvexHttpClient.query() / .mutation() against the
  admin-secret-gated public functions (internalQuery/internalMutation
  are unreachable via ConvexHttpClient — see notification-relay.cjs:243).
- Pagination + idempotency via the isForbidden filter.

Tests:
- convex/__tests__/alertRules.test.ts: 11 cases covering invariant
  enforcement, insert-only defaults, atomic-mutation pair flips,
  partial-update re-validation, omitted-sensitivity preservation.
- tests/notification-relay-effective-sensitivity.test.mjs: 3 source-grep
  cases confirming both reads use the same coerced value.
- tests/notifications-settings-ui-invariants.test.mjs: 7 source-grep
  cases for layout placement, disable-on-realtime state, snap logic,
  atomic-save routing, and IncompatibleDeliveryError handling.

Out of scope (separate follow-ups):
- Slot A: per-recipient hourly rate cap (generic burst airbag).
- Slot B: event-family coalesce for adjacent-zone NWS storms.
- Critical-tier severity audit.

PR 2 will run discovery + dry-run + live migration + courtesy email +
remove the temp migration functions/scripts.

* chore(notifications): lint cleanup for PR1

- Remove redundant 'use strict' from migration .mjs ES modules.
- Add blank lines around lists in plans/forbid-realtime-all-events.md
  per markdownlint MD032 (autofix).

* fix(notifications): address PR koala73#3461 Greptile review (P1 + P2-UX + P2-sec)

P1 — setQuietHours/setQuietHoursForUser blocked pre-migration users:
The new assertCompatibleDeliveryMode was called on every mutation, even
ones that didn't touch (digestMode, sensitivity). For pre-migration rows
in the forbidden state, quiet-hours saves threw INCOMPATIBLE_DELIVERY
which surfaced as a generic 500 (set-quiet-hours HTTP action has no
passthrough). Quiet-hours mutations don't touch the pair, so they can't
introduce new forbidden state — the validator was blocking unrelated
updates on pre-migration rows. Drop the assertion from both quiet-hours
mutations; keep resolveEffectivePair so default-inserts still pick
sensitivity='high' (compatible by construction). Relay coerce-at-read
continues to protect delivery during the migration window.

Added regression test:
setQuietHoursForUser({pre-migration forbidden row}) → succeeds, sensitivity preserved.

P2 (UX) — sensitivity hint always visible in digest mode:
The "Real-time delivery requires High or Critical" hint rendered
unconditionally, so digest users (e.g. daily+all) permanently saw copy
that didn't apply to them. Hide the hint with display:none when
!isRealtime; toggle on mode change. Source-grep test locks both behaviors.

P2 (security) — admin secret exposed in Convex dashboard logs:
Convex logs all public-function args to the dashboard's call history.
adminSecret was passed as a plain query/mutation arg, so anyone with
dashboard access sees it in plaintext for the lifetime of the temp
functions. Added explicit "rotate after migration" guidance to the plan
doc + PR 2 cleanup checklist. The secret should be treated as one-time
use; PR 2 removes the temp functions and the env var in the same commit.
…(cron-cadence inversion) (koala73#3459)

* fix(health): tighten resilienceIntervals maxStaleMin from 14d → 18h

Production /api/health 2026-04-27 reported resilienceIntervals
status=EMPTY records=0 seedAgeMin=671 maxStaleMin=20160. The
14-DAY threshold was 56× the actual 6h cron cadence —
the 2026-04-27 incident had data missing for 11+ hours yet health
stayed STALE-free, masking a real outage.

The seeder is bundled into seed-bundle-resilience (Railway cron
`0 */6 * * *`, every 6h, per docs/railway-seed-consolidation-runbook.md
Bundle 4) — NOT a weekly cron as the inline comment claimed. Per the
project's 3× cron-driven convention (portwatchPortActivity,
chokepointTransits, transitSummaries, bisDsr triplet), the correct
value is 3 × 360min = 1080min (18h).

Defense-in-depth:
  - 1 missed cron + recovery → still OK (no spurious page)
  - 2-3 missed crons (real outage) → STALE_SEED at 18h instead of
    silently passing for 14 days

Tests: 4 new regression assertions in
tests/resilience-cache-keys-health-sync.test.mts under the
"resilienceIntervals maxStaleMin co-pinned to 6h Railway cron
cadence" suite:
  - Pin Resilience-Scores section gate = 2h (informational)
  - Pin maxStaleMin = 1080
  - Assert maxStaleMin >= 540 (1.5× cron cadence floor)
  - Assert maxStaleMin <= 1440 (4× cron cadence ceiling — directly
    tied to the 2026-04-27 incident: 14d setting hid an 11h outage)

This PR is defense-in-depth — it does NOT solve the underlying
data-loss bug (Upstash optimistic-OK returning success without
durable persistence). That is fixed in PR koala73#3458 with a sample-based
parity check before the meta write. Together, the two PRs ensure
that (a) the lying-meta state cannot be written, and (b) future
similar incidents alarm in 18h instead of silently for 2 weeks.

Per skill `health-maxstalemin-write-cadence`.

* fix(health): correct resilienceIntervals cadence baseline (6h→2h, 1080→360min)

Reviewer caught that the runbook's `0 */6 * * *` is stale. The
authoritative source is scripts/seed-bundle-resilience.mjs:5-12,
whose own comment says hourly Railway fires + 2h section gate
make the Resilience-Scores section run "~every 2h." So the prior
1080 (18h) was 9× the real cadence, not 3×, and would still wait
~18h before alarming on data that should refresh every ~2h.

- maxStaleMin 1080 → 360 (= 3× real ~2h cadence per project convention)
- test floor 540 → 180 (1.5× of 2h)
- test ceiling 1440 → 480 (4× of 2h, catches outage within 8h)
- comments cite the bundle script's own line as authoritative; runbook
  noted as stale

Same class of outage-masking bug as the original 14d setting, just
with a smaller magnitude. Test still regression-locks the principle
(tied to bundle-script intervalMs, not the stale runbook).
…nses (plan 002 PR 2) (koala73#3457)

* feat(resilience): add headlineEligible field to score + ranking responses (plan 002 PR 2 §U3)

Plan 2026-04-26-002 §U3 (PR 2 in the 8-PR sequence) — introduces a new
`bool headline_eligible = N;` field on `GetResilienceScoreResponse`
and `ResilienceRankingItem`. PR 2 populates `true` for every successful
score build (no behavior change); PR 6 / §U7 swaps the population logic
to the actual eligibility gate (coverage ≥ 0.65 AND (population ≥ 200k
OR coverage ≥ 0.85) AND !lowConfidence) and the headline ranking endpoint
filters by this field.

Why land this as a precursor: the proto + generated TS surface change is
itself a noisy diff (regenerated openapi yaml/json + client/server stubs)
that's easier to review on its own than mixed in with PR 6's gate logic.
Downstream consumers (widget, raw API) can begin reading the field
informationally before the gate flips, avoiding a coupled "field +
behavior" PR.

Files:
- `proto/worldmonitor/resilience/v1/get_resilience_score.proto` —
  `bool headline_eligible = 17;` on `GetResilienceScoreResponse`
- `proto/worldmonitor/resilience/v1/resilience.proto` —
  `bool headline_eligible = 7;` on `ResilienceRankingItem`
- `make generate` regenerated openapi + TS client/server bindings
- `server/worldmonitor/resilience/v1/_shared.ts:buildResilienceScore`
  and the two fallback paths in `ensureResilienceScoreCached` populate
  the field. Happy path → `true`; invalid country code or missing-cache
  fallback → `false` (the conservative default — those countries can't
  pass the PR-6 gate either)
- `buildRankingItem` passes through from the source-of-truth response;
  null-response fallback returns `false`
- `src/components/resilience-widget-utils.ts:LOCKED_PREVIEW` carries
  `headlineEligible: true` (informational; widget renders nothing
  different yet)
- New test `tests/resilience-headline-eligible-field.test.mts` (5
  pinning tests) — pass-through, fallback default, contract enforcement

7502/7502 tests pass (npm run test:data); typecheck + typecheck:api
clean; lint exit 0.

Plan: docs/plans/2026-04-26-002-feat-resilience-universe-coverage-rebuild-plan.md
PR 3+4+5 just merged: koala73#3452 (commit ba5474f).

* fix(resilience): backfill headlineEligible on cache read for pre-PR-2 v16 entries (PR koala73#3457 review round 1)

Reviewer P1: PR koala73#3452 (just merged) wrote v16 score + ranking cache
entries before this PR added the headlineEligible field. The cache
keys are NOT bumped in this PR (it's a no-behavior-change field
addition; bumping would force a 6h recompute window for an
informational field). So existing cache hits return objects missing
the now-required field — TypeScript types are erased at runtime, so
the wire shape would carry `undefined` instead of a boolean, breaking
any downstream `=== true / === false` discriminator that PR 6 will
introduce.

Fix: backfill on read in two sites:

- `_shared.ts:stripCacheMeta` — invoked by `ensureResilienceScoreCached`
  on every score cache hit. Default missing `headlineEligible` to
  `true` (matches the PR-2 happy-path contract for successful score
  builds).
- `get-resilience-ranking.ts` cache-hit branch — invoked when a
  cached ranking payload is served before recompute. Backfill items[]
  AND greyedOut[] with the same `true` default.

Once the cache cycles to post-PR-2 writes (next cron tick, ~6h TTL),
the backfill becomes a no-op for the steady state. Pre-PR-6 the
default is the same as the build-time value (`true`); PR 6 / §U7 will
flip the build-time value to actual eligibility logic, at which point
the new payloads overwrite the legacy default on the next write.

Tests:
- Updated 2 existing ranking-test fixtures to include
  `headlineEligible: true` (representing the post-PR-2 steady state)
- Added a new ranking-test "backfills headlineEligible on cached items
  written before PR 2" with a fixture that deliberately omits the
  field on every item, asserting the backfill defaults to `true`
- Added a new score-test "stripCacheMeta defaults headlineEligible=true
  when the cached payload predates the field"

7504/7504 tests pass; typecheck + typecheck:api clean.

* test(resilience): wire backfill regression test to fake-upstash so it actually exercises the cache path (PR koala73#3457 review round 2)

Reviewer P2: the cache-backfill test in
tests/resilience-headline-eligible-field.test.mts:90-117 used
setCachedJson directly. Without UPSTASH_REDIS_REST_URL/TOKEN env vars
that helper silently no-ops; ensureResilienceScoreCached then took the
build-path and returned a fresh response that legitimately has
headlineEligible:true (because the build-path sets it that way) — so
the test "passed" without ever exercising the cache-read backfill it
claims to test. With env vars present, it would have written to real
Redis (worse).

Fix: switch to the fake-upstash pattern used by every other ranking/
score test in this codebase:

- import { installRedis } from './helpers/fake-upstash-redis.mts'
- const { redis } = installRedis({})
- redis.set(legacyKey, JSON.stringify(legacyPayload))

Plus two new assertions to PROVE the cache path was exercised (not the
build path silently passing):
- assert.equal(response.overallScore, 60)
  — the cached payload's value, NOT what buildResilienceScore would
    compute for an empty-fixture (typically 0)
- assert.equal(response.dataVersion, 'v16')
  — also from the cached payload

Mutation-test verified the new wiring actually catches regressions:
disabling the stripCacheMeta backfill makes this test fail (and the
other 5 in the suite still pass), confirming the backfill assertion
is now load-bearing.

Note: also pinned `_formula: 'd6'` on the legacy fixture so the stale-
formula gate in ensureResilienceScoreCached doesn't reject the legacy
payload (which would force a rebuild and silently route through the
build-path again — the same trap as the original bug).

7504/7504 tests pass; typecheck clean.

* test(resilience): replace stub-literal contract test with raw-cache-entry assertion (PR koala73#3457 review round 3)

Greptile P2: the "happy-path response includes headlineEligible" test
was a hand-crafted literal stub that asserted `'headlineEligible' in
stub`. Because the stub was defined inline and unconditionally
contained the field, it would have passed even if buildResilienceScore
or ensureResilienceScoreCached stopped emitting the field. TypeScript
type enforcement also doesn't catch a future contributor who marks
the field optional (`headlineEligible?: boolean`).

First-cut fix asserted on the response of ensureResilienceScoreCached
— but that path goes through stripCacheMeta which BACKFILLS missing
headlineEligible to true (PR-2 review round 1 defense-in-depth). So
even with buildResilienceScore not emitting the field, the response
would still test as `true`.

Correct approach: drive a real cache-miss → build → store sequence,
then read the RAW cache entry directly from fake-redis. The raw
stored payload bypasses stripCacheMeta's backfill, so a missing field
in buildResilienceScore propagates straight through and the assertion
fires.

Mutation-verified: removing `headlineEligible: true` from the
buildResilienceScore return object now causes this test to fail (1/6
in the suite). With the field present, all 6 pass.

Net change: 1 test, ~30 LOC, now actually exercises the contract it
claims to enforce instead of asserting a tautology over a hand-crafted
literal.
koala73#3464)

Two more Sentry CSP-violation issues from a follow-up triage pass after
PR koala73#3460 merged:

- WORLDMONITOR-JM (39 events / 21 users on Edge): font-src blocked
  ms-browser-extension://... — Microsoft Edge's extension scheme,
  variant of chrome|moz|safari extensions. Extended the existing
  extension regex to include `ms-browser` so blockedURI and sourceFile
  on this scheme suppress symmetrically.

- WORLDMONITOR-JQ (23 events / 18 users on Samsung Internet / Tizen):
  frame-src blocked `about` (scheme-only) — Smart TV browsers and
  ad-injectors create about:blank / about:srcdoc iframes; we never
  set frame src to about:* ourselves. New branch suppresses bare
  `about` plus any `about:*` scheme URI.

Tests: csp-filter +5 cases (ms-browser-extension URI/source, about
scheme-only, about:blank, about:srcdoc). 174/174 pass.
…esh Resend segment (koala73#3462)

* feat(broadcast): per-wave audience export — pick N, stamp, push to fresh Resend segment

The sustainable per-send primitive for the PRO-launch ramp. Replaces
manual dashboard sub-segmenting with one CLI command per wave; the
existing canary-250 stamps already in registrations naturally exclude
yesterday's recipients from being picked again.

  npx convex run broadcast/audienceWaveExport:assignAndExportWave \
    '{"waveLabel":"wave-2","count":500}'
  # → returns { segmentId, assigned, ... }

  # Then existing flow:
  npx convex run broadcast/sendBroadcast:createProLaunchBroadcast \
    '{"segmentId":"<returned>","nameSuffix":"wave-2"}'

What it does:
1. Refuse if waveLabel already has stamped rows (operator picks unique
   label per wave; prevents accidental double-stamping).
2. Page registrations.paginate (1000/page), apply same dedup rules as
   audienceExport.ts (empty / suppressed / paid / already-in-prior-wave).
   Reservoir-sample N via Algorithm R — fair sample, single pass,
   O(N) memory.
3. Stamp each picked row with proLaunchWave + assignedAt via the
   shared _stampWaveByNormalizedEmail mutation (mirrors the
   canary-250 backfill action).
4. Create a fresh Resend segment via POST /segments named
   `pro-launch-${waveLabel}`.
5. Push picked contacts via the shared upsertContactToSegment helper
   (same two-step pattern audienceExport already uses — handles the
   "global contact exists, segments field not applied on duplicate
   422" Resend API quirk).
6. Return { segmentId, assigned, linkedExisting, alreadyExists, failed,
   underfilled }.

Companion refactor — extracted Resend helpers to a shared module:
  - `_resendContacts.ts` (NEW): RESEND_API_BASE, USER_AGENT,
    isDuplicateContactError, UpsertOutcome, upsertContactToSegment,
    and createSegment.
  - `audienceExport.ts`: replaced its inline copies with imports from
    the new module. No behaviour change; just dedup.

Why Resend can't do this natively: verified against Resend docs —
POST /broadcasts accepts segment_id only (no exclude/sample/limit
params), POST /segments accepts name only (segments are membership
lists, not query-defined via API). Progressive waves require tracking
membership somewhere; Convex is the right source of truth since dedup
math already runs there.

Convex codegen regenerated and committed (api.d.ts now includes
audienceWaveExport's internal mutation/query/action). Convex
typecheck (`tsc -p convex/tsconfig.json`) clean. Sentry-coverage lint
guard clean.

* fixup: reorder wave export to push-first, stamp-only-on-success

P1 from review: previous order stamped all picked contacts BEFORE
attempting Resend push. If `createSegment` threw, or any
`upsertContactToSegment` returned `failed`, those contacts were
permanently excluded from future waves but never landed in a
sendable Resend segment — silently stranded.

New order:
  1. Pick N (in-memory reservoir, no side effects)
  2. createSegment — atomic, throws on failure → no contacts stamped
  3. For each picked: push first, stamp ONLY on success
     (created / linkedExisting / alreadyInSegment).
     Failed pushes leave the contact unstamped → available for
     next wave's pick.

Edge case (rare): push succeeds, stamp throws. Contact is in the
Resend segment but unstamped → may be re-picked into a later wave
and receive a duplicate email. Counted in new `stampFailed` stat;
operator can manually stamp via Data Explorer if it happens. We
don't roll back the Resend push (the DELETE call is a worse risk
than the duplicate-email exposure).

The new `stampFailed: number` field in WaveExportStats surfaces
this case explicitly. Documented in the file docstring's
"Atomicity" section so the next reader doesn't try to "simplify"
back to the unsafe stamp-first ordering.
…ecipient list (koala73#3463)

* feat(notifications): add _listAffectedUserEmails for courtesy-email recipient list

Follow-up to PR koala73#3461. Adds a temp admin-secret-gated query that joins
forbidden-state alertRules rows with verified email channels, returning
the recipient list for the post-migration courtesy email.

Why this exists: PR 1's discovery script returns counts only (no PII).
The courtesy email step needs (userId, variant, enabled, email) tuples,
and they have to be captured BEFORE _migrateRealtimeAllPage runs — once
rows flip to digestMode='daily' they're indistinguishable from organic
digest users.

Workflow:
  node scripts/migrate-list-affected-emails.mjs > /tmp/recipients.json
  node scripts/migrate-realtime-all-to-daily.mjs
  # send email using /tmp/recipients.json (filter enabled=true to
  # target only the actively-harassed subset).

Skips users with unverified email or no email channel — only returns
addresses the relay would actually use to deliver. Production discovery
showed 29 affected rows, 15 enabled; expect <=15 recipients.

Same admin-secret gate, same TEMP-MIGRATION-FUNCTION marker, same
"remove in PR 2 cleanup" discipline as _countRealtimeAllRules and
_migrateRealtimeAllPage. Response contains PII (user emails) and is
logged in the Convex dashboard for the lifetime of the function, so
keep the lifetime short and rotate the admin secret post-migration.

Tests cover:
- UNAUTHORIZED on wrong/missing admin secret
- Recipients filtered to verified email channels only
- Skips users in forbidden state but without email channel
- Skips users not in forbidden state (digest users)
- enabled flag preserved so caller can target actively-harassed subset

* fix(notifications): paginate _listAffectedUserEmails — fail-closed on partial capture

P1 review finding on PR koala73#3463: _listAffectedUserEmails scanned only the
first 500 alertRules rows, not the first 500 affected rows. If the table
grew past 500 rows, affected users on later pages were silently dropped
while the driver still wrote partial JSON. Since the next migration step
makes the original recipient set unreconstructable, partial capture meant
permanently-lost recipients.

Fix:

1. Rename _listAffectedUserEmails → _listAffectedUserEmailsPage, take a
   cursor arg, return {recipients, affectedInPage, isDone, nextCursor}.
   Same paginated shape as the existing _countRealtimeAllRules and
   _migrateRealtimeAllPage.

2. Driver loops the paginated query until isDone, accumulating
   recipients across pages. Critically: writes JSON to stdout ONLY
   after the full loop completes successfully. If any page errors,
   exits non-zero with stderr message and ZERO stdout output. No more
   silent partial JSON.

3. New regression test: driver-style loop captures recipients in the
   pagination contract shape, verifies termination on isDone=true,
   includes a safety guard against infinite loops.

16/16 alertRules tests pass (was 15; +1 pagination contract).
TS + biome clean.
…il (koala73#3465)

* feat(notifications): replace opaque subscription_id with paid/list/saved/discount in new-sub admin email

The "Subscription: sub_..." row in the [WM] New User Subscribed email was the
opaque Dodo subscription_id — useless when the actual question on landing is
"did this user pay full price or use a discount". Thread recurring_pre_tax_amount,
currency, tax_inclusive, and discount_id from data.subscription.active through
to the email action, render Amount Paid + List Price (from PRODUCT_CATALOG) +
Saved + Discount rows, and drop the subscription_id row entirely.

* fix(notifications): scope List Price/Saved rows to USD and drop unused subscriptionId arg

Round-1 review fixups for koala73#3465:

P1 (greptile, real bug): PRODUCT_CATALOG.priceCents is hard-coded in USD, but
formatMoney(listCents, currency) was labelling it with whatever currency the
Dodo webhook reported. For an EU subscriber paying in EUR (Dodo adaptive
currency), the email would have rendered "List Price: €39.99 EUR" and
"Saved: €8.00 EUR" — both figures meaningless because 3999 (USD) − 3199 (EUR)
is a cross-currency subtraction. Skip the List Price + Saved rows entirely
when paid currency != USD; Amount Paid and Discount still render.

P2: subscriptionId arg is unused now that the Subscription row is gone. Made
it v.optional in the action and removed the call-site pass-through in
handleSubscriptionActive. Kept as optional (rather than removing) so any
in-flight scheduled action enqueued before this deploy still validates on
retry — required→optional is a backwards-compatible signature change.
…ONFLICT retry loop (koala73#3466)

* fix(user-prefs): structured ConvexError kinds → CONFLICT propagates as 409, killing retry-loop

Root cause traced via Convex prod logs:

  "error": "Uncaught ConvexError: CONFLICT
      at handler (../convex/userPreferences.ts:59:29)"

The mutation IS throwing `ConvexError("CONFLICT")` correctly server-side.
But the wire format from Convex's HTTP runtime to our edge surfaces the
throw as `Error("[Request ID: X] Server Error")` with `errorData`
undefined — see node_modules/convex/dist/esm/browser/http_client.js:244,
which falls through to a plain `throw new Error(respJSON.errorMessage)`
when `respJSON.errorData === void 0`. String-data ConvexErrors apparently
don't get their `.data` forwarded; object-data ConvexErrors do.

Consequence pre-fix:

  1. server: throw new ConvexError("CONFLICT")
  2. wire: { errorMessage: "[Request ID: X] Server Error", errorData: undef }
  3. edge: msg.includes('CONFLICT') doesn't match → returns 500
  4. client: thinks it's a transient → retries forever with same
     expectedSyncVersion → loop until tab closes

Sentry sample of 100 PD events (post-koala73#3460 fingerprint fix) showed one
user (`user_3CwVMBgni...`) generating 50 of the 100 events in 1h08m,
all with the same `expectedSyncVersion=12` while the server row had
already advanced to syncVersion=13 — exactly the loop the broken
CONFLICT propagation creates.

Fix (two layers + safety net):

- Server (convex/userPreferences.ts): throw ConvexError({ kind, ... })
  for all three named errors (CONFLICT, BLOB_TOO_LARGE, UNAUTHENTICATED).
  CONFLICT now also carries `actualSyncVersion` so the edge can echo
  it. Object-data ConvexErrors propagate `errorData` reliably across
  the Convex wire — verified against the http_client.js source.

- Edge (api/user-prefs.ts): new `extractConvexErrorKind` helper that
  inspects `err.data.kind` first (structured path, the load-bearing
  fix) and falls back to `msg.includes(...)` for the deploy-ordering
  window where Vercel may build before Convex is updated. CONFLICT
  responses now include `actualSyncVersion` in the body.

- Client (src/utils/cloud-prefs-sync.ts): consumes the optional
  `actualSyncVersion` from 409 bodies. Existing 409-handling at
  line 221/282 already does the right thing (refetch + reapply), so
  no behavior change to the retry loop itself; the new field is
  available for future client optimizations.

Tests: tests/user-prefs-convex-error.test.mjs (+12 cases) covers
structured-data preference, legacy substring fallback, structured-wins-
over-message precedence, the exact pre-fix bug ([Request ID: X] Server
Error → null), and forward-compat for new error kinds.

Validation: typecheck + typecheck:api + biome + md lint + version sync
clean. test:data 7558/7558 (+12 from new file), edge bundle + edge
function tests + convex tests all pass.

Followup tracker: WORLDMONITOR-PD will collapse to ~zero post-deploy
once the dominant retry loop closes.

* chore(user-prefs): address PR koala73#3466 review nits — type-guard 409 body + extract helper to importable module

Two Greptile P2 nits, both valid:

1. **api/user-prefs.ts:130** — `actualSyncVersion` was extracted from
   `Record<string, unknown>` and forwarded to the 409 response body
   without a numeric type-guard. The client defensively type-checked it
   so no bad value was actually consumed, but the response contract was
   looser than intended. Added `readConvexErrorNumber(err, field)` which
   returns `number | undefined` after a `typeof === 'number'` check; the
   handler drops non-numeric values rather than echoing `unknown`.

2. **tests/user-prefs-convex-error.test.mjs:22** — the regex +
   `new Function` extraction was fragile (depends on column-0 closing
   brace, manual TS-stripping). Extracted both helpers to a new
   `api/_convex-error.js` JS module (matching the existing `_cors.js`
   / `_json-response.js` / `_sentry-edge.js` pattern), which the test
   now imports directly. The handler imports it via the standard
   `// @ts-expect-error — JS module` shim used elsewhere in this file.

The new module also picks up the Convex wire-format note in its
file-level JSDoc so the next maintainer who hits a string-data
ConvexError-doesn't-propagate trap finds the explanation in one
place.

Tests: +5 cases for `readConvexErrorNumber` (numeric reads, missing
field, non-numeric guard, null/undefined data, zero preservation).
17/17 pass; full edge-function-isolation test 178/178 still passes
(new underscore-prefixed helper is correctly excluded from the
edge-function-discovery glob).
… v16→v17) (koala73#3469)

Plan 2026-04-26-002 §U7 (PR 6 in the 8-PR sequence) — flips
`headlineEligible` from PR 2's "true everywhere" no-behavior-change
contract to the actual eligibility logic (origin Q2 + Q5):

  coverage >= 0.65 AND (population >= 200k OR coverage >= 0.85) AND !lowConfidence

The headline ranking endpoint (`get-resilience-ranking.ts`) now filters
items[] by `headlineEligible: true`; ineligible items move to
`greyedOut`. Raw API endpoints (per-country score) keep returning the
full set with the field surfaced — only the *ranking* endpoint applies
the filter.

§Files

- `_shared.ts`: new `computeHeadlineEligible()` + 3 exported constants
  (HEADLINE_ELIGIBLE_MIN_COVERAGE=0.65, _MIN_POPULATION_MILLIONS=0.2,
  _HIGH_COVERAGE=0.85). Wired into `buildResilienceScore`'s response
  population. Reader memoization extended so the new IMF labor read
  (for population) shares the per-build cache with `scoreAllDimensions`.
- `_dimension-scorers.ts`: exported `RESILIENCE_IMF_LABOR_KEY` + new
  `readCountryPopulationMillionsForGate()` helper. Differs from the §U6
  `readPopulationMillions` in two ways: returns `null` for unknown-pop
  countries (instead of the 0.5M default — the gate must distinguish
  "known small" from "unknown"), and DOES NOT apply the §U6 0.5M
  tiny-state floor (gate needs the real population to decide).
- `get-resilience-ranking.ts`: `passesHeadlineGate` predicate combines
  the existing GREY_OUT_COVERAGE_THRESHOLD (0.40) with the new
  `headlineEligible === true` check. Items[]/greyedOut[] split by it.
- `tests/helpers/resilience-release-fixtures.mts`: add
  `economic:imf:labor:v1` fixture (uniform 50M placeholder for all 43
  G20+EU27 countries) so the release-gate test's countries pass the
  population branch of the new filter.

§Cache prefixes (per cache-prefix-bump-propagation-scope skill):
- RESILIENCE_SCORE_CACHE_PREFIX v16 → v17 (pre-PR-6 score entries
  carry headlineEligible:true unconditionally; would let ineligible
  countries through the headline filter for the full 6h TTL)
- RESILIENCE_RANKING_CACHE_KEY v16 → v17 (same — pre-PR-6 ranking
  cache reflects "all true" world)
- RESILIENCE_HISTORY_KEY_PREFIX v11 → v12 (lockstep — bump pattern
  consistency + audit-trail clean even though history doesn't carry
  the field)
- 10 hardcoded literal sites bulk-updated across tests/, scripts/,
  api/health.js
- Stale "(v16)" test descriptions updated to "(v17)" in two files

§Tests
- New `tests/resilience-headline-eligible-gate.test.mts` (10 tests):
  truth-table coverage of `computeHeadlineEligible` — happy path,
  lowConfidence short-circuit, the 0.65 floor, the 200k population
  boundary, the 0.85 high-coverage compensator, and the unknown-pop
  conservative default.
- 7556/7556 tests pass; typecheck + typecheck:api + lint all clean.

Plan: docs/plans/2026-04-26-002-feat-resilience-universe-coverage-rebuild-plan.md
Predecessor: PR koala73#3457 (PR 2 / §U3 — added the headlineEligible field)
Next: PR 7 / §U8 (methodology rewrite + widget badge polish)
…a73#3471)

* ci: auto-deploy convex/ changes to Convex prod on merge to main

Vercel's build pipeline auto-deploys api/ and src/ but does NOT run
`npx convex deploy --prod` — the Convex backend has its own deployment
flow that has been manual-only in this repo. Merges that touched
convex/<module>.ts (schema changes, mutation/query bodies, action
handlers) silently landed in main without reaching production until
someone remembered to run the deploy by hand.

Surfaced concretely earlier today: PR koala73#3466's structured-data
`ConvexError({ kind, ... })` fix sat in main for 30+ minutes while
WORLDMONITOR-PD kept growing — Convex prod was still running the old
string-data throws because nobody had pushed the convex/ change to the
backend. The drift was invisible until I noticed the Sentry events
post-merge still tagged `error_shape=convex_server_error` instead of
the expected typed CONFLICT bucket.

This workflow:

- Triggers on push to main, gated by a path diff so non-convex/ merges
  don't pay CI minutes for a no-op deploy.
- Provides a `workflow_dispatch` manual fallback for hotfixes / re-runs
  off the regular code-merge cycle.
- Serializes deploys via a `concurrency` group with `cancel-in-progress:
  false`, so two back-to-back merges don't race AND every queued deploy
  eventually lands.
- Uses `npx convex deploy --yes` with `CONVEX_DEPLOY_KEY` from secrets;
  the deploy key pins the target deployment so there is no ambiguity
  about which environment we're pushing to.

One-time setup required: add `CONVEX_DEPLOY_KEY` to the repo's GitHub
Actions secrets. Generate via Convex dashboard → Settings → Deploy
Keys → "Production: deploy" scope, or via
`npx convex deploy --once-create-deploy-key` against the prod
deployment.

* ci(convex-deploy): fail-closed path detection — git diff over gh api compare

Greptile P1 on PR koala73#3471: the `gh api compare` path gate failed OPEN in
two ways:

1. API errors (rate limit, transient 5xx) silently emptied FILES via the
   `|| echo ""` fallback, then the regex grep produced no match, and we
   wrote `convex=false` — skipping a real convex/ deploy.
2. The compare endpoint paginates at 300 files. A large merge that
   touches convex/ alongside many other files could put the convex/
   entries past the first page and our single-page fetch wouldn't see
   them. Same outcome: silent skip.

Either failure mode recreates exactly the drift this workflow is meant
to prevent.

Switched to authoritative `git diff --name-only $BEFORE $AFTER --
'convex/'` against a `fetch-depth: 0` checkout. Now:

- API failures are impossible (no API call).
- Pagination is impossible (git diff is local).
- `set -euo pipefail` + explicit `git cat-file -e` reachability check
  fails CLOSED: any error or missing SHA logs a warning and deploys
  defensively rather than silently skipping. Better one redundant
  deploy than one missed deploy.
- workflow_dispatch and first-push (all-zero BEFORE SHA) cases
  preserved.

Trade-off: `fetch-depth: 0` is heavier than the default shallow
checkout, but the changes job runs ~10s either way on a small repo
and the safety guarantee is worth more than the seconds.
…la73#3467)

* feat(notifications): Slot B — NWS event-family coalesce via VTEC

Out-of-scope follow-up tracked in PR koala73#3461. Stops the adjacent-zone
NWS alert flood: same storm system propagating across multiple counties
no longer produces N notifications per user.

Real symptom: 11 alerts in one inbox, 9 of which were 3 phenomena
(severe thunderstorm warnings, severe thunderstorm watches, flood
warnings) fanned out across ~9 NWS zones. After this change, the same
storm = 1 notification per phenomenon × per office × per event tracking
number, regardless of how many zones it crosses.

How it works:
- NWS VTEC strings (/O.NEW.KSGF.SV.W.0034.250427T1257Z-250427T1330Z/)
  encode (office, phenomenon, significance, eventID) — the tuple that
  identifies one logical event across adjacent zones. Drop the action
  so NEW/CON/CAN bulletins for the same event also collapse.
- New helper deriveWeatherCoalesceKey(vtec) returns "nws:KSGF.SV.W.0034"
  or undefined for missing/malformed VTEC.
- Publisher (scripts/ais-relay.cjs:seedWeatherAlerts) extracts VTEC
  from properties.parameters.VTEC[0], derives coalesceKey, threads it
  into payload.coalesceKey for the publish call.
- Publisher dedup (publishNotificationEvent) uses coalesceKey for the
  scan-dedup key when present — adjacent-zone alerts collapse at the
  queue layer too, not just per-recipient.
- Per-recipient dedup (scripts/notification-relay.cjs:checkDedup)
  takes optional 4th param coalesceKey. Both call sites (held-event +
  realtime) thread it. Type-guarded as string before passing — defense
  against malformed payloads.
- Falls back to title-based dedup when VTEC is absent (rare advisory
  types). No regression for non-NWS publishers.

Tests (12 new in tests/notification-relay-coalesce-key.test.mjs):
- VTEC parser: typical NEW alert, NEW vs CON for same event collapse,
  different events stay distinct, different phenomena stay distinct,
  malformed/missing returns undefined.
- Source-grep contract: checkDedup signature, both call sites thread
  coalesceKey, type-guard present, publisher dedup uses coalesceKey,
  weather alert mapping captures VTEC, publish call wires coalesceKey
  via spread-conditional.

146/146 relay+notification tests pass (was 134 + 12 new).
TS typecheck clean both configs. Biome clean (warnings pre-existing).
CJS syntax check both relay scripts: OK.

Out of scope for this PR (still tracked):
- Slot A: per-recipient hourly rate cap. Generic burst airbag for any
  future bursty publisher. Defer until we see if Slot B alone is
  enough.
- Other publishers' coalesce keys (AIS vessel + bucket, market ticker
  + bucket). Add when those surfaces show similar fan-out.

* fix(notifications): Slot B P1 — pick distinct families BEFORE slicing top 3

PR koala73#3467 review finding: the naive `highSeverityAlerts.slice(0, 3)` runs
the slice on RAW alerts BEFORE coalesce. If the first 3 raw alerts are
adjacent-zone duplicates for one VTEC family, the publisher-side dedup
queues only 1 notification — AND a 4th genuinely-distinct family
(different storm / tornado / flood) sitting at index 3+ is NEVER
considered. Net result: silent loss of legit distinct events.

Fix: dedupe BY family key FIRST, accumulate up to 3 DISTINCT families,
then publish those. Family key uses VTEC-derived coalesce key when
available; falls back to a stable per-alert identity
(`nws:fallback:${id || headline || event}`) so VTEC-less alerts still
dedupe against themselves rather than collapsing on the empty-string
fallback.

Added regression test in tests/notification-relay-coalesce-key.test.mjs:
- assertion that `seenFamilyKeys` Set + `distinctFamilyAlerts` array exist
- assertion that the bug pattern (`for...of highSeverityAlerts.slice(0, 3)`)
  is GONE
- assertion that the family-key fallback includes a stable per-alert
  identity (id || headline || event)

13/13 coalesce tests pass (was 12; +1 P1 regression). CJS syntax OK,
biome clean (warnings pre-existing).
…anup) (koala73#3468)

The (realtime, all) backfill ran successfully — 29 rows migrated to
digestMode='daily', recipient list captured, courtesy emails sent.
Removing the temp admin-secret-gated migration surface added in koala73#3461
and koala73#3463 now that it's served its purpose.

Removed:
- convex/alertRules.ts: _countRealtimeAllRules, _migrateRealtimeAllPage,
  _listAffectedUserEmailsPage, assertMigrationAdmin helper, and the
  TEMP MIGRATION FUNCTIONS comment block.
- 3 driver scripts: scripts/migrate-discover-realtime-all.mjs,
  scripts/migrate-realtime-all-to-daily.mjs,
  scripts/migrate-list-affected-emails.mjs.
- The corresponding test describe() block in
  convex/__tests__/alertRules.test.ts (the _listAffectedUserEmailsPage
  cases — admin-secret gate, channel-type filtering, enabled-flag
  preservation, pagination contract).

Production-logic tests preserved unchanged: cross-field invariant
enforcement, insert-only defaults, atomic-mutation pair flips,
partial-update re-validation, omitted-sensitivity preservation. 12/12
passing post-trim.

After this PR merges, run on prod:
  npx convex env remove --prod MIGRATION_ADMIN_SECRET

The secret value was visible in Convex dashboard function-call logs
during the migration window — treat as exposed, do NOT reuse for any
other admin path. Generate a fresh value if a future migration needs
the same admin-gate pattern.

Pure delete: 500 lines removed, 0 added. TS dual typecheck clean,
biome clean, vitest green.
…userId expires (koala73#3470)

* fix(entitlements): preserve higher-tier sub when another sub on same userId expires

The entitlements table is keyed by_userId (one row per user), but a
single user can hold multiple concurrent Dodo subscriptions on the same
userId -- e.g. they upgraded by buying a higher-tier plan instead of using
plan-change in the customer portal, or admin cancelled an old plan while a
newer paid sub stays active.

handleSubscriptionExpired previously called upsertEntitlements(userId,
"free", ...) unconditionally on subscription.expired, silently
downgrading the user even when another paid sub was still covering them.
handleSubscriptionPlanChanged had a sibling form of the same risk.

Fix: before downgrading or replacing the entitlement, check the user's
other subscriptions via the by_userId index for any "still covering" row
(active, on_hold, or cancelled-with-future-currentPeriodEnd). If one
exists with equal-or-higher tier, recompute the entitlement from it
instead of clobbering.

Also adds payments/billing:deleteSubscriptionByDodoId (internal) -- an
ops tool that deletes a subscription row from Convex and re-derives the
entitlement from remaining covering subs (or downgrades to free). Use to
defuse a doomed subscription.expired for a sub you've already
cancelled/refunded admin-side without waiting for the structural guard.

Discovered while diagnosing a refund/PRO-status question on a user with
two concurrent active subs (pro_monthly cancelled by admin + api_starter
active, paid). Without this guard, the older sub's eventual expiry would
have wiped the higher-tier entitlement during a ~48-min window before the
api_starter renewal event re-upserted it.

* review: route ALL sub event handlers through one recompute helper + deterministic precedence

Addresses two P1 review findings on PR koala73#3470:

1) Coverage gap: the multi-active-sub guard only covered subscription.expired
   and subscription.plan_changed. subscription.active and subscription.renewed
   still called upsertEntitlements() directly with the event's sub, so a
   lower-tier renewal/reactivation could clobber a higher-tier entitlement on
   the same userId. Fix: collapse all four entitlement-write paths in the
   subscription event handlers (active, renewed, plan_changed, expired) into
   a single shared helper recomputeEntitlementFromAllSubs() that derives the
   entitlement from the FULL set of the user's covering subscriptions, post-
   patch. Comp-floor logic moves into the helper too. handleSubscriptionExpired
   now becomes "patch row to expired, then recompute" — no inline guard.

2) Tier-tie picker: the comparator used only features.tier, but the catalog
   has same-tier plans with different capabilities (api_starter and
   api_business are both tier 2; pro_monthly and pro_annual are both tier 1).
   Fix: introduce PLAN_PRECEDENCE in productCatalog.ts and a deterministic
   compareSubscriptionsByCoverage() comparator with three levels:
     1. higher features.tier wins
     2. higher PLAN_PRECEDENCE wins (within-tier capability tie-break)
     3. later currentPeriodEnd wins (within-plan duration tie-break)

Also: deleteSubscriptionByDodoId in billing.ts now reuses
recomputeEntitlementFromAllSubs instead of duplicating the picker logic, so
admin cleanup never produces an entitlement state an organic webhook flow
wouldn't have produced.

Tests added (4):
- subscription.renewed on lower-tier sub does NOT clobber higher-tier
- subscription.active for a NEW lower-tier sub does NOT clobber existing higher-tier
- same-tier precedence: api_business outranks api_starter when both cover
- comparator tie-break by currentPeriodEnd within the same plan

123/123 passing.
…#3473)

* feat(broadcast): cron-driven ramp runner with kill-gate halt

Replaces the manual three-command ritual (assignAndExportWave →
createProLaunchBroadcast → sendProLaunchBroadcast) with a daily
cron at 13:00 UTC that:

  1. Fetches the prior wave's getBroadcastStats
  2. Halts (sets killGateTripped=true, deactivates ramp) if bounce
     rate > 4% or complaint rate > 0.08% — operator must clear
     before resume
  3. Otherwise runs assignAndExportWave + create + send for the
     next tier in `rampCurve`

Singleton config table `broadcastRampConfig` (keyed by literal
"current") holds the curve, current tier, kill-gate state, and last-
wave tracking. Admin mutations: initRamp / pauseRamp / resumeRamp /
clearKillGate / abortRamp / getRampStatus.

Safety rails:
- `MIN_DELIVERED_FOR_KILLGATE = 100`: kill-gate ignored until prior
  wave has enough delivered events for stable rate calc (avoids
  trip on sample-size noise: 1 bounce / 10 delivered = 10%)
- `MIN_HOURS_BETWEEN_WAVES = 18`: cron defers if prior wave is
  fresher than 18h (bounces / complaints take time to flow back via
  Resend webhook)
- `UNDERFILL_RATIO = 0.5`: deactivates ramp when assignAndExportWave
  returns < 50% of requested count (pool drained signal)
- Kill-gate latch is one-way — never auto-clears. Operator runs
  `clearKillGate '{"reason":"..."}'` after investigating, which
  stamps the cleared reason into lastRunStatus for audit
- Partial-failure recovery: if assignAndExportWave / create /
  send throws mid-flight, status records as "partial-failure" with
  the offending error and the cron blocks until cleared. Throws
  bubble to Convex auto-Sentry for paging
- `_recordWaveSent` mutation does an `expectedCurrentTier` check
  before patching — two concurrent cron firings can't both advance
  the same tier (defence-in-depth; cron isn't supposed to overlap
  but Convex doesn't guarantee at-most-once on retried cron runs)

Wave-label naming: `${prefix}-${tier + offset}`. Default offset 3
means tier 0 → wave-3, tier 1 → wave-4, etc. — picks up cleanly
after manually-sent canary-250 + wave-2.

Daily-cron timing 13:00 UTC: late enough that overnight bounces /
complaints from the prior 24h have flowed back via webhook, early
enough (9am ET / 6am PT / 3pm CET) that a tripped kill-gate hits
US business hours for triage.

Files:
- convex/schema.ts: new `broadcastRampConfig` table + by_key index
- convex/broadcast/rampRunner.ts: runDailyRamp action + admin
  mutations + the two recording mutations
- convex/crons.ts: wires runDailyRamp to crons.daily
- convex/_generated/api.d.ts: regenerated

Operator setup (run once after deploy):

  npx convex run broadcast/rampRunner:initRamp '{
    "rampCurve": [1500, 5000, 15000, 25000],
    "waveLabelPrefix": "wave",
    "waveLabelOffset": 3
  }'

After that, the cron handles everything until either kill-gate trips
or the pool drains. Status check anytime via:

  npx convex run broadcast/rampRunner:getRampStatus '{}'

* fix(broadcast): seed prior wave + halt on partial export failures

PR koala73#3473 review:

P1 #1 — first automated wave skipped kill-gate for the last manually
sent wave because `initRamp` had no way to seed prior-wave metadata.
With currentTier=-1 and lastWaveBroadcastId=undefined, the kill-gate
block at runDailyRamp's Step 1 was unreachable on the first tick after
init. Add `seedLastWave*` optional args; require them as a pair when
`waveLabelOffset > 0` (operational signal that this is a resumption
after manual waves, not a fresh ramp).

P1 #2 — runner narrowed `assignAndExportWave`'s return type to only
`{segmentId, assigned, underfilled}`, dropping `failed` and
`stampFailed`. A wave that requested 500 with 250 push failures + 250
successes would have proceeded to create + send the broadcast, marking
the tier as cleanly advanced. `stampFailed > 0` is worse: contacts are
in the Resend segment (will be emailed) but unstamped (re-eligible for
the next pick → guaranteed duplicate-email). Now: widen the local type
to the full `WaveExportStats`, export it from audienceWaveExport.ts,
and abort the run with `partial-failure` status if either failure
counter is non-zero. Operator clears via the existing
`lastRunStatus === partial-failure` gate.

* fix(broadcast): add clearPartialFailure recovery mutation

PR koala73#3473 review (third P1):

The partial-failure block I added in 35091b5 (treat any non-zero
export failure counter as halt-don't-proceed) had no recovery path.
`runDailyRamp` refuses to advance while
`lastRunStatus === "partial-failure"`, but `clearKillGate` no-ops when
`killGateTripped` is false — so a partial-failure would block the cron
forever short of `abortRamp` or hand-patching the DB.

Add `clearPartialFailure(reason: string)` matching the `clearKillGate`
shape: requires partial-failure status (else no-op), records audit
reason in `lastRunStatus`, clears `lastRunError`. Kept separate from
`clearKillGate` deliberately — kill-gate is an email-reputation
investigation (bounce/complaint thresholds), partial-failure is a
mechanical export/send investigation (Resend logs, Convex stamp
errors). Different recovery requirements; conflating them would
encourage operators to clear without reading the right log.

Updated operator-usage docstring with the new command.
koala73 and others added 27 commits May 9, 2026 17:45
…tion' (koala73#3591)

* fix(news): align server finance digest key 'regulation' → 'fin-regulation'

Concrete bug: the finance-variant Financial Regulation panel rendered
empty + UNAVAILABLE pill for every visitor (anon AND pro). Root cause
was a client/server category-key mismatch: the client uses
`'fin-regulation'` (in `src/config/feeds.ts` FINANCE_FEEDS, plus a
matching panel config and a one-time storage migration in App.ts:539
that already shipped) while the server still emitted the digest bucket
under key `'regulation'`. The client iterates `Object.keys(FEEDS)` and
does `digest.categories[category]` — when the keys diverge the panel
never finds its items, the per-feed RSS fallback is gated off on web,
and the body renders `[]` → "No news available" + UNAVAILABLE.

Server is the side that drifted, so renaming server-side avoids
forcing a second round-trip through the client storage migration that
already landed for this rename earlier. This is a static config
change in `server/worldmonitor/news/v1/_feeds.ts` only — no consumer
references the literal `'regulation'` string outside this map (the
classifier keyword on line 145 is a content-match keyword, not a
category key, and is unaffected).

Add a static parity guard `tests/news-feed-key-parity.test.mts` that
asserts client `Object.keys(FEEDS)` ⊆ server `VARIANT_FEEDS[variant]`
for tech / finance / commodity. The guard surfaced two pre-existing
gaps in the tech variant (`podcasts`, `thinktanks`) — those are
separate from this PR's regulation rename and would require curated
RSS sources, so they're listed in `knownGapsClientOnly` with a TODO
pointer. The test also asserts the allowlist itself stays current
(no entries that the server now covers, no entries that don't exist
on the client) so a future cleanup pass can't carry phantom drift.

Verified pre-fix: parity test fails with `Missing on server: fin-regulation`.
Post-fix: parity test passes; full suite 7859/7859.

Closes todos/257 item 9 (news digest coverage drift) for the
finance-variant fin-regulation case.

Item 10 audit: only RegionalIntelligenceBoard writes to a private
`this.body.innerHTML` ref. PR koala73#3586 already neutralised the bug class
by ensuring the lock state fires via `showGatedCta` BEFORE
`loadCurrent()` runs, so writes to the now-detached body are silent
no-ops. No code change needed for item 10 itself; closing the audit.

* test(news-parity): brace-depth guard + fix `staleListed` typo

Greptile review on PR koala73#3591 — two P2 findings, both addressed.

1. `extractCategoryKeys` regex matched `<key>: [` anywhere in the
   variant body without tracking brace depth. A future feed entry
   formatted across multiple lines like
       { name: '...', tags: ['a', 'b'] }
   would emit a spurious `tags` key. The current feed maps use
   single-line objects so this isn't observable today, but the
   guard is meant to outlive style drift. Replace the global regex
   with a stateful scanner that walks the body, maintains brace
   depth, skips inside string literals, and only matches keys at
   depth 0. Smoke-tested against the exact false-positive shape
   Greptile flagged: keys returned `['cloud', 'ai']`, not `tags`.

2. Variable name typo `stalewListed` → `staleListed`.
…t every call (koala73#3592)

User report: logged-in Pro users on tech / commodity variants saw
"Upstream API unavailable" on the Macro Stress (economic) panel,
with console showing repeated `get-fred-series-batch:1 ... 401`.
Anonymous users on the SAME variants saw the data correctly. Full
variant worked for Pro users (because their main-domain localStorage
carries `wm-pro-key` / `wm-widget-key`).

Root cause is in `premiumFetch`. Many service clients (economic,
supply-chain, …) wrap the WHOLE generated client with `premiumFetch`
even though only a few methods target a premium path. Today
`premiumFetch` attaches `Authorization: Bearer <jwt>` for ANY caller
who has a Clerk session — including for non-premium endpoints.

For a Pro user with no tester key hitting a non-premium endpoint:

  1. premiumFetch sets Authorization → wm-session interceptor sees
     it and steps aside, NOT attaching `X-WorldMonitor-Key: wms_…`.
  2. Server gateway only resolves Bearer JWTs on tier-gated paths
     (gateway.ts: `if (isTierGated) resolveClerkSession(...)`); for
     non-tier-gated paths the JWT is ignored entirely.
  3. validateApiKey() reads ONLY X-WorldMonitor-Key. With no key
     present it returns { valid: false, required: true } → 401.

For an anon user the same call falls through to plain
globalThis.fetch, the interceptor attaches wms_, and the gateway
accepts it — hence the inverse-of-expected "anon sees more" pattern.

Fix: gate the Bearer attach on PREMIUM_RPC_PATHS membership. Public
paths fall through so the wm-session interceptor handles wms_.
API-key holders and tester-key holders are unaffected — those auth
shapes travel via X-WorldMonitor-Key which works on any path.

ENDPOINT_ENTITLEMENTS (the tier-gated set) is a strict subset of
PREMIUM_RPC_PATHS at the time of writing, so the single check covers
both gates.

Tests: 4 new regression assertions in tests/premium-fetch.test.mts.
Verified pre-fix that "non-premium path: Clerk JWT NOT attached"
fails with the old code and passes with the new code. Full
test:data suite green: 7863/7863.

Net effect: Pro users without tester keys now see public economic
data (FRED, BLS, BIS, energy) on every subdomain, matching the
behaviour anon users already had.
…kip time filter (koala73#3593)

PRODUCTION 2026-05-04: enabling the disease-outbreaks map layer renders
nothing despite /api/health reporting `diseaseOutbreaks: status=OK,
records=50`. Direct Redis read confirmed the cache holds 50 items but
none render. Three compound issues, fixed in one PR.

== A. ThinkGlobalHealth source returning 0 items ==

Two sub-bugs in `fetchThinkGlobalHealth`:

1. Wrong default branch in URL.
   The seeder hardcoded `/main/index_bundle.js` but the TGH GitHub repo's
   default branch is `master`. The `/main/` URL has been returning HTTP
   404 silently — the seeder caught the !resp.ok and returned []. Fix:
   `main` → `master`. Verified live: 200 OK, 7.5 MB bundle.

2. Bundle format change after 2026-04 webpack rebuild.
   The legacy parser anchored on `var a=[{Alert_ID:` (unquoted JS keys).
   The new bundle wraps records in `eval("var res = [...]")` blocks with
   JSON-quoted keys like `\"Alert_ID\":\"8732529\"`. The old regex
   (`/(\w+):"((?:[^"\\]|\\.)*)"/`) doesn't match the quoted-key form.
   Fix: import `parseRealtimeAlerts` from seed-vpd-tracker.mjs (which
   already handles this exact format with a battle-tested
   schema-anchored scanner — VPD and Disease share the same TGH bundle).

After both fixes, TGH contributes ~1,600 ProMED-reviewed alerts with
real lat/lng, restoring the only geo-rich source (WHO/CDC/ONT only
publish country names → require centroid fallback).

== B. UI time filter eats every item ==

The map's `filterByTimeCached` (DeckGLMap.ts:1505) gated diseaseOutbreaks
by the global timeRange dropdown (max '7d'). Disease outbreaks are
sparse-by-nature — WHO DON publishes 1-2/week, CDC HAN alerts are
infrequent, TGH carries 90 days of ProMED items. When the most recent
WHO/CDC update is 8+ days old (normal), the 7d gate dropped every item
→ empty layer. Production confirmed: 50 cached items, newest 11.0 days
old, all dropped.

Fix: skip the time filter for diseaseOutbreaks. Render all items in the
cache; the seeder's per-source lookback already bounds freshness at
write time. Other layers keep the global filter.

== Out of scope ==

- C: structural health-readiness probe (seedAgeMin tracks seeder run,
  not item freshness — separate followup).
- Static-layer zoom-gates (bases/nuclear/spaceports/economic show
  nothing at default zoom 2 because LAYER_ZOOM_THRESHOLDS[*].minZoom
  is 3-5). Intentional UX, not a data bug — separate followup if we
  want a "zoom in to see N items" affordance.
…a73#3594)

Captures the architectural follow-up identified during the 2026-05-04
disease-outbreaks incident: /api/health currently reports seeder-run
freshness, not content freshness. For sparse upstream sources
(WHO Disease Outbreak News, IEA OPEC reports, central-bank releases,
WB annual indicators) these diverge — seeder runs fine, seed-meta
fetchedAt stays fresh, but the freshest item the user sees is days or
weeks old. Health says OK; UI renders nothing.

Plan opts seeders into a parallel content-age contract:
- runSeed accepts itemTimestamp / itemsPath / maxContentAgeMin
- seed-meta carries newestItemAt/oldestItemAt/maxContentAgeMin when set
- api/health reports new STALE_CONTENT status when content is older than
  the seeder's content-age budget

Backwards compatible — legacy seeders without itemTimestamp keep current
behavior. Pilot on disease-outbreaks (today's incident's origin), then
migrate sparse + annual seeders over Sprints 3-4.

Companion to PRs koala73#3582 (canonical-envelope-mirror), koala73#3593 (disease-
outbreaks TGH + time-filter fixes), and the broader 'fetched-recently is
not the same as fresh-content' insight.
…revisions (koala73#3595)

Five rounds of Codex review against plan koala73#3594:

- Round 1 (8 findings): pilot threshold won't catch incident; canonical-mirror loses content fields; synthetic timestamps mask staleness; all-undated falls through to OK; wrong health.js target symbols; soft-disabled budget; brittle autodetect; missing tests. Adopted contentMeta(data) -> {newestItemAt, oldestItemAt} API.
- Round 2 (3 findings): envelope-writer chain incomplete (need _seed-envelope-source.mjs + parity mirrors + _seed-contract.mjs); classifier precedence wrong; disease snippet broke isNaN filters and mapItem.
- Round 3 (3 findings): TGH source missed migration; stale classifier code block contradicted Sprint 1; helpers leaked via list-disease-outbreaks + bootstrap.
- Round 4 (3 findings): replacement classifier still used bare 'return' but real code uses status='X' assignment; mapItem section contradicted strip contract; grep missed _originalPublishedMs.
- Round 5 (1 finding): test descriptions still said cached items 'have helper fields'; rewrote to separate pre-publish in-memory layer from post-strip published-canonical layer.

Net: ~280 prod LOC + ~250 test LOC for Sprint 1; explicit envelope-writer chain coverage; Sprint 2 disease pilot covers all 3 sources (WHO/RSS/TGH) + helper-field strip via publishTransform + anti-regression tests.

Companion: PR koala73#3593 (immediate disease-outbreaks fixes), koala73#3582 (canonical-envelope-mirror).
…way-egress blips (koala73#3600)

* fix(customs-revenue): retry Treasury MTS with backoff to survive Railway-egress blips

Health endpoint reported `customsRevenue: { status: EMPTY, records: 0,
seedAgeMin: 1845, maxStaleMin: 1440 }` — 30+ hours stale. Treasury
MTS upstream verified healthy (direct probe with the same URL + UA
returned 39 rows in <1s); the gap is Railway-side. Sibling fetchers
in the same seeder (shippingRates seedAgeMin=46m, comtradeFlows
seedAgeMin=116m) were updating fine, so the cron service is alive —
only the customs branch was rejecting in `Promise.allSettled` and
the existing rejection-warn at line 688 logged it but the next 6h
cron tick hit the same transient and never recovered. By the time
the data-key 24h TTL elapsed, the panel went EMPTY.

Add a 3-attempt retry with linear backoff (5s, 10s) wrapping the
Treasury fetch. The existing 15s per-attempt timeout stays. Final
rejection re-throws with attempt count + last error so the rejection
log line at fetchAll() carries enough context to triage from health
output alone (no need to grep Railway logs for the underlying
ECONNRESET / 5xx / etc).

Factor row parsing into `parseCustomsRows()` so the success branch
of the retry loop is clean — retry returns the parsed result
directly; only the fetch + validation steps repeat.

Net effect: a single transient blip on a Railway egress/IP-policy
hiccup no longer cascades into 24+ hours of EMPTY panel data.
Pro/anon UX unchanged when Treasury is fully healthy.

Verified:
- Existing 22 customs-revenue assertions still pass.
- Two new assertions cover the retry shape (3-attempt cap, linear
  backoff, exhausted-error message, parser factored out). Pre-fix
  both new assertions fail; post-fix both pass.
- Full test:data suite: 7865/7865.
- Treasury upstream confirmed healthy via direct curl + node fetch
  probes — root cause is Railway-side transient, not parsing or
  upstream schema drift.

* fix(customs-revenue): short-circuit retry on non-transient errors + comment typo

Greptile P2 review on PR koala73#3600:

1. Block comment said "(5s, 15s)" — `attempt * 5_000` actually
   produces 5s on attempt 1 and 10s on attempt 2 (the 15s was
   accidentally pulled from the per-attempt timeout on the same
   line). Worst-case retry budget is ~60s, not ~75s. Comment now
   reads "(5s, 10s) plus the existing 15s per-attempt timeout".

2. Catch block was catch-all, so deterministic failures — `Treasury
   MTS HTTP 400/404` and the `rows.length > 100` schema-drift check —
   would burn the full 5s + 10s backoff before propagating, plus
   emit two misleading "retrying in 5000ms" warns for what is
   actually a fixed upstream / contract violation.

   Mark such errors with `err.__retryable = false` at the throw
   site; the catch block honours the marker and breaks out of the
   loop immediately. 429 (rate limit) stays retryable. "Treasury
   MTS returned no data" stays retryable too — that one CAN be
   transient (deploy gap, reseed window).

Two new regression assertions in tests/customs-revenue.test.mjs:
- 4xx-except-429 short-circuit pattern is in place + catch block
  honours `err.__retryable === false`.
- Schema-drift row-count violation gets marked non-retryable.

Pre-fix verification: stashed only the script change, both new
assertions fail with "expected 4xx-except-429 client-error
short-circuit" and "Treasury MTS … __retryable = false" missing.
Post-fix all 26 customs-revenue assertions pass; typecheck + lint
clean.
…ERVICE_UNAVAILABLE level, ignore change_ua extension noise (koala73#3601)

Three parallel fixes from one triage round; bundled because they all
touch the same Sentry-classification surfaces.

A. WORLDMONITOR-PG (5ev/4u) — JSON-shape Unauthenticated misclassified

   Convex platform-level 401 ships a JSON body
     `{"code":"Unauthenticated","message":"Could not verify OIDC token
     claim..."}`
   when the Clerk token fails Convex's own OIDC check (token expired
   between our edge's `validateBearerToken` and Convex's verify, or
   Clerk JWKS rotated). Mixed-case `"Unauthenticated"` doesn't match
   the legacy uppercase `UNAUTHENTICATED` substring check. Without the
   JSON-shape detector, this fell through to `error_shape: 'unknown'`
   AND the edge handler returned 500 instead of 401.

   Fix mirrors the existing `"code":"ServiceUnavailable"` JSON detector
   added in PR koala73#3479:
   - `api/_convex-error.js`: detect `"code":"Unauthenticated"` →
     return UNAUTHENTICATED kind. Routed through the same edge branch
     that returns 401.
   - `api/user-prefs.ts`: `error_shape` regex extended to match both
     `UNAUTHENTICATED` and `"code":"Unauthenticated"`. Both bucket as
     `convex_auth_drift` so on-call sees one issue, not two.

B. WORLDMONITOR-QA (4ev/2u) — SERVICE_UNAVAILABLE level=warning

   Sister fix to PR koala73#3506 (CONFLICT level downgrade). Convex platform
   503 is a known transient external-system event; we already capture
   for visibility and return 503 + Retry-After. Pass `level: 'warning'`
   so the capture stays queryable but doesn't drown the error
   dashboard or page on-call. Both GET and POST SERVICE_UNAVAILABLE
   branches updated.

C. WORLDMONITOR-2D (88ev/26u) — change_ua browser extension noise

   `SyntaxError: Failed to execute 'appendChild' on 'Node': Identifier
   'change_ua' has already been declared.` `change_ua` is a known
   User-Agent-spoofing browser extension injecting the same script
   twice. Already had a regex covering `script|reportPage|element|Shop`
   in `ignoreErrors` for the same shape; just extending to include
   `change_ua`.

Tests: 7 new test cases across two test files (4 in
`user-prefs-convex-error.test.mjs` covering JSON-shape Unauthenticated
including defensive negative-control + structured-data precedence;
1 in `user-prefs-sentry-context.test.mts` for `convex_auth_drift`
classification on the JSON-shape variant). Existing 7872/7872 +
181/181 edge tests still pass.

Q9 (Checkout error: session_expired, 1ev/1u) was also resolved this
round — already correctly at info level via INFO_LEVEL_CODES, no
code change needed.
…l Model (koala73#3605)

* fix(ai-flow): cross-variant sync of AI toggles + Headline Memory under Browser Local Model parent

Two related bugs surfaced from a single user observation: "I see HuggingFace
model downloads on tech variant but not full" + "all variants should act the
same."

A. Cross-variant sync gap (sync-keys.ts)

`CLOUD_SYNC_KEYS` synced `wm-ai-flow-cloud-llm` (Cloud AI toggle) across
variants since launch, but accidentally omitted the two sister keys:
  - `wm-ai-flow-browser-model` (Browser Local Model)
  - `wm-headline-memory` (Headline Memory)

Effect: enabling Headline Memory on the full variant left the tech-variant
localStorage at default-false. The user's settings disagreed across
variants for no architectural reason — `wm-ai-flow-cloud-llm` proves the
sync path supports them. Adding both to CLOUD_SYNC_KEYS lets the existing
cloud-prefs-sync round-trip them.

B. Headline Memory escapes Browser Local Model (App.ts + ai-flow-settings.ts)

Headline Memory implementation requires a local embeddings model in the
ML worker. Pre-fix, the two toggles were independent: a user could turn
Browser Local Model OFF and leave Headline Memory ON, which silently kept
running the local ML worker (and the lazily-loaded sentiment + NER models
that piggyback on `mlWorker.isAvailable`). The "Browser Local Model" toggle
was a lie — local models still ran via the Headline Memory gate.

Fix: make Headline Memory a child of Browser Local Model.
  - `isHeadlineMemoryEnabled()` now returns `headline && browser` (effective
    value). All five existing gate sites in App.ts, country-intel.ts, and
    rss.ts inherit the new behavior automatically.
  - Added `getHeadlineMemoryRawValue()` for the settings UI render so the
    toggle still reflects the user's stored preference (re-enabling the
    parent restores their prior choice).
  - App.ts boot path uses `isHeadlineMemoryEnabled()` instead of the raw
    field; on `browserModel` toggle OFF (web), terminate the worker
    unconditionally — the previous `!isHeadlineMemoryEnabled()` clause is
    now circular under the new gating.
  - On `browserModel` toggle ON, re-load the embeddings model if the
    user's persisted Headline Memory was on, so they don't have to
    re-toggle.

C. UI consistency (preferences-content.ts)

`toggleRowHtml` extended with optional `disabled` (back-compat). The
Headline Memory toggle renders disabled when Browser Local Model is off
on web — visual signal of the parent-child dependency. Toggling Browser
Local Model live updates the Headline Memory disabled state without
re-rendering the panel.

Truth table after fix:
  Browser=ON,  Headline=ON  → worker runs (correct)
  Browser=ON,  Headline=OFF → worker runs (correct: other ML features)
  Browser=OFF, Headline=OFF → no worker (correct)
  Browser=OFF, Headline=ON  → no worker (FIXED: was running silently)

* fix(ai-flow): skip Browser Local Model parent gate on desktop runtime (P1)

Previous fix made isHeadlineMemoryEnabled() require both wm-headline-memory
AND wm-ai-flow-browser-model. But Browser Local Model is a web-only toggle
— preferences-content.ts:223 hides it on desktop, and App.ts:897 init's
the worker unconditionally on desktop via isDesktopRuntime(). Result: the
hidden web key never flips to true on desktop, and Headline Memory would
be silently dead on every desktop install.

Skip the parent gate when isDesktopRuntime() returns true. The gate exists
to keep the user's web-side opt-out honest; on desktop the user has
already opted into local AI by installing the Tauri app.

* review(greptile): add missing .catch on init + drop unused getHeadlineMemoryRawValue (P2 x2)
…tandard (koala73#3610)

## Symptom

/api/health reported `bisPolicy`, `bisExchange`, `bisCredit` as
EMPTY_ON_DEMAND with `records: 0`, but `seed-meta:economic:bis` showed
`recordCount: 11` and a recent `fetchedAt`. Verified 2026-05-06 by
direct Upstash GET:

  economic:bis:policy:v1   → (key does not exist)
  economic:bis:eer:v1      → (key does not exist)
  economic:bis:credit:v1   → (key does not exist)
  seed-meta:economic:bis   → fetchedAt recent, recordCount: 11

## Root cause: TTL == cron interval (zero margin)

`seed-bis-data.mjs:32` set `TTL = 43200` (12h). `seed-bundle-macro.mjs:5`
configures the BIS-Data section with `intervalMs: 12 * HOUR`. So the
canonical-key TTL exactly matches the cron interval — any cron drift
(bundle ordering, queue delay, transient failure) leaves the canonical
TTL'd-out for a window before the next successful run rewrites it.

The 13.7h `seedAgeMin` in /api/health (vs the 12h gate) is exactly the
1.7h drift window where canonical was missing.

`seed-meta` survives the gap because it has its own much longer TTL
(30+ days under runSeed's seedMetaTtl), which is why the meta correctly
reflected last-good `recordCount: 11` while the canonical had vanished.

This is the same shape as the trap caught for `bisDsr`/`bisProperty*` at
api/health.js:268-281 in 2026-04-27 — that fix was on the maxStaleMin
(health-threshold) side; this one applies the SAME 3× gold-standard
recipe to the canonical-key TTL side, which had been overlooked.

## Fix

Bump `TTL = 43200` → `TTL = 129600` (36h = 3× the 12h gate). Covers cron
drift + one degraded-to-24h cycle. All 3 canonical writes (policy via
atomicPublish, eer + credit via writeExtraKey in afterPublish) reuse the
same constant, so one bump fixes all three simultaneously. No code-path
change; this is a pure-config fix.

## Verification

- Direct Upstash GET confirmed all 3 canonical keys missing pre-fix
- BIS upstream verified healthy 2026-05-06: WS_CBPOL/WS_EER/WS_TC all
  return 200 + valid CSV (11/12/12 countries respectively under the
  seeder's exact query)
- Seeder logic + parser locally produce 11+12+12 records when run
  end-to-end against current upstream
- typecheck:api clean; lint clean
- No existing seed-bis-data test (so no regression risk on the test
  side; the diagnostic via Upstash GET stays valid post-fix)

Once the next bundle cron tick runs (within 12h of merge + Railway
deploy), the canonical keys will be repopulated and /api/health will
flip to `status: OK` for all 3 BIS entries. Subsequent cron drift up to
24h past schedule will no longer collapse the canonical keys.
…rops surface (koala73#3611)

* fix(seed-portwatch): retry-on-empty + log-on-empty so silent ArcGIS drops surface

## Symptom

\`/api/health\` reports \`chokepoints: COVERAGE_PARTIAL\` (11/13) for hours
at a time despite the upstream ArcGIS having data for all 13. WM
2026-05-06: \`cape_of_good_hope\` and \`gibraltar\` both flagged
\`dataAvailable: false\` in \`supply_chain:transit-summaries:v1\`, traced
back to those two missing from \`supply_chain:portwatch:v1\` while every
other chokepoint was healthy.

## Root cause

\`scripts/seed-portwatch.mjs:114-127\` (pre-fix):

\`\`\`js
const settled = await Promise.allSettled(batch.map(...));
for (let j = 0; j < batch.length; j++) {
  if (outcome.status === 'rejected') { console.warn(...); continue; }
  if (!outcome.value.length) continue;     // ← SILENTLY skipped
  result[batch[j].id] = ...;
}
\`\`\`

When ArcGIS returns \`{features: []}\` (empty 200, the way per-egress-IP
rate limits manifest from Railway), the chokepoint was silently dropped:
no log line, no retry. The 0-record outcome propagated through
\`seedTransitSummaries\` (ais-relay.cjs) → \`dataAvailable: Boolean(cpData)\`
flipped false → /api/health reported COVERAGE_PARTIAL with no diagnostic
trail.

The pattern was bursty: 2 of 3 chokepoints in the same CONCURRENCY=3
batch came back empty (cape_of_good_hope, gibraltar — batch 2). Local
fetch using the seeder's exact \`fetchAllPages\` returned 179 features
each, confirming upstream healthy + bug is in the seeder's silent-drop
path under transient-throttling conditions.

## Fix

Two changes:

1. **Log on empty**: surface the silent-drop path in Railway logs so
   operators see WHICH chokepoint(s) returned 0 features and how often.
   No more invisible failures.

2. **Sequential retry pass**: any chokepoint rejected-or-empty on the
   concurrent first pass gets retried alone with a small delay,
   stepping out of any rate-limit burst. Retries log success/permanent-
   empty distinctly, so transient vs structural failure is visible.

Pipeline extracted as \`runFetchPipeline(chokepoints, sinceEpoch,
fetchPagesFn, retryDelayMs)\` so tests can inject a mock fetcher and
verify retry behavior without hitting ArcGIS. \`fetchAll()\` is now a
2-line wrapper that calls \`runFetchPipeline\` with the real fetcher,
preserving the existing seeder contract (same input, same output shape).

The retry is intentionally **1 attempt** — a permanent ArcGIS issue
for a given chokepoint should still surface as missing in seed-meta
recordCount so /api/health flags it. This isn't a band-aid for real
upstream failures; it's a recovery path for transient throttling.

## Verification

- 10/10 new \`portwatch-retry-on-empty\` tests pass, covering:
  - Healthy first pass → no retry calls
  - Recovery on empty 200 (single + multi-in-batch — the WM 2026-05-06 pattern)
  - Recovery on first-pass rejection
  - Permanent failure (empty on both passes) → drop, no throw
  - All-fail → empty result (caller decides whether to throw)
  - Retry pass is SEQUENTIAL (max in-flight stays at CONCURRENCY=3, not 6)
  - Retry honors retryDelayMs argument
  - Output shape unchanged (back-compat with consumers)
- 167/167 across full portwatch test suite (no regressions)
- typecheck:api clean; lint clean
- \`seed-portwatch.mjs\` is NOT in Dockerfile.relay (verified) — no relay-COPY change needed

After deploy, the next 6h bundle tick will hit the retry path on any
transient empties; over the next 24-48h Railway logs will show
recovery rates so we can quantify the throttle frequency.

* test(portwatch): loosen retry-delay timing threshold 40→25ms (Greptile P2)

50ms argument with a 40ms assertion threshold leaves only 10ms of
scheduler-jitter slack — tight enough to flake on slow/shared CI
runners. Bump to half-the-delay (25ms) per Greptile suggestion. Still
proves the delay actually fires (a 0ms gap from an ignored argument
would fail), just with more tolerance.
…ide sort cliff (koala73#3612)

## Symptom

Bundle `seed-bundle-portwatch-port-activity` running every 12h, container
SIGTERM'd at the 540s budget on May 6 00:02 UTC. 36 errors across 4
batches before timeout, all of the form
`<ISO3>: The operation was aborted due to timeout`. /api/health flagged
`portwatchPortActivity: STALE_SEED 36h+` (2 missed cron cycles).

May 4 + May 5 runs reported `Cache: 174 hits, 0 misses` + 10s duration
(replaying cached payloads — never exercised the upstream path). The
cliff hit on May 6 when all 174 per-country cache keys (synchronized
TTL) expired together and forced real upstream fetches.

## Root cause

ArcGIS migrated `Daily_Ports_Data`'s `date` column to
`esriFieldTypeDateOnly` (sometime in the prior 7 days, hidden by the
cache cliff). Server-side sort on DateOnly is 10-15× slower than no-sort.

Empirical measurements (BRA 60d window, 5,768 rows total, page size 2000):

  | orderByFields                   | per-page    | per-country |
  |---------------------------------|-------------|-------------|
  | `portid ASC,date ASC` (current) |  46.6s      | ~140s ❌ over 90s cap |
  | `ObjectId ASC`                  |  26.5s      | ~80s ⚠ borderline    |
  | (no orderBy at all)             |   4.0s ✅   | ~12s comfortably under |

returnCountOnly is sub-second (the WHERE clause is fine; only the
materialization+sort is slow), confirming this isn't a network or auth
issue — it's specifically the DateOnly orderBy code path on the ArcGIS
server.

## Fix

Drop `orderByFields: \`portid ASC,${df} ASC\`` from the EP3
paginateWindowInto request. The aggregation in that loop is
ORDER-INDEPENDENT — it sums into `Map<portId, accum>` per row without
caring about row order. ArcGIS still provides a consistent default order
(ObjectId ASC) across pages, so resultOffset pagination remains correct.
No client-side sort needed.

Also adds an extensive comment explaining WHY orderBy is deliberately
omitted, so a future contributor doesn't reintroduce it under the
plausible-but-wrong "queries should be deterministically ordered" rule.

## Why this lurked invisibly for days

Per-item Redis cache (`supply_chain:portwatch-ports:v1:<iso2>` keys)
with synchronized 7-day TTL because all keys were written in the same
successful run. While cache was alive, every "successful" run was a
~10s no-op replay — `Cache: 174 hits, 0 misses` — and the upstream
code path was never exercised. The DateOnly migration may have happened
days before the cliff but only became visible when the cache TTL'd out.

(Saved as separate memory: `per-item-cache-cliff-masks-upstream-regression`
to flag this anti-pattern for other seeders. Followups: randomize per-item
TTL on write to de-sync expiry, OR smoke-test ~5% upstream every tick.)

## Tests

- Two new regression-guard tests in
  `tests/portwatch-port-activity-seed.test.mjs`:
    1. EP3 query MUST NOT orderBy on the date field (\${df}) — locks the
       fix against future re-introduction
    2. EP4 (port-reference, no date column) MAY still orderBy portid —
       prevents an over-eager "remove all orderBy" sweep
- 70/70 portwatch-port-activity tests pass (was 68; added 2 guards)
- 159/159 across full portwatch suite (no regressions)
- typecheck:api clean; lint clean
- seed-portwatch-port-activity.mjs is NOT in Dockerfile.relay — no relay-COPY change

After deploy, the next 12h bundle tick will exercise the new code path.
Per-country fetch should drop from ~140s → ~12s. Bundle's 540s budget
is plenty for 174 countries / CONCURRENCY=12 / 15 batches × ~12s ≈ 180s
total. Post-deploy: /api/health flips `portwatchPortActivity` from
STALE_SEED back to OK, and Railway logs will show real
"Seeded N countries" lines (not the cached-replay no-op).

Memories saved:
- `arcgis-dateonly-orderby-pathologically-slow` — generalises to any
  ArcGIS endpoint with esriFieldTypeDateOnly (verify schema before assuming)
- `per-item-cache-cliff-masks-upstream-regression` — the outer cache trap
koala73#3614)

* feat(brief): bump envelope v3→v4 with stable clusterId field (U1)

Adds the canonical-contract foundation for Sprint 1:
- BRIEF_ENVELOPE_VERSION 3→4; SUPPORTED_ENVELOPE_VERSIONS extends to {1,2,3,4}
  for the 7-day backward-read window covering brief:* (7d), story:track:v1:* (7d),
  digest:accumulator:v1:* TTLs.
- BriefStory.clusterId added: stable per-story-cluster identity (rep hash from
  mergedHashes[0] after materializeCluster). REQUIRED on v4 writes; OPTIONAL on
  v1-3 reads (back-compat). Empty string rejected on every version (would
  silently collapse delivered-log keys across clusters).
- Renderer assertNoExtraKeys allowlist + assertBriefEnvelope per-story
  validator updated.
- Filter ships transitional clusterId source (raw.hash with url:{sourceUrl}
  fallback) so the live cron writes valid v4 envelopes from the moment U1
  lands. U3 swaps the upstream source to mergedHashes[0] from materializeCluster
  without touching schema or assertion plumbing.

Producer audit (grep BRIEF_ENVELOPE_VERSION + version: literals across
scripts/ api/ server/ shared/) confirms single live writer at
shared/brief-filter.js::assembleStubbedBriefEnvelope; all readers go
through the constant. No drift risk.

Tests: 196 pass (76 baseline + new v4 happy/edge/error/integration coverage in
tests/brief-magazine-render.test.mjs). Characterization-first guard verified
by removing v3 from SUPPORTED_ENVELOPE_VERSIONS and observing 14 v3-shape
tests fail loudly.

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U1

* feat(digest): multi-rule canonicalization — option (a) (U2)

Collapses the multi-rule send loop so the email body and the magazine URL
come from the SAME per-user winning rule, eliminating the divergence
documented at scripts/seed-digest-notifications.mjs:1713-1732 (now
rewritten to document option (a)).

What changed:
- Added pure helper selectCanonicalSendRule(brief, userRules) in
  scripts/lib/digest-orchestration-helpers.mjs. Returns the user's
  winning rule for this slot or null. Defensive on missing brief,
  missing/empty chosenVariant, missing/empty rules, winner-not-in-list.
- scripts/seed-digest-notifications.mjs builds a userRulesByUserId Map
  once before the send loop and drops every non-winner rule at the top
  of each iteration via selectCanonicalSendRule(...) === rule. The
  synthesis block runs once per winner; generateDigestProse hits the
  cache row written by compose (no extra LLM call).
- Parity log alarm semantics flipped: winner_match=false was previously
  "expected divergence"; under option (a) it can ONLY indicate canonical-
  rule filter bypass OR compose↔send chosenVariant drift, so it's now
  a hard alarm with diagnostic guidance pointing at the two failure
  modes. winner_match=true && channels_equal=false retains its pre-U2
  PARITY REGRESSION semantics (canonical-synthesis cache drift).
- Comment block at lines 1713-1732 rewritten: documents option (a)
  consistency by name, replacing the prior trade-off framing.

Variant semantics: variant is per-rule (full/finance/tech/etc). Under
option (a), only the winner-rule's variant ever gets a digest:last-sent:v1
key for affected user-slots. Pre-U2 non-winner-variant keys are orphan
but harmless — 8d TTL, no consumer reads them after this change.

Test-first verified: 4 source-text guards in brief-composer-rule-dedup
failed against pre-U2 source (expected) and pass after the change. 8 unit
tests for selectCanonicalSendRule cover happy path, single-rule no-regression,
deleted-winner, missing chosenVariant, empty rules, defensive non-string variant.

Tests: 111 pass / 0 fail in targeted suites; 139/139 in adjacent brief
suites confirms no regression. Total Sprint 1 test addition through U2:
+12 tests.

Subscriber-visible: multi-rule users will see the winning rule's content
in BOTH email body and magazine URL (vs prior per-rule body + winner-rule
URL). Confirmed during planning as the intended behaviour change.

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U2

* feat(brief): wire stable clusterId from materializeCluster rep hash (U3)

Closes the U1 transitional placeholder by sourcing BriefStory.clusterId
from the canonical cluster-rep hash (mergedHashes[0] from materializeCluster)
instead of per-story raw.hash. Multi-story clusters now collapse to ONE
shared clusterId; singletons unchanged.

What changed:
- scripts/lib/brief-dedup-jaccard.mjs: tightened materializeCluster sort
  with hash-ASC tiebreak (3rd sort key). Pre-U3 sort relied on TimSort
  stability + caller iteration order — fully-tied score+mentionCount
  items resolved non-deterministically across input orderings. The plan
  claimed hash-tiebreak was already in place; verification showed it
  wasn't. Without it, U3's idempotency invariant (same cluster across
  two ticks → identical clusterId) would silently fail under any
  caller-side reorder (Map iteration, shuffled membership).
- scripts/lib/brief-compose.mjs: digestStoryToUpstreamTopStory emits a
  new clusterRepHash field, sourced from mergedHashes[0] when present,
  falling back to the rep's own hash for singletons.
- shared/brief-filter.js: replaced U1's transitional clusterId logic
  with three-tier preference — clusterRepHash → raw.hash → url:{sourceUrl}.
  Comment block fully rewritten to document U3 as the canonical landing
  (no more "transitional placeholder" or "U3 will swap" language).

Producer audit (re-ran from U1): assembleStubbedBriefEnvelope remains
the single live envelope writer. composeBriefForRule (only used by
news:insights tests) lacks mergedHashes by design and falls back to
raw.hash — consistent with that path's pre-clustering semantics.

Tests: 354/354 pass across 8 brief/digest test files. Added 12 U3 tests
covering singleton-clusterId-equals-own-hash, multi-story-collapse,
idempotency-across-ticks, distinctness, integration through
materializeCluster → digestStoryToUpstreamTopStory → filterTopStories
→ assertBriefEnvelope, plus 4 determinism regression locks
(materializeCluster sort key precedence + reorder-invariance).

Pre-existing failures in tests/brief-edge-route-smoke.test.mjs are
TS-import-extension issues under raw `node --test`, unrelated to U3
(verified identical on baseline via stash + rerun).

End-to-end clusterId contract: U4's delivered-log writer can now
read clusterId directly off BriefStory.clusterId — REQUIRED + non-empty
per the v4 envelope contract enforced by assertBriefEnvelope.

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U3

* test(brief): CI invariant — digest.cards ⊆ brief.cards (U7)

Locks in the structural-subset enforcement that U1+U2+U3 enable. Every
clusterId emitted by the digest channel formatter must have a matching
clusterId in the brief envelope. Pre-push hook auto-picks-up via the
tests/<name>.test.mjs glob in .husky/pre-push:113.

What changed:
- tests/brief-from-digest-stories.test.mjs gains an 8-test describe block
  + ~50-line canonical invariant rationale header (single source of truth;
  JSDoc and code comments elsewhere reference this header rather than
  re-state it, per feedback_doc_drift_after_behavior_fix_needs_grep_sweep).
- Tests cover: 5-cluster happy path, empty pool, multi-story rep collapse,
  single-rule no-canonical-needed, multi-rule post-U2 winner-pool subset,
  two error-path tests with regex-validated diagnostic messages naming
  the orphan id + delivered-log consequence + brief id set, plus a real-
  chain integration test (materializeCluster → compose → assertBriefEnvelope).

Approach: Option (C) per the plan — fixture-based test against the real
composeBriefFromDigestStories chain plus a local helper
(projectDigestEmitClusterId) that mirrors clusterId derivation. Option
(A) was unavailable: formatDigest / formatDigestHtml emit text/HTML
strings without structured clusterIds. Option (B) extraction would
cross into U4/U5 implementation territory. Test header documents
honestly why Option (C) was chosen and how the cross-check works
(if live derivation drifts, U3 idempotency tests fail and force a
helper update in lockstep).

Production finding flagged for U4: the live formatDigest call site at
scripts/seed-digest-notifications.mjs:1789 passes the RAW post-buildDigest
stories pool (capped at DIGEST_MAX_ITEMS=30) — not env.data.stories
(post-compose, capped at MAX_STORIES_PER_USER=12). When pool > 12, the
digest emits cards beyond the brief envelope. The U7 test fixtures
intentionally stay under the cap to test the structural-subset shape;
U4 must plumb env.data.stories into the formatter call site so the
invariant holds in production. Per the strategic doc's "brief-as-canonical"
direction, the brief envelope's set is the correct iteration domain.

Tests: 52/52 pass (44 existing + 8 new).

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U7

* fix(digest): address Codex PR koala73#3614 P1 + P2 review findings

P1 — compose-miss must not suppress digest delivery
  scripts/seed-digest-notifications.mjs:1683+

  Pre-fix: U2's canonical filter ran unconditionally, dropping every
  rule when briefByUser had no entry for a user. composeBriefsForRun
  returns an empty map when BRIEF_SIGNING_SECRET is missing, brief
  compose is disabled, OR a per-user compose error was caught upstream.
  My change turned any of those config/outage states into a complete
  digest-send outage for affected users.

  Fix: gate the canonical filter on `if (briefForUser)`. When missing,
  fall through to the legacy per-rule send path (multi-rule divergence
  reappears for THAT USER ONLY for THIS TICK only — acceptable trade-off
  vs silent suppression). magazineUrl already resolves to null at
  line ~1793 (brief?.magazineUrl ?? null); carousel + CTA paths already
  gate on magazineUrl truthiness, so this branch produces a brief-less
  email/text body that still delivers the curated story list.

  Added composeMissUsers Set so each user gets ONE warn per cron tick,
  not one per rule iteration. Warn line shape:
    [digest] compose-miss user=<id> — briefByUser has no entry. ...
  Uses console.warn (not console.log) so Sentry's console-breadcrumb
  hook surfaces it. Docblock cites the three failure modes
  (BRIEF_SIGNING_SECRET unset, compose disabled, per-user compose
  error) so on-call can triage without git spelunking.

P2 — sourceUrl required from v2 onward, not just on the latest version
  server/_shared/brief-render.js:342

  Pre-fix: `if (env.version === BRIEF_ENVELOPE_VERSION || st.sourceUrl !== undefined)`
  required sourceUrl only on the LATEST version (v3 pre-U1, v4 post-U1),
  contradicting the v2+ contract documented in the comment block above.
  Pre-U1 this exempted v2; post-U1 it exempted v2 AND v3 — strictly worse.

  Fix: `if (env.version >= 2 || st.sourceUrl !== undefined)`. v2/v3/v4
  all require sourceUrl; v1 stays exempt; v1 with a stray sourceUrl
  is still validated (defensive). Comment block updated to cite the
  Codex review item and explain the corrected version semantic.

Tests:
- tests/brief-magazine-render.test.mjs: +3 P2 regression tests
  (v3 missing sourceUrl rejects; v2-shape missing sourceUrl rejects;
  v3 valid-sourceUrl positive control)
- tests/brief-composer-rule-dedup.test.mjs: +4 P1 source-text guards
  (canonical filter gated on briefForUser; composeMissUsers dedup;
  console.warn shape; docblock cites Codex PR + names failure modes)
- 263/263 pass in targeted suites; 7928/7928 in full test:data
  (was 7922 pre-fix; +6 net new tests landed)
- typecheck + typecheck:api clean

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U1+U2 (review iteration)

* fix(digest): address Greptile PR koala73#3614 P2 inline review comments

Two minor inline findings from the Greptile review:

1. scripts/seed-digest-notifications.mjs:1804 — duplicate Map lookup
   `const brief = briefByUser.get(rule.userId)` re-fetched the same
   key that `briefForUser` (added in the Codex P1 fix earlier in the
   loop) already carries. Reuses briefForUser; saves a Map lookup
   per rule iteration and makes the relationship explicit.

2. tests/brief-from-digest-stories.test.mjs — `projectDigestEmitClusterId`
   helper diverged from live `shared/brief-filter.js` on the level-3
   fallback. Live filter has three tiers:
     1. mergedHashes[0]      — canonical materializeCluster path
     2. hash                  — back-compat for non-clustered producers
     3. url:${sourceUrl}      — last-ditch (news:insights ingestion etc)
   Pre-fix the test helper threw on level-3 ("test should never reach
   this"), leaving the url:${sourceUrl} branch structurally untested by
   the U7 invariant. If a future producer triggers level-3 in production,
   the U7 invariant would not catch a missed case. Now the helper mirrors
   all three tiers, with two new tests:

   - "level-3 fallback: digest story with only sourceUrl returns
     url:<sourceUrl>" — positive control for the third-tier path
   - "source preference order: mergedHashes[0] beats hash beats sourceUrl"
     — locks the precedence; if it ever flips, multi-story clusters
     shatter back into per-story clusterIds and the delivered-log key
     shape explodes.

   Updated the docblock + "test should never reach this" comment to
   reflect the now-three-tier shape and cite Codex PR koala73#3614 P2.

   Updated the existing error-path test docstring to clarify the story
   shape uses `link` (not `sourceUrl`) so all three sources are absent.

Greptile's third inline finding (silent send skip on briefByUser miss
at line 1699) is the same issue Codex called P1; already addressed in
a6de2c7 — comment already posted on the PR.

Tests:
- 265/265 in targeted suites (was 263; +2 new fallback-precedence tests)
- 7930/7930 in full test:data (was 7928)
- typecheck clean

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U2+U7 (review iteration)
…s — Sprint 1 Phase 2/3 (U4+U5+U6+U8) (koala73#3617)

* feat(digest): per-channel/per-cluster delivered-log writer + U7 prod-gap fix (U4)

Builds the Sprint 1 / Phase 2 substrate that U5 (cooldown decision module) and
U6 (replay harness) consume. Subscriber-visible: nothing changes — all keys
are additive; the existing digest:last-sent:v1:{user}:{variant} cron-isDue
gate is untouched.

What ships:

- scripts/lib/digest-delivered-log.mjs (new, 257 lines):
  writeDeliveredEntry({userId, channel, ruleId, clusterId, sentAt,
  sourceCount, severity}) → tri-state {written, conflicts, errors}.
  Key shape: digest:sent:v1:${userId}:${channel}:${ruleId}:${clusterId}
  — every discriminator explicit, no OR-fallback collapse per
  skill_cache_key_or_fallback_collapses_input_shapes. SET NX EX via
  JSON-body pipeline form per feedback_upstash_rest_set_ex_path_not_query.
  TTL 30d ± 0-3d jitter (uniform, dependency-injectable randomFn for
  deterministic tests; clamped at 0.9999999 against the rand=()=>1
  injection foot-gun). Trust SET NX boolean — no write-then-reread per
  feedback_upstash_write_reread_race_in_handler. ALLOWED_CHANNELS frozen
  set of {email, telegram, slack, discord, webhook}.
  aggregateResults(results) collapses N tri-state results into one
  summary for the per-rule log line.

- scripts/clear-delivered-entry.mjs (new, 288 lines):
  Operator one-shot CLI primitive. --user, --slot, --cluster, --reason
  ALL required (no --reason → exit 1, no Upstash connection). --channel
  and --rule paired (both or neither). With both: targets one specific
  key. Without either: SCAN+DEL all matching rows. Per-deletion audit
  log includes --reason + ISO timestamp. Exit codes: 0 ok/no-op,
  1 arg validation, 2 transport failure → operator retries.

- scripts/seed-digest-notifications.mjs (+217 lines):
  Cron integration. Per-channel writeDeliveredEntry call inside the
  send-success branch (sequential await, bounded ≤12 clusters × ≤5
  channels per user). Tri-state aggregation across all writes for
  this user-rule send → one [digest] U4 delivered-log summary line.
  console.warn on errors > 0 (Sentry breadcrumb); console.log
  otherwise. Defensive empty-clusterId branch warns + skips the write.
  ruleId encoded as ${variant}:${lang}:${sensitivity} so audit can
  reconstruct the rule definition without a cross-Convex lookup.

  ALSO fixes the U7 production gap (Codex/Greptile-flagged finding):
  formatDigest/formatDigestHtml now consume brief.envelope.data.stories
  (post-compose, post-filter, capped at MAX_STORIES_PER_USER=12) via a
  briefStoriesToFormatterShape compatibility shim, NOT the raw stories
  pool from buildDigest (capped at DIGEST_MAX_ITEMS=30). Without this
  swap, the email body could surface clusterIds the brief envelope
  omitted (the 18-30 stories the cap dropped), orphaning their
  delivered-log keys from the magazine side and breaking the U7
  invariant on the live send path. Compose-miss fallback (briefForUser
  undefined) continues to consume raw stories — accepted U7 degradation
  vs silent suppression for that one tick. The "Sent N stories" log
  now reports formatterStories.length (post-cap), matching what the
  user actually received.

- Dockerfile.digest-notifications (+9 lines): COPY scripts/clear-delivered-entry.mjs.
  The writer module is auto-covered by the existing recursive scripts/lib/ COPY.
  U8 will add the BFS-based static-guard test for this Dockerfile.

- tests/digest-delivered-log.test.mjs (new, 660 lines, 40 tests):
  12 describe blocks. Writer: key-shape validation (every discriminator
  required, empty/non-string rejected before pipeline call), TTL distribution
  (100-sample uniform spread bounded [30d, 30d+3d]), happy path (OK→written:1),
  idempotency (null→conflicts:1), error mapping (5xx→errors:1, malformed
  shape→errors:1, throw→errors:1), aggregation. Clear-script: arg parsing
  (required-arg validation, paired --channel/--rule check, unknown-flag
  rejection), buildSingleKey + buildScanPattern shape, runClear single-key
  + sweep modes. Mock pipeline records call args for tri-state contract testing.

- tests/brief-from-digest-stories.test.mjs (+80 lines):
  U7 production-gap source-text guard describe block (4 tests).
  Asserts the live cron's send loop reads from formatterStories
  derived from brief.envelope.data.stories via briefStoriesToFormatterShape,
  not raw stories. Pattern mirrors the U2 source-text guard precedent
  per the plan's documented test-harness limitation (no full-mock
  Upstash + Convex + Resend harness available).

Tests: 306/306 in targeted suites. typecheck + typecheck:api clean.

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U4

* feat(digest): cooldown decision module + shadow logger (U5)

Builds the Sprint 1 / Phase 3 substrate: pure cooldown-decision module,
fail-closed-on-typo kill-switch parser, and a shadow logger that emits one
summary line per user-rule send. Subscriber-visible: nothing changes — the
decision is computed and logged but never gates a send. Sprint 2 (post-U6
replay validation) flips the connection to enforcement.

What ships:

- scripts/lib/digest-cooldown-config.mjs (new):
  readCooldownConfig({DIGEST_COOLDOWN_MODE}) → {mode, invalidRaw}.
  Empty/unset → 'shadow'. Exact 'off' → 'off'. Anything else (typo,
  garbage, 'true', '1', 'enforce') → 'shadow' + invalidRaw warn surface.
  Case-folded to lowercase, whitespace-trimmed. 'enforce' is intentionally
  invalid in Sprint 1 — Sprint 2 introduces it once U6 replay validates the
  cooldown table. Treating early 'enforce' as fail-closed-to-shadow
  prevents a silent partial-enforce state where the decision is computed
  but the send-loop integration that gates on it doesn't exist yet. Pattern
  modelled on scripts/lib/brief-dedup.mjs::readOrchestratorConfig per
  feedback_kill_switch_default_on_typo.

- scripts/lib/digest-cooldown-decision.mjs (new):
  classifyStub({sourceDomain, headline, severity}) → {type, classificationMissing}.
  Five-rule type classifier (Sprint 1 stub; Sprint 3 ships final taxonomy):
    1. Source domain usni.org|csis.org|brookings.edu|*.edu|nature.com|sciencemag.org → 'analysis'
    2. *.gov + headline matches /LICENSE NO\.|Final Rule|Notice of/ → 'sanctions-regulatory'
    3. headline matches /\b(beat|miss|tops|exceeds)\s+(forecast|estimate|profit)/i → 'high-single-corporate'
    4. severity-derived: critical→'critical-developing', high→'high-event', medium→'med'
    5. fallback → 'high-event' (conservative) + classificationMissing flag for U6 telemetry
  Order of precedence: Analysis domains beat single-corp regex (a .edu publishing
  "X beats forecast" is still analysis, not earnings).

  evaluateCooldown(input) → null (mode=off) | {decision, reason, cooldownHours,
  evolutionDelta, classifiedType, classificationMissing}. Returning null on
  mode='off' (NOT 'allow with reason=cooldown_disabled') is the load-bearing
  contract per feedback_gate_on_ground_truth_not_configured_state — downstream
  observers gate on `cooldownDecision !== null`, NOT on the configured env.

  Cooldown table:
    critical-developing      4h soft  (allow on +5 sources, new fact, tier change)
    critical-sustained      24h hard  (allow on new fact only)
    high-event              18h soft  (allow on +5 sources, new fact, tier change)
    high-single-corporate   48h hard  (allow on tier escalation only — "real follow-up")
    sanctions-regulatory    18h soft
    analysis                 7d hard  (no bypass within window)
    med                     36h soft
  Tier-change has highest precedence among bypasses — strongest editorial signal.

- scripts/lib/digest-cooldown-shadow-log.mjs (new):
  emitCooldownShadowLog({userId, ruleId, slot, decisions}) — one log line
  per user-rule send (not per cluster, not per channel). Aggregates allow/
  suppress counts + reason histogram + classificationMissing count via
  aggregateCooldownDecisions. Promotes to console.warn when any decision
  carried classificationMissing=true (real signal for Sprint 3's classifier
  work). Skipped entirely when decisions array is empty (mode='off' OR no
  brief envelope OR all clusters missing clusterId).

- scripts/seed-digest-notifications.mjs (+~190 lines):
  Resolves DIGEST_COOLDOWN_MODE once per cron tick (line 1776) with loud
  invalidRaw warn at startup. Per-cluster/per-channel cooldown evaluation
  block (line 1995-2087): GETs each U4 delivered-log row, builds decision
  input from BriefStory + last-delivered JSON, calls evaluateCooldown,
  collects decisions. Post-U4-summary shadow-log emit (line 2231) — runs
  even on no-channel-success ticks (the operator-visible cases where
  shadow telemetry matters most). Send loop continues unchanged.

- tests/digest-cooldown-config.test.mjs (new, 21 tests):
  Default + valid modes, case-folding, fail-closed-to-shadow on typo
  (including the intentionally-invalid 'enforce'), purity contract.

- tests/digest-cooldown-decision.test.mjs (new, 40 tests):
  classifyStub coverage for all five rules including order-of-precedence
  (analysis domain beats single-corp regex), false-positive guards, the
  known 'beats'/'misses' false-negative locked as a regression for
  Sprint 3's broader classifier. evaluateCooldown coverage for mode='off'
  null contract, no-prior-delivery, within-floor suppression, evolution
  bypasses (+5 sources, severity tier change with precedence), hard-floor
  no-bypass (analysis 7d, single-corp 48h), classification-missing telemetry.
  Cooldown table sanity — every cell shape + plan-table snapshot guard.

Tests: 61/61 in U5 suites; 8032/8032 in full test:data (was 7930 pre-U5).
typecheck + typecheck:api clean.

Resumption note: this commit lands the work the U5 subagent started but
couldn't finish — the stream watchdog killed it after writing 3 of 4
modules + the cron integration. Tests written by the orchestrator with
one minor regex-mismatch fix in test expectations (the original test
assumed 'beats'/'misses' would match; the actual Sprint 1 regex anchors
on bare verb forms — both behaviors documented as test cases now).

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U5

* feat(digest): 14-day replay harness for cooldown table validation (U6)

Sprint 1 / Phase 3 substrate: pure aggregator + thin CLI that simulate
U5 cooldown decisions across 14 days of replay-log records, then report
the would-have-suppressed drop-rate distribution. Sprint 2 cannot enable
enforce mode until this report shows a sane distribution against
production data.

What ships:

- scripts/replay-digest-cooldown.mjs (new):
  - aggregateReplayDecisions(records, options) — pure aggregator. Builds
    (ruleId, clusterId) timelines, simulates U4 delivered-log state from
    the first occurrence, runs evaluateCooldown on each subsequent
    occurrence. Returns {totalRecords, totalTimelines, totalDecisions,
    allowDecisions, suppressDecisions, dropRatePct, reasonHistogram,
    typeHistogram, severityHistogram, topSuppressed[], coverage{}}.
    Refuses to run on <minDaysCovered (default 14d) coverage unless
    {allowShortCoverage: true} is passed (test escape hatch only).
  - clusterIdFromRecord(record) — uses mergedHashes[0] when present
    (rep's own hash by U3's contract); falls back to storyHash for
    singletons. Returns '' when both are missing (caller filters).
  - renderMarkdownSummary(aggregate) — produces a paste-ready block
    for docs/internal/digest-brief-improvements.md Sprint 1 outcomes.
  - parseArgs(argv) — --days N (default 14), --rule <ruleId>,
    --allow-short-coverage, --help. Throws on unknown flag /
    non-integer --days / missing --rule value.
  - mainCli() — Upstash REST SCAN over `digest:replay-log:v1:*` keys,
    LRANGE per key, JSON.parse records. Filters by date suffix to
    honour --days. Calls aggregateReplayDecisions, prints markdown,
    writes full JSON to /tmp/replay-digest-cooldown-<date>.json.

  Replay-log key shape from scripts/lib/brief-dedup-replay-log.mjs:
    digest:replay-log:v1:{ruleId}:{YYYY-MM-DD} (Redis list, 30d TTL).
  Per-tick numeric clusterId in the replay-log is NOT stable across
  ticks (per the writer's docblock); the harness ignores it and uses
  mergedHashes[0] (= rep.hash by U3) as the canonical cluster identity.

  Channel assumption: simulated cooldown lookup uses channel='email'.
  Real production has per-channel cooldown rows; the replay-log only
  records the dedup pass (channel-agnostic), so the simulation
  conservatively models "would we have suppressed on email?".
  Multi-channel granularity is a Sprint 3 follow-on.

  Live run requires DIGEST_DEDUP_REPLAY_LOG=1 to have been on for
  the requested window. Phase 0 prereq activated 2026-05-06; earliest
  meaningful run date is 2026-05-20.

- tests/replay-digest-cooldown-harness.test.mjs (new, 27 tests):
  Pure-aggregator coverage: clusterIdFromRecord (mergedHashes
  precedence, storyHash fallback, empty case), coverage gate (empty
  input, <14d, allowShortCoverage escape hatch, exactly-14d boundary),
  single-occurrence skip (no decision to evaluate), within-floor
  suppress, beyond-floor allow, +5 sources evolution bypass, Analysis
  domain hard floor (6d → analysis_7d_hard reason), multi-timeline
  aggregation, coverage report shape, top-suppressed sorting +
  no-suppression empty case. renderMarkdownSummary section coverage.
  parseArgs full surface including throw cases.

Tests: 27/27 pass. typecheck clean. Live CLI is fixture-tested only —
the Upstash IO path runs against the real endpoint when invoked
directly post-2026-05-20.

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U6

* test(digest): Dockerfile.digest-notifications import-closure static guard (U8)

Sprint 1 / Phase 3 closing unit. Mirrors tests/dockerfile-relay-imports.test.mjs
but extends coverage to ALL cross-dir imports (scripts/, shared/, server/_shared/,
api/), since the digest cron's import graph spans all four. The relay guard only
catches missing scripts/ COPYs; this guard catches all four prefixes.

What ships:
- tests/dockerfile-digest-notifications-imports.test.mjs (new, 193 lines):
  BFS from scripts/seed-digest-notifications.mjs through the full import graph.
  For every tracked-prefix file reached, asserts it's covered by either an
  exact-match file COPY or a directory-recursive COPY. Coverage parser handles
  both file-level and directory-level directives. Tracked prefixes: scripts/,
  shared/, server/_shared/, api/. 5 tests: Dockerfile exists; coverage parser
  picks up all four prefixes; entrypoint COPY'd; U4+U5 modules covered; BFS
  closure over the full import graph.

Historical context (per the relay test header): the 2026-04-14 to 2026-04-16
chokepoint-flows 32h outage was caused by a missing COPY line for an
_seed-utils.mjs transitive import. Sprint 1's U4+U5 added 5 new files to the
digest cron's import graph; this guard locks the COPY-list invariant.

Note: strategic doc docs/internal/digest-brief-improvements.md was also
updated with a Sprint 1 outcomes section. The file is gitignored under
docs/internal/ — local-only operator artifact.

Tests: 5/5 pass. typecheck clean.

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U8

* fix(digest): address Codex PR koala73#3617 P1+P1+P2 review findings

Three valid Codex findings on the Phase 2/3 PR — two P1 correctness
bugs (one I introduced; one production-shaping defect that breaks U5's
evolution bypass) and one P2 classifier gap.

P1 — Redis SCAN glob-injection in clear-delivered-entry.mjs
  Pre-fix: parseArgs accepted any string for --user / --cluster /
  --channel / --rule, then buildScanPattern interpolated raw values
  into 'digest:sent:v1:${user}:*:*:${cluster}'. Passing --cluster '*'
  or a value containing Redis glob chars (* ? [ ] \) broadens the
  pattern beyond the intended single-user-single-cluster scope; the
  followup DEL loop then wipes far more rows than the operator
  intended.

  Fix: added REDIS_GLOB_CHARS = /[*?[\]\\]/ + SCAN_KEY_FLAGS gate in
  parseArgs. Any flag whose value reaches the SCAN/DEL pattern is
  validated; --reason is exempt (audit log only, never reaches Redis).
  9 new guard tests cover *, foo*, ?, [ ], \, --user, --channel,
  --rule injection vectors plus a regression guard for legitimate
  values containing : and -.

P1 — source count collapse in seed-digest-notifications.mjs
  Pre-fix: the U4 writer payload and U5 evaluator input both read
  'sourceCount: typeof briefStory?.source === string && length > 0 ? 1 : 0'
  — collapsing real source counts (5, 10, 37+) to 0/1. The BriefStory
  schema only carries a single primary 'source' string; the original
  cluster's full sources[] array is not preserved in the envelope.

  Consequence: U5's '+5 sources within floor' evolution bypass cannot
  trigger in production because the delta from N to 0/1 is always 0
  or 1, never ≥5. Today's shadow rows seed bad history that Sprint 2's
  enforce mode would inherit, leading to over-suppression of stories
  that should evolve through.

  Fix: build a sourceCountByClusterId Map once per send from the raw
  clustered 'stories' pool (post-buildDigest, pre-filterTopStories)
  where sources[] is still attached. Match by cluster identity:
  mergedHashes[0] when present (rep's own hash by U3's contract),
  else the story's own hash (singletons). Both sites (U4 writer
  payload + U5 evaluator input) now read from the Map. O(1) lookup
  per cluster iteration. Source-text guard test asserts both sites
  consume the Map and the old 0/1 collapse pattern is gone.

P2 — Analysis domain classifier missed www-prefixed and subdomain hosts
  Pre-fix: Rule 1 of classifyStub did 'ANALYSIS_DOMAINS.includes(host)'
  — exact match only. Real publication URLs typically resolve to
  hosts like 'www.usni.org', 'www.nature.com', 'editorial.csis.org';
  these all fell through to the severity-derived fallback (high-event
  18h floor) instead of the analysis 7d hard floor. That's silent
  shadow-mode under-classification today, and would be silent
  under-suppression once Sprint 2 flips enforce mode on.

  Fix: stripWwwPrefix(sourceDomain) helper + match three host shapes:
    1. exact: usni.org → analysis
    2. www-prefixed: www.usni.org → strip + exact match
    3. subdomain: editorial.usni.org → endsWith('.usni.org') match
  False-positive guard: notmyusni.org stays a miss (suffix match uses
  '.${domain}' with the dot separator, not bare suffix). Tests cover
  www-prefix, subdomain, case-folding, and the false-positive guard.

Tests: 154/154 pass in targeted suites; 8081/8081 in full test:data
(was 8064 pre-fix; +17 net new tests across all three findings).
typecheck clean.

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U4+U5 (review iteration)

* fix(digest): address Codex PR koala73#3617 second-round P1+P2+P2 review findings

P1 — Replay-log shape mismatch with U6 harness expectations
  scripts/lib/brief-dedup-replay-log.mjs + scripts/seed-digest-notifications.mjs:552
  scripts/replay-digest-cooldown.mjs

  Pre-fix: my U6 harness expected records with mergedHashes / headline /
  sourceUrl / hydrated sources. The actual writer emitted title / link /
  numeric per-tick clusterId / NO mergedHashes / replay-log written
  BEFORE source hydration so sources were always []. Result: U6
  metrics structurally missed source-count evolution (sources delta is
  always 0), misclassified analysis/corp/regulatory stories (no
  headline/sourceUrl for the classifier), and split clusters by
  storyHash.

  Fix:
  - Writer bumped v=1 → v=2. Every record now carries:
      repHash       — canonical stable cluster identity (rep's own hash;
                      rep AND non-rep records both carry it via
                      repHashByStoryHash lookup). U6 collapses by this.
      mergedHashes  — full set, set on rep records only (non-reps get null).
      headline      — alias for title (matches BriefStory + U5 classifier).
      sourceUrl     — alias for link.
    Legacy fields (title, link) preserved for v1 readers still in TTL.

  - Cron pre-hydrates sources on dedupedAll BEFORE writeReplayLog
    (single SMEMBERS pipeline, ~30 commands per tick). top items are
    references to the same objects, so the post-cap hydration block
    that lived later is now redundant and removed.

  - U6 harness: clusterIdFromRecord prefers repHash (v2) over
    mergedHashes[0] (v2 reps) over storyHash (v1 fallback). Added
    recordHeadline/recordSourceUrl helpers to read v2 names with v1
    fallbacks (the 30-day TTL window means v1 records persist for 14+
    days after the v2 cutover).

P2 — high-single-corporate downgrade should NOT bypass 48h hard floor
  scripts/lib/digest-cooldown-decision.mjs

  Pre-fix: allowTierChange=true permitted ANY tier change including
  HIGH→MEDIUM downgrades. The table comment documents the bypass as
  'real follow-up event = tier escalation', but the code didn't
  enforce escalation-only. A downgrade earnings repeat inside 48h
  returned allow / severity_tier_change — editorial noise, not a
  follow-up.

  Fix: tierChangeMode: 'escalation-only' | 'any' (default 'any').
  high-single-corporate uses 'escalation-only' so only currentTierRank
  > lastTierRank passes the bypass. Other classes retain symmetric
  tier-change (a critical→high de-escalation IS editorial signal —
  'the situation cooled' is news worth re-airing).

P2 — clear-delivered-entry exact-DEL mode must accept glob chars
  scripts/clear-delivered-entry.mjs

  Pre-fix: parseArgs rejected *, ?, [, ], \\ in --user / --cluster
  unconditionally. But legitimate clusterIds can be the level-3
  fallback url:${sourceUrl} (shared/brief-filter.js:300), and real
  URLs commonly contain ? for query strings. Rejecting these in
  exact-DEL mode would make those rows unrecoverable via this primitive.

  Fix: glob-char guard is sweep-mode-only (no --channel + no --rule
  → SCAN with wildcard pattern). Exact-DEL mode (both --channel +
  --rule supplied) accepts glob chars because Redis treats DEL args
  as exact strings, not patterns. Error message guides operators
  to switch to exact-DEL mode for legitimate URL-fallback clusterIds.

Tests: 162/162 in targeted suites; 8094/8094 in full test:data
(was 8081 pre-fix; +13 net new). typecheck clean.

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U4+U5+U6 (review iteration round 2)

* fix(digest): address Codex PR koala73#3617 round-3 P1+P1 review findings

Two more P1 findings on the round-2 fixes — both real and serious.

P1 — Source hydration didn't reach replay records (object-identity break)
  scripts/lib/brief-dedup-replay-log.mjs (writer)

  Pre-fix: I hydrated sources on dedupedAll[i].sources before
  writeReplayLog(), but writeReplayLog iterates the input `stories`
  array — and materializeCluster() in brief-dedup-jaccard.mjs returns
  COPIED rep objects, so dedupedAll[i] is a different object reference
  than stories[i] for the same hash. Mutating dedupedAll[i].sources
  never reached the writer's iteration over input stories. Result: the
  v2 writer still emitted sources: [] for every record, U6
  source-count evolution remained blind, and Sprint 2 enforce-mode
  would have shipped against a meaningless drop-rate report.

  Fix: writer builds a sourcesByRepHash Map from the reps array (which
  IS the post-hydration source — pre-hydration mutates dedupedAll, and
  reps === dedupedAll at writeReplayLog call time). Each record reads
  sources from sourcesByRepHash.get(repHash), with fallback to the
  input story's sources for fixture compatibility. Non-rep records
  inherit the rep's sources (cluster source-count identity is uniform
  across members — the rep is the canonical view).

P1 — Multi-record-per-tick produced false 0-hour repeat suppressions
  scripts/replay-digest-cooldown.mjs (harness)

  Pre-fix: the writer emits ONE record per input story (rep + each
  non-rep cluster member), so a 2-story cluster in one tick produces
  2 records at the same tsMs. The harness grouped by repHash but
  treated each record as a separate timeline occurrence — the second
  record (same tsMs) read the first as lastDeliveredAt and produced a
  false 0-hour repeat suppression. Every multi-member cluster doubled
  its suppression count in the report.

  Fix: collapse to ONE observation per (ruleId, repHash, tsMs) BEFORE
  building timelines. Prefer rep records over non-rep when both exist
  for a tick (the rep carries the canonical headline + sourceUrl +
  sources; non-reps may have nulled-out fields under the v2 writer's
  rep-only mergedHashes contract). Re-sort timeline records by tsMs
  after collapse for defensive iteration ordering.

Tests: 167/167 in targeted suites (+5 net new). 8099/8099 in full
test:data (was 8094 pre-fix). typecheck clean.

Three new regression tests cover the collapse path:
  - multi-member cluster in one tick → 1 observation, 0 false
    repeats (the original bug)
  - genuine multi-tick re-air still simulates correctly after collapse
  - rep record wins the tie-break — proven via classifier routing:
    if the non-rep's sourceUrl had won, an analysis-domain rep would
    have been routed to high-event (18h) instead of analysis (7d),
    inverting the suppress decision

Two new regression tests cover the writer's source-hydration fix:
  - sources come from the rep object (rep + non-reps share the set)
  - falls back to story.sources when rep has none (fixture compat)

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U4+U5+U6 (review iteration round 3)

* fix(digest): address Codex PR koala73#3617 round-4 P1+P2 review findings

P1 — Delivered-log rows never refreshed after first send (NX → SET)
  scripts/lib/digest-delivered-log.mjs:193

  Pre-fix the writer used SET NX EX. Once a row existed for
  (user, channel, rule, cluster), every subsequent write was a no-op
  conflict. After a high-event re-air was ALLOWED at 19h post-floor,
  the Redis row still pointed to T0 — so the next re-air at 20h read
  lastDeliveredAt=T0 and saw "20h beyond 18h floor → allow", instead
  of "1h since last delivery → suppress". Production shadow telemetry
  diverged from U6 replay (which correctly updates synthetic state on
  allow), and Sprint 2 enforce-mode would have inherited the bug as
  under-suppression of high-rate clusters.

  Fix: switched to plain SET (no NX). Every successful send
  overwrites the row with new {sentAt, sourceCount, severity}. The
  30d-jitter TTL is re-applied on each write, so a cluster re-airing
  every few days never permanently expires. conflicts counter
  preserved at 0 in the return shape for back-compat with the U4
  aggregator.

P2 — Compose-miss fallback sends digests without U4/U5 coverage
  scripts/seed-digest-notifications.mjs (~line 1989, 2078, 2231)

  Pre-fix the U5 cooldown loop and U4 writer were both gated on
  briefEnvelopeStories.length greater than 0. Under compose-miss
  (BRIEF_SIGNING_SECRET unset, brief compose disabled, per-user
  compose error), the formatter fell back to raw stories and the
  digest WAS sent — but U4/U5 skipped those clusters entirely.
  Multi-tick compose outages accumulated un-tracked deliveries;
  when compose recovered, the cooldown saw "no prior delivery" and
  re-aired everything.

  Fix: build a unified cooldownIterableStories array right after
  formatterStories. Brief-success branch uses briefEnvelopeStories
  directly. Compose-miss branch synthesizes the same shape from raw
  stories. U5 cooldown loop and U4 writer both iterate the unified
  array. sourceCountByClusterId is keyed on repHash which matches
  both branches' clusterId semantics, so per-cluster source counts
  work identically.

Tests: 226/226 in targeted suites; 8102/8102 in full test:data
(was 8099 pre-fix; +3 net new). typecheck clean.

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U4+U5 round-4

* fix(digest): address Codex PR koala73#3617 round-5 P1 — webhook channel coverage parity

Pre-fix: the webhook channel passed raw `stories` (the full pre-cap pool,
up to DIGEST_MAX_ITEMS=30) to sendWebhook, while every other channel
consumed `formatterStories` (post-cap, post-filter — the same set U4/U5
iterate via cooldownIterableStories). Webhook users were therefore
receiving cards that were never shadow-evaluated and never seeded
delivered-log rows for future cooldown enforcement. The channel-coverage
gap meant Sprint 2 enforce-mode would have under-suppressed for webhook
subscribers specifically.

Fix: pass formatterStories (NOT stories) into sendWebhook. The webhook
payload schema { title, severity, phase, sources } is preserved because
formatterStories already carries those fields — under brief-success via
the briefStoriesToFormatterShape mapping (BriefStory -> raw shape) and
under compose-miss it IS the raw stories array natively.

Effect: webhook payload now exactly matches what U4 stamped + U5
evaluated for that (user, rule, tick). Channel coverage is uniform
across email / Telegram / Slack / Discord / Webhook.

Tests: 59/59 in tests/brief-from-digest-stories; 8103/8103 in full
test:data (was 8102 pre-fix; +1 net new). typecheck clean.

The new regression test asserts both the positive shape
(`sendWebhook(..., formatterStories, briefLead)`) and a
forbidden-pattern guard against the pre-fix raw-stories form, so a
future refactor re-introducing the gap fails loudly.

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U4+U5 round-5

* fix(digest): wire EVOLUTION_NEW_FACT bypass — Greptile PR koala73#3617 P2

Pre-fix: REASON.EVOLUTION_NEW_FACT was exported as part of the stable
wire contract and `allowNewFact: true` was set on COOLDOWN_TABLE cells
for several types (critical-developing, critical-sustained, high-event,
sanctions-regulatory, med). But evaluateCooldown never checked
allowNewFact and never returned EVOLUTION_NEW_FACT — exporting an
unused contract surface that nothing produced.

Fix shipped end-to-end across writer + evaluator + cron + replay
harness:

1. scripts/lib/digest-delivered-log.mjs writeDeliveredEntry now accepts
   an optional `headline` arg. When non-empty it's persisted alongside
   {sentAt, sourceCount, severity}. Empty/missing headline omits the
   field entirely (forward-compat: older readers see no unexpected
   empty-string field).

2. scripts/lib/digest-cooldown-decision.mjs evaluateCooldown reads
   input.lastDeliveredHeadline. When the table cell allows new-fact
   bypass AND both sides are non-empty AND the headlines differ
   (case-insensitive, whitespace-trimmed compare), returns
   decision='allow', reason=EVOLUTION_NEW_FACT. Bypass precedence
   stays consistent: tier change > new fact > source count.

3. scripts/seed-digest-notifications.mjs U5 cooldown evaluator reads
   parsed.headline from the U4 row, passes as lastDeliveredHeadline
   to evaluateCooldown. U4 writer call site passes briefStory.headline
   (canonical in both branches via cooldownIterableStories).

4. scripts/replay-digest-cooldown.mjs synthetic state tracks
   lastDelivered.headline alongside sentAt/sourceCount/severity, passes
   as lastDeliveredHeadline to evaluateCooldown so U6 replay matches
   live behavior under the new-fact bypass path.

Why string-equality (not LLM-diff): Sprint 3's full classifier ships
an LLM-driven fact-diff that replaces this. For Sprint 1 string-
equality is the conservative stub — only fires when the upstream feed
produced a genuinely different headline (rephrased news, not just a
wire-rewording duplicate). False negatives keep suppression conservative;
false positives (typo-edits firing the bypass) are avoided.

Hard-floor classes (analysis 7d, single-corp 48h) have allowNewFact=false
so the bypass NEVER fires within their windows — preserves the contract
that hard floors don't admit any evolution bypass.

Tests: 206/206 in targeted suites; 8112/8112 in full test:data
(was 8103 pre-fix; +9 net new). typecheck clean.

Six new evaluator tests cover: positive bypass (high-event new
headline), case/whitespace folding (no false positive), null-on-old-row
(no bypass for v4 rows without the field), hard-floor preservation
(analysis + single-corp suppress despite new headline), tier-change
precedence (tier wins over new-fact when both fire). Three new writer
tests cover headline persistence + empty-string omit + missing-arg
back-compat.

Companion note: Greptile's other inline finding (sourceCount 0/1
collapse at line 2060) was already addressed in commit 1bb11f7 from
round-1 — it was reviewed pre-fix. Current code uses
sourceCountByClusterId.get(clusterId) ?? 0; will reply on the PR thread.

Plan: docs/plans/2026-05-06-001-feat-digest-brief-canonical-contract-sprint-1-plan.md U4+U5 round-6
…t flake) (koala73#3618)

The `timeout emits terminal reason BEFORE SIGTERM/SIGKILL grace` test
flaked on PR koala73#3617's post-merge run on main (run #25494120215). The
fixture's first line is:

  process.on('SIGTERM', () => {}); console.log('hung'); setInterval(...);

On a cold/loaded CI runner Node's startup (parse + import + this line's
execution) can exceed the test's 1s timeout. SIGTERM then arrives BEFORE
process.on registered the ignore-handler, so the child dies via Node's
default SIGTERM behaviour and never reaches console.log('hung'). Total
elapsed lands at ~1.1s instead of the expected ~11s (1s timeout + 10s
SIGKILL grace), the `[HANG] hung` assertion fails, and the test never
exercises the SIGKILL escalation it's actually here to validate.

Fix: bump TIMEOUT_MS from 1000 to 3000. Typical Node startup is 50-200ms
and even loaded GitHub-hosted runners shouldn't exceed 1-2s, so 3s gives
comfortable cold-start margin while still exercising the same
timeout → SIGTERM → 10s grace → SIGKILL flow the test validates. The
test's existing 20s elapsedMs cap remains comfortably above the new
worst-case (3s + 10s grace + overhead ~= 14s).

Also relaxed the regex `/timeout after 1s — sending SIGTERM/` to
`/timeout after \d+s — sending SIGTERM/` so a future timeout bump doesn't
require a coordinated regex update — the assertion's purpose is "Failed
line names the timeout-after-N pattern", not the literal N.

Verification: ran tests/bundle-runner.test.mjs 5 times locally, all 9
tests pass each run, no flakes.

The 1s value was a real timing bug, not just a slow runner — it was
flaky because there's no contract that user code in the fixture has run
before the timer fires. The fixture's SIGTERM handler MUST be registered
before SIGTERM arrives for the test's "ignores SIGTERM, must SIGKILL"
contract to be exercised, and the 1s window didn't guarantee that.
…tions + iranEvents) (koala73#3622)

## Symptom

/api/health on 2026-05-08 reported:

  marketImplications: EMPTY records=0 seedAge=78min maxStale=120min
  iranEvents:         EMPTY records=0 seedAge=3837min (~64h) maxStale=20160min (14d)

Both have fresh seed-meta entries (rc=3 for marketImpl, rc=52 for iranEvents
from the last successful runs) but the canonical keys are MISSING from
Upstash. Same shape as BIS PR koala73#3610.

## Root cause

Canonical TTL was only ~1× the cadence — zero drift margin. Per the
"TTL ≥ 3× cron interval" gold standard codified in api/health.js:268-281
and memory `seed-meta-populated-canonical-missing-ttl-cron-match`:

  marketImplications: TTL=75min vs cron=~60min → 1.25× margin
                      Any cron drift or LLM-call slowness kills canonical
                      between ticks. seed-meta TTL is 7 days so it survives.

  iranEvents:         TTL=2 days vs operator-cadence ~weekly → 0.28× margin
                      maxStaleMin: 20160 (14d) is "2× weekly cadence" per
                      the existing comment. Operator went 2.7d between
                      manual seeds (within tolerance per maxStale), but
                      canonical TTL'd out at 2d. seed-meta survived.

## Fix

  marketImplications:  75 * 60 → 180 * 60   (75min → 180min = 3× ~60min cron)
  iranEvents:          172800   → 1209600   (2d → 14d = match maxStaleMin)

Also adds an extensive comment block on each constant explaining WHY the
new value, so future contributors don't tighten the TTL back under the
plausible-but-wrong "TTL should match cron interval" intuition.

## Why both in one PR

Same trap family + same 1-line shape. Splitting would create churn for
no diagnostic clarity benefit. If only one PR is desired, the fix lines
themselves are independent and revertable.

## Verification

- typecheck:api clean
- lint clean
- node -c on both files clean
- No tests required for pure-config TTL bumps; seed-meta-populated-canonical-
  missing-ttl-cron-match memory documents the diagnostic recipe (curl
  canonical + curl seed-meta from Upstash) for verifying post-deploy

Once deployed:
- marketImplications: next ~hourly cron writes canonical with new 180min TTL → /api/health flips OK and stays there across normal cron drift
- iranEvents: next manual seed run writes canonical with new 14d TTL → canonical alive for full health-tolerance window

A separate, non-blocking issue: consumerPricesSpread is also EMPTY but for
a different reason — `consumer-prices-core/src/jobs/publish.ts` ran with
state=OK_ZERO (0 retailers scraped within the 2h freshness window). That's
a data-pipeline issue in the consumer-prices-core service, not a TTL trap;
filing separately.

A structural follow-up — static test that scans all WM seeders + bundle
intervalMs and asserts canonical TTL ≥ 3× cron — is being opened as a
sibling PR. That would catch this trap class on every contribution rather
than after the first production failure.
…list + AbortError noise (koala73#3623)

* fix(sentry): map Convex JSON-shape InternalServerError → 503, allowlist api.rainviewer.com, ignore zero-frame AbortError

Three independent triage findings, bundled because they all touch the
same Sentry-classification surfaces:

A. WORLDMONITOR-PG (9ev/7u) + WORLDMONITOR-PH (3ev/3u) — JSON-shape
   InternalServerError misclassified as 'unknown' → generic 500.

   Convex runtime occasionally surfaces
     `{"code":"InternalServerError","message":"Your request couldn't be
     completed. Try again later."}`
   on internal failures. Same retry-with-backoff remediation as the
   already-handled `"code":"ServiceUnavailable"`, so map it through to
   SERVICE_UNAVAILABLE → 503 + Retry-After in the edge handler. Sentry
   `error_shape` classifier gains its own `convex_internal_error` bucket
   so on-call distinguishes runtime-500s from genuine 503s.

B. WORLDMONITOR-QG (1ev/1u) — Failed to fetch (api.rainviewer.com) from
   MapContainer.fetchAndApplyRadar with chrome-extension fetch wrapper.

   The existing host-allowlist gate required a maplibre frame in the
   stack — the maplibre AJAX path was the only originally-known caller.
   `fetchAndApplyRadar` calls fetch directly (no maplibre frame), so
   the gate didn't fire even though the host was a known third-party.
   Added api.rainviewer.com to MAPLIBRE_THIRD_PARTY_TILE_HOSTS and
   dropped the maplibre-frame requirement on the host-allowlist gate.
   The host-set IS the load-bearing safety: api.worldmonitor.app is
   intentionally NOT in the set, so first-party API regressions still
   surface.

C. WORLDMONITOR-QH (1ev/1u) — bare `Uncaught Error: AbortError` zero-frame
   from Convex's auto-Sentry on server-side action timeouts. No
   actionable context; the action retries cleanly. Added anchored
   ignoreErrors entry.

Tests: 4 new cases (3 in user-prefs-convex-error.test.mjs covering
JSON-shape detection + negative-control + structured-data precedence,
1 in user-prefs-sentry-context.test.mts for the new error_shape
bucket). Existing 8118/8118 + 181/181 edge tests still pass.

WORLDMONITOR-QD (4ev/2u, deck.gl pointer-handler crash) was also
resolved this round but deferred to auto-reopen — minified Iie.Fie
symbol in our bundled MapContainer chunk; risk of masking real
first-party MapContainer.ts bugs is too high for the current event
volume.

* review(greptile): rename isMaplibreAjaxFailure → isHostScopedFetchFailure (P3)

The host-allowlist gate stopped being maplibre-specific in this PR (the
maplibre-frame requirement was dropped to support first-party callers
like MapContainer.fetchAndApplyRadar). The variable name and constant
were stale; renamed to match the new scope:
  - MAPLIBRE_THIRD_PARTY_TILE_HOSTS → THIRD_PARTY_FETCH_HOST_ALLOWLIST
  - isMaplibreAjaxFailure → isHostScopedFetchFailure
  - cross-ref comment in the zero-frame block updated

No behavior change — the suppression condition and host-set are identical.

* test(beforesend): mirror THIRD_PARTY_FETCH_HOST_ALLOWLIST rename in source-extraction
…koala73#3625)

* test(seeders): static guard — canonical TTL ≥ 3× bundle cron interval

## Why this exists

The "canonical TTL ≈ cron interval" trap has bitten WorldMonitor at
least 3 times across distinct seeders, with the same operator-facing
symptom every time:

  - PR koala73#3610 (BIS): 12h TTL == 12h cron → canonical TTL'd-out between
    drifted ticks, /api/health reported `bisPolicy: EMPTY records=0`
    while seed-meta showed last-good rc=11.
  - PR koala73#3622 (marketImplications): 75min TTL vs ~60min cron — only
    1.25× margin. Same symptom.
  - PR koala73#3622 (iranEvents): 2d TTL vs ~weekly operator-cadence (0.28×).
    Same symptom on 2026-05-08.

Each fix was tactical (bump that one TTL). This test catches the
pattern STRUCTURALLY at PR-review time, before production failure.

## What it asserts

For every section across `scripts/seed-bundle-*.mjs`:
- Resolves the section's `intervalMs` (handles HOUR/DAY/WEEK constants
  + arithmetic + numeric-separator underscores like `12 * HOUR` or
  `86_400`).
- Reads the corresponding seeder script. Resolves `ttlSeconds:` from
  the runSeed call (handles `const TTL = N;` and `export const X = N;`
  declarations + arithmetic).
- Asserts `ttlSeconds * 1000 >= 3 * intervalMs`.

When canonical TTL is at least 3× the cron interval, normal cron drift
(queue ordering, retry delays, transient failures) cannot leave the
canonical TTL'd-out for any window.

## Allowlist of current violations

Scanning today's main surfaced ~25 sections currently below the 3×
threshold. They're listed in `KNOWN_VIOLATIONS` with their current
ratio so the test is mergeable without coupling to a giant fix-all PR.
The test fails on:

  (a) NEW sections dropping below threshold — must fix or add to
      allowlist with justification (regression caught at PR-review).
  (b) ALLOWLISTED entries no longer violating — must remove the entry,
      otherwise the allowlist drifts (catches "I fixed this but forgot
      to remove from the list").

As future PRs bump TTLs, contributors remove the corresponding
allowlist entry. Goal: empty allowlist.

## Scope

INCLUDES: every bundle section using `runSeed(..., { ttlSeconds: ... })`.

EXCLUDES: non-bundle seeders (manually-triggered like seed-iran-events
or external-cron like seed-forecasts.mjs's MARKET_IMPLICATIONS_TTL).
Those don't have a discoverable cron interval in code; PR koala73#3622 audited
them manually. A follow-up could extend this test by also checking
`ttlSeconds * 60 >= maxStaleMin` (the health-tolerance invariant) for
seeders that aren't in any bundle.

## Resolver design

Two-stage:
1. Try direct safe-eval (digits + arithmetic + preseeded HOUR/DAY/WEEK).
2. If the expression contains identifiers, find each `const|let|var|
   export const NAME = expr` declaration in the same file, resolve
   recursively (cycle-guarded), build a scope, retry safe-eval.

Underscored numeric literals (`7_200`, `86_400`) are stripped before
eval. Function calls, member access (`process.env.X`), and any non-
arithmetic input are rejected.

When a section's intervalMs OR ttlSeconds is unresolvable, it's logged
as a SKIP (not a failure) — the resolver gap is informational, not a
violation. If the gap covers a real new violation, it'll surface as a
runtime production failure and the resolver gets extended.

## Verification

- 11/11 tests pass on this branch (1 main check + 10 resolver sanity
  tests covering numeric literals, underscored numerics, multiplication,
  preseeded HOUR/DAY/WEEK, in-file const/export const declarations,
  unresolvable-identifier handling, unsafe-input rejection,
  extractBundleSections shape)
- typecheck:api clean
- lint clean

## Future direction

When the allowlist is shrunk to <5 entries, consider tightening
SAFETY_FACTOR from 3 → 4 to give more headroom. Or add a per-entry
SAFETY_FACTOR override for sections where 3× is genuinely overkill
(e.g. annual indicators where 3× would be 3 years).

* fix(test): address Greptile P1+P2 review on PR koala73#3625

Three findings, all valid:

P1 (line 53): __filename is not auto-defined in ESM. Used on line 261
in the hygiene-check error message — would throw ReferenceError exactly
when the hygiene path fires (a contributor fixes a seeder but forgets
to remove the allowlist entry). Now declared via fileURLToPath.

P1 (line 264): KNOWN_VIOLATIONS entries that hit a SKIP path (script
file missing, unresolvable intervalMs, resolver gap on ttlSeconds) were
falsely flagged as "no longer violating," failing the hygiene check
with a misleading message. Now tracked in skippedAllowKeys and excluded
from the hygiene loop — only entries that resolved cleanly + passed the
threshold count as "fixed."

P2 (line 186): blockRe `\\{ ... \\}` non-greedy match cut off at the
first inner `}` for sections containing nested objects (e.g.
`extraHeaders: { ... }`), silently dropping them so a real new
violation could slip past the guard. Replaced with brace-balanced scan
from each `{ label: '...'` anchor — respects string literals, walks
forward until matching `}`.

Two new tests cover the brace-balanced extractor:
- handles sections with nested objects (the P2 trap)
- handles strings containing braces (defensive)

13/13 pass (was 11/11 + 2 new). typecheck:api + lint clean.
)

* feat(health-readiness): Sprint 1 — content-age probe infra (opt-in contentMeta / STALE_CONTENT)

Implements Sprint 1 of the 2026-05-04 health-readiness plan
(docs/plans/2026-05-04-001-feat-health-readiness-probe-content-age-plan.md).
Adds an opt-in content-age contract that distinguishes seeder-RUN freshness
from CONTENT freshness, surfacing STALE_CONTENT in /api/health when sparse
upstreams (WHO Disease Outbreak News, IEA OPEC reports, central-bank
releases, WB annual indicators) stop publishing while seeder cron stays green.

Backwards compatible: legacy seeders without contentMeta/maxContentAgeMin
are byte-identical in behavior. The opt-in signal is presence of
maxContentAgeMin in the seed-meta and the canonical _seed envelope.

== Envelope chain (parity across all 3 mirrors) ==

- scripts/_seed-envelope-source.mjs — buildEnvelope accepts optional
  newestItemAt / oldestItemAt / maxContentAgeMin trio
- api/_seed-envelope.js — mirror
- server/_shared/seed-envelope.ts — mirror + SeedMeta interface extended
- scripts/verify-seed-envelope-parity.mjs — passes (3/3 exports verified)

== Contract validator ==

scripts/_seed-contract.mjs:
- Adds contentMeta and maxContentAgeMin to OPTIONAL_FIELDS
- Cross-field check: declaring one without the other is a hard fail
  (prevents the silently-disabled-but-looks-opted-in trap from Codex r1 P1d)
- Type checks: contentMeta must be a function; maxContentAgeMin must be
  a positive integer (rejects 0, negatives, non-integer, NaN, Infinity,
  strings, null)

== runSeed wiring ==

scripts/_seed-utils.mjs:
- Opts destructure adds contentMeta + maxContentAgeMin
- Up-front config validation (CONTRACT VIOLATION exits 1 at config time,
  not at write time)
- ORDER CONTRACT: contentMeta(rawData) runs BEFORE publishTransform(rawData)
  so seeders can attach pre-publish helper fields (e.g.
  _publishedAtIsSynthetic) for timestamp computation, then strip them
  via publishTransform — the helpers never leak into the canonical key
  or client responses (Codex round 3 P2)
- contentMeta returning null OR throwing both produce newestItemAt: null
  in the envelope — health classifier reads as STALE_CONTENT
- Future-dated/zero/non-finite timestamps validated at runSeed boundary
- Content trio propagates into envelopeMeta on success path AND
  through readCanonicalEnvelopeMeta into the validate-fail mirror branch
  (Codex round 1 P0b — without this, STALE_CONTENT signal vanishes
  exactly when last-good-with-stale-content data is being served, the
  worst possible time for the alarm to disappear)
- writeFreshnessMetadata accepts a contentAge param

== Health classifier ==

api/health.js:
- readSeedMeta surfaces contentAge: { newestItemAt, oldestItemAt,
  maxContentAgeMin, contentAgeMin (derived in minutes), contentStale (derived) }
  when seed-meta carries the trio. null for legacy seeders.
- classifyKey: NEW STALE_CONTENT branch slotted between COVERAGE_PARTIAL
  and the final OK fall-through. NO existing branches reordered or
  modified. Existing precedence preserved: REDIS_PARTIAL > SEED_ERROR >
  OK_CASCADE > EMPTY_ON_DEMAND > EMPTY > EMPTY_DATA > STALE_SEED >
  COVERAGE_PARTIAL > STALE_CONTENT > OK
- STATUS_COUNTS.STALE_CONTENT = 'warn' (operator can't fix upstream
  cadence; bucket as warn to drive degraded, not critical)
- Per-key entry surfaces contentAgeMin + maxContentAgeMin when seeder
  opted in (otherwise absent — legacy entries unchanged)
- problemKeys collector flows STALE_CONTENT through automatically (it
  filters only OK / OK_CASCADE / EMPTY_ON_DEMAND)
- Test-only __testing__ export for scoped unit tests

== Tests ==

- tests/seed-utils-empty-data-failure.test.mjs (extended): +2 cases
  - validate-fail mirror PRESERVES newestItemAt/oldestItemAt/maxContentAgeMin
  - legacy seeders without contentAge in canonical envelope keep legacy
    seed-meta shape (anti-regression for Codex round 1 P0b)
- tests/seed-content-age-contract.test.mjs (NEW): 10 cases
  - contract enforcement (4): half-config (both ways), bad budget types,
    non-function contentMeta
  - ordering (2): contentMeta sees pre-publish helpers, publishTransform
    strips them, canonical payload helper-free
  - behavior (3): null / throwing / valid timestamps
  - anti-regression (1): legacy seeders unaffected
- tests/health-content-age.test.mjs (NEW): 16 cases
  - readSeedMeta content-age surface (4): trio present, legacy null,
    contentStale boundary, null newestItemAt
  - classifyKey STALE_CONTENT branch (3): fires correctly, fresh→OK,
    legacy→OK
  - precedence vs every existing status (5): STALE_SEED, REDIS_PARTIAL,
    SEED_ERROR, EMPTY, COVERAGE_PARTIAL all outrank STALE_CONTENT
  - STATUS_COUNTS bucket (2): STALE_CONTENT=warn, anti-regression for
    existing buckets
  - per-key response shape (2): contentAgeMin+maxContentAgeMin surfaced,
    null contentAgeMin surfaced explicitly

Test totals: 79/79 pass across the seed-envelope, seed-contract,
seed-utils, content-age, and health-content-age suites. Envelope parity
verifier passes. typecheck + typecheck:api both clean.

== Net diff ==
9 files changed, 311 prod LOC + 250 test LOC.

== What's next (Sprint 2) ==
Migrate disease-outbreaks as the proof-of-concept consumer. Pilot
maxContentAgeMin=9 days (chosen so the 2026-05-04 11d-old incident
would have tripped the new alarm). Tag synthetic timestamps in
WHO/RSS/TGH parsers; strip helpers via publishTransform. See plan
Sprint 2 section.

* fix: address Greptile PR koala73#3596 P1 + P2 review findings

P1 — `_seed-utils.mjs:1278` — Content-age silently discarded for
non-contract-mode seeders.

  Pre-fix: the seed-meta mirror gated on `(contentAgeOptedIn && envelopeMeta)`.
  But `envelopeMeta` is constructed only when `contractMode === true` (when the
  seeder declared `recordCount`/`declareRecords`). Every seeder that opted into
  content-age via `contentMeta` callback but had NOT yet migrated to contract
  mode silently dropped the content-age trio from its seed-meta — defeating the
  opt-in for the majority of the cohort. The health classifier read no
  `maxContentAgeMin` and skipped STALE_CONTENT entirely for those keys.

  Fix: read content-age from the local `contentNewestAt`/`contentOldestAt`/
  `maxContentAgeMin` values (populated at line ~1088 whenever the seeder opted
  in, regardless of contractMode) instead of from `envelopeMeta`. Both branches
  publish the same trio when both are populated; reading from the local source
  unifies the two paths and makes the seed-meta mirror match the contract-mode
  envelope exactly.

P2 — `api/health.js:589` — Future-dated `newestItemAt` produces negative
`contentAgeMin`, silently suppressing the stale signal.

  Pre-fix: `contentAgeMin > maxContentAgeMin` is false for ANY negative number
  (negative is not greater than any positive budget). A feed publishing
  timestamps in the future — clock skew, timezone bug, or upstream confusing
  forecasts with observations — would silently pass the staleness check
  forever.

  Fix: detect `contentAgeMin < 0` (future-dated) and force `contentStale: true`
  alongside the existing branches. Negative `contentAgeMin` is preserved on
  the wire so operators can see HOW far in the future the timestamp was (a
  -10-minute drift is a clock-skew nit; -8760 minutes is a year-from-now
  corruption).

Tests:
- 4 new regression tests across `tests/seed-content-age-contract.test.mjs`
  (P1: non-contract seeder mirrors content-age + null-content-meta still
  carries opt-in signal) and `tests/health-content-age.test.mjs` (P2:
  near-future + far-future newestItemAt → contentStale=true with negative
  contentAgeMin preserved as diagnostic signal).
- 30/30 in targeted suites; typecheck clean.

Both findings hit the same systemic shape: silent-suppression bugs in the
very subsystem designed to detect silent staleness. Worth fixing on the
foundation PR before the rest of the Sprint 1 stack inherits them.
…ing + 9d budget) (koala73#3597)

* feat(disease-outbreaks): Sprint 2 — content-age pilot (synthetic tagging + STALE_CONTENT @ 9d)

Implements Sprint 2 of the 2026-05-04 health-readiness plan
(docs/plans/2026-05-04-001-feat-health-readiness-probe-content-age-plan.md).
Stacked on Sprint 1 (koala73#3596 — content-age probe infra).

Migrates disease-outbreaks as the proof-of-concept content-age consumer.
Pilot maxContentAgeMin=9 days chosen so the 2026-05-04 11d-old incident
would have correctly tripped STALE_CONTENT.

== Source-parser changes (3 sources, uniform shape) ==

scripts/seed-disease-outbreaks.mjs:

WHO DON parser (line ~117): tag synthetic timestamps when the upstream
omits PublicationDateAndTime. Carry _originalPublishedMs (parsed ms or
null) and _publishedAtIsSynthetic (boolean) alongside the existing
publishedMs (which keeps its Date.now() fallback for UI consumer compat).

RSS parser (line ~150, both CDC and Outbreak News Today): same pattern
when pubDate is missing/unparseable.

TGH parser (line ~211): always carries non-synthetic since the line-198
filter rejects undated items earlier. Migration is additive — every
TGH item gets _publishedAtIsSynthetic: false and _originalPublishedMs:
publishedMs so contentMeta + publishTransform apply uniformly.

mapItem (line ~244): carries _publishedAtIsSynthetic and
_originalPublishedMs through to the output shape so contentMeta can
read them at runSeed time.

== runSeed opts (Sprint 2 contract) ==

contentMeta: excludes _publishedAtIsSynthetic items + 1h clock-skew
tolerance + null when validCount === 0 (matches list-feed-digest's
FUTURE_DATE_TOLERANCE_MS pattern).

maxContentAgeMin: 9 * 24 * 60 = 12960 minutes (9 days) — chosen
deliberately so the production incident's 11d-old cache would have
flagged STALE_CONTENT. Tighter would page on normal WHO/CDC quiet
weeks; looser would have missed the incident.

publishTransform: strips _publishedAtIsSynthetic + _originalPublishedMs
from every item BEFORE atomicPublish so the helpers never reach:
  - the Redis canonical key (health:disease-outbreaks:v1)
  - /api/bootstrap response (data.diseaseOutbreaks)
  - list-disease-outbreaks RPC response
  - the DiseaseOutbreakItem proto-generated type

The Sprint 1 ordering contract (contentMeta runs BEFORE publishTransform)
guarantees contentMeta sees the helpers that publishTransform then strips.

== Anti-regression tests ==

tests/disease-outbreaks-seed.test.mjs (NEW) — 16 cases split by layer:

Pre-publish (in-memory) layer (5):
- WHO without PublicationDateAndTime → tagged synthetic
- WHO with valid PublicationDateAndTime → non-synthetic
- RSS without pubDate → tagged synthetic
- RSS with valid pubDate → non-synthetic
- TGH always non-synthetic

contentMeta behavior (5):
- All-synthetic → null (→ STALE_CONTENT)
- Mixed: synthetic with newer publishedAt does NOT win newest
- Picks newest+oldest from non-synthetic set
- Future-dated items beyond 1h tolerance excluded
- NEAR_FUTURE within 1h tolerance accepted

publishTransform strip (3):
- Both helper fields stripped from every item
- publishedAt remains non-null (UI/RPC consumer contract)
- Empty + missing outbreaks handled safely

End-to-end (1):
- contentMeta runs on raw data WITH helpers, publishTransform strips,
  canonical-shape JSON contains NEITHER _publishedAtIsSynthetic NOR
  _originalPublishedMs (combined-regex assertion per Codex round 4 P2)

Pilot threshold sanity (2):
- 11d-old items DO trip the 9d budget (anti-drift on the pilot threshold —
  any future change to 9d must update this test)
- 5d-old items DO NOT trip (no false positive on normal upstream rhythm)

Test totals: 95/95 pass across the seed-envelope, seed-contract,
seed-utils, content-age, health-content-age, and disease-outbreaks-seed
suites.

== Verification (post-deploy) ==

After Railway bundle redeploy:
1. /api/health.diseaseOutbreaks shows contentAgeMin and maxContentAgeMin.
2. Redis canonical health:disease-outbreaks:v1 contains NEITHER
   _publishedAtIsSynthetic NOR _originalPublishedMs (combined-regex
   grep returns 0).
3. /api/bootstrap?keys=diseaseOutbreaks response payload helper-free.
4. With current 11d-old WHO/CDC items + bug-pattern data, STALE_CONTENT
   surfaces in /api/health and ops can act on it.

* refactor(disease-outbreaks): extract helpers + inject nowMs to kill test drift and timing flake

Greptile P2s on PR koala73#3597:

1. tests/disease-outbreaks-seed.test.mjs replicated parser/mapper/contentMeta
   logic locally — a drift in fetchWhoDonApi or contentMeta would not have
   failed any of the 16 tests because they asserted against their own copy
   of the logic, not the seeder's.

2. The "near-future ≤1h accepted" test relied on Date.now() being stable
   between test setup and the call into contentMeta. On a loaded CI runner
   the gap could exceed the (1h - 30min) margin and flake.

Fixes both at once:

- New scripts/_disease-outbreaks-helpers.mjs exports the pure functions
  (whoNormalizeItem, rssNormalizeItem, tghNormalizeItem, mapItem,
  diseaseContentMeta, diseasePublishTransform, DISEASE_MAX_CONTENT_AGE_MIN).
  diseaseContentMeta accepts an optional nowMs for deterministic skew tests.

- Seeder imports those helpers instead of inlining them. ~150 lines
  removed; behavior unchanged (verified by node -c + smoke test).

- Test file imports the real helpers (no replicas). All skew-limit tests
  inject FIXED_NOW=1700000000000 — no wall-clock dependence.

- Tightens the "within 1h tolerance" test from +30min to +5min ahead of
  injected NOW, well clear of the 1h boundary regardless of the timing fix.

Net: -265 lines across the two existing files; +200 in the new helpers
module. 17/17 disease tests pass; 49/49 across the full Sprint 1+2 stack.

* fix(test): correct FIXED_NOW comment year (2025→2023)

Unix timestamp 1700000000000 ms is 2023-11-14T22:13:20Z, not 2025-11-14.
Test correctness unaffected (FIXED_NOW is just an injected stable epoch),
but a reader reasoning about the skew-limit arithmetic would get the
mental date math wrong. Greptile P2 on PR koala73#3598 (which copied the same
wrong comment from this file when Sprint 3a was branched off).
…3#3598)

* feat(climate-news): Sprint 3a — content-age probe (7d budget)

Sparse seeders sub-PR a/c of the 2026-05-04 health-readiness plan. Adds a
content-age contract on seed-climate-news.mjs so /api/health surfaces
STALE_CONTENT when the freshest cached climate-news item is older than 7
days — covering the failure mode where every RSS parse silently breaks at
once (e.g. our regex stops matching because a feed bundle changed) and the
seeder keeps running clean while the cache fossilizes.

Why 7 days: Carbon Brief, Guardian Environment, NASA EO, UNEP, Phys.org,
Copernicus, Inside Climate News, Climate Central, and ReliefWeb publish
collectively at multiple-times-per-day cadence. A 7d budget tolerates a
major holiday weekend across all sources without false-positive paging,
and trips on a real upstream-aggregator outage.

Why no synthetic-tagging needed (unlike disease-outbreaks Sprint 2):
seed-climate-news.mjs:76 + :132 already drop items with publishedAt=0 at
parse time, so contentMeta reads item.publishedAt directly. No helper
fields, no publishTransform stripping required.

Following the Sprint 2 post-refactor pattern: pure helper lives in
scripts/_climate-news-helpers.mjs (climateNewsContentMeta with injectable
nowMs for deterministic tests + CLIMATE_NEWS_MAX_CONTENT_AGE_MIN constant).
The seeder imports it; the test imports it. No duplicated logic, no drift
surface.

Verification: 10/10 climate-news tests pass; 59/59 across the full
content-age stack (Sprint 1 infra + Sprint 2 disease + Sprint 3a climate).
typecheck:api clean; lint clean (pre-existing warnings only).

* fix(test): correct FIXED_NOW comment year (2025→2023)

Greptile P2 on PR koala73#3598: 1700000000000 ms is 2023-11-14T22:13:20Z, not
2025. Test correctness unaffected; comment-only fix so a reader reasoning
about skew-limit arithmetic gets the right mental date math.
…la73#3599)

* feat(iea-oil-stocks): Sprint 3b — content-age probe (45d budget)

Sparse seeders sub-PR b/c of the 2026-05-04 health-readiness plan.
Branched off Sprint 1 (koala73#3596) as a parallel sibling to Sprint 2 (koala73#3597)
and Sprint 3a (koala73#3598) per the plan's "Each PR is independently shippable"
note (line 498).

## Why this matters

IEA monthly oil stocks publish on an M+2 cadence — August data ships in
late October/early November. Without a content-age probe, a stalled
publication month is invisible to /api/health: the seeder runs fine on
its 6h cron, fetchedAt stays fresh, but data.dataMonth never advances. A
45-day budget trips STALE_CONTENT exactly when a month has been missed
(e.g. cache shows "2024-08" past Dec 1 when "2024-10" should have
landed).

## Shape contract — different from Sprint 2/3a

IEA is a SINGLE-SNAPSHOT seeder: every member shares one `dataMonth`
("YYYY-MM" string at the top level), there is no per-item published-at.
The new helper parses dataMonth → end-of-month UTC ms (the latest
possible observation date in the named period) and returns it as both
`newestItemAt` and `oldestItemAt`.

Defensive: contentMeta returns null when dataMonth is missing, malformed
("2024-13", "2024-8" single-digit), or future-dated beyond 1h clock-skew
tolerance (guards against upstream yearMonth garbage producing e.g. a
2099-12 dataMonth).

## Pattern parity with Sprint 2/3a

Following the established pattern: pure helpers in
`scripts/_iea-oil-stocks-helpers.mjs` (`dataMonthToEndOfMonthMs`,
`ieaOilStocksContentMeta`, `IEA_OIL_STOCKS_MAX_CONTENT_AGE_MIN`).
Seeder imports them; tests import them. No replicas.

`seed-iea-oil-stocks.mjs` is NOT in Dockerfile.relay (verified via
`grep`), so no COPY-line update needed (unlike Sprint 3a's
seed-climate-news which IS relay-COPY'd).

## Verification

- 15/15 iea content-age tests pass (incl. leap-year, month-rollover,
  invalid-shape rejection, M+2 lag realism, future-clock-skew defense)
- 78/78 across iea seed + Sprint 1 + Sprint 3b stack
- typecheck:api clean; lint clean (pre-existing warnings only)
- Dockerfile.relay closure test passes (no relay impact)

* fix(iea-oil-stocks): bump budget 45d→90d to cover M+2 natural lag

Greptile P1 on PR koala73#3599: a 45-day budget contradicts the helper's own
M+2 cadence claim. End-of-observation-month (Aug 31) is ~60-65 days
BEFORE publication (~late Oct/early Nov), so fresh-arrival data is
already past the 45d threshold at the moment a successful seed run
writes it. STALE_CONTENT would have fired on every cron tick.

Corrected math: 90d = ~60d natural M+2 lag + ~30d missed-publication
slack. Trips only when a month is missed entirely (cache stuck at
"2024-08" past mid-Jan when "2024-10" should have landed).

Also addresses 3 P2 review nits in the same edit:

- Test "60 days old" → "fresh-arrival regression guard: ~60d-old fresh
  M+2 data does NOT trip" (the math was right, name was wrong; rewrote
  the test to actually pin the failure mode the P1 cited).
- Test "~30 days old" → "~14 days old" (the fixture was "2023-10" =
  ~14d before FIXED_NOW, not 30).
- M+2 lag scenario comment "Sept data published ~Oct 25" → "~late Nov
  (M+2 cadence)" — Oct 25 is M+1, not M+2.

Added: dedicated fresh-arrival regression guard test that asserts a
~75d-old fresh M+2 dataMonth is within budget. Without it, a future
budget tightening could re-introduce the immediate-page bug invisibly.

Verification: 16/16 iea content-age (was 15/15 — added regression guard);
79/79 across iea seed + Sprint 1 + Sprint 3b stack; typecheck:api clean.
…t) (koala73#3602)

* feat(power-reliability): Sprint 4 — content-age probe (24-month budget)

Closes the plan's "Definition of done" item: at least 1 annual-data
seeder migrated. Branched off Sprint 1 (koala73#3596) as a parallel sibling
to Sprints 2/3a/3b.

## Why this matters

WB EG.ELC.LOSS.ZS publishes annually. Without a content-age probe, a
stalled WB publication cycle is invisible to /api/health: the seeder
runs fine on its 35-day TTL, fetchedAt stays fresh, but no country's
year ever advances past e.g. 2024. STALE_CONTENT trips correctly when
the cache stops advancing — for power-reliability, that means "by the
time you'd expect year-N+1 data, year-N is still latest" → page on-call.

## Why 24 months (NOT the plan's 13 months)

Plan §477-485 originally proposed `13 * 30 * 24 * 60` minutes (~13
months), but this is structurally wrong for WB indicators — verified
against live WB API on 2026-05-05:

  curl https://api.worldbank.org/v2/country/USA;CHN;...;KWT/indicator/EG.ELC.LOSS.ZS

On that date G7 max year = 2024. End-of-2024 = Dec 31 2024 = ~17 months
before the seed. WB year-N data lands in cache 12-18 months after
end-of-N (publication lag varies). A 13-month budget would have tripped
STALE_CONTENT immediately on every successful fresh-arrival — the same
failure mode Greptile P1 caught on Sprint 3b PR koala73#3599 (45d budget vs
M+2 60-day natural lag).

24mo math:
  - Year N data lands at age = 12-18 months (publication lag)
  - Year (N+1) data lands ~12 months later, resetting the clock
  - Worst case during steady state: age = ~30 months (just before next
    year drops AND publication lag at upper end)
  - 24mo budget catches catastrophic stalls (>2y silent upstream)
    without false-positive paging during normal between-publications

## Shape contract — third distinct shape this sprint

Per-country dict where each country has its OWN year (different from
Sprint 2/3a per-item arrays AND from Sprint 3b single-snapshot period):

  {countries: {US: {value, year: 2024}, KW: {year: 2021}, ...}, seededAt}

`newestItemAt` = end-of-(max year across all countries) — drives
staleness. Late reporters (KW/QA/AE) lagging G7 don't drag the panel
into STALE_CONTENT; once any country's year advances, the clock resets.

`oldestItemAt` = end-of-(min year across countries) — informational.

## Pattern parity with Sprint 2/3a/3b

Pure helpers in `scripts/_power-reliability-helpers.mjs`:
`yearToEndOfYearMs`, `powerReliabilityContentMeta` (with injectable
`nowMs`), `POWER_RELIABILITY_MAX_CONTENT_AGE_MIN`. Seeder imports;
test imports. No replicas.

`seed-power-reliability.mjs` is NOT in Dockerfile.relay (verified via
grep), so no COPY-line update needed.

## Verification

- 14/14 power-reliability content-age tests pass
- 46/46 across Sprint 1 + Sprint 4 stack
- typecheck:api clean; lint clean
- Tests include a dedicated `fresh-arrival regression guard` test
  that pins the EXACT budget/natural-lag mismatch failure mode
  (Sprint 3b lesson made concrete) so a future budget tightening
  cannot silently re-introduce the immediate-page bug
- Boundary test: 2023 data in May 2026 (~29mo) DOES trip — confirms
  the staleness clock works correctly past the budget threshold

* fix(power-reliability): bump budget 24mo→36mo to cover steady-state ceiling

Greptile P1 on PR koala73#3602: 24-month budget false-positives mid-cycle when
next-year data publishes legitimately late.

The math I missed in the initial commit: fresh-arrival lag (~17mo for WB
EG.ELC.LOSS.ZS) is the FLOOR but not the worst case. Once year N is in
cache, it stays there until year N+1 publishes — which can legitimately
take up to end-of-(N+1) + 18mo = end-of-N + 30mo under the documented
12-18 month publication-lag range. So cache age can reach 30 months
between publications WITHOUT any real upstream stall.

Corrected budget = 30mo steady-state ceiling + 6mo slack = 36 months
(36 thirty-day-months = 1080 days ≈ 3 years).

Also resolves the P2 prose-vs-math mismatch (JSDoc previously said
"730 days" but `24 * 30 * 24 * 60` = 720; new wording "36 thirty-day
months ≈ 1080 days" is internally consistent).

General formula now documented in the helper JSDoc:

  budget >= max_publication_lag + cycle_length + slack

Both halves required: fresh-arrival lag AND cycle_length. Initial PR
covered fresh-arrival (~17mo) but missed cycle_length (12mo), which is
exactly how the false-positive emerges. Same shape as Sprint 3b PR koala73#3599
P1 — that one missed fresh-arrival; this one missed steady-state.

Tests:
- Renamed boundary test "max year 2023 (~29mo) DOES trip" → "steady-state
  regression guard: max year 2023 (~29mo) does NOT trip — within ceiling"
  with assertion direction flipped (29mo < 30mo ceiling = legitimate
  late-publication wait, not staleness)
- Added new boundary test "max year 2022 (~40mo) DOES trip — past ceiling
  = real stall" to confirm the budget fires correctly past the ceiling
- Constant assertion: 36 * 30 * 24 * 60

15/15 power-reliability tests pass; 47/47 across Sprint 1+4 stack;
typecheck:api clean; lint clean.
…ssil-share (koala73#3603)

* feat(wb-cohort): Sprint 4 follow-up — content-age for low-carbon + fossil-share

Sprint 4 cohort follow-up of the 2026-05-04 health-readiness probe plan.
Migrates the two remaining WB resilience seeders that match power-reliability's
shape: seed-low-carbon-generation.mjs and seed-fossil-electricity-share.mjs.
Branched off Sprint 1 (koala73#3596) as a parallel sibling.

## Why a shared helper this time

Three production seeders now use the IDENTICAL per-country-dict shape
({countries: {ISO2: {value, year}}, seededAt}) with the IDENTICAL
contentMeta math (max-year selection + end-of-year UTC + 1h skew limit).
Per CLAUDE.md "three similar lines is better than a premature abstraction"
— three is exactly the line for justifying the abstraction now.

New `scripts/_wb-country-dict-content-age-helpers.mjs` exports:
  - yearToEndOfYearMs(year)
  - wbCountryDictContentMeta(data, nowMs?)

Each seeder imports it + brings its own MAX_CONTENT_AGE_MIN constant
inline (per-seeder budgets matter — see below). seed-power-reliability
keeps its own helper for now (PR koala73#3602 is in review; backporting to the
shared helper is a follow-up after merge to keep that PR's diff focused).
The math is verifiably identical.

## Per-seeder budgets (NOT one-size-fits-all)

Verified against live WB API on 2026-05-05 — publication lags differ
across these "annual WB indicators":

  - low-carbon-generation (NUCL+RNEW+HYRO sum, MAX year of 3):
      max year = 2024 (driven by NUCL/HYRO; RNEW lags to 2021 but is
      masked by MAX-of-3 in the seeder's countries[iso2].year compute)
      → fresh-arrival lag ~17mo
      → 36mo budget (= 30mo steady-state ceiling + 6mo slack)
      → matches power-reliability exactly

  - fossil-electricity-share (EG.ELC.FOSL.ZS):
      max year = 2023 (NOT 2024 — slower-publishing indicator)
      → fresh-arrival lag ~29mo
      → 48mo budget (= 41mo steady-state ceiling + 7mo slack)

A naive cohort-wide budget would either false-positive on fossil-share
(if 36mo) or be wastefully loose on low-carbon (if 48mo). Per-seeder
constants are the correct response — each indicator's lag is empirically
different.

The "per-seeder budget separation" test pins this explicitly: a 41mo cache
trips low-carbon (36mo) but NOT fossil-share (48mo). Demonstrates that the
budgets aren't accidental — they reflect real upstream cadence differences.

## Renewables (RNEW.ZS) data-quality flag

Discovered during the audit: EG.ELC.RNEW.ZS max year = 2021 in May 2026,
~53mo lag. Inside low-carbon-generation it's masked by MAX(NUCL, RNEW,
HYRO), so content-age looks fine. But the underlying renewable share
data is genuinely 5+ years stale. Not addressed in this PR — flagging
as a separate data-quality concern for follow-up review.

## Verification

  - 15/15 wb-country-dict content-age tests pass (incl. fresh-arrival +
    steady-state regression guards for BOTH new seeders, plus a
    per-seeder budget separation test)
  - 47/47 across Sprint 1 + cohort follow-up stack
  - typecheck:api clean; lint clean
  - Neither seeder is in Dockerfile.relay (verified via grep) — no
    relay-COPY change needed

Sprint 4 is now done for the WB cohort (3 of 5 plan-listed indicators
migrated, with a 4th — IMF/WEO — explicitly deferred because it has
forecast-year semantics that need different content-age handling).

* fix: address Greptile PR koala73#3603 P2 nits (misleading comment + import order)

P2 — `tests/wb-country-dict-content-age.test.mjs:79` — misleading inline
comment: read `// end-of-2026 = Dec 31 23:59:59 = past FIXED_NOW (May 5)`
but FIXED_NOW is 2026-05-05 and end-of-2026 is ~7 months in the FUTURE,
not past. The test logic is correct (the EDGE year IS excluded as
future-dated beyond skew tolerance) — only the comment was wrong.

P2 — `scripts/seed-fossil-electricity-share.mjs:30` — `import iso3ToIso2`
appeared on the line immediately after `const MAX_CONTENT_AGE_MIN`.
ES module `import`s are hoisted regardless of source order, but
interleaving with declarations confuses readers (code "looks" sequential
but the import actually executes first). Moved the import up alongside
the other top-of-module imports.

Both pure-text nits — no behavior change. typecheck clean; targeted
tests/wb-country-dict-content-age.test.mjs passes 15/15.
…tics, 18mo budget) (koala73#3604)

* feat(imf-weo): Sprint 4 IMF cohort — content-age (forecast-year semantics, 18mo budget)

Closes the deferred IMF/WEO portion of Sprint 4 (plan §477-485 listed
"plus IMF/WEO/etc." as part of the annual-data migration). Branched off
Sprint 1 (koala73#3596) as a parallel sibling.

Migrates all 4 IMF SDMX seeders in one PR:
  - seed-imf-external.mjs   (BCA, TM_RPCH, TX_RPCH)
  - seed-imf-growth.mjs     (NGDP_RPCH, NGDPDPC, NGDP_R, PPPPC, PPPGDP, NID_NGDP, NGSD_NGDP)
  - seed-imf-labor.mjs      (LUR, LP)
  - seed-imf-macro.mjs      (PCPIPCH, BCA_NGDPD, GGR_NGDP, PCPI, PCPIEPCH, GGX_NGDP, GGXONLB_NGDP)

## The semantic difference from WB cohort (and why a separate helper)

WB indicators store the OBSERVED year — `record.date = "2024"` means
data observed during calendar year 2024. The WB helper maps year →
end-of-year UTC ms (the latest observation date inside the named year).

IMF/WEO stores the FORECAST horizon, NOT an observation year. The
`weoYears()` function in `_seed-utils.mjs` returns
`[currentYear, currentYear-1, currentYear-2]` and `latestValue()` picks
the first year that has a finite value. So in May 2026 after the April
2026 WEO release, max stored year = 2026 — that's IMF's freshest
*forecast* for fiscal 2026, not observations through end-of-2026.

If the IMF helper reused the WB cohort helper (`yearToEndOfYearMs`):
year=2026 → end-of-2026 = Dec 31 2026 = ~7 months FUTURE relative to
NOW → rejected by 1h skew limit → `contentMeta` returns null → every
fresh IMF cache reports STALE_CONTENT. That's the failure mode this
module avoids.

Mapping rationale: `imfForecastYearToMs(year)` returns
`Date.UTC(year - 1, 11, 31, 23, 59, 59, 999)`. Reads as: "the latest
fully-observed period this forecast vintage is built on." For year=2026
→ end-of-2025 = ~5 months ago in May 2026. Correctly fresh.

A dedicated test (`semantic difference from WB cohort: forecast year
2026 in May 2026 maps to past (NOT future)`) exists specifically to
prevent a future refactor from collapsing the WB and IMF helpers.

## Why one shared budget across all 4 IMF seeders (NOT per-seeder)

WB cohort had per-seeder budgets because publication lags differed
(LOSS at ~17mo, FOSL at ~29mo). All 4 IMF seeders use the IDENTICAL
upstream — IMF SDMX/WEO. WEO publishes April + October vintages each
year as a single integrated release covering all WorldMonitor's
indicator codes. So all 4 share the same fresh-arrival lag and the
same steady-state ceiling. One budget = correct.

## 18-month budget — derivation

Steady-state model under "year → end-of-(year-1)" mapping:

  - After April N release: max year = N → newestItemAt = end-of-(N-1).
    Age = ~5 months.
  - After October N: max year still = N → age = ~11 months.
  - Just before April N+1: max year still = N → age = ~16 months.
  - After April N+1: max year advances to N+1 → newestItemAt resets.

Steady-state ceiling = 16mo (just before April release of next year).
Budget = 16mo + 2mo slack = 18 months. Trips when a full year of WEO
releases is missed (both April AND October vintages of one year), which
is the right pager threshold for an IMF outage.

## Verification

  - 15/15 imf-weo content-age tests pass (incl. fresh-arrival + steady-
    state regression guards, future-skew defense, late-reporter cohort
    handling, and the WB-vs-IMF semantic-difference guard test)
  - Tested with `npx tsx --test` against the existing IMF test suites:
    34/34 across `imf-country-data` + `seed-imf-extended` + new file
  - 47/47 across Sprint 1 + IMF cohort stack
  - typecheck:api clean; lint clean
  - Zero seed-imf-*.mjs files in Dockerfile.relay (verified via grep)
    so no relay-COPY change needed

## Sprint 4 status after this PR

  - ✅ power-reliability (koala73#3602)
  - ✅ low-carbon-generation + fossil-electricity-share (koala73#3603)
  - ✅ IMF/WEO cohort: external + growth + labor + macro (this PR)

Plan §477-485 fully closed. The plan's "Definition of done" §530
(≥1 annual-data migrated) was satisfied by koala73#3602; this PR + koala73#3603
round out the rest of the listed cohort.

* fix(imf-weo): use max forecast year for content-age, not priority-first metric

Codex PR koala73#3604 P2. The four IMF/WEO seeders write `entry.year` as the
priority-first non-null indicator's year (`ca?.year ?? tm?.year ?? tx?.year`
in seed-imf-external). That's correct as the public payload's "primary
metric vintage" but WRONG for content-age: a row with BCA=2024 +
import-volume=2026 publishes year=2024, even though the country dict
carries a fresh 2026 metric — content-age maps it to 2023-12-31 (~17mo old,
near-stale) when it actually carries a 2026 metric (~5mo old in May 2026).

Fix path A (preserves public payload semantics): seeders now populate a
dedicated `latestYear` field via a new `maxIntegerYear()` helper, computed
across ALL the country's indicator years. The content-age helper prefers
`entry.latestYear` over `entry.year`, falling back to `year` for back-compat
with caches written before this PR.

- scripts/_imf-weo-content-age-helpers.mjs — export `maxIntegerYear()`;
  `imfWeoContentMeta` reads `entry.latestYear` first
- scripts/seed-imf-{external,growth,labor,macro}.mjs — populate `latestYear`
  alongside existing `year` (no public payload change beyond the new field)
- tests/imf-weo-content-age.test.mjs — add maxIntegerYear unit tests +
  three mixed-indicator-year regression tests covering the fresh-metric-
  behind-stale-primary case, latestYear=null fallback, and heterogeneous
  cohort newest/oldest extraction

* chore(imf-weo): adversarial-review hardening — horizon-extension trap guard + schemaVersion bump

PR koala73#3604 review findings #1 + #2. Both advisory, no behavior change today.

#1 Horizon-extension trap: weoYears() currently returns [currentYear,
   currentYear-1, currentYear-2], so max year = currentYear and the 1h
   skew filter is purely defensive. If a future Sprint extends weoYears()
   to include currentYear+1 to surface forward forecasts, the skew filter
   would silently drop every fresh +1 entry, regressing cohort
   newestItemAt to the prior year and producing FALSE STALE_CONTENT for
   genuinely-fresh data. Added load-bearing comment near the skew check
   plus a regression-guard test that documents the trap shape under
   FIXED_NOW=2026-05-05. Test asserts the trap, not desired behavior;
   when horizon extension lands the test fails and forces revisit.

#2 schemaVersion bump 1->2 across all 4 seeders. Codex P2 added the
   latestYear field; envelope newestItemAt math now differs under the
   same schema number. Bumping forces a clean republish on rollout and
   makes rollback observable rather than silently drifting envelope math
   while caches keep the new shape.
…-strike disable (koala73#3627)

* fix(consumer-prices): add pin auto-recovery — symmetric to existing 3-strike disable

## Symptom

WM 2026-05-08: /api/health flagged \`consumerPricesSpread: EMPTY_DATA\`
for hours despite 4 AE retailers actively scraping with freshnessMin
18-26 minutes. Investigation revealed retailer-spread aggregation
collapsed because no basket item had ≥3 retailers with active matches +
in-stock observations across all 4.

Audit revealed 48.5% of ALL product_matches across the system are
sticky-disabled via \`pin_disabled_at\`:

  basket_market    disabled  active  total  pct_disabled
  ──────────────── ───────── ─────── ────── ─────────────
  essentials-ae    111       49      174    64%
  essentials-sg    11        7       18     61%
  essentials-sa    42        27      75     56%
  essentials-au    20        23      45     44%
  essentials-us    20        27      49     41%
  essentials-gb    12        24      40     30%
  essentials-in    16        24      67     24%
  essentials-br    5         14      21     24%
  ──────────────── ───────── ─────── ────── ─────────────
  TOTAL            237       252     489    48.5%

Daily disable drip of 3-14 matches at 02:00 UTC for ~3 weeks. Disabled-
set match-score AVG = 0.99 vs active-set 0.95 — proves the disabler is
killing the BEST matches whose underlying products had transient blips
(3 consecutive out-of-stock or pin-error scrapes), not selecting bad
data.

## Root cause: sticky-disable without auto-recovery

\`scripts/jobs/scrape.ts\` has a 3-strike auto-disable mechanism: when a
pinned product is OOS or pin-errors 3 consecutive scrapes,
\`pin_disabled_at\` gets set. **There was NO paired auto-recovery.** Once
\`pin_disabled_at\` is set, it's never cleared. Coverage monotonically
decays over weeks as transient blips (seasonal OOS, URL hiccups,
temporary supply issues) accumulate.

See memory \`sticky-disable-without-auto-recovery-decays\` for the
generalized pattern.

## Fix: BOTH halves shipped together

(A) Code: symmetric counter \`consecutive_in_stock\` mirrors the existing
\`consecutive_out_of_stock\` from migration 007. The in-stock branch in
\`scrape.ts\` increments it; when it crosses the same 3-consecutive
threshold the disable side uses, \`pin_disabled_at\` is cleared. Logged
as \`[pin] auto-recovered stale pin for <target> (Nx in-stock)\`.

(B) Data: one-time SQL reset of all existing \`pin_disabled_at\` markers
(\`UPDATE ... SET pin_disabled_at = NULL\`). The next scrape cycle
re-disables anything still genuinely broken; the ~70% that were
transiently OOS recover within ~3 days.

Code-only would leave the existing 237 sticky records permanently
disabled (auto-recovery only fires on successful scrapes, but
sticky-disabled may not be scraped at all if disable also cuts the
scrape path). Data-only restarts decay immediately on next nightly
scrape. Both required.

## Migration verified

Dry-run inside a transaction (ROLLBACK):

  ALTER TABLE retailer_products ADD COLUMN ... → ✓
  UPDATE product_matches SET pin_disabled_at = NULL → 237 rows
  Post-state: still_disabled = 0
  Post-ROLLBACK: 237 disabled (production unchanged) ✓

## Verification

- 31/31 consumer-prices-core unit tests pass (no regressions)
- TypeScript clean on the modified scrape.ts (\`tsc --noEmit\` shows pre-
  existing implicit-any errors elsewhere; none introduced by this PR)
- Migration SQL syntactically valid + idempotent (\`ADD COLUMN IF NOT
  EXISTS\` allows safe re-run)
- Recovery is logged (\`[pin] auto-recovered ...\`) so post-deploy we can
  verify by grepping Railway logs for that pattern

## Post-deploy expectations

Within ~3 days of deploy:
- 237 sticky-disabled markers cleared by the migration
- Next scrape cycle re-disables only the genuinely-broken ones (URL
  permanently changed, product permanently out of stock)
- The transient majority (~70% based on score histogram) start
  contributing to retailer-spread aggregation
- \`/api/health\` flips \`consumerPricesSpread\` from EMPTY_DATA to OK once
  ≥2 retailers have ≥4 common items (the existing \`MIN_SPREAD_ITEMS\`
  quality gate)
- Coverage no longer monotonically decays — sticky disables are now
  self-healing

## Memory entries

- \`sticky-disable-without-auto-recovery-decays\` — captures this
  pattern's discriminator (high disable rate + disabled-set quality
  ≥ active-set quality + daily drip pattern) and the
  always-ship-both-halves rule
- \`strict-full-coverage-aggregation-collapses-to-empty\` — the surface
  symptom (the spread query collapsing); this PR addresses the
  underlying cause (data sparsity from monotonic decay)

* fix(consumer-prices): close 3 gaps from fresh-eyes review of koala73#3627

Self-review of koala73#3627 surfaced three real holes that would have made the
original fix not actually work in production:

## Gap 1: migration was incomplete (97% no-op)

The first cut cleared `pin_disabled_at` but left the trigger counters
(`consecutive_out_of_stock` and `pin_error_count`) at threshold.
`getPinnedUrlsForRetailer` (matches.ts:102-103) ALSO excludes products
where either counter is ≥3. Per a live-DB audit, 230 of 237 disabled
matches (97%) had at least one counter at threshold — so post-migration
they'd still be excluded from scraping → my new auto-recovery counter
would never run on them → they'd stay effectively disabled.

Fix: migration 009 now also resets both counters for any
retailer_product where they exceeded 0:

  UPDATE retailer_products SET consecutive_out_of_stock = 0,
                                pin_error_count = 0
   WHERE consecutive_out_of_stock > 0 OR pin_error_count > 0;

Live-DB dry-run (in BEGIN…ROLLBACK transaction) confirms this resets
282 retailer_products. Post-migration: 0 still-disabled, 0 still-OOS-at-
threshold, 0 still-pin-error-at-threshold. Production unchanged after
ROLLBACK.

## Gap 2: handlePinError didn't reset the recovery counter

The original handlePinError increments pin_error_count but didn't
touch consecutive_in_stock. By symmetry, every failure path must
reset the recovery counter — otherwise an Exa fallback (pin error)
interleaved with successful in-stock scrapes would let the recovery
counter accumulate falsely across failures.

Fix: handlePinError now does `consecutive_in_stock = 0` alongside the
pin_error_count increment. Same pattern already in handleStaleOnOutOfStock.

## Gap 3: zero unit tests for the new logic

The Completeness Standard says "test before shipping." First cut had
zero tests — would have shipped on dry-run + manual verification only.

Fix: extracted handleStaleOnInStock + handleStaleOnOutOfStock + handlePinError
to dedicated module `scrape-pin-recovery.ts` (avoids scrape.ts's heavy
transitive deps — exa-js, playwright, etc. — that prevented unit-test
imports). Added 9 tests in `scrape.test.ts` covering:

  - increments + atomic counter resets on each branch
  - threshold gating (3-strike on both sides)
  - idempotency on the clear (repeat in-stock observations after
    threshold safely re-fire the no-op clear)
  - defensive handling of missing/null counter values
  - symmetry contract (same threshold value, same call shape)

40/40 tests pass (was 31; added 9). TypeScript clean on all 3 modified
files. scrape.ts now delegates to the helpers via a 1-line import; the
production code path is unchanged.

## Why this iteration matters

A code-only fix without the migration counter-reset would have shipped
green CI but produced ZERO actual recovery in production — the very
products it was meant to fix would have remained excluded. Fresh-eyes
review caught this BEFORE deploy. Ship the complete thing — not the
plan to ship the complete thing.

* fix(consumer-prices): close P1 — add recovery-probe path so future disables don't decay

Reviewer P1 on PR koala73#3627: even with the symmetric counter + migration,
auto-recovery cannot run after FUTURE disables because the scrape job
excludes disabled pins from the target set.

\`getPinnedUrlsForRetailer\` filters out:
  - pm.pin_disabled_at IS NOT NULL
  - rp.consecutive_out_of_stock < 3
  - rp.pin_error_count < 3

Once handleStaleOnOutOfStock or handlePinError disables a pin, future
scrape cycles never fetch that product → handleStaleOnInStock never
runs → consecutive_in_stock never increments → pin_disabled_at never
clears → decay restarts from cycle one. The migration cleared today's
backlog, but the fix was a one-shot bandaid, not self-healing.

## Architectural fix: split "scrape for recovery" from "aggregate"

Per the reviewer's suggested fix (split pins-to-scrape from pins-
eligible-for-aggregation):

(A) New function \`getDisabledPinsForRecovery(retailerId, limit)\` in
    matches.ts — returns up to N disabled pins per cycle, FIFO-ordered
    by pin_disabled_at ASC (oldest disable first → fairness across the
    disabled set).

(B) scrape.ts now loads BOTH sets and merges them:
    - Active pins (every cycle, current behavior)
    - Recovery probes (LIMIT=10 per cycle)
    Active wins on key collision (active set is healthier; collision
    rare).

(C) Aggregation gates in worldmonitor.ts (buildSpreadSnapshot etc.)
    continue filtering pin_disabled_at IS NULL — probed-but-still-
    disabled pins don't leak into spread until they've fully recovered
    (3 successful in-stock observations).

## Recovery dynamics

With ~30 disabled pins per retailer and LIMIT=10:
- Full probe coverage: ~3 days (10/cycle × 3 cycles)
- Recovery for a single pin: 3 successful probes spaced across ~7 days
  = ~7-9 days to clear pin_disabled_at
- Once recovered, pin returns to active rotation; new disables get
  probed automatically next FIFO cycle

Bounded scrape-budget cost: at most LIMIT extra fetches per cycle per
retailer. Tunable.

## Verification

Live-DB read-only test of the new SQL against the 4 AE retailers:

  carrefour_ae:    10 / 40 disabled probed per cycle
  lulu_ae:         10 / 25 disabled probed per cycle
  noon_grocery_ae: 10 / 31 disabled probed per cycle
  spinneys_ae:      9 / 15 disabled probed per cycle (LIMIT-bounded by
                                                       9 unique items)

## Tests

6 new tests in src/db/queries/matches.test.ts pin the SQL contract:
- Filter polarity (IS NOT NULL, opposite of getPinnedUrlsForRetailer)
- match_status whitelist (only auto/approved enter recovery)
- FIFO ordering (pin_disabled_at ASC)
- LIMIT honored (bounded budget)
- Map<key, {sourceUrl, productId, matchId}> shape parity (so scrape.ts
  can merge both Maps)
- Empty-rows handling

46/46 tests pass (was 40 before this commit). TypeScript clean on all
modified files. scrape.ts production code path: unchanged for active
pins; merged with recovery probes via Map union.

## Why this is the proper fix

Without the recovery-probe path, the original fix is a one-shot
intervention — it cleared the historical 237 sticky markers but
provides no defense against future decay. The reviewer correctly
identified that "auto-recovery cannot run after future disables." This
commit adds the missing self-healing loop: every scrape cycle picks
up a bounded slice of disabled pins, gives them a recovery probe, and
resurrects the ones whose underlying products came back in stock.

Memory entry \`sticky-disable-without-auto-recovery-decays\` updated
with the "gates beneath the gate" pattern + the fix recipe (split the
scrape gate from the aggregation gate).

* fix(consumer-prices): close P1 round 2 — global FIFO via ranked CTE (no UUID starvation)

Reviewer P1 (round 2) on PR koala73#3627: the recovery-probe SQL used
\`DISTINCT ON (pm.basket_item_id) ORDER BY pm.basket_item_id, pm.pin_disabled_at ASC LIMIT \$2\`,
which returns the first N basket UUIDs (UUID order), NOT the N oldest
disabled pins. Low-UUID basket_items would be probed every cycle while
high-UUID disabled pins would starve forever.

Live-DB verification of carrefour_ae's 40 disabled matches across 12
basket items confirmed the bug:

  BUGGY (current): returned same 10 lowest-UUID basket_items every cycle,
                    ignoring 30+ newer-disabled matches with high UUIDs
  FIXED (this PR): returns 10 globally oldest disabled pins (2026-03-23
                    through 2026-04-03 today; cycles through the rest as
                    the oldest recover)

## Fix

Per the reviewer's suggestion: ranked subquery picks one representative
per basket_item (the OLDEST-disabled match within the partition), then
the OUTER query applies global FIFO ordering and the LIMIT.

\`\`\`sql
SELECT canonical_name, basket_slug, source_url, product_id, match_id
  FROM (
    SELECT cp.canonical_name, b.slug AS basket_slug, ...,
           ROW_NUMBER() OVER (
             PARTITION BY pm.basket_item_id
             ORDER BY pm.pin_disabled_at ASC
           ) AS rn
      FROM product_matches pm ...
     WHERE rp.retailer_id = \$1 ... AND pm.pin_disabled_at IS NOT NULL
  ) ranked
 WHERE rn = 1
 ORDER BY pin_disabled_at ASC
 LIMIT \$2
\`\`\`

Verified across all 4 AE retailers on live DB — each returns globally
oldest disabled pins, NOT lowest-UUID basket_items. Spinneys' top-10
under the new SQL spans 2026-03-23 to 2026-05-04 (full disable date
range), proving global fairness.

## Test strengthened

The previous test \`orders by oldest disable first (FIFO)\` only checked
that the SQL contained the substring \`pin_disabled_at ASC\` — which the
buggy SQL ALSO contained (in the wrong position). Replaced with structural
assertions that catch the bug class:

  - MUST use ROW_NUMBER + PARTITION BY basket_item_id (not DISTINCT ON)
  - MUST filter on rn = 1 (one rep per partition)
  - MUST NOT contain DISTINCT ON anywhere
  - MUST apply OUTER ORDER BY pin_disabled_at AFTER the rn=1 filter
    AND BEFORE LIMIT (verified by index ordering of clauses in the SQL
    string — a regression to DISTINCT ON would fail this check)

46/46 tests pass. TypeScript clean. Live-DB read-only verification
confirms expected behavior.

## Pattern for the memory

The original test was tautological: a substring check that a buggy
implementation could also satisfy. Strengthening test assertions to be
INVARIANT under the bug — not just OBSERVED in the correct version — is
the lesson. Memory entry update follows.
…k fails on legacy product IDs (koala73#3630)

WORLDMONITOR-QM (13 Sentry events / 1 user, 4 visible Dodo webhook
retries every 30-60s):

  Webhook processing failed: [Error: Uncaught TypeError:
  dynamic module import unsupported
    at resolvePlanKey (subscriptionHelpers.ts:279)
    at handleSubscriptionActive (subscriptionHelpers.ts:385)
    at handler (webhookMutations.ts:103)]

`resolvePlanKey` did `await import("../config/productCatalog")` to
read `LEGACY_PRODUCT_ALIASES` only on the legacy-alias fallback path.
Convex's V8 isolate rejects first-party `await import(...)` with the
exact phrase above. The first-party static import for `PLAN_PRECEDENCE`
on line 12 already pulls from the same module — just merged
LEGACY_PRODUCT_ALIASES into that import.

User impact (until this deploys): every Dodo subscription webhook for
a user on a rotated/legacy product ID hits a 500. Dodo retries with
backoff until it gives up. The user's entitlement never updates after
their plan change — silent paid-but-not-provisioned drift. The bug
only fires on the alias path (`mapping` returns null on the
productPlans index lookup), so users on current product IDs are
unaffected.

Comment on the now-removed `await import` line documents the Convex
isolate restriction so a future reader doesn't reintroduce it.
Repo-wide grep for `await import(` in convex/ confirms this was the
only site.
P1 fixes:
- Cargo.toml: restore version to 2.8.0 with comment explaining the
  prior downgrade; prevents tooling from misinterpreting version order
- save_vault: guard empty app_data_dir on write (return error instead
  of writing secrets-vault.json to CWD)
- save_vault: add create_dir_all before write to prevent ENOENT on
  first-ever set_secret call with a fresh Linux app data dir
- save_vault: set file permissions to 0o600 on Unix (owner read/write
  only) after writing the fallback vault
- save_vault: add SECURITY NOTE documenting plaintext exposure risk

P2 fix:
- prefix unused keyring_err with _ to silence compiler warning

Fixes review comments on koala73#3619.
@fuleinist fuleinist requested a review from SebastienMelki as a code owner May 9, 2026 09:45
@vercel
Copy link
Copy Markdown

vercel Bot commented May 9, 2026

@fuleinist is attempting to deploy a commit to the World Monitor Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

trust:safe Brin: contributor trust score safe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bundle runner counts graceful-failure exits as OK (status=OK records=)

4 participants