From 94e983fe656541f14a8dd140d4a8c5cf6a5ab5ba Mon Sep 17 00:00:00 2001
From: Alex Lebedev <alex.l@posthog.com>
Date: Mon, 22 Jun 2026 20:21:38 +0200
Subject: [PATCH] feat: Add limits on scouts.

---
 context/skills/self-driving/description.md    |  1 +
 .../self-driving/references/2-read-context.md |  6 +--
 .../self-driving/references/6-scouts.md       | 47 ++++++++++---------
 .../references/6b-tailor-scouts.md            |  2 +-
 4 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/context/skills/self-driving/description.md b/context/skills/self-driving/description.md
index 240746c..934f924 100644
--- a/context/skills/self-driving/description.md
+++ b/context/skills/self-driving/description.md
@@ -19,6 +19,7 @@ Each step file points to the next. Run them in order. **Start by reading `refere
 - **Never disable a source the user already enabled.** You only switch things on (and tune scouts off); existing enabled rows are someone's deliberate choice.
 - **Never enable a connected-tool source the user hasn't confirmed they use.** GitHub Issues, Linear, Zendesk, and pganalyze are ask-then-connect, never blind.
 - **Stay off the internal surfaces.** Don't call `signals-scout-emit-signal` or any scratchpad-write tool, and don't change a scout's `emit` flag or `run_interval_minutes` — on configs, this skill only flips `enabled`. **Canonical scout bodies are never edited.** New scout skills are created in exactly one place: step 6b, and only ones the user approved there.
+- **Keep the scout fleet small.** Every enabled scout is a recurring LLM spend. Step 6 enables only `signals-scout-general` plus the **one or two** specialists for the products this project uses most — never error tracking or session replay (those reach the inbox as native sources) — and step 6b adds **at most two** custom scouts. Everything else stays disabled.
 - **Batch your questions.** `wizard_ask` has a small per-run budget; one multi-select beats four yes/nos. Don't skip a step or drop a connector (e.g. Linear) or custom scouts setup to save calls.
 - **Decline goes first.** Every `wizard_ask` that offers choices must include a plain-language decline option (skip / none / "keep what's there"), and it must be the **first** option so it is the default highlight — an accidental `enter` then declines instead of committing the user to something. The **one exception is step 3's GitHub gate**: the run cannot proceed without GitHub, so there the affirmative ("Done — I've installed it") stays first and the decline ("I can't connect right now", which aborts) stays last.
 
diff --git a/context/skills/self-driving/references/2-read-context.md b/context/skills/self-driving/references/2-read-context.md
index 9a852d6..3d5f638 100644
--- a/context/skills/self-driving/references/2-read-context.md
+++ b/context/skills/self-driving/references/2-read-context.md
@@ -22,7 +22,7 @@ Load via `ToolSearch select:Read,Glob,Grep,mcp__posthog-wizard__signals-scout-pr
 
 1. **Read `./posthog-setup-report.md`.** It is ground truth for what the base integration instrumented **in this repo**: events, error tracking, feature flags. Do not re-derive what it already states. It is NOT authority over project-level facts — session replay in particular may be instrumented in another repo or via the snippet, so the report can rule replay in but never out (step 4 probes the server for that).
 
-2. **Call `signals-scout-project-profile-get`.** It returns products in use, connected integrations, warehouse sources, and the signal source configs split enabled/disabled — one call instead of four. **Tolerate failure**: it can 404 or error on a team without a profile yet. If it fails, fall back to the step-1 source list and the report; do not retry more than once and do not abort. **Note "profile unavailable" in your checklist** — a profile 404 is expected on a first-run team, so any later decision that relies only on the profile must record "unknown", not a confident negative.
+2. **Call `signals-scout-project-profile-get`.** It returns products in use, connected integrations, warehouse sources, and the signal source configs split enabled/disabled — one call instead of four. It also carries **relative usage magnitude**: `top_events` (per-event count + distinct users), `recent_activity` (edits per scope), and per-entity active counts (feature flags, experiments, surveys, dashboards). Capture a rough sense of **which products this project uses most** — step 6 enables a scout only for the one or two most-used products, so a usage ranking matters, not just a binary in/out. **Tolerate failure**: it can 404 or error on a team without a profile yet. If it fails, fall back to the step-1 source list and the report; do not retry more than once and do not abort. **Note "profile unavailable" in your checklist** — a profile 404 is expected on a first-run team, so any later decision that relies only on the profile must record "unknown", not a confident negative.
 
 3. **Server-side product usage.** The run prompt's "Project state" block is authoritative for the opt-ins it lists (session replay recording, exception autocapture, surveys): **opt-in ON = product enabled**, even if no data has arrived yet. Where the block says OFF/unknown and the repo gave no signal, spend ONE cheap probe each for usage evidence (tolerate 403/404 → record "unknown"):
    - `query-session-recordings-list` — any recording → replay in use
@@ -38,6 +38,6 @@ Load via `ToolSearch select:Read,Glob,Grep,mcp__posthog-wizard__signals-scout-pr
    - **Support**: does the team use PostHog support/conversations (per the profile)?
    - **Issue trackers**: any hints of Linear, Zendesk, or pganalyze (you will still ask in step 5 — hints only shape the question, they never authorize enabling).
 
-   Do NOT crawl the whole source tree. If a question can't be answered cheaply, record "unknown" and move on — unknowns default to asking the user about sources; for surface-specific scouts, an unconfirmed surface is not justification to keep them on (step 6 disables them without evidence).
+   Do NOT crawl the whole source tree. If a question can't be answered cheaply, record "unknown" and move on — unknowns default to asking the user about sources; for scouts, an unconfirmed surface won't rank among the most-used products, so step 6 won't enable its scout.
 
-5. **Write down your working checklist** (in your own notes, not a file): candidate native sources, candidate connected tools, candidate scout disables, GitHub status if the profile revealed it. Steps 4–6 consume this.
+5. **Write down your working checklist** (in your own notes, not a file): candidate native sources, candidate connected tools, which products this project uses most (drives step 6's pick: `general` + the 1–2 most-used specialists), GitHub status if the profile revealed it. Steps 4–6 consume this.
diff --git a/context/skills/self-driving/references/6-scouts.md b/context/skills/self-driving/references/6-scouts.md
index 6840aa5..e4e29d0 100644
--- a/context/skills/self-driving/references/6-scouts.md
+++ b/context/skills/self-driving/references/6-scouts.md
@@ -4,7 +4,7 @@ next_step: 6b-tailor-scouts.md
 
 # Step 6 — Configure the scout fleet
 
-Scouts are the pull side of Signals: scheduled agents that scan the project on an interval and emit findings as `signals_scout` / `cross_source_issue` signals (which step 4's scout gate lets into the inbox). Materialize the fleet, then switch off the scouts whose product surface this project doesn't have.
+Scouts are the pull side of Signals: scheduled agents that scan the project on an interval and emit findings as `signals_scout` / `cross_source_issue` signals (which step 4's scout gate lets into the inbox). Every enabled scout is a recurring LLM spend — it costs a full run every tick even when it finds nothing — so the fleet is kept **deliberately small**: the `general` scout, plus the **one or two specialists** for the products this project uses most. Everything else is disabled.
 
 ## Status
 
@@ -24,47 +24,48 @@ Load via `ToolSearch select:mcp__posthog-wizard__signals-scout-config-sync,mcp__
 
    **Soft-degrade if the tool is missing or fails**: fall back to `signals-scout-config-list`. If that returns rows, tune those. If it returns nothing, the fleet hasn't been materialized yet — record a follow-up ("the scout fleet materializes automatically within ~30 minutes; tune it later in PostHog or re-run this setup") and continue to step 7. **Not an abort.**
 
-2. **Tune — classify every scout the sync returned; don't assume a fixed list.** The fleet is seeded from posthog and grows over time (it's ~19 scouts today), so always work from the rows `signals-scout-config-sync` actually returned, not a hardcoded set. For each scout, read its name/description and ask **"does this project have the surface this scout watches?"** — that sorts it into one of two buckets:
+2. **Decide the enabled set — the whole point of this step is to enable FEW scouts, not many.** Work from the rows `signals-scout-config-sync` actually returned (the fleet grows over time — ~19 scouts today — so never hardcode a list). The enabled set has exactly three parts:
 
-   **Always-on (cross-product).** Its surface is "any project with data," so it self-closes cheaply when there's nothing to say. Keep enabled. Examples (illustrative, not exhaustive):
+   **(a) `general` — always enabled.** `signals-scout-general` watches cross-product correlations and the surfaces no specialist covers; it self-closes cheaply when there's nothing to say. Keep it on for every project.
 
-   - `signals-scout-general` — cross-product correlations and uncovered surfaces
-   - `signals-scout-anomaly-detection` — anomalies in whatever time series exist
-   - `signals-scout-observability-gaps` — events with no insight coverage
-   - `signals-scout-health-checks` — PostHog setup health
-   - `signals-scout-inbox-validation` — whether shipped fixes actually held
+   **(b) Never enable the `error-tracking` or `session-replay` scouts.** Step 4 already enables error tracking and session replay as native **sources** — their findings reach the inbox through that pipeline, so a scout on the same surface only duplicates it. Disable `signals-scout-error-tracking` and `signals-scout-session-replay` unconditionally, regardless of evidence. This is an **intentional** exclusion, not an evidence gap, so do **not** record a re-enable follow-up for them — note them as "covered by the native source".
 
-   **Surface-specific (conditional).** Tied to a product or surface a project may not have. **Enable ONLY when step 2 found positive evidence the surface is in use** — evidence on EITHER side counts: the repo scan OR the server-side state (project-state opt-ins and usage probes). A product enabled at the project level is evidence even when this repo shows nothing. No evidence → disable. Examples of surface → evidence (illustrative, not exhaustive):
+   **(c) One or two specialists — for the products this project uses MOST.** This is a judgment call, not a checklist: weigh ALL the step-2 evidence together — the profile's `top_events` (volume + distinct users), recent activity, the active counts for feature flags / experiments / surveys / dashboards, plus any repo signals — and pick the **one or two** product surfaces that are most actually used, then enable each one's scout. The candidate pool is the entire fleet **except** `general` and the two excluded in (b); it includes both the surface-specific scouts and the remaining cross-product ones:
 
-   | Scout | Enable only with evidence of |
+   | Scout | Specialist for |
    |---|---|
-   | `signals-scout-error-tracking` | error tracking in use — exception autocapture ON, error issues exist, or the repo instruments it (the same evidence step 4 uses for the error-tracking source) |
-   | `signals-scout-session-replay` | session recording enabled (opt-in ON or recordings exist) |
-   | `signals-scout-product-analytics` | funnels / retention / lifecycle insights or product events in use |
+   | `signals-scout-product-analytics` | funnels / retention / lifecycle insights or heavy product-event usage |
    | `signals-scout-web-analytics` | web traffic / pageviews with referrer or UTM tracking |
-   | `signals-scout-feature-flags` | feature flags in use (frontend or backend) |
-   | `signals-scout-surveys` | surveys opt-in ON or surveys found (step 2) |
+   | `signals-scout-feature-flags` | feature flags in active use (frontend or backend) |
+   | `signals-scout-surveys` | surveys in use |
    | `signals-scout-revenue-analytics` | a payment SDK / revenue data |
    | `signals-scout-ai-observability` | `$ai_*` events / LLM usage |
    | `signals-scout-logs` | the PostHog logs product in use |
    | `signals-scout-csp-violations` | CSP reporting configured |
    | `signals-scout-experiments` | active A/B experiments |
-   | `signals-scout-customer-analytics` | group / accounts analytics (B2B), not a pure B2C app |
+   | `signals-scout-customer-analytics` | group / accounts analytics (B2B) |
    | `signals-scout-data-pipelines` | CDP destinations, batch exports, or hog flows |
    | `signals-scout-replay-vision` | Replay Vision scanners configured |
+   | `signals-scout-anomaly-detection` | (cross-product) anomalies in whatever time series exist |
+   | `signals-scout-observability-gaps` | (cross-product) events with no insight coverage |
+   | `signals-scout-health-checks` | (cross-product) PostHog setup health |
+   | `signals-scout-inbox-validation` | (cross-product) whether shipped fixes actually held |
 
-   **A scout neither list names** (posthog keeps adding them): classify it by the same question — read its description and decide whether its surface is product-agnostic (→ always-on) or tied to a surface you must confirm (→ conditional, evidence required). When unsure whether a surface-specific scout's surface exists, treat that as no evidence.
+   Rules for the pick:
+   - **At most two.** Even if three or more surfaces look used, keep only the two most-used. Enabling more re-creates the cost problem this step exists to prevent.
+   - **At least one.** Always end with a specialist enabled. If no product surface clearly stands out — e.g. the only products in use are error tracking / session replay (excluded in (b)), or the profile was unavailable and nothing is rankable — **fall back to one universal cross-product scout** (`signals-scout-anomaly-detection` or `signals-scout-health-checks`) as the stand-in. Avoid `signals-scout-inbox-validation` as the fallback on a fresh setup — there are no shipped fixes for it to validate yet.
+   - **A scout the table doesn't name** (posthog keeps adding them): treat it as a specialist candidate — read its description, judge whether its surface is among this project's most-used, and enable it only if it earns one of the ≤2 slots.
 
-   **"Unknown" is not evidence → disable the scout.** Unlike a dormant warehouse responder (gated on a sync, so it never fires for free), a scout runs on its schedule and costs a full LLM run every tick even when it finds nothing — so never pay for a surface you can't confirm exists. For every conditional scout you disable, record a re-enable follow-up so the user can switch it on if they do use that surface (e.g. "enable `signals-scout-logs` in PostHog if you use the logs product").
+3. **Disable every scout you did NOT enable** in (a)–(c) — this is now most of the fleet. Disable via `signals-scout-config-update` with the config `id` and `{ enabled: false }` — **nothing else**. Don't touch `emit` (dry-run posture) or `run_interval_minutes`; the defaults are correct. A failed update is a follow-up, not an abort.
 
-3. Disable via `signals-scout-config-update` with the config `id` and `{ enabled: false }` — **nothing else**. Don't touch `emit` (dry-run posture) or `run_interval_minutes`; defaults are correct for a fresh fleet. A failed update is a follow-up, not an abort.
+   For each **surface-specific** scout you disabled, record a re-enable follow-up so the user can switch it on if they do use that surface later (e.g. "enable `signals-scout-logs` in PostHog if you use the logs product"). The error-tracking / session-replay disables are intentional (see (b)) — note them as "covered by the native source", not as a re-enable follow-up.
 
-4. **Show the result.** This step asks the user nothing, so the only in-run visibility is the status line — after tuning, emit one with the outcome (short scout names, no `signals-scout-` prefix):
+4. **Show the result.** This step asks the user nothing, so the only in-run visibility is the status line — after tuning, emit one naming the enabled set (short names, no `signals-scout-` prefix):
 
 ```
-[STATUS] Scout fleet: 12 active, disabled: ai-observability, revenue-analytics, logs, csp-violations, customer-analytics, data-pipelines, experiments, replay-vision
+[STATUS] Scout fleet: 3 active (general, product-analytics, feature-flags); 16 disabled
 ```
 
-(Adjust counts and names to the actual fleet the sync returned and the decisions you made — fleet size varies as posthog adds scouts. If nothing was disabled, say "N active, none disabled".)
+(Adjust counts and names to the actual fleet and your decisions — the enabled set is always `general` + the 1–2 specialists, so "2 active" or "3 active" is expected; error-tracking and session-replay are deliberately among the disabled.)
 
-Fresh configs have never run, so they're due immediately — the first scans fire on the next coordinator tick, within ~30 minutes. Record per-scout decisions (kept / disabled + why) for the report.
+Fresh configs have never run, so they're due immediately — the first scans fire on the next coordinator tick, within ~30 minutes. Record per-scout decisions (enabled / disabled + why) for the report.
diff --git a/context/skills/self-driving/references/6b-tailor-scouts.md b/context/skills/self-driving/references/6b-tailor-scouts.md
index 61b69ba..88a155d 100644
--- a/context/skills/self-driving/references/6b-tailor-scouts.md
+++ b/context/skills/self-driving/references/6b-tailor-scouts.md
@@ -28,7 +28,7 @@ Load via `ToolSearch select:mcp__posthog-wizard__llma-skill-get,mcp__posthog-wiz
 
 2. **Do the gap analysis — this is the thinking step, take it seriously.** Lay the project evidence (the setup report's event taxonomy above all, plus the step-2 checklist: funnel structure, payment/LLM/survey surfaces, warehouse sources, integrations) against what the canonical fleet already watches. For each candidate surface ask, in order:
    - **Is it watchable?** Concrete events with names you can list, a funnel with ordered steps, a domain loop with a success/failure pair. "It's a web app" is not a surface.
-   - **Is it uncovered?** A canonical scout that step 6 kept enabled may already own it — error bursts belong to `signals-scout-error-tracking`, generic anomalies to `signals-scout-anomaly-detection`. A custom scout that duplicates an enabled canonical adds noise, not coverage.
+   - **Is it uncovered?** Three things can already own a surface, and a custom scout that duplicates any of them adds noise, not coverage: (1) a canonical scout step 6 kept enabled — the 1–2 specialists or `signals-scout-general`, which sweeps cross-product surfaces every run (e.g. generic anomalies belong to `signals-scout-anomaly-detection` if it was picked); (2) a **native source** — error tracking and session replay are consumed as sources in step 4, so never propose a custom scout for error bursts or replay analysis even though their canonical scouts are now disabled. If the surface is only watched by a canonical scout step 6 *disabled* (not `general`, not a native source), it is genuinely uncovered and fair game.
    - **Would its scout pass the quality bar?** You must be able to name its signal-vs-noise discriminator and 2–4 concrete explore patterns *before* proposing it. If you can't, the surface isn't ready for a scout — record it as a report note instead.
 
    Typical shapes that survive all three filters: the product's core funnel (creation → completion → conversion), a domain job pipeline with success/failure events, a critical third-party dependency the events expose (e.g. an external API search that can silently degrade). **Propose at most two custom scouts — never more, even if more surfaces look watchable.** Zero is a perfectly good outcome and one or two is the norm; if three or more look worthwhile, the filters were too loose — keep only the two highest-value ones and record the rest as report notes. Every scout is a recurring scheduled LLM spend — every tick costs a full run even when it's quiet — so each must earn its keep, and the hard cap also keeps the proposal readable in the terminal, where each scout needs room for its explanation.