From 214d393771832ce53e2ee35b176288d91b6413e2 Mon Sep 17 00:00:00 2001
From: Nathan Oyler <notque@gmail.com>
Date: Fri, 27 Mar 2026 12:19:51 -0700
Subject: [PATCH] feat(skill-creator): add enrichment workflow for autonomous
 skill improvement
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds a new mode of operation to skill-creator: a 6-phase AUDIT→RESEARCH→ENRICH→
TEST→EVALUATE→PUBLISH loop that autonomously improves thin skills by adding domain
knowledge as reference content, then validates improvement via blind A/B eval before
committing. Reuses existing eval infrastructure (run_eval.py, comparator.md,
aggregate_benchmark.py). Max 3 iterations with rotating research angles.

New files:
- references/enrichment-workflow.md: deep reference for all 6 phases plus retry logic
- references/domain-research-targets.md: per-domain lookup table of primary/secondary
  sources and extraction targets (Go, Python, K8s, TypeScript, React, Testing,
  Security, Perses, Code Review, Git/PR)
---
 skills/skill-creator/SKILL.md                 |  75 +++++
 .../references/domain-research-targets.md     | 281 +++++++++++++++++
 .../references/enrichment-workflow.md         | 289 ++++++++++++++++++
 3 files changed, 645 insertions(+)
 create mode 100644 skills/skill-creator/references/domain-research-targets.md
 create mode 100644 skills/skill-creator/references/enrichment-workflow.md

diff --git a/skills/skill-creator/SKILL.md b/skills/skill-creator/SKILL.md
index 4a525c8..3e1119f 100644
--- a/skills/skill-creator/SKILL.md
+++ b/skills/skill-creator/SKILL.md
@@ -353,6 +353,76 @@ the best description by test-set score to avoid overfitting.
 
 ---
 
+## Enriching existing skills
+
+Use this mode when a skill already exists but produces shallow, generic output — it
+has thin `references/`, no `scripts/`, and passes an eval by luck rather than
+by containing domain knowledge that changes behavior.
+
+Indicators this mode is appropriate:
+- `references/` has fewer than 2 files, or none at all
+- No `scripts/` directory
+- Eval outputs look plausible but lack domain idioms, concrete examples, or
+  checklists specific to the skill's domain
+- The skill passes a test because the model already knows the domain, not because
+  the skill contributes anything
+
+### The enrichment loop
+
+Six phases, max 3 iterations before escalating to the user:
+
+**AUDIT** — measure the skill's current depth before changing anything.
+Count `references/`, `scripts/`, `agents/` files. Run the skill against 2-3
+realistic prompts. Save outputs to `enrichment-workspace/baseline/`.
+See `references/enrichment-workflow.md` → AUDIT phase for the exact checklist.
+
+**RESEARCH** — find domain knowledge the skill is missing.
+Read the skill's SKILL.md and existing references to identify gaps. Search for
+best practices, pattern catalogs with before/after examples, common mistakes,
+and validation criteria. Where to look depends on the skill's domain — consult
+`references/domain-research-targets.md` for a lookup table of primary and
+secondary sources per domain.
+
+**ENRICH** — add the research as reference content.
+Create new files in the skill's `references/` directory. Add deterministic
+`scripts/` where operations are repeatable. Update SKILL.md only with one-line
+pointers to the new references — keep the orchestrator lean. Focus on content
+that changes behavior: concrete examples beat abstract advice.
+See `references/enrichment-workflow.md` → ENRICH phase for structuring guidance.
+
+**TEST** — A/B test the enriched skill against baseline.
+Write 2-3 realistic prompts that exercise the skill's domain. Use
+`scripts/run_eval.py` to run enriched vs baseline on the same prompts. Both
+runs use identical inputs. Save outputs to `enrichment-workspace/iteration-N/`.
+
+**EVALUATE** — dispatch blind comparators on each test prompt.
+Use `agents/comparator.md` (already bundled in this skill). Comparator scores on
+depth, accuracy, actionability, and domain idioms without knowing which version
+is which. If enriched wins 2/3 or better → PUBLISH. If tie or loss → run
+`agents/analyzer.md` to understand why, then RETRY with a different research angle.
+See `references/enrichment-workflow.md` → EVALUATE phase for scoring details.
+
+**PUBLISH** — commit validated improvements.
+Create branch `feat/enrich-{skill-name}`, commit references + scripts + SKILL.md
+pointer updates, push, create PR. See `references/enrichment-workflow.md` →
+PUBLISH phase for the exact commit/PR flow.
+
+### Retry logic
+
+Each retry uses a different research angle to avoid retreading the same ground:
+
+| Iteration | Research angle |
+|-----------|---------------|
+| 1 | Official docs + canonical best practices |
+| 2 | Common mistakes + anti-patterns (what goes wrong) |
+| 3 | Advanced patterns + edge cases (what experts know) |
+
+After 3 failed iterations, report to the user: summarize what was tried, what the
+evaluator found lacking, and ask whether to try a different approach or accept the
+current state.
+
+---
+
 ## Bundled agents
 
 The `agents/` directory contains prompts for specialized subagents used by this
@@ -384,6 +454,11 @@ skill. Read them when you need to spawn the relevant subagent.
 - `references/complexity-tiers.md` — Skill examples by complexity tier
 - `references/workflow-patterns.md` — Reusable phase structures and gate patterns
 - `references/error-catalog.md` — Common skill creation errors with solutions
+- `references/enrichment-workflow.md` — Deep reference for the enrichment loop:
+  AUDIT checklist, RESEARCH strategy, ENRICH structuring, TEST/EVALUATE/PUBLISH phases,
+  and retry logic in detail
+- `references/domain-research-targets.md` — Lookup table: given a skill's domain,
+  which primary sources, secondary sources, and extraction targets to use during RESEARCH
 
 ---
 
diff --git a/skills/skill-creator/references/domain-research-targets.md b/skills/skill-creator/references/domain-research-targets.md
new file mode 100644
index 0000000..600b3f1
--- /dev/null
+++ b/skills/skill-creator/references/domain-research-targets.md
@@ -0,0 +1,281 @@
+# Domain Research Targets
+
+Lookup table for the enrichment loop RESEARCH phase. Given a skill's domain, this
+file tells you where to look for knowledge, what authority each source carries, and
+what to extract from it.
+
+Format per entry:
+- **Primary sources** — official docs, specs, canonical reference material (highest authority)
+- **Secondary sources** — blogs, talks, books, community guides (patterns and examples)
+- **Extract** — what form of knowledge to pull out (checklists, before/after, decision trees)
+
+---
+
+## Go general (go-testing, go-concurrency, go-error-handling, go-anti-patterns, go-code-review)
+
+**Primary sources**
+- [Effective Go](https://go.dev/doc/effective_go) — canonical idioms; extract named
+  patterns with rationale
+- [Go specification](https://go.dev/ref/spec) — authoritative on language semantics;
+  useful for edge cases and subtle behavior
+- [Go standard library source](https://cs.opensource.google/go/go) — how the stdlib
+  itself applies patterns; extract struct design, error handling, and interface choices
+- [Go Blog](https://go.dev/blog) — official in-depth articles; especially errors,
+  modules, generics, and concurrency posts
+- [Go wiki: CodeReviewComments](https://github.com/golang/go/wiki/CodeReviewComments) —
+  community-maintained list of Go code review feedback; extract as checklist
+- [Go wiki: CommonMistakes](https://github.com/golang/go/wiki/CommonMistakes) —
+  extract directly as anti-pattern catalog
+
+**Secondary sources**
+- [Go Proverbs](https://go-proverbs.github.io) (Rob Pike) — memorable heuristics;
+  useful for decision criteria
+- Dave Cheney's blog (dave.cheney.net) and talks — especially error handling, interfaces,
+  and performance; extract before/after examples
+- [100 Go Mistakes](https://100go.co) (Teiva Harsanyi) — structured mistake catalog;
+  extract mistake + root cause + fix format
+- Go 1.22+ release notes — new patterns and deprecations worth knowing
+
+**Extract**
+- Checklist: idiomatic Go review (interface size, error wrapping, goroutine hygiene)
+- Before/after: common rewrites (bare error returns → wrapped; goroutine leak → context cancel)
+- Decision tree: when to use channels vs mutexes, when to define an interface vs use concrete type
+- Anti-pattern catalog: goroutine leaks, error shadowing, interface pollution, unnecessary abstractions
+
+---
+
+## Go SAPCC (go-sapcc-conventions)
+
+This skill is already rich — it was built from extracted PR review comments from
+sapcc/keppel and sapcc/go-bits. Enrichment is low-value unless new PR review
+patterns have accumulated.
+
+**When to enrich**: mine new merged PRs from sapcc/keppel and sapcc/go-bits since
+the skill's last update date. Look for reviewer comments that establish new patterns
+not yet in the skill's references.
+
+**Primary source**: sapcc/keppel PR review history (via `skills/skill-creator/scripts/` pr-miner)
+
+**Extract**: reviewer comment → pattern name → before/after example, same format as
+existing sapcc references
+
+---
+
+## Python (python-quality-gate)
+
+**Primary sources**
+- [PEP 8](https://peps.python.org/pep-0008/) — style; extract checklist of the
+  non-obvious rules (the obvious ones are already in every model's training)
+- [PEP 484](https://peps.python.org/pep-0484/) — type hints; extract annotation patterns
+- [PEP 526](https://peps.python.org/pep-0526/) — variable annotations
+- [PEP 3107](https://peps.python.org/pep-3107/) — function annotations
+- [Python docs: typing module](https://docs.python.org/3/library/typing.html) —
+  extract: when to use Protocol vs ABC, TypeVar constraints, overload patterns
+- [mypy docs](https://mypy.readthedocs.io) — extract: common type errors and their
+  fixes, strict mode implications
+
+**Secondary sources**
+- [ruff rules reference](https://docs.astral.sh/ruff/rules/) — every rule has a
+  rationale; extract the non-obvious ones as checklist
+- Real Python tutorials — extract before/after examples from "Pythonic" articles
+- Hynek Schlawack's blog — especially async and attrs patterns
+
+**Extract**
+- Checklist: pre-commit quality gate (ruff, mypy, bandit checks that matter most)
+- Before/after: common Python anti-patterns with idiomatic rewrites
+- Decision tree: when to use dataclass vs TypedDict vs NamedTuple vs attrs
+- Anti-pattern catalog: mutable default arguments, broad except, type: ignore abuse
+
+---
+
+## Kubernetes (kubernetes-debugging, kubernetes-security)
+
+**Primary sources**
+- [Kubernetes official docs](https://kubernetes.io/docs/) — especially Concepts and
+  Tasks sections; extract patterns, not API reference
+- [RBAC best practices](https://kubernetes.io/docs/concepts/security/rbac-good-practices/)
+- [Network Policy docs](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
+- [CIS Kubernetes Benchmark](https://www.cisecurity.org/benchmark/kubernetes) —
+  extract as security checklist with severity levels
+- [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/)
+
+**Secondary sources**
+- *Kubernetes Patterns* (Ibryam & Huss) — extract named patterns with use-case criteria
+- Learnk8s blog — extract debugging decision trees and before/after manifests
+- [kubectl cheat sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/) —
+  extract as debugging command reference
+
+**Extract**
+- Checklist: security hardening (RBAC, network policies, pod security, secret management)
+- Decision tree: debugging pod failures (CrashLoopBackOff → ImagePullBackOff → OOMKilled flow)
+- Before/after: insecure manifest → hardened manifest examples
+- Anti-pattern catalog: over-privileged service accounts, missing resource limits, secret in env vars
+
+---
+
+## TypeScript (typescript-check)
+
+**Primary sources**
+- [TypeScript handbook](https://www.typescriptlang.org/docs/handbook/) — extract
+  non-obvious type patterns: conditional types, mapped types, template literals
+- [TypeScript release notes](https://www.typescriptlang.org/docs/handbook/release-notes/overview.html)
+  — new features per version; extract patterns introduced in 5.x
+- [@types conventions](https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/README.md)
+
+**Secondary sources**
+- Matt Pocock (total-typescript.com) — extract: advanced type patterns with before/after,
+  common TS mistakes with fixes
+- [TypeScript Deep Dive](https://basarat.gitbook.io/typescript/) — extract anti-patterns section
+
+**Extract**
+- Checklist: strict mode implications and what each flag catches
+- Before/after: `any` abuse → proper generics, type assertion abuse → type guards
+- Decision tree: when to use `interface` vs `type`, `unknown` vs `any`, generics vs overloads
+- Anti-pattern catalog: type assertions without guards, overly broad union types, enum misuse
+
+---
+
+## React / Next.js (distinctive-frontend-design, threejs-builder)
+
+**Primary sources**
+- [React docs](https://react.dev) — especially the "Thinking in React" and hooks
+  reference sections; extract composability patterns
+- [Next.js docs](https://nextjs.org/docs) — extract: App Router patterns, server
+  component vs client component decision criteria, data fetching patterns
+
+**Secondary sources**
+- Vercel blog — extract: App Router migration patterns, performance optimization cases
+- Kent C. Dodds (kentcdodds.com) — extract: compound component pattern, custom hooks
+  patterns, testing philosophy
+- Josh Comeau (joshwcomeau.com) — extract: CSS-in-JS patterns, animation approaches
+
+**Extract**
+- Decision tree: server component vs client component selection criteria
+- Before/after: common React anti-patterns (prop drilling → context, useEffect abuse → derived state)
+- Checklist: performance review (unnecessary re-renders, missing keys, large bundle items)
+- Pattern catalog: compound components, render props, custom hooks with clear interfaces
+
+---
+
+## Testing (test-driven-development, testing-anti-patterns, e2e-testing)
+
+**Primary sources**
+- [Playwright docs](https://playwright.dev/docs/intro) — extract: Page Object Model
+  structure, locator best practices, network interception patterns
+- [pytest docs](https://docs.pytest.org) — extract: fixture patterns, parametrize,
+  conftest scope decisions
+
+**Secondary sources**
+- Kent C. Dodds — Testing Trophy and [testing-library principles](https://testing-library.com/docs/guiding-principles)
+  — extract: what to test at each level
+- *Growing Object-Oriented Software, Guided by Tests* (Freeman & Pryce) — extract:
+  outside-in TDD pattern, listening to tests as design signal
+- *xUnit Test Patterns* (Meszaros) — extract: test smell catalog with names and fixes
+- Martin Fowler's [bliki on test doubles](https://martinfowler.com/bliki/TestDouble.html)
+
+**Extract**
+- Checklist: test quality review (one assertion focus, arrange-act-assert, test isolation)
+- Anti-pattern catalog with names: Mystery Guest, Eager Test, Fragile Test, Slow Test
+- Decision tree: unit vs integration vs E2E for a given scenario
+- Before/after: brittle selector → resilient locator, over-mocked test → integrated test
+
+---
+
+## Security (security-threat-model)
+
+**Primary sources**
+- [OWASP Top 10](https://owasp.org/www-project-top-ten/) — extract each category as
+  a named vulnerability with detection criteria and mitigation checklist
+- [OWASP Cheat Sheets](https://cheatsheetseries.owasp.org) — extract checklists per
+  topic (SQL injection, XSS, CSRF, auth, etc.)
+- [CWE Top 25](https://cwe.mitre.org/top25/) — extract as severity-ranked catalog
+- [NIST guidelines](https://csrc.nist.gov/publications) — especially SP 800-53 controls
+
+**Secondary sources**
+- PortSwigger Web Security Academy — extract: attack pattern → detection → fix format
+- Troy Hunt's blog — extract: real-world mistake catalog
+
+**Extract**
+- Checklist: threat modeling prompts (per STRIDE category)
+- Before/after: vulnerable code → remediated code for each OWASP Top 10 item
+- Decision tree: severity classification (Critical/High/Medium/Low with criteria)
+- Anti-pattern catalog: hard-coded secrets, overly permissive CORS, missing auth checks
+
+---
+
+## Perses (perses-*)
+
+**Primary sources**
+- [Perses docs](https://perses.dev/docs/) — extract: dashboard definition spec,
+  plugin architecture, variable interpolation formats
+- [Perses GitHub wiki](https://github.com/perses/perses/wiki) — supplementary patterns
+- [PromQL docs](https://prometheus.io/docs/prometheus/latest/querying/basics/) —
+  extract: query optimization patterns, recording rules, alerting rule structure
+
+**Secondary sources**
+- Perses GitHub issues and PR discussions — extract: community-documented gotchas
+  and workarounds
+
+**Extract**
+- Checklist: dashboard quality (variable usage, panel alignment, datasource scoping)
+- Before/after: raw PromQL → optimized PromQL with recording rules
+- Decision tree: when to use global vs project vs dashboard scope for variables
+- Anti-pattern catalog: hardcoded datasource names, missing variable fallbacks, over-complex queries
+
+---
+
+## Voice skills (create-voice, voice-writer, voice-calibrator)
+
+These skills are already rich — they have deterministic Python validators and
+wabi-sabi calibration built in. Enrichment is rarely warranted.
+
+**When to enrich**: if the banned-pattern list in `voice_validator.py` needs
+expansion, or a new voice profile introduces patterns the existing rules don't cover.
+Mine the validator's false-positive/false-negative log if one exists.
+
+---
+
+## Code review (systematic-code-review, parallel-code-review)
+
+**Primary sources**
+- [Google Engineering Practices: Code Review](https://google.github.io/eng-practices/review/)
+  — extract: reviewer standards, author responsibilities, speed guidelines
+- [Conventional Comments](https://conventionalcomments.org) — label taxonomy for
+  review comments (nitpick, suggestion, issue, question, etc.)
+
+**Secondary sources**
+- Michaela Greiler (michaelagreiler.com) — extract: research-backed review effectiveness
+  checklist, anti-patterns in reviewer behavior
+- SmartBear Code Review research papers — extract: optimal review size, defect density
+  findings as concrete thresholds
+
+**Extract**
+- Checklist: what to check at each review tier (security, logic, style, naming)
+- Before/after: vague review comment → actionable comment with label
+- Decision tree: block vs request-changes vs comment vs approve criteria
+- Anti-pattern catalog: rubber-stamping, nitpick overload, missing context in comments
+
+---
+
+## Git / PR workflows (pr-pipeline, pr-sync, git-commit-flow)
+
+**Primary sources**
+- [Conventional Commits spec](https://www.conventionalcommits.org) — extract:
+  type taxonomy, breaking change notation, footer format
+- [GitHub API docs](https://docs.github.com/en/rest) — extract: PR creation fields,
+  check run status, review request patterns
+- [gh CLI reference](https://cli.github.com/manual/) — extract: useful command
+  combinations for PR workflows
+
+**Secondary sources**
+- [Git best practices](https://sethrobertson.github.io/GitBestPractices/) — extract:
+  commit hygiene rules
+- [Chris Beams: How to Write a Git Commit Message](https://cbea.ms/git-commit/) —
+  extract: 7 rules as checklist
+
+**Extract**
+- Checklist: pre-PR commit hygiene (message format, squash policy, branch naming)
+- Before/after: bad commit message → conventional commit message
+- Decision tree: squash vs merge vs rebase for different PR types
+- Anti-pattern catalog: fixup commits left in history, force-push to shared branch,
+  PR too large to review
diff --git a/skills/skill-creator/references/enrichment-workflow.md b/skills/skill-creator/references/enrichment-workflow.md
new file mode 100644
index 0000000..2a15619
--- /dev/null
+++ b/skills/skill-creator/references/enrichment-workflow.md
@@ -0,0 +1,289 @@
+# Enrichment Workflow
+
+Deep reference for the enrichment loop described in SKILL.md → "Enriching existing skills".
+Read this file when executing any phase of the loop.
+
+---
+
+## AUDIT phase
+
+Goal: establish a factual baseline before touching anything.
+
+**Depth inventory** — count what the skill currently has:
+
+```
+references/   count files (target: 3+)
+scripts/      count files (target: 1+ deterministic tools)
+agents/       count files (target: 0–2 bundled subagents)
+```
+
+Record counts in `enrichment-workspace/audit.json`:
+
+```json
+{
+  "skill": "skill-name",
+  "audit_date": "YYYY-MM-DD",
+  "references_count": 0,
+  "scripts_count": 0,
+  "agents_count": 0,
+  "depth_verdict": "thin | adequate | rich",
+  "gaps": ["no pattern catalog", "no before/after examples", "no validation checklist"]
+}
+```
+
+**Baseline runs** — run the skill against 2–3 prompts that exercise its core domain.
+Use `scripts/run_eval.py` with `--skill-path` pointing at the skill directory.
+Save each output to `enrichment-workspace/baseline/eval-N/output.md`.
+
+Capture what is missing from baseline outputs: generic advice, no domain idioms,
+no concrete examples, no checklists, no error scenarios — these become research targets.
+
+**Depth verdict**:
+- `thin`: references < 2 files AND no scripts → enrichment warranted
+- `adequate`: 2–4 references OR 1+ scripts → enrichment may help, evaluate carefully
+- `rich`: 4+ references AND scripts AND agents → enrichment unlikely to move needle;
+  consider description optimization instead
+
+---
+
+## RESEARCH phase
+
+Goal: find knowledge that will change the skill's output behavior — not summaries,
+but patterns, checklists, before/after examples, and common mistakes.
+
+**Step 1 — read for gaps**
+Read the skill's SKILL.md and all files in its `references/` directory.
+List what knowledge would change a model's output if it had it:
+- Specific checklists the domain uses
+- Before/after examples of correct vs incorrect patterns
+- Common mistakes practitioners make
+- Decision criteria for choosing between approaches
+- Validation rules that are non-obvious
+
+**Step 2 — consult domain-research-targets.md**
+Look up the skill's domain in `references/domain-research-targets.md`.
+It lists primary sources (highest authority), secondary sources (patterns and
+examples), and what to extract from each.
+
+**Step 3 — gather knowledge**
+For each source in the domain table:
+
+*Official docs*: Read methodology sections, not API reference. Extract:
+- Named patterns with rationale
+- Anti-patterns the docs explicitly warn against
+- Decision trees or "when to use X vs Y" guidance
+
+*Secondary sources* (blogs, talks, books): Extract:
+- Before/after examples (these are gold — models respond to concrete diffs)
+- Common mistake catalogs with explanation of why they are mistakes
+- Checklists practitioners actually use
+
+*learning.db* (the toolkit's retro database): Run:
+```bash
+python3 ~/.claude/scripts/retro_query.py --domain "skill-domain-keyword" --min-confidence 0.7
+```
+This surfaces patterns learned from past sessions in this domain.
+
+**What to capture** (format for reference files):
+
+```markdown
+## Pattern: Name
+
+**When**: [situation where this applies]
+**Do this**:
+```code
+good example
+```
+**Not this**:
+```code
+bad example
+```
+**Why**: [one-sentence rationale]
+```
+
+Aim to collect at minimum:
+- 5–10 named patterns with before/after examples
+- 1 checklist of 8–15 items practitioners use
+- 5–10 common mistakes with explanations
+- Any domain-specific validation criteria
+
+---
+
+## ENRICH phase
+
+Goal: add research content to the skill in a form that changes behavior at execution time.
+
+**Where content goes**:
+
+| Content type | Target location |
+|---|---|
+| Pattern catalog with before/after | `references/patterns.md` (new) |
+| Checklist practitioners use | `references/checklist.md` (new) |
+| Common mistakes | `references/anti-patterns.md` (new, or add to existing) |
+| Validation criteria | `references/validation-criteria.md` (new) |
+| Repeatable deterministic operation | `scripts/tool-name.py` (new) |
+
+**Structuring reference files**:
+- Lead with the most behaviorally impactful content (checklists, before/after examples)
+- Group by use-case phase, not by alphabet
+- Keep each file focused — one theme per file is easier for the skill to load selectively
+- Include the "why" for each pattern; models generalize reasoning better than rules
+
+**Updating SKILL.md**:
+Add exactly one line per new reference file to the existing "## Reference files" section:
+```
+- `references/patterns.md` — [domain] pattern catalog: N named patterns with before/after examples
+```
+Do not expand SKILL.md prose. The orchestrator stays lean; depth lives in references.
+
+**Scripts**:
+When research reveals a repeatable mechanical operation (e.g., "run these 4 checks
+in sequence"), extract it to `scripts/`. Scripts save tokens on every invocation
+and ensure consistency. Use argparse, write to stdout, exit non-zero on error.
+
+**The focus test**: before adding any content, ask — would this change what a model
+outputs when executing the skill? If yes, add it. If it is background context the
+model already has from training, skip it.
+
+---
+
+## TEST phase
+
+Goal: measure whether the enrichment changed output quality.
+
+**Write test prompts** — 2–3 prompts that specifically exercise the domain knowledge
+you just added. If you added a pattern catalog for Go error handling, write prompts
+that require correct error wrapping. Prompts must be realistic and specific.
+
+**Run A/B eval**:
+```bash
+# Run enriched skill
+python3 skills/skill-creator/scripts/run_eval.py \
+  --skill-path skills/target-skill \
+  --prompt "realistic test prompt" \
+  --output enrichment-workspace/iteration-1/with-enrichment/eval-1/
+
+# Run baseline (no skill)
+python3 skills/skill-creator/scripts/run_eval.py \
+  --prompt "realistic test prompt" \
+  --output enrichment-workspace/iteration-1/baseline/eval-1/
+```
+
+Run both with identical prompts. Save all outputs under `enrichment-workspace/iteration-N/`.
+
+**Workspace structure**:
+```
+enrichment-workspace/
+├── audit.json
+├── baseline/
+│   ├── eval-1/output.md
+│   └── eval-2/output.md
+├── iteration-1/
+│   ├── with-enrichment/
+│   │   ├── eval-1/output.md
+│   │   └── eval-2/output.md
+│   ├── baseline/
+│   │   ├── eval-1/output.md
+│   │   └── eval-2/output.md
+│   └── comparisons/
+│       ├── eval-1-comparison.json
+│       └── eval-2-comparison.json
+└── iteration-2/
+    └── ...
+```
+
+---
+
+## EVALUATE phase
+
+Goal: determine objectively whether the enriched version is better.
+
+**Dispatch comparator** for each test prompt pair.
+Load `agents/comparator.md` (bundled with skill-creator). Feed it both outputs
+labeled "Output A" and "Output B" — do not reveal which is enriched. The comparator:
+- Scores both on depth, accuracy, actionability, domain idioms (0–10 each)
+- Picks a winner with cited evidence
+- Saves to `enrichment-workspace/iteration-N/comparisons/eval-N-comparison.json`
+
+**Decision rule**:
+- Enriched wins 2/3 prompts or better → **PUBLISH**
+- Tie (1–1 with 2 prompts, or 1–1–1 with 3) → run analyzer, then **RETRY**
+- Baseline wins majority → run analyzer, then **RETRY**
+
+**Run analyzer on loss/tie**:
+Load `agents/analyzer.md`. Feed it: the comparison results, the enrichment content
+added, and the baseline outputs. Ask it to identify specifically what the enriched
+version lacked. Common findings:
+- Content was added but never referenced in SKILL.md phases (the skill doesn't know to load it)
+- Examples were too abstract — model didn't recognize them as patterns to apply
+- Research angle was wrong for the prompt type being tested
+
+Record the analyzer's findings in `enrichment-workspace/iteration-N/analysis.md`.
+These drive the next research angle.
+
+---
+
+## PUBLISH phase
+
+Goal: commit validated improvements cleanly so they can be reviewed and merged.
+
+**Branch**:
+```bash
+git checkout -b feat/enrich-{skill-name}
+```
+
+**Stage only enrichment artifacts**:
+```bash
+git add skills/target-skill/references/
+git add skills/target-skill/scripts/      # if scripts were added
+git add skills/skill-name/SKILL.md        # pointer lines only
+```
+
+Do not commit `enrichment-workspace/` — it is ephemeral eval data.
+
+**Commit message**:
+```
+feat(target-skill): enrich with {domain} patterns and checklist
+
+Added {N} reference files covering {what}: pattern catalog with before/after
+examples, practitioner checklist, and common mistake catalog. Enriched version
+wins {2 or 3}/3 blind A/B evals against baseline on {domain} prompts.
+```
+
+**Push and create PR**:
+```bash
+git push -u origin feat/enrich-{skill-name}
+gh pr create \
+  --title "feat(target-skill): enrich with domain knowledge" \
+  --body "Enrichment loop result: N reference files added, wins X/3 blind evals."
+```
+
+---
+
+## Retry logic in detail
+
+After a failed evaluation, pick the next research angle based on what the analyzer found:
+
+**Iteration 1 — official docs + canonical best practices**
+Focus: what the domain's authoritative sources say to do. Patterns and guidelines
+from official documentation, language specs, or framework guides. This catches the
+most common gap: missing canonical patterns.
+
+**Iteration 2 — common mistakes + anti-patterns**
+Focus: what practitioners actually get wrong. PR review comments, SO questions,
+post-mortems, "gotchas" sections in docs. This adds the flip side: what NOT to do
+and why. Models often produce safer output when they know the failure modes.
+
+**Iteration 3 — advanced patterns + edge cases**
+Focus: what experts know that beginners don't. Performance trade-offs, non-obvious
+interactions, when the "standard" pattern breaks down. Only worth pursuing if
+iterations 1 and 2 produced improvement but not enough.
+
+**If iteration 3 still fails**:
+Do not silently degrade. Report to the user:
+- What enrichment was tried (3 research angles)
+- What the comparator found lacking in each iteration
+- Hypothesis for why the skill is hard to enrich (domain may require runtime context,
+  not static reference content)
+- Recommendation: accept current state, redesign the eval prompts, or try a
+  fundamentally different enrichment approach (e.g., bundled agent instead of reference file)