From 214d393771832ce53e2ee35b176288d91b6413e2 Mon Sep 17 00:00:00 2001 From: Nathan Oyler Date: Fri, 27 Mar 2026 12:19:51 -0700 Subject: [PATCH] feat(skill-creator): add enrichment workflow for autonomous skill improvement MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a new mode of operation to skill-creator: a 6-phase AUDIT→RESEARCH→ENRICH→ TEST→EVALUATE→PUBLISH loop that autonomously improves thin skills by adding domain knowledge as reference content, then validates improvement via blind A/B eval before committing. Reuses existing eval infrastructure (run_eval.py, comparator.md, aggregate_benchmark.py). Max 3 iterations with rotating research angles. New files: - references/enrichment-workflow.md: deep reference for all 6 phases plus retry logic - references/domain-research-targets.md: per-domain lookup table of primary/secondary sources and extraction targets (Go, Python, K8s, TypeScript, React, Testing, Security, Perses, Code Review, Git/PR) --- skills/skill-creator/SKILL.md | 75 +++++ .../references/domain-research-targets.md | 281 +++++++++++++++++ .../references/enrichment-workflow.md | 289 ++++++++++++++++++ 3 files changed, 645 insertions(+) create mode 100644 skills/skill-creator/references/domain-research-targets.md create mode 100644 skills/skill-creator/references/enrichment-workflow.md diff --git a/skills/skill-creator/SKILL.md b/skills/skill-creator/SKILL.md index 4a525c8..3e1119f 100644 --- a/skills/skill-creator/SKILL.md +++ b/skills/skill-creator/SKILL.md @@ -353,6 +353,76 @@ the best description by test-set score to avoid overfitting. --- +## Enriching existing skills + +Use this mode when a skill already exists but produces shallow, generic output — it +has thin `references/`, no `scripts/`, and passes an eval by luck rather than +by containing domain knowledge that changes behavior. + +Indicators this mode is appropriate: +- `references/` has fewer than 2 files, or none at all +- No `scripts/` directory +- Eval outputs look plausible but lack domain idioms, concrete examples, or + checklists specific to the skill's domain +- The skill passes a test because the model already knows the domain, not because + the skill contributes anything + +### The enrichment loop + +Six phases, max 3 iterations before escalating to the user: + +**AUDIT** — measure the skill's current depth before changing anything. +Count `references/`, `scripts/`, `agents/` files. Run the skill against 2-3 +realistic prompts. Save outputs to `enrichment-workspace/baseline/`. +See `references/enrichment-workflow.md` → AUDIT phase for the exact checklist. + +**RESEARCH** — find domain knowledge the skill is missing. +Read the skill's SKILL.md and existing references to identify gaps. Search for +best practices, pattern catalogs with before/after examples, common mistakes, +and validation criteria. Where to look depends on the skill's domain — consult +`references/domain-research-targets.md` for a lookup table of primary and +secondary sources per domain. + +**ENRICH** — add the research as reference content. +Create new files in the skill's `references/` directory. Add deterministic +`scripts/` where operations are repeatable. Update SKILL.md only with one-line +pointers to the new references — keep the orchestrator lean. Focus on content +that changes behavior: concrete examples beat abstract advice. +See `references/enrichment-workflow.md` → ENRICH phase for structuring guidance. + +**TEST** — A/B test the enriched skill against baseline. +Write 2-3 realistic prompts that exercise the skill's domain. Use +`scripts/run_eval.py` to run enriched vs baseline on the same prompts. Both +runs use identical inputs. Save outputs to `enrichment-workspace/iteration-N/`. + +**EVALUATE** — dispatch blind comparators on each test prompt. +Use `agents/comparator.md` (already bundled in this skill). Comparator scores on +depth, accuracy, actionability, and domain idioms without knowing which version +is which. If enriched wins 2/3 or better → PUBLISH. If tie or loss → run +`agents/analyzer.md` to understand why, then RETRY with a different research angle. +See `references/enrichment-workflow.md` → EVALUATE phase for scoring details. + +**PUBLISH** — commit validated improvements. +Create branch `feat/enrich-{skill-name}`, commit references + scripts + SKILL.md +pointer updates, push, create PR. See `references/enrichment-workflow.md` → +PUBLISH phase for the exact commit/PR flow. + +### Retry logic + +Each retry uses a different research angle to avoid retreading the same ground: + +| Iteration | Research angle | +|-----------|---------------| +| 1 | Official docs + canonical best practices | +| 2 | Common mistakes + anti-patterns (what goes wrong) | +| 3 | Advanced patterns + edge cases (what experts know) | + +After 3 failed iterations, report to the user: summarize what was tried, what the +evaluator found lacking, and ask whether to try a different approach or accept the +current state. + +--- + ## Bundled agents The `agents/` directory contains prompts for specialized subagents used by this @@ -384,6 +454,11 @@ skill. Read them when you need to spawn the relevant subagent. - `references/complexity-tiers.md` — Skill examples by complexity tier - `references/workflow-patterns.md` — Reusable phase structures and gate patterns - `references/error-catalog.md` — Common skill creation errors with solutions +- `references/enrichment-workflow.md` — Deep reference for the enrichment loop: + AUDIT checklist, RESEARCH strategy, ENRICH structuring, TEST/EVALUATE/PUBLISH phases, + and retry logic in detail +- `references/domain-research-targets.md` — Lookup table: given a skill's domain, + which primary sources, secondary sources, and extraction targets to use during RESEARCH --- diff --git a/skills/skill-creator/references/domain-research-targets.md b/skills/skill-creator/references/domain-research-targets.md new file mode 100644 index 0000000..600b3f1 --- /dev/null +++ b/skills/skill-creator/references/domain-research-targets.md @@ -0,0 +1,281 @@ +# Domain Research Targets + +Lookup table for the enrichment loop RESEARCH phase. Given a skill's domain, this +file tells you where to look for knowledge, what authority each source carries, and +what to extract from it. + +Format per entry: +- **Primary sources** — official docs, specs, canonical reference material (highest authority) +- **Secondary sources** — blogs, talks, books, community guides (patterns and examples) +- **Extract** — what form of knowledge to pull out (checklists, before/after, decision trees) + +--- + +## Go general (go-testing, go-concurrency, go-error-handling, go-anti-patterns, go-code-review) + +**Primary sources** +- [Effective Go](https://go.dev/doc/effective_go) — canonical idioms; extract named + patterns with rationale +- [Go specification](https://go.dev/ref/spec) — authoritative on language semantics; + useful for edge cases and subtle behavior +- [Go standard library source](https://cs.opensource.google/go/go) — how the stdlib + itself applies patterns; extract struct design, error handling, and interface choices +- [Go Blog](https://go.dev/blog) — official in-depth articles; especially errors, + modules, generics, and concurrency posts +- [Go wiki: CodeReviewComments](https://github.com/golang/go/wiki/CodeReviewComments) — + community-maintained list of Go code review feedback; extract as checklist +- [Go wiki: CommonMistakes](https://github.com/golang/go/wiki/CommonMistakes) — + extract directly as anti-pattern catalog + +**Secondary sources** +- [Go Proverbs](https://go-proverbs.github.io) (Rob Pike) — memorable heuristics; + useful for decision criteria +- Dave Cheney's blog (dave.cheney.net) and talks — especially error handling, interfaces, + and performance; extract before/after examples +- [100 Go Mistakes](https://100go.co) (Teiva Harsanyi) — structured mistake catalog; + extract mistake + root cause + fix format +- Go 1.22+ release notes — new patterns and deprecations worth knowing + +**Extract** +- Checklist: idiomatic Go review (interface size, error wrapping, goroutine hygiene) +- Before/after: common rewrites (bare error returns → wrapped; goroutine leak → context cancel) +- Decision tree: when to use channels vs mutexes, when to define an interface vs use concrete type +- Anti-pattern catalog: goroutine leaks, error shadowing, interface pollution, unnecessary abstractions + +--- + +## Go SAPCC (go-sapcc-conventions) + +This skill is already rich — it was built from extracted PR review comments from +sapcc/keppel and sapcc/go-bits. Enrichment is low-value unless new PR review +patterns have accumulated. + +**When to enrich**: mine new merged PRs from sapcc/keppel and sapcc/go-bits since +the skill's last update date. Look for reviewer comments that establish new patterns +not yet in the skill's references. + +**Primary source**: sapcc/keppel PR review history (via `skills/skill-creator/scripts/` pr-miner) + +**Extract**: reviewer comment → pattern name → before/after example, same format as +existing sapcc references + +--- + +## Python (python-quality-gate) + +**Primary sources** +- [PEP 8](https://peps.python.org/pep-0008/) — style; extract checklist of the + non-obvious rules (the obvious ones are already in every model's training) +- [PEP 484](https://peps.python.org/pep-0484/) — type hints; extract annotation patterns +- [PEP 526](https://peps.python.org/pep-0526/) — variable annotations +- [PEP 3107](https://peps.python.org/pep-3107/) — function annotations +- [Python docs: typing module](https://docs.python.org/3/library/typing.html) — + extract: when to use Protocol vs ABC, TypeVar constraints, overload patterns +- [mypy docs](https://mypy.readthedocs.io) — extract: common type errors and their + fixes, strict mode implications + +**Secondary sources** +- [ruff rules reference](https://docs.astral.sh/ruff/rules/) — every rule has a + rationale; extract the non-obvious ones as checklist +- Real Python tutorials — extract before/after examples from "Pythonic" articles +- Hynek Schlawack's blog — especially async and attrs patterns + +**Extract** +- Checklist: pre-commit quality gate (ruff, mypy, bandit checks that matter most) +- Before/after: common Python anti-patterns with idiomatic rewrites +- Decision tree: when to use dataclass vs TypedDict vs NamedTuple vs attrs +- Anti-pattern catalog: mutable default arguments, broad except, type: ignore abuse + +--- + +## Kubernetes (kubernetes-debugging, kubernetes-security) + +**Primary sources** +- [Kubernetes official docs](https://kubernetes.io/docs/) — especially Concepts and + Tasks sections; extract patterns, not API reference +- [RBAC best practices](https://kubernetes.io/docs/concepts/security/rbac-good-practices/) +- [Network Policy docs](https://kubernetes.io/docs/concepts/services-networking/network-policies/) +- [CIS Kubernetes Benchmark](https://www.cisecurity.org/benchmark/kubernetes) — + extract as security checklist with severity levels +- [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) + +**Secondary sources** +- *Kubernetes Patterns* (Ibryam & Huss) — extract named patterns with use-case criteria +- Learnk8s blog — extract debugging decision trees and before/after manifests +- [kubectl cheat sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/) — + extract as debugging command reference + +**Extract** +- Checklist: security hardening (RBAC, network policies, pod security, secret management) +- Decision tree: debugging pod failures (CrashLoopBackOff → ImagePullBackOff → OOMKilled flow) +- Before/after: insecure manifest → hardened manifest examples +- Anti-pattern catalog: over-privileged service accounts, missing resource limits, secret in env vars + +--- + +## TypeScript (typescript-check) + +**Primary sources** +- [TypeScript handbook](https://www.typescriptlang.org/docs/handbook/) — extract + non-obvious type patterns: conditional types, mapped types, template literals +- [TypeScript release notes](https://www.typescriptlang.org/docs/handbook/release-notes/overview.html) + — new features per version; extract patterns introduced in 5.x +- [@types conventions](https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/README.md) + +**Secondary sources** +- Matt Pocock (total-typescript.com) — extract: advanced type patterns with before/after, + common TS mistakes with fixes +- [TypeScript Deep Dive](https://basarat.gitbook.io/typescript/) — extract anti-patterns section + +**Extract** +- Checklist: strict mode implications and what each flag catches +- Before/after: `any` abuse → proper generics, type assertion abuse → type guards +- Decision tree: when to use `interface` vs `type`, `unknown` vs `any`, generics vs overloads +- Anti-pattern catalog: type assertions without guards, overly broad union types, enum misuse + +--- + +## React / Next.js (distinctive-frontend-design, threejs-builder) + +**Primary sources** +- [React docs](https://react.dev) — especially the "Thinking in React" and hooks + reference sections; extract composability patterns +- [Next.js docs](https://nextjs.org/docs) — extract: App Router patterns, server + component vs client component decision criteria, data fetching patterns + +**Secondary sources** +- Vercel blog — extract: App Router migration patterns, performance optimization cases +- Kent C. Dodds (kentcdodds.com) — extract: compound component pattern, custom hooks + patterns, testing philosophy +- Josh Comeau (joshwcomeau.com) — extract: CSS-in-JS patterns, animation approaches + +**Extract** +- Decision tree: server component vs client component selection criteria +- Before/after: common React anti-patterns (prop drilling → context, useEffect abuse → derived state) +- Checklist: performance review (unnecessary re-renders, missing keys, large bundle items) +- Pattern catalog: compound components, render props, custom hooks with clear interfaces + +--- + +## Testing (test-driven-development, testing-anti-patterns, e2e-testing) + +**Primary sources** +- [Playwright docs](https://playwright.dev/docs/intro) — extract: Page Object Model + structure, locator best practices, network interception patterns +- [pytest docs](https://docs.pytest.org) — extract: fixture patterns, parametrize, + conftest scope decisions + +**Secondary sources** +- Kent C. Dodds — Testing Trophy and [testing-library principles](https://testing-library.com/docs/guiding-principles) + — extract: what to test at each level +- *Growing Object-Oriented Software, Guided by Tests* (Freeman & Pryce) — extract: + outside-in TDD pattern, listening to tests as design signal +- *xUnit Test Patterns* (Meszaros) — extract: test smell catalog with names and fixes +- Martin Fowler's [bliki on test doubles](https://martinfowler.com/bliki/TestDouble.html) + +**Extract** +- Checklist: test quality review (one assertion focus, arrange-act-assert, test isolation) +- Anti-pattern catalog with names: Mystery Guest, Eager Test, Fragile Test, Slow Test +- Decision tree: unit vs integration vs E2E for a given scenario +- Before/after: brittle selector → resilient locator, over-mocked test → integrated test + +--- + +## Security (security-threat-model) + +**Primary sources** +- [OWASP Top 10](https://owasp.org/www-project-top-ten/) — extract each category as + a named vulnerability with detection criteria and mitigation checklist +- [OWASP Cheat Sheets](https://cheatsheetseries.owasp.org) — extract checklists per + topic (SQL injection, XSS, CSRF, auth, etc.) +- [CWE Top 25](https://cwe.mitre.org/top25/) — extract as severity-ranked catalog +- [NIST guidelines](https://csrc.nist.gov/publications) — especially SP 800-53 controls + +**Secondary sources** +- PortSwigger Web Security Academy — extract: attack pattern → detection → fix format +- Troy Hunt's blog — extract: real-world mistake catalog + +**Extract** +- Checklist: threat modeling prompts (per STRIDE category) +- Before/after: vulnerable code → remediated code for each OWASP Top 10 item +- Decision tree: severity classification (Critical/High/Medium/Low with criteria) +- Anti-pattern catalog: hard-coded secrets, overly permissive CORS, missing auth checks + +--- + +## Perses (perses-*) + +**Primary sources** +- [Perses docs](https://perses.dev/docs/) — extract: dashboard definition spec, + plugin architecture, variable interpolation formats +- [Perses GitHub wiki](https://github.com/perses/perses/wiki) — supplementary patterns +- [PromQL docs](https://prometheus.io/docs/prometheus/latest/querying/basics/) — + extract: query optimization patterns, recording rules, alerting rule structure + +**Secondary sources** +- Perses GitHub issues and PR discussions — extract: community-documented gotchas + and workarounds + +**Extract** +- Checklist: dashboard quality (variable usage, panel alignment, datasource scoping) +- Before/after: raw PromQL → optimized PromQL with recording rules +- Decision tree: when to use global vs project vs dashboard scope for variables +- Anti-pattern catalog: hardcoded datasource names, missing variable fallbacks, over-complex queries + +--- + +## Voice skills (create-voice, voice-writer, voice-calibrator) + +These skills are already rich — they have deterministic Python validators and +wabi-sabi calibration built in. Enrichment is rarely warranted. + +**When to enrich**: if the banned-pattern list in `voice_validator.py` needs +expansion, or a new voice profile introduces patterns the existing rules don't cover. +Mine the validator's false-positive/false-negative log if one exists. + +--- + +## Code review (systematic-code-review, parallel-code-review) + +**Primary sources** +- [Google Engineering Practices: Code Review](https://google.github.io/eng-practices/review/) + — extract: reviewer standards, author responsibilities, speed guidelines +- [Conventional Comments](https://conventionalcomments.org) — label taxonomy for + review comments (nitpick, suggestion, issue, question, etc.) + +**Secondary sources** +- Michaela Greiler (michaelagreiler.com) — extract: research-backed review effectiveness + checklist, anti-patterns in reviewer behavior +- SmartBear Code Review research papers — extract: optimal review size, defect density + findings as concrete thresholds + +**Extract** +- Checklist: what to check at each review tier (security, logic, style, naming) +- Before/after: vague review comment → actionable comment with label +- Decision tree: block vs request-changes vs comment vs approve criteria +- Anti-pattern catalog: rubber-stamping, nitpick overload, missing context in comments + +--- + +## Git / PR workflows (pr-pipeline, pr-sync, git-commit-flow) + +**Primary sources** +- [Conventional Commits spec](https://www.conventionalcommits.org) — extract: + type taxonomy, breaking change notation, footer format +- [GitHub API docs](https://docs.github.com/en/rest) — extract: PR creation fields, + check run status, review request patterns +- [gh CLI reference](https://cli.github.com/manual/) — extract: useful command + combinations for PR workflows + +**Secondary sources** +- [Git best practices](https://sethrobertson.github.io/GitBestPractices/) — extract: + commit hygiene rules +- [Chris Beams: How to Write a Git Commit Message](https://cbea.ms/git-commit/) — + extract: 7 rules as checklist + +**Extract** +- Checklist: pre-PR commit hygiene (message format, squash policy, branch naming) +- Before/after: bad commit message → conventional commit message +- Decision tree: squash vs merge vs rebase for different PR types +- Anti-pattern catalog: fixup commits left in history, force-push to shared branch, + PR too large to review diff --git a/skills/skill-creator/references/enrichment-workflow.md b/skills/skill-creator/references/enrichment-workflow.md new file mode 100644 index 0000000..2a15619 --- /dev/null +++ b/skills/skill-creator/references/enrichment-workflow.md @@ -0,0 +1,289 @@ +# Enrichment Workflow + +Deep reference for the enrichment loop described in SKILL.md → "Enriching existing skills". +Read this file when executing any phase of the loop. + +--- + +## AUDIT phase + +Goal: establish a factual baseline before touching anything. + +**Depth inventory** — count what the skill currently has: + +``` +references/ count files (target: 3+) +scripts/ count files (target: 1+ deterministic tools) +agents/ count files (target: 0–2 bundled subagents) +``` + +Record counts in `enrichment-workspace/audit.json`: + +```json +{ + "skill": "skill-name", + "audit_date": "YYYY-MM-DD", + "references_count": 0, + "scripts_count": 0, + "agents_count": 0, + "depth_verdict": "thin | adequate | rich", + "gaps": ["no pattern catalog", "no before/after examples", "no validation checklist"] +} +``` + +**Baseline runs** — run the skill against 2–3 prompts that exercise its core domain. +Use `scripts/run_eval.py` with `--skill-path` pointing at the skill directory. +Save each output to `enrichment-workspace/baseline/eval-N/output.md`. + +Capture what is missing from baseline outputs: generic advice, no domain idioms, +no concrete examples, no checklists, no error scenarios — these become research targets. + +**Depth verdict**: +- `thin`: references < 2 files AND no scripts → enrichment warranted +- `adequate`: 2–4 references OR 1+ scripts → enrichment may help, evaluate carefully +- `rich`: 4+ references AND scripts AND agents → enrichment unlikely to move needle; + consider description optimization instead + +--- + +## RESEARCH phase + +Goal: find knowledge that will change the skill's output behavior — not summaries, +but patterns, checklists, before/after examples, and common mistakes. + +**Step 1 — read for gaps** +Read the skill's SKILL.md and all files in its `references/` directory. +List what knowledge would change a model's output if it had it: +- Specific checklists the domain uses +- Before/after examples of correct vs incorrect patterns +- Common mistakes practitioners make +- Decision criteria for choosing between approaches +- Validation rules that are non-obvious + +**Step 2 — consult domain-research-targets.md** +Look up the skill's domain in `references/domain-research-targets.md`. +It lists primary sources (highest authority), secondary sources (patterns and +examples), and what to extract from each. + +**Step 3 — gather knowledge** +For each source in the domain table: + +*Official docs*: Read methodology sections, not API reference. Extract: +- Named patterns with rationale +- Anti-patterns the docs explicitly warn against +- Decision trees or "when to use X vs Y" guidance + +*Secondary sources* (blogs, talks, books): Extract: +- Before/after examples (these are gold — models respond to concrete diffs) +- Common mistake catalogs with explanation of why they are mistakes +- Checklists practitioners actually use + +*learning.db* (the toolkit's retro database): Run: +```bash +python3 ~/.claude/scripts/retro_query.py --domain "skill-domain-keyword" --min-confidence 0.7 +``` +This surfaces patterns learned from past sessions in this domain. + +**What to capture** (format for reference files): + +```markdown +## Pattern: Name + +**When**: [situation where this applies] +**Do this**: +```code +good example +``` +**Not this**: +```code +bad example +``` +**Why**: [one-sentence rationale] +``` + +Aim to collect at minimum: +- 5–10 named patterns with before/after examples +- 1 checklist of 8–15 items practitioners use +- 5–10 common mistakes with explanations +- Any domain-specific validation criteria + +--- + +## ENRICH phase + +Goal: add research content to the skill in a form that changes behavior at execution time. + +**Where content goes**: + +| Content type | Target location | +|---|---| +| Pattern catalog with before/after | `references/patterns.md` (new) | +| Checklist practitioners use | `references/checklist.md` (new) | +| Common mistakes | `references/anti-patterns.md` (new, or add to existing) | +| Validation criteria | `references/validation-criteria.md` (new) | +| Repeatable deterministic operation | `scripts/tool-name.py` (new) | + +**Structuring reference files**: +- Lead with the most behaviorally impactful content (checklists, before/after examples) +- Group by use-case phase, not by alphabet +- Keep each file focused — one theme per file is easier for the skill to load selectively +- Include the "why" for each pattern; models generalize reasoning better than rules + +**Updating SKILL.md**: +Add exactly one line per new reference file to the existing "## Reference files" section: +``` +- `references/patterns.md` — [domain] pattern catalog: N named patterns with before/after examples +``` +Do not expand SKILL.md prose. The orchestrator stays lean; depth lives in references. + +**Scripts**: +When research reveals a repeatable mechanical operation (e.g., "run these 4 checks +in sequence"), extract it to `scripts/`. Scripts save tokens on every invocation +and ensure consistency. Use argparse, write to stdout, exit non-zero on error. + +**The focus test**: before adding any content, ask — would this change what a model +outputs when executing the skill? If yes, add it. If it is background context the +model already has from training, skip it. + +--- + +## TEST phase + +Goal: measure whether the enrichment changed output quality. + +**Write test prompts** — 2–3 prompts that specifically exercise the domain knowledge +you just added. If you added a pattern catalog for Go error handling, write prompts +that require correct error wrapping. Prompts must be realistic and specific. + +**Run A/B eval**: +```bash +# Run enriched skill +python3 skills/skill-creator/scripts/run_eval.py \ + --skill-path skills/target-skill \ + --prompt "realistic test prompt" \ + --output enrichment-workspace/iteration-1/with-enrichment/eval-1/ + +# Run baseline (no skill) +python3 skills/skill-creator/scripts/run_eval.py \ + --prompt "realistic test prompt" \ + --output enrichment-workspace/iteration-1/baseline/eval-1/ +``` + +Run both with identical prompts. Save all outputs under `enrichment-workspace/iteration-N/`. + +**Workspace structure**: +``` +enrichment-workspace/ +├── audit.json +├── baseline/ +│ ├── eval-1/output.md +│ └── eval-2/output.md +├── iteration-1/ +│ ├── with-enrichment/ +│ │ ├── eval-1/output.md +│ │ └── eval-2/output.md +│ ├── baseline/ +│ │ ├── eval-1/output.md +│ │ └── eval-2/output.md +│ └── comparisons/ +│ ├── eval-1-comparison.json +│ └── eval-2-comparison.json +└── iteration-2/ + └── ... +``` + +--- + +## EVALUATE phase + +Goal: determine objectively whether the enriched version is better. + +**Dispatch comparator** for each test prompt pair. +Load `agents/comparator.md` (bundled with skill-creator). Feed it both outputs +labeled "Output A" and "Output B" — do not reveal which is enriched. The comparator: +- Scores both on depth, accuracy, actionability, domain idioms (0–10 each) +- Picks a winner with cited evidence +- Saves to `enrichment-workspace/iteration-N/comparisons/eval-N-comparison.json` + +**Decision rule**: +- Enriched wins 2/3 prompts or better → **PUBLISH** +- Tie (1–1 with 2 prompts, or 1–1–1 with 3) → run analyzer, then **RETRY** +- Baseline wins majority → run analyzer, then **RETRY** + +**Run analyzer on loss/tie**: +Load `agents/analyzer.md`. Feed it: the comparison results, the enrichment content +added, and the baseline outputs. Ask it to identify specifically what the enriched +version lacked. Common findings: +- Content was added but never referenced in SKILL.md phases (the skill doesn't know to load it) +- Examples were too abstract — model didn't recognize them as patterns to apply +- Research angle was wrong for the prompt type being tested + +Record the analyzer's findings in `enrichment-workspace/iteration-N/analysis.md`. +These drive the next research angle. + +--- + +## PUBLISH phase + +Goal: commit validated improvements cleanly so they can be reviewed and merged. + +**Branch**: +```bash +git checkout -b feat/enrich-{skill-name} +``` + +**Stage only enrichment artifacts**: +```bash +git add skills/target-skill/references/ +git add skills/target-skill/scripts/ # if scripts were added +git add skills/skill-name/SKILL.md # pointer lines only +``` + +Do not commit `enrichment-workspace/` — it is ephemeral eval data. + +**Commit message**: +``` +feat(target-skill): enrich with {domain} patterns and checklist + +Added {N} reference files covering {what}: pattern catalog with before/after +examples, practitioner checklist, and common mistake catalog. Enriched version +wins {2 or 3}/3 blind A/B evals against baseline on {domain} prompts. +``` + +**Push and create PR**: +```bash +git push -u origin feat/enrich-{skill-name} +gh pr create \ + --title "feat(target-skill): enrich with domain knowledge" \ + --body "Enrichment loop result: N reference files added, wins X/3 blind evals." +``` + +--- + +## Retry logic in detail + +After a failed evaluation, pick the next research angle based on what the analyzer found: + +**Iteration 1 — official docs + canonical best practices** +Focus: what the domain's authoritative sources say to do. Patterns and guidelines +from official documentation, language specs, or framework guides. This catches the +most common gap: missing canonical patterns. + +**Iteration 2 — common mistakes + anti-patterns** +Focus: what practitioners actually get wrong. PR review comments, SO questions, +post-mortems, "gotchas" sections in docs. This adds the flip side: what NOT to do +and why. Models often produce safer output when they know the failure modes. + +**Iteration 3 — advanced patterns + edge cases** +Focus: what experts know that beginners don't. Performance trade-offs, non-obvious +interactions, when the "standard" pattern breaks down. Only worth pursuing if +iterations 1 and 2 produced improvement but not enough. + +**If iteration 3 still fails**: +Do not silently degrade. Report to the user: +- What enrichment was tried (3 research angles) +- What the comparator found lacking in each iteration +- Hypothesis for why the skill is hard to enrich (domain may require runtime context, + not static reference content) +- Recommendation: accept current state, redesign the eval prompts, or try a + fundamentally different enrichment approach (e.g., bundled agent instead of reference file)