diff --git a/docs/PHILOSOPHY.md b/docs/PHILOSOPHY.md index af3bbbb7..46b4f0e9 100644 --- a/docs/PHILOSOPHY.md +++ b/docs/PHILOSOPHY.md @@ -54,7 +54,7 @@ The solution: only pull the relevant information for the specific task. This is Three mechanisms enforce this: - **Agents**: specialized instruction files tailored to specific domains, loaded only when their triggers match - **Skills**: workflow methodologies that invoke deterministic scripts (Python CLIs, validation tools) rather than relying on LLM judgment alone, activated only when their workflow applies -- **Progressive Disclosure**: summary in the main file, details in `references/` subdirectory. Right context at the right time, not everything at once +- **Progressive Disclosure**: SKILL.md contains the workflow orchestration and tells the model *when* to load deep context. Detailed catalogs, agent rosters, specification tables, and output templates live in `references/` and are loaded only when the current workflow phase needs them. A skill with 26 chart types keeps the selection logic in SKILL.md and each chart's parameter spec in its own reference file — the model loads only the spec for the chart it selected. A review skill with 4 waves keeps the orchestration in SKILL.md and each wave's agent roster in a separate reference file — Wave 2 agents don't consume tokens during Wave 1 ## Tokens Are Cheap, Quality Is Expensive @@ -221,26 +221,38 @@ Everything a skill needs lives inside the skill directory. Scripts, viewer templ ``` skills/my-skill/ -├── SKILL.md # The workflow +├── SKILL.md # The orchestrator — workflow + when to load references ├── agents/ # Subagent prompts used only by this skill ├── scripts/ # Deterministic CLI tools this skill invokes ├── assets/ # Templates, HTML viewers, static files └── references/ # Deep context loaded on demand ``` +**The orchestrator pattern:** SKILL.md is a thin workflow orchestrator, not a monolithic document. It tells the model *what to do* (phases, gates, decisions) and *when to load deep context* (reference files). The heavy content — detailed catalogs, agent dispatch prompts, output templates, specification tables — lives in `references/` and gets loaded only when the current phase needs it. + +This is the difference between a skill that works and a skill that works *efficiently*: + +| Approach | Token Cost | Quality | +|----------|-----------|---------| +| Everything in SKILL.md | High — full content loaded on every invocation | Good but wasteful | +| Thin SKILL.md, no references | Low — but missing context | Degraded — lost domain knowledge | +| **Orchestrator + references** | **Proportional to task** — load what the phase needs | **Best — full knowledge, minimal waste** | + +Making a skill shorter by deleting content is not progressive disclosure — it's content loss. Progressive disclosure means the content still exists, organized so only the relevant slice enters the context window at any given phase. + +**Example:** A review skill with 4 waves of agents keeps the wave orchestration logic in SKILL.md (~500 lines) and puts each wave's agent roster and dispatch prompts in separate reference files (`references/wave-1-foundation.md`, `references/wave-2-deep-dive.md`). When executing Wave 1, only the Wave 1 reference is loaded. Wave 2's agents don't consume tokens until Wave 2 begins. + **Why this matters:** A skill that depends on scripts scattered across the repo is fragile to move, hard to test, and impossible to evaluate in isolation. When everything is bundled, the skill can be: - Copied to another project and it works - Tested via `run_eval.py` against its own workspace - Reviewed as a single unit — all the tooling is visible in one tree - Deleted without orphaning dependencies elsewhere -**The exception:** Shared patterns (`shared-patterns/anti-rationalization-core.md`) are referenced across skills. These stay shared. But skill-specific scripts, assets, and agents are always bundled. - **Repo-level `scripts/`** is reserved for toolkit-wide operations (learning-db.py, sync-to-user-claude.py, INDEX generation) — tools that operate on the system as a whole, not on a single skill's workflow. ## Workflow First, Constraints Inline -Skill documents place the workflow (Instructions/Phases) immediately after the frontmatter. Constraints appear inline within the phases they govern, not in a separate upfront section. +Skill documents place the workflow (Instructions/Phases) immediately after the frontmatter. Constraints appear inline within the phases they govern, with reasoning attached ("because X"), not in a separate upfront section. **Measured result:** A/B/C testing on Go code generation showed workflow-first ordering (C) swept constraints-first ordering (B) 3-0 across simple, medium, and complex prompts. Agent blind reviewers consistently scored workflow-first higher on testing depth, Go idioms, and benchmark coverage. @@ -249,18 +261,19 @@ Skill documents place the workflow (Instructions/Phases) immediately after the f ``` 1. YAML frontmatter (What + When) 2. Brief overview (How — one paragraph) -3. Instructions/Phases (The actual workflow, with inline constraints) -4. Benchmark/Commands Guide (Reference material) +3. Instructions/Phases (The workflow, constraints inline with reasoning) +4. Reference Material (Commands, guides — or pointers to references/) 5. Error Handling (Failure context) -6. Anti-Patterns (What went wrong before) -7. References (Pointers to deep context) +6. References (Pointers to bundled files) ``` -**Why it works:** The model encounters the task structure before the constraint framework. Constraints appear at the decision point where they apply — "use table-driven tests because they make adding cases trivial" inside the testing phase, not in a separate Hardcoded Behaviors section 200 lines earlier. The model spends attention on understanding the task, not parsing a constraint taxonomy. +**Why it works:** The model encounters the task structure before any constraint framework. Constraints appear at the decision point where they apply — "use table-driven tests because they make adding cases trivial" inside the testing phase, not in a separate Hardcoded Behaviors section 200 lines earlier. Attaching reasoning ("because X") lets the model generalize constraints to situations the skill author didn't anticipate. + +**What was removed:** Operator Context sections (Hardcoded/Default/Optional taxonomy), standalone Anti-Patterns sections, Anti-Rationalization tables, and Capabilities & Limitations boilerplate. These were structural overhead that separated constraints from the workflow steps where they apply. -**What moves:** The Operator Context section (Hardcoded/Default/Optional behaviors) decomposes. Each constraint migrates to the phase where it applies. "Run with -race for concurrent code" belongs in Phase 3 (RUN), not in a behavior table. +**Where the content went:** Every constraint was distributed inline to the workflow step where it matters. Anti-pattern wisdom became reasoning attached to the relevant instruction. Nothing was deleted — it was reorganized to be at point-of-use. -**What stays:** Error Handling, Anti-Patterns, and References remain at the end as context that's consulted when things go wrong — not before the model has understood what "going right" looks like. +**Progressive disclosure completes the picture:** Workflow-first ordering keeps SKILL.md navigable. For skills exceeding ~500 lines, detailed catalogs, agent rosters, and specification tables move to `references/` files. The SKILL.md workflow tells the model when to load each reference — "Read `references/wave-1-foundation.md` for the agent list and dispatch prompts." The model gets the orchestration logic upfront and loads deep context only when the current phase needs it. ## Open Sharing Over Individual Ownership diff --git a/pipelines/INDEX.json b/pipelines/INDEX.json index 13a4e888..848a2a94 100644 --- a/pipelines/INDEX.json +++ b/pipelines/INDEX.json @@ -1,6 +1,6 @@ { "version": "2.0", - "generated": "2026-03-24T22:20:54Z", + "generated": "2026-03-27T19:00:36Z", "generated_by": "scripts/generate-skill-index.py", "pipelines": { "agent-upgrade": { @@ -14,7 +14,7 @@ "align agent to template" ], "category": "meta", - "user_invocable": true, + "user_invocable": false, "version": "1.0.0", "phases": [ "AUDIT", @@ -108,7 +108,7 @@ }, "comprehensive-review": { "file": "pipelines/comprehensive-review/SKILL.md", - "description": "Unified 3-wave code review: Wave 0 auto-discovers packages/modules and dispatches one language-specialist agent per package for deep per-package analysis.", + "description": "Unified 4-wave code review: Wave 0 auto-discovers packages/modules and dispatches one language-specialist agent per package for deep per-package analysis.", "triggers": [ "comprehensive review", "full review", @@ -118,28 +118,18 @@ "multi-agent review", "complete code review", "20-agent review", + "25-agent review", "per-package review", "3-wave review", + "4-wave review", + "adversarial review", "full code review", "review all packages" ], "category": "review", "force_route": true, - "user_invocable": true, - "version": "3.2.0", - "phases": [ - "STATIC ANALYSIS", - "SCOPE", - "WAVE 0 DISPATCH", - "WAVE 0 AGGREGATE", - "LIBRARY CONTRACT VERIFICATION", - "WAVE 1 DISPATCH", - "WAVE 0+1 AGGREGATE", - "WAVE 2 DISPATCH", - "FULL AGGREGATE", - "FIX", - "REPORT" - ], + "user_invocable": false, + "version": "4.0.0", "pairs_with": [ "systematic-code-review", "parallel-code-review", @@ -160,7 +150,7 @@ ], "category": "content", "force_route": true, - "user_invocable": true, + "user_invocable": false, "version": "1.0.0", "phases": [ "SCAN", @@ -182,7 +172,7 @@ "analyze from multiple angles" ], "category": "content", - "user_invocable": true, + "user_invocable": false, "version": "2.0.0", "phases": [ "VALIDATE INPUTS", @@ -292,6 +282,30 @@ "technical-documentation-engineer" ] }, + "feature-pipeline": { + "file": "pipelines/feature-pipeline/SKILL.md", + "description": "End-to-end feature lifecycle pipeline coordinating design through release: DESIGN, PLAN, IMPLEMENT, VALIDATE, RELEASE, RECORD.", + "triggers": [ + "build feature end to end", + "full feature lifecycle", + "feature from scratch", + "design to release", + "complete feature pipeline", + "feature pipeline" + ], + "category": "process", + "user_invocable": false, + "version": "1.0.0", + "phases": [ + "DESIGN", + "PLAN", + "IMPLEMENT", + "VALIDATE", + "RELEASE", + "RECORD" + ], + "agent": "general-purpose" + }, "github-profile-rules": { "file": "pipelines/github-profile-rules/SKILL.md", "description": "Extract programming rules and coding conventions from a GitHub user's public profile via API.", @@ -302,7 +316,7 @@ "github conventions" ], "category": "meta", - "user_invocable": true, + "user_invocable": false, "version": "1.0.0", "phases": [ "ADR", @@ -329,7 +343,7 @@ "hook with gates" ], "category": "infrastructure", - "user_invocable": true, + "user_invocable": false, "version": "1.0.0", "phases": [ "SPEC", @@ -356,7 +370,7 @@ "mcp from repo" ], "category": "meta", - "user_invocable": true, + "user_invocable": false, "version": "1.0.0", "phases": [ "ANALYZE", @@ -613,7 +627,7 @@ ], "category": "meta", "force_route": true, - "user_invocable": true, + "user_invocable": false, "version": "1.0.0", "phases": [ "DISCOVER", @@ -641,7 +655,7 @@ ], "category": "meta", "force_route": true, - "user_invocable": true, + "user_invocable": false, "version": "1.0.0", "phases": [ "CHANGELOG", @@ -657,6 +671,60 @@ ], "agent": "system-upgrade-engineer" }, + "systematic-debugging": { + "file": "pipelines/systematic-debugging/SKILL.md", + "description": "Evidence-based 5-phase debugging pipeline with mandatory reproduction, testing, and learning gates: OBSERVE, HYPOTHESIZE, TEST, FIX, RECORD.", + "triggers": [ + "debug", + "find root cause", + "reproduce bug", + "fix this bug", + "this is broken", + "not working", + "why is this broken", + "getting an error", + "unexpected behavior", + "investigate error" + ], + "category": "process", + "user_invocable": false, + "version": "2.0.0", + "phases": [ + "OBSERVE", + "HYPOTHESIZE", + "TEST", + "FIX", + "RECORD" + ], + "agent": "general-purpose" + }, + "systematic-refactoring": { + "file": "pipelines/systematic-refactoring/SKILL.md", + "description": "Safe 5-phase refactoring pipeline with test characterization, incremental changes, and learning gates: CHARACTERIZE, PLAN, REFACTOR, VERIFY, RECORD.", + "triggers": [ + "refactor safely", + "rename symbol", + "extract function", + "refactor", + "simplify this", + "restructure code", + "reduce complexity", + "split this file", + "restructure", + "move to separate file" + ], + "category": "process", + "user_invocable": false, + "version": "2.0.0", + "phases": [ + "CHARACTERIZE", + "PLAN", + "REFACTOR", + "VERIFY", + "RECORD" + ], + "agent": "general-purpose" + }, "voice-calibrator": { "file": "pipelines/voice-calibrator/SKILL.md", "description": "Analyze writing samples and extract voice patterns to create or update voice skills with deterministic metrics and authorship-matching validation.", @@ -670,11 +738,6 @@ "category": "content", "user_invocable": false, "version": "2.0.0", - "phases": [ - "VOICE GROUNDING", - "VOICE METRICS", - "THINKING PATTERNS" - ], "pairs_with": [ "voice-orchestrator", "voice-validator" @@ -742,99 +805,6 @@ "plan-checker", "research-pipeline" ] - }, - "systematic-debugging": { - "file": "pipelines/systematic-debugging/SKILL.md", - "description": "Evidence-based 5-phase debugging pipeline with mandatory reproduction, testing, and learning gates: OBSERVE, HYPOTHESIZE, TEST, FIX, RECORD.", - "triggers": [ - "debug", - "find root cause", - "reproduce bug", - "fix this bug", - "this is broken", - "not working", - "why is this broken", - "getting an error", - "unexpected behavior", - "investigate error" - ], - "category": "process", - "user_invocable": false, - "version": "2.0.0", - "phases": [ - "OBSERVE", - "HYPOTHESIZE", - "TEST", - "FIX", - "RECORD" - ], - "pairs_with": [ - "systematic-refactoring", - "workflow-orchestrator" - ], - "agent": "general-purpose" - }, - "systematic-refactoring": { - "file": "pipelines/systematic-refactoring/SKILL.md", - "description": "Safe 5-phase refactoring pipeline with test characterization, incremental changes, and learning gates: CHARACTERIZE, PLAN, REFACTOR, VERIFY, RECORD.", - "triggers": [ - "refactor safely", - "rename symbol", - "extract function", - "refactor", - "simplify this", - "restructure code", - "reduce complexity", - "split this file", - "restructure", - "move to separate file" - ], - "category": "process", - "user_invocable": false, - "version": "2.0.0", - "phases": [ - "CHARACTERIZE", - "PLAN", - "REFACTOR", - "VERIFY", - "RECORD" - ], - "pairs_with": [ - "systematic-debugging", - "workflow-orchestrator" - ], - "agent": "general-purpose" - }, - "feature-pipeline": { - "file": "pipelines/feature-pipeline/SKILL.md", - "description": "End-to-end feature lifecycle pipeline coordinating design through release: DESIGN, PLAN, IMPLEMENT, VALIDATE, RELEASE, RECORD.", - "triggers": [ - "build feature end to end", - "full feature lifecycle", - "feature from scratch", - "design to release", - "complete feature pipeline", - "feature pipeline" - ], - "category": "process", - "user_invocable": false, - "version": "1.0.0", - "phases": [ - "DESIGN", - "PLAN", - "IMPLEMENT", - "VALIDATE", - "RELEASE", - "RECORD" - ], - "pairs_with": [ - "feature-design", - "feature-plan", - "feature-implement", - "feature-validate", - "feature-release" - ], - "agent": "general-purpose" } } } diff --git a/pipelines/agent-upgrade/SKILL.md b/pipelines/agent-upgrade/SKILL.md index 69032bd2..ca67768f 100644 --- a/pipelines/agent-upgrade/SKILL.md +++ b/pipelines/agent-upgrade/SKILL.md @@ -35,42 +35,13 @@ routing: # Agent Upgrade Pipeline -## Operator Context +## Overview This skill upgrades a single target agent or skill through a scored, gated pipeline. It is a **bottom-up** quality mechanism — triggered when a specific component needs improvement — complementing the **top-down** `system-upgrade` pipeline that handles multi-component changes driven by external events. -### Hardcoded Behaviors (Always Apply) -- **Score before AND after**: Phase 1 (AUDIT) must produce a baseline score using `agent-evaluation`. Phase 5 (RE-EVALUATE) must produce an after score. Never claim improvement without a measured delta. -- **Plan approval gate**: Phase 3 output MUST be presented to the user and explicitly approved before Phase 4 begins. Do not implement any changes without approval. -- **Domain logic is off-limits**: Phase 4 MUST NOT alter an agent's routing triggers, domain coverage, or core methodology. Only structural improvements, template alignment, Operator Context additions, and retro graduations are in scope unless the user explicitly directs otherwise. -- **Use agent-evaluation for scoring**: Do not self-assess quality. Invoke the `agent-evaluation` skill for objective scores. - -### Default Behaviors (ON unless disabled) -- **Check retro graduates**: Phase 1 always searches learning.db for entries targeting the agent under upgrade. Graduation candidates are surfaced in Phase 2. -- **Peer comparison**: Phase 2 compares the target against 2–3 agents in the same category for consistency gaps (e.g., comparing a Go agent against other Go agents). -- **Regression protection**: If Phase 5 delta is negative, report to user and do NOT auto-revert. User decides. - -### Optional Behaviors (OFF unless enabled) -- **Auto-approve**: Skip the Phase 3 user approval gate (enable with "auto-apply" or "skip approval"). -- **Skip peer comparison**: Omit peer consistency check in Phase 2 (enable with "skip peers"). -- **Skip retro scan**: Omit retro graduate scan (enable with "skip retro"). - -## What This Skill CAN Do -- Establish an objective baseline score for any agent or skill in the repository -- Identify structural gaps against `AGENT_TEMPLATE_V2` (missing sections, outdated patterns) -- Surface learning.db entries ready for graduation into the target agent -- Produce a ranked improvement plan with Critical/Important/Minor tiers -- Apply approved improvements: missing sections, Operator Context behaviors, graduated retro patterns, peer consistency fixes -- Score the result and report the quality delta - -## What This Skill CANNOT Do -- Change an agent's domain logic, routing triggers, or core methodology without explicit user direction -- Guarantee correctness of generated improvements — the RE-EVALUATE phase catches regressions -- Graduate retro entries automatically — graduation is proposed in the plan and requires approval - --- ## Instructions @@ -95,6 +66,7 @@ ls skills/ | grep [name] ``` **Step 2**: Run `agent-evaluation` skill on the target file to get the baseline score (0–100) and grade (A/B/C/F). +This is critical because without a baseline, there is no way to verify improvement. "Looks better" is not a quality claim. **Step 3**: Scan for retro entries targeting this agent: ```bash @@ -140,16 +112,13 @@ grep -rl "AGENT_TEMPLATE_V2" agents/ skills/ | head -5 Check for required sections. For agents: - Frontmatter with `routing:` block (triggers, pairs_with, complexity, category) - Operator Context with all three subsections (Hardcoded, Default, Optional) -- What This Agent CAN Do / CANNOT Do - Error Handling section -- Anti-Patterns section For skills: - Frontmatter with `agent:`, `allowed-tools:`, `user-invocable:` - Operator Context section - Phase-by-phase instructions with gates - Error Handling section -- Anti-Patterns section **Step 2**: Assess Operator Context completeness: @@ -226,18 +195,17 @@ CRITICAL (must fix): IMPORTANT (should fix): 3. Graduate retro entry: "[learning summary]" → new Hardcoded behavior [~10min] 4. Add missing frontmatter field: `model: sonnet` [~2min] - 5. Add Anti-Patterns section — missing entirely [~15min] MINOR (nice to have): - 6. Align "What This Agent CANNOT Do" wording to peer style [~5min] + 5. Align "What This Agent CANNOT Do" wording to peer style [~5min] -Total: 6 improvements +Total: 5 improvements Estimated quality delta: +12 to +18 points Proceed with implementation? (or specify which items to include/exclude) ``` -**Step 3**: Wait for user approval. Do NOT proceed to Phase 4 without it. +**Step 3**: Wait for user approval. Do NOT proceed to Phase 4 without it. The gate exists because the user may have strong opinions about which improvements are appropriate — silent mass edits violate the approval contract. - "yes", "proceed", "go ahead", "do it" → proceed with all items - User specifies subset (e.g., "skip 5 and 6") → update plan, proceed with approved subset - "no" or "stop" → stop and summarize what was decided @@ -250,18 +218,17 @@ Proceed with implementation? (or specify which items to include/exclude) **Goal**: Apply the approved improvements to the target file. -**Step 1**: Read the current target file in full before making any edits. Never edit from memory. +**Step 1**: Read the current target file in full before making any edits. Never edit from memory. This prevents collisions when the file changes between Phase 3 and Phase 4. **Step 2**: For each approved improvement, apply in order of tier (Critical first): **For missing sections** (add from AGENT_TEMPLATE_V2 patterns): - Operator Context: Add the three-subsection structure. Populate Hardcoded with behaviors that ARE enforced, Default with on-by-default behaviors, Optional with opt-in behaviors. - Error Handling: Add 2–4 concrete error cases with Cause + Solution format. -- Anti-Patterns: Add 2–3 named anti-patterns with What it looks like / Why wrong / Do instead format. **For retro graduations**: - Add the learning as a new Hardcoded behavior (if it should always apply) or a new pattern/rule in the relevant section. -- Preserve the original voice and specificity of the retro entry — don't paraphrase it into generic advice. +- Preserve the original voice and specificity of the retro entry — don't paraphrase it into generic advice. Only graduate retro entries with score ≥ 6 that are directly relevant to the target agent's domain. **For outdated patterns**: - Update to current convention. Reference the peer agents or template for the correct form. @@ -269,7 +236,7 @@ Proceed with implementation? (or specify which items to include/exclude) **For peer inconsistencies**: - Align to the majority pattern observed across peers. If peers themselves are inconsistent, align to the most recent or highest-scoring peer. -**Step 3**: Do NOT change any of the following without explicit user direction: +**Step 3**: Do NOT change any of the following without explicit user direction because domain logic changes require deliberate user decision, not opportunistic bundling: - Routing triggers (`triggers:` frontmatter field) - Domain coverage statements - Core methodology or phase structure (for skills) @@ -294,7 +261,7 @@ delta = after_score - baseline_score - Show the exact diff of changes applied - Report which change likely caused the regression - Ask user whether to revert the specific change or proceed anyway -- Do NOT auto-revert +- Do NOT auto-revert because the user may have context that justifies the regression **Step 4**: Report upgrade completion: @@ -344,51 +311,6 @@ Solution: Re-read the file before editing. If the baseline state has changed mat --- -## Anti-Patterns - -### Anti-Pattern 1: Skipping the Baseline Score -**What it looks like**: Jumping straight to DIFF or PLAN without running agent-evaluation in Phase 1. -**Why wrong**: Without a baseline, there is no way to verify improvement. "Looks better" is not a quality claim. -**Do instead**: Always score first. Even a rough manual score is better than no baseline. - -### Anti-Pattern 2: Implementing Without Plan Approval -**What it looks like**: Editing the target file immediately after Phase 2 without presenting the plan. -**Why wrong**: The user may have strong opinions about which improvements are appropriate. Silent mass edits violate the approval contract. -**Do instead**: Always present the ranked plan and wait for explicit approval. The gate exists for a reason. - -### Anti-Pattern 3: Changing Domain Logic -**What it looks like**: "Improving" an agent by changing its routing triggers, expanding its domain scope, or altering its core methodology as part of the upgrade. -**Why wrong**: Domain logic changes require deliberate user decision, not opportunistic bundling. They can break routing, cause misrouting, or alter behavior the user depends on. -**Do instead**: Limit Phase 4 to structural improvements, Operator Context additions, retro graduations, and template alignment. Flag domain logic questions to the user separately. - -### Anti-Pattern 4: Self-Assessing Quality Instead of Using agent-evaluation -**What it looks like**: Saying "the agent looks much better now" without running RE-EVALUATE. -**Why wrong**: Self-assessment is the exact failure mode that the scoring pipeline exists to prevent. An agent can look better and score worse. -**Do instead**: Always run agent-evaluation in Phase 5. The delta is the claim. Everything else is opinion. - -### Anti-Pattern 5: Over-Graduating Retro Entries -**What it looks like**: Adding every learning.db entry as a new Hardcoded behavior, bloating the agent with marginally relevant rules. -**Why wrong**: Bloat degrades usability and reduces the signal-to-noise ratio of the Operator Context. -**Do instead**: Graduate only retro entries with score ≥ 6 that are directly relevant to the target agent's domain. Surfacing them in the plan lets the user decide their importance. - ---- - -## Examples - -### Example 1: Template alignment upgrade -User: "Upgrade the python-general-engineer agent — it's missing Operator Context." -Actions: Phase 1 scores it (baseline: 58/C). Phase 2 finds missing Operator Context, no Anti-Patterns section, one retro graduation candidate. Phase 3 presents 3-item plan (Critical: add Operator Context; Important: graduate retro entry; Minor: add Anti-Patterns). User approves all. Phase 4 adds sections. Phase 5 re-evaluates (after: 74/B, delta: +16). Upgrade complete. - -### Example 2: Retro graduation only -User: "Graduate the retro learnings about debugging into the systematic-debugging skill." -Actions: Phase 1 scores systematic-debugging (baseline: 81/A). Phase 2 finds 2 retro entries in debugging.md with tags matching the skill. Phase 3 presents 2-item plan (both Important: inject as Hardcoded behaviors). User approves item 1, skips item 2. Phase 4 applies one graduation. Phase 5 re-evaluates (after: 84/A, delta: +3). Upgrade complete. - -### Example 3: Regression caught -User: "Improve the hook-development-engineer agent — align it to current patterns." -Actions: Phase 1 scores (baseline: 72/B). Phase 2 finds 4 gaps. Phase 3 plan approved. Phase 4 applies all 4. Phase 5 re-evaluates (after: 69/C, delta: -3). Regression detected. Show diff. Ask user: revert item 3 (the likely culprit) or accept? User says revert. Item 3 reverted. Phase 5 re-run (after: 75/B, delta: +3). Upgrade complete. - ---- - ## References - [agent-evaluation](../../skills/agent-evaluation/SKILL.md) - Objective scoring skill used in Phase 1 (baseline) and Phase 5 (re-evaluate) diff --git a/pipelines/article-evaluation-pipeline/SKILL.md b/pipelines/article-evaluation-pipeline/SKILL.md index 824f73cf..de5e522d 100644 --- a/pipelines/article-evaluation-pipeline/SKILL.md +++ b/pipelines/article-evaluation-pipeline/SKILL.md @@ -37,45 +37,24 @@ routing: # Article Evaluation Pipeline Skill -## Operator Context - -This skill operates as an operator for voice authenticity evaluation, configuring Claude's behavior for deterministic validation combined with wabi-sabi-aware analysis. It implements the **Pipeline** architectural pattern -- Fetch, Validate, Analyze, Report -- with **Domain Intelligence** embedded in voice authenticity classification. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before evaluation -- **Over-Engineering Prevention**: Evaluate the article as-is. No speculative corrections, no "while I'm here" rewrites -- **Deterministic Validation Required**: Always use `scripts/voice_validator.py` for pattern matching. Never self-assess voice quality -- **Wabi-Sabi Awareness**: Natural imperfections are FEATURES, not bugs. See `skills/shared-patterns/wabi-sabi-authenticity.md` -- **Em-Dash Zero Tolerance**: Em-dashes are an absolute prohibition, always flagged as errors -- **No False Positives on Authenticity**: Do NOT flag typos, run-ons, fragments, self-corrections, or trailing thoughts as errors -- **Artifact Persistence**: Save evaluation report to file, not just context - -### Default Behaviors (ON unless disabled) -- **Full Pipeline Execution**: Run all 4 phases: FETCH -> VALIDATE -> ANALYZE -> REPORT -- **Voice Auto-Detection**: Detect voice based on source context or explicit `--voice` flag -- **Wabi-Sabi Report Section**: Include dedicated analysis of intentional imperfections -- **Banned Pattern Check**: Run zero-tolerance check for AI tells alongside voice validation -- **Line Number Attribution**: Report all findings with specific line numbers -- **Artifact Saving**: Save fetched content to `/tmp/article-evaluation-[timestamp].md` - -### Optional Behaviors (OFF unless enabled) -- **Quick Mode**: Skip wabi-sabi analysis, only run validators (`--quick`) -- **Fix Suggestions**: Generate revision suggestions for failing content (`--suggest-fixes`) -- **Specific Voice Override**: Force voice profile instead of auto-detect (`--voice {name}`) - -## What This Skill CAN Do -- Evaluate articles for voice authenticity through deterministic validation -- Classify imperfections as wabi-sabi markers (keep) vs actual violations (fix) -- Run banned pattern checks with zero tolerance for AI tells -- Generate comprehensive evaluation reports with line-level attribution -- Auto-detect which voice profile to validate against - -## What This Skill CANNOT Do -- Write or generate articles (use voice-orchestrator or research-to-article instead) -- Edit or fix articles (use anti-ai-editor instead) -- Create voice profiles (use voice-calibrator instead) -- Skip the validation phase and self-assess quality -- Run without `scripts/voice_validator.py` available +## Overview + +This skill evaluates articles for voice authenticity through a deterministic 4-phase pipeline: **Fetch**, **Validate**, **Analyze**, **Report**. It combines voice pattern validation via `voice_validator.py` with wabi-sabi-aware analysis to distinguish authentic imperfections from actual violations. The evaluation produces a structured report with verdict (AUTHENTIC / NEEDS WORK / FAILED) and line-specific recommendations. + +**Default behaviors (always enabled)**: +- All 4 phases execute in sequence with explicit gates between phases +- Voice profile auto-detection from source context or manual `--voice` override +- Banned pattern zero-tolerance checking alongside voice validation +- Wabi-sabi analysis distinguishing natural imperfections from AI tells +- Line number attribution for all findings +- Evaluation report saved to file artifact + +**Optional behaviors (off unless triggered)**: +- Quick mode: Skip wabi-sabi analysis, report scores only (`--quick`) +- Fix suggestions: Generate revision items for NEEDS WORK/FAILED verdicts (`--suggest-fixes`) +- Force voice: Override auto-detection with explicit profile (`--voice {name}`) + +**Scope boundaries**: This skill evaluates existing articles only. Do not use for writing (use voice-orchestrator), editing (use anti-ai-editor), or creating voice profiles (use voice-calibrator). --- @@ -83,32 +62,30 @@ This skill operates as an operator for voice authenticity evaluation, configurin ### Phase 1: FETCH -**Goal**: Obtain article content in a format suitable for validation. +**Goal**: Obtain article content and identify the target voice profile. **Step 1: Identify source** - Determine whether input is a URL or local file path. **Step 2: Fetch content** - -For URLs: Use WebFetch or curl to retrieve content, extract article body as markdown. -For local files: Read the file directly. +- For URLs: Use WebFetch or curl to retrieve content, extract article body as markdown +- For local files: Read the file directly using Read tool **Step 3: Save artifact** - -Save content to `/tmp/article-evaluation-[timestamp].md` for subsequent phases. +Store content to `/tmp/article-evaluation-[timestamp].md` for use in subsequent phases. This persistent artifact is non-negotiable — all downstream validators reference this file. **Step 4: Detect voice** +Identify voice profile from source context: +- Known domain/path mapping → fetch corresponding voice profile +- Unknown source → require explicit `--voice {name}` flag -Detect voice profile from source or context: -- Known domain/path -> mapped voice profile -- Unknown -> require explicit `--voice` flag +Do not guess voice profiles. Wrong profile produces meaningless scores and invalidates downstream analysis. -**Gate**: Article content saved to temp file AND voice profile identified. Proceed only when gate passes. +**Gate**: Article content exists in temp file AND voice profile identified. Proceed only when gate passes. ### Phase 2: VALIDATE -**Goal**: Run deterministic validation against voice profile and banned patterns. +**Goal**: Run deterministic validation against voice profile and banned patterns using `scripts/voice_validator.py`. **Step 1: Voice pattern validation** @@ -128,44 +105,42 @@ python3 $HOME/claude-code-toolkit/scripts/voice_validator.py check-banned \ --content /tmp/article-evaluation.md ``` -Pass criteria: Score = 100 (no banned patterns found). +Pass criteria: Score = 100 (no banned patterns detected). -**Step 3: Record results** +Em-dashes are absolutely prohibited and must always be flagged as errors. This is a zero-tolerance policy. -Capture both scores, all errors, and all warnings with line numbers. +**Step 3: Record results** +Capture both validation scores, all errors with line numbers, and all warnings. Trust the script output over subjective assessment — deterministic validation is non-negotiable. -**Gate**: Both validation runs complete with captured output. Proceed only when gate passes. +**Gate**: Both validation runs complete with output captured and line numbers recorded. Proceed only when gate passes. ### Phase 3: ANALYZE (Wabi-Sabi) -**Goal**: Classify imperfections as authentic markers or actual violations. +**Goal**: Classify imperfections as authentic markers or actual violations. This phase distinguishes whether deviations from "perfect" writing are intentional stylistic features or genuine problems. **Step 1: Scan for imperfections** - -Review content for all deviations from "perfect" writing: typos, run-ons, fragments, self-corrections, trailing thoughts, casual contractions. +Review content for all deviations from grammatical perfection: typos, run-ons, fragments, self-corrections, trailing thoughts, casual contractions, natural rhythm breaks. **Step 2: Classify each finding** -For each imperfection found, classify as: -- **WABI-SABI** (KEEP): Intentional imperfection matching the writer's authentic patterns -- **ERROR** (FIX): Actual voice violation or banned pattern -- **WARNING** (REVIEW): Minor rhythm or pattern issue worth noting +For each imperfection found, classify as one of: +- **WABI-SABI** (KEEP): Intentional imperfection matching the writer's authentic patterns. Natural-sounding deviations that enhance authenticity. +- **ERROR** (FIX): Actual voice violation, banned pattern, or content that contradicts the authentic voice. +- **WARNING** (REVIEW): Minor rhythm issue or pattern inconsistency worth documenting but not blocking. -Use `references/wabi-sabi-classification.md` for the full classification guide and decision tree. +Use `references/wabi-sabi-classification.md` for the full decision tree. Do not flag typos, run-ons, or fragments as errors automatically — evaluate them against the writer's authentic voice patterns first. **Step 3: Check for suspicious perfection** +Zero wabi-sabi markers is itself a red flag. If no imperfections found, note as suspicious — authentic writing always contains natural imperfections. Over-polished content suggests synthetic generation. -Zero wabi-sabi markers is itself a red flag. If no markers found, note this as suspicious -- authentic writing always contains imperfections. - -**Gate**: All imperfections classified with line numbers and rationale. Proceed only when gate passes. +**Gate**: All imperfections classified with line numbers and rationale. Document the wabi-sabi verdict (markers present / absent). Proceed only when gate passes. ### Phase 4: REPORT -**Goal**: Generate comprehensive evaluation report. +**Goal**: Generate comprehensive evaluation report with verdict and recommendations. **Step 1: Compile findings** - -Aggregate validation scores, wabi-sabi markers, errors, and warnings into the report structure defined in `references/report-template.md`. +Aggregate validation scores, wabi-sabi markers, errors, and warnings into the structured report format defined in `references/report-template.md`. All findings must include line numbers and context. **Step 2: Determine verdict** @@ -175,15 +150,16 @@ Aggregate validation scores, wabi-sabi markers, errors, and warnings into the re | NEEDS WORK | Voice >= 60, banned < 100 (minor violations) | | FAILED | Voice < 60, or major banned pattern violations | -**Step 3: Write recommendations** +A passing score is not about over-polishing (95+). Expect authentic articles to score 70-90 with visible wabi-sabi markers. Scores below 60 or banned pattern violations require remediation. -For NEEDS WORK or FAILED verdicts, list specific items to fix with line numbers. For AUTHENTIC, note what makes it work. +**Step 3: Write recommendations** +- For NEEDS WORK or FAILED verdicts: List specific items to fix with line numbers and rationale +- For AUTHENTIC verdicts: Document what makes it work — which wabi-sabi markers contribute to authenticity **Step 4: Output report** +Display report to user and save to file. The report must include verdict, both scores, wabi-sabi analysis, and specific recommendations. -Display report to user and save to file if requested. - -**Gate**: Complete report generated with verdict, scores, and recommendations. Evaluation complete. +**Gate**: Complete report generated with verdict, scores, wabi-sabi analysis, and recommendations. Evaluation complete. --- @@ -191,20 +167,24 @@ Display report to user and save to file if requested. ### Example 1: URL Evaluation User says: "Evaluate this article https://example.com/posts/my-article/" + Actions: 1. Fetch article content, save to temp file, detect voice from context (FETCH) 2. Run voice_validator.py validate + check-banned (VALIDATE) 3. Classify imperfections as wabi-sabi or violations (ANALYZE) 4. Generate report with verdict (REPORT) -Result: Comprehensive evaluation with AUTHENTIC/NEEDS WORK/FAILED verdict + +Result: Comprehensive evaluation with AUTHENTIC/NEEDS WORK/FAILED verdict and line-specific findings ### Example 2: Local File Quick Check User says: "Quick check on ~/myblog/content/posts/draft.md" + Actions: 1. Read local file, save to temp, detect voice from context (FETCH) 2. Run both validators (VALIDATE) -3. Skip wabi-sabi analysis (quick mode) (ANALYZE skipped) +3. Skip wabi-sabi analysis (quick mode enabled) (ANALYZE skipped) 4. Generate abbreviated report with scores only (REPORT) + Result: Fast pass/fail with scores, no wabi-sabi breakdown --- @@ -213,74 +193,31 @@ Result: Fast pass/fail with scores, no wabi-sabi breakdown ### Error: "Voice validator script not found" Cause: `scripts/voice_validator.py` not at expected path or not executable + Solution: 1. Verify path: `ls $HOME/claude-code-toolkit/scripts/voice_validator.py` 2. Check permissions: `chmod +x` if needed -3. If missing, cannot proceed -- deterministic validation is non-negotiable +3. If missing, cannot proceed — deterministic validation via the script is a non-negotiable requirement ### Error: "Cannot determine voice profile" Cause: Source does not match any known site mapping and no `--voice` flag provided + Solution: 1. Ask user which voice to validate against 2. Use `--voice {name}` explicit flag -3. Do NOT guess -- wrong profile produces meaningless scores +3. Do NOT guess or assume — wrong profile produces meaningless scores and invalidates the entire evaluation ### Error: "Article content empty or too short" -Cause: WebFetch failed, URL is paywalled, or file path incorrect +Cause: WebFetch failed, URL is paywalled/auth-walled, or file path incorrect + Solution: -1. Verify URL is accessible (check for paywalls, auth walls) +1. Verify URL is accessible (check for paywalls, auth walls, redirects) 2. Try alternative fetch method (curl vs WebFetch) -3. Ask user to provide content directly if URL inaccessible - ---- - -## Anti-Patterns - -### Anti-Pattern 1: Flagging Wabi-Sabi as Errors -**What it looks like**: Reporting typos, run-ons, and fragments as issues to fix -**Why wrong**: These are authenticity markers. Fixing them makes content more synthetic. -**Do instead**: Classify using the wabi-sabi decision tree. Only flag items on the banned list. - -### Anti-Pattern 2: Expecting Perfect Scores -**What it looks like**: Treating 70/100 as a failing score, aiming for 95+ -**Why wrong**: Over-polished content is an AI tell. A score of 70-90 with wabi-sabi markers is more authentic than 95+ with none. -**Do instead**: Pass threshold is 60. Expect authentic articles in the 70-90 range. - -### Anti-Pattern 3: Self-Assessing Voice Quality -**What it looks like**: "This sounds like the target voice to me" without running the validator -**Why wrong**: LLM assessment is inconsistent and biased. Deterministic scripts are reproducible. -**Do instead**: Always run `voice_validator.py`. Trust the script over your judgment. - -### Anti-Pattern 4: Skipping Wabi-Sabi Analysis -**What it looks like**: Running validators only and reporting scores without imperfection analysis -**Why wrong**: Misses the key insight -- whether imperfections are features or bugs. Two articles with 75/100 can be very different. -**Do instead**: Complete Phase 3 unless explicitly in quick mode. - -### Anti-Pattern 5: Fixing Articles During Evaluation -**What it looks like**: "Let me also fix these issues I found" during evaluation -**Why wrong**: Evaluation and editing are separate workflows. Mixing them loses objectivity. -**Do instead**: Report findings only. If fixes needed, redirect to anti-ai-editor skill. +3. Ask user to provide content directly if URL remains inaccessible --- ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Wabi-Sabi Authenticity](../shared-patterns/wabi-sabi-authenticity.md) - Imperfection classification principles -- [Pipeline Architecture](../shared-patterns/pipeline-architecture.md) - Pipeline design patterns - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I can tell it's authentic without the validator" | Subjective assessment is unreliable | Run voice_validator.py, trust the script | -| "The typos are errors, I should flag them" | Typos in natural positions are wabi-sabi markers | Classify with the decision tree first | -| "Score is close to 60, probably fine" | Probably is not proven | Report exact score, let threshold decide | -| "No need for wabi-sabi analysis, scores tell the story" | Scores miss the authenticity texture | Complete Phase 3 unless quick mode | -| "I'll fix the issues while evaluating" | Evaluation and editing are separate concerns | Report only, redirect to anti-ai-editor | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/report-template.md`: Full report format with verdict criteria and examples - `${CLAUDE_SKILL_DIR}/references/wabi-sabi-classification.md`: Complete marker tables and classification decision tree diff --git a/pipelines/auto-pipeline/SKILL.md b/pipelines/auto-pipeline/SKILL.md index 29e91845..7c04f83e 100644 --- a/pipelines/auto-pipeline/SKILL.md +++ b/pipelines/auto-pipeline/SKILL.md @@ -37,40 +37,11 @@ routing: # Auto-Pipeline -## Operator Context +## Overview This pipeline operates as the automatic fallback for `/do` when no existing route matches a non-trivial request. It classifies the task type, selects and adapts a canonical chain, executes it with phase gates, and optionally crystallizes the pattern into a permanent pipeline. -### Hardcoded Behaviors (Always Apply) -- **Dedup Gate**: ALWAYS check pipeline catalog before creating anything. If existing pipeline covers 70%+ of request, route to it instead. -- **8-12 Step Preference**: Prefer longer chains over minimal ones. More steps = more phase gates = higher quality. A 6-step chain that could be 10 steps is under-utilizing the infrastructure. -- **Phase Gates**: Every step must complete before the next begins. No skipping, no parallel steps unless the step explicitly supports parallelism (RESEARCH, REVIEW). -- **Artifact Persistence**: Save intermediate output at each phase to session-local files. Context is ephemeral; files are not. -- **Rule 12 Compliance**: RESEARCH steps MUST use parallel multi-agent dispatch (3-4 agents). Sequential grep-based research is banned. -- **Immediate Crystallization in Toolkit Repo**: If running in this repo (detected by `pipelines/auto-pipeline/` existing in CWD), crystallize on first encounter — don't wait for 3+ runs. -- **No Duplicate Pipelines**: The dedup gate is a HARD BLOCK. If an existing pipeline matches, route to it. Do not rationalize "this is slightly different." - -### Default Behaviors (ON unless disabled) -- **Chain Adaptation**: Extend canonical chains with domain-appropriate steps from the step menu (EXTRACT, COMPILE, ASSESS, SYNTHESIZE, REFINE, COMPARE, LINT, CONFORM, MONITOR). -- **Operator Profile Gates**: Apply personal/work/CI/production gates per the step menu's profile rules. -- **Learning Recording**: After every ephemeral execution, record to learning.db for crystallization tracking. - -### Optional Behaviors (OFF unless enabled) -- **Dry Run**: Show the proposed chain without executing it. -- **Force Ephemeral**: Skip crystallization even in toolkit repo (for testing). - -## What This Pipeline CAN Do -- Classify any user request into one of 8 canonical task types -- Select and adapt canonical chains to 8-12 steps -- Execute chains inline with phase gates -- Auto-crystallize into permanent pipelines (toolkit repo: immediate; others: 3+ runs) -- Dedup against existing pipeline catalog - -## What This Pipeline CANNOT Do -- Override existing pipeline routing (dedup gate prevents this) -- Create agents (crystallization uses existing agents or creates via pipeline-scaffolder) -- Bypass the creation gate hook (new pipelines go through proper creation flow) -- Execute without phase gates (every step is gated) +**Key principle**: ALWAYS check the pipeline catalog first. If an existing pipeline covers 70%+ of the request, route to it instead of creating a new one. This prevents duplicate pipeline fragmentation and maintenance burden. --- @@ -80,6 +51,8 @@ This pipeline operates as the automatic fallback for `/do` when no existing rout **Goal**: Ensure we're not duplicating an existing pipeline. +**Why this matters**: ALWAYS check the pipeline catalog first. If an existing pipeline covers 70%+ of the request, route to it instead. Duplicate pipelines fragment routing, create maintenance burden, and confuse discovery. The dedup gate is a HARD BLOCK — do not rationalize "this is slightly different." + **Step 1**: Run task type classification: ```bash python3 ~/.claude/scripts/task-type-classifier.py --request "{user_request}" --check-catalog ~/.claude/skills/auto-pipeline/references/pipeline-catalog.json --json @@ -107,6 +80,10 @@ python3 ~/.claude/scripts/task-type-classifier.py --request "{user_request}" --c **Goal**: Select the best canonical chain variant and extend to 8-12 steps. +**Why 8-12 steps?** Fewer than 8 steps means under-utilizing phase gates, which reduces verification points and quality. Each additional step adds a gate — a mandatory quality checkpoint. A 6-step chain that could be 10 is leaving opportunities for validation on the table. Extend all chains to this range. + +**Why sequential?** Every step must complete before the next begins. No skipping, no parallel steps unless the step explicitly supports parallelism (RESEARCH, REVIEW). Phase gates enforce ordering and prevent state leakage between phases. + **Step 1**: Read `pipelines/chain-composer/references/canonical-chains.md` for the full canonical chain and its variants. **Step 2**: Select the best variant based on request analysis: @@ -125,7 +102,7 @@ python3 ~/.claude/scripts/task-type-classifier.py --request "{user_request}" --c - Add REFINE after validation - Add SYNTHESIZE before reporting -**Step 4**: Apply operator profile gates: +**Step 4**: Apply operator profile gates. Chain behavior shifts based on execution context: - Personal: remove APPROVE, PROMPT; reduce GUARD to branch-check only - Work: add CONFORM after GENERATE; full GUARD - CI: skip interaction steps; add NOTIFY @@ -137,6 +114,10 @@ python3 ~/.claude/scripts/task-type-classifier.py --request "{user_request}" --c **Goal**: Determine whether to crystallize immediately or run ephemeral. +**Toolkit repo rule**: If running in this repo (detected by `pipelines/auto-pipeline/SKILL.md` existing in CWD), crystallize on first encounter. This repo IS the pipeline system — every pattern we extract becomes part of the toolkit. Don't wait for 3 runs; capture the pattern immediately. + +**Outside toolkit repo rule**: Wait for 3+ ephemeral executions in the same domain before crystallizing. This ensures the pattern is stable and not a one-off. + **Step 1**: Check if `pipelines/auto-pipeline/SKILL.md` exists in CWD (indicates toolkit repo). **Step 2**: If toolkit repo: @@ -155,13 +136,15 @@ python3 ~/.claude/scripts/task-type-classifier.py --request "{user_request}" --c **Goal**: Create a permanent pipeline from the adapted chain and wire it into routing. -This phase IS a pipeline itself (10 steps): +**Why parallel research in CRYSTALLIZE?** Rule 12 is validated by A/B testing — parallel research agents produce 1.40-point quality gap over sequential grep-based research. Dispatch 3-4 parallel agents instead of running searches sequentially. Sequential research is banned in RESEARCH steps. + +**Why 10 steps?** This phase itself is a full pipeline: DETECT → GATHER → RESEARCH (parallel) → COMPOSE → VALIDATE → SCAFFOLD → INTEGRATE → VERIFY → REGISTER → EXECUTE. Every step gates the next. If any step fails, we roll back to ephemeral for the current request and record the failure for investigation. **Step 1 — DETECT**: Confirm crystallization threshold is met (toolkit repo = always; other = 3+ prior runs). **Step 2 — GATHER**: If prior ephemeral runs exist, collect their chain descriptions and outcomes from learning.db. -**Step 3 — RESEARCH**: Use accumulated evidence to inform pipeline design. Dispatch 3 parallel research agents: +**Step 3 — RESEARCH**: Use accumulated evidence to inform pipeline design. Dispatch 3 parallel research agents (never sequential — Rule 12): - Agent 1: Analyze what steps worked in prior ephemeral runs - Agent 2: Find similar existing pipelines to learn from - Agent 3: Identify domain-specific references/scripts needed @@ -172,7 +155,7 @@ This phase IS a pipeline itself (10 steps): - Ensure 8-12 steps - Validate type compatibility -**Step 5 — VALIDATE**: Run `python3 ~/.claude/scripts/artifact-utils.py validate-chain` against the composed chain. +**Step 5 — VALIDATE**: Run `python3 ~/.claude/scripts/artifact-utils.py validate-chain` against the composed chain. If validation fails, fall back to unmodified canonical chain and log the adaptation failure. **Step 6 — SCAFFOLD**: Create the pipeline skill: - Create `pipelines/{pipeline-name}/SKILL.md` with full operator context @@ -193,12 +176,14 @@ This phase IS a pipeline itself (10 steps): **Step 10 — EXECUTE**: Route the original request through the newly created permanent pipeline. -**Gate**: Pipeline created, integrated, verified. Original request executing through permanent pipeline. +**Gate**: Pipeline created, integrated, verified. Original request executing through permanent pipeline. If any earlier step fails, fall back to ephemeral execution for this request and record the failure. ### Phase 4: EPHEMERAL EXECUTE **Goal**: Execute the adapted chain inline with phase gates, without persistence. +**Why save intermediate artifacts?** Context is ephemeral; files are not. Save intermediate output at each phase to session-local files so work survives context compression and session boundaries. + **Step 1**: Display the ephemeral pipeline banner: ``` =================================================================== @@ -215,29 +200,33 @@ This phase IS a pipeline itself (10 steps): =================================================================== ``` -**Step 2**: Execute each step in sequence: +**Step 2**: Execute each step in sequence with gating: - For each step in the chain: 1. Announce: `[Phase {N}/{total}: {STEP_NAME}]` 2. Execute the step's action (research, compile, generate, validate, etc.) 3. Save output to session-local file: `/tmp/ephemeral-pipeline/{step_name}.md` - 4. Verify gate condition is met - 5. Proceed to next step + 4. Verify gate condition is met (step must complete, output must exist) + 5. Proceed to next step only after gate passes -**Step 3**: For RESEARCH steps (Rule 12): dispatch 3-4 parallel agents: +**Step 3**: For RESEARCH steps (Rule 12 mandatory): dispatch 3-4 parallel agents. Never run grep/search commands sequentially — that is the banned pattern. Instead: - Agent 1: Code/content analysis - Agent 2: Usage patterns / ecosystem context - Agent 3: Examples and reference material - Agent 4 (optional): External documentation / API references -**Step 4**: For REVIEW steps: dispatch 3+ parallel reviewers with different lenses. +Collect all results in parallel, then synthesize. + +**Step 4**: For REVIEW steps: dispatch 3+ parallel reviewers with different lenses (contrarian, skeptical, user advocate, etc.). -**Step 5**: For REFINE steps: iterate up to 3 times until validation passes. +**Step 5**: For REFINE steps: iterate up to 3 times until validation passes. After 3 iterations, proceed with best effort output and log what remains unresolved. -**Step 6**: On completion, record learning: +**Step 6**: On completion, record learning for crystallization tracking: ```bash python3 ~/.claude/scripts/learning-db.py learn --skill auto-pipeline "ephemeral {task_type} for {domain}: {chain_description}" ``` +This enables the 3+ run threshold for crystallization outside the toolkit repo. + **Gate**: All steps executed. Output delivered. Learning recorded. --- @@ -245,44 +234,20 @@ python3 ~/.claude/scripts/learning-db.py learn --skill auto-pipeline "ephemeral ## Error Handling ### Error: "Classification returns no matches" -Cause: Request doesn't match any task type keywords. -Solution: Default to `analysis` type (broadest applicability). Log a learning about the unclassifiable request. +**Cause**: Request doesn't match any task type keywords. +**Solution**: Default to `analysis` type (broadest applicability). Log a learning about the unclassifiable request. ### Error: "Dedup gate blocks but user insists" -Cause: Existing pipeline matches 70%+ but user says it doesn't serve their needs. -Solution: User can say "create new pipeline for X, existing one doesn't cover Y." The explicit override bypasses dedup. +**Cause**: Existing pipeline matches 70%+ but user says it doesn't serve their needs. +**Solution**: User can say "create new pipeline for X, existing one doesn't cover Y." The explicit override bypasses dedup. ### Error: "Chain validation fails" -Cause: Type compatibility error in adapted chain. -Solution: Fall back to the unmodified canonical chain for that task type. Log the adaptation that failed. +**Cause**: Type compatibility error in adapted chain. +**Solution**: Fall back to the unmodified canonical chain for that task type. Log the adaptation that failed. ### Error: "Crystallization fails mid-scaffold" -Cause: Pipeline creation blocked by creation gate or ADR enforcement. -Solution: Fall back to ephemeral execution for the current request. Record the failure for investigation. - ---- - -## Anti-Patterns - -### Anti-Pattern 1: Creating Duplicate Pipelines -**What it looks like**: Auto-pipeline creates a new pipeline when an existing one covers the same domain. -**Why wrong**: Fragments routing, creates maintenance burden, confuses discovery. -**Do instead**: ALWAYS check the pipeline catalog first. 70%+ coverage = route to existing. - -### Anti-Pattern 2: Minimal Chains -**What it looks like**: Creating a 4-5 step chain when 8-10 steps would be appropriate. -**Why wrong**: Fewer steps = fewer phase gates = fewer verification points = lower quality. -**Do instead**: Extend chains to 8-12 steps using the step menu. Add ASSESS, COMPILE, SYNTHESIZE, REFINE where appropriate. - -### Anti-Pattern 3: Skipping Crystallization in Toolkit Repo -**What it looks like**: Running ephemeral in the toolkit repo instead of creating a permanent pipeline. -**Why wrong**: This repo IS the pipeline system. Every pipeline we create for ourselves becomes part of the toolkit. -**Do instead**: ALWAYS crystallize in the toolkit repo. Immediate, first encounter. - -### Anti-Pattern 4: Sequential Research -**What it looks like**: Running grep/search commands one at a time in a RESEARCH step. -**Why wrong**: Rule 12 (validated by A/B test) — sequential research produces 1.40-point quality gap. -**Do instead**: Dispatch 3-4 parallel research agents. Always. +**Cause**: Pipeline creation blocked by creation gate or ADR enforcement. +**Solution**: Fall back to ephemeral execution for the current request. Record the failure for investigation. --- diff --git a/pipelines/chain-composer/SKILL.md b/pipelines/chain-composer/SKILL.md index 17882a14..a1741b7f 100644 --- a/pipelines/chain-composer/SKILL.md +++ b/pipelines/chain-composer/SKILL.md @@ -35,42 +35,13 @@ routing: # Chain Composer Skill -## Operator Context +## Overview -This skill operates as an operator for pipeline chain composition, configuring Claude's behavior for selecting, ordering, and validating pipeline steps into coherent chains. It implements a **Type-Safe Composition** pattern: read the step menu, select steps by task type, apply operator profile gates, validate type compatibility with a deterministic script, and produce a machine-readable Pipeline Spec. +This skill composes valid pipeline chains for each subdomain in a Component Manifest. It reads the step menu, selects and orders steps by task type, applies operator profile gates (personal, work, CI, production), validates type compatibility using a deterministic script (`artifact-utils.py validate-chain`), and produces a machine-readable Pipeline Spec JSON consumed by `pipeline-scaffolder`. -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution. Project instructions override default skill behaviors. -- **Over-Engineering Prevention**: Compose the simplest chain that satisfies the task type. Do not add steps "for completeness" or "in case they need it." Every step in a chain must have a concrete reason to be there. If a subdomain's task type maps to a 5-step canonical chain, don't pad it to 8 steps. -- **Deterministic Validation**: Chain correctness is verified by `python3 ~/.claude/scripts/artifact-utils.py validate-chain`, not by LLM self-assessment. The script checks type compatibility, composition rules, ADR-first, and terminal steps. If the script says INVALID, the chain is invalid regardless of how logical it looks. -- **No Duplication**: The step menu lives in `pipelines/pipeline-scaffolder/references/step-menu.md`. The pipeline spec format lives in `pipelines/pipeline-scaffolder/references/pipeline-spec-format.md`. Reference them; do not copy their content into the Pipeline Spec or into this skill's output. -- **Operator Profile Enforcement**: Every chain must be modified by the operator profile from the Component Manifest. Personal chains are lean. Production chains have maximum gates. Skipping profile application produces chains that are unsafe (production) or bloated (personal). +**Core pattern**: template-then-adapt. Start from a canonical chain template for the task type (which encodes correct type flow), then inject domain-specific steps and operator profile gates. This is more reliable than composing from scratch because the type compatibility matrix has 18 families with specific consumption rules, and the canonical templates already encode the correct flow. -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report composition decisions concisely. Show chain visualizations (step arrows) rather than describing them in prose. -- **Temporary File Cleanup**: Remove `/tmp/pipeline-{run-id}/chain-*.json` validation fragments after Phase 3 completes. Keep only the final Pipeline Spec. -- **Canonical Chain Starting Point**: Always start from the canonical chain template for the task type (see `references/canonical-chains.md`), then adapt. Do not compose chains from scratch -- the canonical templates encode hard-won composition patterns. -- **Chain Visualization**: When reporting chains, use the arrow format: `ADR -> RESEARCH -> COMPILE -> GENERATE -> VALIDATE -> OUTPUT`. - -### Optional Behaviors (OFF unless enabled) -- **Verbose Composition Log**: Show step-by-step reasoning for each composition decision (step selection, profile gate application, domain-specific additions) -- **Alternative Chain Generation**: Produce 2-3 alternative chains per subdomain with tradeoff analysis - -## What This Skill CAN Do - -- Compose valid pipeline chains for any task type in the step menu (generation, review, debugging, operations, configuration, analysis, migration, testing) -- Apply operator profile gates (personal, work, CI, production) to modify chains -- Add domain-specific steps (LINT, CONFORM, TEMPLATE, DELEGATE) when the subdomain requires them -- Validate all chains using `scripts/artifact-utils.py validate-chain` deterministically -- Produce a complete Pipeline Spec JSON following `pipeline-spec-format.md` -- Iterate on failed chains (max 3 attempts per chain) to fix type incompatibilities - -## What This Skill CANNOT Do - -- **Scaffold pipeline components**: That is `pipeline-scaffolder`. This skill only produces the Pipeline Spec; it does not create agents, skills, hooks, or scripts. -- **Discover subdomains**: That is `domain-research`. This skill consumes the Component Manifest; it does not research the domain. -- **Validate domain-specific content**: The type compatibility check is structural. Domain correctness (e.g., whether a PromQL pattern is valid) is the generated pipeline's responsibility, not this skill's. -- **Override the step menu**: All steps must come from `step-menu.md`. This skill composes from the menu; it does not invent new steps. +--- ## Instructions @@ -109,7 +80,7 @@ This skill operates as an operator for pipeline chain composition, configuring C **Goal**: For each subdomain in the Component Manifest, compose a complete, valid pipeline chain by starting from the canonical template and adapting it. -**Why composition follows a template-then-adapt pattern**: Composing chains from scratch is error-prone -- the type compatibility matrix has 18 families with specific consumption rules, and getting the flow wrong is easy. The canonical templates encode the correct type flow for each task type. Starting from a template and adapting is more reliable than assembling step-by-step. +**Why composition follows a template-then-adapt pattern**: Composing chains from scratch is error-prone. The canonical templates encode the correct type flow for each task type. Starting from a template and adapting is more reliable than assembling step-by-step. *Constraint: Always start from the canonical chain template for the task type, then adapt. Do not compose chains from scratch.* For each subdomain in the Component Manifest, execute Steps 1-5: @@ -128,42 +99,50 @@ Use the subdomain's `task_type` to select the starting template from `references | `migration` | ADR -> CHARACTERIZE -> PLAN -> GUARD -> SNAPSHOT -> EXECUTE -> VALIDATE -> OUTPUT | | `testing` | ADR -> RESEARCH -> CHARACTERIZE -> GENERATE -> VALIDATE -> REPORT | +**Constraint: Unknown task_type is a blocker.** If the Component Manifest contains a `task_type` that does not map to any canonical chain template, STOP and ask the user to classify it. Valid task types are the 8 listed above. Do not invent a new task type. + **Step 2: Apply domain-specific adaptations** Examine the subdomain's description, references_needed, and scripts_needed to determine if the canonical chain needs domain-specific steps: -- **If generating artifacts with checkable syntax** (e.g., PromQL, HCL, SQL): Insert `LINT` after `GENERATE` and before `VALIDATE` -- **If output must match an external spec** (e.g., OpenAPI, JSON Schema): Insert `CONFORM` after `GENERATE` (or after `LINT` if both apply) -- **If cross-domain expertise is needed**: Insert `DELEGATE` at the point where the other domain's pipeline is invoked -- **If the subdomain has a validation script** in `scripts_needed`: Ensure a `VALIDATE` step exists with `params.script` referencing that script -- **If a template/boilerplate is referenced** in `references_needed`: Consider replacing `GENERATE` with `TEMPLATE` or inserting `TEMPLATE` before `GENERATE` -- **If refinement is expected** (validation may fail iteratively): Insert `REFINE` after `VALIDATE` with `params.max_refine_cycles: 3` +- **If generating artifacts with checkable syntax** (e.g., PromQL, HCL, SQL): Insert `LINT` after `GENERATE` and before `VALIDATE`. *Reason: syntax errors must be caught before validation.* +- **If output must match an external spec** (e.g., OpenAPI, JSON Schema): Insert `CONFORM` after `GENERATE` (or after `LINT` if both apply). *Reason: conformance to spec is a prerequisite to validation.* +- **If cross-domain expertise is needed**: Insert `DELEGATE` at the point where the other domain's pipeline is invoked. *Reason: enables sequential execution across domain boundaries.* +- **If the subdomain has a validation script** in `scripts_needed`: Ensure a `VALIDATE` step exists with `params.script` referencing that script. *Reason: ties generated artifacts to deterministic validation.* +- **If a template/boilerplate is referenced** in `references_needed`: Consider replacing `GENERATE` with `TEMPLATE` or inserting `TEMPLATE` before `GENERATE`. *Reason: templates encode domain knowledge that should run before generation.* +- **If refinement is expected** (validation may fail iteratively): Insert `REFINE` after `VALIDATE` with `params.max_refine_cycles: 3`. *Reason: limits iteration cost while permitting recovery from validation failures.* + +*Constraint: Compose the simplest chain that satisfies the task type. Do not add steps "for completeness" or "in case they need it." Every step in a chain must have a concrete reason to be there.* If the subdomain doesn't explicitly require a step, don't add it. **Step 3: Apply operator profile gates** Modify the chain based on the operator profile from the Component Manifest. These rules come from `step-menu.md` Operator Profiles section: -**Personal profile**: +**Personal profile** (most permissive, fewest gates): - Remove `APPROVE` and `PROMPT` steps (full autonomy) - Reduce `GUARD` to `params.checks: ["branch-not-main"]` only - `SIMULATE` and `SNAPSHOT` are available but not mandatory -- only include if the subdomain explicitly needs them -**Work profile**: +**Work profile** (moderate gates): - Add `CONFORM` after `GENERATE` for convention checking (if not already present from domain adaptation) - Full `GUARD -> SNAPSHOT -> EXECUTE -> VALIDATE` for any chain that modifies state - Add `APPROVE` for production-affecting changes -**CI profile**: +**CI profile** (automated, no interaction): - Remove all interaction steps (`PROMPT`, `APPROVE`, `PRESENT`) - Add `NOTIFY` before `OUTPUT`/`REPORT` to send results to PR/Slack - `GUARD` checks dependencies/tools exist, not permissions -**Production profile**: +**Production profile** (maximum gates): - Add `GUARD -> SNAPSHOT` before every `EXECUTE` - Add `SIMULATE` before `EXECUTE` for large-scale changes - Add `APPROVE` before any dangerous operation - Add `PRESENT` before and after `EXECUTE` for visibility +*Constraint: Apply exactly the gates specified by the operator profile. Personal = minimal gates. Production = maximum gates. No more, no less. Do NOT add production-grade gates to personal chains or vice versa.* + +**Constraint: Operator profile must come from Component Manifest or user.** If the Component Manifest does not specify an operator_profile, default to `personal` profile (most permissive, fewest gates) and log a warning that profile was defaulted. + **Step 4: Construct Step objects** For each step in the composed chain, construct the Step object per `pipeline-spec-format.md`: @@ -182,7 +161,7 @@ For each step in the composed chain, construct the Step object per `pipeline-spe Rules for constructing Step objects: - `family`: Look up the step in the Step Name enum in `pipeline-spec-format.md` - `output_schema`: Look up the family in the Step Family table to find what it produces -- `consumes`: The first step (ADR) has `null`. Every other step's `consumes` is the `output_schema` of the previous substantive step. For steps after transparent steps (safety, interaction, validation), `consumes` references the last non-transparent step's output +- `consumes`: The first step (ADR) has `null`. Every other step's `consumes` is the `output_schema` of the previous substantive step. *Constraint: For steps after transparent steps (safety, interaction, validation), `consumes` references the last non-transparent step's output.* This is because transparent steps pass through the primary data flow. - `params`: Populate based on the subdomain's references_needed and scripts_needed - `profile_gate`: Set to the minimum profile required for this step (e.g., `"work"` for CONFORM added by work profile). Set to `null` for unconditional steps @@ -199,6 +178,8 @@ Before moving to the next subdomain, verify the chain satisfies the composition - [ ] Pipeline Summary is terminal (nothing after OUTPUT/REPORT) - [ ] No duplicate steps (except VALIDATE, VERIFY, CHECKPOINT) +*Constraint: Chain correctness is verified by `python3 ~/.claude/scripts/artifact-utils.py validate-chain`, not by LLM self-assessment. The script checks type compatibility, composition rules, ADR-first, and terminal steps. If the script says INVALID, the chain is invalid regardless of how logical it looks.* + **Gate**: Every subdomain has a complete chain. All chains follow composition rules. All Step objects have valid family, output_schema, and consumes fields. Proceed to Phase 3. --- @@ -207,7 +188,7 @@ Before moving to the next subdomain, verify the chain satisfies the composition **Goal**: Validate all chains using `scripts/artifact-utils.py validate-chain`. This is the critical quality gate -- it catches type incompatibilities that visual inspection misses. -**Why deterministic validation is mandatory**: The type compatibility matrix has complex rules: transparent steps pass through primary data flow, safety steps wrap rather than replace, and some steps appear in multiple families with different semantics (e.g., EXECUTE is in git-release, LINT is in both validation and domain-extension). The script handles all these edge cases correctly. LLM judgment is not sufficient -- it will rationalize subtle type mismatches as "close enough." +**Why deterministic validation is mandatory**: The type compatibility matrix has complex rules: transparent steps pass through primary data flow, safety steps wrap rather than replace, and some steps appear in multiple families with different semantics (e.g., EXECUTE is in git-release, LINT is in both validation and domain-extension). The script handles all these edge cases correctly. *Constraint: LLM judgment is not sufficient. The script is the source of truth. Trust the script over visual inspection.* **Step 1**: Create a temporary directory for validation artifacts: ```bash @@ -245,7 +226,7 @@ python3 ~/.claude/scripts/artifact-utils.py validate-chain /tmp/pipeline-{run-id **Step 5**: If any chain fails validation, fix the composition error in Phase 2 and re-validate. Track the iteration count per chain. -- **Maximum 3 iterations per chain**. If a chain fails validation 3 times, STOP and report the persistent failure to the user with the specific type incompatibility. Do not guess at fixes beyond 3 attempts -- the chain template may need human review. +*Constraint: Maximum 3 iterations per chain.* If a chain fails validation 3 times, STOP and report the persistent failure to the user with the specific type incompatibility. Do not guess at fixes beyond 3 attempts -- the chain template may need human review. **Step 6**: After all chains pass, clean up temporary validation fragments: ```bash @@ -260,7 +241,7 @@ rm -f /tmp/pipeline-{run-id}/chain-*.json **Goal**: Produce the complete Pipeline Spec JSON and a human-readable summary. The Pipeline Spec is the contract that `pipeline-scaffolder` consumes. -**Why the Pipeline Spec is JSON, not markdown**: The scaffolder needs machine-readable data to derive build targets (skills, references, scripts, agents, routing). JSON is the contract format defined in `pipeline-spec-format.md`. The human-readable summary is a companion, not a replacement. +**Why the Pipeline Spec is JSON, not markdown**: *Constraint: The scaffolder needs machine-readable data to derive build targets (skills, references, scripts, agents, routing). JSON is the contract format defined in `pipeline-spec-format.md`.* The human-readable summary is a companion, not a replacement. *Constraint: Do NOT produce markdown instead of JSON. The scaffolder consumes JSON. It parses field names, iterates subdomain arrays, and extracts step objects programmatically.* **Step 1**: Load `pipelines/pipeline-scaffolder/references/pipeline-spec-format.md` for the exact format contract. Verify every field requirement against what you will produce. @@ -307,7 +288,7 @@ Naming rules: ``` Top-level validation: -- Exactly one of `new_agent` or `reuse_agent` is non-null +- Exactly one of `new_agent` or `reuse_agent` is non-null. *Constraint: Whether to create a new agent vs. reuse must come from Component Manifest or user. Do NOT guess.* - `operator_profile` is a valid profile enum value - `subdomains` is non-empty @@ -368,6 +349,8 @@ Total steps across all chains: {total} **Gate**: Pipeline Spec JSON exists at `/tmp/pipeline-{run-id}/pipeline-spec.json`. All chains pass final `validate-chain`. Content summary generated. Handoff to `pipeline-scaffolder`. +--- + ## Error Handling ### Error: Type Incompatibility in Chain @@ -390,57 +373,10 @@ Total steps across all chains: {total} **Cause**: The Component Manifest does not specify an operator_profile. **Solution**: Default to `personal` profile (most permissive, fewest gates). Log a warning that profile was defaulted. This avoids blocking the pipeline but may produce under-gated chains for work/production environments. -## Anti-Patterns - -### Hardcoding Chains Instead of Composing -**What it looks like**: Writing a fixed chain for a subdomain without consulting the canonical template or step menu. -**Why wrong**: Hardcoded chains bypass the type compatibility system. They work by coincidence until the step menu evolves and the hardcoded types no longer align. They also miss operator profile gates. -**Do instead**: Always start from the canonical chain template for the task type, then adapt with domain-specific steps and profile gates. The template encodes the correct type flow. - -### Skipping Deterministic Validation -**What it looks like**: Composing chains in Phase 2 and proceeding directly to Phase 4 output without running `validate-chain`. -**Why wrong**: The type compatibility matrix has 18 families with specific rules. Transparent steps, primary data flow pass-through, and multi-family steps (LINT, EXECUTE, TEMPLATE) create subtle edge cases. LLM self-assessment will rationalize type mismatches as acceptable. -**Do instead**: Always run `python3 ~/.claude/scripts/artifact-utils.py validate-chain` for every chain. The script is the source of truth. - -### Padding Chains with Unnecessary Steps -**What it looks like**: A personal-profile `generation` chain that includes GUARD, SNAPSHOT, APPROVE, PRESENT, and CONFORM "for safety." -**Why wrong**: Over-gated chains waste execution time and context. Operator profiles exist precisely to prevent this -- personal profiles are lean by design. Adding production-grade gates to personal chains violates the profile contract. -**Do instead**: Apply exactly the gates specified by the operator profile. Personal = minimal gates. Production = maximum gates. No more, no less. - -### Producing Markdown Instead of JSON -**What it looks like**: Writing the Pipeline Spec as a markdown document with chain visualizations but no JSON. -**Why wrong**: The scaffolder consumes JSON. It parses field names, iterates subdomain arrays, and extracts step objects programmatically. Markdown is for humans; JSON is the machine contract. -**Do instead**: Always produce `pipeline-spec.json` following `pipeline-spec-format.md` exactly. The `content.md` companion is for human readability, not for the scaffolder. - -## Anti-Rationalization - -| Rationalization Attempt | Why It's Wrong | Required Action | -|------------------------|----------------|-----------------| -| "The chain looks correct, no need to run validate-chain" | Visual inspection misses transparent step pass-through and multi-family step resolution | Run `validate-chain` for every chain, no exceptions | -| "This subdomain is simple, I can compose from scratch" | Even simple chains need correct type flow; canonical templates guarantee it | Start from canonical template, then adapt | -| "Adding extra safety steps won't hurt" | Extra steps bloat execution, waste context, and violate profile contracts | Apply only the gates specified by the operator profile | -| "The type mismatch is minor, the scaffolder can handle it" | The scaffolder rejects invalid specs; it does not fix them | Fix all type incompatibilities before producing the spec | -| "I'll fix the Pipeline Spec format later" | The spec is a contract; partial compliance means the scaffolder rejects it | Follow `pipeline-spec-format.md` exactly from the start | -| "Three validation failures means the template is wrong" | Three failures means your adaptation broke the type flow; re-examine your domain-specific insertions | Review each insertion point for type compatibility before retrying | - -## Blocker Criteria - -STOP and ask the user (do NOT proceed autonomously) when: - -| Situation | Why Stop | Ask This | -|-----------|----------|----------| -| Component Manifest has fewer than 2 subdomains | Minimum subdomain count not met; domain-research may be incomplete | "The manifest has only {N} subdomain(s). Re-run domain-research or proceed with {N}?" | -| A subdomain's task_type has no canonical template | Cannot compose without a starting point | "Subdomain '{name}' has task_type '{type}' which has no canonical chain. How should I classify it?" | -| A chain fails validate-chain 3 times | Structural issue beyond iterative fixing | "Chain for '{subdomain}' fails validation after 3 attempts. Error: {error}. How should I restructure it?" | -| Operator profile conflicts with subdomain requirements | E.g., production subdomain under personal profile | "Subdomain '{name}' involves production state changes but profile is 'personal'. Override to 'work' or 'production'?" | - -### Never Guess On -- Which operator profile to apply (must come from Component Manifest or user) -- Whether to create a new agent vs. reuse (must come from Component Manifest) -- The correct task_type for an ambiguous subdomain (ask the user) -- Whether a validation failure is a false positive (trust the script) - ## References - **Canonical Chains**: [references/canonical-chains.md](references/canonical-chains.md) -- 8 task-type templates with variants - **Artifact Utils Script**: `scripts/artifact-utils.py validate-chain` -- Deterministic chain validator +- **Step Menu**: [../pipeline-scaffolder/references/step-menu.md](../pipeline-scaffolder/references/step-menu.md) -- Step families, type compatibility matrix, composition rules, operator profiles +- **Pipeline Spec Format**: [../pipeline-scaffolder/references/pipeline-spec-format.md](../pipeline-scaffolder/references/pipeline-spec-format.md) -- Machine-readable contract format +- **Domain Research**: [../domain-research/SKILL.md](../domain-research/SKILL.md) -- Input skill that produces Component Manifest diff --git a/pipelines/comprehensive-review/SKILL.md b/pipelines/comprehensive-review/SKILL.md index 51c3c9f4..77652cec 100644 --- a/pipelines/comprehensive-review/SKILL.md +++ b/pipelines/comprehensive-review/SKILL.md @@ -64,51 +64,27 @@ routing: # Comprehensive Code Review v4 — Four-Wave Hybrid Architecture -Four-wave review with per-package deep analysis and adversarial challenge. Wave 0 auto-discovers packages/modules and dispatches one language-specialist agent per package to read ALL code in that package. Wave 1 (12 foundation agents, including newcomer perspective) runs in parallel with Wave 0 context. Wave 2 (10 deep-dive agents) receives Wave 0+1 findings for targeted analysis. Wave 3 (4-5 adversarial agents) challenges Wave 1+2 consensus — are findings actually important? Are tradeoffs justified? Should the PR be split? All findings are aggregated with wave-agreement labels, deduplicated, and auto-fixed. +Four-wave review with per-package deep analysis and adversarial challenge. Wave 0 auto-discovers packages and dispatches one language-specialist agent per package to read ALL code in that package. Wave 1 (12 foundation agents) runs in parallel with Wave 0 context. Wave 2 (10 deep-dive agents) receives Wave 0+1 findings for targeted analysis. Wave 3 (4-5 adversarial agents) challenges Wave 1+2 consensus. All findings are aggregated with wave-agreement labels, deduplicated, and auto-fixed. -**How this differs from existing skills**: -- `/parallel-code-review`: 3 agents (security, business, arch) — report only -- `/comprehensive-review`: **25+ agents in 4 waves** — per-package + cross-cutting + adversarial review AND fix everything +**vs `/parallel-code-review`**: 3 agents (security, business, arch) — report only +**vs `/comprehensive-review`**: 25+ agents in 4 waves — per-package + cross-cutting + adversarial AND fix everything --- -## Operator Context - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before dispatching reviewers -- **Four-Wave Parallelism**: Wave 0 agents dispatched in batches of 10. Wave 1 agents MUST be dispatched in a SINGLE message. Wave 2 agents MUST be dispatched in a SINGLE message after Wave 1 completes. Wave 3 agents MUST be dispatched in a SINGLE message after Wave 2 completes. -- **Context Cascading**: Wave 1 receives Wave 0 per-package findings. Wave 2 receives Wave 0+1 findings. Wave 3 receives Wave 0+1+2 findings and is instructed to CHALLENGE consensus, not reinforce it. -- **Fix Everything, Defer Nothing**: After all waves complete, fix EVERY finding. No deferrals. No "out of scope." No "will fix later." The only exception is BLOCKED (fix + alternative both break tests, <10% of total). -- **Worktree Isolation**: Fixes happen on a new branch, never on the current working branch directly -- **Severity Aggregation**: Combine findings by severity before fixing -- **Phase Gates Enforced**: Each phase must complete before the next begins -- **No Skipping Agents**: All agents run even for "simple" changes -- **No "Acceptable" Disposition**: The only valid review dispositions are FIX NOW, FIX IN FOLLOW-UP (with mandatory tracking artifact — see Deferred Finding Tracking below), or NOT AN ISSUE (with evidence). "Acceptable", "valid but deferred", "real but not blocking", and "conservative" are NOT valid dispositions. If an issue is real, it gets tracked. If it's not real, provide evidence. Include this rule in every Wave 1 and Wave 2 agent dispatch prompt. -- **Deferred Finding Tracking**: Before marking any finding as FIX IN FOLLOW-UP, create a tracking artifact (GitHub issue, `TODO(follow-up):` in code, or learning.db entry). The review cannot proceed past the aggregation phase until all FIX IN FOLLOW-UP findings have a tracking artifact. Report artifact locations in the review summary under a "Deferred Findings" table. - -### Default Behaviors (ON unless disabled) -- **Wave 0 Per-Package Review**: Auto-discover packages/modules and dispatch one agent per package. Adds deep per-package context that cross-cutting agents miss. -- **Smart Agent Selection**: Detect file types and PR context to choose relevant agents -- **Deduplication**: Merge overlapping findings from multiple agents, keep highest severity -- **Fix Verification**: Run tests/linters after each fix batch to catch regressions -- **Report Generation**: Write `comprehensive-review-report.md` to repo root -- **Quick Wins First**: Fix lowest-risk issues first to build momentum - -### Optional Behaviors (OFF unless enabled) -- **--review-only**: Skip fix phase, report only (like parallel-code-review) +## Flags + +- **--review-only**: Skip fix phase, report only - **--skip-wave0**: Skip Wave 0 per-package review (faster, less thorough) - **--wave1-only**: Run only Wave 1 (12 agents), skip Wave 0, Wave 2, and Wave 3 - **--focus [files]**: Review only specified files instead of full diff - **--severity [critical|high|medium|all]**: Only report/fix findings at or above severity -- **--org-conventions**: Pass organization-specific convention flags to reviewer-language-specialist. Configure organization detection in `scripts/classify-repo.py`. +- **--org-conventions**: Pass organization-specific convention flags to reviewer-language-specialist --- -## Instructions - -### Phase 0.5: STATIC ANALYSIS (Mandatory Prerequisite) +## Phase 0.5: STATIC ANALYSIS (Mandatory Prerequisite) -**Goal**: Run deterministic static analysis BEFORE dispatching any review agents. Linters catch syntactic issues that LLM agents miss. Findings feed into Wave 1 as input context. +Run deterministic static analysis BEFORE dispatching any review agents. Linters catch syntactic issues that LLM agents miss. Findings feed into Wave 1 as input context. **Step 1: Detect language and run linters** @@ -135,7 +111,7 @@ fi **Step 2: Auto-fix trivial findings** -Auto-fixable findings (gofmt, goimports, ruff format) should be fixed BEFORE agent review begins. This prevents agents from wasting time on formatting issues. +Auto-fixable findings (gofmt, goimports, ruff format) should be fixed BEFORE agent review begins — prevents agents from wasting time on formatting. ```bash # Go: auto-fix formatting @@ -157,9 +133,9 @@ Save non-auto-fixable lint findings to pass as context to all Wave 1 agents. Inc --- -### Phase 1: SCOPE +## Phase 1: SCOPE -**Goal**: Determine what to review and which agents to dispatch. +**Goal**: Determine what to review, discover packages, initialize findings directory. **Step 1: Identify changed files** @@ -180,97 +156,12 @@ git diff --name-only HEAD~1 **Step 2: Detect organization conventions** ```bash -# Auto-detect repo type (deterministic — no LLM judgment needed) REPO_TYPE=$(python3 ~/.claude/scripts/classify-repo.py --type-only 2>/dev/null || echo "personal") ``` -If the repo belongs to a protected organization with custom conventions: -- Set convention flags for the rest of this review -- Wave 1 Agent 9 (`reviewer-language-specialist`) gets organization-specific flags appended -- Log: "Organization conventions detected — reviewer-language-specialist will check org-specific patterns" - -**Step 3: Understand the four-wave agent roster** - -#### Wave 0: Per-Package Deep Review (Auto-Discovered) - -One language-specialist agent per discovered package reads ALL code in that package. This wave discovers issues that cross-cutting agents miss because they have full package context. - -| Language | Discovery Command | Agent Type | Min Package Size | -|----------|-------------------|------------|-----------------| -| Go | `find . -name "*.go" -path "*/internal/*" \| xargs dirname \| sort -u` | `golang-general-engineer-compact` | 1 file | -| Go (also) | `find . -name "*.go" -not -path "*/internal/*" -not -path "*/vendor/*" \| xargs dirname \| sort -u` | `golang-general-engineer-compact` | 1 file | -| Python | `find . -name "__init__.py" \| xargs dirname \| sort -u` | `python-general-engineer` | 1 file | -| TypeScript | `find . -name "index.ts" -o -name "index.tsx" \| xargs dirname \| sort -u` | `typescript-frontend-engineer` | 1 file | - -**Wave 0 produces**: Per-package findings with full context (every file in the package was read). These findings inform Wave 1+2 about package-level patterns, internal APIs, and local code quality. - -#### Wave 1: Foundation Agents (Independent Analysis) - -These agents run in parallel with Wave 0 per-package findings as context. They perform cross-cutting analysis that spans packages and establish the foundation for Wave 2. - -| # | Agent | Focus Area | Key Catches | -|---|-------|------------|-------------| -| 1 | `reviewer-security` | Security | OWASP Top 10, auth, injection, secrets | -| 2 | `reviewer-business-logic` | Domain | Edge cases, state transitions, requirement gaps | -| 3 | Architecture reviewer* | Architecture | Patterns, naming, structure, idioms | -| 4 | `reviewer-silent-failures` | Error Handling | Swallowed errors, empty catches, bad fallbacks | -| 5 | `reviewer-test-analyzer` | Test Coverage | Coverage gaps, fragile tests, missing negative cases | -| 6 | `reviewer-type-design` | Type Design | Weak invariants, leaky encapsulation | -| 7 | `reviewer-code-quality` | Quality/Style | CLAUDE.md violations, convention drift | -| 8 | `reviewer-comment-analyzer` | Documentation | Comment rot, misleading docs, stale TODOs | -| 9 | `reviewer-language-specialist` | Language Idioms | Modern stdlib, concurrency, LLM tells, org-specific rules | -| 10 | `reviewer-docs-validator` | Project Health | README, CLAUDE.md, deps, CI, build system | -| 11 | `reviewer-adr-compliance` | ADR Compliance | Implementation matches ADR decisions, no scope creep | -| 12 | `reviewer-newcomer` | Newcomer Perspective | Documentation gaps, confusing code, implicit assumptions, onboarding friction | - -*Architecture reviewer selection by language: - -| File Types | Agent | -|-----------|-------| -| `.go` files | `golang-general-engineer` or `golang-general-engineer-compact` | -| `.py` files | `python-general-engineer` | -| `.ts`/`.tsx` files | `typescript-frontend-engineer` | -| Mixed or other | `Explore` | - -#### Wave 2: Deep-Dive Agents (Context-Aware Analysis) - -These agents receive Wave 0+1 aggregated findings as input. They perform targeted deep-dives informed by per-package analysis (Wave 0) and cross-cutting analysis (Wave 1). - -| # | Agent | Focus Area | Wave 1 Context Used | -|---|-------|------------|---------------------| -| 11 | `reviewer-performance` | Performance | Architecture findings → focus on hot paths | -| 12 | `reviewer-concurrency` | Concurrency | Silent-failure + architecture findings → concurrent paths | -| 13 | `reviewer-api-contract` | API Contracts | Business-logic + type-design findings → contract-sensitive code | -| 14 | `reviewer-dependency-audit` | Dependencies | Docs-validator findings → dependency documentation gaps | -| 15 | `reviewer-error-messages` | Error Messages | Silent-failure + code-quality findings → error paths | -| 16 | `reviewer-dead-code` | Dead Code | Code-quality + docs-validator findings → abandoned artifacts | -| 17 | `reviewer-naming-consistency` | Naming | Code-quality + language-specialist findings → convention baselines | -| 18 | `reviewer-observability` | Observability | Silent-failure findings → observability gaps at failure points | -| 19 | `reviewer-config-safety` | Config Safety | Security + docs-validator findings → config security gaps | -| 20 | `reviewer-migration-safety` | Migration Safety | API-contract + business-logic findings → migration-sensitive changes | - -#### Wave 3: Adversarial Perspectives (Consensus Challenge) - -These agents receive Wave 0+1+2 aggregated findings as input. Their job is to CHALLENGE the consensus — not reinforce it. Wave 3 agents push back on findings, question whether issues are real, and surface tradeoffs that earlier waves may have accepted uncritically. - -| # | Agent | Role | Challenge Focus | -|---|-------|------|----------------| -| 21 | `reviewer-contrarian` | Challenges findings | Are these findings actually important? Which are false positives? Which are over-severity? | -| 22 | `reviewer-skeptical-senior` | Experience-based skepticism | "I've seen this before" — which findings are theoretical vs real-world issues? | -| 23 | `reviewer-user-advocate` | User impact assessment | Does this change break users? Are UX tradeoffs justified? Are migration paths safe? | -| 24 | `reviewer-meta-process` | Process/approach review | Is this the right approach? Should the PR be split? Is the review itself focused correctly? | - -**Conditional: SAPCC Structural Review** - -| # | Agent | Condition | Challenge Focus | -|---|-------|-----------|----------------| -| 25 | `reviewer-sapcc-structural` | Repo contains ANY of: `hybris/`, `core-customize/`, `config/localextensions.xml`, or `manifest.json` with `"commerceSuiteVersion"` | SAP Commerce Cloud structural integrity — extension wiring, build manifest, data model impacts | - -If the SAPCC condition is not met, skip `reviewer-sapcc-structural` silently (no warning, no log). - -**Step 3: Initialize findings directory** +If the repo belongs to a protected organization: set convention flags for the rest of this review. Wave 1 Agent 9 (`reviewer-language-specialist`) gets organization-specific flags appended. Log: "Organization conventions detected." -Create a temporary directory to persist findings across waves. This is critical — without it, context compaction between waves loses all prior findings. +**Step 3: Initialize findings directory** — critical for surviving context compaction between waves ```bash REVIEW_DIR="/tmp/claude-review/$(date +%Y%m%d-%H%M%S)" @@ -278,236 +169,66 @@ mkdir -p "$REVIEW_DIR" echo "Review findings directory: $REVIEW_DIR" ``` -All subsequent phases MUST write their findings to `$REVIEW_DIR/` and read prior wave findings from there. This ensures findings survive context compaction between waves. - -| File | Written By | Read By | -|------|-----------|---------| -| `$REVIEW_DIR/wave0-findings.md` | Phase 1c | Phase 2a, 2b | -| `$REVIEW_DIR/wave1-findings.md` | Phase 2b | Phase 3a | -| `$REVIEW_DIR/wave01-summary.md` | Phase 2b | Phase 3a | -| `$REVIEW_DIR/wave2-findings.md` | Phase 3b | Phase 3c | -| `$REVIEW_DIR/wave012-summary.md` | Phase 3b | Phase 3c | -| `$REVIEW_DIR/wave3-findings.md` | Phase 3d | Phase 4 | -| `$REVIEW_DIR/final-report.md` | Phase 3d | Phase 4, Phase 5 | +All subsequent phases MUST write their findings to `$REVIEW_DIR/` and read prior wave findings from there. **Step 4: Create task_plan.md** -```markdown -# Task Plan: Comprehensive Review v3 - -## Goal -Three-wave review and auto-fix of [N] changed files across [N] packages. - -## Phases -- [ ] Phase 1: Scope (identify files, discover packages) -- [ ] Phase 1b: Wave 0 Dispatch (per-package deep review) -- [ ] Phase 1c: Wave 0 Aggregate (per-package findings) -- [ ] Phase 2a: Wave 1 Dispatch (12 foundation agents + Wave 0 context) -- [ ] Phase 2b: Wave 1 Aggregate (collect and summarize Wave 0+1 findings) -- [ ] Phase 3a: Wave 2 Dispatch (10 deep-dive agents with Wave 0+1 context) -- [ ] Phase 3b: Wave 2 Aggregate (merge Wave 0+1+2 findings) -- [ ] Phase 3c: Wave 3 Dispatch (4-5 adversarial agents with Wave 0+1+2 context) -- [ ] Phase 3d: Wave 3 Aggregate (merge adversarial challenges, label agreement) -- [ ] Phase 4: Fix (auto-fix on branch) -- [ ] Phase 5: Report (write report, verify) - -## Review Profile -- Files: [list] -- Packages discovered: [N] -- Wave 0 agents: [N] (one per package) -- Wave 1 agents: 12 -- Wave 2 agents: 10 -- Wave 3 agents: 4-5 (adversarial; 5 if SAPCC detected) -- Org conventions: [detected org or none] -- Mode: [review+fix | review-only] - -## Findings Directory -$REVIEW_DIR = [path from Step 3] - -## Status -**Currently in Phase 1** - Discovering packages -``` +Read `${CLAUDE_SKILL_DIR}/references/output-templates.md` for the task_plan.md template. Fill it in with actual file count, package count, detected org, and mode. **Gate**: Files identified, packages discovered, findings directory created, plan created. Proceed to Phase 1b. --- -### Phase 1b: WAVE 0 DISPATCH — Per-Package Deep Review - -**Goal**: Dispatch one language-specialist agent per discovered package. Each agent reads ALL files in its package for deep, contextual review. - -**Skip if**: `--skip-wave0` flag is set or `--wave1-only` flag is set. - -**Step 1: Discover packages** - -```bash -# Go packages (internal/ and top-level) -find . -name "*.go" -not -path "*/vendor/*" -not -path "*/.git/*" | xargs dirname | sort -u - -# Python packages -find . -name "__init__.py" -not -path "*/venv/*" -not -path "*/.git/*" | xargs dirname | sort -u - -# TypeScript modules (directories with index.ts/tsx) -find . -name "index.ts" -o -name "index.tsx" | grep -v node_modules | xargs dirname | sort -u -``` - -**Step 2: Select agent type per language** +## Phase 1b: WAVE 0 DISPATCH — Per-Package Deep Review -| Language | Agent | -|----------|-------| -| Go | `golang-general-engineer-compact` | -| Python | `python-general-engineer` | -| TypeScript | `typescript-frontend-engineer` | -| Mixed | Use language of majority files in that package | +**Skip if**: `--skip-wave0` or `--wave1-only` flag is set. -**Step 3: Dispatch agents in batches of 10** +Read `${CLAUDE_SKILL_DIR}/references/wave-0-per-package.md` for: +- Package discovery commands by language +- Agent selection table +- The complete per-package agent dispatch prompt +- Batch size rules (max 10 agents per message) +- Wave 0 aggregate output format -Each per-package agent gets this prompt: - -``` -PER-PACKAGE DEEP REVIEW — Wave 0 - -PACKAGE: [package path] -LANGUAGE: [Go/Python/TypeScript] - -MCP TOOL DISCOVERY (do this FIRST, before reading package files): -- Use ToolSearch to check for available MCP tools that can enhance your analysis: - a. Run ToolSearch("gopls") — if this is a Go package, loads type-aware analysis - tools (go_file_context, go_diagnostics, go_symbol_references, etc.) - b. Run ToolSearch("context7") — loads library documentation lookup tools -- If gopls tools are available AND LANGUAGE is Go: - * Use go_file_context after reading each .go file for intra-package dependency context - * Use go_diagnostics on the package to detect build/analysis errors - * Use go_symbol_references to check for unused or misused exported symbols -- If Context7 tools are available: - * Use resolve-library-id + query-docs for unfamiliar library APIs in this package - -INSTRUCTIONS: -1. Read the CLAUDE.md file(s) in this repository first -2. Run MCP TOOL DISCOVERY steps above -3. Read EVERY file in this package directory: [package path]/ -4. Understand the package's purpose, internal APIs, and relationships -5. Review ALL code for issues — you have full package context -6. Use MCP tools (gopls, Context7) as you review for type-aware precision -7. Focus on issues that require understanding the WHOLE package: - - Internal API misuse between files in this package - - Inconsistent error handling patterns within the package - - Missing or redundant functionality - - Package-level design issues (cohesion, coupling) - - Test coverage relative to package complexity -8. Use structured output format with severity classification -9. Include file:line references for every finding - -CONTEXT: This is Wave 0 of a comprehensive review. Your per-package findings -will be passed to 20 cross-cutting review agents in Waves 1 and 2. Focus on -issues that require full package context to detect — cross-cutting agents will -handle file-level and project-level concerns. - -OUTPUT FORMAT: -### PACKAGE: [package path] -**Files reviewed**: [list all files read] -**Package purpose**: [1-sentence summary] -**Package health**: [HEALTHY | MINOR_ISSUES | NEEDS_ATTENTION | CRITICAL] - -Findings: -### [CRITICAL|HIGH|MEDIUM|LOW]: [One-line summary] -**File**: `path/to/file:LINE` -**Issue**: [Description] -**Impact**: [Why this matters] -**Fix**: [Concrete code fix] -**Requires package context**: [Why a single-file reviewer would miss this] ---- -``` - -**CRITICAL**: Dispatch up to 10 agents per message. If more than 10 packages are discovered, use multiple batches: -- Batch 1: packages 1-10 (dispatch in ONE message) -- Batch 2: packages 11-20 (dispatch in ONE message after batch 1 completes) -- Continue until all packages are covered - -Use `model: sonnet` for ALL per-package agents. Never use haiku for code review. The orchestrator runs on Opus; dispatched agents run on Sonnet for cost efficiency (40% savings, minimal quality tradeoff). +**Dispatch**: One language-specialist agent per discovered package. Each reads ALL files in its package. Batches of up to 10 agents per message. **Gate**: All per-package agents dispatched and completed. Proceed to Phase 1c. --- -### Phase 1c: WAVE 0 AGGREGATE — Per-Package Summary - -**Goal**: Collect Wave 0 findings into a per-package summary for Wave 1+2 context injection. - -**Step 1: Collect all Wave 0 findings** - -Read each per-package agent's output. Extract: package path, health rating, findings with severity. - -**Step 2: Build Wave 0 Summary** +## Phase 1c: WAVE 0 AGGREGATE -```markdown -## Wave 0 Per-Package Findings Summary (for Wave 1+2 context) +Collect Wave 0 findings into a per-package summary. Read `${CLAUDE_SKILL_DIR}/references/wave-0-per-package.md` for the aggregate output format. -### Packages Reviewed: [N] -### Packages Healthy: [N] | Minor Issues: [N] | Needs Attention: [N] | Critical: [N] +**Step 1**: Collect all per-package agent outputs. Extract: package path, health rating, findings with severity. -### Per-Package Results +**Step 2**: Identify cross-package patterns (e.g., "5 packages have inconsistent error handling"). These are especially valuable for Wave 1+2 agents. -#### [package/path1] — [HEALTHY|MINOR_ISSUES|NEEDS_ATTENTION|CRITICAL] -- Purpose: [1-sentence] -- Files: [N] -- Findings: [N] (CRITICAL: N, HIGH: N, MEDIUM: N, LOW: N) -- Key issues: - - [SEVERITY]: [summary] at [file:line] - - ... - -#### [package/path2] — [STATUS] -- ... - -### Cross-Package Patterns Detected -- [Pattern 1]: Seen in packages [list] — suggests systemic issue -- [Pattern 2]: ... - -### Wave 0 Hotspots (packages with most findings) -1. [package] — [N findings, N critical] -2. [package] — [N findings, N high] -3. ... -``` - -**Step 3: Identify cross-package patterns** - -Look across all per-package results for recurring themes (e.g., "5 packages have inconsistent error handling" or "3 packages missing test files"). These cross-package patterns are especially valuable for Wave 1+2 agents. - -**Step 4: Save Wave 0 findings to disk** - -Write the complete Wave 0 summary to disk so it survives context compaction: +**Step 3: Save Wave 0 findings to disk** — do NOT skip this step ```bash -# Write Wave 0 findings — this is the source of truth for Wave 1+2 cat > "$REVIEW_DIR/wave0-findings.md" << 'WAVE0_EOF' [Paste the complete Wave 0 Per-Package Findings Summary here] WAVE0_EOF echo "Saved Wave 0 findings: $(wc -l < "$REVIEW_DIR/wave0-findings.md") lines" ``` -**CRITICAL**: Do NOT skip this step. If compaction fires before Wave 1 dispatch, Wave 0 findings are gone forever without this file. - **Gate**: Wave 0 summary built, saved to `$REVIEW_DIR/wave0-findings.md`. Proceed to Phase 1.5. --- -### Phase 1.5: LIBRARY CONTRACT VERIFICATION (Go repos only) +## Phase 1.5: LIBRARY CONTRACT VERIFICATION (Go repos only) -**Goal**: Before Wave 1 dispatches, verify that code assumptions about imported library behavior are actually true. This catches the systemic LLM blind spot where agents reason from protocol knowledge instead of reading library source. +**Skip for non-Go repos** (no `go.mod`). -**When to run**: Only for Go repos (check for `go.mod`). Skip for Python/TypeScript repos. +**Goal**: Before Wave 1, verify that code assumptions about imported library behavior are actually true. Catches the LLM blind spot where agents reason from protocol knowledge instead of reading library source. -**Step 1: Scan changed code for library assumptions** - -In the diff (from Phase 1), identify: -- Comments claiming library behavior ("X will retry", "Y returns error on Z", "will rebalance") +**Step 1: Scan changed code for library assumptions** — look for: +- Comments claiming library behavior ("X will retry", "Y returns error on Z") - Error handling that assumes specific error types from imported libraries -- Control flow that depends on library lifecycle (reconnect, rebalance, retry, redeliver) -- Log messages describing expected library behavior - -**Step 2: Dispatch a verification agent** +- Control flow that depends on library lifecycle (reconnect, rebalance, retry) -Dispatch a single `golang-general-engineer-compact` agent (model: sonnet) with this prompt structure: +**Step 2: Dispatch a single `golang-general-engineer-compact` agent** (model: sonnet): ``` You are a library contract verifier. Your job is to read library source code @@ -526,328 +247,99 @@ Output a Library Contract Report as a markdown table: | Assumption | Location | Verified? | Evidence | ``` -**Step 3: Save report and pass to Wave 1** +**Step 3**: Save to `$REVIEW_DIR/library-contracts.md`. Include in every Wave 1 agent dispatch prompt alongside Wave 0 findings. -Save the Library Contract Report to `$REVIEW_DIR/library-contracts.md`. Include it in every Wave 1 agent dispatch prompt alongside Wave 0 findings and static analysis findings. - -**Gate**: Library contract verification complete (or skipped for non-Go repos). Proceed to Phase 2a. +**Gate**: Library contract verification complete (or skipped). Proceed to Phase 2a. --- -### Phase 2a: WAVE 1 DISPATCH - -**Goal**: Launch all foundation review agents in a SINGLE message for true parallel execution, with Wave 0 per-package context. This dispatches 12 agents. +## Phase 2a: WAVE 1 DISPATCH -**CRITICAL**: ALL Wave 1 agent dispatches MUST be in ONE message. Sequential dispatch defeats parallelism. +**ALL 12 Wave 1 agent dispatches MUST be in ONE message for true parallelism.** -**Step 0: Load Wave 0 findings from disk** +Read `${CLAUDE_SKILL_DIR}/references/wave-1-foundation.md` for: +- Complete agent roster (12 agents with focus areas) +- Architecture reviewer selection by language +- Standard agent prompt template (with Wave 0 context injection) +- Agent-specific prompt additions (mandatory caller tracing, assertion depth checks, etc.) -Before constructing agent prompts, read Wave 0 findings from disk (in case context compaction has occurred): +**Step 0: Load Wave 0 findings from disk** (guards against context compaction): ```bash WAVE0_CONTEXT=$(cat "$REVIEW_DIR/wave0-findings.md" 2>/dev/null || echo "Wave 0 skipped — no per-package context available") ``` -Use `$WAVE0_CONTEXT` in each agent prompt below. - -**Model**: Use `model: sonnet` for all Wave 1 agents. The orchestrator (this skill) runs on Opus; dispatched review agents run on Sonnet. +Inject `$WAVE0_CONTEXT` into each agent prompt. Also inject static analysis findings and library contract report (if Go repo). -Each agent prompt should include: - -``` -REVIEW SCOPE: -- Files to review: [list of changed files] -- Change context: [what was changed and why, if known] -- Repository: [current directory] - -WAVE 0 PER-PACKAGE CONTEXT (deep per-package review results): -[Insert $WAVE0_CONTEXT — loaded from $REVIEW_DIR/wave0-findings.md] - -MCP TOOL DISCOVERY (do this FIRST, before any file reads): -- Use ToolSearch to check for available MCP tools that can enhance your analysis: - a. Run ToolSearch("gopls") — if Go files are in scope, this loads type-aware - analysis tools (go_file_context, go_diagnostics, go_symbol_references, etc.) - b. Run ToolSearch("context7") — loads library documentation lookup tools for - verifying dependency usage and API correctness -- If gopls tools are available AND this is a Go repository: - * Use go_file_context after reading any .go file to understand intra-package dependencies - * Use go_symbol_references before flagging unused or misused symbols - * Use go_diagnostics on files you flag to confirm real vs false-positive issues -- If Context7 tools are available: - * Use resolve-library-id + query-docs to verify library API usage in flagged code - -INSTRUCTIONS: -1. Read the CLAUDE.md file(s) in this repository first -2. Run MCP TOOL DISCOVERY steps above -3. Review the Wave 0 per-package context to understand package-level findings -4. Review the specified files for issues in your domain -5. Use Wave 0 findings to AVOID duplicating per-package issues already found -6. Focus on CROSS-CUTTING concerns that span multiple packages -7. Use MCP tools (gopls, Context7) during analysis where they add precision -8. Use structured output format with severity classification -9. Include file:line references for every finding -10. For each finding, provide a concrete fix recommendation - -OUTPUT FORMAT: -Return findings as: -### [CRITICAL|HIGH|MEDIUM|LOW]: [One-line summary] -**File**: `path/to/file:LINE` -**Issue**: [Description] -**Impact**: [Why this matters] -**Fix**: [Concrete code fix] -**Wave 0 Cross-Ref**: [Which Wave 0 package finding this relates to, if any] ---- -``` +Include in every Wave 1 and Wave 2 agent dispatch prompt: "The only valid review dispositions are FIX NOW, FIX IN FOLLOW-UP (with mandatory tracking artifact), or NOT AN ISSUE (with evidence). 'Acceptable', 'valid but deferred', and 'conservative' are NOT valid dispositions." -**Agent-specific prompt additions:** - -| Agent | Extra Instructions | -|-------|-------------------| -| `reviewer-security` | Focus on OWASP Top 10, auth, input validation, secrets. **MCP**: For Go, use gopls `go_symbol_references` to trace tainted input flows. **CALLER TRACING (mandatory)**: When the diff modifies functions with security-sensitive parameters (auth tokens, filter flags, sentinel values like `"*"`), grep for ALL callers across the repo and verify each validates the parameter. Do NOT trust PR descriptions — verify independently. | -| `reviewer-business-logic` | Focus on requirements coverage, edge cases, state transitions. **CALLER TRACING (mandatory)**: When the diff changes interface semantics or introduces sentinel values, grep for ALL callers (`.MethodName(`) across the repo and verify each honors the contract. Do NOT claim "no caller passes X" without searching. | -| Architecture reviewer | Focus on patterns, naming, structure, maintainability. **MCP**: For Go, use gopls `go_file_context` to understand cross-file dependencies | -| `reviewer-silent-failures` | Focus on catch blocks, error swallowing, fallback behavior. **MCP**: For Go, use gopls `go_diagnostics` to verify error handling correctness | -| `reviewer-test-analyzer` | Focus on coverage gaps, missing edge case tests, test quality. **ASSERTION DEPTH CHECK (mandatory)**: For security-sensitive code, flag presence-only assertions (NotEmpty, NotNil, hasKey). Tests MUST verify actual values, not just existence. | -| `reviewer-type-design` | Focus on invariants, encapsulation, type safety. **MCP**: For Go, use gopls `go_package_api` to understand type surface area | -| `reviewer-code-quality` | Focus on CLAUDE.md compliance, conventions, style | -| `reviewer-comment-analyzer` | Focus on comment accuracy, rot, misleading docs | -| `reviewer-language-specialist` | Detect language from files, check modern stdlib, idioms, concurrency, LLM tells. **MCP**: For Go files, use gopls `go_file_context` and `go_diagnostics` to detect non-idiomatic patterns with type awareness. If org conventions detected, append org-specific flags to prompt. | -| `reviewer-docs-validator` | Check README.md, CLAUDE.md, deps, CI config, build system, LICENSE. Review the project, not the code. **MCP**: Use Context7 to verify documented library versions/APIs match actual usage | -| `reviewer-adr-compliance` | Auto-discover ADRs from `adr/` and `.adr-session.json`. Check every decision point has implementation, no contradictions, no scope creep. Output ADR COMPLIANT or NOT ADR COMPLIANT. | -| `reviewer-newcomer` | Review from a newcomer/fresh-eyes perspective. Focus on: documentation gaps that would confuse a new developer, implicit assumptions not explained in code or comments, confusing variable/function names, unclear control flow, missing "why" explanations. Flag anything where a developer unfamiliar with this codebase would be lost. | - -**Gate**: All Wave 1 agents dispatched in a single message (12 agents). Wait for all to complete. Proceed to Phase 2b. +**Gate**: All 12 Wave 1 agents dispatched in a single message. Wait for all to complete. Proceed to Phase 2b. --- -### Phase 2b: WAVE 0+1 AGGREGATE - -**Goal**: Collect Wave 0 and Wave 1 findings into a structured summary that becomes Wave 2's input context. - -**Step 1: Collect all Wave 1 findings** - -Read each Wave 1 agent's output. Extract findings with severity, file, description, and fix. +## Phase 2b: WAVE 0+1 AGGREGATE -**Step 2: Build Wave 0+1 Summary** +Read `${CLAUDE_SKILL_DIR}/references/wave-1-foundation.md` for the Wave 0+1 combined summary format. -Create a condensed summary combining Wave 0 per-package findings and Wave 1 cross-cutting findings. This combined summary becomes the context injected into every Wave 2 agent. +**Step 1**: Collect all Wave 1 agent outputs. Extract findings with severity, file, description, fix. -```markdown -## Wave 0+1 Findings Summary (for Wave 2 context) +**Step 2**: Build combined Wave 0+1 summary (Wave 0 per-package findings + Wave 1 cross-cutting findings). Identify overlapping findings between waves — duplicates validate both agents' analysis. -### Wave 0 Per-Package Summary: [N packages reviewed] -- Packages with issues: [list with health status] -- Cross-package patterns: [list] -- Hotspot packages: [top 3 by finding count] -- Key per-package findings: - - [package]: [SEVERITY] [summary] - - ... - -### Security (Agent 1): [N findings] -- CRITICAL: [list if any] -- HIGH: [list] -- Files with security issues: [list] - -### Business Logic (Agent 2): [N findings] -- State transitions identified: [list] -- Edge cases flagged: [list] -- Files with domain issues: [list] - -### Architecture (Agent 3): [N findings] -- Architectural patterns noted: [list] -- Hot paths identified: [list] -- Files with structural issues: [list] - -### Silent Failures (Agent 4): [N findings] -- Swallowed errors at: [file:line list] -- Error paths without handling: [list] -- Files with error handling gaps: [list] - -### Test Coverage (Agent 5): [N findings] -- Coverage gaps: [list] -- Untested paths: [list] - -### Type Design (Agent 6): [N findings] -- Weak types identified: [list] -- Type safety issues at: [file:line list] - -### Code Quality (Agent 7): [N findings] -- Convention baseline established: [patterns] -- Convention violations: [list] -- CLAUDE.md compliance issues: [list] - -### Comments (Agent 8): [N findings] -- Comment rot at: [file:line list] -- Stale TODOs: [list] - -### Language Specialist (Agent 9): [N findings] -- Language-specific issues: [list] -- Modern stdlib opportunities: [list] -- LLM code tells: [list] - -### Docs & Config (Agent 10): [N findings] -- Documentation gaps: [list] -- Dependency issues: [list] -- CI/build issues: [list] - -### ADR Compliance (Agent 11): [N findings] -- ADR decisions not implemented: [list] -- ADR contradictions: [list] -- Scope creep: [list] - -### Newcomer Perspective (Agent 12): [N findings] -- Documentation gaps: [list] -- Confusing code: [list] -- Implicit assumptions: [list] -- Onboarding friction: [list] - -``` - -**Step 3: Quick-deduplicate Wave 0+1** - -Identify overlapping findings between Wave 0 per-package agents and Wave 1 cross-cutting agents. Note duplicates for final aggregation but keep all findings in the context — Wave 2 agents benefit from seeing the raw data. Wave 0 findings that were also caught by Wave 1 validate both agents' analysis. - -**Step 4: Save Wave 1 findings and combined summary to disk** - -Write both raw Wave 1 findings AND the combined Wave 0+1 summary to disk: +**Step 3: Save to disk** — do NOT skip: ```bash -# Save raw Wave 1 findings (individual agent outputs) cat > "$REVIEW_DIR/wave1-findings.md" << 'WAVE1_EOF' -[Paste ALL Wave 1 agent outputs — the raw findings from each of the 12 agents] +[Paste ALL Wave 1 agent outputs] WAVE1_EOF -# Save the combined Wave 0+1 summary (the structured context for Wave 2) cat > "$REVIEW_DIR/wave01-summary.md" << 'WAVE01_EOF' -[Paste the Wave 0+1 Findings Summary built in Step 2 above] +[Paste the Wave 0+1 Findings Summary] WAVE01_EOF echo "Saved Wave 1 findings: $(wc -l < "$REVIEW_DIR/wave1-findings.md") lines" echo "Saved Wave 0+1 summary: $(wc -l < "$REVIEW_DIR/wave01-summary.md") lines" ``` -**CRITICAL**: Do NOT skip this step. Wave 2 agents need the combined summary, and context compaction WILL fire between Wave 1 aggregate and Wave 2 dispatch on large reviews. - **Gate**: Wave 0+1 summary built, saved to `$REVIEW_DIR/wave01-summary.md`. Proceed to Phase 3a. --- -### Phase 3a: WAVE 2 DISPATCH +## Phase 3a: WAVE 2 DISPATCH -**Goal**: Launch all 10 deep-dive agents in a SINGLE message, each receiving Wave 0+1 findings summary as context. +**ALL 10 Wave 2 agent dispatches MUST be in ONE message.** -**CRITICAL**: ALL 10 Wave 2 agent dispatches MUST be in ONE message. +Read `${CLAUDE_SKILL_DIR}/references/wave-2-deep-dive.md` for: +- Complete agent roster (10 agents with focus areas and Wave 1 context used) +- Standard agent prompt template (with Wave 0+1 context injection) +- Agent-specific context instructions -**Step 0: Load Wave 0+1 findings from disk** - -Before constructing agent prompts, reload the combined summary from disk (in case context compaction has occurred since Phase 2b): +**Step 0: Load Wave 0+1 findings from disk**: ```bash -WAVE01_SUMMARY=$(cat "$REVIEW_DIR/wave01-summary.md" 2>/dev/null || echo "ERROR: Wave 0+1 summary not found at $REVIEW_DIR/wave01-summary.md — cannot proceed with Wave 2") -``` - -If the file is missing, something went wrong in Phase 2b. Re-read `$REVIEW_DIR/wave0-findings.md` and `$REVIEW_DIR/wave1-findings.md` and rebuild the summary before proceeding. - -**Model**: Use `model: sonnet` for all Wave 2 agents. The orchestrator (this skill) runs on Opus; dispatched review agents run on Sonnet. - -Each Wave 2 agent prompt should include the standard review scope PLUS the Wave 0+1 context: - -``` -REVIEW SCOPE: -- Files to review: [list of changed files] -- Change context: [what was changed and why, if known] -- Repository: [current directory] - -WAVE 0+1 CONTEXT (use this to focus your analysis): -[Insert $WAVE01_SUMMARY — loaded from $REVIEW_DIR/wave01-summary.md] - -MCP TOOL DISCOVERY (do this FIRST, before any file reads): -- Use ToolSearch to check for available MCP tools that can enhance your analysis: - a. Run ToolSearch("gopls") — if Go files are in scope, this loads type-aware - analysis tools (go_file_context, go_diagnostics, go_symbol_references, etc.) - b. Run ToolSearch("context7") — loads library documentation lookup tools for - verifying dependency usage and API correctness -- If gopls tools are available AND this is a Go repository: - * Use go_file_context after reading any .go file to understand intra-package dependencies - * Use go_symbol_references before flagging unused or misused symbols - * Use go_diagnostics on files you flag to confirm real vs false-positive issues -- If Context7 tools are available: - * Use resolve-library-id + query-docs to verify library API usage in flagged code - -INSTRUCTIONS: -1. Read the CLAUDE.md file(s) in this repository first -2. Run MCP TOOL DISCOVERY steps above -3. Review the Wave 0 per-package context for package-level issues already found -4. Review the Wave 1 cross-cutting context for foundation issues already found -5. Use Wave 0+1 findings to FOCUS your deep-dive analysis: - - Prioritize packages flagged as NEEDS_ATTENTION or CRITICAL by Wave 0 - - Prioritize files and paths flagged by Wave 1 - - Look for issues in YOUR domain that neither Wave 0 nor Wave 1 would catch - - Cross-reference your findings with both waves to add depth -6. Do NOT simply repeat Wave 0 or Wave 1 findings — add NEW insights -7. Use MCP tools (gopls, Context7) during analysis where they add precision -8. Use structured output format with severity classification -9. Include file:line references for every finding - -OUTPUT FORMAT: -Return findings as: -### [CRITICAL|HIGH|MEDIUM|LOW]: [One-line summary] -**File**: `path/to/file:LINE` -**Issue**: [Description] -**Impact**: [Why this matters] -**Fix**: [Concrete code fix] -**Wave 0+1 Cross-Ref**: [Which earlier finding this relates to, if any] ---- +WAVE01_SUMMARY=$(cat "$REVIEW_DIR/wave01-summary.md" 2>/dev/null || echo "ERROR: Wave 0+1 summary not found — cannot proceed with Wave 2") ``` -**Wave 2 agent-specific prompt additions:** - -| Agent | Extra Context Instructions | -|-------|--------------------------| -| `reviewer-performance` | Use Wave 0 per-package findings to identify packages with complexity issues. Use Wave 1 architecture findings to identify hot paths. Focus on algorithmic complexity, N+1 queries, allocation waste. **MCP**: For Go, use gopls `go_symbol_references` to trace hot path call chains | -| `reviewer-concurrency` | Use Wave 0 per-package findings for concurrent patterns within packages. Use Wave 1 silent-failure + architecture findings for cross-package concurrent paths. Focus on races, goroutine leaks, deadlocks. **MCP**: For Go, use gopls `go_diagnostics` to detect race condition warnings | -| `reviewer-api-contract` | Use Wave 0 per-package findings to understand internal API surfaces. Use Wave 1 business-logic + type-design findings for contract-sensitive endpoints. Focus on breaking changes, status codes. **MCP**: Use Context7 to verify API contract claims against library docs | -| `reviewer-dependency-audit` | Use Wave 1 docs-validator findings to cross-reference documented vs actual dependencies. Run govulncheck/npm audit/pip-audit. Focus on CVEs, licenses, deprecated packages. **MCP**: Use Context7 `resolve-library-id` + `query-docs` to verify dependency API usage. For Go, use gopls `go_vulncheck` for vulnerability scanning | -| `reviewer-error-messages` | Use Wave 0 per-package error handling patterns. Use Wave 1 silent-failure + code-quality findings. Focus on error message quality, actionability, consistency. | -| `reviewer-dead-code` | Use Wave 0 per-package findings to identify unused internal APIs between files. Use Wave 1 code-quality + docs-validator findings. Focus on unreachable code, unused exports. | -| `reviewer-naming-consistency` | Use Wave 0 per-package naming patterns to detect intra-package drift. Use Wave 1 code-quality + language-specialist findings. Focus on cross-package naming consistency. | -| `reviewer-observability` | Use Wave 0 per-package findings for packages missing instrumentation. Use Wave 1 silent-failure findings for error paths missing observability. Focus on RED metrics gaps. | -| `reviewer-config-safety` | Use Wave 0 per-package findings for hardcoded values within packages. Use Wave 1 security + docs-validator findings. Focus on secrets, missing env var validation. | -| `reviewer-migration-safety` | Use Wave 1 api-contract + business-logic findings to identify migration-sensitive changes. Focus on reversible migrations, deprecation paths, rollback safety. | +If the file is missing, re-read `$REVIEW_DIR/wave0-findings.md` and `$REVIEW_DIR/wave1-findings.md` and rebuild the summary before proceeding. **Gate**: All 10 Wave 2 agents dispatched in a single message. Wait for all to complete. Proceed to Phase 3b. --- -### Phase 3b: WAVE 0+1+2 AGGREGATE +## Phase 3b: WAVE 0+1+2 AGGREGATE -**Goal**: Merge Wave 0+1+2 findings into a structured summary for Wave 3 adversarial review. This is NOT the final aggregate — Wave 3 will challenge these findings. - -**Step 0: Load all prior wave findings from disk** - -Reload all wave findings from disk before aggregating (context compaction may have fired): +**Step 0: Reload all prior wave findings from disk**: ```bash -# Reload all wave findings from disk WAVE0=$(cat "$REVIEW_DIR/wave0-findings.md" 2>/dev/null || echo "") WAVE1=$(cat "$REVIEW_DIR/wave1-findings.md" 2>/dev/null || echo "") echo "Loaded Wave 0: $(echo "$WAVE0" | wc -l) lines, Wave 1: $(echo "$WAVE1" | wc -l) lines" ``` -**Step 1: Collect all findings** - -Combine Wave 0 per-package (from `$REVIEW_DIR/wave0-findings.md`), Wave 1 cross-cutting (from `$REVIEW_DIR/wave1-findings.md`), and Wave 2 deep-dive findings (just returned from agents) into a single list. +**Step 1**: Combine Wave 0 per-package, Wave 1 cross-cutting, and Wave 2 deep-dive findings. -**Step 2: Preliminary deduplication** +**Step 2: Deduplicate** — if two or more agents flagged the same file:line, keep the highest severity, merge fix recommendations, note which agents found it (reinforces importance). Prefer Wave 2 fixes when they add Wave 0+1 context. -If two or more agents flagged the same file:line: -- Keep the highest severity classification -- Merge fix recommendations (later waves may have more targeted fixes) -- Note which agents found it (reinforces importance) -- Prefer Wave 2 fixes when they add Wave 0+1 context (deepest understanding) -- Wave 0 per-package findings confirmed by Wave 1+2 are high-confidence - -**Step 3: Classify by severity** +**Step 3: Classify by severity**: | Severity | Meaning | Action | |----------|---------|--------| @@ -856,332 +348,122 @@ If two or more agents flagged the same file:line: | MEDIUM | Quality issue, missing test, comment rot, naming drift | Fix (auto) | | LOW | Style preference, minor simplification, documentation | Fix (auto) | -**Step 4: Build preliminary summary matrix** - -``` -| Agent | Wave | CRITICAL | HIGH | MEDIUM | LOW | -|--------------------------|------|----------|------|--------|-----| -| Per-Package: [pkg1] | 0 | N | N | N | N | -| Per-Package: [pkg2] | 0 | N | N | N | N | -| Per-Package: [...] | 0 | N | N | N | N | -| **Wave 0 Subtotal** | **0**| **N** | **N**| **N** | **N**| -| Security | 1 | N | N | N | N | -| Business Logic | 1 | N | N | N | N | -| Architecture | 1 | N | N | N | N | -| Silent Failures | 1 | N | N | N | N | -| Test Coverage | 1 | N | N | N | N | -| Type Design | 1 | N | N | N | N | -| Code Quality | 1 | N | N | N | N | -| Comment Analyzer | 1 | N | N | N | N | -| Language Specialist | 1 | N | N | N | N | -| Docs & Config | 1 | N | N | N | N | -| ADR Compliance | 1 | N | N | N | N | -| Newcomer | 1 | N | N | N | N | -| **Wave 1 Subtotal** | **1**| **N** | **N**| **N** | **N**| -| Performance | 2 | N | N | N | N | -| Concurrency | 2 | N | N | N | N | -| API Contract | 2 | N | N | N | N | -| Dependency Audit | 2 | N | N | N | N | -| Error Messages | 2 | N | N | N | N | -| Dead Code | 2 | N | N | N | N | -| Naming Consistency | 2 | N | N | N | N | -| Observability | 2 | N | N | N | N | -| Config Safety | 2 | N | N | N | N | -| Migration Safety | 2 | N | N | N | N | -| **Wave 2 Subtotal** | **2**| **N** | **N**| **N** | **N**| -| **Wave 0+1+2 TOTAL** | | **N** | **N**| **N** | **N**| -``` - -**Step 5: Save Wave 2 findings and combined summary to disk** +**Step 4**: Build preliminary summary matrix. Read `${CLAUDE_SKILL_DIR}/references/output-templates.md` for the full matrix format. -Write Wave 2 raw findings AND the combined Wave 0+1+2 summary to disk for Wave 3 context: +**Step 5: Save to disk** — do NOT skip: ```bash -# Save Wave 2 raw findings cat > "$REVIEW_DIR/wave2-findings.md" << 'WAVE2_EOF' [Paste ALL Wave 2 agent outputs] WAVE2_EOF -# Save the combined Wave 0+1+2 summary (the structured context for Wave 3) cat > "$REVIEW_DIR/wave012-summary.md" << 'WAVE012_EOF' -[Paste the preliminary summary matrix + all classified findings from Steps 2-4] +[Paste the preliminary summary matrix + all classified findings] WAVE012_EOF echo "Saved Wave 2 findings: $(wc -l < "$REVIEW_DIR/wave2-findings.md") lines" echo "Saved Wave 0+1+2 summary: $(wc -l < "$REVIEW_DIR/wave012-summary.md") lines" ``` -**CRITICAL**: Do NOT skip this step. Wave 3 agents need the combined summary, and context compaction WILL fire between Wave 2 aggregate and Wave 3 dispatch on large reviews. - **Gate**: Wave 0+1+2 summary built, saved to `$REVIEW_DIR/wave012-summary.md`. Proceed to Phase 3c. --- -### Phase 3c: WAVE 3 DISPATCH — Adversarial Perspectives - -**Goal**: Launch 4-5 adversarial agents in a SINGLE message. These agents CHALLENGE the Wave 1+2 consensus — they are not looking for new issues, they are questioning whether existing findings are real, correctly prioritized, and worth fixing. +## Phase 3c: WAVE 3 DISPATCH — Adversarial Perspectives -**CRITICAL**: ALL Wave 3 agent dispatches MUST be in ONE message. +**ALL Wave 3 agent dispatches MUST be in ONE message.** -**Step 0: Load Wave 0+1+2 findings from disk** +Read `${CLAUDE_SKILL_DIR}/references/wave-3-adversarial.md` for: +- Complete agent roster (4-5 agents with adversarial roles) +- SAPCC conditional detection script +- Standard adversarial prompt template (with Wave 0+1+2 context injection) +- Agent-specific challenge focus instructions +- Wave agreement label definitions (UNANIMOUS / MAJORITY / CONTESTED) -Before constructing agent prompts, reload the combined summary from disk (in case context compaction has occurred since Phase 3b): +**Step 0: Load Wave 0+1+2 findings from disk**: ```bash -WAVE012_SUMMARY=$(cat "$REVIEW_DIR/wave012-summary.md" 2>/dev/null || echo "ERROR: Wave 0+1+2 summary not found at $REVIEW_DIR/wave012-summary.md — cannot proceed with Wave 3") +WAVE012_SUMMARY=$(cat "$REVIEW_DIR/wave012-summary.md" 2>/dev/null || echo "ERROR: Wave 0+1+2 summary not found — cannot proceed with Wave 3") ``` -If the file is missing, re-read `$REVIEW_DIR/wave0-findings.md`, `$REVIEW_DIR/wave1-findings.md`, and `$REVIEW_DIR/wave2-findings.md` and rebuild the summary before proceeding. - -**Step 1: Check SAPCC conditional** - -```bash -# Check for SAP Commerce Cloud indicators -SAPCC_DETECTED=false -if [ -d "hybris" ] || [ -d "core-customize" ] || [ -f "config/localextensions.xml" ]; then - SAPCC_DETECTED=true -fi -if [ -f "manifest.json" ] && grep -q '"commerceSuiteVersion"' manifest.json 2>/dev/null; then - SAPCC_DETECTED=true -fi -echo "SAPCC detected: $SAPCC_DETECTED" -``` - -If `SAPCC_DETECTED=true`, include `reviewer-sapcc-structural` as a 5th Wave 3 agent. Otherwise dispatch only the 4 core adversarial agents. - -**Model**: Use `model: sonnet` for all Wave 3 agents. The orchestrator (this skill) runs on Opus; dispatched review agents run on Sonnet. - -**Step 2: Dispatch Wave 3 agents** - -Each Wave 3 agent prompt should include the standard review scope PLUS the full Wave 0+1+2 context, with explicit instructions to CHALLENGE rather than reinforce: - -``` -ADVERSARIAL REVIEW — Wave 3 - -REVIEW SCOPE: -- Files to review: [list of changed files] -- Change context: [what was changed and why, if known] -- Repository: [current directory] - -WAVE 0+1+2 FINDINGS (the consensus you are challenging): -[Insert $WAVE012_SUMMARY — loaded from $REVIEW_DIR/wave012-summary.md] - -YOUR ROLE: You are an ADVERSARIAL reviewer. Your job is NOT to find new issues. -Your job is to CHALLENGE the findings above. Push back. Question severity. -Identify false positives. Flag overreactions. Surface tradeoffs that earlier -waves accepted without scrutiny. - -INSTRUCTIONS: -1. Read the CLAUDE.md file(s) in this repository first -2. Read the code being reviewed -3. Read the Wave 0+1+2 findings carefully -4. For each finding from earlier waves, determine: - a. AGREE — the finding is real, correctly classified, and worth fixing - b. CHALLENGE — the finding is questionable (explain why) - c. DOWNGRADE — the finding is real but over-classified (suggest correct severity) - d. DISMISS — the finding is a false positive or not worth fixing (provide evidence) -5. Surface any tradeoffs or second-order effects the earlier waves missed -6. Be specific — vague disagreement is not useful - -OUTPUT FORMAT: -### CHALLENGE: [One-line summary of what you're challenging] -**Original finding**: [Wave N, Agent, Severity: summary] -**Your verdict**: [AGREE | CHALLENGE | DOWNGRADE | DISMISS] -**Reasoning**: [Why you disagree or agree] -**Evidence**: [Code reference, real-world precedent, or logical argument] -**Suggested action**: [Keep as-is | Reduce to MEDIUM | Drop | Needs human judgment] ---- -``` - -**Wave 3 agent-specific prompt additions:** - -| Agent | Extra Instructions | -|-------|-------------------| -| `reviewer-contrarian` | Challenge every HIGH and CRITICAL finding. Are they actually important? Which are false positives? Which are over-classified? Look for findings where Wave 1+2 agents reinforced each other's bias rather than independently verifying. Question whether suggested fixes introduce new problems. | -| `reviewer-skeptical-senior` | Apply 10+ years of engineering experience. Which findings are theoretical risks that never manifest in practice? Which are textbook answers that don't apply to this codebase's scale/context? Flag "resume-driven" suggestions (over-engineering, premature optimization). Identify findings where the cure is worse than the disease. | -| `reviewer-user-advocate` | Focus exclusively on user impact. Does this change break existing users? Are migration paths safe? Are UX tradeoffs justified? Challenge findings that improve code quality at the expense of user experience. Flag findings that ignore backward compatibility. Question whether "fixing" something makes it harder for users. | -| `reviewer-meta-process` | Step back from individual findings. Is the overall approach correct? Should this PR be split into smaller PRs? Are the right problems being solved? Is the review itself focused on the right things? Flag cases where the review is bikeshedding on style while missing structural issues. Question whether the fix phase will create more churn than the findings are worth. | -| `reviewer-sapcc-structural` | **(SAPCC repos only)** Challenge findings through SAP Commerce Cloud structural lens. Do findings account for hybris extension lifecycle? Are suggested fixes compatible with the SAP build system? Do architecture recommendations respect CCv2 manifest constraints? Flag findings that would break extension wiring or data model migrations. | +If the file is missing, rebuild from `$REVIEW_DIR/wave0-findings.md`, `$REVIEW_DIR/wave1-findings.md`, and `$REVIEW_DIR/wave2-findings.md` before proceeding. **Gate**: All Wave 3 agents dispatched in a single message (4-5 agents). Wait for all to complete. Proceed to Phase 3d. --- -### Phase 3d: FULL AGGREGATE — Wave Agreement Synthesis - -**Goal**: Merge ALL agents' findings (Wave 0+1+2+3) into a final, severity-classified, deduplicated report WITH wave-agreement labels. +## Phase 3d: FULL AGGREGATE — Wave Agreement Synthesis -**Step 0: Load all prior wave findings from disk** - -Reload all wave findings from disk before final aggregation (context compaction may have fired): +**Step 0: Reload all prior wave findings from disk**: ```bash -# Reload all wave findings from disk WAVE0=$(cat "$REVIEW_DIR/wave0-findings.md" 2>/dev/null || echo "") WAVE1=$(cat "$REVIEW_DIR/wave1-findings.md" 2>/dev/null || echo "") WAVE2=$(cat "$REVIEW_DIR/wave2-findings.md" 2>/dev/null || echo "") WAVE012=$(cat "$REVIEW_DIR/wave012-summary.md" 2>/dev/null || echo "") -echo "Loaded Wave 0: $(echo "$WAVE0" | wc -l) lines" -echo "Loaded Wave 1: $(echo "$WAVE1" | wc -l) lines" -echo "Loaded Wave 2: $(echo "$WAVE2" | wc -l) lines" -echo "Loaded Wave 0+1+2 summary: $(echo "$WAVE012" | wc -l) lines" ``` -**Step 1: Process Wave 3 challenges** - -For each Wave 3 agent's output, categorize their verdicts: -- **AGREE** verdicts: Reinforces the original finding (increases confidence) -- **CHALLENGE** verdicts: Flags the finding for human review -- **DOWNGRADE** verdicts: Suggests lower severity (adjust if multiple Wave 3 agents agree) -- **DISMISS** verdicts: Suggests dropping the finding (only drop if 2+ Wave 3 agents dismiss AND no Wave 1+2 agent rated it CRITICAL) - -**Step 2: Label every finding with wave agreement level** - -Every finding from the final report MUST carry one of these labels: - -| Label | Meaning | Criteria | Action | -|-------|---------|----------|--------| -| **UNANIMOUS** | All waves agree | Wave 1+2 found it AND Wave 3 agrees (or does not challenge) | HIGH confidence — fix without hesitation | -| **MAJORITY** | Most waves agree | Wave 1+2 found it AND 1-2 Wave 3 agents challenge but others agree | Fix, but note the challenge in the report | -| **CONTESTED** | Wave 3 contradicts Wave 1+2 | Wave 1+2 found it BUT 3+ Wave 3 agents challenge or dismiss | Needs human judgment — present both arguments | +**Step 1: Process Wave 3 challenges** (from `${CLAUDE_SKILL_DIR}/references/wave-3-adversarial.md`): +- AGREE → reinforces finding +- CHALLENGE → flag for human review +- DOWNGRADE → reduce severity if multiple Wave 3 agents agree +- DISMISS → drop if 2+ Wave 3 agents dismiss AND no Wave 1+2 agent rated CRITICAL -**Step 3: Surface Wave 3 challenges that contradict consensus** +**Step 2**: Label every finding UNANIMOUS / MAJORITY / CONTESTED. Read `${CLAUDE_SKILL_DIR}/references/wave-3-adversarial.md` for the label criteria. -Create a dedicated section listing all CONTESTED findings. For each: -- The original Wave 1+2 finding and severity -- The Wave 3 challenge(s) and reasoning -- Both sides' evidence -- Recommended disposition (fix / skip / needs human decision) +**Step 3**: Create a "Contested Findings" section listing each CONTESTED finding with both sides' evidence and recommended disposition. -**Step 4: Build final summary matrix** +**Step 4**: Build final summary matrix. Read `${CLAUDE_SKILL_DIR}/references/output-templates.md` for the full matrix format. -``` -| Agent | Wave | CRITICAL | HIGH | MEDIUM | LOW | -|--------------------------|------|----------|------|--------|-----| -| Per-Package: [pkg1] | 0 | N | N | N | N | -| Per-Package: [pkg2] | 0 | N | N | N | N | -| Per-Package: [...] | 0 | N | N | N | N | -| **Wave 0 Subtotal** | **0**| **N** | **N**| **N** | **N**| -| Security | 1 | N | N | N | N | -| Business Logic | 1 | N | N | N | N | -| Architecture | 1 | N | N | N | N | -| Silent Failures | 1 | N | N | N | N | -| Test Coverage | 1 | N | N | N | N | -| Type Design | 1 | N | N | N | N | -| Code Quality | 1 | N | N | N | N | -| Comment Analyzer | 1 | N | N | N | N | -| Language Specialist | 1 | N | N | N | N | -| Docs & Config | 1 | N | N | N | N | -| ADR Compliance | 1 | N | N | N | N | -| Newcomer | 1 | N | N | N | N | -| **Wave 1 Subtotal** | **1**| **N** | **N**| **N** | **N**| -| Performance | 2 | N | N | N | N | -| Concurrency | 2 | N | N | N | N | -| API Contract | 2 | N | N | N | N | -| Dependency Audit | 2 | N | N | N | N | -| Error Messages | 2 | N | N | N | N | -| Dead Code | 2 | N | N | N | N | -| Naming Consistency | 2 | N | N | N | N | -| Observability | 2 | N | N | N | N | -| Config Safety | 2 | N | N | N | N | -| Migration Safety | 2 | N | N | N | N | -| **Wave 2 Subtotal** | **2**| **N** | **N**| **N** | **N**| -| Contrarian | 3 | — challenges — | — | — | — | -| Skeptical Senior | 3 | — challenges — | — | — | — | -| User Advocate | 3 | — challenges — | — | — | — | -| Meta-Process | 3 | — challenges — | — | — | — | -| SAPCC Structural | 3 | — challenges — | — | — | — | -| **Wave 3 Summary** | **3**| **N agreed** | **N challenged** | **N downgraded** | **N dismissed** | -| **TOTAL (post-Wave 3)** | | **N** | **N**| **N** | **N**| -``` - -**Wave Agreement Summary:** - -``` -| Agreement Level | Count | Action | -|-----------------|-------|--------| -| UNANIMOUS | N | Fix immediately — high confidence | -| MAJORITY | N | Fix, note challenge in report | -| CONTESTED | N | Needs human judgment — present both sides | -``` - -**Step 5: Save all findings to disk** - -Write Wave 3 findings and the final aggregated report to disk: +**Step 5: Save all findings to disk**: ```bash -# Save Wave 3 raw findings (adversarial challenge outputs) cat > "$REVIEW_DIR/wave3-findings.md" << 'WAVE3_EOF' -[Paste ALL Wave 3 agent outputs — the raw challenge verdicts from each agent] +[Paste ALL Wave 3 agent outputs] WAVE3_EOF -# Save the final deduplicated, severity-classified, agreement-labeled report cat > "$REVIEW_DIR/final-report.md" << 'REPORT_EOF' -[Paste the full summary matrix + all classified findings + agreement labels + contested findings section] +[Paste the full summary matrix + all classified findings + agreement labels + contested section] REPORT_EOF echo "Saved Wave 3 findings: $(wc -l < "$REVIEW_DIR/wave3-findings.md") lines" echo "Saved final report: $(wc -l < "$REVIEW_DIR/final-report.md") lines" -echo "All findings persisted at: $REVIEW_DIR/" ls -la "$REVIEW_DIR/" ``` -**Step 6: Present summary to user** +**Step 6**: Show the matrix, agreement summary, and CONTESTED findings BEFORE proceeding to fixes. If `--review-only`, stop here. -Show the matrix, agreement summary, and CONTESTED findings list BEFORE proceeding to fixes. If `--review-only`, stop here. +For CONTESTED findings: "Wave 3 challenges these N findings. Fix them anyway, skip them, or decide individually?" -For CONTESTED findings, explicitly ask the user: "Wave 3 challenges these N findings. Fix them anyway, skip them, or decide individually?" - -**Gate**: All 25+ agents' findings classified, deduplicated, agreement-labeled, saved to `$REVIEW_DIR/final-report.md`, matrix built. User informed of contested findings. Proceed to Phase 4. +**Gate**: All findings classified, deduplicated, agreement-labeled, saved to `$REVIEW_DIR/final-report.md`. User informed of contested findings. Proceed to Phase 4. --- -### Phase 4: FIX +## Phase 4: FIX -**Goal**: Fix EVERY finding. Zero deferred. Zero "out of scope." Zero "will fix later." +**NO DEFERRED FIXES. Fix EVERY finding. Zero "out of scope." Zero "will fix later."** -**Step 0: Load findings from disk** +The only acceptable reason not to fix a finding: applying it breaks tests AND an alternative fix also breaks tests. Even then, BLOCKED items must be fewer than 10% of total findings. -Reload the final report from disk before fixing (context compaction almost certainly fired during the 20+ agent dispatch/aggregate cycle): +**Step 0: Load findings from disk**: ```bash -# This is your source of truth for what to fix cat "$REVIEW_DIR/final-report.md" ``` -If the file is missing, rebuild from individual wave files: -```bash -cat "$REVIEW_DIR/wave0-findings.md" "$REVIEW_DIR/wave1-findings.md" "$REVIEW_DIR/wave2-findings.md" "$REVIEW_DIR/wave3-findings.md" -``` +If missing, rebuild: `cat "$REVIEW_DIR/wave0-findings.md" "$REVIEW_DIR/wave1-findings.md" "$REVIEW_DIR/wave2-findings.md" "$REVIEW_DIR/wave3-findings.md"` -**Wave Agreement Handling in Fix Phase:** +**Wave Agreement Handling**: | Agreement Level | Fix Behavior | |-----------------|-------------| -| **UNANIMOUS** | Fix without hesitation — all waves agree | -| **MAJORITY** | Fix, but include Wave 3 challenge as code comment if the challenge has merit | -| **CONTESTED** | Fix only if user approved during Phase 3d. If user said "skip", do not fix. If user said "decide individually", follow their per-finding decisions. | - -**CRITICAL RULE: NO DEFERRED FIXES.** Every finding from every wave MUST be fixed. The only acceptable reason to not fix a finding is if applying it breaks tests — and even then, try an alternative fix before giving up. "Deferred" is not a valid status. "Out of scope" is not a valid reason. If an agent found it, fix it. - -**Common deferral rationalizations to REJECT:** - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Pre-existing, not modified by this PR" | The review found it. It's a real issue. Fix it. | Fix it in the same branch — pre-existing bugs are still bugs | -| "Needs follow-up PR" | Follow-up PRs don't happen. Fix it now. | Fix it now — one PR with all fixes | -| "Architectural change required" | If you can describe the fix, you can implement it | Implement the fix, even if it touches multiple files | -| "Needs design decision" | The agent already suggested a fix — use it | Apply the suggested fix or a reasonable alternative | -| "Acceptable risk / low impact" | If it was worth reporting, it's worth fixing | Fix it — the agent wouldn't flag it if it didn't matter | -| "Test-only code, doesn't matter" | Test code quality affects reliability | Fix test code to the same standard as production code | -| "Documentation task / maintenance task" | Docs and maintenance are code. Fix them. | Update the docs, bump the dependency, fix the README | -| "Standard pattern / convention choice" | If an agent flagged it as an issue, the pattern is wrong here | Fix it or explicitly justify in a code comment why it's intentional | -| "Optimization, not a bug" | Performance issues ARE bugs to your users | Apply the optimization | +| UNANIMOUS | Fix without hesitation | +| MAJORITY | Fix, include Wave 3 challenge as code comment if challenge has merit | +| CONTESTED | Fix only if user approved during Phase 3d. Follow per-finding decisions. | **Step 1: Create branch** ```bash -# Create fix branch from current HEAD git checkout -b review-fixes/$(date +%Y%m%d-%H%M%S) ``` @@ -1189,18 +471,12 @@ Or use `EnterWorktree` for full isolation. **Step 2: Fix ALL findings by severity (CRITICAL first)** -Order: -1. CRITICAL — fix immediately, test after each -2. HIGH — fix in batch, test after batch -3. MEDIUM — fix in batch, test after batch -4. LOW — fix in batch, test after batch - -**Every severity level gets fixed.** LOW findings are not optional. MEDIUM findings are not "nice to have." Fix them all. +Order: CRITICAL → HIGH → MEDIUM → LOW. Every level gets fixed — LOW is not optional. For each fix: 1. State which finding is being addressed (wave, agent, severity, file:line) 2. Apply the fix -3. Verify the fix compiles/parses +3. Verify it compiles/parses 4. Run relevant tests ```bash @@ -1215,17 +491,16 @@ npx tsc --noEmit && npx vitest run make check 2>/dev/null || make test 2>/dev/null ``` -**Step 3: Apply simplifications and docs fixes LAST** - -Run `reviewer-code-simplifier` on already-fixed files (final code polish). Apply `reviewer-docs-validator` fixes after code is finalized (docs should reflect final state). Simplification and docs should be the last changes. - -**Step 4: If a fix breaks tests** +**Step 3**: Apply simplifications (run `reviewer-code-simplifier` on already-fixed files) and docs fixes LAST — docs should reflect final code state. +**Step 4: If a fix breaks tests**: 1. Revert the specific fix -2. Try an ALTERNATIVE fix that addresses the same finding differently -3. If the alternative also breaks tests, note as "BLOCKED — breaks tests, alternative also failed" +2. Try an ALTERNATIVE fix for the same finding +3. If alternative also fails: mark "BLOCKED — breaks tests, alternative also failed" 4. Continue with remaining fixes -5. "BLOCKED" items must be fewer than 10% of total findings. If more than 10% are blocked, something is wrong with the fix approach — reassess. +5. BLOCKED items must be <10% of total + +**Reject all deferral rationalizations**: "pre-existing bug", "needs follow-up PR", "architectural change required", "acceptable risk", "test-only code", "standard pattern" — if an agent found it, fix it. **Step 5: Commit** @@ -1238,156 +513,11 @@ git commit -m "fix: apply comprehensive review findings (N fixes across M files) --- -### Phase 5: REPORT - -**Goal**: Generate comprehensive report documenting everything. +## Phase 5: REPORT -Write `comprehensive-review-report.md`: +Write `comprehensive-review-report.md`. Read `${CLAUDE_SKILL_DIR}/references/output-templates.md` for the full report template. -```markdown -# Comprehensive Code Review Report v3 - -**Date**: [date] -**Files reviewed**: [N] -**Packages discovered**: [N] -**Agents dispatched**: [N] (Wave 0: [N per-package], Wave 1: 12, Wave 2: 10, Wave 3: [4-5 adversarial]) -**Total findings**: [N] -**Findings fixed**: [N] -**Findings blocked**: [N] (ONLY if fix breaks tests after alternative attempt — must be <10%) - ---- - -## Verdict: [CLEAN | ALL_FIXED | BLOCKED_ITEMS] - -- CLEAN: No findings (rare) -- ALL_FIXED: Every finding was fixed (expected outcome) -- BLOCKED_ITEMS: Some fixes break tests even after alternative attempts (<10%) - -[2-3 sentences: Overall assessment. What systemic patterns emerged? -Is the codebase better after fixes?] - -## Wave Summary - -| Wave | Agents | Findings | Fixed | Unique to Wave | -|------|--------|----------|-------|----------------| -| Wave 0 (Per-Package) | N | N | N | N | -| Wave 1 (Foundation) | 12 | N | N | N | -| Wave 2 (Deep-Dive) | 10 | N | N | N | -| Wave 3 (Adversarial) | N | — | — | N challenges | -| **TOTAL** | **N** | **N** | **N** | | - -## Wave Agreement Analysis - -| Agreement Level | Count | Fixed | Skipped | Human Decided | -|-----------------|-------|-------|---------|---------------| -| UNANIMOUS | N | N | — | — | -| MAJORITY | N | N | — | — | -| CONTESTED | N | N | N | N | - -## Wave 0: Per-Package Results - -| Package | Health | Files | Findings | Key Issue | -|---------|--------|-------|----------|-----------| -| [pkg/path1] | HEALTHY | N | N | — | -| [pkg/path2] | NEEDS_ATTENTION | N | N | [biggest] | -| ... | ... | ... | ... | ... | - -**Cross-Package Patterns**: [List of patterns seen across multiple packages] - -## Agent Summary - -| Agent | Wave | Findings | Fixed | Blocked | Key Issue | -|-------|------|----------|-------|---------|-----------| -| Per-Package (total) | 0 | N | N | N | [biggest] | -| Security | 1 | N | N | N | [biggest] | -| Business Logic | 1 | N | N | N | [biggest] | -| Architecture | 1 | N | N | N | [biggest] | -| Silent Failures | 1 | N | N | N | [biggest] | -| Test Coverage | 1 | N | N | N | [biggest] | -| Type Design | 1 | N | N | N | [biggest] | -| Code Quality | 1 | N | N | N | [biggest] | -| Comment Analyzer | 1 | N | N | N | [biggest] | -| Language Specialist | 1 | N | N | N | [biggest] | -| Docs & Config | 1 | N | N | N | [biggest] | -| ADR Compliance | 1 | N | N | N | [biggest] | -| Newcomer | 1 | N | N | N | [biggest] | -| Performance | 2 | N | N | N | [biggest] | -| Concurrency | 2 | N | N | N | [biggest] | -| API Contract | 2 | N | N | N | [biggest] | -| Dependency Audit | 2 | N | N | N | [biggest] | -| Error Messages | 2 | N | N | N | [biggest] | -| Dead Code | 2 | N | N | N | [biggest] | -| Naming Consistency | 2 | N | N | N | [biggest] | -| Observability | 2 | N | N | N | [biggest] | -| Config Safety | 2 | N | N | N | [biggest] | -| Migration Safety | 2 | N | N | N | [biggest] | -| Contrarian | 3 | — | — | — | [key challenge] | -| Skeptical Senior | 3 | — | — | — | [key challenge] | -| User Advocate | 3 | — | — | — | [key challenge] | -| Meta-Process | 3 | — | — | — | [key challenge] | -| SAPCC Structural | 3 | — | — | — | [key challenge or N/A] | -| **TOTAL** | | **N** | **N** | **N** | | - -## Context Cascade Effectiveness - -How each wave's context helped later waves find deeper issues: - -| Wave 2 Agent | Wave 0 Context Used | Wave 1 Context Used | Additional Findings Due to Context | -|-------------|--------------------|--------------------|-------------------------------------| -| Performance | Package complexity hotspots | Architecture hot paths | [N findings] | -| Concurrency | Intra-package concurrent patterns | Silent failures + arch | [N findings] | -| ... | ... | ... | ... | - -### Wave 3 Challenge Effectiveness - -How adversarial review changed the final outcome: - -| Wave 3 Agent | Findings Challenged | Downgraded | Dismissed | Key Insight | -|-------------|--------------------|-----------|-----------|----| -| Contrarian | N | N | N | [biggest challenge] | -| Skeptical Senior | N | N | N | [biggest challenge] | -| User Advocate | N | N | N | [biggest challenge] | -| Meta-Process | N | N | N | [biggest challenge] | -| SAPCC Structural | N or N/A | N | N | [biggest challenge or N/A] | - -## Findings by Severity - -### CRITICAL -[Each finding with before/after code] - -### HIGH -[Each finding with before/after code] - -### MEDIUM -[Summary with file references] - -### LOW -[Brief list] - -## Contested Findings (Wave 3 vs Wave 1+2) - -| Finding | Wave 1+2 Severity | Wave 3 Verdict | Resolution | -|---------|-------------------|----------------|------------| -| [summary] | HIGH | CHALLENGE: [reason] | [Fixed / Skipped / Human decided] | -| ... | ... | ... | ... | - -## Quick Wins Applied -[List of easy fixes that improved quality] - -## Blocked Items (if any — must be <10% of total) -[List of findings where fix AND alternative fix both break tests] -[Each must include: what was tried, why it failed, suggested manual approach] - -## What's Done Well -[Genuine positives found during review] - -## Systemic Recommendations -[2-3 big-picture patterns observed across findings] -``` - -**Step 2: Note findings location** - -Display the findings directory path so the user knows where raw data lives: +Display findings directory path to user: ``` Review findings persisted at: $REVIEW_DIR/ @@ -1400,217 +530,35 @@ Review findings persisted at: $REVIEW_DIR/ final-report.md — Aggregated, deduplicated, agreement-labeled ``` -These files persist in `/tmp/` until next reboot. They can be re-read in future sessions if needed. - -**Gate**: Report written, findings persisted to disk. Display summary to user. Review complete. +**Gate**: Report written, findings persisted. Display summary to user. Review complete. --- -## Combining with Existing Skills +## Error Handling -### When to use which +**Agent Times Out**: Report findings from completed agents immediately. Note which timed out. Offer to re-run separately. Proceed with partial results — do not block the entire wave. -| Situation | Use This | -|-----------|----------| -| Any PR, any language, full review+fix+challenge | `/comprehensive-review` (4 waves) | -| Fast review, skip per-package | `/comprehensive-review --skip-wave0` (3 waves: 1+2+3) | -| Quick review, 12 agents only | `/comprehensive-review --wave1-only` | -| Quick 3-reviewer check, no fix | `/parallel-code-review` | -| PR comment validation | `/pr-review-address-feedback` | -| Sequential deep dive | `systematic-code-review` skill | +**Fix Breaks Tests**: Revert the specific fix. Try an ALTERNATIVE approach for the same finding. If alternative also fails, mark BLOCKED. Continue. BLOCKED must be <10%. ---- +**Conflicting Fixes**: Prefer security fix over style. Prefer correctness over simplification. Wave 2 fixes with Wave 0+1 context generally have better understanding. Apply higher-severity agent's fix — never skip. -## Error Handling +**No Changed Files**: Ask user "Which files would you like reviewed?" If "everything", scan all source files. Warn about scope. -### Error: "Agent Times Out" -Cause: One or more agents exceed execution time. -Solution: -1. Report findings from completed agents immediately -2. Note which agent(s) timed out -3. Offer to re-run failed agent separately -4. Proceed with partial results — do not block the entire wave - -### Error: "Fix Breaks Tests" -Cause: Applied fix introduces a regression. -Solution: -1. Revert the specific fix immediately -2. Try an ALTERNATIVE fix approach for the same finding -3. If alternative also fails, mark as "BLOCKED — both approaches break tests" -4. Continue with remaining fixes -5. Blocked items must be <10% of total — if higher, reassess fix strategy - -### Error: "Conflicting Fixes" -Cause: Two agents suggest contradictory fixes for same code. -Solution: -1. Prefer security fix over style fix (security wins) -2. Prefer correctness over simplification -3. Wave 2 fixes with Wave 0+1 context generally have better understanding -4. If genuinely ambiguous, apply the higher-severity agent's fix — never skip - -### Error: "No Changed Files Found" -Cause: No git diff, no PR context, no changes to review. -Solution: -1. Ask user: "Which files would you like reviewed?" -2. If user says "everything", scan all source files -3. Warn about review scope and time for large repos - -### Error: "No Packages Discovered" -Cause: Wave 0 package discovery finds no packages (no internal/ dirs, no __init__.py, no index.ts). -Solution: -1. Skip Wave 0 entirely — this is not an error -2. Proceed to Wave 1 with note: "Wave 0 skipped — no package structure detected" -3. Wave 1 and Wave 2 still run normally without Wave 0 context - -### Error: "Too Many Packages (>30)" -Cause: Large monorepo with many packages discovered. -Solution: -1. Report: "Discovered [N] packages. Wave 0 will require [ceil(N/10)] batches." -2. Proceed with batching — quality matters more than speed -3. Consider filtering to packages containing changed files if reviewing a PR - -### Error: "Wave 0/1 Produces No Findings" -Cause: A wave finds nothing to report. -Solution: -1. This is good news — code passed that wave's review -2. Still dispatch subsequent waves with note: "Wave [N] found no issues. Perform independent analysis." -3. Empty findings from early waves are still useful context — they confirm code quality +**No Packages Discovered**: Skip Wave 0 silently. Proceed to Wave 1 with note: "Wave 0 skipped — no package structure detected." Wave 1 and Wave 2 still run normally. ---- +**Too Many Packages (>30)**: Report count and batch requirement. Proceed with batching — quality matters more than speed. Consider filtering to packages containing changed files if reviewing a PR. -## Anti-Patterns - -### AP-1: Sequential Agent Dispatch -**What it looks like**: Sending one Agent call, waiting, then sending the next. -**Why wrong**: Multiplies review time. Agents within a wave are independent. -**Do instead**: ALL Agent dispatches within a wave in ONE message. - -### AP-2: Fixing Without Full Review -**What it looks like**: Fixing Wave 1 findings while Wave 2 is still running. -**Why wrong**: Wave 2 may find conflicting or deeper issues. Deduplication requires all results. -**Do instead**: Complete Phase 3b full aggregation before ANY fixes. - -### AP-3: Skipping "Trivial" Agents -**What it looks like**: "No new types, skip type-design-analyzer" -**Why wrong**: Existing types in changed files may have issues. Let agents find nothing. -**Do instead**: Run all agents. Empty results are fast and confirm quality. - -### AP-4: Fixing on Main Branch -**What it looks like**: Applying fixes directly on the user's current branch. -**Why wrong**: Review fixes should be isolated for easy revert. -**Do instead**: Always create a fix branch or use worktree. - -### AP-5: Deferring Fixes -**What it looks like**: Marking findings as "deferred", "out of scope", or "will fix later." -**Why wrong**: Deferred fixes never get fixed. The whole point of comprehensive review is fixing everything now. -**Do instead**: Fix every finding. If a fix breaks tests, try an alternative approach. Only "BLOCKED" (fix + alternative both break tests) is acceptable, and must be <10%. - -### AP-6: Skipping Wave 2 -**What it looks like**: "Wave 1 found enough, no need for Wave 2." -**Why wrong**: Wave 2 agents find categories of issues Wave 1 cannot (performance, concurrency, naming, etc.). -**Do instead**: Always run all waves unless `--wave1-only` or `--skip-wave0` is explicitly passed. - -### AP-7: Not Passing Context Between Waves -**What it looks like**: Dispatching Wave 1 without Wave 0 context, or Wave 2 without Wave 0+1 context. -**Why wrong**: The entire value of multi-wave architecture is context-aware analysis. Each wave enriches the next. -**Do instead**: Always include prior wave findings summaries in every subsequent wave agent prompt. - -### AP-8: Dispatching Too Many Per-Package Agents at Once -**What it looks like**: Sending 25 per-package agents in one message. -**Why wrong**: Max 10 agents per message. Exceeding this causes failures. -**Do instead**: Batch Wave 0 agents in groups of 10. Wait for each batch before sending the next. - -### AP-9: Skipping Wave 3 -**What it looks like**: "Wave 1+2 consensus is strong enough, no need for adversarial review." -**Why wrong**: Consensus without challenge is groupthink. Wave 3 catches over-classified findings, false positives, and user-impact blind spots that reinforcing waves miss. -**Do instead**: Always run Wave 3. Adversarial challenge improves signal-to-noise ratio. - -### AP-10: Wave 3 Agents Agreeing With Everything -**What it looks like**: Wave 3 agents return "AGREE" on every finding without genuine challenge. -**Why wrong**: The purpose of Wave 3 is adversarial pressure-testing. Universal agreement means the agents are not doing their job. -**Do instead**: If Wave 3 returns >90% AGREE, note this in the report as "Wave 3 did not provide meaningful challenge — findings may benefit from human review." - -### AP-11: "Acceptable" as a Review Disposition -**What it looks like**: "This is a real issue but acceptable for now" / "valid but deferred" / "conservative, not a bug" -**Why wrong**: "Acceptable" acknowledges a problem while avoiding the cost of addressing it. Creates the illusion of thoroughness without substance. -**Do instead**: FIX NOW, FIX IN FOLLOW-UP (with tracked artifact), or NOT AN ISSUE (with evidence). No middle ground. -*Graduated from /do SKILL.md — incident: Kafka PR shipped double-backoff classified as "conservative, not a bug"* - -### AP-12: Deferred Findings Without Tracking Artifacts -**What it looks like**: "We'll address this in a follow-up" with no issue, TODO, or learning.db entry created. -**Why wrong**: "Follow-up" without a tracking artifact is a polite way of saying "never." -**Do instead**: Create a tracking artifact (GitHub issue, `TODO(follow-up):` in code, learning.db entry) before marking any finding as deferred. -*Graduated from /do SKILL.md — incident: Kafka PR deferred findings lost between review rounds* +**Wave Finds No Findings**: Good news. Still dispatch subsequent waves with note: "Wave [N] found no issues. Perform independent analysis." Empty findings confirm quality. --- -## Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Small PR, skip comprehensive" | Small PRs hide big bugs | Run all 4 waves | -| "Tests pass, no review needed" | Tests don't catch all issues | Tests are necessary but not sufficient | -| "Just a style fix" | Style issues compound into tech debt | Fix every finding | -| "Fix phase takes too long" | Finding bugs in prod takes longer | Fix now or fix later at 10x cost | -| "Agent found nothing, wasted" | Confirming quality is valuable | No finding = confidence | -| "I'll fix the LOWs manually later" | Later never comes | Auto-fix everything now | -| "This finding is out of scope" | If an agent found it, it's in scope | Fix it — scope is everything agents report | -| "32 findings deferred" | Deferred = not fixed = failed review | Fix all 32. Zero deferred. | -| "Only actionable findings fixed" | ALL findings are actionable — agents don't report non-actionable things | Fix every finding | -| "Wave 1 is enough" | Wave 2 finds performance, concurrency, naming issues Wave 1 misses | Run all waves | -| "Wave 0 is slow, skip it" | Per-package context catches issues cross-cutting agents miss | Run Wave 0 unless explicitly skipped | -| "Too many packages, batch overhead" | Batching costs tokens, not accuracy | Batch all packages, even if it takes 3+ batches | -| "Wave 3 is overkill" | Adversarial review catches groupthink and false positives | Run Wave 3 — it improves signal-to-noise | -| "All waves agree, no need for Wave 3" | Agreement without challenge is untested consensus | Wave 3 validates consensus — agreement AFTER challenge is meaningful | -| "Wave 3 challenges mean findings are wrong" | Challenges are data, not verdicts | Present both sides — let humans decide on CONTESTED findings | +## When to Use Which ---- - -## References - -This skill uses these shared patterns: -- [Anti-Rationalization Core](../shared-patterns/anti-rationalization-core.md) -- [Anti-Rationalization Review](../shared-patterns/anti-rationalization-review.md) -- [Severity Classification](../shared-patterns/severity-classification.md) -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) -- [Verification Checklist](../shared-patterns/verification-checklist.md) - -### Related Skills & Agents - -**Wave 1 Agents:** -- `reviewer-security` — OWASP Top 10, auth, injection, secrets -- `reviewer-business-logic` — Edge cases, state transitions, requirements -- Architecture reviewer — Patterns, structure (language-specific agent) -- `reviewer-silent-failures` — Swallowed errors, empty catches -- `reviewer-test-analyzer` — Coverage gaps, test quality -- `reviewer-type-design` — Invariants, encapsulation -- `reviewer-code-quality` — CLAUDE.md compliance, conventions -- `reviewer-comment-analyzer` — Comment accuracy, rot -- `reviewer-language-specialist` — Modern stdlib, idioms, LLM tells -- `reviewer-docs-validator` — README, CLAUDE.md, deps, CI -- `reviewer-adr-compliance` — ADR compliance, decision mapping, scope creep -- `reviewer-newcomer` — Fresh-eyes perspective, documentation gaps, onboarding friction - -**Wave 2 Agents:** -- `reviewer-performance` — Hot paths, N+1, allocations, caching -- `reviewer-concurrency` — Races, goroutine leaks, deadlocks -- `reviewer-api-contract` — Breaking changes, status codes, schemas -- `reviewer-dependency-audit` — CVEs, licenses, deprecated packages -- `reviewer-error-messages` — Actionable errors, context, consistency -- `reviewer-dead-code` — Unreachable code, unused exports, stale flags -- `reviewer-naming-consistency` — Convention drift, acronym casing -- `reviewer-observability` — Metrics, logging, traces, health checks -- `reviewer-config-safety` — Hardcoded values, env vars, secrets -- `reviewer-migration-safety` — Reversible migrations, deprecation paths - -**Wave 3 Agents (Adversarial):** -- `reviewer-contrarian` — Challenges findings, identifies false positives and over-classification -- `reviewer-skeptical-senior` — Experience-based skepticism, theoretical vs real-world risks -- `reviewer-user-advocate` — User impact assessment, UX tradeoffs, backward compatibility -- `reviewer-meta-process` — Process review, PR splitting, approach validation -- `reviewer-sapcc-structural` — (conditional) SAP Commerce Cloud structural integrity - -**Related Skills:** -- `parallel-code-review` — 3-agent subset (security, business, arch) without fix -- `systematic-code-review` — Sequential 4-phase methodology -- `pr-review-address-feedback` — PR comment validation and triage +| Situation | Use This | +|-----------|----------| +| Any PR, any language, full review+fix+challenge | `/comprehensive-review` (4 waves) | +| Fast review, skip per-package | `/comprehensive-review --skip-wave0` (3 waves: 1+2+3) | +| Quick review, 12 agents only | `/comprehensive-review --wave1-only` | +| Quick 3-reviewer check, no fix | `/parallel-code-review` | +| PR comment validation | `/pr-review-address-feedback` | +| Sequential deep dive | `systematic-code-review` skill | diff --git a/pipelines/comprehensive-review/references/output-templates.md b/pipelines/comprehensive-review/references/output-templates.md new file mode 100644 index 00000000..4f5cef72 --- /dev/null +++ b/pipelines/comprehensive-review/references/output-templates.md @@ -0,0 +1,267 @@ +# Output Templates + +## Findings Summary Matrix + +Used during Wave 0+1+2 aggregate (Phase 3b) and final aggregate (Phase 3d): + +``` +| Agent | Wave | CRITICAL | HIGH | MEDIUM | LOW | +|--------------------------|------|----------|------|--------|-----| +| Per-Package: [pkg1] | 0 | N | N | N | N | +| Per-Package: [pkg2] | 0 | N | N | N | N | +| Per-Package: [...] | 0 | N | N | N | N | +| **Wave 0 Subtotal** | **0**| **N** | **N**| **N** | **N**| +| Security | 1 | N | N | N | N | +| Business Logic | 1 | N | N | N | N | +| Architecture | 1 | N | N | N | N | +| Silent Failures | 1 | N | N | N | N | +| Test Coverage | 1 | N | N | N | N | +| Type Design | 1 | N | N | N | N | +| Code Quality | 1 | N | N | N | N | +| Comment Analyzer | 1 | N | N | N | N | +| Language Specialist | 1 | N | N | N | N | +| Docs & Config | 1 | N | N | N | N | +| ADR Compliance | 1 | N | N | N | N | +| Newcomer | 1 | N | N | N | N | +| **Wave 1 Subtotal** | **1**| **N** | **N**| **N** | **N**| +| Performance | 2 | N | N | N | N | +| Concurrency | 2 | N | N | N | N | +| API Contract | 2 | N | N | N | N | +| Dependency Audit | 2 | N | N | N | N | +| Error Messages | 2 | N | N | N | N | +| Dead Code | 2 | N | N | N | N | +| Naming Consistency | 2 | N | N | N | N | +| Observability | 2 | N | N | N | N | +| Config Safety | 2 | N | N | N | N | +| Migration Safety | 2 | N | N | N | N | +| **Wave 2 Subtotal** | **2**| **N** | **N**| **N** | **N**| +| Contrarian | 3 | — challenges — | — | — | — | +| Skeptical Senior | 3 | — challenges — | — | — | — | +| User Advocate | 3 | — challenges — | — | — | — | +| Meta-Process | 3 | — challenges — | — | — | — | +| SAPCC Structural | 3 | — challenges — | — | — | — | +| **Wave 3 Summary** | **3**| **N agreed** | **N challenged** | **N downgraded** | **N dismissed** | +| **TOTAL (post-Wave 3)** | | **N** | **N**| **N** | **N**| +``` + +Wave Agreement Summary (append after matrix): + +``` +| Agreement Level | Count | Action | +|-----------------|-------|--------| +| UNANIMOUS | N | Fix immediately — high confidence | +| MAJORITY | N | Fix, note challenge in report | +| CONTESTED | N | Needs human judgment — present both sides | +``` + +## Final Report Template + +Write to `comprehensive-review-report.md`: + +```markdown +# Comprehensive Code Review Report v4 + +**Date**: [date] +**Files reviewed**: [N] +**Packages discovered**: [N] +**Agents dispatched**: [N] (Wave 0: [N per-package], Wave 1: 12, Wave 2: 10, Wave 3: [4-5 adversarial]) +**Total findings**: [N] +**Findings fixed**: [N] +**Findings blocked**: [N] (ONLY if fix breaks tests after alternative attempt — must be <10%) + +--- + +## Verdict: [CLEAN | ALL_FIXED | BLOCKED_ITEMS] + +- CLEAN: No findings (rare) +- ALL_FIXED: Every finding was fixed (expected outcome) +- BLOCKED_ITEMS: Some fixes break tests even after alternative attempts (<10%) + +[2-3 sentences: Overall assessment. What systemic patterns emerged? +Is the codebase better after fixes?] + +## Wave Summary + +| Wave | Agents | Findings | Fixed | Unique to Wave | +|------|--------|----------|-------|----------------| +| Wave 0 (Per-Package) | N | N | N | N | +| Wave 1 (Foundation) | 12 | N | N | N | +| Wave 2 (Deep-Dive) | 10 | N | N | N | +| Wave 3 (Adversarial) | N | — | — | N challenges | +| **TOTAL** | **N** | **N** | **N** | | + +## Wave Agreement Analysis + +| Agreement Level | Count | Fixed | Skipped | Human Decided | +|-----------------|-------|-------|---------|---------------| +| UNANIMOUS | N | N | — | — | +| MAJORITY | N | N | — | — | +| CONTESTED | N | N | N | N | + +## Wave 0: Per-Package Results + +| Package | Health | Files | Findings | Key Issue | +|---------|--------|-------|----------|-----------| +| [pkg/path1] | HEALTHY | N | N | — | +| [pkg/path2] | NEEDS_ATTENTION | N | N | [biggest] | +| ... | ... | ... | ... | ... | + +**Cross-Package Patterns**: [List of patterns seen across multiple packages] + +## Agent Summary + +| Agent | Wave | Findings | Fixed | Blocked | Key Issue | +|-------|------|----------|-------|---------|-----------| +| Per-Package (total) | 0 | N | N | N | [biggest] | +| Security | 1 | N | N | N | [biggest] | +| Business Logic | 1 | N | N | N | [biggest] | +| Architecture | 1 | N | N | N | [biggest] | +| Silent Failures | 1 | N | N | N | [biggest] | +| Test Coverage | 1 | N | N | N | [biggest] | +| Type Design | 1 | N | N | N | [biggest] | +| Code Quality | 1 | N | N | N | [biggest] | +| Comment Analyzer | 1 | N | N | N | [biggest] | +| Language Specialist | 1 | N | N | N | [biggest] | +| Docs & Config | 1 | N | N | N | [biggest] | +| ADR Compliance | 1 | N | N | N | [biggest] | +| Newcomer | 1 | N | N | N | [biggest] | +| Performance | 2 | N | N | N | [biggest] | +| Concurrency | 2 | N | N | N | [biggest] | +| API Contract | 2 | N | N | N | [biggest] | +| Dependency Audit | 2 | N | N | N | [biggest] | +| Error Messages | 2 | N | N | N | [biggest] | +| Dead Code | 2 | N | N | N | [biggest] | +| Naming Consistency | 2 | N | N | N | [biggest] | +| Observability | 2 | N | N | N | [biggest] | +| Config Safety | 2 | N | N | N | [biggest] | +| Migration Safety | 2 | N | N | N | [biggest] | +| Contrarian | 3 | — | — | — | [key challenge] | +| Skeptical Senior | 3 | — | — | — | [key challenge] | +| User Advocate | 3 | — | — | — | [key challenge] | +| Meta-Process | 3 | — | — | — | [key challenge] | +| SAPCC Structural | 3 | — | — | — | [key challenge or N/A] | +| **TOTAL** | | **N** | **N** | **N** | | + +## Context Cascade Effectiveness + +How each wave's context helped later waves find deeper issues: + +| Wave 2 Agent | Wave 0 Context Used | Wave 1 Context Used | Additional Findings Due to Context | +|-------------|--------------------|--------------------|-------------------------------------| +| Performance | Package complexity hotspots | Architecture hot paths | [N findings] | +| Concurrency | Intra-package concurrent patterns | Silent failures + arch | [N findings] | +| ... | ... | ... | ... | + +### Wave 3 Challenge Effectiveness + +How adversarial review changed the final outcome: + +| Wave 3 Agent | Findings Challenged | Downgraded | Dismissed | Key Insight | +|-------------|--------------------|-----------|-----------|----| +| Contrarian | N | N | N | [biggest challenge] | +| Skeptical Senior | N | N | N | [biggest challenge] | +| User Advocate | N | N | N | [biggest challenge] | +| Meta-Process | N | N | N | [biggest challenge] | +| SAPCC Structural | N or N/A | N | N | [biggest challenge or N/A] | + +## Findings by Severity + +### CRITICAL +[Each finding with before/after code] + +### HIGH +[Each finding with before/after code] + +### MEDIUM +[Summary with file references] + +### LOW +[Brief list] + +## Contested Findings (Wave 3 vs Wave 1+2) + +| Finding | Wave 1+2 Severity | Wave 3 Verdict | Resolution | +|---------|-------------------|----------------|------------| +| [summary] | HIGH | CHALLENGE: [reason] | [Fixed / Skipped / Human decided] | +| ... | ... | ... | ... | + +## Quick Wins Applied +[List of easy fixes that improved quality] + +## Blocked Items (if any — must be <10% of total) +[List of findings where fix AND alternative fix both break tests] +[Each must include: what was tried, why it failed, suggested manual approach] + +## What's Done Well +[Genuine positives found during review] + +## Systemic Recommendations +[2-3 big-picture patterns observed across findings] +``` + +## task_plan.md Template + +Created at the start of Phase 1: + +```markdown +# Task Plan: Comprehensive Review v4 + +## Goal +Four-wave review and auto-fix of [N] changed files across [N] packages. + +## Phases +- [ ] Phase 0.5: Static Analysis (linters, auto-fix trivial, capture remaining) +- [ ] Phase 1: Scope (identify files, detect org, discover packages, create findings dir) +- [ ] Phase 1b: Wave 0 Dispatch (per-package deep review) +- [ ] Phase 1c: Wave 0 Aggregate (per-package findings) +- [ ] Phase 1.5: Library Contract Verification (Go repos only) +- [ ] Phase 2a: Wave 1 Dispatch (12 foundation agents + Wave 0 context) +- [ ] Phase 2b: Wave 1 Aggregate (collect and summarize Wave 0+1 findings) +- [ ] Phase 3a: Wave 2 Dispatch (10 deep-dive agents with Wave 0+1 context) +- [ ] Phase 3b: Wave 2 Aggregate (merge Wave 0+1+2 findings) +- [ ] Phase 3c: Wave 3 Dispatch (4-5 adversarial agents with Wave 0+1+2 context) +- [ ] Phase 3d: Wave 3 Aggregate (merge adversarial challenges, label agreement) +- [ ] Phase 4: Fix (auto-fix on branch) +- [ ] Phase 5: Report (write report, verify) + +## Review Profile +- Files: [list] +- Packages discovered: [N] +- Wave 0 agents: [N] (one per package) +- Wave 1 agents: 12 +- Wave 2 agents: 10 +- Wave 3 agents: 4-5 (adversarial; 5 if SAPCC detected) +- Org conventions: [detected org or none] +- Mode: [review+fix | review-only] + +## Findings Directory +$REVIEW_DIR = [path from Phase 1 Step 3] + +## Status +**Currently in Phase 1** - Discovering packages +``` + +## Findings Directory Layout + +``` +$REVIEW_DIR/ + wave0-findings.md — Per-package deep review results + wave1-findings.md — Foundation agent results (12 agents) + wave01-summary.md — Combined Wave 0+1 context for Wave 2 + wave2-findings.md — Deep-dive agent results (10 agents) + wave012-summary.md — Combined Wave 0+1+2 context for Wave 3 + wave3-findings.md — Adversarial challenge results (4-5 agents) + final-report.md — Aggregated, deduplicated, agreement-labeled +``` + +| File | Written By | Read By | +|------|-----------|---------| +| `$REVIEW_DIR/wave0-findings.md` | Phase 1c | Phase 2a, 2b | +| `$REVIEW_DIR/wave1-findings.md` | Phase 2b | Phase 3a | +| `$REVIEW_DIR/wave01-summary.md` | Phase 2b | Phase 3a | +| `$REVIEW_DIR/wave2-findings.md` | Phase 3b | Phase 3c | +| `$REVIEW_DIR/wave012-summary.md` | Phase 3b | Phase 3c | +| `$REVIEW_DIR/wave3-findings.md` | Phase 3d | Phase 4 | +| `$REVIEW_DIR/final-report.md` | Phase 3d | Phase 4, Phase 5 | + +These files persist in `/tmp/` until next reboot and can be re-read in future sessions if needed. diff --git a/pipelines/comprehensive-review/references/wave-0-per-package.md b/pipelines/comprehensive-review/references/wave-0-per-package.md new file mode 100644 index 00000000..cb7de494 --- /dev/null +++ b/pipelines/comprehensive-review/references/wave-0-per-package.md @@ -0,0 +1,131 @@ +# Wave 0: Per-Package Deep Review + +## Package Discovery Commands + +| Language | Discovery Command | Agent Type | Min Package Size | +|----------|-------------------|------------|-----------------| +| Go | `find . -name "*.go" -path "*/internal/*" \| xargs dirname \| sort -u` | `golang-general-engineer-compact` | 1 file | +| Go (also) | `find . -name "*.go" -not -path "*/internal/*" -not -path "*/vendor/*" \| xargs dirname \| sort -u` | `golang-general-engineer-compact` | 1 file | +| Python | `find . -name "__init__.py" \| xargs dirname \| sort -u` | `python-general-engineer` | 1 file | +| TypeScript | `find . -name "index.ts" -o -name "index.tsx" \| xargs dirname \| sort -u` | `typescript-frontend-engineer` | 1 file | + +Full discovery commands (excluding vendor/venv/git): + +```bash +# Go packages (internal/ and top-level) +find . -name "*.go" -not -path "*/vendor/*" -not -path "*/.git/*" | xargs dirname | sort -u + +# Python packages +find . -name "__init__.py" -not -path "*/venv/*" -not -path "*/.git/*" | xargs dirname | sort -u + +# TypeScript modules (directories with index.ts/tsx) +find . -name "index.ts" -o -name "index.tsx" | grep -v node_modules | xargs dirname | sort -u +``` + +## Agent Selection by Language + +| Language | Agent | +|----------|-------| +| Go | `golang-general-engineer-compact` | +| Python | `python-general-engineer` | +| TypeScript | `typescript-frontend-engineer` | +| Mixed | Use language of majority files in that package | + +Use `model: sonnet` for ALL per-package agents. Never use haiku for code review. The orchestrator runs on Opus; dispatched agents run on Sonnet for cost efficiency (40% savings, minimal quality tradeoff). + +## Per-Package Agent Dispatch Prompt + +Dispatch up to 10 agents per message. If more than 10 packages are discovered, use multiple batches: +- Batch 1: packages 1-10 (dispatch in ONE message) +- Batch 2: packages 11-20 (dispatch in ONE message after batch 1 completes) +- Continue until all packages are covered + +Each per-package agent gets this prompt: + +``` +PER-PACKAGE DEEP REVIEW — Wave 0 + +PACKAGE: [package path] +LANGUAGE: [Go/Python/TypeScript] + +MCP TOOL DISCOVERY (do this FIRST, before reading package files): +- Use ToolSearch to check for available MCP tools that can enhance your analysis: + a. Run ToolSearch("gopls") — if this is a Go package, loads type-aware analysis + tools (go_file_context, go_diagnostics, go_symbol_references, etc.) + b. Run ToolSearch("context7") — loads library documentation lookup tools +- If gopls tools are available AND LANGUAGE is Go: + * Use go_file_context after reading each .go file for intra-package dependency context + * Use go_diagnostics on the package to detect build/analysis errors + * Use go_symbol_references to check for unused or misused exported symbols +- If Context7 tools are available: + * Use resolve-library-id + query-docs for unfamiliar library APIs in this package + +INSTRUCTIONS: +1. Read the CLAUDE.md file(s) in this repository first +2. Run MCP TOOL DISCOVERY steps above +3. Read EVERY file in this package directory: [package path]/ +4. Understand the package's purpose, internal APIs, and relationships +5. Review ALL code for issues — you have full package context +6. Use MCP tools (gopls, Context7) as you review for type-aware precision +7. Focus on issues that require understanding the WHOLE package: + - Internal API misuse between files in this package + - Inconsistent error handling patterns within the package + - Missing or redundant functionality + - Package-level design issues (cohesion, coupling) + - Test coverage relative to package complexity +8. Use structured output format with severity classification +9. Include file:line references for every finding + +CONTEXT: This is Wave 0 of a comprehensive review. Your per-package findings +will be passed to 20 cross-cutting review agents in Waves 1 and 2. Focus on +issues that require full package context to detect — cross-cutting agents will +handle file-level and project-level concerns. + +OUTPUT FORMAT: +### PACKAGE: [package path] +**Files reviewed**: [list all files read] +**Package purpose**: [1-sentence summary] +**Package health**: [HEALTHY | MINOR_ISSUES | NEEDS_ATTENTION | CRITICAL] + +Findings: +### [CRITICAL|HIGH|MEDIUM|LOW]: [One-line summary] +**File**: `path/to/file:LINE` +**Issue**: [Description] +**Impact**: [Why this matters] +**Fix**: [Concrete code fix] +**Requires package context**: [Why a single-file reviewer would miss this] +--- +``` + +## Wave 0 Aggregate Output Format + +After all per-package agents complete, build this summary for Wave 1+2 context injection: + +```markdown +## Wave 0 Per-Package Findings Summary (for Wave 1+2 context) + +### Packages Reviewed: [N] +### Packages Healthy: [N] | Minor Issues: [N] | Needs Attention: [N] | Critical: [N] + +### Per-Package Results + +#### [package/path1] — [HEALTHY|MINOR_ISSUES|NEEDS_ATTENTION|CRITICAL] +- Purpose: [1-sentence] +- Files: [N] +- Findings: [N] (CRITICAL: N, HIGH: N, MEDIUM: N, LOW: N) +- Key issues: + - [SEVERITY]: [summary] at [file:line] + - ... + +#### [package/path2] — [STATUS] +- ... + +### Cross-Package Patterns Detected +- [Pattern 1]: Seen in packages [list] — suggests systemic issue +- [Pattern 2]: ... + +### Wave 0 Hotspots (packages with most findings) +1. [package] — [N findings, N critical] +2. [package] — [N findings, N high] +3. ... +``` diff --git a/pipelines/comprehensive-review/references/wave-1-foundation.md b/pipelines/comprehensive-review/references/wave-1-foundation.md new file mode 100644 index 00000000..1aaba359 --- /dev/null +++ b/pipelines/comprehensive-review/references/wave-1-foundation.md @@ -0,0 +1,173 @@ +# Wave 1: Foundation Agents (12 agents) + +These agents run in parallel with Wave 0 per-package findings as context. They perform cross-cutting analysis that spans packages and establish the foundation for Wave 2. + +**ALL 12 agents MUST be dispatched in ONE message for true parallel execution.** + +Use `model: sonnet` for all Wave 1 agents. The orchestrator runs on Opus; dispatched agents run on Sonnet. + +## Agent Roster + +| # | Agent | Focus Area | Key Catches | +|---|-------|------------|-------------| +| 1 | `reviewer-security` | Security | OWASP Top 10, auth, injection, secrets | +| 2 | `reviewer-business-logic` | Domain | Edge cases, state transitions, requirement gaps | +| 3 | Architecture reviewer* | Architecture | Patterns, naming, structure, idioms | +| 4 | `reviewer-silent-failures` | Error Handling | Swallowed errors, empty catches, bad fallbacks | +| 5 | `reviewer-test-analyzer` | Test Coverage | Coverage gaps, fragile tests, missing negative cases | +| 6 | `reviewer-type-design` | Type Design | Weak invariants, leaky encapsulation | +| 7 | `reviewer-code-quality` | Quality/Style | CLAUDE.md violations, convention drift | +| 8 | `reviewer-comment-analyzer` | Documentation | Comment rot, misleading docs, stale TODOs | +| 9 | `reviewer-language-specialist` | Language Idioms | Modern stdlib, concurrency, LLM tells, org-specific rules | +| 10 | `reviewer-docs-validator` | Project Health | README, CLAUDE.md, deps, CI, build system | +| 11 | `reviewer-adr-compliance` | ADR Compliance | Implementation matches ADR decisions, no scope creep | +| 12 | `reviewer-newcomer` | Newcomer Perspective | Documentation gaps, confusing code, implicit assumptions, onboarding friction | + +*Architecture reviewer selection by language: + +| File Types | Agent | +|-----------|-------| +| `.go` files | `golang-general-engineer` or `golang-general-engineer-compact` | +| `.py` files | `python-general-engineer` | +| `.ts`/`.tsx` files | `typescript-frontend-engineer` | +| Mixed or other | `Explore` | + +## Standard Agent Prompt Template + +Each agent prompt includes: + +``` +REVIEW SCOPE: +- Files to review: [list of changed files] +- Change context: [what was changed and why, if known] +- Repository: [current directory] + +WAVE 0 PER-PACKAGE CONTEXT (deep per-package review results): +[Insert $WAVE0_CONTEXT — loaded from $REVIEW_DIR/wave0-findings.md] + +MCP TOOL DISCOVERY (do this FIRST, before any file reads): +- Use ToolSearch to check for available MCP tools that can enhance your analysis: + a. Run ToolSearch("gopls") — if Go files are in scope, this loads type-aware + analysis tools (go_file_context, go_diagnostics, go_symbol_references, etc.) + b. Run ToolSearch("context7") — loads library documentation lookup tools for + verifying dependency usage and API correctness +- If gopls tools are available AND this is a Go repository: + * Use go_file_context after reading any .go file to understand intra-package dependencies + * Use go_symbol_references before flagging unused or misused symbols + * Use go_diagnostics on files you flag to confirm real vs false-positive issues +- If Context7 tools are available: + * Use resolve-library-id + query-docs to verify library API usage in flagged code + +INSTRUCTIONS: +1. Read the CLAUDE.md file(s) in this repository first +2. Run MCP TOOL DISCOVERY steps above +3. Review the Wave 0 per-package context to understand package-level findings +4. Review the specified files for issues in your domain +5. Use Wave 0 findings to AVOID duplicating per-package issues already found +6. Focus on CROSS-CUTTING concerns that span multiple packages +7. Use MCP tools (gopls, Context7) during analysis where they add precision +8. Use structured output format with severity classification +9. Include file:line references for every finding +10. For each finding, provide a concrete fix recommendation + +OUTPUT FORMAT: +Return findings as: +### [CRITICAL|HIGH|MEDIUM|LOW]: [One-line summary] +**File**: `path/to/file:LINE` +**Issue**: [Description] +**Impact**: [Why this matters] +**Fix**: [Concrete code fix] +**Wave 0 Cross-Ref**: [Which Wave 0 package finding this relates to, if any] +--- +``` + +## Agent-Specific Prompt Additions + +| Agent | Extra Instructions | +|-------|-------------------| +| `reviewer-security` | Focus on OWASP Top 10, auth, input validation, secrets. **MCP**: For Go, use gopls `go_symbol_references` to trace tainted input flows. **CALLER TRACING (mandatory)**: When the diff modifies functions with security-sensitive parameters (auth tokens, filter flags, sentinel values like `"*"`), grep for ALL callers across the repo and verify each validates the parameter. Do NOT trust PR descriptions — verify independently. | +| `reviewer-business-logic` | Focus on requirements coverage, edge cases, state transitions. **CALLER TRACING (mandatory)**: When the diff changes interface semantics or introduces sentinel values, grep for ALL callers (`.MethodName(`) across the repo and verify each honors the contract. Do NOT claim "no caller passes X" without searching. | +| Architecture reviewer | Focus on patterns, naming, structure, maintainability. **MCP**: For Go, use gopls `go_file_context` to understand cross-file dependencies | +| `reviewer-silent-failures` | Focus on catch blocks, error swallowing, fallback behavior. **MCP**: For Go, use gopls `go_diagnostics` to verify error handling correctness | +| `reviewer-test-analyzer` | Focus on coverage gaps, missing edge case tests, test quality. **ASSERTION DEPTH CHECK (mandatory)**: For security-sensitive code, flag presence-only assertions (NotEmpty, NotNil, hasKey). Tests MUST verify actual values, not just existence. | +| `reviewer-type-design` | Focus on invariants, encapsulation, type safety. **MCP**: For Go, use gopls `go_package_api` to understand type surface area | +| `reviewer-code-quality` | Focus on CLAUDE.md compliance, conventions, style | +| `reviewer-comment-analyzer` | Focus on comment accuracy, rot, misleading docs | +| `reviewer-language-specialist` | Detect language from files, check modern stdlib, idioms, concurrency, LLM tells. **MCP**: For Go files, use gopls `go_file_context` and `go_diagnostics` to detect non-idiomatic patterns with type awareness. If org conventions detected, append org-specific flags to prompt. | +| `reviewer-docs-validator` | Check README.md, CLAUDE.md, deps, CI config, build system, LICENSE. Review the project, not the code. **MCP**: Use Context7 to verify documented library versions/APIs match actual usage | +| `reviewer-adr-compliance` | Auto-discover ADRs from `adr/` and `.adr-session.json`. Check every decision point has implementation, no contradictions, no scope creep. Output ADR COMPLIANT or NOT ADR COMPLIANT. | +| `reviewer-newcomer` | Review from a newcomer/fresh-eyes perspective. Focus on: documentation gaps that would confuse a new developer, implicit assumptions not explained in code or comments, confusing variable/function names, unclear control flow, missing "why" explanations. Flag anything where a developer unfamiliar with this codebase would be lost. | + +## Wave 0+1 Aggregate Output Format + +After Wave 1 completes, build this combined summary for Wave 2 context: + +```markdown +## Wave 0+1 Findings Summary (for Wave 2 context) + +### Wave 0 Per-Package Summary: [N packages reviewed] +- Packages with issues: [list with health status] +- Cross-package patterns: [list] +- Hotspot packages: [top 3 by finding count] +- Key per-package findings: + - [package]: [SEVERITY] [summary] + - ... + +### Security (Agent 1): [N findings] +- CRITICAL: [list if any] +- HIGH: [list] +- Files with security issues: [list] + +### Business Logic (Agent 2): [N findings] +- State transitions identified: [list] +- Edge cases flagged: [list] +- Files with domain issues: [list] + +### Architecture (Agent 3): [N findings] +- Architectural patterns noted: [list] +- Hot paths identified: [list] +- Files with structural issues: [list] + +### Silent Failures (Agent 4): [N findings] +- Swallowed errors at: [file:line list] +- Error paths without handling: [list] +- Files with error handling gaps: [list] + +### Test Coverage (Agent 5): [N findings] +- Coverage gaps: [list] +- Untested paths: [list] + +### Type Design (Agent 6): [N findings] +- Weak types identified: [list] +- Type safety issues at: [file:line list] + +### Code Quality (Agent 7): [N findings] +- Convention baseline established: [patterns] +- Convention violations: [list] +- CLAUDE.md compliance issues: [list] + +### Comments (Agent 8): [N findings] +- Comment rot at: [file:line list] +- Stale TODOs: [list] + +### Language Specialist (Agent 9): [N findings] +- Language-specific issues: [list] +- Modern stdlib opportunities: [list] +- LLM code tells: [list] + +### Docs & Config (Agent 10): [N findings] +- Documentation gaps: [list] +- Dependency issues: [list] +- CI/build issues: [list] + +### ADR Compliance (Agent 11): [N findings] +- ADR decisions not implemented: [list] +- ADR contradictions: [list] +- Scope creep: [list] + +### Newcomer Perspective (Agent 12): [N findings] +- Documentation gaps: [list] +- Confusing code: [list] +- Implicit assumptions: [list] +- Onboarding friction: [list] +``` diff --git a/pipelines/comprehensive-review/references/wave-2-deep-dive.md b/pipelines/comprehensive-review/references/wave-2-deep-dive.md new file mode 100644 index 00000000..cee929f6 --- /dev/null +++ b/pipelines/comprehensive-review/references/wave-2-deep-dive.md @@ -0,0 +1,89 @@ +# Wave 2: Deep-Dive Agents (10 agents) + +These agents receive Wave 0+1 aggregated findings as input. They perform targeted deep-dives informed by per-package analysis (Wave 0) and cross-cutting analysis (Wave 1). + +**ALL 10 agents MUST be dispatched in ONE message for true parallel execution.** + +Use `model: sonnet` for all Wave 2 agents. The orchestrator runs on Opus; dispatched agents run on Sonnet. + +## Agent Roster + +| # | Agent | Focus Area | Wave 1 Context Used | +|---|-------|------------|---------------------| +| 11 | `reviewer-performance` | Performance | Architecture findings → focus on hot paths | +| 12 | `reviewer-concurrency` | Concurrency | Silent-failure + architecture findings → concurrent paths | +| 13 | `reviewer-api-contract` | API Contracts | Business-logic + type-design findings → contract-sensitive code | +| 14 | `reviewer-dependency-audit` | Dependencies | Docs-validator findings → dependency documentation gaps | +| 15 | `reviewer-error-messages` | Error Messages | Silent-failure + code-quality findings → error paths | +| 16 | `reviewer-dead-code` | Dead Code | Code-quality + docs-validator findings → abandoned artifacts | +| 17 | `reviewer-naming-consistency` | Naming | Code-quality + language-specialist findings → convention baselines | +| 18 | `reviewer-observability` | Observability | Silent-failure findings → observability gaps at failure points | +| 19 | `reviewer-config-safety` | Config Safety | Security + docs-validator findings → config security gaps | +| 20 | `reviewer-migration-safety` | Migration Safety | API-contract + business-logic findings → migration-sensitive changes | + +## Standard Agent Prompt Template + +Each Wave 2 agent prompt includes the standard review scope PLUS the Wave 0+1 context: + +``` +REVIEW SCOPE: +- Files to review: [list of changed files] +- Change context: [what was changed and why, if known] +- Repository: [current directory] + +WAVE 0+1 CONTEXT (use this to focus your analysis): +[Insert $WAVE01_SUMMARY — loaded from $REVIEW_DIR/wave01-summary.md] + +MCP TOOL DISCOVERY (do this FIRST, before any file reads): +- Use ToolSearch to check for available MCP tools that can enhance your analysis: + a. Run ToolSearch("gopls") — if Go files are in scope, this loads type-aware + analysis tools (go_file_context, go_diagnostics, go_symbol_references, etc.) + b. Run ToolSearch("context7") — loads library documentation lookup tools for + verifying dependency usage and API correctness +- If gopls tools are available AND this is a Go repository: + * Use go_file_context after reading any .go file to understand intra-package dependencies + * Use go_symbol_references before flagging unused or misused symbols + * Use go_diagnostics on files you flag to confirm real vs false-positive issues +- If Context7 tools are available: + * Use resolve-library-id + query-docs to verify library API usage in flagged code + +INSTRUCTIONS: +1. Read the CLAUDE.md file(s) in this repository first +2. Run MCP TOOL DISCOVERY steps above +3. Review the Wave 0 per-package context for package-level issues already found +4. Review the Wave 1 cross-cutting context for foundation issues already found +5. Use Wave 0+1 findings to FOCUS your deep-dive analysis: + - Prioritize packages flagged as NEEDS_ATTENTION or CRITICAL by Wave 0 + - Prioritize files and paths flagged by Wave 1 + - Look for issues in YOUR domain that neither Wave 0 nor Wave 1 would catch + - Cross-reference your findings with both waves to add depth +6. Do NOT simply repeat Wave 0 or Wave 1 findings — add NEW insights +7. Use MCP tools (gopls, Context7) during analysis where they add precision +8. Use structured output format with severity classification +9. Include file:line references for every finding + +OUTPUT FORMAT: +Return findings as: +### [CRITICAL|HIGH|MEDIUM|LOW]: [One-line summary] +**File**: `path/to/file:LINE` +**Issue**: [Description] +**Impact**: [Why this matters] +**Fix**: [Concrete code fix] +**Wave 0+1 Cross-Ref**: [Which earlier finding this relates to, if any] +--- +``` + +## Agent-Specific Context Instructions + +| Agent | Extra Context Instructions | +|-------|--------------------------| +| `reviewer-performance` | Use Wave 0 per-package findings to identify packages with complexity issues. Use Wave 1 architecture findings to identify hot paths. Focus on algorithmic complexity, N+1 queries, allocation waste. **MCP**: For Go, use gopls `go_symbol_references` to trace hot path call chains | +| `reviewer-concurrency` | Use Wave 0 per-package findings for concurrent patterns within packages. Use Wave 1 silent-failure + architecture findings for cross-package concurrent paths. Focus on races, goroutine leaks, deadlocks. **MCP**: For Go, use gopls `go_diagnostics` to detect race condition warnings | +| `reviewer-api-contract` | Use Wave 0 per-package findings to understand internal API surfaces. Use Wave 1 business-logic + type-design findings for contract-sensitive endpoints. Focus on breaking changes, status codes. **MCP**: Use Context7 to verify API contract claims against library docs | +| `reviewer-dependency-audit` | Use Wave 1 docs-validator findings to cross-reference documented vs actual dependencies. Run govulncheck/npm audit/pip-audit. Focus on CVEs, licenses, deprecated packages. **MCP**: Use Context7 `resolve-library-id` + `query-docs` to verify dependency API usage. For Go, use gopls `go_vulncheck` for vulnerability scanning | +| `reviewer-error-messages` | Use Wave 0 per-package error handling patterns. Use Wave 1 silent-failure + code-quality findings. Focus on error message quality, actionability, consistency. | +| `reviewer-dead-code` | Use Wave 0 per-package findings to identify unused internal APIs between files. Use Wave 1 code-quality + docs-validator findings. Focus on unreachable code, unused exports. | +| `reviewer-naming-consistency` | Use Wave 0 per-package naming patterns to detect intra-package drift. Use Wave 1 code-quality + language-specialist findings. Focus on cross-package naming consistency. | +| `reviewer-observability` | Use Wave 0 per-package findings for packages missing instrumentation. Use Wave 1 silent-failure findings for error paths missing observability. Focus on RED metrics gaps. | +| `reviewer-config-safety` | Use Wave 0 per-package findings for hardcoded values within packages. Use Wave 1 security + docs-validator findings. Focus on secrets, missing env var validation. | +| `reviewer-migration-safety` | Use Wave 1 api-contract + business-logic findings to identify migration-sensitive changes. Focus on reversible migrations, deprecation paths, rollback safety. | diff --git a/pipelines/comprehensive-review/references/wave-3-adversarial.md b/pipelines/comprehensive-review/references/wave-3-adversarial.md new file mode 100644 index 00000000..b04dc569 --- /dev/null +++ b/pipelines/comprehensive-review/references/wave-3-adversarial.md @@ -0,0 +1,105 @@ +# Wave 3: Adversarial Agents (4-5 agents) + +These agents receive Wave 0+1+2 aggregated findings as input. Their job is to CHALLENGE the consensus — not reinforce it. Wave 3 agents push back on findings, question whether issues are real, and surface tradeoffs that earlier waves may have accepted uncritically. + +**ALL Wave 3 agents MUST be dispatched in ONE message for true parallel execution.** + +Use `model: sonnet` for all Wave 3 agents. The orchestrator runs on Opus; dispatched agents run on Sonnet. + +## Agent Roster + +| # | Agent | Role | Challenge Focus | +|---|-------|------|----------------| +| 21 | `reviewer-contrarian` | Challenges findings | Are these findings actually important? Which are false positives? Which are over-severity? | +| 22 | `reviewer-skeptical-senior` | Experience-based skepticism | "I've seen this before" — which findings are theoretical vs real-world issues? | +| 23 | `reviewer-user-advocate` | User impact assessment | Does this change break users? Are UX tradeoffs justified? Are migration paths safe? | +| 24 | `reviewer-meta-process` | Process/approach review | Is this the right approach? Should the PR be split? Is the review itself focused correctly? | + +## Conditional: SAPCC Structural Review + +| # | Agent | Condition | Challenge Focus | +|---|-------|-----------|----------------| +| 25 | `reviewer-sapcc-structural` | Repo contains ANY of: `hybris/`, `core-customize/`, `config/localextensions.xml`, or `manifest.json` with `"commerceSuiteVersion"` | SAP Commerce Cloud structural integrity — extension wiring, build manifest, data model impacts | + +SAPCC detection: +```bash +SAPCC_DETECTED=false +if [ -d "hybris" ] || [ -d "core-customize" ] || [ -f "config/localextensions.xml" ]; then + SAPCC_DETECTED=true +fi +if [ -f "manifest.json" ] && grep -q '"commerceSuiteVersion"' manifest.json 2>/dev/null; then + SAPCC_DETECTED=true +fi +``` + +If `SAPCC_DETECTED=false`, skip `reviewer-sapcc-structural` silently (no warning, no log). + +## Standard Agent Prompt Template + +Each Wave 3 agent prompt includes the standard review scope PLUS the full Wave 0+1+2 context, with explicit instructions to CHALLENGE rather than reinforce: + +``` +ADVERSARIAL REVIEW — Wave 3 + +REVIEW SCOPE: +- Files to review: [list of changed files] +- Change context: [what was changed and why, if known] +- Repository: [current directory] + +WAVE 0+1+2 FINDINGS (the consensus you are challenging): +[Insert $WAVE012_SUMMARY — loaded from $REVIEW_DIR/wave012-summary.md] + +YOUR ROLE: You are an ADVERSARIAL reviewer. Your job is NOT to find new issues. +Your job is to CHALLENGE the findings above. Push back. Question severity. +Identify false positives. Flag overreactions. Surface tradeoffs that earlier +waves accepted without scrutiny. + +INSTRUCTIONS: +1. Read the CLAUDE.md file(s) in this repository first +2. Read the code being reviewed +3. Read the Wave 0+1+2 findings carefully +4. For each finding from earlier waves, determine: + a. AGREE — the finding is real, correctly classified, and worth fixing + b. CHALLENGE — the finding is questionable (explain why) + c. DOWNGRADE — the finding is real but over-classified (suggest correct severity) + d. DISMISS — the finding is a false positive or not worth fixing (provide evidence) +5. Surface any tradeoffs or second-order effects the earlier waves missed +6. Be specific — vague disagreement is not useful + +OUTPUT FORMAT: +### CHALLENGE: [One-line summary of what you're challenging] +**Original finding**: [Wave N, Agent, Severity: summary] +**Your verdict**: [AGREE | CHALLENGE | DOWNGRADE | DISMISS] +**Reasoning**: [Why you disagree or agree] +**Evidence**: [Code reference, real-world precedent, or logical argument] +**Suggested action**: [Keep as-is | Reduce to MEDIUM | Drop | Needs human judgment] +--- +``` + +## Agent-Specific Prompt Additions + +| Agent | Extra Instructions | +|-------|-------------------| +| `reviewer-contrarian` | Challenge every HIGH and CRITICAL finding. Are they actually important? Which are false positives? Which are over-classified? Look for findings where Wave 1+2 agents reinforced each other's bias rather than independently verifying. Question whether suggested fixes introduce new problems. | +| `reviewer-skeptical-senior` | Apply 10+ years of engineering experience. Which findings are theoretical risks that never manifest in practice? Which are textbook answers that don't apply to this codebase's scale/context? Flag "resume-driven" suggestions (over-engineering, premature optimization). Identify findings where the cure is worse than the disease. | +| `reviewer-user-advocate` | Focus exclusively on user impact. Does this change break existing users? Are migration paths safe? Are UX tradeoffs justified? Challenge findings that improve code quality at the expense of user experience. Flag findings that ignore backward compatibility. Question whether "fixing" something makes it harder for users. | +| `reviewer-meta-process` | Step back from individual findings. Is the overall approach correct? Should this PR be split into smaller PRs? Are the right problems being solved? Is the review itself focused on the right things? Flag cases where the review is bikeshedding on style while missing structural issues. Question whether the fix phase will create more churn than the findings are worth. | +| `reviewer-sapcc-structural` | **(SAPCC repos only)** Challenge findings through SAP Commerce Cloud structural lens. Do findings account for hybris extension lifecycle? Are suggested fixes compatible with the SAP build system? Do architecture recommendations respect CCv2 manifest constraints? Flag findings that would break extension wiring or data model migrations. | + +## Wave Agreement Labels + +After Wave 3 completes, every finding from the final report MUST carry one of these labels: + +| Label | Meaning | Criteria | Action | +|-------|---------|----------|--------| +| **UNANIMOUS** | All waves agree | Wave 1+2 found it AND Wave 3 agrees (or does not challenge) | HIGH confidence — fix without hesitation | +| **MAJORITY** | Most waves agree | Wave 1+2 found it AND 1-2 Wave 3 agents challenge but others agree | Fix, but note the challenge in the report | +| **CONTESTED** | Wave 3 contradicts Wave 1+2 | Wave 1+2 found it BUT 3+ Wave 3 agents challenge or dismiss | Needs human judgment — present both arguments | + +Wave 3 challenge verdicts: +- **AGREE** verdicts: Reinforces the original finding (increases confidence) +- **CHALLENGE** verdicts: Flags the finding for human review +- **DOWNGRADE** verdicts: Suggests lower severity (adjust if multiple Wave 3 agents agree) +- **DISMISS** verdicts: Suggests dropping the finding (only drop if 2+ Wave 3 agents dismiss AND no Wave 1+2 agent rated it CRITICAL) + +If Wave 3 returns >90% AGREE on all findings, note in the report: "Wave 3 did not provide meaningful challenge — findings may benefit from human review." diff --git a/pipelines/de-ai-pipeline/SKILL.md b/pipelines/de-ai-pipeline/SKILL.md index f4ef3d75..cadbb772 100644 --- a/pipelines/de-ai-pipeline/SKILL.md +++ b/pipelines/de-ai-pipeline/SKILL.md @@ -35,40 +35,8 @@ routing: # De-AI Pipeline -## Operator Context - Automated scan-fix-verify loop that removes AI writing patterns from documentation files. Uses `scripts/scan-ai-patterns.py` for deterministic detection against `scripts/data/banned-patterns.json` (323 patterns, 24 categories), then dispatches fix agents per file, then re-scans to verify fixes. Repeats until zero errors or max 3 iterations. -### Hardcoded Behaviors (Always Apply) -- **Script-First Detection**: Use `scan-ai-patterns.py` for all scanning. Do not self-assess AI patterns. -- **Preserve Meaning**: Fixes must not change factual content. Only rephrase, never remove information. -- **Skip Zones**: Do not modify content inside code blocks, YAML frontmatter, inline code, or blockquotes. -- **Max 3 Iterations**: Stop after 3 scan-fix cycles even if errors remain. Report remaining errors. -- **False Positive Awareness**: When a pattern match is a technical term, skill name, or schema value, skip it and note the false positive. - -### Default Behaviors (ON unless disabled) -- **Parallel File Fixes**: When 2+ files have errors, dispatch fix agents in parallel. -- **Commit After Clean**: After reaching zero errors, stage and report (do not auto-commit). - -### Optional Behaviors (OFF unless enabled) -- **Single File Mode**: Pass a specific file path to scan and fix only that file. -- **Scan Only**: Run the scanner without fixing. Report errors and exit. - -## What This Skill CAN Do -- Scan all docs for AI pattern violations using deterministic regex matching -- Fix detected patterns by rephrasing sentences to avoid triggers -- Verify fixes by re-scanning after edits -- Handle em-dash replacement, jargon substitution, structural rewrites -- Report false positives for pattern refinement - -## What This Skill CANNOT Do -- Fix patterns in agent definitions or skill files (those are system prompts, not prose) -- Remove technical terms that happen to match patterns (false positives) -- Rewrite entire documents (targeted fixes only) -- Change meaning to avoid a pattern match - ---- - ## Instructions ### Phase 1: SCAN @@ -77,6 +45,8 @@ Automated scan-fix-verify loop that removes AI writing patterns from documentati **Step 1: Run scan-ai-patterns.py** +Always use the scanner script for detection — never self-assess a file as "clean" without running the tool. The scanner catches patterns humans miss. + ```bash python3 ~/.claude/scripts/scan-ai-patterns.py --errors-only --json ``` @@ -86,20 +56,20 @@ Parse the JSON output. Group hits by file. **Step 2: Classify results** For each hit, classify as: -- **Fixable**: The pattern is a genuine AI tell that can be rephrased -- **False positive**: The pattern match is a technical term, skill name, command name, or schema value +- **Fixable**: The pattern is a genuine AI tell that can be rephrased while preserving all factual content. +- **False positive**: The pattern match is a technical term, skill name, command name, or schema value that should be skipped. -Report false positives separately for pattern refinement. +Report false positives separately for pattern refinement. These become data for improving the banned-patterns database over time. **Gate**: Scan complete. If zero errors, report clean and stop. If errors found, proceed to Phase 2. ### Phase 2: FIX -**Goal**: Fix all errors in each file. +**Goal**: Fix all errors in each file while preserving meaning. -**Step 1: For each file with errors, apply targeted fixes** +**Step 1: Read each file with errors and apply targeted fixes** -Read the file. For each error hit: +For each error hit, use this strategy table: | Pattern Category | Fix Strategy | |-----------------|--------------| @@ -124,7 +94,9 @@ Read the file. For each error hit: | `ai_significance_phrases` | State the significance directly | | `exploration_verbs` | Use direct verbs: examine, look at, consider | -**Step 2: Preserve skip zones** +**Critical constraint**: Preserve all factual content. Only rephrase sentences; never remove information to avoid a pattern match. If rephrasing is hard, skip the hit and note it in the final report. + +**Step 2: Respect protected zones** Do NOT modify text inside: - Code blocks (``` ... ```) @@ -132,9 +104,11 @@ Do NOT modify text inside: - YAML frontmatter (--- ... ---) - Blockquotes (> ...) -**Step 3: For 2+ files, fix in parallel** +These zones stay untouched. Scan results inside protected zones are automatically false positives. -When multiple files have errors, dispatch one Agent per file to fix them simultaneously. Each agent gets the file path and the specific hits to fix. +**Step 3: Dispatch parallel fixes for multiple files** + +When 2 or more files have errors, dispatch one Agent per file to fix them simultaneously. Each agent gets the file path and the specific hits to fix. This parallelizes the work. **Gate**: All fixable errors addressed. Proceed to Phase 3. @@ -142,7 +116,7 @@ When multiple files have errors, dispatch one Agent per file to fix them simulta **Goal**: Re-scan to confirm fixes worked and no new errors were introduced. -**Step 1: Re-run scanner** +**Step 1: Re-run the scanner** ```bash python3 ~/.claude/scripts/scan-ai-patterns.py --errors-only @@ -151,13 +125,15 @@ python3 ~/.claude/scripts/scan-ai-patterns.py --errors-only **Step 2: Check results** - **Zero errors**: Report success. Proceed to Phase 4 (REPORT). -- **Errors remain**: Increment iteration counter. If < 3 iterations, return to Phase 2. If = 3, proceed to Phase 4 with remaining errors noted. +- **Errors remain**: Increment iteration counter. If less than 3 iterations, return to Phase 2 and fix again. If 3 iterations reached, proceed to Phase 4 with remaining errors noted in the report. + +This max-3-iteration constraint stops infinite loops while capturing as many fixes as possible. **Gate**: Either zero errors or max iterations reached. Proceed to Phase 4. ### Phase 4: REPORT -**Goal**: Report results and stage changes. +**Goal**: Report results and stage changes. Do not commit automatically — let the user decide. **Step 1: Report** @@ -177,7 +153,7 @@ git add [fixed files] git status ``` -Do NOT commit. Report staged files and let the user decide. +Report staged files. Do not run `git commit` — the user owns the final commit decision. **Gate**: Report delivered. Pipeline complete. @@ -186,33 +162,20 @@ Do NOT commit. Report staged files and let the user decide. ## Error Handling ### Error: "scan-ai-patterns.py not found" -Cause: Script not in expected location -Solution: Check `scripts/scan-ai-patterns.py` exists. If not, the toolkit may need re-installation. +**Cause**: Script not in expected location +**Solution**: Check `scripts/scan-ai-patterns.py` exists. If not, the toolkit may need re-installation. ### Error: "Pattern match is a false positive" -Cause: Technical term, skill name, or schema value matches a banned pattern -Solution: Skip the fix, note the false positive in the report. Suggest pattern refinement (e.g., tighten the regex, add context requirement). +**Cause**: Technical term, skill name, or schema value matches a banned pattern +**Solution**: Skip the fix. Note the false positive in the report and suggest pattern refinement (e.g., tighten the regex, add context requirement). These notes feed back into the pattern database. ### Error: "Fix introduces new errors" -Cause: Rephrased sentence contains a different banned pattern -Solution: Rephrase again avoiding both patterns. If stuck after 3 attempts on one sentence, skip and note in report. +**Cause**: Rephrased sentence contains a different banned pattern +**Solution**: Rephrase again avoiding both patterns. If stuck after 3 attempts on one sentence, skip and note in report. Don't sacrifice clarity for pattern avoidance. --- -## Anti-Patterns - -### Anti-Pattern 1: Self-Assessing Instead of Scanning -**What it looks like**: Deciding a file "looks clean" without running the script -**Do instead**: Run `scan-ai-patterns.py`. The script catches patterns humans miss. - -### Anti-Pattern 2: Changing Meaning to Avoid Patterns -**What it looks like**: Removing a sentence because rephrasing is hard -**Do instead**: Preserve all factual content. Rephrase, don't delete. - -### Anti-Pattern 3: Ignoring False Positives -**What it looks like**: Forcing awkward rewrites for legitimate technical terms -**Do instead**: Note the false positive in the report. The pattern database gets refined over time. - ## References -- [Anti-Rationalization](../../skills/shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations during scan-fix-verify loops +- `scripts/scan-ai-patterns.py` — Deterministic pattern scanner with 323 banned patterns across 24 categories +- `scripts/data/banned-patterns.json` — Pattern database with regex rules and categories diff --git a/pipelines/do-perspectives/SKILL.md b/pipelines/do-perspectives/SKILL.md index 471b44c8..f9e797e7 100644 --- a/pipelines/do-perspectives/SKILL.md +++ b/pipelines/do-perspectives/SKILL.md @@ -32,55 +32,18 @@ routing: # Multi-Perspective Analysis Skill -## Operator Context - -This skill operates as an operator for inline multi-perspective analysis, configuring Claude's behavior for comprehensive, cost-effective pattern extraction from source material. It implements the **Pipeline** architectural pattern -- Analyze, Synthesize, Apply, Verify -- with **Domain Intelligence** embedded across 10 analytical frameworks. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before starting analysis -- **Over-Engineering Prevention**: Extract actionable patterns only. No speculative rules, no "might be useful" additions -- **Source Material Required**: NEVER begin analysis without valid source material loaded and verified -- **All 10 Perspectives**: Complete all 10 analytical lenses. No skipping perspectives because "enough patterns found" -- **Sequential Integrity**: Each perspective produces documented output before moving to the next -- **Synthesis Before Application**: NEVER apply improvements without completing the synthesis phase first - -### Default Behaviors (ON unless disabled) -- **Cross-Reference Tracking**: Note when multiple perspectives surface the same pattern -- **Priority Ranking**: Rank extracted rules by how many perspectives support them -- **Concrete Examples**: Include specific quotes or references from source material for each rule -- **Token Budget Awareness**: Target 10,000-15,000 tokens total; flag if source material will exceed budget -- **Completion Report**: Generate structured report showing all phases, findings, and changes -- **Git Commit**: Create commit documenting improvements after successful application - -### Optional Behaviors (OFF unless enabled) -- **Parallel Comparison**: Run /do-parallel afterward to compare sequential vs parallel findings -- **Deep Dive Mode**: Spend 500-800 words per perspective instead of 200-500 -- **Multi-Target**: Apply findings to multiple related agents or skills in one session - -## What This Skill CAN Do -- Analyze source material through 10 distinct analytical frameworks sequentially -- Extract actionable patterns and rules from articles, documentation, and code -- Synthesize findings across perspectives into priority-ranked recommendations -- Apply synthesized improvements to a target agent or skill -- Generate completion reports documenting all analysis and changes - -## What This Skill CANNOT Do -- Replace true parallel analysis (use /do-parallel for zero cross-contamination) -- Improve targets without source material to analyze -- Skip perspectives or phases to save tokens -- Make improvements without the synthesis step -- Guarantee independence between perspectives (sequential analysis has inherent anchoring) - ---- +This skill extracts actionable patterns from source material through sequential analysis of 10 analytical frameworks. It operates as a **Pipeline** with four gates: VALIDATE INPUTS → MULTI-PERSPECTIVE ANALYSIS → SYNTHESIZE → APPLY. ## Instructions -$ARGUMENTS - Target agent/skill name + source material path (file path or inline text) +**$ARGUMENTS** - Target agent/skill name + source material path (file path or inline text) ### Phase 1: VALIDATE INPUTS **Goal**: Confirm target exists and source material is loaded before any analysis. +**Hardcoded requirement** (always apply): Source material must be loaded and verified before proceeding. This prevents wasted analysis cycles on empty or invalid input. + **Step 1: Parse arguments** - First argument: target agent or skill name - Second argument: source material file path or inline text @@ -92,7 +55,7 @@ $ARGUMENTS - Target agent/skill name + source material path (file path or inline **Step 3: Load source material** - If file path: read and verify file is non-empty - If inline text: confirm sufficient content for analysis (minimum ~500 words) -- Estimate token budget based on source material length +- Estimate token budget based on source material length (target: 10,000-15,000 tokens total; flag if source material will exceed budget) **Gate**: Target exists and is readable. Source material is loaded and non-trivial. Proceed only when gate passes. @@ -100,6 +63,10 @@ $ARGUMENTS - Target agent/skill name + source material path (file path or inline **Goal**: Analyze source material through all 10 analytical lenses, producing documented findings for each. +**Hardcoded requirement** (always apply): Complete all 10 perspectives. No skipping perspectives because "enough patterns found." Later perspectives often surface non-obvious patterns that earlier ones miss. The value of 10 lenses is comprehensiveness. If a perspective yields low signal, mark it as such in the report, but still complete the analysis. + +**Default behavior** (ON unless disabled): For each perspective, include cross-reference tracking — note when multiple perspectives surface the same pattern. This reveals high-confidence rules. + For each perspective, produce output in this format: ```markdown @@ -131,10 +98,10 @@ For each perspective, produce output in this format: 9. **Complexity Management** -- Approach to difficult topics, gradual escalation, when to be thorough vs concise 10. **Limitation and Nuance Handling** -- Caveats, edge cases, uncertainty acknowledgment, trade-off presentation -**Constraints**: -- 200-500 words per perspective (focused, not padded) +**Constraints on analysis output**: +- 200-500 words per perspective (focused, not padded): Over-engineering prevention applies here. Extract actionable patterns only. No speculative rules, no "might be useful" additions. - 3-5 extracted rules per perspective -- Each perspective MUST reference specific content from the source material +- Each perspective MUST reference specific content from the source material, not generic observations that could apply anywhere. Generic rules add no value. The purpose is extracting patterns specific to the source material. - Cross-references to other perspectives are encouraged but optional **Gate**: All 10 perspectives documented with observations, rules, and source references. Proceed only when gate passes. @@ -143,8 +110,10 @@ For each perspective, produce output in this format: **Goal**: Unify findings across all perspectives into priority-ranked recommendations. +**Hardcoded requirement** (always apply): Synthesis before application. NEVER apply improvements without completing the synthesis phase first. Without synthesis, you apply every extracted rule equally. Priority ranking prevents over-engineering and focuses on high-signal patterns. + **Step 1: Identify common themes** -- Patterns that appeared in 4+ perspectives are high-signal +- Patterns that appeared in 4+ perspectives are high-signal (supported by multiple lenses) - Note convergence and reinforcement across lenses **Step 2: Extract unique insights** @@ -180,6 +149,8 @@ Rules supported by 1-3 perspectives or moderate impact **Goal**: Improve the target agent or skill using synthesized recommendations. +**Hardcoded requirement** (always apply): Do NOT remove existing working patterns. Existing patterns were validated through prior use. New patterns should augment, not replace. Only modify existing content when source material provides a strictly better approach. + **Step 1: Read current target** - Load the current agent or skill file - Identify which sections map to Priority 1 and 2 recommendations @@ -201,6 +172,8 @@ Rules supported by 1-3 perspectives or moderate impact **Goal**: Confirm changes are valid and generate completion report. +**Default behavior** (ON unless disabled): Generate structured completion report showing all phases, findings, and changes. This artifact serves as a reference for future improvements. + **Step 1: Verify file** - Re-read the modified file to confirm it is well-formed - Check YAML frontmatter parses correctly @@ -297,42 +270,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Skipping Perspectives -**What it looks like**: "5 perspectives found enough patterns, skipping the rest" -**Why wrong**: Later perspectives often surface non-obvious patterns that earlier ones miss. The value of 10 lenses is comprehensiveness. -**Do instead**: Complete all 10 perspectives. Mark low-yield ones as such in the report. - -### Anti-Pattern 2: Applying Without Synthesizing -**What it looks like**: Jumping from analysis directly to editing the target file -**Why wrong**: Without synthesis, you apply every extracted rule equally. Priority ranking prevents over-engineering and focuses on high-signal patterns. -**Do instead**: Complete Phase 3 synthesis. Apply only Priority 1 and 2 rules. - -### Anti-Pattern 3: Generic Rules Without Source References -**What it looks like**: Extracted rules that could apply to any source material -**Why wrong**: Generic rules add no value. The purpose is extracting patterns specific to the source material. -**Do instead**: Every rule MUST reference a specific passage, example, or technique from the source. - -### Anti-Pattern 4: Removing Existing Content -**What it looks like**: Replacing working patterns in the target with new patterns from analysis -**Why wrong**: Existing patterns were validated through prior use. New patterns should augment, not replace. -**Do instead**: ADD depth and new sections. Modify existing content only when source material provides a strictly better approach. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Pipeline Architecture](../shared-patterns/pipeline-architecture.md) - Phase-based execution patterns - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "5 perspectives is enough" | Comprehensiveness is the value proposition | Complete all 10 lenses | -| "I can see the patterns already" | Seeing patterns is not the same as systematic extraction | Follow the full analysis framework | -| "Synthesis is just overhead" | Without ranking, all rules get equal weight | Complete Priority 1/2/3 ranking | -| "Source material is obvious" | Obvious material still has non-obvious structural patterns | Analyze through all lenses regardless | +- [CLAUDE.md Compliance](../../../CLAUDE.md) - Verification over assumption, artifacts over memory, authentic over polished +- Pipeline Architecture pattern - Phase-based execution with gates +- Sequential analysis constraints - Cross-contamination inherent in sequential analysis (see /do-parallel for zero cross-contamination alternative) diff --git a/pipelines/doc-pipeline/SKILL.md b/pipelines/doc-pipeline/SKILL.md index e34d6dc6..fbb7d58e 100644 --- a/pipelines/doc-pipeline/SKILL.md +++ b/pipelines/doc-pipeline/SKILL.md @@ -41,51 +41,16 @@ routing: # Documentation Pipeline Skill -## Operator Context - -This skill operates as an operator for structured documentation creation, configuring Claude's behavior for thorough, research-backed technical writing. It implements the **Pipeline** architectural pattern -- Research, Outline, Generate, Verify, Output -- with **Artifact Persistence** at each phase ensuring no work is lost to context decay. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before any documentation work -- **Research Before Writing**: NEVER draft documentation without completing research phase first -- **Verify All Examples**: Every code example must be executed and proven to work -- **Artifact Persistence**: Each phase produces a saved file, not just context -- **Scope Discipline**: Document what exists. Do not document aspirational features or planned work. - -### Default Behaviors (ON unless disabled) -- **Parallel Research**: Launch subagents for code analysis, usage patterns, and context gathering simultaneously -- **Save Phase Artifacts**: Write `doc-research.md`, `doc-outline.md`, `doc-draft.md` at each phase -- **Run Code Examples**: Execute all code snippets to confirm correctness before including them -- **Natural Voice**: Use clear, direct prose. Avoid corporate jargon and sterile technical writing. -- **Audience Awareness**: Identify target audience in research phase and write at their level - -### Optional Behaviors (OFF unless enabled) -- **Skip Research**: Use `--skip-research` for trivial docs where subject is already well-understood -- **Draft Only**: Use `--draft` to produce documentation without verification phase -- **Voice Override**: Use `--voice [name]` to apply a specific voice profile instead of default - -## What This Skill CAN Do -- Create documentation from scratch through structured research and generation -- Run parallel subagents to gather code analysis, usage patterns, and context simultaneously -- Verify code examples, installation steps, and API signatures against actual code -- Produce phased artifacts so work survives context limits -- Generate READMEs, API docs, usage guides, and technical reference material - -## What This Skill CANNOT Do -- Edit or update existing documentation in place (use manual editing instead) -- Write blog posts or marketing content (use research-to-article instead) -- Generate documentation without researching the subject first -- Skip verification of code examples (broken examples are worse than no examples) -- Produce documentation for code that does not yet exist - ---- - ## Instructions +This skill implements a **5-phase pipeline** for structured documentation creation with artifact persistence at each phase. Every phase produces a saved file to prevent work loss to context decay. Follow CLAUDE.md before starting: read and follow repository CLAUDE.md before any documentation work. + ### Phase 1: RESEARCH **Goal**: Gather comprehensive understanding of the subject before writing a single line of documentation. +**Why research first**: Jumping straight to generating prose based on a quick glance at the code produces shallow documentation that misses edge cases, prerequisites, and actual usage patterns. Users will not trust docs that omit critical details. Aspirational docs rot faster than any other kind of documentation. + **Step 1: Launch parallel subagents** Three subagents run simultaneously with a 5-minute timeout: @@ -96,6 +61,11 @@ Three subagents run simultaneously with a 5-minute timeout: | Usage Patterns | Test examples, common patterns, edge cases, error scenarios | Examples with context | | Context Gathering | Problem solved, audience, prerequisites, related docs | Context summary | +If a subagent times out (subject is too broad or codebase is very large): +1. Narrow the research scope to a specific module or component +2. Re-launch only the timed-out subagent with a more focused prompt +3. If still timing out, run research sequentially instead of in parallel + **Step 2: Compile research** Merge subagent findings into a single research document. Resolve any contradictions between subagents by re-reading the actual source code. @@ -110,6 +80,8 @@ Write compiled research to `doc-research.md` in the working directory. **Goal**: Structure documentation based on research findings before generating prose. +**Why outline first**: Outlining forces you to identify gaps where research is insufficient before you invest time in prose. It also prevents over-documentation where every private method and implementation detail makes readers unable to find what they actually need. + **Step 1: Identify sections** Standard documentation structure (adapt based on subject): @@ -127,7 +99,7 @@ Standard documentation structure (adapt based on subject): **Step 2: Assign research to sections** -Map each research finding to the section where it belongs. Identify gaps where research is insufficient and gather additional information. +Map each research finding to the section where it belongs. Identify gaps where research is insufficient and gather additional information. Document only public APIs, common tasks, and what users actually need to accomplish — not internal constants or every private method. **Step 3: Save artifact** @@ -139,6 +111,8 @@ Write outline to `doc-outline.md` in the working directory. **Goal**: Write documentation that is clear, accurate, and useful to the target audience. +**Why verify before publishing**: Documentation that sounds like a press release ("The system leverages enterprise-grade functionality...") drives readers away. Write like you're explaining to a colleague. Be direct. Be specific. Cut filler. + **Step 1: Load context** Read `doc-research.md` and `doc-outline.md` to ground generation in verified facts. @@ -161,10 +135,18 @@ Write complete draft to `doc-draft.md` in the working directory. **Goal**: Prove every claim and example in the documentation is accurate. +**Why verification is mandatory**: Unverified docs erode trust over time. Broken examples destroy documentation credibility — one bad example makes users distrust all examples. If examples were derived from outdated patterns, test mocks, or incomplete context, execution will reveal the problem. Never document code that does not yet exist or features that are aspirational. + **Step 1: Execute code examples** Run every code snippet in the documentation. Capture output. Compare against documented expectations. +If an example fails: +1. Re-read the actual source code for the function being demonstrated +2. Check if the function requires specific setup or environment +3. Rewrite the example based on working test cases in the codebase +4. If the function itself is broken, document the limitation rather than a broken example + **Step 2: Validate API signatures** Cross-reference every function name, parameter, and return type against actual source code. @@ -182,6 +164,12 @@ For each verification failure: Maximum 3 fix-and-verify iterations. If still failing after 3 attempts, flag the section with a `` comment and proceed. +**Special case — No Usage Patterns**: If research finds no usage patterns (code is new, untested, or internal-only with no existing consumers): +1. Check git history for how the author used the code in commits +2. Look for related test files that exercise the code paths +3. Read the code directly and construct examples from the API surface +4. Flag the documentation as "based on API analysis, not observed usage" + **Gate**: All code examples execute. API signatures match source. Installation steps work. Proceed only when gate passes. ### Phase 5: OUTPUT @@ -240,47 +228,6 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Writing Without Research -**What it looks like**: Jumping straight to generating prose based on a quick glance at the code -**Why wrong**: Produces shallow documentation that misses edge cases, prerequisites, and actual usage patterns. Users will not trust docs that omit critical details. -**Do instead**: Complete Phase 1 fully. Save the research artifact. Only then proceed to outlining. - -### Anti-Pattern 2: Documenting Aspirational Features -**What it looks like**: "This module will support clustering in the future" or documenting planned APIs -**Why wrong**: Users try to use documented features and fail. Aspirational docs rot faster than any other kind. -**Do instead**: Document only what exists and works today. Use a roadmap file for future plans. - -### Anti-Pattern 3: Unverified Code Examples -**What it looks like**: Including code snippets that "should work" without executing them -**Why wrong**: Broken examples destroy documentation credibility. One bad example makes users distrust all examples. -**Do instead**: Execute every example in Phase 4. If it fails, fix it or remove it. - -### Anti-Pattern 4: Over-Documentation -**What it looks like**: Documenting every private method, internal constant, and implementation detail -**Why wrong**: Readers cannot find what they need in a wall of irrelevant detail. Signal-to-noise ratio collapses. -**Do instead**: Focus on public APIs, common tasks, and what users actually need to accomplish. - -### Anti-Pattern 5: Sterile Corporate Voice -**What it looks like**: "The system leverages enterprise-grade functionality to facilitate documentation workflows" -**Why wrong**: Nobody reads documentation that sounds like a press release. Users skim past corporate filler. -**Do instead**: Write like you are explaining to a colleague. Be direct. Be specific. Cut filler. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations during research and verification -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks before declaring documentation done -- [Pipeline Architecture](../shared-patterns/pipeline-architecture.md) - Standard pipeline phase structure and artifact management - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I know this code well enough to skip research" | Familiarity breeds blind spots | Complete Phase 1, save artifact | -| "Examples are obvious, no need to run them" | Obvious examples break in surprising ways | Execute every example in Phase 4 | -| "Draft is good enough, skip verification" | Unverified docs erode trust over time | Complete Phase 4 fully | -| "Nobody reads this section anyway" | You don't know what users read | Document all outlined sections | +This skill follows CLAUDE.md requirements for artifact persistence, verification gates, and scope discipline. Every claim must be verified against source code before publication. Code examples must be executed. Installation steps must be tested end-to-end. Documentation must represent what exists today, not aspirational features. diff --git a/pipelines/domain-research/SKILL.md b/pipelines/domain-research/SKILL.md index 051ffb3e..2cffc48c 100644 --- a/pipelines/domain-research/SKILL.md +++ b/pipelines/domain-research/SKILL.md @@ -36,41 +36,15 @@ routing: # Domain Research Skill -## Operator Context - -This skill operates as an operator for domain decomposition, configuring Claude's behavior for discovering, classifying, and mapping subdomains within a target domain before pipeline generation begins. It implements a **Research-Classify-Map-Produce** pattern — broad parallel discovery narrows into structured classification, then maps to pipeline chains, then produces a Component Manifest. - -This skill is the first step in the self-improving pipeline generator (see `adr/self-improving-pipeline-generator.md`). It answers: "What subdomains exist in this domain, and what kind of pipeline does each one need?" - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution. Project instructions override default skill behaviors. -- **Parallel Research Enforcement (Rule 12)**: Phase 1 MUST dispatch 4 parallel research agents. Sequential research is BANNED. WHY: A/B testing proved parallel research eliminates a 1.40-point gap in Examples quality (`adr/pipeline-creator-ab-test.md`). Sequential grep-based research produces shallower, less diverse findings. -- **Dual-Layer Artifacts**: Every phase produces both `manifest.json` (machine envelope) and `content.md` (human-readable output). WHY: The Pipeline Architect needs machine-readable metadata to validate chain composition, while agents need readable content for reasoning. See ADR "Artifact Format: Dual-Layer Output Standard". -- **Discovery Over Invention**: The skill discovers subdomains through research — it does NOT hardcode or assume subdomain lists. WHY: Hardcoded lists miss domain-specific nuances and become stale. The whole point of this skill is that it adapts to any domain. -- **Reuse Before Create (Rule 9)**: When classifying subdomains, always check if existing agents/skills cover 70%+ of the subdomain before marking it as "needs new component". WHY: Agents are expensive context. Skills are cheap. The generator biases toward binding new skills to existing agents. - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report findings without self-congratulation. Show the subdomain list and classifications directly rather than describing the process. -- **Temporary File Cleanup**: Phase artifacts live in `/tmp/pipeline-{run-id}/`. The PRODUCE phase copies final artifacts to permanent location. Intermediate files remain for debugging until the pipeline-orchestrator-engineer cleans up. -- **Operator Profile Detection**: Read the detected operator profile from pipeline context but do NOT gate any research steps on it. WHY: Research itself is read-only and harmless across all profiles. The profile information is passed through to the Component Manifest so downstream skills (chain-composer, scaffolder) can apply the correct safety gates. - -### Optional Behaviors (OFF unless enabled) -- **Deep Reference Research**: Agent 4 (Reference Research) fetches external documentation URLs. OFF by default because it requires network access and increases latency. -- **Verbose Classification**: Show detailed rationale for every task type assignment. ON for debugging classification disagreements. - -## What This Skill CAN Do -- Discover subdomains within any target domain by dispatching parallel research agents -- Scan the existing repository for agents, skills, hooks, and scripts that overlap with the target domain -- Classify each subdomain by task type, complexity, and reuse potential -- Map each subdomain to a preliminary pipeline chain from the step menu -- Produce a Component Manifest listing everything the scaffolder needs to build -- Detect which existing agents can be reused as executors for new subdomain skills - -## What This Skill CANNOT Do -- **Scaffold pipeline components**: That is handled by `pipeline-scaffolder` after this skill completes -- **Compose final pipeline chains**: Preliminary chains are draft proposals; the chain-composer skill finalizes them with type compatibility validation -- **Create or modify routing entries**: That is handled by `routing-table-updater` -- **Validate pipeline chains against the type compatibility matrix**: This skill maps chains; validation is the chain-composer's responsibility +## Overview + +This skill answers: **What subdomains exist in this target domain, and what kind of pipeline does each one need?** + +The skill implements a **Research-Classify-Map-Produce** pattern: broad parallel discovery narrows into structured classification, then maps to pipeline chains, then produces a Component Manifest that the chain-composer skill uses to finalize and scaffold pipeline components. + +This is the first step in the self-improving pipeline generator (see `adr/self-improving-pipeline-generator.md`). + +--- ## Instructions @@ -85,7 +59,7 @@ This skill is the first step in the self-improving pipeline generator (see `adr/ ### Phase 1: DISCOVER (Parallel Multi-Agent — Rule 12 Mandatory) -**Goal**: Build a broad, multi-perspective understanding of the target domain. Breadth of research directly determines the quality of subdomain discovery — this is why parallel agents are mandatory, not optional. +**Goal**: Build a broad, multi-perspective understanding of the target domain. Breadth of research directly determines the quality of subdomain discovery — this is why parallel agents are mandatory, not optional. A/B testing proved parallel research eliminates a 1.40-point gap in Examples quality (see `adr/pipeline-creator-ab-test.md`). Sequential research is **BANNED** because it produces shallower, less diverse findings. **Default N = 4 agents.** Override with `--research-agents N` (minimum 2, maximum 6). @@ -104,7 +78,7 @@ Assemble shared context from: **Step 2: Dispatch 4 parallel research agents** -Launch all 4 agents simultaneously using the Task tool. Each agent receives the shared context block and saves findings to a separate artifact file. Each has a 5-minute timeout. +Launch **all 4 agents simultaneously** using the Task tool. Each agent receives the shared context block and saves findings to a separate artifact file. Each has a 5-minute timeout. **Agent 1: Domain Expert** - Investigate: What are the subdomains of this domain? What tasks do practitioners commonly perform? What are the natural workflow boundaries? @@ -130,6 +104,8 @@ Launch all 4 agents simultaneously using the Task tool. Each agent receives the - Output: Reference file recommendations and deterministic validation opportunities - Save to: `/tmp/pipeline-{run-id}/phase-1-research/agent-4-reference-research.md` +**Why parallel is mandatory**: Parallel dispatch forces diverse perspectives from the start. Agents do not see each other's partial results and thus avoid anchoring bias. Testing proved 4-agent parallel produces measurably better Examples coverage than sequential dispatch. + **Step 3: Collect and merge research artifacts** After all agents complete (at least 3 of 4 must succeed), merge findings into a single research compilation at `/tmp/pipeline-{run-id}/phase-1-research/content.md`. @@ -159,7 +135,7 @@ Create the Phase 1 dual-layer artifact: Update `metrics.candidate_subdomains` and `metrics.existing_components_found` with actual counts. **Gate**: All of the following must be true before proceeding: -- At least 3 of 4 research agents completed successfully +- At least 3 of 4 research agents completed successfully (Rule 12 gate) - Research compilation file exists at `/tmp/pipeline-{run-id}/phase-1-research/content.md` - At least 1 candidate subdomain identified - `manifest.json` written with `status: "complete"` @@ -168,7 +144,7 @@ If fewer than 3 agents completed: set `status: "partial"`, report which agents f ### Phase 2: CLASSIFY -**Goal**: For each candidate subdomain from Phase 1, assign a task type, complexity tier, reuse assessment, and required references. This phase transforms raw research findings into structured classifications that Phase 3 can map to pipeline chains. +**Goal**: For each candidate subdomain from Phase 1, assign a task type, complexity tier, reuse assessment, and required references. This phase transforms raw research findings into structured classifications that Phase 3 can map to pipeline chains. **Discovery logic**: Subdomains are discovered through research, not hardcoded. Why? Hardcoded lists miss domain-specific nuances and become stale. The whole point of this skill is that it adapts to any domain. **Step 1: Load classification reference** @@ -201,7 +177,8 @@ When a subdomain's tasks span two types (e.g., alerting is `configuration` + `mo | Medium | 5-8 steps | May include review or validation loops, 1-2 output types | | Complex | 8+ steps | Cross-domain delegation, safety gates, multiple output types, experimentation | -**Reuse Potential** — For each subdomain, check the existing inventory (from Agent 2): +**Reuse Potential** — For each subdomain, check the existing inventory (from Agent 2). Always verify existing coverage before marking anything as "New" (Rule 9: Reuse Before Create). Why? Agents are expensive context. Skills are cheap. The generator biases toward binding new skills to existing agents. + - **Full reuse** (80%+): Existing agent AND skill cover this subdomain. Skip it or note for enhancement only. - **Partial reuse** (40-79%): Existing agent covers the domain but no skill for this specific subdomain. Create new skill, bind to existing agent. - **New** (<40%): No existing components meaningfully cover this. Needs new skill and possibly new agent. @@ -265,13 +242,13 @@ Create the Phase 2 dual-layer artifact: **Gate**: All of the following must be true before proceeding: - Every candidate subdomain from Phase 1 has a task_type assigned - Every candidate subdomain has a complexity tier -- Every candidate subdomain has a reuse assessment +- Every candidate subdomain has a reuse assessment (checked against Agent 2's inventory) - At least 2 subdomains classified (if only 1 found, reconsider whether the domain is too narrow for decomposition — report to pipeline-orchestrator and ask whether to proceed as single-pipeline) - `manifest.json` written with `status: "complete"` ### Phase 3: MAP (Compose Preliminary Chains) -**Goal**: For each classified subdomain, select steps from the step menu and compose a preliminary pipeline chain. These are draft chains — the chain-composer skill validates and finalizes them. +**Goal**: For each classified subdomain, select steps from the step menu and compose a preliminary pipeline chain. These are draft chains — the chain-composer skill validates and finalizes them. **Type compatibility is mandatory**: Every adjacent step pair must have compatible output-to-input types. Why? Invalid types produce broken chains. Never skip this validation. **Step 1: Load step menu** @@ -309,12 +286,11 @@ For each classified subdomain, build a preliminary chain by: - Has quality criteria: VALIDATE - Add REFINE (max 3 cycles) after any validation step that can fail -6. **Apply profile gates** — note which steps are profile-dependent: +6. **Apply profile gates** — note which steps are profile-dependent. Record as annotations on the chain, not hard inclusions. Why? Read the operator profile from pipeline context but do NOT gate any research steps on it — research itself is read-only and harmless across all profiles. The profile information is passed through to the Component Manifest so downstream skills (chain-composer, scaffolder) can apply the correct safety gates. - APPROVE: Work/Production only - GUARD + SNAPSHOT: Work/Production only for state changes - SIMULATE: Production only (optional elsewhere) - NOTIFY: CI/Work/Production (skip in Personal) - - Record these as annotations on the chain, not hard inclusions 7. **Check for cross-domain dependencies**: - Does this subdomain need expertise from another domain? Add DELEGATE @@ -331,6 +307,13 @@ If Schema X is in Step B's consumes list → compatible Otherwise → flag the incompatibility and adjust ``` +When incompatibility is found, insert a bridging step. Common bridges: +- Research Artifact needs to become Structured Corpus: insert COMPILE +- Multiple Verdicts need to become one: insert AGGREGATE +- Generation Artifact needs Verdict before next step: insert VALIDATE + +If no bridge works, restructure the chain. **Never skip type validation.** + **Step 4: Produce mapping artifact** `/tmp/pipeline-{run-id}/phase-3-map/manifest.json`: @@ -382,7 +365,7 @@ Otherwise → flag the incompatibility and adjust **Gate**: All of the following must be true before proceeding: - Every classified subdomain has a preliminary chain - Every chain starts with ADR and ends with OUTPUT (or REPORT for analysis tasks) -- No unresolved type compatibility issues remain +- No unresolved type compatibility issues remain (all incompatibilities resolved or accepted with explicit bridge steps) - All profile-gated steps are annotated (not hard-included) - `manifest.json` written with `status: "complete"` @@ -396,7 +379,7 @@ Based on the existing inventory (Phase 1, Agent 2) and reuse assessments (Phase - If an existing agent covers 70%+ of the domain: **Reuse it**. Bind all new subdomain skills to this agent. Note the agent name and what gaps it has (if any). - If no existing agent covers the domain: **Create one new coordinator agent**. Define its name (`{domain}-pipeline-engineer` or `{domain}-{function}-engineer`), purpose, and which subdomain skills it will execute. -- NEVER create one agent per subdomain. WHY: Agents are expensive context; skills are cheap. The architecture is "1 agent : N skills" not "N agents : N skills". +- **NEVER create one agent per subdomain.** Why? Agents are expensive context; skills are cheap. The architecture is "1 agent : N skills" not "N agents : N skills". **Step 2: Compile shared resources** @@ -500,6 +483,8 @@ Identify resources that span multiple subdomains: If gate passes: Report completion to pipeline-orchestrator-engineer. The Component Manifest is the handoff artifact for the chain-composer skill. +--- + ## Error Handling ### Error: Domain Too Narrow @@ -526,49 +511,11 @@ If gate passes: Report completion to pipeline-orchestrator-engineer. The Compone - Generation Artifact needs Verdict before next step: insert VALIDATE If no bridge works, restructure the chain. Never skip type validation. -## Anti-Patterns - -### Anti-Pattern 1: Hardcoded Subdomain Lists -**What it looks like**: Skipping Phase 1 research and providing a predetermined list of subdomains -**Why wrong**: Misses domain nuances, produces generic pipelines, defeats the purpose of the research phase. A human can guess "Prometheus has metrics and alerting" — the value of this skill is discovering the non-obvious subdomains (performance tuning, federation, recording rules). -**Do instead**: Always run the full parallel research phase. Even for well-known domains, Agent 2 (Existing Inventory) and Agent 4 (Reference Research) will discover context the human prompt missed. - -### Anti-Pattern 2: One Agent Per Subdomain -**What it looks like**: Component Manifest creates 5 agents for 5 subdomains -**Why wrong**: Agents are expensive context (loaded per session). Skills are cheap (loaded per task). Creating N agents where 1 agent + N skills would work wastes context budget and fragments routing. -**Do instead**: Create at most 1 new agent per domain. Bind all subdomain skills to it (or to an existing agent that covers the domain). The ratio should be 1 agent : N skills, not 1:1. - -### Anti-Pattern 3: Over-Splitting Subdomains -**What it looks like**: Discovering 10+ subdomains for a moderate-complexity domain -**Why wrong**: Produces too many narrow skills that overlap. Each skill has a fixed context cost (frontmatter + phases). 10 micro-skills may cost more context total than 5 well-scoped skills. -**Do instead**: Target 3-7 subdomains for most domains. If you discover more, look for natural groupings. "Prometheus metric types", "Prometheus metric naming", and "Prometheus recording rules" can likely merge into "Prometheus metrics authoring". - -### Anti-Pattern 4: Sequential Research -**What it looks like**: Running Agent 1, waiting for results, then Agent 2, then Agent 3, then Agent 4 -**Why wrong**: Rule 12 is not a suggestion — A/B testing proved parallel research produces measurably better output. Sequential research takes 4x longer and produces narrower findings because later agents don't benefit from the breadth that parallelism provides. -**Do instead**: Always dispatch all 4 agents simultaneously. The gate requires 3 of 4 to succeed, not all 4 in sequence. - -### Anti-Pattern 5: Ignoring Reuse Assessment -**What it looks like**: Marking every subdomain as "New" without checking existing inventory -**Why wrong**: Creates duplicate agents/skills that fragment routing and waste maintenance effort -**Do instead**: Agent 2 specifically searches for existing components. Phase 2 must assess every subdomain against this inventory before marking anything as "New". - -## Anti-Rationalization - -| Rationalization Attempt | Why It's Wrong | Required Action | -|------------------------|----------------|-----------------| -| "I already know the subdomains, skip research" | You know SOME subdomains. Research finds the non-obvious ones. | Run all 4 agents. Phase 1 is not optional. | -| "Sequential research is fine for simple domains" | Rule 12 has no complexity exception. The A/B data applies universally. | Dispatch agents in parallel. Always. | -| "This subdomain needs its own agent" | Agents are for domains, skills are for subdomains. 1 agent : N skills. | Bind to existing/shared agent. Only create new agent if no existing agent covers 70%+ of domain. | -| "3 subdomains aren't enough, let me add more" | More subdomains ≠ better decomposition. Over-splitting is an anti-pattern. | Stop at natural boundaries. 3-7 is the target range. | -| "I'll just pick the obvious chain, no need for step menu lookup" | The step menu exists to prevent chain composition errors. It has type compatibility rules. | Load and reference the step menu. Validate types. | -| "Type compatibility is just bureaucracy" | It's the type system of pipeline composition. Invalid types produce broken chains. | Validate every adjacent pair. Insert bridges for incompatibilities. | - --- ## Source Hierarchy -Research follows a strict source hierarchy. WHY: Stale training data is the most common research failure — hallucinated version numbers, deprecated APIs, and removed features all come from treating training data as current documentation. +Research follows a strict source hierarchy. Why? Stale training data is the most common research failure — hallucinated version numbers, deprecated APIs, and removed features all come from treating training data as current documentation. | Priority | Source | Confidence | When to Use | |----------|--------|------------|-------------| @@ -579,11 +526,9 @@ Research follows a strict source hierarchy. WHY: Stale training data is the most If a finding comes from training data alone (no external verification), it MUST be tagged LOW confidence. This applies even when the model is highly confident — confidence is not currency without a source. ---- - ## Confidence-Level Tagging -Every research finding is tagged with a confidence level. WHY: Without explicit confidence, consumers of research output treat all findings as equally reliable — which means a guess from training data carries the same weight as a verified API response. Confidence tagging forces the researcher to assess source quality and forces the consumer to calibrate trust. +Every research finding is tagged with a confidence level. Why? Without explicit confidence, consumers of research output treat all findings as equally reliable — which means a guess from training data carries the same weight as a verified API response. Confidence tagging forces the researcher to assess source quality and forces the consumer to calibrate trust. ### Confidence Levels @@ -618,9 +563,9 @@ Every research finding is tagged with a confidence level. WHY: Without explicit --- -## "Don't Hand-Roll" Output Section +## Don't Hand-Roll Output Section -Research output includes a mandatory section listing problems that seem simple but have battle-tested library solutions. WHY: The most expensive bugs come from reimplementing solutions that already exist with years of production hardening, security patches, and edge case coverage. A hand-rolled JWT validator or rate limiter might pass tests but fail under adversarial conditions. +Research output includes a mandatory section listing problems that seem simple but have battle-tested library solutions. Why? The most expensive bugs come from reimplementing solutions that already exist with years of production hardening, security patches, and edge case coverage. A hand-rolled JWT validator or rate limiter might pass tests but fail under adversarial conditions. ### Format @@ -656,7 +601,7 @@ The researcher should populate this table with domain-specific entries based on ## Anti-Features Output Section -Research output includes a mandatory section listing features to explicitly NOT build, with rationale. WHY: Explicitly naming what is out of scope is as valuable as naming what is in scope. Without this, scope creep happens through "while we're at it" additions that seem reasonable in isolation but compound into over-engineering. +Research output includes a mandatory section listing features to explicitly NOT build, with rationale. Why? Explicitly naming what is out of scope is as valuable as naming what is in scope. Without this, scope creep happens through "while we're at it" additions that seem reasonable in isolation but compound into over-engineering. ### Format @@ -702,6 +647,8 @@ STOP and ask the pipeline-orchestrator-engineer (do NOT proceed autonomously) wh - Which operator profile to apply (detect from context or use default) - Whether a subdomain is too narrow or too broad (ask when uncertain) +--- + ## References - [Task Type Guide](references/task-type-guide.md) - Detailed task type definitions with canonical chains and examples (loaded in Phase 2) diff --git a/pipelines/explore-pipeline/SKILL.md b/pipelines/explore-pipeline/SKILL.md index 12dc06ee..cbdeda20 100644 --- a/pipelines/explore-pipeline/SKILL.md +++ b/pipelines/explore-pipeline/SKILL.md @@ -48,47 +48,16 @@ routing: # Exploration Pipeline -## Operator Context - -This skill operates as an operator for systematic codebase exploration, configuring Claude's behavior for structured, read-only investigation. It implements the **Pipeline Architecture** pattern — parallel scan, structured mapping, deep analysis, artifact-based reporting — with **Domain Intelligence** embedded in the exploration methodology. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before exploring -- **Over-Engineering Prevention**: Explore what was asked. No speculative scope expansion, no "while I'm here" tangents into unrelated subsystems -- **Read-Only**: This pipeline NEVER modifies source code, configs, or tests -- **Artifacts at Every Phase**: Save findings to files; context is ephemeral -- **Structured Output**: Always produce a structured exploration report -- **Scope Discipline**: Answer the exploration question; do not generate improvement recommendations unless explicitly asked - -### Default Behaviors (ON unless disabled) -- **Parallel Scanning**: Launch 3 parallel subagents for initial scan -- **Architecture Mapping**: Map component relationships and data flow -- **Save Report**: Save findings to `exploration-report.md` -- **Entry Point Tracing**: Identify main executables, CLI entry points, API routes -- **Pattern Detection**: Identify naming conventions, directory organization, error handling patterns -- **Dependency Mapping**: Trace component relationships and dependency graphs - -### Optional Behaviors (OFF unless enabled) -- **Deep Dive**: Use `--deep` for comprehensive multi-layer analysis -- **Quick Mode**: Use `--quick` for high-level overview only (skip Phase 3) -- **Specific Focus**: Use `--focus [area]` to constrain exploration to a single component -- **Explicit Tier Selection**: Use `--tier quick|standard|deep` to set exploration depth (see Tiered Depth Model below) - -## What This Skill CAN Do -- Systematically scan repository structure using parallel subagents -- Map architecture layers, component relationships, and data flow -- Identify patterns, conventions, and key abstractions -- Produce structured, reusable exploration reports saved as artifacts -- Trace entry points and execution paths through the codebase - -## What This Skill CANNOT Do -- Modify any files in the repository (read-only) -- Debug or fix bugs (use systematic-debugging instead) -- Refactor code (use systematic-refactoring instead) -- Generate documentation (use technical-documentation-engineer instead) -- Skip phases within the selected tier (Quick: Phase 1; Standard: 1-3,8; Deep: all 8) +## Overview ---- +This skill performs systematic codebase exploration using parallel subagents and tiered depth selection. It is read-only (never modifies files) and saves structured artifacts at every phase. Depth is determined by the query type: **Quick** (single question, Phase 1 only), **Standard** (subsystem understanding, 4 phases), or **Deep** (full quality assessment with recommendations, 8 phases). + +The pipeline implements three core constraints: +1. **Scope discipline**: Answer the question asked, do not tangent into unrelated subsystems or generate unsolicited recommendations +2. **Artifact-first**: Save findings to files at each phase; context is ephemeral +3. **Gate enforcement**: Do not skip phases within the selected tier. Each phase has defined exit criteria and cannot be omitted + +Optional behaviors are disabled by default: use `--deep` for comprehensive analysis, `--quick` for overview, `--focus [area]` for targeted exploration, or `--tier quick|standard|deep` for explicit depth selection. ## Instructions @@ -293,7 +262,7 @@ Solution: ### Error: "Scanner Subagent Timed Out" Cause: Subagent stuck on large directory traversal or slow file reads Solution: -1. Proceed with results from completed scanners (minimum 2 of 3) +1. Proceed with results from completed scanners (minimum 2 of 3). Do not wait for all three to complete; minimum 2 is the gate. 2. Fill gaps with targeted manual investigation in Phase 3 3. Note incomplete coverage in the final report @@ -306,92 +275,49 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Random File Reading -**What it looks like**: Opening files without a plan, hoping to stumble on understanding -**Why wrong**: Produces incomplete, biased picture weighted toward whatever was read first -**Do instead**: Systematic parallel scan with defined focus areas - -### Anti-Pattern 2: No Saved Artifacts -**What it looks like**: Exploring entirely in context without saving findings to files -**Why wrong**: Knowledge lost when context compresses; cannot be reused across sessions -**Do instead**: Save architecture map and report to files at each phase - -### Anti-Pattern 3: Skipping the Map Phase -**What it looks like**: Jumping from scanning straight to analysis -**Why wrong**: Misses component relationships; analysis lacks structural context -**Do instead**: Always create architecture map before deep analysis - -### Anti-Pattern 4: Generating Recommendations Unsolicited -**What it looks like**: "Here are 10 things I would improve about this codebase" -**Why wrong**: User asked to understand, not to change. Scope creep wastes time and adds noise. -**Do instead**: Report what exists. Only recommend if user explicitly asks. - -### Anti-Pattern 5: Same Depth for Every Question -**What it looks like**: Running all 4 phases with parallel scanners when the user just wants to know "what ORM does this use?" -**Why wrong**: Wastes time and tokens. A 15-minute pipeline for a 2-minute question erodes trust in the tool. -**Do instead**: Match depth to the question. Use the Tiered Depth Model below to select the right scope. - ---- - ## Tiered Depth Model -Not every exploration question needs the same depth. Running a full pipeline for a quick fact-check wastes time; running a shallow scan for onboarding misses critical context. The tier is determined by the caller, not guessed. If depth is unspecified, default to **Standard**. +The tier is determined by query type, not guessed. Matching depth to question scope prevents waste (not spending 30 minutes on a 2-minute fact-check) while ensuring completeness (not running a shallow scan when full understanding is needed). **Default to Standard if depth is unspecified.** All phases within the selected tier must run; phases cannot be skipped or reordered. ### Quick Verify (2-5 minutes) **Purpose**: Confirm a specific fact about the codebase. -**Scope**: Answer one question. Read only the files necessary to answer it. No document generation -- the answer IS the output. +**Scope**: Answer one question. Read only the files necessary to answer it. No document generation — the answer IS the output. Avoid tangenting into adjacent subsystems: stay focused on the specific fact. -**Phases used**: Phase 1 (SCAN) only -- single targeted scanner, not parallel. +**Phases used**: Phase 1 (SCAN) only — single targeted scanner, not parallel. **Exit criteria**: Question answered with file path evidence, or "could not determine" with a list of what was checked. -**Examples**: -- "Does this project use dependency injection?" -- "What ORM does the API use?" -- "Is there a CI pipeline configured?" -- "What version of React is this using?" +**Examples**: "Does this project use dependency injection?" "What ORM does the API use?" "Is there a CI pipeline configured?" "What version of React is this using?" **Output format**: Direct answer in conversation (no saved report file). ### Standard (15-30 minutes) -**Purpose**: Understand a subsystem or map one area of the codebase. +**Purpose**: Understand a subsystem or map one area of the codebase. This is the default tier when the user hasn't specified a depth preference. -**Scope**: All 4 phases execute. Parallel scanners in Phase 1. Produce a single structured document covering the targeted area. +**Scope**: All 4 phases execute. Parallel scanners in Phase 1 (minimum 2 of 3 must complete). Produce a single structured document covering the targeted area. Do not jump to analysis without creating an architecture map in Phase 2. -**Phases used**: All 4 phases (SCAN, MAP, ANALYZE, REPORT). +**Phases used**: All 4 phases (SCAN, MAP, ANALYZE, REPORT). Phases cannot be skipped. **Exit criteria**: Report covers the subsystem's boundaries, key patterns, and integration points. Saved as `exploration-report.md`. -**Examples**: -- "How does authentication work in this app?" -- "Map the payment processing flow." -- "Explain how the event system works." -- "What's the testing strategy for this repo?" +**Examples**: "How does authentication work in this app?" "Map the payment processing flow." "Explain how the event system works." "What's the testing strategy for this repo?" **Output format**: Saved `exploration-report.md` with all sections. ### Deep Dive (1+ hour) -**Purpose**: Full analysis — architectural understanding PLUS quality evaluation, consistency assessment, and pattern analysis. +**Purpose**: Full analysis — architectural understanding PLUS quality evaluation, consistency assessment, and pattern analysis. Use this tier when the user asks to "analyze quality of", "assess consistency of", or "evaluate patterns in" the codebase. -**Scope**: All 8 phases. Phases 1-3 explore the codebase. Phases 4-7 compile findings, assess quality, synthesize recommendations, and verify against source. Phase 8 produces a comprehensive report. +**Scope**: All 8 phases. Phases 1-3 explore the codebase. Phases 4-7 compile findings, assess quality, synthesize recommendations, and verify against source code to remove false positives. Phase 8 produces a comprehensive report. Do not generate improvement recommendations unless findings have been verified against actual source code in Phase 7. -**Phases used**: All 8 phases: SCAN → MAP → ANALYZE → COMPILE → ASSESS → SYNTHESIZE → REFINE → REPORT. +**Phases used**: All 8 phases: SCAN → MAP → ANALYZE → COMPILE → ASSESS → SYNTHESIZE → REFINE → REPORT. All phases are mandatory; none can be skipped. -**Exit criteria**: Comprehensive report covers full architecture, quality assessment with scores, consistency evaluation, pattern analysis, and ranked recommendations. Multiple artifact files produced. +**Exit criteria**: Comprehensive report covers full architecture, quality assessment with scores, consistency evaluation, pattern analysis, and ranked recommendations. Multiple artifact files produced. Top 5 findings verified against source in Phase 7 to confirm accuracy. -**Examples**: -- "I'm new to this codebase, give me the full picture." -- "We're considering a major refactor -- what do we need to know?" -- "Analyze the quality and consistency of our error handling." -- "Assess which patterns are used consistently vs inconsistently." -- "Full architectural review before we plan next quarter." -- "Evaluate the test coverage patterns across all modules." +**Examples**: "I'm new to this codebase, give me the full picture." "We're considering a major refactor — what do we need to know?" "Analyze the quality and consistency of our error handling." "Assess which patterns are used consistently vs inconsistently." "Full architectural review before we plan next quarter." "Evaluate the test coverage patterns across all modules." **Output format**: Saved `exploration-report.md` plus supplementary files (`architecture-map.md`, `analysis-compilation.md`, component-specific documents as needed). @@ -403,7 +329,7 @@ If user specifies --tier or --quick or --deep: Else if user asks a single specific question (what/which/does/is): Use Quick Verify. Else if user asks to analyze/assess/evaluate/audit quality or patterns: - Use Deep Dive (needs COMPILE + ASSESS + SYNTHESIZE + REFINE phases). + Use Deep Dive (requires all 8 phases including COMPILE, ASSESS, SYNTHESIZE, REFINE for verification). Else if user asks about a specific subsystem or flow: Use Standard. Else if user asks for full picture / onboarding / comprehensive: @@ -435,17 +361,4 @@ When exploration requires looking up external information (framework conventions ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) - Phase transition rules -- [Pipeline Architecture](../shared-patterns/pipeline-architecture.md) - Pipeline design principles - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I read enough files to understand" | Sampling bias skews understanding | Complete all 3 parallel scans | -| "Architecture is obvious, skip mapping" | Obvious to whom? Document it. | Save architecture-map.md | -| "Quick overview is good enough" | Quick overviews miss relationships | Complete all 4 phases unless --quick | -| "I'll remember the structure" | Context compresses; memory is unreliable | Save artifacts to files | +This skill follows the **Pipeline Architecture** pattern with artifact-based reporting and gate enforcement. Context is ephemeral; all findings must be saved to files at each phase to survive context compression and enable reuse across sessions. diff --git a/pipelines/feature-pipeline/SKILL.md b/pipelines/feature-pipeline/SKILL.md index 08c1edcd..2d960155 100644 --- a/pipelines/feature-pipeline/SKILL.md +++ b/pipelines/feature-pipeline/SKILL.md @@ -36,39 +36,18 @@ routing: End-to-end feature lifecycle pipeline that coordinates the five existing feature skills (feature-design, feature-plan, feature-implement, feature-validate, feature-release) into a single phase-gated workflow with a final RECORD phase for learning. -## Operator Context - -This pipeline orchestrates the full feature lifecycle. Each phase invokes the corresponding feature-* skill and enforces gates before transitions. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before starting -- **Sequential Phases**: Phases execute in order. No skipping. -- **Phase Gates Enforced**: Each phase must pass its gate before the next begins -- **Artifact Persistence**: Each phase produces artifacts that subsequent phases depend on -- **Branch Isolation**: All work happens on a feature branch, never on main - -### Default Behaviors (ON unless disabled) -- **Design-First**: Always start with design, even for "obvious" features -- **Plan Before Code**: Never write implementation code without an approved plan -- **Test Before Ship**: Validation must pass before release -- **Record Learnings**: Always complete the RECORD phase - -### Optional Behaviors (OFF unless enabled) -- **--skip-design**: Skip Phase 1 if design document already exists -- **--skip-release**: Stop after validation (useful for draft features) -- **--parallel-implement**: Dispatch implementation tasks in parallel via agents - -## What This Pipeline CAN Do -- Coordinate an entire feature from initial design through release -- Ensure proper sequencing of design, planning, implementation, validation, and release -- Prevent common lifecycle mistakes (coding before design, shipping before tests) -- Record patterns for future feature development - -## What This Pipeline CANNOT Do -- Replace individual feature-* skills for single-phase work -- Debug bugs (use systematic-debugging instead) -- Refactor existing code (use systematic-refactoring instead) -- Skip phases without explicit opt-in flags +## Overview + +This pipeline orchestrates the full feature lifecycle by invoking feature-* skills in sequence: DESIGN, PLAN, IMPLEMENT, VALIDATE, RELEASE, RECORD. Each phase must pass its gate before the next begins (enforced because skipping design or plan steps causes rework, and testing without validation creates merged bugs). + +**Before starting**: Read and follow your repository's CLAUDE.md because it contains essential context and conventions. + +**Scope**: Use for end-to-end feature work only. For single-phase work (e.g., "just validate this feature"), use the individual feature-* skills instead. + +**Optional flags** (OFF by default): +- `--skip-design` — Skip Phase 1 if design document already exists +- `--skip-release` — Stop after validation (useful for draft features) +- `--parallel-implement` — Dispatch implementation tasks in parallel via agents --- @@ -202,40 +181,27 @@ Document what went well, what could improve, and any process adjustments for nex ### Error: "Design Phase Stalls" Cause: Requirements unclear or stakeholder alignment missing -Solution: Time-box design to 2 iterations. If no convergence, document open questions and proceed with best-available design, flagging assumptions. +Solution: Time-box design to 2 iterations because unbounded design exploration delays implementation without creating clarity. If no convergence, document open questions and proceed with best-available design, flagging assumptions. Re-validate during Phase 4 if assumptions prove incorrect. ### Error: "Implementation Diverges from Plan" Cause: Discovered complexity not anticipated in design/plan -Solution: Return to Phase 2 (PLAN). Update plan with new understanding. Do NOT continue implementing against an outdated plan. +Solution: Return to Phase 2 (PLAN) to update the plan with new understanding because continuing with an outdated plan causes wasted implementation effort and integration rework. Do NOT continue implementing against an outdated plan. ### Error: "Validation Fails" Cause: Implementation bugs or missing requirements -Solution: Return to Phase 3 (IMPLEMENT) to fix. Do NOT proceed to RELEASE with failing validation. Re-run full validation after fixes. +Solution: Return to Phase 3 (IMPLEMENT) to fix because proceeding to release with failing tests ships defects to end users. Do NOT proceed to RELEASE with failing validation. Re-run full validation after fixes to catch any regressions. ### Error: "Release Blocked" Cause: Merge conflicts, CI failures, or review feedback -Solution: Address each blocker. Return to Phase 4 (VALIDATE) if code changes were needed. +Solution: Address each blocker. Return to Phase 4 (VALIDATE) if code changes were needed because changes risk introducing new failures. --- -## Anti-Patterns - -### Anti-Pattern 1: Skipping Design -**What it looks like**: "This feature is simple, let me just code it" -**Why wrong**: Simple features become complex. Without design, scope creeps and rework follows. -**Do instead**: Complete Phase 1 even if brief. A one-paragraph design doc is still a design doc. - -### Anti-Pattern 2: Implementing Without a Plan -**What it looks like**: Starting to code after design, without breaking into tasks -**Why wrong**: Leads to unordered work, missed dependencies, and integration pain. -**Do instead**: Complete Phase 2. Even 3 ordered tasks is better than ad-hoc coding. - -### Anti-Pattern 3: Shipping Without Validation -**What it looks like**: "Tests pass locally, let's merge" -**Why wrong**: Local passes != CI passes. Regressions hide in untested paths. -**Do instead**: Complete Phase 4 with full quality gates before Phase 5. +## References -### Anti-Pattern 4: Not Recording Learnings -**What it looks like**: Shipping and moving on without retrospective -**Why wrong**: Same mistakes repeat. Process never improves. -**Do instead**: Complete Phase 6. Five minutes of recording saves hours on the next feature. +This pipeline coordinates: +- `/feature-design` — Explore requirements, discuss trade-offs, produce design document +- `/feature-plan` — Break design into wave-ordered tasks with domain agent assignments +- `/feature-implement` — Execute plan by dispatching tasks to domain agents +- `/feature-validate` — Run quality gates (tests, lint, type checks, custom validation) +- `/feature-release` — Merge via PR, tag release, clean up branch diff --git a/pipelines/github-profile-rules/SKILL.md b/pipelines/github-profile-rules/SKILL.md index f418a4d4..9bdb4426 100644 --- a/pipelines/github-profile-rules/SKILL.md +++ b/pipelines/github-profile-rules/SKILL.md @@ -33,41 +33,9 @@ routing: # GitHub Profile Rules Extraction -## Operator Context - -This skill operates as the main orchestration pipeline for extracting programming rules from a GitHub user's public profile. It implements a 7-phase pipeline that fetches data exclusively via the GitHub API (no git clone), analyzes code patterns across repos, extracts PR review comments for preference signals, compiles findings into deduplicated confidence-scored rules, and outputs actionable CLAUDE.md-compatible entries. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution. -- **API-Only Data Fetching**: All GitHub data must be fetched via `scripts/github-api-fetcher.py`. No git clone, no subprocess git calls. This is a non-negotiable constraint. -- **Rate Limit Awareness**: Before each batch of API calls, check remaining quota. If `--token` is not provided, the unauthenticated limit is 60 req/hr. -- **Evidence-Based Rules Only**: Every generated rule must cite at least one repo or review where the pattern was observed. No generic advice. -- **Confidence Scoring**: Every rule gets a confidence level: high (3+ repos), medium (2 repos), low (1 repo). - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report progress by phase with counts (repos fetched, files sampled, patterns found, rules generated). -- **Top-N Repos**: Analyze the top 10 repos by stars/recent activity unless overridden by `--max-repos`. -- **Review Priority**: PR reviews given carry 2x weight vs code authored for preference signals. -- **Output Dual Format**: Always produce both CLAUDE.md markdown and JSON with confidence scores. - -### Optional Behaviors (OFF unless enabled) -- **Verbose Mode**: Show each API call and response -- **Raw Data Preservation**: Keep intermediate files alongside final output -- **Org-Wide Analysis**: Extend analysis to all repos in a GitHub organization - -## What This Skill CAN Do -- Fetch public repos, file contents, commit messages, and PR reviews via GitHub API (rest endpoints) -- Sample N files per repo across a user's top repositories -- Identify naming conventions, code style patterns, architectural preferences, and testing habits -- Deduplicate and rank patterns by frequency across repos -- Output actionable rules in CLAUDE.md-compatible markdown and structured JSON -- Handle API rate limiting with backoff and user notification - -## What This Skill CANNOT Do -- **Clone repositories**: All analysis is API-based -- **Access private data**: Only public repos and reviews are analyzed -- **Run code**: Patterns are extracted from source text, not by executing code -- **Guarantee exhaustive coverage**: API rate limits and sampling mean not every file is analyzed +## Overview + +This pipeline extracts programming rules from a GitHub user's public profile by analyzing repositories, code files, commit messages, and PR reviews via the GitHub API. It produces confidence-scored rules formatted for CLAUDE.md-compatible markdown or JSON output. All data fetching is API-only (no git clone). Every extracted rule must have evidence from at least one repo or PR review, scored by confidence: high (3+ repos or 2+ repos + review signal), medium (2 repos), low (1 repo). --- @@ -93,7 +61,7 @@ This skill operates as the main orchestration pipeline for extracting programmin **Goal**: Fetch the user's repo list, language statistics, and README samples via GitHub API. -**Step 1**: Run the API fetcher to get repo metadata: +**Step 1**: Run the API fetcher to get repo metadata. Enforce API-only constraint — never use git clone: ```bash python3 ~/.claude/scripts/github-api-fetcher.py repos \ --username {username} \ @@ -131,7 +99,7 @@ python3 ~/.claude/scripts/github-api-fetcher.py repos \ - **Agent 4: Documentation Patterns** -- README structure, inline comments, docstring style, documentation quality Each agent: -- Uses `python3 ~/.claude/scripts/github-api-fetcher.py sample-files` to fetch file contents +- Uses `python3 ~/.claude/scripts/github-api-fetcher.py sample-files` to fetch file contents (API-only, never clone) - Saves findings to `/tmp/github-rules-{username}/research-{aspect}.md` - Has a 5-minute timeout - Operates independently @@ -144,9 +112,9 @@ Each agent: ### Phase 3: SAMPLE -**Goal**: Fetch PR reviews the user has given to extract preference signals. +**Goal**: Fetch PR reviews the user has given to extract preference signals. PR reviews carry 2x weight versus code authored when scoring confidence. -**Step 1**: Run the API fetcher to get PR reviews: +**Step 1**: Run the API fetcher to get PR reviews. Check rate limits before each batch (60 req/hr unauthenticated, 5000 req/hr with token): ```bash python3 ~/.claude/scripts/github-api-fetcher.py pr-reviews \ --username {username} \ @@ -175,7 +143,7 @@ python3 ~/.claude/scripts/github-api-fetcher.py pr-reviews \ **Step 2**: For each identified pattern: - Count how many repos it appears in -- Check if it's reinforced by review comments +- Check if it's reinforced by review comments (adds 2x weight) - Assign confidence: high (3+ repos OR 2+ repos + review signal), medium (2 repos), low (1 repo) - Categorize using the taxonomy from `references/rule-categories.md` @@ -199,7 +167,7 @@ python3 ~/.claude/scripts/rules-compiler.py \ ### Phase 5: GENERATE -**Goal**: Format compiled patterns as CLAUDE.md-compatible rule entries. +**Goal**: Format compiled patterns as CLAUDE.md-compatible rule entries with evidence. **Step 1**: For each rule, generate a CLAUDE.md entry following this format: ```markdown @@ -303,7 +271,7 @@ mkdir -p rules/{username} ### Error: API Rate Limit Exceeded **Cause**: Too many requests without authentication token. -**Solution**: Check `X-RateLimit-Remaining`. Suggest `--token` flag. For unauthenticated: 60/hr. With token: 5000/hr. +**Solution**: Check `X-RateLimit-Remaining`. Suggest `--token` flag. For unauthenticated: 60/hr. With token: 5000/hr. Implement backoff and check remaining quota before each batch of API calls. ### Error: No Public Repos or Reviews **Cause**: User has no public GitHub activity. @@ -311,24 +279,15 @@ mkdir -p rules/{username} ### Error: Insufficient Data for Confident Rules **Cause**: Too few repos or files to establish patterns. -**Solution**: Lower confidence thresholds and flag all rules as preliminary. Report data limitations. - -## Anti-Patterns +**Solution**: Lower confidence thresholds and flag all rules as preliminary. Report data limitations. Only extract rules with evidence from the available data. -### Anti-Pattern 1: Cloning Repos -**What it looks like**: Using git clone to access code. -**Why wrong**: Violates API-only constraint. Unnecessary for pattern extraction. -**Do instead**: Use `scripts/github-api-fetcher.py sample-files` for file content. +### Error: Generic or Unauthenticated Rules +**Constraint**: Every generated rule must cite at least one repo or review where the pattern was observed. No generic advice. Avoid patterns that look like "Follow clean code principles" without specific evidence — extract only patterns with specific evidence from the user's code. -### Anti-Pattern 2: Generic Rules -**What it looks like**: "Follow clean code principles" without specific evidence. -**Why wrong**: Adds no value over generic best practices. -**Do instead**: Extract only patterns with specific evidence from the user's code. +### Error: Clone Attempts +**Constraint**: All GitHub data must be fetched via `scripts/github-api-fetcher.py`. No git clone, no subprocess git calls. This is a non-negotiable constraint. Pattern extraction happens via API-based file content sampling, never by cloning repositories. -### Anti-Pattern 3: Single-Repo Overfitting -**What it looks like**: 20 rules from one project. -**Why wrong**: May reflect project conventions, not personal preferences. -**Do instead**: Cross-reference across 3+ repos for high confidence. +--- ## References diff --git a/pipelines/hook-development-pipeline/SKILL.md b/pipelines/hook-development-pipeline/SKILL.md index d0be425f..866786cf 100644 --- a/pipelines/hook-development-pipeline/SKILL.md +++ b/pipelines/hook-development-pipeline/SKILL.md @@ -31,46 +31,15 @@ routing: # Hook Development Pipeline -## Operator Context +## Overview -This skill wraps the `hook-development-engineer` with explicit phase gates that enforce the requirements the engineer cannot enforce alone: a reviewed spec before any code is written, a mandatory sub-50ms timing test before the hook is registered, and a registration step that makes "done" mean "live and documented" rather than just "written." +This pipeline enforces the full lifecycle of hook creation with three critical gates that prevent silent failures in production: -Hooks fire on every tool use in a live Claude Code session. A slow hook is not an acceptable hook — it degrades every tool call permanently. The performance gate in Phase 3 exists precisely because "should be fast" is not the same as "was measured." +1. **Spec-before-code gate (Phase 1)**: Hooks fire on every tool use in a live Claude Code session. Decision errors (wrong event type, wrong timing) compound across hundreds of sessions. Phase 1 forces all decisions into a written spec before any code is written. "Should I use PreToolUse or PostToolUse?" must be answered on paper, not in code. -### Hardcoded Behaviors (Always Apply) +2. **Performance gate (Phase 3)**: A slow hook degrades every tool call permanently. The measured performance test in Phase 3 is mandatory and blocking — if the hook takes ≥50ms, you return to Phase 2 with optimization instructions, no exceptions. "Should be fast" is not the same as "was measured as fast." -- **Spec Before Code**: Phase 1 must produce a written spec with all decisions recorded before Phase 2 begins. Never skip to implementation. -- **ADR Session Awareness**: In Phase 1 (SPEC), check for active ADR session (`.adr-session.json`). If found, read hook requirements from the ADR via `python3 ~/.claude/scripts/adr-query.py context --adr {adr_path} --role script-creator`. Include ADR-specified event types, matchers, and behavioral requirements in the spec. Run `adr-query.py list` to check for related ADRs. -- **Performance Gate is Blocking**: If `time python3 hooks/{name}.py < /dev/null` reads ≥ 50ms, return to Phase 2. Do not proceed to Phase 4. No exceptions, no "close enough." -- **Non-Blocking Gate is Blocking**: If the hook exits non-zero on invalid input, return to Phase 2. A crashing hook is worse than no hook. -- **Registration is Part of Done**: A hook not registered in `settings.json` is not done. Phase 4 is mandatory. -- **Lazy Imports Only**: All non-stdlib imports must be inside functions. Top-level imports add cold-start time and kill the performance budget. -- **Exit 0 Always**: The hook's `__main__` block must end with `sys.exit(0)` inside a `finally` clause. Not optional. - -### Default Behaviors (ON unless disabled) - -- **Dispatch to hook-development-engineer for Phase 2**: Use the Agent tool to dispatch implementation to `hook-development-engineer`. The pipeline orchestrates; the specialist writes. -- **Timeout field in registration**: Include `timeout` in every settings.json hook registration (default: 5000ms; 10000ms for SessionStart). -- **Learning-db record in Phase 5**: Record the hook in `scripts/learning-db.py` so the retro system can reference it. - -### Optional Behaviors (OFF unless enabled) - -- **Skip Phase 5 documentation**: Omit learning-db record for throwaway or experimental hooks (enable with "skip docs"). -- **Extended timeout**: Use `timeout: 10000` for hooks that need heavier startup (enable explicitly; still test against 50ms). - -## What This Skill CAN Do - -- Guide the full lifecycle of a new hook from contract definition through live registration -- Enforce the performance and non-blocking gates that the single-pass engineer workflow skips -- Dispatch Phase 2 implementation to `hook-development-engineer` with a precise spec -- Update `settings.json` with the correct event type, timeout, and `once` flag -- Record the hook in the learning database for retro visibility - -## What This Skill CANNOT Do - -- Guarantee correctness of hook logic — that is the engineer's domain -- Choose an event type for you — the spec phase requires your decision -- Override the performance gate — if it's slow, you go back to Phase 2 +3. **Registration gate (Phase 4)**: A hook not wired into `settings.json` is not done. This makes the boundary between "written" and "live" explicit and discoverable. --- @@ -93,6 +62,8 @@ Work through the following decisions and record them in a spec block. Do not pro | `Stop` | Session ends | Generate summary, archive | | `PreCompact` | Before context compression | Archive learnings | +**Constraint**: If you're looking at this table and thinking "both make sense," that's the symptom that needs fixing. PreToolUse and PostToolUse receive different JSON structures at different moments. A hook that fires before a tool runs cannot see the tool's output. A hook that fires after cannot block the call. The event type must map 1:1 to your action type. Spend 2 minutes on this; it prevents half of all hook regressions. + **Decision 2 — Target tools** (if applicable): Which specific tool names trigger logic? Or does the hook respond to all tools? **Decision 3 — Action type** (pick one or more): @@ -106,7 +77,10 @@ Work through the following decisions and record them in a spec block. Do not pro **Decision 5 — Once-per-session?**: Should this hook run only once per session (`once: true`)? Applies mainly to SessionStart hooks that load context. -**Decision 6 — External dependencies**: Does the hook import anything outside the Python standard library? If yes, those imports MUST be lazy (inside functions). List them. +**Decision 6 — External dependencies**: Does the hook import anything outside the Python standard library? **Constraint**: Non-stdlib imports MUST be lazy (inside functions only). Top-level imports add cold-start time and kill the performance budget. If you see `import requests` or `from hooks.lib import learning_db_v2` at the module level, your hook will fail Phase 3. List your dependencies here. + +**ADR Session Awareness** (check before proceeding): +If `.adr-session.json` exists, run `python3 ~/.claude/scripts/adr-query.py context --adr {adr_path} --role script-creator` to read hook requirements from the ADR. Include ADR-specified event types, matchers, and behavioral requirements in the spec. Run `adr-query.py list` to check for related ADRs. **Output**: A spec block like this: @@ -130,16 +104,16 @@ Output format: {"additionalContext": "..."} **Goal**: Write the hook Python file following established patterns. -Dispatch to `hook-development-engineer` using the Agent tool with the spec from Phase 1. The spec is the brief; the engineer writes the code. +Dispatch to `hook-development-engineer` using the Agent tool with the spec from Phase 1. The spec is the brief; the engineer writes the code. The engineer is NOT in pipeline mode and cannot enforce lazy imports or the exit-0 pattern independently — you will validate these in Phase 3. **Required structure** (verify before accepting Phase 2 output): 1. **Shebang + module docstring** — docstring must state: event type, what the hook does, performance characteristics, dependencies. -2. **Stdlib imports at top level only** — `sys`, `json`, `os` are fine at the top. Everything else: inside functions. +2. **Stdlib imports at top level only** — `sys`, `json`, `os` are fine at the top. Everything else: inside functions. This is the performance gate trigger. 3. **Early-exit for non-target tools** — if the hook targets specific tools, check `tool_name` immediately after JSON parse and exit 0 if it doesn't match. 4. **JSON parse with error handling** — wrap `json.loads(sys.stdin.read())` in try/except; on failure, write to debug log and exit 0. 5. **Main logic** — the actual hook behavior. -6. **Exit 0 always**: +6. **Exit 0 always** (non-negotiable): ```python if __name__ == "__main__": try: @@ -185,19 +159,19 @@ Must complete with exit 0 and no output. echo '{}' | python3 hooks/{name}.py echo 'invalid json' | python3 hooks/{name}.py ``` -Check with `echo $?` after each. Both must be 0. If either is non-zero, the hook has an unguarded exit path — return to Phase 2. +Check with `echo $?` after each. Both must be 0. **Constraint (Blocking Gate)**: If either is non-zero, the hook has an unguarded exit path — return to Phase 2. A crashing hook is worse than no hook. The `finally: sys.exit(0)` block is not optional. -**Check 3 — Performance** (MANDATORY hard gate): +**Check 3 — Performance (MANDATORY hard gate)**: ```bash time python3 hooks/{name}.py < /dev/null ``` -Read the `real` time from the output. **Must be under 50ms.** This is a cold-start measurement — it includes Python interpreter startup, import resolution, and the hook's own logic. +Read the `real` time from the output. **Must be under 50ms.** This is a cold-start measurement — it includes Python interpreter startup, import resolution, and the hook's own logic. **Constraint (Blocking Gate)**: If this check fails, return to Phase 2. No exceptions, no "close enough." If performance fails: -- Identify which imports are at the top level and move them inside functions -- Reduce startup logic (defer DB connections, lazy-load config) -- Simplify the early-exit path so non-matching tool calls return immediately -- Re-run `time python3 hooks/{name}.py < /dev/null` after each change +- Check for any `import X` at module level where X is not `sys`, `json`, `os`, `re`, or other zero-cost stdlib modules. +- Move those imports inside the functions that use them. +- Check if the hook opens a file or DB connection before checking `tool_name` — move those after the early-exit guard. +- Re-run `time python3 hooks/{name}.py < /dev/null` after each change and repeat until you're under 50ms. **Do not proceed to Phase 4 until all three checks pass.** @@ -207,7 +181,7 @@ If performance fails: ### Phase 4: REGISTER -**Goal**: Wire the hook into Claude Code settings so it actually fires. +**Goal**: Wire the hook into Claude Code settings so it actually fires. **Constraint**: A hook not registered in `settings.json` is not done. Phase 4 is mandatory. **Step 1**: Locate the settings file: ```bash @@ -239,7 +213,7 @@ cat ~/.claude/settings.json } ``` -**Timeout defaults**: +**Timeout defaults** (include in every registration): - Most hooks: `5000` (5 seconds — generous ceiling above the 50ms target) - SessionStart hooks: `10000` (10 seconds — loading context can legitimately take longer) - UserPromptSubmit hooks: `5000` @@ -341,70 +315,6 @@ Docs: --- -## Anti-Patterns - -### Top-Level Imports That Bloat Startup Time - -**What it looks like**: -```python -import sqlite3 # fine (stdlib, near-zero cost) -from hooks.lib import learning_db_v2 # BAD — local module with filesystem lookup -import requests # BAD — third-party, slow -``` - -**Why wrong**: Python resolves every top-level import before `main()` runs. A local module import can cost 10–30ms on its own — enough to fail the performance gate before a single line of hook logic executes. - -**Do instead**: Move non-stdlib imports inside the functions that need them: -```python -def _check_patterns(tool_output): - from hooks.lib import learning_db_v2 # imported only when this function is called - db = learning_db_v2.open() - ... -``` - -### Blocking on Error Instead of Exiting 0 - -**What it looks like**: -```python -data = json.loads(sys.stdin.read()) # raises on invalid JSON → unhandled → exit 1 -``` - -**Why wrong**: Any unhandled exception causes Python to exit with code 1. Claude Code interprets a non-zero hook exit as a hard failure and can stall. - -**Do instead**: -```python -try: - data = json.loads(sys.stdin.read()) -except (json.JSONDecodeError, ValueError): - sys.exit(0) # not our event format; ignore silently -``` - -### Skipping the Spec Because "It's Simple" - -**What it looks like**: Moving directly to Phase 2 because the hook "obviously" handles PostToolUse and "obviously" injects context. - -**Why wrong**: "Obviously" is the source of half of all hook event type mismatches. PreToolUse and PostToolUse are not interchangeable. A hook that fires before a tool runs cannot see the tool's output. A hook that fires after cannot block the call. - -**Do instead**: Write the spec, even for a two-line hook. It takes 2 minutes and prevents a wrong-event-type regression. - -### Registering Under the Wrong Event Type - -**What it looks like**: Registering an error-learning hook under `PreToolUse` instead of `PostToolUse` because both "kind of make sense." - -**Why wrong**: The hook will receive the wrong JSON structure and silently do nothing — or worse, fire at the wrong time and inject stale context. - -**Do instead**: Match the event type to the action type table in Phase 1's SPEC section. `PostToolUse` = "I need to see the result." `PreToolUse` = "I need to prevent or modify the call." - -### Treating Timeout in settings.json as the Performance Budget - -**What it looks like**: Setting `"timeout": 5000` and considering the performance question answered. - -**Why wrong**: The timeout is a ceiling on how long Claude Code will wait before killing the hook. It says nothing about actual hook performance. A hook that takes 2000ms every tool call degrades the session even if it doesn't time out. - -**Do instead**: Treat 50ms as the performance target. Treat 5000ms as the safety net. The Phase 3 `time` test measures actual performance; the timeout field does not. - ---- - ## References - [Hook Development Engineer](../../agents/hook-development-engineer.md) - Agent dispatched in Phase 2 for implementation diff --git a/pipelines/mcp-pipeline-builder/SKILL.md b/pipelines/mcp-pipeline-builder/SKILL.md index 2387b6e0..aa85db4b 100644 --- a/pipelines/mcp-pipeline-builder/SKILL.md +++ b/pipelines/mcp-pipeline-builder/SKILL.md @@ -36,40 +36,14 @@ allowed-tools: # MCP Pipeline Builder -## Purpose +## Overview Automate the path from "I have a repo" to "I have a working MCP server registered with Claude Code." The pipeline derives what to expose by analyzing the repository — no prior knowledge required from the user. **Input**: A repository URL or local path (e.g., `https://github.com/org/repo` or `/home/user/my-tool`). **Output**: A compiled, registered MCP server entry in `~/.claude.json` or `.claude/settings.json`. ---- - -## Operator Context - -**Note on `context: fork`**: This pipeline intentionally does NOT declare `context: fork`, unlike other multi-phase pipeline skills. The Phase 2 human review gate requires interactive back-and-forth with the user; forked context would isolate that interaction from the parent session. The trade-off (larger parent context) is acceptable given the Phase 2 gate is the pipeline's most important safety mechanism. - -### Hardcoded Behaviors (Always Apply) - -- **Human Review Gate at Phase 2**: ALWAYS pause after producing `design.md` and show it to the user before generating any code. The user must explicitly say "yes", "proceed", or "y" before Phase 3 begins. This gate exists because Phase 1 analysis is imperfect — catching a wrong tool name costs seconds; fixing generated code costs minutes. -- **Read-Only Default**: Only read operations are mapped to tools unless the user passes `--allow-destructive`. Phase 1 must classify every operation as read or write. Phase 2 silently drops all write operations from the design unless `--allow-destructive` is set. -- **Append-Only Config Registration**: Phase 6 reads the existing config file before writing. It NEVER overwrites existing `mcpServers` entries. If the target server name already exists in the config, Phase 6 prints a warning and exits without modifying the file. -- **3-Iteration Fix Limit**: Phase 4 attempts to fix compilation errors automatically. After 3 failed iterations, the pipeline halts and surfaces all errors to the user. It does not retry a fourth time. -- **Artifact-First Execution**: Every phase produces a saved artifact before proceeding. If a phase artifact does not exist, the pipeline re-runs that phase rather than proceeding from stale state. - -### Default Behaviors (ON unless disabled) - -- **TypeScript Target Language**: Phase 3 generates TypeScript with `@modelcontextprotocol/sdk` + Zod unless `--python` is passed. -- **stdio Transport**: Default transport is stdio (local subprocess). Phase 1 may override this to HTTP if it detects the target is a running web service. -- **Workflow Tools Over API Wrappers**: Phase 2 enforces the 5–15 tool granularity target. Operations that would always be called together are merged. -- **Evaluation Gate**: Phase 5 requires accuracy ≥ 7/10 before proceeding to registration. One regeneration attempt is permitted below that threshold. - -### Optional Behaviors (OFF unless enabled) - -- **`--python`**: Generate Python FastMCP instead of TypeScript. Best when the target is a Python library with a clean importable API. Enable by passing `--python` on invocation. -- **`--allow-destructive`**: Include write, update, and delete operations in the design. Enable by passing `--allow-destructive`. Requires explicit user decision — prompt for confirmation even when flag is set. -- **`--dry-run`**: Phase 6 prints the config snippet without writing to disk. Enable by passing `--dry-run`. -- **`--import`**: For Python targets only. Import the target library directly rather than calling it via subprocess/HTTP. Enable by passing `--import`. Only valid with `--python`. +This pipeline intentionally does NOT declare `context: fork`. The Phase 2 human review gate requires interactive back-and-forth with the user; forked context would isolate that interaction from the parent session. The trade-off (larger parent context) is acceptable given the Phase 2 gate is the pipeline's most important safety mechanism. --- @@ -106,7 +80,7 @@ The `{repo-slug}` is derived from the repository name: lowercase, hyphens for se 3. **Discover API surface**: REST endpoints, CLI commands, exported functions/classes. See `references/analysis-checklist.md` → API Surface Discovery. 4. **Identify auth mechanisms**: What credentials does this service require? (env vars, API key, OAuth, JWT). See `references/analysis-checklist.md` → Auth Pattern Recognition. 5. **Map data entities**: What "nouns" does this system manage? (users, issues, events, records) -6. **Classify operations as read or write**: Apply the explicit heuristic from `references/analysis-checklist.md` → Read/Write Classification. Every operation gets a classification. +6. **Classify operations as read or write**: Apply the explicit heuristic from `references/analysis-checklist.md` → Read/Write Classification. Every operation gets a classification. By default, only read operations are mapped to tools unless the user passes `--allow-destructive`. 7. **Detect transport signal**: Does the repo describe or expose a running HTTP server? If yes, note "HTTP transport candidate" in analysis. 8. **Estimate tool count**: Count meaningful read operations. If > 20, identify grouping opportunities. @@ -136,10 +110,10 @@ Write `mcp-design/{repo-slug}/analysis.md` with the schema from `references/anal 1. **Select target language**: TypeScript (default) or Python if `--python` is set. Record the decision and rationale in design.md. 2. **Select transport**: stdio (default) or HTTP if Phase 1 flagged the target as a service. Record decision. -3. **Assign operations to primitives**: Use the heuristic table from `references/design-rules.md` → Primitive Selection. In brief: tools for operations requiring parameters or having side effects; resources for stable, URI-addressable reference data; prompts only if clearly high-value. -4. **Filter destructive operations**: If `--allow-destructive` is NOT set, exclude all write/update/delete operations. Log each exclusion. +3. **Filter destructive operations**: If `--allow-destructive` is NOT set, exclude all write/update/delete operations. This is the default read-only constraint. Log each exclusion with clear reasoning. +4. **Assign operations to primitives**: Use the heuristic table from `references/design-rules.md` → Primitive Selection. In brief: tools for operations requiring parameters or having side effects; resources for stable, URI-addressable reference data; prompts only if clearly high-value. 5. **Name each tool**: Follow `{service}_{verb}_{noun}` snake_case convention. See `references/design-rules.md` → Tool Naming. -6. **Apply workflow-tool heuristic**: If two tools would always be called in sequence, merge them. Target 5–15 tools total. See `references/design-rules.md` → Tool Granularity. +6. **Apply workflow-tool heuristic**: If two tools would always be called in sequence, merge them. Enforce the 5–15 tool granularity target. Operations that would always be called together must be merged. See `references/design-rules.md` → Tool Granularity. 7. **Write tool descriptions and parameter schemas in prose**: Pre-code specification. Each tool entry must have: name, description (2–3 sentences), parameters (name, type, required/optional, description), expected response format, annotations (readOnlyHint, destructiveHint, idempotentHint). 8. **Write resource definitions** (if any): URI template, description, mime type. @@ -160,9 +134,9 @@ Show the user: - If "edit": apply the user's edits to `design.md`, then re-ask - If "abort": stop pipeline; artifacts remain on disk for manual continuation -This gate exists because Phase 1 analysis may misidentify the relevant API surface or pick wrong primitive types. A wrong `design.md` produces hundreds of lines of wrong code. User review at this point costs seconds. +**Why this gate exists**: Phase 1 analysis is probabilistic. The user knows their target repo; the model does not. A 30-second review catches a wrong tool scope that would take 10 minutes to debug in generated code. Phase 1 analysis may misidentify the relevant API surface or pick wrong primitive types. A wrong `design.md` produces hundreds of lines of wrong code. User review at this point costs seconds. -**Do NOT rationalize skipping this gate.** No time pressure, confidence level, or apparent obviousness of the design justifies proceeding without explicit user approval. +**Do NOT rationalize skipping this gate.** No time pressure, confidence level, or apparent obviousness of the design justifies proceeding without explicit user approval. The cost of wrong code is orders of magnitude higher than the cost of a review pause. **Gate**: User explicitly approves design.md by responding "y", "yes", or "proceed". Do not continue to Phase 3 without explicit approval. @@ -219,7 +193,7 @@ Generate tools one at a time from the design.md tool list. For each tool: 2. Implement the tool using the client approach (call target API/CLI), not import approach 3. Apply the auth pattern from analysis.md -**Gate**: All tools listed in design.md must be implemented. No stubs, no TODO comments in the generated code. Every file must be complete. +**Gate**: All tools listed in design.md must be implemented. No stubs, no TODO comments in the generated code. Every file must be complete. This ensures Phase 4 validation can run without placeholder blockers. --- @@ -236,7 +210,7 @@ Generate tools one at a time from the design.md tool list. For each tool: #### Fix Loop (Max 3 Iterations) 1. Run the build command. If it succeeds, proceed to Phase 5. -2. If it fails: parse the error output, identify the root cause, apply a fix. +2. If it fails: parse the error output, identify the root cause, apply a targeted fix. 3. Re-run the build. 4. Repeat up to 3 total attempts. @@ -246,7 +220,7 @@ After 3 failures: - Halt the pipeline - Tell the user: "Manual intervention required. Fix the errors above and re-run from Phase 4." -Do not attempt a 4th iteration. Three failed iterations means the design has a structural problem that automated fixing cannot resolve. +**Do not attempt a 4th iteration.** Three failed iterations means the design has a structural problem that automated fixing cannot resolve. More iterations produce increasingly speculative fixes without addressing the underlying design issue. **Gate**: Build must succeed before Phase 5 begins. @@ -288,8 +262,8 @@ Write `mcp-design/{repo-slug}/evaluation-report.md` with: #### Accuracy Gate - **Accuracy ≥ 7/10**: Proceed to Phase 6. -- **Accuracy < 7/10**: Attempt one Phase 3 regeneration (re-run GENERATE with additional guidance derived from which questions failed). Then re-run Phase 4 and Phase 5. -- **If accuracy still < 7/10 after regeneration**: Surface the evaluation report, explain which tools are failing, halt the pipeline. Do not attempt a third generation pass. +- **Accuracy < 7/10**: Attempt one Phase 3 regeneration (re-run GENERATE with additional guidance derived from which questions failed). Then re-run Phase 4 and Phase 5. This single regeneration pass is permitted to refine tool implementations based on evaluation feedback. +- **If accuracy still < 7/10 after regeneration**: Surface the evaluation report, explain which tools are failing, halt the pipeline. Do not attempt a third generation pass. Less than 7/10 accuracy means more than 30% of questions are wrong, which indicates the tool scope or implementation approach needs human-guided redesign. **Gate**: Accuracy ≥ 7/10 required to proceed to Phase 6. @@ -331,6 +305,8 @@ python3 ${CLAUDE_SKILL_DIR}/scripts/register_mcp.py \ With `--dry-run`: print the config snippet only; do not write. +The script enforces read-before-write semantics: it reads the existing config file before writing. It NEVER overwrites existing `mcpServers` entries. If the target server name already exists in the config, the script prints a warning and exits without modifying the file. + #### Post-Registration Output After successful registration, the script prints: @@ -363,11 +339,11 @@ The pipeline agent should additionally inform the user: "Test with `/mcp` or the ### Error: Zero tools after destructive filtering **Cause**: All discovered operations are write/update/delete and `--allow-destructive` is not set, leaving Phase 2 with no tools to design. -**Solution**: Inform the user. Suggest `--allow-destructive` if write operations are appropriate for their use case. +**Solution**: Inform the user. Suggest `--allow-destructive` if write operations are appropriate for their use case, with explicit confirmation required when the flag is set. ### Error: Build fails after 3 fix iterations **Cause**: Structural design problem in Phase 4 — type errors, import mismatches, or Zod schema conflicts that targeted fixes cannot resolve. -**Solution**: Surface the full error output. Halt the pipeline. Tell the user: "Manual intervention required. Fix the errors above and re-run from Phase 4." Do not attempt a 4th iteration. +**Solution**: Surface the full error output. Halt the pipeline. Tell the user: "Manual intervention required. Fix the errors above and re-run from Phase 4." Do not attempt a 4th iteration. This is a design problem, not a code problem. ### Error: Server fails to start during evaluation **Cause**: Build artifact missing, wrong path, or server crashes on startup in Phase 5. @@ -379,48 +355,6 @@ The pipeline agent should additionally inform the user: "Test with `/mcp` or the --- -## Anti-Patterns - -### Anti-Pattern 1: Skipping the Phase 2 Gate -**What it looks like**: Rationalizing "The design is obviously correct, no need to pause" and proceeding directly to Phase 3 generation. -**Why wrong**: Phase 1 analysis is probabilistic. The user knows their target repo; the model does not. A 30-second review catches a wrong tool scope that would take 10 minutes to debug in generated code. -**Do instead**: Always show design.md to the user and wait for explicit "y", "proceed", or "yes" before continuing. - -### Anti-Pattern 2: Generating API Wrappers -**What it looks like**: Exposing every endpoint as a tool for "comprehensive coverage" — resulting in 30+ tools with vague descriptions. -**Why wrong**: 30 tools with vague descriptions is worse than 8 tools with clear workflow semantics. The model wastes turns selecting between near-identical options. -**Do instead**: Enforce the 5-15 tool target in Phase 2. Merge operations that would always be called in sequence. Prioritize workflow-level tools over raw API wrappers. - -### Anti-Pattern 3: Continuing Past 3 Fix Iterations -**What it looks like**: Rationalizing "Just one more iteration — this error looks simple" after three failed Phase 4 build attempts. -**Why wrong**: If three targeted fix passes haven't resolved it, the design has a structural problem. More iterations produce increasingly speculative fixes. -**Do instead**: Surface the full error output, halt the pipeline, and ask the user to review design.md for structural issues. - -### Anti-Pattern 4: Writing Config Without Read-First -**What it looks like**: Writing the new MCP server entry directly to config, assuming "the config is probably empty." -**Why wrong**: The config may contain other MCP server entries the user depends on. Overwriting destroys existing registrations. -**Do instead**: Always read the config file before writing. Use `register_mcp.py` which enforces read-before-write and append-only semantics. - -### Anti-Pattern 5: Exposing Destructive Operations by Default -**What it looks like**: Including write, update, and delete operations in the design because "the user probably wants full CRUD coverage." -**Why wrong**: A model cannot reliably determine which write operations are safe to expose. Wrong assumptions here can delete production data. -**Do instead**: Default to read-only. Only include destructive operations when the user explicitly passes `--allow-destructive`. - ---- - -## Anti-Rationalization - -| Rationalization Attempt | Why It's Wrong | Required Action | -|------------------------|----------------|-----------------| -| "User seems impatient, I'll skip the design gate" | Impatience doesn't reduce the cost of wrong code generation | Show design.md; ask the question | -| "The analysis found obvious tools, design is clearly right" | Obvious to the model ≠ correct for the user's use case | Still show; still ask | -| "The build almost passes, 4th iteration will fix it" | Structural errors don't yield to more iterations | Surface errors; halt | -| "Accuracy is 6/10, close enough" | The gate exists because 6/10 means 40% of questions are wrong | Trigger regeneration pass | -| "Config file looks empty, safe to write directly" | Look ≠ verify; other entries may be present | read-before-write enforced by register_mcp.py | -| "I'll expose write tools too, it's more useful" | Usefulness judgment without user input is dangerous for destructive ops | Only expose write tools if `--allow-destructive` is set | - ---- - ## References - [Analysis Checklist](references/analysis-checklist.md) - Phase 1 discovery checklist and output schema diff --git a/pipelines/perses-dac-pipeline/SKILL.md b/pipelines/perses-dac-pipeline/SKILL.md index 95b7ae6e..d62a1727 100644 --- a/pipelines/perses-dac-pipeline/SKILL.md +++ b/pipelines/perses-dac-pipeline/SKILL.md @@ -35,37 +35,6 @@ routing: Set up and manage Dashboard-as-Code workflows with CUE or Go SDK. -## Operator Context - -This skill operates as a pipeline for Dashboard-as-Code workflows, from module initialization through CI/CD integration. - -### Hardcoded Behaviors (Always Apply) -- **One dashboard per file**: Follow Perses convention of one dashboard definition per file — keeps diffs clean and enables per-dashboard CI validation -- **Build before apply**: Always run `percli dac build` before `percli apply` — raw CUE/Go files cannot be applied directly -- **Validate built output**: Always run `percli lint` on built JSON/YAML before deploying — build success does not guarantee valid dashboard spec -- **Go SDK stdout warning**: Never log/print to stdout in Go DaC programs — `dac build` captures stdout as the dashboard definition, so any stray output corrupts it - -### Default Behaviors (ON unless disabled) -- **CUE SDK**: Default to CUE SDK unless user requests Go -- **JSON output**: Build to JSON format by default -- **Git-friendly**: Organize files for version control (one dashboard per file, clear naming) - -### Optional Behaviors (OFF unless enabled) -- **Go SDK**: Use Go SDK instead of CUE for teams more comfortable with Go -- **YAML output**: Build to YAML format instead of JSON - -## What This Skill CAN Do -- Initialize CUE or Go DaC modules -- Write dashboard definitions using SDK builders -- Build definitions to JSON/YAML -- Set up CI/CD with GitHub Actions -- Manage multi-dashboard repositories - -## What This Skill CANNOT Do -- Create custom plugins (use perses-plugin-create) -- Deploy Perses server (use perses-deploy) -- Migrate Grafana dashboards (use perses-grafana-migrate) - --- ## Instructions @@ -81,7 +50,9 @@ cue mod init my-dashboards percli dac setup cue mod tidy ``` -Requirements: `percli` >= v0.51.0, `cue` >= v0.12.0 + +- Requirements: `percli` >= v0.51.0, `cue` >= v0.12.0 +- Why CUE first: CUE SDK is the default unless the user explicitly requests Go, as it requires less runtime overhead and is idiomatic for configuration-as-code patterns. **Go SDK**: ```bash @@ -90,15 +61,19 @@ go mod init my-dashboards percli dac setup --language go go mod tidy ``` -Requirements: `percli` >= v0.44.0, Go installed -**Gate**: Module initialized, dependencies resolved. `cue mod tidy` or `go mod tidy` succeeds without errors. Proceed to Phase 2. +- Requirements: `percli` >= v0.44.0, Go installed +- When to use: Use Go SDK only when the team is significantly more comfortable with Go or when dashboard logic requires programmatic features beyond CUE's data-transformation capabilities. + +**Gate**: Module initialized, dependencies resolved. `cue mod tidy` or `go mod tidy` succeeds without errors. Verify `percli version` meets minimum requirement. Proceed to Phase 2. + +--- ### Phase 2: DEFINE -**Goal**: Write dashboard definitions using SDK builders. +**Goal**: Write dashboard definitions using SDK builders. Keep one dashboard per file to enable per-dashboard CI validation and clean diffs — this is the Perses convention and prevents merge conflicts when multiple dashboards coexist. -CUE example structure: +CUE directory structure: ``` dac/ ├── cue.mod/ @@ -110,10 +85,15 @@ dac/ └── variables.cue ``` +- **One dashboard per file**: Always split multiple dashboards into separate files in `dac/dashboards/`. A single file with all dashboards is simpler to write but impossible to maintain — every change triggers a full rebuild, diffs are unreadable, and code review becomes impossible. +- **Shared definitions**: Use `dac/shared/` for common datasources, panel templates, and variable definitions. Import them into your dashboard definitions via CUE's standard import syntax. + CUE DaC imports from `github.com/perses/perses/cue/dac-utils/*`. Go DaC imports from `github.com/perses/perses/go-sdk`. -**Gate**: Dashboard definitions written. Files parse without syntax errors. Proceed to Phase 3. +**Gate**: Dashboard definitions written. Files parse without syntax errors. Run `cue vet` (CUE) to validate structure. Proceed to Phase 3. + +--- ### Phase 3: BUILD @@ -132,11 +112,16 @@ percli dac build -f main.go -ojson Output appears in `built/` directory. -**Gate**: Build succeeds, JSON/YAML output in `built/`. Proceed to Phase 4. +- **Always build before apply**: Raw `.cue` or `.go` files cannot be applied directly to Perses — `percli apply` requires the built JSON/YAML output. Running `percli apply` on source files will fail. +- **Never commit `built/`**: The `built/` directory is a derived artifact — add it to `.gitignore`. CI/CD will rebuild from source on every push, ensuring consistency. + +**Gate**: Build succeeds, JSON/YAML output in `built/` directory. Proceed to Phase 4. + +--- ### Phase 4: VALIDATE -**Goal**: Ensure built dashboards are valid. +**Goal**: Ensure built dashboards are valid according to Perses schema. ```bash percli lint -f built/cpu-monitoring.json @@ -144,7 +129,12 @@ percli lint -f built/cpu-monitoring.json percli lint -f built/cpu-monitoring.json --online ``` -**Gate**: Validation passes. Proceed to Phase 5. +- **Build success ≠ valid dashboard**: `percli dac build` only validates CUE/Go syntax, not Perses dashboard semantics. Panels can reference nonexistent datasources, variables can have wrong types, and fields can violate the schema — the build will still succeed. +- **Always lint**: Never skip this step. Invalid dashboards deployed to Perses break monitoring and cause incidents. + +**Gate**: Validation passes with no errors reported. Proceed to Phase 5. + +--- ### Phase 5: DEPLOY @@ -159,7 +149,9 @@ Verify deployment: percli get dashboard --project ``` -**Gate**: Dashboards deployed and accessible. Proceed to Phase 6 if CI/CD is requested. +**Gate**: Dashboards deployed and accessible via `percli get`. Proceed to Phase 6 if CI/CD is requested. + +--- ### Phase 6: CI/CD INTEGRATION (optional) @@ -182,7 +174,10 @@ jobs: password: ${{ secrets.PERSES_PASSWORD }} ``` -**Gate**: CI/CD pipeline configured and tested. Pipeline complete. +- **Use secrets and variables**: Never hardcode `PERSES_URL`, `PERSES_USERNAME`, or `PERSES_PASSWORD` into workflow YAML. Use GitHub repo Settings > Variables for `vars.PERSES_URL` and Settings > Secrets for credentials. URL changes and secret leaks are prevented this way. +- **Verify CI/CD locally first**: Test the pipeline by pushing to a development branch and verifying the workflow succeeds before declaring the pipeline complete. + +**Gate**: CI/CD pipeline configured, tested with a real push, and succeeds end-to-end. Pipeline complete. --- @@ -193,83 +188,11 @@ jobs: | `cue mod tidy` fails with "no CUE module found" | CUE module not initialized — `cue mod init` was skipped or ran in wrong directory | Run `cue mod init ` in the `dac/` directory before running `percli dac setup` | | `cue mod tidy` fails with version/dependency errors | CUE version too old (< 0.12.0) or incompatible with `percli dac setup` output | Verify `cue version` shows >= 0.12.0. Upgrade CUE if needed. Run `percli dac setup` again after upgrade | | `percli dac build` produces empty `built/` directory | CUE expression does not evaluate to a valid dashboard object, or the entry file path is wrong | Verify the file path passed to `-f` is correct. Ensure the CUE file evaluates to a Perses dashboard (imports `dac-utils` and uses dashboard builder). Check for CUE evaluation errors in stderr | -| `percli dac build` output contains non-JSON lines (Go SDK) | Go program prints to stdout via `fmt.Println`, `log.Println`, or similar — `dac build` captures all stdout as the dashboard definition | Remove ALL stdout prints from Go DaC code. Use `fmt.Fprintln(os.Stderr, ...)` for debug output. Check imported libraries for stray stdout writes | +| `percli dac build` output contains non-JSON lines (Go SDK) | Go program prints to stdout via `fmt.Println`, `log.Println`, or similar — `dac build` captures all stdout as the dashboard definition | **CRITICAL**: Remove ALL stdout prints from Go DaC code. Use `fmt.Fprintln(os.Stderr, ...)` for debug output. Check imported libraries for stray stdout writes. Rebuild and verify output is valid JSON. This is a blocker. | | CI/CD pipeline fails with 401/403 | `PERSES_USERNAME` / `PERSES_PASSWORD` secrets not configured in GitHub repository, or credentials are wrong | Add secrets in GitHub repo Settings > Secrets. Verify credentials work locally with `percli login` first | | CI/CD pipeline fails with connection refused | `PERSES_URL` variable points to wrong server or server is not reachable from GitHub Actions runner | Verify the URL is publicly accessible (not localhost). Check `vars.PERSES_URL` is set in GitHub repo Settings > Variables | | `percli lint` fails on valid-looking JSON | Dashboard JSON is structurally valid but violates Perses schema (e.g., unknown panel type, missing required field) | This is the system working correctly — fix the dashboard definition and rebuild. Run `percli lint --online` against a live server for detailed validation errors | -### Error: CUE module initialization failure -**Cause**: `cue mod init` was skipped or ran in the wrong directory before running `percli dac setup` -**Solution**: Navigate to the `dac/` directory and run `cue mod init ` before `percli dac setup`. Verify `cue version` shows >= 0.12.0. - -### Error: Go SDK stdout contamination -**Cause**: Go DaC program prints to stdout via `fmt.Println` or `log.Println`, corrupting the dashboard JSON output from `percli dac build` -**Solution**: Remove all stdout writes from Go DaC code. Use `fmt.Fprintln(os.Stderr, ...)` for debug output. Check imported libraries for stray stdout writes. Rebuild and verify output is valid JSON. - -### Error: CI/CD authentication failure -**Cause**: `PERSES_USERNAME` / `PERSES_PASSWORD` secrets not configured in GitHub repository settings, or credentials are incorrect -**Solution**: Add secrets in GitHub repo Settings > Secrets. Verify credentials work locally with `percli login` first. Ensure `vars.PERSES_URL` points to the correct, publicly accessible server. - ---- - -## Anti-Patterns - -| Anti-Pattern | Why It Fails | Correct Approach | -|-------------|-------------|-----------------| -| **Multiple dashboards in one CUE file** | Makes diffs noisy, CI runs everything on any change, harder to review | One dashboard per file in `dac/dashboards/`. Use `dac/shared/` for common definitions | -| **Running `percli apply` directly on CUE/Go files** | `percli apply` expects built JSON/YAML, not raw source files — will fail or produce garbage | Always run `percli dac build` first, then `percli apply -f built/.json` | -| **Skipping `percli lint` after successful build** | Build success means valid CUE/Go syntax, NOT valid Perses dashboard — panels, datasources, variables can still be wrong | Always lint built output: `percli lint -f built/.json` | -| **Using `fmt.Println` / `log.Println` in Go SDK** | `percli dac build` captures stdout as dashboard JSON — any print statement corrupts the output with non-JSON text | Use `fmt.Fprintln(os.Stderr, ...)` for debug output. Remove all stdout writes before building | -| **Hardcoding Perses URL in CI workflow** | URL changes break the pipeline, credentials leak if URL contains auth | Use `vars.PERSES_URL` variable and `secrets.PERSES_USERNAME` / `secrets.PERSES_PASSWORD` secrets | -| **Committing the `built/` directory to git** | Built artifacts are derived output — committing them causes merge conflicts and staleness | Add `built/` to `.gitignore`. Let CI/CD rebuild from source on every push | - ---- - -## Anti-Rationalization - -These are common justifications for skipping steps. Each one leads to real failures. - -| Rationalization | Why It Fails | Required Action | -|----------------|-------------|-----------------| -| "Build succeeded, so the dashboard is valid" | `percli dac build` only validates CUE/Go syntax, not Perses dashboard semantics — panels can reference nonexistent datasources, variables can have wrong types | **Run `percli lint` on every built artifact before deploying** | -| "I tested locally, CI/CD will be fine" | Local environment has different `percli` version, different CUE cache, different auth context — CI runner starts fresh | **Verify CI pipeline succeeds end-to-end in a real push before declaring done** | -| "One file with all dashboards is simpler" | Simpler to write, nightmare to maintain — every change triggers full rebuild, diffs are unreadable, code review is impossible | **Split into one dashboard per file, always** | -| "I'll add linting to the pipeline later" | Later never comes. Invalid dashboards get deployed, break monitoring, cause incidents | **Add `percli lint` to the pipeline now, in the same PR that sets up CI** | -| "The Go SDK print statement is just for debugging" | `dac build` does not distinguish "debug" stdout from "dashboard" stdout — the print will ship to production in the JSON output | **Remove the print. Use stderr. Verify with `percli dac build` before committing** | - ---- - -## FORBIDDEN Patterns - -These patterns MUST NOT appear in any DaC pipeline output. Violation is a blocker. - -- **NEVER** run `percli apply` on raw `.cue` or `.go` source files — always build first -- **NEVER** print to stdout in Go SDK dashboard programs — this corrupts `dac build` output -- **NEVER** hardcode credentials in workflow YAML — use GitHub secrets -- **NEVER** skip the lint step between build and deploy — build success is not validation -- **NEVER** put multiple dashboard definitions in a single file — one dashboard per file, always -- **NEVER** commit the `built/` directory to version control — it is a derived artifact - ---- - -## Blocker Criteria - -Do NOT mark a phase as complete if any of these conditions exist: - -| Phase | Blocker | -|-------|---------| -| INITIALIZE | `cue mod tidy` or `go mod tidy` exits with non-zero status | -| INITIALIZE | `percli` version does not meet minimum requirement (CUE: v0.51.0+, Go: v0.44.0+) | -| DEFINE | CUE files have syntax errors (`cue vet` fails) | -| DEFINE | Multiple dashboards defined in a single file | -| BUILD | `percli dac build` produces no output in `built/` directory | -| BUILD | Built JSON contains non-JSON content (Go SDK stdout contamination) | -| VALIDATE | `percli lint` reports errors on any built artifact | -| DEPLOY | `percli apply` returns non-zero exit status | -| DEPLOY | Dashboard not visible via `percli get dashboard --project ` | -| CI/CD | Workflow YAML contains hardcoded credentials | -| CI/CD | Pipeline has not been tested with an actual push | - --- ## References diff --git a/pipelines/perses-plugin-pipeline/SKILL.md b/pipelines/perses-plugin-pipeline/SKILL.md index 5a91295f..c8e06f08 100644 --- a/pipelines/perses-plugin-pipeline/SKILL.md +++ b/pipelines/perses-plugin-pipeline/SKILL.md @@ -34,138 +34,14 @@ routing: 6-phase pipeline for complete Perses plugin development: from scaffold through deploy. -## Operator Context - -This skill operates as an end-to-end plugin development guide, enforcing phase gates between SCAFFOLD, SCHEMA, IMPLEMENT, TEST, BUILD, and DEPLOY. - -### Hardcoded Behaviors (Always Apply) -- **Phase gates enforced**: Do not proceed to next phase until current phase passes its gate criteria -- **Test before build**: `percli plugin test-schemas` must pass before running `percli plugin build` -- **Schema + component required**: Both CUE schema and React component must be implemented before BUILD -- **Verify archive contents**: After build, confirm archive contains package.json, mf-manifest.json, schemas/, and __mf/ -- **Use percli for all scaffolding**: Never manually create plugin directory structures — always use `percli plugin generate` - -### Default Behaviors (ON unless disabled) -- **CUE before React**: Author and validate schemas before implementing React components -- **JSON example creation**: Generate a JSON example alongside every CUE schema for validation -- **Hot-reload development**: Use `percli plugin start` during IMPLEMENT phase for live preview -- **Archive format**: Default to .tar.gz for distribution archives - -### Optional Behaviors (OFF unless enabled) -- **Grafana migration schema**: Write `migrate/migrate.cue` for Grafana panel/datasource migration -- **Multiple plugin types**: Scaffold multiple plugins in a single module (e.g., Panel + Datasource) -- **CI pipeline generation**: Create GitHub Actions or GitLab CI config for automated build/test - -## What This Skill CAN Do -- Scaffold new Perses plugins via `percli plugin generate` for all plugin types -- Author CUE schemas defining plugin data models and validation rules -- Implement React components using `@perses-dev/plugin-system` hooks and patterns -- Run schema validation via `percli plugin test-schemas` -- Build distribution archives via `percli plugin build` -- Deploy plugin archives to a running Perses server - -## What This Skill CANNOT Do -- Deploy or configure Perses servers (use perses-deploy) -- Create or manage dashboards (use perses-dashboard-create) -- Manage Kubernetes infrastructure (use kubernetes-helm-engineer) -- Debug Perses server-side issues (use perses-deploy or systematic-debugging) - ---- - -## Error Handling - -### `percli plugin generate` fails - -| Symptom | Cause | Fix | -|---------|-------|-----| -| "invalid plugin type" | Unrecognized `--plugin.type` value | Use one of: Panel, Datasource, TimeSeriesQuery, TraceQuery, ProfileQuery, LogQuery, Variable, Explore | -| "directory already exists" | Target directory conflicts with existing scaffold | Remove or rename existing directory, or use a different `--module.name` | -| "percli: command not found" | percli not installed | Install via `brew install perses/tap/percli` or download from GitHub releases | -| Non-zero exit with no message | Wrong percli version | Check `percli version` and upgrade if below required minimum | - -### CUE schema compilation errors - -| Symptom | Cause | Fix | -|---------|-------|-----| -| "package name mismatch" | CUE file uses wrong package name — Perses requires `package model` for all plugin schemas | Change the first line of the .cue file to `package model` | -| "cannot find package" / missing imports | Required CUE modules not imported | Add missing imports — common ones: `github.com/perses/shared/cue/common` for shared types | -| "conflicting values" | JSON example does not conform to CUE constraints | Fix JSON to match CUE type definitions, or relax CUE constraints | -| "cannot convert incomplete value" | CUE schema uses unresolved references | Ensure all referenced definitions are imported or defined locally | - -### React component build errors - -| Symptom | Cause | Fix | -|---------|-------|-----| -| "Cannot find module '@perses-dev/...'" | Missing Perses npm dependencies | Run `npm install @perses-dev/plugin-system @perses-dev/components` | -| "Node.js version mismatch" | Node version too old for rsbuild | Upgrade to Node 18+ (check with `node --version`) | -| TypeScript type errors | Component props don't match plugin spec types | Align component props with the generated spec types from CUE schema | -| rsbuild config errors | Invalid rsbuild.config.ts | Compare against percli-generated default; do not modify module federation settings | - -### `percli plugin build` archive incomplete - -| Symptom | Cause | Fix | -|---------|-------|-----| -| Missing mf-manifest.json | Build did not complete successfully | Re-run `percli plugin build` and check for earlier errors in build output | -| Missing schemas/ directory | CUE schemas not compiled to JSON | Run `percli plugin test-schemas` first, then rebuild | -| Wrong archive structure | Manual archive creation instead of percli | Always use `percli plugin build` — never manually zip | -| Archive too large | node_modules or build artifacts included | Check .npmignore or build config; only dist output should be archived | - -### Error: Plugin scaffold generation failure -**Cause**: Invalid `--plugin.type` value, directory name conflict, or `percli` not installed / version too old -**Solution**: Verify `percli version` meets minimum requirements. Use one of the valid plugin types: Panel, Datasource, TimeSeriesQuery, TraceQuery, ProfileQuery, LogQuery, Variable, Explore. Remove or rename conflicting directories. - -### Error: CUE schema compilation failure -**Cause**: Wrong package name (must be `package model`), missing CUE module imports, or JSON example does not conform to CUE constraints -**Solution**: Ensure every plugin `.cue` schema file declares `package model`. Add missing imports from `github.com/perses/shared/cue/common`. Fix JSON examples to match CUE type definitions. Run `percli plugin test-schemas` to validate. - -### Error: Build archive incomplete -**Cause**: `percli plugin build` succeeded but archive is missing `mf-manifest.json`, `schemas/`, or `__mf/` directory -**Solution**: Re-run `percli plugin build` and check for build errors in output. Ensure `percli plugin test-schemas` passes before building. Always use `percli plugin build` — never manually construct archives. List archive contents with `tar -tzf .tar.gz` to verify. - ---- - -## Anti-Patterns - -| Anti-Pattern | Why It Fails | Correct Approach | -|--------------|-------------|------------------| -| **Skipping SCHEMA phase** — jumping from scaffold to React implementation | Components lack type safety; no validation contract; schema errors surface late during build | Always complete SCHEMA and validate with `percli plugin test-schemas` before writing React code | -| **Building without testing schemas** — running `percli plugin build` before `percli plugin test-schemas` | Build may succeed but produce invalid schemas that fail at runtime in Perses server | Run `percli plugin test-schemas` and fix all CUE errors before building | -| **Not verifying archive contents** — trusting build output without inspection | Archive may be missing mf-manifest.json or schemas, causing silent failures on deploy | After every build, list archive contents and confirm required files exist | -| **Manually creating directory structure** — instead of using `percli plugin generate` | Missing module federation config, incorrect package.json, no rsbuild setup | Always start with `percli plugin generate` even if you plan to customize later | - -## Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|---------------|-----------------| -| "The CUE schema looks correct, skip test-schemas" | CUE has subtle type rules; visual inspection misses constraint errors | **Run `percli plugin test-schemas`** — it catches what reading cannot | -| "Build succeeded so the archive is fine" | Build exit code 0 does not guarantee complete archive contents | **List and verify archive contents** — check for mf-manifest.json, schemas/, __mf/ | -| "Schema changes are small, no need to re-test" | Small CUE changes can cascade into type mismatches across the schema | **Re-run test-schemas after every schema change**, no matter how small | -| "The component renders in dev, skip deploy verification" | Hot-reload dev mode bypasses plugin loading path; deploy uses archive loading | **Test the actual deployed plugin** in a Perses server instance | - -## FORBIDDEN Patterns -- **NEVER** modify percli-generated rsbuild.config.ts module federation settings — this breaks plugin loading -- **NEVER** manually construct plugin archives (zip/tar) — always use `percli plugin build` -- **NEVER** skip CUE package declaration — every plugin .cue schema file must declare `package model` -- **NEVER** import from `@perses-dev/internal` — use only public API packages (`@perses-dev/plugin-system`, `@perses-dev/components`) -- **NEVER** hardcode Perses server URLs in plugin source — plugins receive context via the plugin system - -## Blocker Criteria - -Stop and ask the user before proceeding if: -- Plugin type is ambiguous (e.g., could be Panel or Variable) -- Target Perses server version is unknown (schema compatibility varies) -- No Perses server is available for DEPLOY phase testing -- Required `@perses-dev/*` package versions conflict with existing node_modules -- CUE schema requires imports from packages not present in the module - ---- - ## Instructions ### Phase 1: SCAFFOLD **Goal**: Generate plugin scaffold with correct structure. +**Why this phase first**: Always use `percli plugin generate` for scaffolding — never manually create directory structures. Manual approaches miss module federation config, package.json structure, and rsbuild setup. + 1. Determine plugin parameters: - `--module.org`: Organization name (e.g., `my-org`) - `--module.name`: Module name (e.g., `my-plugin-module`) @@ -183,57 +59,74 @@ percli plugin generate \ 3. Verify generated structure: package.json, rsbuild.config.ts, src/, schemas/ directories exist. -**Gate**: Scaffold generated, directory structure verified. Proceed to Phase 2. +**Gate**: Scaffold generated, directory structure verified. Phase gates enforced — do not proceed to Phase 2 until this gate passes. ### Phase 2: SCHEMA **Goal**: Author CUE schema defining the plugin's data model. +**Why this phase before React**: CUE before React always — author and validate schemas before implementing React components. Components lack type safety and validation contracts if schema comes later. Schema errors surface late during build if skipped. + 1. Create CUE schema at `schemas///.cue` - - Declare correct package: `package model` + - Declare correct package: `package model` (mandatory — Perses requires this in all plugin schemas) - Define the plugin's spec structure with CUE types and constraints - Import common Perses schema packages as needed 2. Create JSON example at `schemas///.json` + - Generate JSON examples alongside every CUE schema for validation - Must validate against the CUE schema - Serves as documentation and test fixture 3. Optional: Write Grafana migration schema at `schemas///migrate/migrate.cue` 4. Validate: `percli plugin test-schemas` + - Visual inspection misses CUE's subtle type rules — always run test-schemas + - Re-run after every schema change, no matter how small -**Gate**: `percli plugin test-schemas` passes with zero errors. Proceed to Phase 3. +**Gate**: `percli plugin test-schemas` passes with zero errors. Do not proceed to Phase 3 until this gate passes. ### Phase 3: IMPLEMENT **Goal**: Build React component implementing the plugin UI. +**Why this phase after SCHEMA**: Schema validation gates this phase — you have type safety and a validation contract before writing React code. + 1. Implement component in `src///` - Use `@perses-dev/plugin-system` hooks (e.g., `useDataQueries`, `useTimeRange`) - Use `@perses-dev/components` for shared UI elements + - Never import from `@perses-dev/internal` — use only public API packages - Follow Perses component patterns from existing plugins 2. Register plugin in module's plugin registration file 3. Use `percli plugin start` for hot-reload development against a running Perses server + - Default behavior: hot-reload development is ON -**Gate**: Component renders correctly in dev mode. Proceed to Phase 4. +**Gate**: Component renders correctly in dev mode. Do not proceed to Phase 4 until this gate passes. ### Phase 4: TEST **Goal**: Validate schemas and component behavior. +**Constraint**: Test before build always — `percli plugin test-schemas` must pass before running `percli plugin build`. Build success does not guarantee complete archive contents. + 1. Run `percli plugin test-schemas` — must pass (re-validate after any IMPLEMENT changes) + - Build exit code 0 does not guarantee complete archive contents — validation catches what exit codes miss 2. Run component unit tests if present (`npm test` or framework-specific runner) 3. Test with `percli plugin start` against a running Perses server — verify plugin appears and functions -**Gate**: All schema tests pass, component renders and functions correctly. Proceed to Phase 5. +**Gate**: All schema tests pass, component renders and functions correctly. Do not proceed to Phase 5 until this gate passes. ### Phase 5: BUILD **Goal**: Create distribution archive. +**Why this phase gated on TEST**: Test before build ensures schema validity. Build without test validation produces archives with invalid schemas that fail silently in Perses server. + 1. Run `percli plugin build` + - Never manually construct plugin archives (zip/tar) — always use `percli plugin build` + - Default format: .tar.gz for distribution archives + 2. Verify archive contents include: - `package.json` — plugin metadata - `mf-manifest.json` — module federation manifest @@ -243,14 +136,17 @@ percli plugin generate \ ```bash # Verify archive contents (adjust filename) tar -tzf .tar.gz | head -20 +# Always list and verify archive contents — don't trust exit code 0 ``` -**Gate**: Archive built, contents verified with all required files present. Proceed to Phase 6. +**Gate**: Archive built, contents verified with all required files present. Do not proceed to Phase 6 until this gate passes. ### Phase 6: DEPLOY **Goal**: Install plugin in Perses server and verify. +**Constraint**: Test the actual deployed plugin — hot-reload dev mode bypasses plugin loading path; deploy uses archive loading. + 1. Copy archive to Perses server's `plugins-archive/` directory 2. Restart Perses (or wait for hot-reload if enabled in server config) 3. Verify plugin loaded: @@ -264,6 +160,58 @@ percli get plugin --- +## Error Handling + +### `percli plugin generate` fails + +| Symptom | Cause | Fix | +|---------|-------|-----| +| "invalid plugin type" | Unrecognized `--plugin.type` value | Use one of: Panel, Datasource, TimeSeriesQuery, TraceQuery, ProfileQuery, LogQuery, Variable, Explore | +| "directory already exists" | Target directory conflicts with existing scaffold | Remove or rename existing directory, or use a different `--module.name` | +| "percli: command not found" | percli not installed | Install via `brew install perses/tap/percli` or download from GitHub releases | +| Non-zero exit with no message | Wrong percli version | Check `percli version` and upgrade if below required minimum | + +### CUE schema compilation errors + +| Symptom | Cause | Fix | +|---------|-------|-----| +| "package name mismatch" | CUE file uses wrong package name — Perses requires `package model` for all plugin schemas | Change the first line of the .cue file to `package model` | +| "cannot find package" / missing imports | Required CUE modules not imported | Add missing imports — common ones: `github.com/perses/shared/cue/common` for shared types | +| "conflicting values" | JSON example does not conform to CUE constraints | Fix JSON to match CUE type definitions, or relax CUE constraints | +| "cannot convert incomplete value" | CUE schema uses unresolved references | Ensure all referenced definitions are imported or defined locally | + +### React component build errors + +| Symptom | Cause | Fix | +|---------|-------|-----| +| "Cannot find module '@perses-dev/...'" | Missing Perses npm dependencies | Run `npm install @perses-dev/plugin-system @perses-dev/components` | +| "Node.js version mismatch" | Node version too old for rsbuild | Upgrade to Node 18+ (check with `node --version`) | +| TypeScript type errors | Component props don't match plugin spec types | Align component props with the generated spec types from CUE schema | +| rsbuild config errors | Invalid rsbuild.config.ts | Compare against percli-generated default; do not modify module federation settings | + +### `percli plugin build` archive incomplete + +| Symptom | Cause | Fix | +|---------|-------|-----| +| Missing mf-manifest.json | Build did not complete successfully | Re-run `percli plugin build` and check for earlier errors in build output | +| Missing schemas/ directory | CUE schemas not compiled to JSON | Run `percli plugin test-schemas` first, then rebuild | +| Wrong archive structure | Manual archive creation instead of percli | Always use `percli plugin build` — never manually zip | +| Archive too large | node_modules or build artifacts included | Check .npmignore or build config; only dist output should be archived | + +### Error: Plugin scaffold generation failure +**Cause**: Invalid `--plugin.type` value, directory name conflict, or `percli` not installed / version too old +**Solution**: Verify `percli version` meets minimum requirements. Use one of the valid plugin types: Panel, Datasource, TimeSeriesQuery, TraceQuery, ProfileQuery, LogQuery, Variable, Explore. Remove or rename conflicting directories. + +### Error: CUE schema compilation failure +**Cause**: Wrong package name (must be `package model`), missing CUE module imports, or JSON example does not conform to CUE constraints +**Solution**: Ensure every plugin `.cue` schema file declares `package model`. Add missing imports from `github.com/perses/shared/cue/common`. Fix JSON examples to match CUE type definitions. Run `percli plugin test-schemas` to validate. + +### Error: Build archive incomplete +**Cause**: `percli plugin build` succeeded but archive is missing `mf-manifest.json`, `schemas/`, or `__mf/` directory +**Solution**: Re-run `percli plugin build` and check for build errors in output. Ensure `percli plugin test-schemas` passes before building. Always use `percli plugin build` — never manually construct archives. List archive contents with `tar -tzf .tar.gz` to verify. + +--- + ## References - [Perses Plugin Development Guide](https://perses.dev/docs/plugins/development/) diff --git a/pipelines/pipeline-retro/SKILL.md b/pipelines/pipeline-retro/SKILL.md index 1a8b8b4e..4a5b3c66 100644 --- a/pipelines/pipeline-retro/SKILL.md +++ b/pipelines/pipeline-retro/SKILL.md @@ -57,44 +57,6 @@ Layer 3: VALIDATION -- Regenerate the affected skills + re-test The critical discipline: we NEVER patch a generated skill directly. Every fix goes through the generator so all future pipelines benefit. This is what makes the system self-improving rather than self-patching. -## Operator Context - -### Hardcoded Behaviors (Always Apply) - -- **Three-Layer Discipline**: NEVER edit a generated skill file to fix a test failure. WHY: Artifact fixes teach the system nothing -- the same bug will appear in every future pipeline the generator creates. Fix the generator, not the output. -- **Evidence-Required Rules**: NEVER add a rule to `architecture-rules.md` without citing the specific test failure that proved it necessary. WHY: The ADR mandates "Rules earn their place through data. No rule is added based on 'best practice' or 'should'." Rules without evidence accumulate into bloat that slows every future generation. -- **Trace Before Fix**: NEVER propose a generator fix without first tracing the failure to a specific link in the generation chain. WHY: Fixing the wrong link wastes a regeneration cycle and may introduce new problems. The 5-link chain analysis ensures you fix the root cause, not a symptom. -- **Prove Before Ship**: NEVER mark a generator fix as complete without regenerating the affected skill and re-testing. WHY: A fix that doesn't improve test results isn't a fix -- it's a guess. Layer 3 is what distinguishes this from wishful thinking. - -### Default Behaviors (ON unless disabled) - -- **Communication Style**: Report root causes and proposed fixes with evidence. Show the chain link analysis, not just the conclusion. -- **Temporary File Cleanup**: Clean up intermediate analysis files after the retro report is produced. -- **Batch Processing**: When multiple failures share the same root cause, propose one fix that addresses all of them rather than N separate fixes. -- **Conservative Application**: For complex fixes (new step types, restructured chains), present for review rather than auto-applying. For trivial fixes (template typos, missing rules with clear evidence), apply directly. - -### Optional Behaviors (OFF unless enabled) - -- **Verbose Trace**: Include the full content of each generation chain link examined (increases context cost but aids debugging). -- **Cross-Domain Analysis**: When multiple domains have been generated, look for patterns across domains (e.g., the same chain-error appearing in Prometheus and RabbitMQ pipelines). - -## What This Skill CAN Do - -- Load and parse pipeline-test-runner reports (manifest.json + content.md) -- Trace failures through the 5-link generation chain (research, composition, template, rules, step menu) -- Propose specific fixes to generator components with evidence citations -- Apply trivial fixes directly to generator files -- Re-invoke chain-composer, pipeline-scaffolder, and pipeline-test-runner for affected subdomains -- Compare before/after test results to validate fixes -- Produce a structured retro report as a dual-layer artifact - -## What This Skill CANNOT Do - -- **Fix generated skills directly**: This is Layer 1 -- explicitly forbidden by the Three-Layer Pattern -- **Add rules without evidence**: Architecture rules require empirical justification -- **Modify the step menu type system**: Step menu changes affect all pipeline composition and require separate review -- **Run against arbitrary skill failures**: This skill only processes pipeline-test-runner output, not general test failures - --- ## Instructions @@ -112,6 +74,8 @@ These are provided by `pipeline-orchestrator-engineer` when invoking Phase 6 (RE **Goal**: Load the test runner report and build a failure inventory. +**Hardcoded Constraint**: If all results are PASS, produce a minimal report and exit—do not proceed to Phase 2. + **Step 1**: Read the test runner `manifest.json`. Extract: - `status`: overall pipeline test status - `metrics.pass_count`, `metrics.fail_count`, `metrics.partial_count` @@ -145,6 +109,8 @@ Save the failure inventory to `/tmp/pipeline-retro-{domain}/failure-inventory.md **Goal**: For each failure, trace it to a specific link in the 5-link generation chain. +**Hardcoded Constraint**: NEVER propose a generator fix without first tracing the failure to a specific link. Fixing the wrong link wastes a regeneration cycle. The 5-link chain analysis ensures you fix the root cause, not a symptom. + The generation chain has 5 links. Each link is a component of the pipeline generator that contributed to the final output. The failure was introduced at one of these links -- the goal is to identify which one. **The 5 Links**: @@ -181,7 +147,7 @@ skills/{skill_name}/SKILL.md | 2. Chain Composition | Did the chain have the wrong steps or wrong step order for this task type? | `chain-error` | | 1. Domain Research | Did domain research misclassify the subdomain or miss critical information? | `research-miss` | -Walk from Link 5 backward to Link 1. The FIRST link whose output introduced the problem is the root cause. WHY: Fixing upstream links is more impactful (affects more future pipelines) but also riskier. We want the most specific fix possible -- the link closest to the failure that can resolve it. +Walk from Link 5 backward to Link 1. The FIRST link whose output introduced the problem is the root cause. Why: Fixing upstream links is more impactful (affects more future pipelines) but also riskier. We want the most specific fix possible -- the link closest to the failure that can resolve it. **Step 5**: If the failure cannot be attributed to any of the 5 links, classify it as `test-target-issue`. This means the test target was insufficient, not the generator. Examples: - Test target was too domain-specific for the generated pipeline to handle @@ -196,6 +162,10 @@ Save the trace analysis to `/tmp/pipeline-retro-{domain}/trace-analysis.md`. **Goal**: For each root cause, propose a specific fix to the generator component. +**Hardcoded Constraint**: NEVER add a rule to `architecture-rules.md` without citing the specific test failure that proved it necessary. Rules earn their place through data. Rules without evidence accumulate into bloat that slows every future generation. + +**Hardcoded Constraint**: For complex fixes (new step types, restructured chains), present for review rather than auto-applying. For trivial fixes (template typos, missing rules with clear evidence), apply directly. + **Fix Proposals by Classification**: **`research-miss`** -- Domain research failed to discover the right information. @@ -222,7 +192,7 @@ Save the trace analysis to `/tmp/pipeline-retro-{domain}/trace-analysis.md`. - Target: `pipelines/pipeline-scaffolder/references/step-menu.md` - Propose: A new step entry with: name, output schema, consumes, parallel flag, when-to-use description - Evidence required: Show the gap in the chain that a new step type would fill, and why existing steps cannot cover it -- NOTE: Step menu changes are always presented for review, never auto-applied. WHY: Step menu changes affect the type system of ALL pipeline composition. A bad step type is worse than a missing one. +- NOTE: Step menu changes are always presented for review, never auto-applied. Step menu changes affect the type system of ALL pipeline composition. A bad step type is worse than a missing one. **`test-target-issue`** -- The test target was inadequate. - Target: The test runner configuration, not the generator @@ -246,6 +216,8 @@ Save proposed fixes to `/tmp/pipeline-retro-{domain}/proposed-fixes.md`. **Goal**: Apply generator fixes and regenerate affected pipelines to prove the fixes work. +**Hardcoded Constraint**: NEVER mark a generator fix as complete without regenerating the affected skill and re-testing. A fix that doesn't improve test results isn't a fix -- it's a guess. Layer 3 is what distinguishes this from wishful thinking. + **Step 1: Classify and apply fixes** For each proposed fix: @@ -303,6 +275,10 @@ If the fix must be reverted: **Goal**: Produce the retro report as a dual-layer artifact. +**Default Behavior**: Clean up intermediate analysis files after the retro report is produced. + +**Default Behavior**: When multiple failures share the same root cause, propose one fix that addresses all of them rather than N separate fixes. + **Step 1: Create manifest.json** ```json @@ -446,65 +422,11 @@ Save the report to the pipeline run directory alongside the test runner output. --- -## Anti-Patterns - -### Layer 1 Temptation -**What it looks like**: Opening a generated skill file and editing it to fix the test failure. "I'll just add a missing step to this one skill." -**Why wrong**: The fix dies with this one pipeline. The next time the generator runs, it produces the same bug. You've spent effort that doesn't compound. -**Do instead**: Trace the failure to the generator component (template, rules, chain, step menu) and fix it there. The fix then propagates to ALL future pipelines. - -### Rules Without Evidence -**What it looks like**: Adding a rule to architecture-rules.md because it "seems like a good idea" or is "best practice". -**Why wrong**: The ADR explicitly forbids this: "Rules earn their place through data. No rule is added based on 'best practice' or 'should'." Evidence-free rules accumulate into bloat that slows generation without proven benefit. -**Do instead**: Every rule MUST cite the specific test failure (with evidence ID) that proved it necessary. If you can't point to a failure, you don't have a rule -- you have an opinion. - -### Premature Regeneration -**What it looks like**: Applying a fix and immediately regenerating without tracing the root cause. "Let me just try this change and see if it helps." -**Why wrong**: Trial-and-error wastes regeneration cycles and can introduce new issues. Each regeneration cycle involves chain-composer, scaffolder, and test-runner -- it's expensive. -**Do instead**: Complete the full trace analysis (Phase 2) before proposing any fix (Phase 3). Understand the root cause before attempting a remedy. - -### Ignoring test-target-issue Classification -**What it looks like**: Forcing a generator fix when the real problem is the test target. "The test said FAIL, so something in the generator must be wrong." -**Why wrong**: Not all failures are generator failures. Some test targets are ambiguous, too narrow, or require capabilities outside the generated pipeline's scope. Fixing the generator for a bad test target introduces unnecessary rules. -**Do instead**: Honestly classify `test-target-issue` when appropriate. Report it in the retro, suggest better test targets, and move on. - -### Fixing Multiple Links Simultaneously -**What it looks like**: "The chain composition was wrong AND the template has a bug AND we need a new rule." Proposing fixes to 3 links at once. -**Why wrong**: When you change multiple variables, you can't attribute the improvement (or regression) to any specific fix. Layer 3 validation becomes meaningless. -**Do instead**: Fix one link at a time. Regenerate and re-test after each fix. If the first fix resolves the failure, stop. Only move to the next link if the first fix didn't help. - ---- - -## Anti-Rationalization - -| Rationalization Attempt | Why It's Wrong | Required Action | -|------------------------|----------------|-----------------| -| "Just this once, I'll fix the generated skill directly" | Layer 1 NEVER compounds. Every direct fix is effort that dies with one pipeline. | Trace to generator. Fix at Layer 2. Prove at Layer 3. | -| "This rule is obviously needed, no evidence required" | The ADR exists because obvious rules accumulate into un-obvious bloat. | Cite the specific test failure or don't add the rule. | -| "I know the root cause without reading the chain" | Skipping trace analysis leads to fixing symptoms. | Walk all 5 links. The first link that introduced the problem is the root cause. | -| "The fix is too complex to regenerate and test" | Untested fixes are guesses, not improvements. | If you can't prove it, don't ship it. Add to open items. | -| "All these failures are test target problems" | Blanket test-target-issue classification avoids accountability. | Trace each failure individually. Some may be test issues, but batch-dismissal is a rationalization. | -| "Let me fix multiple links to be thorough" | Multi-variable changes prevent causal attribution. | One link per regeneration cycle. Isolate the variable. | - -## Blocker Criteria - -STOP and ask the user (do NOT proceed autonomously) when: - -| Situation | Why Stop | Ask This | -|-----------|----------|----------| -| Proposed fix modifies the step menu type system | Type system changes affect ALL pipeline composition across ALL domains | "This fix requires a new step type / modified type compatibility. Review before applying?" | -| Proposed fix would invalidate existing validated chains | Existing pipelines may break | "This fix changes chain composition rules. N existing chains may need re-validation. Proceed?" | -| All failures classify as the same root cause | May indicate a systematic issue requiring architectural discussion | "All N failures trace to {classification}. This suggests a deeper issue. Discuss approach?" | -| Regeneration produced regressions | The fix made things worse somewhere | "Regeneration improved {N} subdomains but regressed {M}. Revert all, keep partial, or investigate?" | - -### Never Guess On -- Whether a failure is a generator issue or a test-target issue (trace it) -- Which link in the generation chain is the root cause (walk all 5) -- Whether a step menu change is safe (always present for review) -- Whether to revert or keep a fix that produced mixed results (ask the user) - ---- - ## References - **Pipeline Orchestrator**: [agents/pipeline-orchestrator-engineer.md](../../agents/pipeline-orchestrator-engineer.md) -- The agent that invokes this skill as Phase 6 +- **Three-Layer Pattern ADR**: [adr/self-improving-pipeline-generator.md](../../adr/self-improving-pipeline-generator.md) -- Design rationale for Layer 1/2/3 discipline +- **Chain Composer**: [pipelines/chain-composer/SKILL.md](../chain-composer/SKILL.md) -- Creates Pipeline Spec from domain research +- **Pipeline Scaffolder**: [pipelines/pipeline-scaffolder/SKILL.md](../pipeline-scaffolder/SKILL.md) -- Generates skills from Pipeline Spec +- **Pipeline Test Runner**: [pipelines/pipeline-test-runner/SKILL.md](../pipeline-test-runner/SKILL.md) -- Tests generated pipelines and produces retro input +- **Domain Research**: [pipelines/domain-research/SKILL.md](../domain-research/SKILL.md) -- Link 1 in the generation chain diff --git a/pipelines/pipeline-scaffolder/SKILL.md b/pipelines/pipeline-scaffolder/SKILL.md index 863b2798..34d05f3b 100644 --- a/pipelines/pipeline-scaffolder/SKILL.md +++ b/pipelines/pipeline-scaffolder/SKILL.md @@ -33,48 +33,11 @@ routing: # Pipeline Scaffolder Skill -## Operator Context - -This skill operates as the build engine of the self-improving pipeline generator. It consumes a Pipeline Spec JSON (produced by `chain-composer`) and scaffolds all components: 0-1 agents, N skills (one per subdomain), N sets of reference files, optional scripts, optional hooks, and routing entries for all N skills. It implements a **Spec-Driven Fan-Out** pattern -- parse the spec, validate it, fan out skill creation per subdomain, integrate into routing. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution. Project instructions override default skill behaviors. -- **Pipeline Spec Required**: The ONLY valid input is a Pipeline Spec JSON conforming to `references/pipeline-spec-format.md`. No freestyle scaffolding, no manual component manifests, no "just create a skill" requests. WHY: The spec is a validated, type-checked contract. Freestyle scaffolding produces orphan components and type mismatches. -- **Architecture Rules Enforcement**: Before generating any component, load and enforce all rules from `references/architecture-rules.md`. Every generated component must pass every rule. -- **Template Compliance**: Every agent MUST follow `AGENT_TEMPLATE_V2.md`. Every skill MUST be generated from `references/generated-skill-template.md`. WHY: Templates ensure structural consistency, which enables automated validation and routing integration. -- **No Monolithic Prompts**: Agent prompts MUST NOT exceed 10,000 words. If content exceeds this limit, move detail to `references/` subdirectory. -- **ADR Cascade**: Every generated skill MUST include Phase 0: ADR as its first instruction phase. WHY: ADRs prevent context drift across phases and provide grading artifacts for retrospectives. The ADR mandate cascades from generator to generated. -- **ADR Hash Verification**: Before scaffolding, verify the ADR has not been modified since session registration: `python3 ~/.claude/scripts/adr-query.py verify --adr {adr_path} --hash {hash}`. If verification fails (exit 1), stop and re-register. Check `adr-query.py list` for related ADRs during discovery. -- **Parallel Research Enforcement**: When a generated skill's chain includes a research-gathering step, the generated phase MUST use parallel multi-agent dispatch per Rule 12. WHY: A/B testing proved sequential research loses 1.40 points on Examples quality versus parallel dispatch. - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report facts without self-congratulation. Show generated file paths, component counts, and key decisions rather than describing them. -- **Temporary File Cleanup**: Remove any intermediate generation files. Keep only the final pipeline components. -- **Naming Convention**: Agents follow `{domain}-{function}-engineer` pattern. Skills use `{group}-{function}`. Hooks use `{pipeline-name}-detector.py`. Scripts use `{domain}-{function}.py`. -- **Group-Prefix Consistency**: New skills MUST use the same prefix as related existing skills. Before naming, check `ls skills/ | grep {domain}` to find the group. Examples: voice skills start with `voice-`, Go skills with `go-`, PR skills with `pr-`, writing/content skills with `writing-`, review skills with `review-`. If no group exists, the new skill starts one. -- **Profile-Aware Generation**: Respect the `operator_profile` field from the Pipeline Spec. Include or exclude safety/interaction steps based on the profile. WHY: Personal profiles don't need APPROVE gates; production profiles require them. Over-gating personal workflows adds friction; under-gating production workflows creates risk. - -### Optional Behaviors (OFF unless enabled) -- **Reference File Generation**: Generate `references/anti-patterns.md` for new agents (ON for Complex agents, OFF for Simple). -- **Hook Test Generation**: Create a basic test script alongside any generated hook. -- **Dry Run Mode**: Parse and validate the spec, show planned components, but don't create files. - -## What This Skill CAN Do -- Parse and validate Pipeline Spec JSON against the format contract -- Generate N subdomain skills (one per subdomain entry) from the generated-skill-template -- Generate 0-1 domain agents following AGENT_TEMPLATE_V2.md -- Generate Python scripts with argparse CLI skeletons for each `scripts_needed` entry -- Generate reference file stubs for each `references_needed` entry -- Copy or symlink shared references into each skill's references directory -- Convert chain steps to numbered skill phases using the chain-to-phase mapping -- Wire all components into routing via `routing-table-updater` -- Validate the full component dependency graph (no orphans, correct bindings) - -## What This Skill CANNOT Do -- **Generate domain-specific business logic**: Scaffolded components are structural; domain logic comes from domain agents or the Pipeline Spec's chain definitions -- **Modify routing tables directly**: Routing updates are delegated to `routing-table-updater` -- **Compose pipeline chains**: Chain composition is handled by `chain-composer` upstream -- **Validate chain type compatibility**: Use `scripts/artifact-utils.py validate-chain` before invoking this skill +## Overview + +This skill operates as the build engine of the self-improving pipeline generator. It consumes a Pipeline Spec JSON (produced by `chain-composer`) and scaffolds all components: 0-1 agents, N skills (one per subdomain), N sets of reference files, optional scripts, optional hooks, and routing entries for all N skills. It implements a **Spec-Driven Fan-Out** pattern: parse the spec, validate it, fan out skill creation per subdomain, integrate into routing. + +The skill MUST read and follow repository CLAUDE.md files before execution (project instructions override default behaviors). It ONLY accepts Pipeline Spec JSON as input—no freestyle scaffolding, no manual component manifests, no "just create a skill" requests. The spec is a validated, type-checked contract; freestyle scaffolding produces orphan components and type mismatches. ## Instructions @@ -84,7 +47,7 @@ This skill operates as the build engine of the self-improving pipeline generator **Step 1**: Read the Pipeline Spec JSON. This is typically saved by `chain-composer` at a known path (passed as input to this skill or found in the ADR). -**Step 2**: Validate the spec against `references/pipeline-spec-format.md`. Check: +**Step 2**: Validate the spec against `references/pipeline-spec-format.md`. The spec is the only valid input contract—do NOT attempt to fix or reinterpret invalid specs; that is `chain-composer`'s responsibility. Check: Top-level: - [ ] Exactly one of `new_agent` or `reuse_agent` is non-null @@ -93,7 +56,7 @@ Top-level: - [ ] `domain` is lowercase kebab-case Per subdomain: -- [ ] Chain starts with ADR step (`family: "invariant"`) +- [ ] Chain starts with ADR step (`family: "invariant"`) — every generated skill MUST include Phase 0: ADR as its first instruction phase to prevent context drift and provide grading artifacts - [ ] Chain ends with terminal step (`output_schema: "pipeline-summary"`) - [ ] Type compatibility holds for all adjacent steps - [ ] `skill_name` matches pattern `^[a-z][a-z0-9]*(-[a-z][a-z0-9]*)+$` @@ -102,7 +65,7 @@ Per subdomain: - [ ] Every `params.script` reference exists in `scripts_needed` - [ ] Every `params.rules` / `params.template` reference exists in `references_needed` or `shared_references` -**Step 3**: If `reuse_agent` is non-null, verify the agent exists in `agents/INDEX.json`. If it does not exist, STOP -- the spec is invalid. +**Step 3**: If `reuse_agent` is non-null, verify the agent exists in `agents/INDEX.json`. If it does not exist, STOP—the spec is invalid. Trust the spec: if `reuse_agent` is set, upstream chain-composer already determined reuse is appropriate rather than fragmenting expertise across a new agent (Rule 9: Reuse Over Recreation). **Step 4**: Compute build targets from the spec: @@ -131,14 +94,14 @@ python3 ~/.claude/scripts/adr-query.py verify \ --hash {spec.adr_hash} ``` -- **Exit 0**: ADR matches — proceed with scaffolding -- **Exit 1**: ADR has changed since composition — STOP +- **Exit 0**: ADR matches—proceed with scaffolding +- **Exit 1**: ADR has changed since composition—STOP If exit 1: Do NOT proceed. The ADR changed after this Pipeline Spec was composed. The spec may now be inconsistent with the current ADR. Required action: Re-run `chain-composer` with the updated ADR to produce a fresh Pipeline Spec, then re-run scaffolder. If `adr_hash` field is absent from the spec: Log a warning and continue (older pipeline specs may not have this field). -**Gate**: Pipeline Spec loaded and valid. All validation checks pass. ADR integrity verified (or hash absent — warning logged). Component count established. Proceed to Phase 2. +**Gate**: Pipeline Spec loaded and valid. All validation checks pass. ADR integrity verified (or hash absent—warning logged). Component count established. Proceed to Phase 2. ### Phase 2: SCAFFOLD AGENT (if needed) @@ -150,12 +113,12 @@ If `adr_hash` field is absent from the spec: Log a warning and continue (older p **When creating a new agent**: -**Step 1**: Read `AGENT_TEMPLATE_V2.md` for the structural template. +**Step 1**: Read `AGENT_TEMPLATE_V2.md` for the structural template. Every agent MUST follow this template to ensure structural consistency, which enables automated validation and routing integration. **Step 2**: Generate the agent file at `agents/{new_agent.name}.md` with: - YAML frontmatter: name, version, description (with 3 examples), color, routing metadata - `routing.triggers` from `new_agent.triggers` -- `routing.pairs_with` listing ALL subdomain skill names from the spec +- `routing.pairs_with` listing ALL subdomain skill names from the spec—incomplete pairs_with means the agent can't be discovered for all subdomains - `routing.complexity` from `new_agent.complexity` - `routing.category` from `new_agent.category` - Operator declaration with `new_agent.expertise` items @@ -163,24 +126,22 @@ If `adr_hash` field is absent from the spec: Log a warning and continue (older p - Capabilities & Limitations - Instructions section referencing the subdomain skills - Error Handling (3+ categories) -- Anti-Patterns (3+ items) -- Anti-Rationalization table - Blocker Criteria - References section **Step 3**: If `new_agent.complexity` is Medium or higher, create `agents/{new_agent.name}/references/` directory. **Step 4**: Validate the agent: -- [ ] All 10 required AGENT_TEMPLATE_V2 sections present -- [ ] Main file under 10,000 words +- [ ] All required AGENT_TEMPLATE_V2 sections present +- [ ] Main file under 10,000 words—no monolithic prompts. If content exceeds this limit, move detail to `references/` subdirectory - [ ] `pairs_with` lists all N subdomain skill names - [ ] Naming follows `{domain}-{function}-engineer` pattern **Gate**: Either existing agent confirmed in INDEX.json, or new agent file created and validated. Proceed to Phase 3. -### Phase 3: SCAFFOLD SKILLS (Fan-Out -- one per subdomain) +### Phase 3: SCAFFOLD SKILLS (Fan-Out—one per subdomain) -**Goal**: Create all N subdomain skills using the generated-skill-template, plus their reference files and scripts. +**Goal**: Create all N subdomain skills using the generated-skill-template, plus their reference files and scripts. Template compliance ensures every skill MUST be generated from `references/generated-skill-template.md`. **Step 0: Load the template**. Read `references/generated-skill-template.md` once. This contains: - The SKILL.md template with `{{variable}}` placeholders @@ -191,7 +152,7 @@ If `adr_hash` field is absent from the spec: Log a warning and continue (older p - For simple chains (3-4 steps): batch 2-3 subdomain skills per sub-agent - For complex chains (5+ steps): dispatch one sub-agent per subdomain - Maximum 10 parallel sub-agents (system limit) -- Each sub-agent receives: the Pipeline Spec, the generated-skill-template, and architecture rules +- Each sub-agent receives: the Pipeline Spec, the generated-skill-template, and architecture rules to enforce on every component before generation **For each subdomain** (whether batched or individual): @@ -203,7 +164,7 @@ If `adr_hash` field is absent from the spec: Log a warning and continue (older p - `{{description}}` from `subdomain.description` - `{{agent_name}}` from the agent decision (`reuse_agent` or `new_agent.name`) - `{{routing_triggers_csv}}` from joining `subdomain.routing_triggers` -- `{{operator_profile_*}}` flags from top-level `operator_profile` +- `{{operator_profile_*}}` flags from top-level `operator_profile`—respect the profile field to include or exclude safety/interaction steps based on profile (personal profiles don't need APPROVE gates; production profiles require them) **Step 2: Convert chain steps to phases**. For each step in `subdomain.chain`: @@ -212,7 +173,7 @@ Map from step family to phase implementation using the chain-to-phase mapping in | Step Family | Phase Template | Key Customizations | |-------------|---------------|-------------------| | `invariant` | Phase 0: ADR | Always first. Uses domain + subdomain in ADR path. | -| `research-gathering` | Parallel Multi-Agent | `params.agents` count, `params.aspects` labels, `params.timeout_minutes` | +| `research-gathering` | Parallel Multi-Agent | `params.agents` count, `params.aspects` labels, `params.timeout_minutes`. MANDATORY: Use parallel multi-agent dispatch per Rule 12, validated by A/B test (sequential research loses 1.40 points on Examples quality, 0.60 on Completeness). | | `structuring` | Compile/Organize | Structures research into hierarchy | | `decision-planning` | Plan/Decide | Options, criteria, selection rationale | | `generation` | Generate | `params.template` reference, `params.voice` if applicable | @@ -231,26 +192,24 @@ For steps with `profile_gate` set: include the phase only if the gate matches th **Step 3: Generate error handling**. Use the task type default errors from the template. Include at minimum 3 error entries per skill. -**Step 4: Generate anti-patterns**. Use the task type default anti-patterns from the template. Include at minimum 3 anti-pattern entries per skill. - -**Step 5: Create the skill file** at `skills/{skill_name}/SKILL.md`. +**Step 4: Create the skill file** at `skills/{skill_name}/SKILL.md`. All skills MUST use consistent naming: group-prefix (e.g., `voice-`, `go-`, `pr-`, `writing-`, `review-`) matching related existing skills. Before naming, check `ls skills/ | grep {domain}` to find the group. If no group exists, the new skill starts one. No naming convention violations. -**Step 6: Create reference files** for each entry in `subdomain.references_needed`: +**Step 5: Create reference files** for each entry in `subdomain.references_needed`: - Create directory `skills/{skill_name}/references/` if it doesn't exist - For each file in `references_needed`: - If it also appears in `shared_references`, copy from the shared source - If subdomain-specific, create a stub file with a header, purpose section, and placeholder content structure appropriate to the domain - Symlink or reference `architecture-rules.md` from `pipeline-scaffolder/references/` -**Step 7: Create scripts** for each entry in `subdomain.scripts_needed`: +**Step 6: Create scripts** for each entry in `subdomain.scripts_needed`: - Generate at `scripts/{filename}` with: - `#!/usr/bin/env python3` shebang - Module docstring with purpose, caller, usage - - `argparse` CLI with relevant subcommands for the domain + - `argparse` CLI with relevant subcommands for the domain—all scripts use argparse for consistency and discoverability; this is not optional - JSON output format - Exit codes: 0 = success, 1 = error - Validation logic stubs appropriate to the artifact type - - No LLM calls -- scripts are deterministic + - No LLM calls—scripts are deterministic operations **Validation per skill** (before proceeding to next subdomain): - [ ] YAML frontmatter has all required fields (name, description, version, user-invocable, agent, allowed-tools) @@ -258,13 +217,13 @@ For steps with `profile_gate` set: include the phase only if the gate matches th - [ ] Phase gates exist between all phases - [ ] `agent` field references the correct agent - [ ] All `references_needed` files exist in `skills/{skill_name}/references/` -- [ ] Research-gathering phases use parallel multi-agent dispatch (Rule 12) +- [ ] Research-gathering phases use parallel multi-agent dispatch (Rule 12 is not negotiable—validated by A/B test) **Gate**: All N skill files exist. All reference files exist. All scripts exist. Each skill passed per-skill validation. Proceed to Phase 4. ### Phase 4: INTEGRATE -**Goal**: Wire all components into routing and verify the full component dependency graph. +**Goal**: Wire all components into routing and verify the full component dependency graph. Unroutable skills are dead code (Rule 7—not negotiable). **Step 1: Routing integration**. Invoke `routing-table-updater` in batch mode: - Add all N skills to `skills/do/references/routing-tables.md` @@ -297,8 +256,8 @@ Script verification (for each script): Cross-cutting: - [ ] No orphan components (every component referenced by at least one other) -- [ ] No naming convention violations (Rule 8) -- [ ] No dual-responsibility components (Rule 1) +- [ ] No naming convention violations (group-prefix consistency, domain-function pattern) +- [ ] No dual-responsibility components **Gate**: All components routable via `/do`. Integration verified. No orphans. Proceed to Phase 5. @@ -373,8 +332,8 @@ To invoke each generated skill: ## Error Handling ### Error: Invalid Pipeline Spec -**Cause**: The spec fails validation in Phase 1 -- missing fields, type incompatibilities, invalid enums, or constraint violations. -**Solution**: Return the specific validation failure with the field path and expected value. Do NOT attempt to fix the spec -- that is `chain-composer`'s responsibility. Report the error to the orchestrator so it can re-invoke chain composition. +**Cause**: The spec fails validation in Phase 1—missing fields, type incompatibilities, invalid enums, or constraint violations. Specs are contracts; do NOT attempt to fix them during scaffolding. +**Solution**: Return the specific validation failure with the field path and expected value. This is `chain-composer`'s responsibility. Report the error to the orchestrator so it can re-invoke chain composition. ### Error: Agent Not Found **Cause**: `reuse_agent` references an agent name not present in `agents/INDEX.json`. @@ -386,59 +345,32 @@ To invoke each generated skill: ### Error: Script Syntax Invalid **Cause**: A generated Python script fails `ast.parse()`. -**Solution**: Re-generate the script skeleton. Scripts are simple argparse CLIs -- syntax errors usually mean a template variable was not substituted. Check for remaining `{{variable}}` markers. +**Solution**: Re-generate the script skeleton. Scripts are simple argparse CLIs—syntax errors usually mean a template variable was not substituted. Check for remaining `{{variable}}` markers. ### Error: Naming Convention Violation -**Cause**: A component name doesn't follow the pattern from Rule 8 (`{domain}-{function}` for skills, `{domain}-{function}-engineer` for agents). +**Cause**: A component name doesn't follow group-prefix pattern (`{domain}-{function}` for skills, `{domain}-{function}-engineer` for agents). **Solution**: Rename the component. The `skill_name` field in the spec should already be validated, so this error usually means a manual override went wrong. ### Error: Orphan Component **Cause**: A reference file or script exists but is not referenced by any skill's chain. -**Solution**: Either the spec has an unused entry in `references_needed` / `scripts_needed`, or the chain-to-phase mapping failed to include the reference. Check the spec for consistency (validation rules 10 and 11 should have caught this in Phase 1). - -## Anti-Patterns - -### Anti-Pattern 1: Monolithic Single Skill -**What it looks like**: Creating one skill for the entire domain instead of N skills (one per subdomain). -**Why wrong**: Monolithic skills dilute expertise, overload context, and can't be routed independently. Each subdomain has different task types needing different pipeline chains. -**Do instead**: Follow the spec -- one skill per subdomain entry. Same agent, different methodology. - -### Anti-Pattern 2: Freestyle Scaffolding -**What it looks like**: Creating skills without a Pipeline Spec JSON -- "just make a skill for X". -**Why wrong**: Without the spec, there is no validated chain, no type checking, no consistent structure. The result is skills that don't integrate with the pipeline system. -**Do instead**: Always require a Pipeline Spec JSON. If one doesn't exist, route to `chain-composer` first. - -### Anti-Pattern 3: Skip Routing Integration -**What it looks like**: Creating all skill files but not running `routing-table-updater`. -**Why wrong**: Unroutable skills are dead code (Rule 7). Users and the `/do` router can't discover them. -**Do instead**: Phase 4 integration is not optional. Every skill must be routable before scaffolding is considered complete. - -### Anti-Pattern 4: Copy Agent When Reuse Works -**What it looks like**: Creating a new agent when `reuse_agent` is set -- "the new one will be more specialized". -**Why wrong**: Violates Rule 9 (Reuse Over Recreation). If an existing agent covers 70%+ of the domain, binding new skills to it is better than fragmenting expertise across agents. -**Do instead**: Trust the spec. If `reuse_agent` is non-null, the upstream chain-composer already determined reuse is appropriate. - -### Anti-Pattern 5: Sequential Research in Generated Skills -**What it looks like**: Generating a skill whose research phase uses sequential grep/search instead of parallel agents. -**Why wrong**: A/B-tested loss: -1.40 points on Examples quality, -0.60 on Completeness (see `adr/pipeline-creator-ab-test.md`). Sequential research creates tunnel vision. -**Do instead**: Every research-gathering step in the chain MUST generate a parallel multi-agent phase. Use the template's chain-to-phase mapping for research families. - -## Anti-Rationalization - -| Rationalization Attempt | Why It's Wrong | Required Action | -|------------------------|----------------|-----------------| -| "This spec is close enough, I'll fix it during scaffolding" | Spec is the contract. Fixing it here means chain-composer has a bug. | Reject and return to chain-composer | -| "One skill can handle two subdomains" | Spec says N subdomains = N skills. Merging violates the design. | Follow the spec exactly | -| "This subdomain is too simple for a full skill" | Even simple subdomains need routing, ADR cascade, and phase gates. | Scaffold it anyway | -| "I'll add routing later" | Unroutable = dead code. Rule 7 is not negotiable. | Integrate in Phase 4 | -| "Sequential research is fine for this domain" | Rule 12 is validated by A/B test, not opinion. | Use parallel research | -| "The agent doesn't need all N skills in pairs_with" | Incomplete pairs_with means the agent can't be discovered for all subdomains. | List all N skills | -| "This script doesn't need argparse, it's simple" | All scripts use argparse for consistency and discoverability. | Add argparse CLI | +**Solution**: Either the spec has an unused entry in `references_needed`/`scripts_needed`, or the chain-to-phase mapping failed to include the reference. Check the spec for consistency (validation rules should have caught this in Phase 1). + +### Error: Monolithic Single Skill Created +**Cause**: Creating one skill for the entire domain instead of N skills (one per subdomain). +**Solution**: This violates the spec. Monolithic skills dilute expertise, overload context, and can't be routed independently. Each subdomain has different task types needing different pipeline chains. Follow the spec—one skill per subdomain entry, same agent, different methodology. + +### Error: Freestyle Scaffolding Attempted +**Cause**: Creating skills without a Pipeline Spec JSON—"just make a skill for X". +**Solution**: Without the spec, there is no validated chain, no type checking, no consistent structure. The result is skills that don't integrate with the pipeline system. Always require a Pipeline Spec JSON. If one doesn't exist, route to `chain-composer` first. + +### Error: Routing Integration Skipped +**Cause**: All skill files exist but `routing-table-updater` was not run. +**Solution**: Unroutable skills are dead code (Rule 7—not negotiable). Users and the `/do` router can't discover them. Phase 4 integration is mandatory. Every skill must be routable before scaffolding is considered complete. ## References -- **Pipeline Spec Format**: [references/pipeline-spec-format.md](references/pipeline-spec-format.md) -- the input contract -- **Generated Skill Template**: [references/generated-skill-template.md](references/generated-skill-template.md) -- template for each subdomain skill -- **Architecture Rules**: [references/architecture-rules.md](references/architecture-rules.md) -- rules to enforce on all components -- **Step Menu**: [references/step-menu.md](references/step-menu.md) -- valid steps and type compatibility -- **Agent Template**: [../../AGENT_TEMPLATE_V2.md](../../AGENT_TEMPLATE_V2.md) -- template for new agents +- **Pipeline Spec Format**: [references/pipeline-spec-format.md](references/pipeline-spec-format.md)—the input contract +- **Generated Skill Template**: [references/generated-skill-template.md](references/generated-skill-template.md)—template for each subdomain skill +- **Architecture Rules**: [references/architecture-rules.md](references/architecture-rules.md)—rules to enforce on all components +- **Step Menu**: [references/step-menu.md](references/step-menu.md)—valid steps and type compatibility +- **Agent Template**: [../../AGENT_TEMPLATE_V2.md](../../AGENT_TEMPLATE_V2.md)—template for new agents diff --git a/pipelines/pipeline-test-runner/SKILL.md b/pipelines/pipeline-test-runner/SKILL.md index 7760ce44..2370eebd 100644 --- a/pipelines/pipeline-test-runner/SKILL.md +++ b/pipelines/pipeline-test-runner/SKILL.md @@ -36,47 +36,13 @@ routing: # Pipeline Test Runner -## Purpose +## Overview -Validate that generated pipeline skills actually work by running them against real targets. Chain validation (done by `chain-composer` and `scripts/artifact-utils.py validate-chain`) checks type compatibility between steps. This skill checks **execution** -- does the skill produce valid artifacts when given real input? The distinction matters because a chain can be type-valid but produce empty content, crash on domain-specific inputs, or timeout due to overly complex research phases. +This skill validates that generated pipeline skills actually work by running them against real targets. It is Phase 5 of the pipeline orchestrator's 7-phase flow, consuming the scaffolding output from `pipeline-scaffolder` and producing a report for `pipeline-retro`. -This skill is Phase 5 of the pipeline orchestrator's 7-phase flow. Its output feeds directly into `pipeline-retro` (Phase 6), which traces failures back to the generator using the Three-Layer Pattern. +**Why this skill exists**: Chain validation (done by `chain-composer`) checks type compatibility between steps. This skill checks **execution** -- does the pipeline produce valid artifacts when given real input? Type-safe chains can still fail at runtime: research phases may find nothing, generation may produce empty content, validation scripts may reject output, or timeout may occur from overly complex research. -## Operator Context - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution. Project instructions override default skill behaviors. -- **Pipeline Spec Required**: Input MUST include the Pipeline Spec JSON (same spec consumed by `pipeline-scaffolder`). The spec defines what subdomains exist, what skills were generated, and what scripts/references each skill expects. WHY: Without the spec, the test runner doesn't know what to test or what "success" looks like. -- **Per-Subdomain Results**: Every subdomain gets its own result (PASS/PARTIAL/FAIL/TIMEOUT). Never aggregate into a single pass/fail that hides individual failures. WHY: The retro skill (Phase 6) needs per-subdomain failure traces to fix the correct generator component. A blanket "FAIL" tells the retro nothing about which subdomain or which chain step broke. -- **No Production Targets**: Test against repo files, fixtures, or synthetic inputs only. Never invoke skills against live/external systems. WHY: Test runs happen during pipeline generation -- they must be safe, repeatable, and free of side effects. -- **Artifact Validation via Script**: Always use `scripts/artifact-utils.py validate-manifest` for manifest validation rather than manual JSON inspection. WHY: The script implements the canonical validation rules from the ADR. Manual checks will drift from the spec over time. - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report facts without self-congratulation. Show per-subdomain results table, not narrative descriptions. -- **Temporary File Cleanup**: Remove `/tmp/pipeline-test-*` directories after the report is produced. Keep only the final report artifacts. -- **Parallel Execution**: Fan out skill runs up to 5 in parallel. Batch larger sets into groups of 5. WHY: Parallel execution matches the pipeline architecture's "parallel over sequential" principle, but unbounded parallelism risks context exhaustion. -- **Timeout Enforcement**: 5 minutes per skill run. Skills that exceed this are classified as TIMEOUT rather than left to run indefinitely. - -### Optional Behaviors (OFF unless enabled) -- **Verbose Traces**: Include full skill output in the report (default: summary + failure traces only) -- **Skip Synthetic**: Disable synthetic target generation -- only test subdomains with real targets -- **Extended Timeout**: Increase per-skill timeout to 10 minutes (for complex chains with 7+ steps) - -## What This Skill CAN Do -- Discover appropriate test targets for each subdomain (fixtures, codebase files, synthetic) -- Run N subdomain skills in parallel batches of up to 5 -- Validate dual-layer artifacts (manifest.json + content.md) using `scripts/artifact-utils.py` -- Produce per-subdomain PASS/PARTIAL/FAIL/TIMEOUT classifications -- Generate failure traces linking failures to specific chain steps -- Produce a dual-layer report artifact for consumption by `pipeline-retro` - -## What This Skill CANNOT Do -- **Fix failed skills**: Fixing is `pipeline-retro`'s job (Three-Layer Pattern: Layer 2 fixes the generator, not the artifact) -- **Test against external systems**: Only repo files, fixtures, and synthetic targets -- **Evaluate content quality**: Tests structural validity (manifest exists, content non-empty, status complete) not semantic quality (is the generated PromQL correct?) -- **Run more than 10 subdomain tests per batch**: System limit on parallel agents - -When asked to perform unavailable actions, explain the limitation and suggest the appropriate skill. +**Key distinction**: You will always test against repo files, fixtures, or synthetic inputs only—never against live systems. Structural validation tests artifact validity (manifest exists, content non-empty, status complete), not content quality (is the generated PromQL correct?). Per-subdomain results are essential: if 4 of 5 subdomains pass and 1 fails, `pipeline-retro` needs to fix only that one component. --- @@ -85,26 +51,17 @@ When asked to perform unavailable actions, explain the limitation and suggest th ### Input This skill requires: -- **Pipeline Spec JSON**: The validated spec from `chain-composer` / `pipeline-scaffolder` (path or inline) -- **Scaffolding Report** (optional): The report from `pipeline-scaffolder` Phase 5, confirming all components exist - -The spec provides: -- `spec.subdomains[*].skill_name` -- what skills to test -- `spec.subdomains[*].routing_triggers` -- how to invoke each skill -- `spec.subdomains[*].scripts_needed` -- what validation scripts exist -- `spec.subdomains[*].references_needed` -- what domain references are available -- `spec.domain` -- the target domain (used for fixture/codebase search) -- `spec.reuse_agent` or `spec.new_agent.name` -- the executing agent +- **Pipeline Spec JSON**: The validated spec from `chain-composer` / `pipeline-scaffolder` (path or inline). Provides `spec.subdomains[*].skill_name`, `spec.subdomains[*].routing_triggers`, `spec.subdomains[*].scripts_needed`, `spec.subdomains[*].references_needed`, `spec.domain`, and executing agent. +- **Scaffolding Report** (optional): Confirmation that all components exist. ### Phase 1: DISCOVER TARGETS -**Goal**: For each subdomain skill, identify a suitable test target that exercises the skill's pipeline chain. +**Goal**: For each subdomain skill, identify a suitable test target. -**Why this phase exists**: A skill that passes chain validation (types match) can still fail when given real input. The research phase might find nothing relevant, the generation phase might produce empty content, or the validation script might reject the output. Testing against real targets catches these failures before the pipeline ships. +**Why**: A skill that passes chain validation can still fail when given real input. **Step 1: Load the Pipeline Spec** -Read the spec and extract the subdomain list: ``` subdomains = spec.subdomains domain = spec.domain @@ -123,7 +80,7 @@ ls tests/fixtures/{subdomain.name}/ 2>/dev/null Fixture files are purpose-built for testing and are the most reliable targets. **Priority 2 -- Codebase files**: -Search the repo for files matching the subdomain's domain. The search strategy depends on the subdomain's task type: +Search the repo for files matching the subdomain's domain. The search strategy depends on task type: | Task Type | Search Strategy | |-----------|----------------| @@ -136,17 +93,13 @@ Search the repo for files matching the subdomain's domain. The search strategy d | `migration` | Find files in the "old" format | | `testing` | Find existing code that needs test generation | -Use Glob and Grep to search. Examples: -```bash -# For prometheus-metrics (generation, PromQL): -# Search for existing .rules files, prometheus configs, PromQL expressions -``` +Use Glob and Grep to search. **Priority 3 -- Synthetic targets**: -If no real targets exist, create a minimal synthetic target that exercises the skill's chain. The synthetic target should be: +If no real targets exist, create a minimal synthetic target. Synthetic targets must be: - Small enough to process quickly (under 50 lines) - Representative of the subdomain's input type -- Valid input that the skill's chain can actually consume +- Valid input that the chain can actually consume Save synthetic targets to `/tmp/pipeline-test-{run-id}/targets/{subdomain.name}/`. @@ -173,13 +126,15 @@ For each subdomain, record: } ``` -**Gate**: Every subdomain has a test target (fixture, codebase file, or synthetic). Target metadata recorded. Proceed to Phase 2. +**Gate**: Every subdomain has a test target (fixture, codebase file, or synthetic). Target metadata recorded. ### Phase 2: EXECUTE (Parallel Skill Runs) **Goal**: Run each subdomain skill against its test target and capture output artifacts. -**Why parallel**: Subdomain skills are independent -- `prometheus-metrics` has no dependency on `prometheus-alerting`. Sequential execution wastes time proportional to the number of subdomains. The A/B test validated that parallel execution produces equivalent quality to sequential (and the test runner cares about structural validity, not content quality). +**Why parallel**: Subdomain skills are independent. Sequential execution wastes time proportional to the number of subdomains. Parallel execution produces equivalent quality (the test runner cares about structural validity, not content quality). + +**Constraint**: Never run more than 10 subdomain tests per batch. Fan out up to 5 in parallel; batch larger sets into groups of 5. **Step 1: Create run directory** @@ -191,7 +146,7 @@ Use a timestamp-based run-id: `$(date +%Y%m%d-%H%M%S)`. **Step 2: Prepare invocation prompts** -For each subdomain, construct the prompt that will invoke the skill: +For each subdomain, construct the prompt: ``` Run the {skill_name} skill against this test target: @@ -211,19 +166,19 @@ Requirements: **Step 3: Fan-out execution** Dispatch skill runs in parallel batches: -- If N <= 5 subdomains: run all in parallel using the Agent tool +- If N ≤ 5 subdomains: run all in parallel using the Agent tool - If N > 5: batch into groups of 5, run each batch sequentially (each batch's skills run in parallel) For each skill run: -1. Invoke using the Agent tool with the subdomain's bound agent (`spec.reuse_agent` or `spec.new_agent.name`) -2. Pass the invocation prompt from Step 2 -3. Set timeout: 5 minutes (300,000ms) +1. Invoke using the Agent tool with the subdomain's bound agent +2. Pass the invocation prompt +3. Set timeout: 5 minutes (300,000ms). Constraint: Never increase timeout without user confirmation 4. On completion, record: exit status, execution time, output path 5. On timeout: record TIMEOUT status, note which phase was running when timeout hit (if determinable) **Step 4: Collect results** -After all runs complete (or timeout), collect: +After all runs complete (or timeout): ```json { "subdomain": "{name}", @@ -237,16 +192,18 @@ After all runs complete (or timeout), collect: } ``` -**Gate**: All skill runs completed (success, timeout, or error). Results collected for every subdomain. Proceed to Phase 3. +**Gate**: All skill runs completed (success, timeout, or error). Results collected for every subdomain. ### Phase 3: VALIDATE OUTPUTS **Goal**: Classify each skill's output as PASS, PARTIAL, FAIL, or TIMEOUT based on structural validity. -**Why structural validation, not semantic**: This skill checks "did the pipeline produce valid artifacts?" not "is the content correct?" Content quality evaluation requires domain expertise and is subjective. Structural validation is deterministic and catches the failures that matter for generator improvement -- missing manifests, empty content, broken chains. +**Why structural validation, not semantic**: This tests "did the pipeline produce valid artifacts?" not "is the content correct?" Content quality requires domain expertise and is subjective. Structural validation is deterministic and catches failures that matter for generator improvement. **Step 1: For each completed run, validate the manifest** +Always use `scripts/artifact-utils.py validate-manifest` for manifest validation. Never use manual JSON inspection. Constraint: The script implements canonical validation rules from the ADR; manual checks will drift from the spec over time. + ```bash python3 ~/.claude/scripts/artifact-utils.py validate-manifest \ /tmp/pipeline-test-{run-id}/runs/{subdomain.name}/manifest.json @@ -262,20 +219,19 @@ Check the output: Check `content.md`: - Exists and is non-empty (> 0 bytes) - Contains actual content (not just headers or template placeholders) -- Check for `{{variable}}` markers that indicate template substitution failure +- No unsubstituted `{{variable}}` markers ```bash -# Check for unsubstituted template variables grep -c '{{' /tmp/pipeline-test-{run-id}/runs/{subdomain.name}/content.md ``` **Step 3: Check manifest status** If the manifest exists and is valid, read the `status` field: -- `"complete"`: the skill reports it finished successfully -- `"partial"`: the skill reports incomplete results (some phases ran, some didn't) -- `"failed"`: the skill reports explicit failure -- `"blocked"`: the skill was blocked by a gate +- `"complete"`: skill finished successfully +- `"partial"`: skill incomplete (some phases ran, some didn't) +- `"failed"`: skill reports explicit failure +- `"blocked"`: skill was blocked by a gate **Step 4: Run domain-specific validation (if applicable)** @@ -286,7 +242,7 @@ python3 ~/.claude/scripts/{domain-validator}.py validate \ /tmp/pipeline-test-{run-id}/runs/{subdomain.name}/content.md ``` -Domain scripts provide deeper validation than manifest checks (e.g., `promql-validator.py` checks PromQL syntax). Not all subdomains have domain scripts -- this step is optional. +Domain scripts provide deeper validation than manifest checks (e.g., `promql-validator.py` checks PromQL syntax). This step is optional if the script doesn't exist. **Step 5: Classify each subdomain** @@ -301,7 +257,7 @@ Apply the classification matrix: | Unsubstituted `{{variable}}` markers in content | **FAIL** | | Manifest status is `"failed"` or `"blocked"` | **FAIL** | | Manifest valid, content exists, but status is `"partial"` | **PARTIAL** | -| Manifest valid, content exists, domain validation has warnings | **PARTIAL** | +| Manifest valid, content exists, but domain validation has warnings | **PARTIAL** | | Manifest valid, content non-empty, status `"complete"`, domain validation passes | **PASS** | For each FAIL or PARTIAL classification, record the failure trace: @@ -311,17 +267,17 @@ For each FAIL or PARTIAL classification, record the failure trace: "classification": "FAIL|PARTIAL", "reason": "manifest_invalid|content_empty|status_failed|timeout|domain_validation_failed|template_markers", "detail": "Specific error message or validation output", - "chain_step_hint": "Best guess at which chain step failed (e.g., GENERATE produced empty content, VALIDATE script not found)" + "chain_step_hint": "Best guess at which chain step failed" } ``` -**Gate**: All outputs classified. Failure traces recorded for every non-PASS result. Proceed to Phase 4. +**Gate**: All outputs classified. Failure traces recorded for every non-PASS result. ### Phase 4: REPORT -**Goal**: Produce the test run report as a dual-layer artifact for consumption by `pipeline-retro`. +**Goal**: Produce the test run report as a dual-layer artifact for `pipeline-retro`. -**Why dual-layer**: The manifest enables `pipeline-retro` to programmatically determine the overall verdict and iterate over per-subdomain results. The content.md provides human-readable detail for debugging. Both layers are needed -- the retro skill reads the manifest to decide what to fix, and the human reads the content to understand what went wrong. +**Why dual-layer**: The manifest enables `pipeline-retro` to programmatically determine the verdict and iterate over per-subdomain results. The content.md provides human-readable detail for debugging. Both are needed. **Step 1: Compute overall verdict** @@ -420,15 +376,16 @@ For each failure, classify the likely root cause: | timeout | Chain too complex or research too broad | Chain composition (Layer 2) | | status_failed | Skill hit an error gate | Skill logic (Layer 2) | -The "Recommended Fix Layer" column tells pipeline-retro WHERE to apply the fix: -- **Generator (Layer 2)**: Fix the scaffolder template or generation logic -- **Test target**: The target was bad, not the skill -- redo discovery -- **Chain composition**: The chain design needs adjustment (too many steps, wrong step order) +The "Recommended Fix Layer" column tells pipeline-retro WHERE to apply the fix. This skill ONLY reports the likely cause; it does not decide or apply fixes. `pipeline-retro` decides what to fix. ``` **Step 4: Display summary** -Output the results table and overall verdict to the user. Include the report path for `pipeline-retro` to consume. +Output the results table and overall verdict to the user. Include the report path for `pipeline-retro` to consume. Report facts without self-congratulation. Show per-subdomain results table, not narrative descriptions. + +**Step 5: Cleanup** + +Remove `/tmp/pipeline-test-*` directories after the report is produced. Keep only the final report artifacts. **Gate**: Report artifacts exist at `/tmp/pipeline-test-{run-id}/report/`. Pipeline test run complete. @@ -438,11 +395,11 @@ Output the results table and overall verdict to the user. Include the report pat ### Error: Pipeline Spec Not Found **Cause**: The spec path provided doesn't exist or the spec wasn't passed as input. -**Solution**: Check the ADR for the spec path (usually saved by `chain-composer`). If the spec was consumed by `pipeline-scaffolder` but not saved, re-run chain composition or read the spec from the scaffolder's input artifacts. +**Solution**: Check the ADR for the spec path (usually saved by `chain-composer`). If the spec was consumed by `pipeline-scaffolder` but not saved, re-run chain composition or read the spec from the scaffolder's input artifacts. This is a blocker—ask the user for the spec path. ### Error: Skill File Not Found **Cause**: A subdomain's `skill_name` in the spec references a skill that doesn't exist at `skills/{skill_name}/SKILL.md`. -**Solution**: The scaffolder didn't create this skill. Check the scaffolding report for errors. This subdomain gets an automatic FAIL classification with reason `skill_not_found`. +**Solution**: The scaffolder didn't create this skill. Check the scaffolding report for errors. Classify this subdomain as FAIL with reason `skill_not_found`. ### Error: Agent Not Available **Cause**: The agent specified in `reuse_agent` or `new_agent.name` is not available for invocation. @@ -450,56 +407,13 @@ Output the results table and overall verdict to the user. Include the report pat ### Error: Validation Script Not Found **Cause**: A subdomain's `scripts_needed` references a script that doesn't exist at `scripts/{filename}`. -**Solution**: Skip domain-specific validation for this subdomain. Note the missing script in the failure trace. The subdomain can still PASS on structural validation alone -- the missing script is a separate issue for the scaffolder to fix. +**Solution**: Skip domain-specific validation for this subdomain. Note the missing script in the failure trace. The subdomain can still PASS on structural validation alone. ### Error: All Subdomains Timeout **Cause**: Every skill run exceeded the 5-minute timeout. -**Solution**: This usually means the generated skills have overly complex research phases or the agent is overloaded. Report all as TIMEOUT. In recommendations, suggest reducing `params.agents` count in research steps or simplifying chains. Consider enabling the Extended Timeout optional behavior for retry. - ---- - -## Anti-Patterns - -### Anti-Pattern 1: Skip Testing Because Chain Validates -**What it looks like**: "Chain validation passed, so the skills must work." -**Why wrong**: Chain validation checks TYPE compatibility (step A produces research-artifact, step B consumes research-artifact). Execution testing checks RUNTIME behavior (does the research phase actually find anything? does the generation phase produce non-empty content? does the validation script exist and pass?). Type-safe code can still crash at runtime. -**Do instead**: Always run the test phase. Chain validation and execution testing catch different failure classes. - -### Anti-Pattern 2: Same Target for All Subdomains -**What it looks like**: Using one generic file as the test target for every subdomain skill. -**Why wrong**: Each subdomain has a different task type and expects different input. A PromQL query file is meaningless to an alerting-rules skill. A generic target produces false negatives (skill fails because the input is wrong, not because the skill is broken) or false positives (skill "succeeds" by ignoring the irrelevant input). -**Do instead**: Follow the priority-ordered target discovery per subdomain. Each gets domain-appropriate input. - -### Anti-Pattern 3: Fail the Entire Run on One Subdomain Failure -**What it looks like**: "prometheus-metrics failed, so the whole test run is FAIL." -**Why wrong**: Per-subdomain results are essential for `pipeline-retro`. If 4 of 5 subdomains pass and 1 fails, the retro needs to fix only the one failing generator component. Blanket failure hides which parts work. -**Do instead**: Classify each subdomain independently. Compute overall verdict by majority, but always report per-subdomain results. - -### Anti-Pattern 4: Test Against External Systems -**What it looks like**: Running the Prometheus operations skill against a live Prometheus cluster. -**Why wrong**: Test runs happen during pipeline generation. They must be safe (no side effects), repeatable (same result every time), and fast (under 5 minutes). Live systems introduce network latency, authentication, and the risk of actual changes. -**Do instead**: Use repo files, fixtures, or synthetic targets exclusively. - -### Anti-Pattern 5: Retry Failed Skills Indefinitely -**What it looks like**: A skill fails, so the test runner re-runs it 3 times hoping for a different result. -**Why wrong**: Structural failures (missing manifest, empty content, template markers) are deterministic -- they will fail the same way every time. Retrying wastes time and context. The fix belongs in the generator (Layer 2), not in re-execution. -**Do instead**: Classify the failure, record the trace, and move on. The `pipeline-retro` skill handles fixing. - ---- - -## Anti-Rationalization - -| Rationalization Attempt | Why It's Wrong | Required Action | -|------------------------|----------------|-----------------| -| "The scaffolder validated everything, testing is redundant" | Scaffolder validates structure (YAML frontmatter, naming). Test runner validates execution (does it run and produce output?) | Run tests regardless of scaffolder results | -| "Synthetic targets are unrealistic, skip subdomains without fixtures" | Synthetic targets catch structural failures (missing manifest, empty content, template markers). They don't test domain quality, but they DO test pipeline mechanics. | Generate synthetic targets for subdomains without real targets | -| "One subdomain timed out, so all will timeout" | Timeouts depend on chain complexity, target size, and research scope. Each subdomain is independent. | Test all subdomains; don't extrapolate from one | -| "This failure is obviously a target issue, not a skill issue" | The test runner REPORTS the likely cause; it doesn't DECIDE. `pipeline-retro` decides what to fix. | Classify honestly, include recommendation, let retro decide | -| "We can skip the manifest validation script and just check JSON manually" | `artifact-utils.py validate-manifest` implements canonical rules from the ADR. Manual checks drift from the spec. | Always use the script | - ---- +**Solution**: Report all as TIMEOUT. In recommendations, suggest reducing `params.agents` count in research steps or simplifying chains. Ask the user before enabling Extended Timeout optional behavior for retry. -## Blocker Criteria +### Blocker Criteria STOP and ask the user (do NOT proceed autonomously) when: diff --git a/pipelines/pr-pipeline/SKILL.md b/pipelines/pr-pipeline/SKILL.md index a7928dad..104239b6 100644 --- a/pipelines/pr-pipeline/SKILL.md +++ b/pipelines/pr-pipeline/SKILL.md @@ -43,56 +43,6 @@ routing: A structured pipeline for creating high-quality pull requests with proper staging, meaningful commits, parallel review, and CI verification. -**Core Principle**: Quality gates at every phase. Never submit a PR that hasn't been reviewed and verified. - ---- - -## Operator Context - -This skill operates as an operator for the PR submission workflow, configuring Claude's behavior for structured, gate-enforced pull request creation. It implements the **Pipeline Architecture** pattern -- Classify Repo, Stage, Review, Commit, Push, Review-Fix Loop, Create, Verify -- with repo-aware gating and parallel review as dedicated gate-enforced phases. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md commit and branch rules -- **No Attribution**: Never add "Generated with Claude Code", "Co-Authored-By: Claude", or similar -- **Sensitive File Blocking**: Block staging of .env, credentials.json, secrets.*, *.pem, *.key files -- **Verify CI**: Always check CI status before marking pipeline complete -- **Review Before Commit**: Run parallel code review before creating the commit -- **Branch Protection**: Never push directly to main/master without explicit authorization -- **Repo-Aware Review Gate**: Classify repo in Phase 0 using `scripts/classify-repo.py`. For personal repos, run up to 3 iterations of `/pr-review` → fix before creating the PR. For protected-org repos, create the PR but NEVER auto-merge — their CI gates and human reviewers handle quality. -- **Organization-Gated Workflow**: Repos under protected organizations (configured in `scripts/classify-repo.py`) require user confirmation before EACH step: commit message approval, push approval, and PR creation approval. Never auto-execute any of these steps in protected-org repos. Present the proposed action and wait for user approval. After PR creation, STOP — no CI waiting, no merge attempts. - -### Default Behaviors (ON unless disabled) -- **Parallel Review**: Launch 3 reviewers (security, business logic, code quality) via Task calls -- **Branch Naming**: Use branch-naming skill for compliant branch names -- **CI Wait**: Wait for CI to complete and report status (timeout 10 minutes) -- **Conventional Commits**: Use conventional commit format focusing on WHAT and WHY -- **Upstream Tracking**: Push with -u flag for new branches - -### Optional Behaviors (OFF unless enabled) -- **Skip Parallel Review**: Use `--skip-review` to skip Phase 2 (parallel subagent review) for trivial changes (typos, formatting). Phase 4b (comprehensive `/pr-review` loop) still runs — it cannot be skipped. -- **Draft PR**: Use `--draft` to create draft instead of ready PR -- **No CI Wait**: Use `--no-wait` to skip CI verification phase -- **Custom Title**: Use `--title "..."` to override generated PR title -- **File Filter**: Use `--files "pattern"` to stage only matching files - -## What This Skill CAN Do -- Stage, commit, push, and create PRs in a single pipeline execution -- Run parallel code review with 3 independent reviewers before submission -- Block sensitive files from being staged or committed -- Generate meaningful PR titles and bodies from commit history and diff -- Wait for and report CI status after PR creation - -## What This Skill CANNOT Do -- Push to main/master without explicit user authorization -- Skip Phase 2 (parallel subagent review) unless `--skip-review` is explicitly passed. Phase 4b review-fix loop always runs for personal repos. -- Create PRs without any staged changes -- Fix CI failures (report them, then stop) -- Merge PRs (creation and verification only) - -### Worktree Agent Awareness - -When this pipeline runs inside a worktree agent (dispatched with `isolation: "worktree"`), the worktree creates a local branch that persists after the agent completes. This branch blocks `gh pr merge --delete-branch` and `git branch -d`. The dispatching agent or cleanup skill must run `git worktree remove ` before merging the PR or deleting the branch. If you are creating a PR from a worktree, note this in the PR body so the caller knows cleanup is required. - --- ## Instructions @@ -109,71 +59,21 @@ REPO_TYPE=$(python3 ~/.claude/scripts/classify-repo.py --type-only) | Repo Type | Review Policy | Merge Policy | Step Execution | |-----------|--------------|--------------|----------------| | `protected-org` | Phase 2 parallel review only (their reviewers handle comprehensive review) | **NEVER auto-merge**. Create PR, report URL, stop. | **Human-gated**: confirm commit message, push, and PR creation with user before each step | -| `personal` | Phase 2 parallel review + Phase 4b review-fix loop (max 3 iterations of `/pr-review` → fix) | Create PR after review passes | Auto-execute steps normally | +| `personal` | Phase 2 parallel review + Phase 4b review-fix loop (max 3 iterations of `/pr-review` -> fix) | Create PR after review passes | Auto-execute steps normally | + +Protected-org repos require user confirmation before EACH step (commit message approval, push approval, PR creation approval) because unauthorized actions in shared org repos can trigger CI storms, notify entire teams, or violate org policies. Never auto-execute any of these steps -- present the proposed action and wait for user approval. **Gate**: Repo type classified. Policy determined. ### Phase 0.5: PREFLIGHT CHECKLIST -**Goal**: Fail fast on environment issues before attempting PR creation. Every check produces a specific, actionable error message — not a generic "preflight failed." This checklist runs in seconds and prevents the far more expensive failure of a half-created PR. +**Goal**: Fail fast on environment issues before attempting PR creation. Every check produces a specific, actionable error message -- not a generic "preflight failed." -**Why this exists**: PR creation can fail mid-way because the working tree is dirty, the branch is main, or `gh` isn't authenticated — all discoverable before starting the pipeline. Catching these upfront avoids partial state (e.g., a commit pushed but no PR created). +PR creation can fail mid-way because the working tree is dirty, the branch is main, or `gh` isn't authenticated -- all discoverable before starting the pipeline. Catching these upfront avoids partial state (e.g., a commit pushed but no PR created). -Run all checks sequentially. Abort on the first failure. +Run 5 checks sequentially (verification status, clean working tree, correct branch, remote configured, `gh` authenticated). Abort on the first failure with a specific error message. -| # | Check | Command | Failure Action | -|---|-------|---------|---------------| -| 1 | Verification status (did quality gates pass?) | Check for recent test/build output or verification artifacts | Abort: "Run verification first — no evidence that quality gates passed." | -| 2 | Clean working tree (no uncommitted changes) | `git status --porcelain` | Abort: "Working tree is dirty. Uncommitted files:\n{list}. Stage or stash before running PR pipeline." | -| 3 | Correct branch (not main/master) | `git branch --show-current` | Abort: "Currently on {branch}. Create a feature branch first: `git checkout -b type/description`" | -| 4 | Remote configured for current branch | `git config --get branch.$(git branch --show-current).remote` | Abort: "No remote configured for branch. Push with: `git push -u origin $(git branch --show-current)`" | -| 5 | `gh` CLI authenticated | `gh auth status 2>&1` | Abort: "GitHub CLI not authenticated. Run: `gh auth login`" | - -```bash -# Preflight check sequence -echo "Running preflight checklist..." - -# Check 1: Verification status -# Look for verification artifacts (test output, build logs) — if the project -# has a test suite and no recent verification evidence exists, warn. -# This is a soft gate: skip if no test infrastructure is detected. - -# Check 2: Clean working tree -DIRTY=$(git status --porcelain) -if [ -n "$DIRTY" ]; then - echo "PREFLIGHT FAIL: Working tree is dirty." - echo "$DIRTY" - echo "Stage or stash uncommitted changes before running PR pipeline." - exit 1 -fi - -# Check 3: Not on main/master -BRANCH=$(git branch --show-current) -if [ "$BRANCH" = "main" ] || [ "$BRANCH" = "master" ]; then - echo "PREFLIGHT FAIL: On branch '$BRANCH'." - echo "Create a feature branch: git checkout -b type/description" - exit 1 -fi - -# Check 4: Remote configured -REMOTE=$(git config --get "branch.$BRANCH.remote" 2>/dev/null) -if [ -z "$REMOTE" ]; then - echo "PREFLIGHT FAIL: No remote configured for branch '$BRANCH'." - echo "Push with: git push -u origin $BRANCH" - exit 1 -fi - -# Check 5: gh CLI authenticated -if ! gh auth status >/dev/null 2>&1; then - echo "PREFLIGHT FAIL: GitHub CLI not authenticated." - echo "Run: gh auth login" - exit 1 -fi - -echo "Preflight checklist PASSED." -``` - -**Note on Check 1 (Verification status)**: This is context-dependent. If the project has a test suite (`go test`, `npm test`, `pytest`, etc.), look for evidence that tests were run recently (e.g., verification report files, recent test output in the session). If no test infrastructure exists, this check passes by default. The goal is to prevent submitting code that was never tested, not to block projects without tests. +See `references/preflight-checklist.md` for the full check table, bash script, and note on Check 1 (verification status). **Gate**: All preflight checks pass. Environment is ready for PR creation. Proceed to Phase 1. @@ -181,7 +81,11 @@ echo "Preflight checklist PASSED." **Goal**: Analyze working tree and stage appropriate changes. -**Step 1: Inspect changes** +**Step 1: Read and follow CLAUDE.md** + +Before staging, read the repository's CLAUDE.md for commit and branch rules. These rules override defaults because each repo has its own conventions for branch naming, commit format, and file organization. + +**Step 2: Inspect changes** ```bash # See what's changed @@ -192,9 +96,11 @@ git diff git diff --cached ``` -**Step 2: Block sensitive files** +**Step 3: Block sensitive files** + +Check every changed file against the blocklist. Sensitive files must be blocked here, before staging, because once committed they enter git history permanently -- removing them later requires history rewriting which is disruptive for all collaborators. -Check every changed file against the blocklist: +Blocklist: - `.env`, `.env.*` - `credentials.json`, `secrets.*` - `*.pem`, `*.key`, `*.p12` @@ -202,18 +108,24 @@ Check every changed file against the blocklist: If sensitive files are detected, STOP and report to user. Do not stage them. -**Step 3: Stage changes** +**Step 4: Stage changes** + +Stage specific files by name -- never run `git add -A` or `git add .` because blind staging captures unrelated changes, build artifacts, and debug logs that obscure review and pollute history. ```bash # Stage specific files (never git add -A blindly) git add [files] ``` +If the changeset spans 30+ files or multiple unrelated features, suggest the user split into focused PRs. Monolithic PRs are impossible to review effectively, carry high regression risk, and block other work. + **Gate**: Changes staged. No sensitive files included. Staged diff makes sense as a cohesive unit. ### Phase 2: REVIEW (Comprehensive Multi-Agent Review) -**Goal**: Catch ALL issues before they reach the commit. This is the full 3-wave, 20+ agent comprehensive review, not a lightweight pass. +**Goal**: Catch ALL issues before they reach the commit. Run the review loop before creating the commit because post-merge fixes cost 2 PRs instead of 1. + +**Skip condition**: Only if user explicitly passes `--skip-review`. One-line changes can still introduce security vulnerabilities or break business logic, so the default is always to review. **Invoke the comprehensive-review skill:** @@ -250,6 +162,8 @@ Map changes to conventional commit type: feat, fix, refactor, docs, test, chore, **Step 3: Create commit** +Write the commit message now with full context -- "I'll fix the commit message later" never happens, and git history is permanent. + ```bash git commit -m "$(cat <<'EOF' type(scope): concise description of WHAT changed @@ -260,7 +174,7 @@ EOF )" ``` -Follow CLAUDE.md rules for commit messages. No attribution lines. +Follow CLAUDE.md rules for commit messages. Never add "Generated with Claude Code", "Co-Authored-By: Claude", or similar attribution lines because they add noise and violate most project commit conventions. **Protected-org repos**: Before executing the commit, present the proposed commit message to the user and wait for explicit approval. Show the full message and list of files that will be committed. @@ -272,6 +186,8 @@ Follow CLAUDE.md rules for commit messages. No attribution lines. **Step 1: Ensure correct branch** +Never push directly to main/master without explicit authorization -- this bypasses all review gates and can break the build for everyone. + ```bash # Check current branch git branch --show-current @@ -284,6 +200,8 @@ Use branch-naming skill if available for compliant names. **Step 2: Push with tracking** +Push with `-u` flag for new branches so subsequent pushes and PR creation can find the upstream automatically. + ```bash # CLAUDE_GATE_BYPASS=1 bypasses the git-submission-gate hook (this skill IS the gate) CLAUDE_GATE_BYPASS=1 git push -u origin $(git branch --show-current) @@ -301,60 +219,11 @@ Confirm push succeeded by checking output. If push fails (e.g., rejected), repor **Goal**: Iteratively review and fix issues until clean or max 3 iterations reached. -**Skip condition**: If `REPO_TYPE == "protected-org"`, skip this phase entirely. Protected-org repos have their own PR gates. - -**Loop**: Up to 3 iterations of `/pr-review` → fix → amend commit → push. - -``` -ITERATION = 0 -MAX_ITERATIONS = 3 - -while ITERATION < MAX_ITERATIONS: - ITERATION += 1 - - Step 1: Run /pr-review - Step 2: If no issues found → EXIT LOOP (proceed to Phase 5) - Step 3: Fix all reported issues - Step 4: Stage fixes, amend commit, force push to branch - Step 5: Report iteration results -``` - -**Step 1: Run `/pr-review`** - -Invoke the `/pr-review` command, which launches specialized review agents (code-reviewer, silent-failure-hunter, comment-analyzer, etc.) and captures retro learnings. - -**Step 2: Evaluate results** - -| Result | Action | -|--------|--------| -| No issues found | **Exit loop**. Proceed to Phase 5 (CREATE PR). | -| Issues found (iteration < 3) | Fix issues in Step 3, then re-review. | -| Issues remaining after iteration 3 | **Exit loop**. Include remaining issues in PR body as known items. Proceed to Phase 5. | - -**Step 3: Fix reported issues** - -Address each issue found by the review. This includes: -- Code quality fixes (naming, style, error handling) -- Documentation updates (stale references, missing README entries) -- Test gaps (if flagged) +**Skip condition**: If `REPO_TYPE == "protected-org"`, skip this phase entirely. Protected-org repos have their own PR gates. This phase cannot be skipped for personal repos -- even with `--skip-review` (which only skips Phase 2), this loop always runs because it is the final quality gate before PR creation. -**Step 4: Amend and push** +**Loop**: Up to 3 iterations of `/pr-review` -> fix -> amend commit -> push. After iteration 3, exit and document remaining issues in the PR body. -```bash -git add [fixed files] -git commit --amend --no-edit -CLAUDE_GATE_BYPASS=1 git push --force-with-lease -``` - -**Step 5: Report iteration** - -``` -REVIEW-FIX ITERATION [N/3] - Found: [X issues] - Fixed: [Y issues] - Remaining: [Z issues] - Status: [CLEAN | FIXING | MAX ITERATIONS REACHED] -``` +See `references/review-fix-loop.md` for the full loop logic, steps 1-5 with code blocks, result table, and iteration report format. **Gate**: Review-fix loop complete. Either clean (0 issues) or max 3 iterations reached with remaining issues documented. @@ -364,76 +233,9 @@ REVIEW-FIX ITERATION [N/3] **Skip condition**: If the repo is NOT the claude-code-toolkit repo, skip this phase entirely. Detection: check if both `agents/` and `skills/` directories exist at the project root. If either is missing, skip directly to Phase 5. -```bash -# Detect toolkit repo -if [ -d "agents" ] && [ -d "skills" ]; then - echo "Toolkit repo detected — RETRO phase required" -else - echo "Not toolkit repo — skipping RETRO phase" - # Skip to Phase 5 -fi -``` - -**Step 1: Collect review findings** - -Gather all findings from Phase 2 (REVIEW) and Phase 4b (REVIEW-FIX LOOP) that were identified and fixed. Include: -- Security findings that were addressed -- Code quality issues that were corrected -- Business logic errors that were fixed -- Methodology gaps that were exposed - -For each finding, identify the **responsible agent or skill** — the component whose instructions should have prevented the issue. +Five steps: collect findings from Phases 2 and 4b, record per-component learnings, boost to 1.0 and graduate immediately, embed graduated patterns in the responsible agent/skill files, and stage the updated files. -**Step 2: Record learnings** - -For each finding, record a retro entry scoped to the responsible agent or skill: - -```bash -# For agent-scoped findings (e.g., python-general-engineer produced bad code) -python3 ~/.claude/scripts/learning-db.py learn --agent {agent-name} "pattern description from review finding" - -# For skill-scoped findings (e.g., reddit-moderate missed a test requirement) -python3 ~/.claude/scripts/learning-db.py learn --skill {skill-name} "pattern description from review finding" -``` - -**Step 3: Immediate graduation** - -Per /do Phase 5 policy, boost each entry to 1.0 confidence and graduate immediately. This is NOT a slow-burn learning — review findings in this repo are structural fixes. - -```bash -# Boost confidence to 1.0 (run boost 3x — each boost applies a multiplier) -python3 ~/.claude/scripts/learning-db.py boost "agent:{agent-name}" "{key}" -python3 ~/.claude/scripts/learning-db.py boost "agent:{agent-name}" "{key}" -python3 ~/.claude/scripts/learning-db.py boost "agent:{agent-name}" "{key}" - -# Graduate — marks as embedded, excludes from future prompt injection -python3 ~/.claude/scripts/learning-db.py graduate "agent:{agent-name}" "{key}" "agents/{agent-name}.md" -# Or for skills: -python3 ~/.claude/scripts/learning-db.py graduate "skill:{skill-name}" "{key}" "skills/{skill-name}/SKILL.md" -``` - -**Step 4: Embed in agent/skill** - -Update the responsible agent or skill file with the graduated pattern: - -| Finding Target | Update Location | Section to Modify | -|---------------|----------------|-------------------| -| Agent produced bad code | `agents/{name}.md` | FORBIDDEN patterns or Anti-Patterns | -| Skill methodology gap | `skills/{name}/SKILL.md` | Instructions or Anti-Patterns | -| Router missed a pattern | `skills/do/SKILL.md` | Routing tables or Force-Routes | -| Hook failed to catch | `hooks/{name}.py` | Detection logic | - -Write the pattern at the right abstraction level — generalize from the specific bug to the class of bug (e.g., "validate all CLI inputs" not "validate subreddit names in _cmd_classify"). - -**Step 5: Stage retro changes** - -```bash -# Stage updated agent/skill files alongside the code changes -git add agents/{updated-agent}.md -git add skills/{updated-skill}/SKILL.md -``` - -These changes will be included in the existing commit (amend in next push cycle) or in a new commit if Phase 4b already completed cleanly. +See `references/retro-adr-phases.md` for full steps, bash commands, and the finding-target table. **Gate**: All review findings recorded in learning.db, graduated to 1.0, and embedded in the responsible agent/skill files. Updated files staged for commit. @@ -441,26 +243,11 @@ These changes will be included in the existing commit (amend in next push cycle) **Goal**: Verify that all ADRs in the `adr/` directory have consistent format and valid status fields before the PR is created. -**Skip condition**: Same as Phase 4c — only runs in the toolkit repo (both `agents/` and `skills/` directories exist at root). - -**Step 1: Run ADR format check** - -```bash -python3 ~/.claude/scripts/adr-status.py check -``` +**Skip condition**: Same as Phase 4c -- only runs in the toolkit repo (both `agents/` and `skills/` directories exist at root). -If exit code 1 (warnings found): -- Review each warning (missing headings, empty status) -- Fix formatting issues in the ADR files -- Stage the fixes: `git add adr/` +Run `python3 ~/.claude/scripts/adr-status.py check`; fix any warnings and stage changes. Run `python3 ~/.claude/scripts/adr-status.py status` and include the summary in the PR body if the PR touches `adr/*.md` files. -**Step 2: Run ADR status report** - -```bash -python3 ~/.claude/scripts/adr-status.py status -``` - -Include the status summary in the PR body if the PR touches any `adr/*.md` files. This gives reviewers an at-a-glance view of ADR state. +See `references/retro-adr-phases.md` for full ADR commands and fix workflow. **Gate**: `python3 ~/.claude/scripts/adr-status.py check` exits 0. All ADRs have valid format. @@ -476,51 +263,14 @@ Analyze the full diff against the base branch and all commit messages to draft: **Step 1.5: Artifact-Driven PR Body Generation** -When planning artifacts exist, generate the PR body from them rather than writing freeform. Artifacts capture *intent* (why the change was made), which is more valuable to reviewers than a mechanical diff summary. +When planning artifacts exist (`task_plan.md`, verification reports, review summaries, deviation logs), generate the PR body from them rather than writing freeform. Artifacts capture *intent*, which is more valuable to reviewers than a mechanical diff summary. Fall back to diff-based generation when no artifacts exist. -Check for artifacts in this order and build the PR body from what's available: - -| Artifact | PR Section Generated | How to Extract | -|----------|---------------------|----------------| -| `task_plan.md` | **Summary** (from Goal section) and **Changes** (from completed tasks) | Read the Goal and Phases sections; list completed items as change bullets | -| Verification reports (`*-verification.md`, test output) | **Test Plan** (from verification output) | Extract pass/fail counts and key assertions verified | -| Review summaries (Phase 2 / Phase 4b output) | **Review Findings** (from reviewer results) | Summarize security/logic/quality verdicts | -| Deviation logs (ADR-076 repair actions) | **Deviations** section | List repair actions taken and why the original plan changed | - -```markdown -## PR Body Template (artifact-driven) - -## Summary - -- [Goal statement] -- [Key change 1 from completed tasks] -- [Key change 2 from completed tasks] - -## Changes - -- [Completed task description] -- [Completed task description] - -## Test Plan - -- [ ] [Verification result 1] -- [ ] [Verification result 2] - -## Review Findings - -Security: PASS -Business Logic: PASS -Code Quality: PASS - -## Deviations - -- [Deviation description and rationale] -``` - -**Fallback**: If no artifacts exist, fall back to diff-based generation — summarize changes from the diff and commit messages. This is the existing behavior and remains the default for ad-hoc PRs without planning artifacts. +See `references/pr-templates.md` for the full artifact table, PR body template, and fallback guidance. **Step 2: Create PR** +This pipeline cannot create PRs without staged changes -- if nothing is staged, the earlier phases would have caught this. + ```bash CLAUDE_GATE_BYPASS=1 gh pr create --title "type(scope): description" --body "$(cat <<'EOF' ## Summary @@ -539,7 +289,7 @@ EOF )" ``` -Add `--draft` flag if draft mode was requested. +Add `--draft` flag if draft mode was requested via `--draft`. **Protected-org repos**: Before creating the PR, present the title, body, and target branch to the user. Wait for explicit approval before executing `gh pr create`. @@ -553,7 +303,7 @@ Record and report the PR URL to the user. ``` PR PIPELINE COMPLETE (protected-org repo) -Protected-org repo detected — PR created for human review. +Protected-org repo detected -- PR created for human review. PR: https://github.com/your-org/your-repo/pull/123 Next steps are handled by org CI gates and human reviewers. @@ -562,7 +312,7 @@ This pipeline will NOT auto-merge protected-org PRs. ### Phase 6: VERIFY (personal repos only) -**Goal**: Wait for CI and report final status. +**Goal**: Wait for CI and report final status. Always check CI status before marking the pipeline complete because merging without CI confirmation risks shipping broken code. ```bash # Get the latest workflow run for this branch @@ -572,16 +322,16 @@ gh run list --branch $(git branch --show-current) --limit 1 gh run watch [run-id] --exit-status ``` -If CI fails, report which checks failed and the PR URL. Do NOT merge. Do NOT proceed to cleanup. +If CI fails, report which checks failed and the PR URL. Do NOT merge. Do NOT proceed to cleanup. This pipeline reports CI failures but does not fix them -- diagnosing CI requires different context than PR creation. If CI passes and user requested merge: ```bash CLAUDE_GATE_BYPASS=1 gh pr merge --merge --delete-branch ``` -**HARD RULE**: Never merge a PR with failing or pending CI. CI must pass first. The `ci-merge-gate.py` hook enforces this mechanically — it blocks `gh pr merge` when checks are failing or pending. Do NOT use `--admin` or any bypass to circumvent this. If CI fails on an "unrelated" test, investigate the root cause (date-dependent fixtures, flaky tests) rather than force-merging. +**HARD RULE**: Never merge a PR with failing or pending CI. CI must pass first. The `ci-merge-gate.py` hook enforces this mechanically -- it blocks `gh pr merge` when checks are failing or pending. Do NOT use `--admin` or any bypass to circumvent this. If CI fails on an "unrelated" test, investigate the root cause (date-dependent fixtures, flaky tests) rather than force-merging -- assuming CI is "probably flaky" masks real failures and normalizes broken builds. -*Graduated from learning.db — skill:pr-sync/17fed1ab26c7 (PR #55 merged with failing CI, led to broken main)* +*Graduated from learning.db -- skill:pr-sync/17fed1ab26c7 (PR #55 merged with failing CI, led to broken main)* If `--no-wait` was passed, skip this phase and report the PR URL immediately. @@ -616,73 +366,19 @@ For protected-org repos: skip cleanup (their processes handle branch lifecycle). 2. Update status from "Proposed" to "Accepted" in the ADR file 3. Move the ADR file to `adr/completed/` 4. Clear `.adr-session.json` -5. Report: `ADR updated: {name} → Accepted, moved to completed/` +5. Report: `ADR updated: {name} -> Accepted, moved to completed/` ADRs are gitignored (local-only), so this is a local file operation, not a git operation. **Gate**: Branch cleaned up (or skipped if PR is still open). Pipeline complete. ---- +### Worktree Agent Awareness + +When this pipeline runs inside a worktree agent (dispatched with `isolation: "worktree"`), the worktree creates a local branch that persists after the agent completes. This branch blocks `gh pr merge --delete-branch` and `git branch -d`. The dispatching agent or cleanup skill must run `git worktree remove ` before merging the PR or deleting the branch. If you are creating a PR from a worktree, note this in the PR body so the caller knows cleanup is required. + +### Options Reference and Examples -## Examples - -### Example 1: Standard PR Submission (personal repo) -User says: "Submit a PR for these changes" -Actions: -1. Classify repo from remote URL (CLASSIFY REPO) -2. `git status`, review changes, stage files (STAGE) -3. Launch 3 parallel reviewers on staged diff (REVIEW) -4. Create conventional commit from staged changes (COMMIT) -5. Push branch to remote with tracking (PUSH) -6. Run review-fix loop: `/pr-review` → fix → re-review, up to 3 iterations (REVIEW-FIX LOOP) -7. Record and graduate review findings, embed in responsible agents/skills (RETRO, toolkit repo only) -8. Validate ADR format consistency (ADR VALIDATION, toolkit repo only) -9. Create PR with summary and review findings (CREATE PR) -10. Wait for CI, report status (VERIFY) -Result: PR URL with CI status and review-fix iteration count - -### Example 2: Draft PR for Work in Progress (personal repo) -User says: "Open a draft PR for what I have so far" -Actions: -1. Classify repo (CLASSIFY REPO) -2. Stage current changes, skip incomplete files if noted (STAGE) -3. Run parallel review (REVIEW) -4. Commit with `wip:` or appropriate prefix (COMMIT) -5. Push to feature branch (PUSH) -6. Run review-fix loop (REVIEW-FIX LOOP) -7. Record and graduate review findings (RETRO, toolkit repo only) -8. Validate ADR format consistency (ADR VALIDATION, toolkit repo only) -9. Create PR with `--draft` flag (CREATE PR) -10. Report PR URL, skip CI wait if `--no-wait` (VERIFY) -Result: Draft PR URL - -### Example 3: Trivial Change with Skip Parallel Review (personal repo) -User says: "Quick PR for this typo fix, skip review" -Actions: -1. Classify repo (CLASSIFY REPO) -2. Stage the single file change (STAGE) -3. Skip Phase 2 parallel review (--skip-review) -4. Commit: `fix(docs): correct typo in README` (COMMIT) -5. Push to branch (PUSH) -6. Run review-fix loop — Phase 4b still runs even with --skip-review (REVIEW-FIX LOOP) -7. Record and graduate review findings (RETRO, toolkit repo only) -8. Validate ADR format consistency (ADR VALIDATION, toolkit repo only) -9. Create PR with minimal body (CREATE PR) -10. Wait for CI (VERIFY) -Result: PR URL for typo fix - -### Example 4: Protected-Org Repo (human-gated workflow) -User says: "Submit a PR for these changes" (in a protected-org repo) -Actions: -1. Classify repo → protected-org detected (CLASSIFY REPO) -2. Stage files (STAGE) -3. Run parallel review (REVIEW) -4. Present commit message → user confirms → create commit (COMMIT, human-gated) -5. Present push details → user confirms → push to remote (PUSH, human-gated) -6. Skip Phase 4b (protected-org repos use their own review gates) -7. Present PR title/body → user confirms → create PR (CREATE PR, human-gated) -8. **STOP**. No CI wait, no merge. Report PR URL. -Result: PR URL. Next steps handled by org CI gates and human reviewers. +See `references/pr-templates.md` for the full options reference table and all 4 usage examples (Standard PR, Draft PR, Trivial Change, Protected-Org). --- @@ -719,51 +415,3 @@ Solution: 2. Provide the PR URL so user can monitor manually 3. Suggest: `gh run watch [run-id]` for manual monitoring 4. Mark pipeline as complete with "CI pending" status - ---- - -## Anti-Patterns - -### Anti-Pattern 1: Staging Everything Blindly -**What it looks like**: Running `git add -A` or `git add .` without reviewing changes -**Why wrong**: Captures sensitive files, unrelated changes, build artifacts, and debug logs -**Do instead**: Review `git status`, stage specific files by name - -### Anti-Pattern 2: Skipping Review for "Simple" Changes -**What it looks like**: "This is just a one-line fix, no need for review" -**Why wrong**: One-line changes can introduce security vulnerabilities or break business logic -**Do instead**: Run review unless user explicitly passes `--skip-review` - -### Anti-Pattern 3: Vague Commit Messages -**What it looks like**: `git commit -m "updates"` or `git commit -m "fix stuff"` -**Why wrong**: Provides no context for reviewers, breaks blame history, makes rollback decisions harder -**Do instead**: Conventional format with WHAT and WHY: `fix(auth): prevent token expiry race condition` - -### Anti-Pattern 4: Giant Monolithic PRs -**What it looks like**: 30+ files changed, multiple unrelated features in one PR -**Why wrong**: Impossible to review effectively, high regression risk, blocks other work -**Do instead**: Split into focused PRs. If changes are entangled, suggest user restructure first. - -### Anti-Pattern 5: Ignoring CI Failures -**What it looks like**: "CI failed but it's probably flaky, let's merge anyway" -**Why wrong**: Flaky tests mask real failures. Merging broken CI normalizes broken builds. -**Do instead**: Report failure details. Let user decide whether to investigate or retry. - ---- - -## References - -This skill uses these shared patterns: -- [Pipeline Architecture](../shared-patterns/pipeline-architecture.md) - Phase sequencing and artifact flow -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) - Phase transition rules -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Just a small change, skip review" | Small changes cause big bugs | Run review unless --skip-review | -| "CI is probably flaky" | Flaky assumption masks real failures | Report failure, let user decide | -| "I'll fix the commit message later" | Later never comes, history is permanent | Write proper message now | -| "These files belong together" | Unrelated changes obscure review | Split into separate PRs | diff --git a/pipelines/pr-pipeline/references/pr-templates.md b/pipelines/pr-pipeline/references/pr-templates.md new file mode 100644 index 00000000..f2682281 --- /dev/null +++ b/pipelines/pr-pipeline/references/pr-templates.md @@ -0,0 +1,136 @@ +# PR Templates and Examples + +Full details for Phase 5 (CREATE PR) of the PR Pipeline: artifact-driven body generation, templates, and all usage examples. + +--- + +## Artifact-Driven PR Body Generation + +When planning artifacts exist, generate the PR body from them rather than writing freeform. Artifacts capture *intent* (why the change was made), which is more valuable to reviewers than a mechanical diff summary. + +Check for artifacts in this order and build the PR body from what's available: + +| Artifact | PR Section Generated | How to Extract | +|----------|---------------------|----------------| +| `task_plan.md` | **Summary** (from Goal section) and **Changes** (from completed tasks) | Read the Goal and Phases sections; list completed items as change bullets | +| Verification reports (`*-verification.md`, test output) | **Test Plan** (from verification output) | Extract pass/fail counts and key assertions verified | +| Review summaries (Phase 2 / Phase 4b output) | **Review Findings** (from reviewer results) | Summarize security/logic/quality verdicts | +| Deviation logs (ADR-076 repair actions) | **Deviations** section | List repair actions taken and why the original plan changed | + +**Fallback**: If no artifacts exist, fall back to diff-based generation -- summarize changes from the diff and commit messages. This is the existing behavior and remains the default for ad-hoc PRs without planning artifacts. + +## PR Body Template (Artifact-Driven) + +```markdown +## Summary + +- [Goal statement] +- [Key change 1 from completed tasks] +- [Key change 2 from completed tasks] + +## Changes + +- [Completed task description] +- [Completed task description] + +## Test Plan + +- [ ] [Verification result 1] +- [ ] [Verification result 2] + +## Review Findings + +Security: PASS +Business Logic: PASS +Code Quality: PASS + +## Deviations + +- [Deviation description and rationale] +``` + +--- + +## Examples + +### Example 1: Standard PR Submission (personal repo) + +User says: "Submit a PR for these changes" + +Actions: +1. Classify repo from remote URL (CLASSIFY REPO) +2. `git status`, review changes, stage files (STAGE) +3. Launch 3 parallel reviewers on staged diff (REVIEW) +4. Create conventional commit from staged changes (COMMIT) +5. Push branch to remote with tracking (PUSH) +6. Run review-fix loop: `/pr-review` -> fix -> re-review, up to 3 iterations (REVIEW-FIX LOOP) +7. Record and graduate review findings, embed in responsible agents/skills (RETRO, toolkit repo only) +8. Validate ADR format consistency (ADR VALIDATION, toolkit repo only) +9. Create PR with summary and review findings (CREATE PR) +10. Wait for CI, report status (VERIFY) + +Result: PR URL with CI status and review-fix iteration count + +### Example 2: Draft PR for Work in Progress (personal repo) + +User says: "Open a draft PR for what I have so far" + +Actions: +1. Classify repo (CLASSIFY REPO) +2. Stage current changes, skip incomplete files if noted (STAGE) +3. Run parallel review (REVIEW) +4. Commit with `wip:` or appropriate prefix (COMMIT) +5. Push to feature branch (PUSH) +6. Run review-fix loop (REVIEW-FIX LOOP) +7. Record and graduate review findings (RETRO, toolkit repo only) +8. Validate ADR format consistency (ADR VALIDATION, toolkit repo only) +9. Create PR with `--draft` flag (CREATE PR) +10. Report PR URL, skip CI wait if `--no-wait` (VERIFY) + +Result: Draft PR URL + +### Example 3: Trivial Change with Skip Parallel Review (personal repo) + +User says: "Quick PR for this typo fix, skip review" + +Actions: +1. Classify repo (CLASSIFY REPO) +2. Stage the single file change (STAGE) +3. Skip Phase 2 parallel review (--skip-review) +4. Commit: `fix(docs): correct typo in README` (COMMIT) +5. Push to branch (PUSH) +6. Run review-fix loop -- Phase 4b still runs even with --skip-review (REVIEW-FIX LOOP) +7. Record and graduate review findings (RETRO, toolkit repo only) +8. Validate ADR format consistency (ADR VALIDATION, toolkit repo only) +9. Create PR with minimal body (CREATE PR) +10. Wait for CI (VERIFY) + +Result: PR URL for typo fix + +### Example 4: Protected-Org Repo (human-gated workflow) + +User says: "Submit a PR for these changes" (in a protected-org repo) + +Actions: +1. Classify repo -> protected-org detected (CLASSIFY REPO) +2. Stage files (STAGE) +3. Run parallel review (REVIEW) +4. Present commit message -> user confirms -> create commit (COMMIT, human-gated) +5. Present push details -> user confirms -> push to remote (PUSH, human-gated) +6. Skip Phase 4b (protected-org repos use their own review gates) +7. Present PR title/body -> user confirms -> create PR (CREATE PR, human-gated) +8. **STOP**. No CI wait, no merge. Report PR URL. + +Result: PR URL. Next steps handled by org CI gates and human reviewers. + +--- + +## Options Reference + +| Option | Effect | Default | +|--------|--------|---------| +| `--skip-review` | Skip Phase 2 (parallel subagent review) for trivial changes. Phase 4b review-fix loop still runs. | OFF (review runs) | +| `--draft` | Create draft PR instead of ready PR | OFF (ready PR) | +| `--no-wait` | Skip Phase 6 CI verification | OFF (waits for CI) | +| `--title "..."` | Override generated PR title | Auto-generated | +| `--files "pattern"` | Stage only files matching pattern | All changed files | diff --git a/pipelines/pr-pipeline/references/preflight-checklist.md b/pipelines/pr-pipeline/references/preflight-checklist.md new file mode 100644 index 00000000..5ea87f85 --- /dev/null +++ b/pipelines/pr-pipeline/references/preflight-checklist.md @@ -0,0 +1,65 @@ +# Preflight Checklist + +Full details for Phase 0.5 of the PR Pipeline. + +## Check Table + +Run all checks sequentially. Abort on the first failure. + +| # | Check | Command | Failure Action | +|---|-------|---------|---------------| +| 1 | Verification status (did quality gates pass?) | Check for recent test/build output or verification artifacts | Abort: "Run verification first -- no evidence that quality gates passed." | +| 2 | Clean working tree (no uncommitted changes) | `git status --porcelain` | Abort: "Working tree is dirty. Uncommitted files:\n{list}. Stage or stash before running PR pipeline." | +| 3 | Correct branch (not main/master) | `git branch --show-current` | Abort: "Currently on {branch}. Create a feature branch first: `git checkout -b type/description`" | +| 4 | Remote configured for current branch | `git config --get branch.$(git branch --show-current).remote` | Abort: "No remote configured for branch. Push with: `git push -u origin $(git branch --show-current)`" | +| 5 | `gh` CLI authenticated | `gh auth status 2>&1` | Abort: "GitHub CLI not authenticated. Run: `gh auth login`" | + +## Bash Script + +```bash +# Preflight check sequence +echo "Running preflight checklist..." + +# Check 1: Verification status +# Look for verification artifacts (test output, build logs) — if the project +# has a test suite and no recent verification evidence exists, warn. +# This is a soft gate: skip if no test infrastructure is detected. + +# Check 2: Clean working tree +DIRTY=$(git status --porcelain) +if [ -n "$DIRTY" ]; then + echo "PREFLIGHT FAIL: Working tree is dirty." + echo "$DIRTY" + echo "Stage or stash uncommitted changes before running PR pipeline." + exit 1 +fi + +# Check 3: Not on main/master +BRANCH=$(git branch --show-current) +if [ "$BRANCH" = "main" ] || [ "$BRANCH" = "master" ]; then + echo "PREFLIGHT FAIL: On branch '$BRANCH'." + echo "Create a feature branch: git checkout -b type/description" + exit 1 +fi + +# Check 4: Remote configured +REMOTE=$(git config --get "branch.$BRANCH.remote" 2>/dev/null) +if [ -z "$REMOTE" ]; then + echo "PREFLIGHT FAIL: No remote configured for branch '$BRANCH'." + echo "Push with: git push -u origin $BRANCH" + exit 1 +fi + +# Check 5: gh CLI authenticated +if ! gh auth status >/dev/null 2>&1; then + echo "PREFLIGHT FAIL: GitHub CLI not authenticated." + echo "Run: gh auth login" + exit 1 +fi + +echo "Preflight checklist PASSED." +``` + +## Note on Check 1 (Verification Status) + +This is context-dependent. If the project has a test suite (`go test`, `npm test`, `pytest`, etc.), look for evidence that tests were run recently (e.g., verification report files, recent test output in the session). If no test infrastructure exists, this check passes by default. The goal is to prevent submitting code that was never tested, not to block projects without tests. diff --git a/pipelines/pr-pipeline/references/retro-adr-phases.md b/pipelines/pr-pipeline/references/retro-adr-phases.md new file mode 100644 index 00000000..88eaa11b --- /dev/null +++ b/pipelines/pr-pipeline/references/retro-adr-phases.md @@ -0,0 +1,106 @@ +# Retro and ADR Validation Phases + +Full details for Phase 4c (RETRO) and Phase 4d (ADR VALIDATION) of the PR Pipeline. +Both phases apply to the claude-code-toolkit repo only. + +--- + +## Phase 4c: RETRO + +**Detection**: Both `agents/` and `skills/` directories exist at project root. + +```bash +# Detect toolkit repo +if [ -d "agents" ] && [ -d "skills" ]; then + echo "Toolkit repo detected -- RETRO phase required" +else + echo "Not toolkit repo -- skipping RETRO phase" + # Skip to Phase 5 +fi +``` + +### Step 1: Collect Review Findings + +Gather all findings from Phase 2 (REVIEW) and Phase 4b (REVIEW-FIX LOOP) that were identified and fixed. Include: +- Security findings that were addressed +- Code quality issues that were corrected +- Business logic errors that were fixed +- Methodology gaps that were exposed + +For each finding, identify the **responsible agent or skill** -- the component whose instructions should have prevented the issue. + +### Step 2: Record Learnings + +For each finding, record a retro entry scoped to the responsible agent or skill: + +```bash +# For agent-scoped findings (e.g., python-general-engineer produced bad code) +python3 ~/.claude/scripts/learning-db.py learn --agent {agent-name} "pattern description from review finding" + +# For skill-scoped findings (e.g., reddit-moderate missed a test requirement) +python3 ~/.claude/scripts/learning-db.py learn --skill {skill-name} "pattern description from review finding" +``` + +### Step 3: Immediate Graduation + +Per /do Phase 5 policy, boost each entry to 1.0 confidence and graduate immediately. This is NOT a slow-burn learning -- review findings in this repo are structural fixes. + +```bash +# Boost confidence to 1.0 (run boost 3x -- each boost applies a multiplier) +python3 ~/.claude/scripts/learning-db.py boost "agent:{agent-name}" "{key}" +python3 ~/.claude/scripts/learning-db.py boost "agent:{agent-name}" "{key}" +python3 ~/.claude/scripts/learning-db.py boost "agent:{agent-name}" "{key}" + +# Graduate -- marks as embedded, excludes from future prompt injection +python3 ~/.claude/scripts/learning-db.py graduate "agent:{agent-name}" "{key}" "agents/{agent-name}.md" +# Or for skills: +python3 ~/.claude/scripts/learning-db.py graduate "skill:{skill-name}" "{key}" "skills/{skill-name}/SKILL.md" +``` + +### Step 4: Embed in Agent/Skill + +Update the responsible agent or skill file with the graduated pattern: + +| Finding Target | Update Location | Section to Modify | +|---------------|----------------|-------------------| +| Agent produced bad code | `agents/{name}.md` | FORBIDDEN patterns or Anti-Patterns | +| Skill methodology gap | `skills/{name}/SKILL.md` | Instructions or Anti-Patterns | +| Router missed a pattern | `skills/do/SKILL.md` | Routing tables or Force-Routes | +| Hook failed to catch | `hooks/{name}.py` | Detection logic | + +Write the pattern at the right abstraction level -- generalize from the specific bug to the class of bug (e.g., "validate all CLI inputs" not "validate subreddit names in _cmd_classify"). + +### Step 5: Stage Retro Changes + +```bash +# Stage updated agent/skill files alongside the code changes +git add agents/{updated-agent}.md +git add skills/{updated-skill}/SKILL.md +``` + +These changes will be included in the existing commit (amend in next push cycle) or in a new commit if Phase 4b already completed cleanly. + +--- + +## Phase 4d: ADR VALIDATION + +**Detection**: Same as Phase 4c -- only runs in the toolkit repo (both `agents/` and `skills/` directories exist at root). + +### Step 1: Run ADR Format Check + +```bash +python3 ~/.claude/scripts/adr-status.py check +``` + +If exit code 1 (warnings found): +- Review each warning (missing headings, empty status) +- Fix formatting issues in the ADR files +- Stage the fixes: `git add adr/` + +### Step 2: Run ADR Status Report + +```bash +python3 ~/.claude/scripts/adr-status.py status +``` + +Include the status summary in the PR body if the PR touches any `adr/*.md` files. This gives reviewers an at-a-glance view of ADR state. diff --git a/pipelines/pr-pipeline/references/review-fix-loop.md b/pipelines/pr-pipeline/references/review-fix-loop.md new file mode 100644 index 00000000..94311eb0 --- /dev/null +++ b/pipelines/pr-pipeline/references/review-fix-loop.md @@ -0,0 +1,58 @@ +# Review-Fix Loop + +Full details for Phase 4b of the PR Pipeline (personal repos only). + +## Loop Logic + +Up to 3 iterations of `/pr-review` -> fix -> amend commit -> push. + +``` +ITERATION = 0 +MAX_ITERATIONS = 3 + +while ITERATION < MAX_ITERATIONS: + ITERATION += 1 + + Step 1: Run /pr-review + Step 2: If no issues found -> EXIT LOOP (proceed to Phase 5) + Step 3: Fix all reported issues + Step 4: Stage fixes, amend commit, force push to branch + Step 5: Report iteration results +``` + +## Step 1: Run `/pr-review` + +Invoke the `/pr-review` command, which launches specialized review agents (code-reviewer, silent-failure-hunter, comment-analyzer, etc.) and captures retro learnings. + +## Step 2: Evaluate Results + +| Result | Action | +|--------|--------| +| No issues found | **Exit loop**. Proceed to Phase 5 (CREATE PR). | +| Issues found (iteration < 3) | Fix issues in Step 3, then re-review. | +| Issues remaining after iteration 3 | **Exit loop**. Include remaining issues in PR body as known items. Proceed to Phase 5. | + +## Step 3: Fix Reported Issues + +Address each issue found by the review. This includes: +- Code quality fixes (naming, style, error handling) +- Documentation updates (stale references, missing README entries) +- Test gaps (if flagged) + +## Step 4: Amend and Push + +```bash +git add [fixed files] +git commit --amend --no-edit +CLAUDE_GATE_BYPASS=1 git push --force-with-lease +``` + +## Step 5: Iteration Report Format + +``` +REVIEW-FIX ITERATION [N/3] + Found: [X issues] + Fixed: [Y issues] + Remaining: [Z issues] + Status: [CLEAN | FIXING | MAX ITERATIONS REACHED] +``` diff --git a/pipelines/research-pipeline/SKILL.md b/pipelines/research-pipeline/SKILL.md index c31bf02e..07d7ee0b 100644 --- a/pipelines/research-pipeline/SKILL.md +++ b/pipelines/research-pipeline/SKILL.md @@ -39,40 +39,15 @@ routing: # Research Pipeline -## Operator Context +## Overview This skill formalizes the research-coordinator-engineer's parallel research workflow into a 5-phase pipeline with artifact saving at each phase. It is the go-to path when research needs to produce a citable, resumable output — not just an in-session answer. -### Hardcoded Behaviors (Always Apply) -- **Parallel dispatch mandatory**: Phase 2 (GATHER) MUST dispatch minimum 3 parallel `research-subagent-executor` agents in a single message. Sequential research is forbidden. This is validated by A/B testing (Rule 12, pipeline-orchestrator): sequential research loses quality vs. parallel. -- **Artifact saving at every phase**: Each phase writes to `research/{topic}/` before proceeding. Context-only storage is not acceptable — context is ephemeral, files are not. -- **Scope before gather**: Phase 1 (SCOPE) must produce `scope.md` before any research agents are dispatched. Gathering without a defined question produces unfocused findings. -- **Report is the canonical output**: `research/{topic}/report.md` is the artifact reported to the user. Inline chat is supplementary. - -### Default Behaviors (ON unless disabled) -- **Save raw findings per agent**: Each parallel agent in Phase 2 writes its own `raw-{angle}.md` file, not a shared file. This preserves distinct perspectives for synthesis. -- **Evidence quality ratings**: Phase 3 labels each finding Strong/Moderate/Weak based on source specificity. -- **Gap check**: Phase 4 explicitly checks whether the synthesis answers every sub-question from scope.md. -- **Standard depth**: ~10 tool calls per research agent unless user specifies "quick" or "deep". - -### Optional Behaviors (OFF unless enabled) -- **Deep mode**: ~20 tool calls per research agent. Enable with "deep research" or "thorough research". -- **Quick mode**: ~5 tool calls per research agent. Enable with "quick research" or "brief research". -- **Extra agents**: Dispatch more than 3 parallel agents for broad topics. Enable with "comprehensive" or when primary question has 5+ distinct angles. - -## What This Skill CAN Do -- Define a precise research scope with primary question and 2-5 sub-questions -- Dispatch 3+ parallel `research-subagent-executor` agents, each assigned a distinct angle -- Compile raw findings into a synthesis with evidence quality ratings per claim -- Check the synthesis against the original scope for gaps and bias -- Produce a structured final report saved to `research/{topic}/report.md` - -## What This Skill CANNOT Do -- Guarantee factual accuracy — it surfaces evidence quality but cannot verify all claims -- Replace domain-specific research workflows (e.g., `go-code-review` for Go code analysis) -- Produce audio, images, or non-text artifacts +The pipeline enforces mandatory parallel research (minimum 3 independent agents), artifact +persistence at every phase, and structured validation before delivery. Context-only storage +is forbidden; all outputs must persist to `research/{topic}/` for resumability and long-term reference. --- @@ -135,6 +110,8 @@ Write `research/{topic}/scope.md`: **Goal**: Execute parallel research with mandatory multi-agent dispatch. +**Critical Constraint**: You MUST dispatch minimum 3 parallel `research-subagent-executor` agents in a single message. Sequential research is forbidden — it produces lower quality output and takes 3–5x longer than parallel dispatch (validated by A/B testing). Each agent must be assigned a distinct angle and receive identical dispatch instructions in the same message; do NOT dispatch agents one at a time waiting for completion between each. + **Step 1**: Assign a distinct angle to each agent. Angles should cover the scope without overlapping. Good angle patterns for most research topics: | Angle | Focus | @@ -149,7 +126,7 @@ Write `research/{topic}/scope.md`: Choose 3–5 angles that are relevant to the primary question and sub-questions from scope.md. -**Step 2**: Dispatch minimum 3 parallel `research-subagent-executor` agents in a single message. Each agent receives: +**Step 2**: Dispatch all agents in a single message. Each agent receives: - Its assigned angle - The primary question from scope.md - The sub-questions relevant to its angle @@ -157,6 +134,8 @@ Choose 3–5 angles that are relevant to the primary question and sub-questions - The depth setting (number of tool calls) - Its output file: `research/{topic}/raw-{angle}.md` +Each agent writes its findings to its own `raw-{angle}.md` file, not a shared file. This preserves distinct perspectives for synthesis and prevents shared bias between agents. + Example dispatch instruction for one agent: ``` You are the "tradeoffs" research agent for: {primary question} @@ -215,6 +194,8 @@ If an agent times out or fails to write its file: | **Moderate** | Supported by one source or general practitioner consensus | | **Weak** | Inferred, speculative, or from a single low-authority source | +Distinguish between "Strong" findings backed by specific named sources and findings where "Strong" just means "multiple agents said it" — the latter can hide shared bias. Only mark as Strong if evidence is independent and specific. + **Step 4**: Write `research/{topic}/synthesis.md`: ```markdown @@ -282,7 +263,7 @@ If an agent times out or fails to write its file: **Goal**: Produce the final formatted report. -**Step 1**: Write `research/{topic}/report.md` — the canonical output artifact. +**Step 1**: Write `research/{topic}/report.md` — the canonical output artifact. This is what you report to the user. Inline chat is supplementary. Structure: @@ -385,51 +366,6 @@ Solution: Check if prior `report.md` exists. If it does, ask the user: re-run (o --- -## Anti-Patterns - -### Anti-Pattern 1: Sequential Research -**What it looks like**: Dispatching one research agent, waiting, dispatching the next, waiting, etc. -**Why wrong**: Sequential research takes 3–5x longer and produces lower quality output than parallel dispatch (A/B validated, Rule 12). Each agent also lacks awareness of the other angles, producing redundant findings. -**Do instead**: Dispatch all agents in a single message with distinct angles. Wait once for all to complete. - -### Anti-Pattern 2: Not Saving Artifacts -**What it looks like**: Reading all raw files and synthesizing in context without writing synthesis.md or report.md to disk. -**Why wrong**: Context is ephemeral. If the session ends or context compresses, all synthesis work is lost. The user cannot reference findings later. -**Do instead**: Write synthesis.md after Phase 3 and report.md after Phase 5, always. These are the persistent record. - -### Anti-Pattern 3: Treating Synthesis as Delivery -**What it looks like**: Skipping Phase 4 (VALIDATE) and Phase 5 (DELIVER), sending synthesis.md content directly to the user. -**Why wrong**: Synthesis is an intermediate artifact. It lacks quality assessment, bias check, and the formatted structure of a final report. Gaps are not yet surfaced for the user. -**Do instead**: Always run VALIDATE before DELIVER. Quality assessment is what distinguishes research from note-taking. - -### Anti-Pattern 4: Single-Agent Research -**What it looks like**: Dispatching one agent with instructions to "research everything" about the topic. -**Why wrong**: A single agent pursues one thread of inquiry and cannot cover multiple angles in parallel. It also hits depth limits faster. -**Do instead**: Assign distinct angles to distinct agents. Minimum 3. - -### Anti-Pattern 5: Vague Scope -**What it looks like**: Skipping Phase 1 and dispatching agents with the user's raw query as the research question. -**Why wrong**: Raw queries like "research Kubernetes" produce unfocused findings across too many angles with no synthesis structure. -**Do instead**: Always write scope.md first. A 2-minute scoping step saves significant synthesis effort. - ---- - -## Examples - -### Example 1: Technical tradeoff research -User: "Research the tradeoffs of CRDTs vs. operational transforms for collaborative editing" -Actions: Phase 1 defines primary question ("What are the tradeoffs between CRDTs and OT for real-time collaborative editing at scale?"), 4 sub-questions (conflict resolution model, performance, implementation complexity, production adoption). Phase 2 dispatches 4 agents: current-state, tradeoffs, technical-details, real-world-usage. Phase 3 synthesizes 8 key findings with evidence ratings. Phase 4 identifies 1 weak finding (performance at very large scale — limited data). Phase 5 delivers `research/crdt-vs-ot/report.md`. - -### Example 2: Quick competitive landscape research -User: "Quick research on vector database options for a new project" -Actions: Phase 1 sets depth=quick (~5 calls per agent), 3 sub-questions. Phase 2 dispatches 3 agents: current-state (mainstream options), tradeoffs (when to use each), real-world-usage (adoption and maturity). Phase 5 delivers `research/vector-databases/report.md` with executive summary that directly answers "which to pick for a new project." - -### Example 3: Resuming interrupted research -User: "Continue the research on distributed consensus algorithms" -Actions: Check if `research/distributed-consensus/` exists. If scope.md and some raw-*.md files exist, read scope.md to re-establish context, check which angles are missing, and resume from Phase 3 (SYNTHESIZE) if all raw files are present, or re-dispatch missing angles if some are absent. - ---- - ## References - [domain-research](../domain-research/SKILL.md) - Subdomain discovery pipeline that uses similar parallel research patterns diff --git a/pipelines/research-to-article/SKILL.md b/pipelines/research-to-article/SKILL.md index 664a3f05..419940fa 100644 --- a/pipelines/research-to-article/SKILL.md +++ b/pipelines/research-to-article/SKILL.md @@ -37,48 +37,11 @@ routing: # Research-to-Article Pipeline -## Operator Context - -This skill orchestrates the complete content pipeline: parallel research gathering, structured compilation, voice generation, validation, and output. It implements the **Pipeline Architecture** pattern — gather, compile, ground, generate, validate — with **Domain Intelligence** embedded in voice-first content methodology. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before starting pipeline -- **Over-Engineering Prevention**: Generate the article requested. No speculative additions, no "while I'm here" improvements to other content -- **Parallel Research**: Launch 5 parallel research agents for comprehensive coverage -- **Research-Voice Separation**: Research informs but NEVER dominates narrative. Raw data must be transformed to story -- **Deterministic Validation**: Always validate final output with `voice_validator.py` -- **Wabi-Sabi Authenticity**: Natural imperfections are features, not bugs. Do not over-polish -- **Artifact Persistence**: Save research doc AND article to disk. Context is ephemeral, files persist - -### Default Behaviors (ON unless disabled) -- **5 Research Agents**: Primary domain, narrative arcs, external context, community reaction, business context -- **Research Compilation**: Structure findings with Story Arc first, key facts second -- **Voice Mode Selection**: Auto-select mode based on content type (profile, awards, journey, etc.) -- **Article Voice Notes**: Include voice-specific notes in compiled research doc -- **Full Validation**: Run `voice_validator.py` before declaring output complete -- **Timeout Enforcement**: 5-minute per-agent hard timeout, proceed with available data - -### Optional Behaviors (OFF unless enabled) -- **Skip Research**: Use provided research data (`--skip-research`) -- **Skip Validation**: Draft mode without deterministic validation (`--skip-validation`) -- **Single Agent Research**: Faster but less thorough (`--quick`) -- **Custom Agent Count**: Override default 5 agents for simpler topics - -## What This Skill CAN Do -- Orchestrate end-to-end research-to-article pipeline with parallel agents -- Compile multi-source research into structured documents with story arcs -- Generate voice-authentic articles using voice skills and profiles -- Validate output deterministically with `voice_validator.py` -- Gracefully degrade when research agents timeout or return incomplete data - -## What This Skill CANNOT Do -- Generate articles without researching first (use voice-orchestrator instead) -- Edit existing articles (use the appropriate voice skill directly) -- Optimize article performance or SEO (not a content optimization tool) -- Skip validation phases (use `--skip-validation` flag explicitly if draft mode needed) -- Upload to WordPress (use wordpress-uploader skill after pipeline completes) +## Overview ---- +This skill orchestrates a complete content pipeline from research to publication. The pipeline operates in six distinct phases, each with defined inputs and gate criteria that must pass before proceeding to the next phase. Each phase produces persistent artifacts (files saved to disk) because context is ephemeral but files remain. + +The core principle: research informs the article but NEVER dominates the narrative. Raw data transforms into story before reaching the final output. Always run deterministic validation with `voice_validator.py` at the end because self-assessment is unreliable. ## Instructions @@ -88,7 +51,7 @@ This skill orchestrates the complete content pipeline: parallel research gatheri **Step 1: Define research scope** -Identify the subject, timeframe, and 6 research dimensions: +Identify the subject, timeframe, and 6 research dimensions. The current news dimension (Agent 6) is mandatory for profile articles because articles must feel timely — timely journalism captures where the story is RIGHT NOW, not a historical summary from months ago. ```markdown ## Research Scope @@ -103,14 +66,14 @@ Agents: 6. [Current news] - Last 1-2 weeks: upcoming events, recent developments, announcements ``` -**MANDATORY: Agent 6 (Current News) cannot be skipped for profile articles.** The article must feel timely. Search for: +Search for these in current news research: - Upcoming events involving the subject - Recent developments from the last 2-3 weeks - Significant changes or announcements - Media coverage, press conferences, official statements - Social media announcements or reactions -**Important**: Raw analytics, ratings, or database numbers are for research context only. NEVER surface raw data in the final article. Use data to understand trajectory, then transform to narrative. +**Important**: Raw analytics, ratings, or database numbers serve research context only — NEVER surface raw data in the final article because readers don't know or care about database numbers like "1771.7 in the ratings and 8.40 community rating". Use data during research to understand trajectory, then transform to narrative in the article: "having the best stretch of their career right now". **Step 2: Launch 5 parallel agents** @@ -126,15 +89,17 @@ Timeline: 5:00 - HARD TIMEOUT: Proceed with available data ``` +Enforce a 5-minute hard timeout on research agents because diminishing returns set in after 3-4 agents provide sufficient data. Waiting indefinitely wastes time without improving quality. + **Gate**: At least 3 of 5 agents have returned data. If fewer than 3, supplement with direct WebSearch. Proceed only when gate passes. ### Phase 2: COMPILE (Structure Research) -**Goal**: Merge agent outputs into a single structured research document. +**Goal**: Merge agent outputs into a single structured research document with story arc as the primary organizing principle. **Step 1: Identify the story arc** -Before organizing facts, determine: What is the STORY here? The story arc is the most important element — it frames every fact, quote, and detail. +Before organizing facts, determine: What is the STORY here? The story arc is the most important element — it frames every fact, quote, and detail. A fact dump (no story arc) becomes a list, not an article. Complete this phase with a clear story arc before moving to GENERATE. **Step 2: Compile research document** @@ -178,10 +143,10 @@ Invoke the appropriate voice skill (e.g., `voice-{name}`) via the Skill tool. Se **Step 2: Generate with research context** Key constraints for ALL voices: -- NEVER expose analytics, ratings, or raw data — transform to narrative +- NEVER expose analytics, ratings, or raw data — transform to narrative because readers want stories, not reports - Reference the compiled research document by path -- Apply wabi-sabi: natural imperfections are features -- End with forward momentum — point ahead, not backward +- Apply wabi-sabi: natural imperfections are features, not bugs. Do not over-polish +- End with forward momentum — point ahead, not backward. Voice-authentic writing never summarizes. Summary paragraphs are an AI tell **Step 3: Save draft** @@ -205,6 +170,8 @@ python3 $HOME/claude-code-toolkit/scripts/voice_validator.py validate \ --format json ``` +Run `voice_validator.py` every time because self-assessment is unreliable — validation is mandatory, not optional. + **Step 2: Check pass criteria** | Metric | Requirement | @@ -221,6 +188,11 @@ If validation fails: 3. Re-validate 4. Maximum 3 iterations — if still failing after 3, output with validation report and note issues +Common fixes: +- Check for research language ("metrics show", "data indicates") and replace with narrative +- Check for banned patterns and remove +- Re-read voice profile and adjust tone + **Gate**: Validation passes OR 3 refinement iterations exhausted. Proceed only when gate passes. ### Phase 6: OUTPUT @@ -252,7 +224,7 @@ If validation fails: **Step 2: Verify artifacts exist** -Confirm both files are saved: +Confirm both files are saved — this is non-optional verification because files persist when context vanishes: - Research document at `content/[site]/test/[subject]-research.md` - Article at `content/[site]/test/[subject]-article.md` @@ -321,66 +293,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Research Language in Article -**What it looks like**: "Data shows that fan engagement increased 40% during Q3" -**Why wrong**: Exposes raw analytics. Readers want stories, not reports. Voice authenticity destroyed. -**Do instead**: Transform data to narrative — "the audience grew every week, and by summer the momentum was undeniable" - -### Anti-Pattern 2: Skipping Research Compilation -**What it looks like**: Passing raw agent outputs directly to voice generation -**Why wrong**: No story arc means no throughline. Article becomes a list of facts, not a narrative. -**Do instead**: Complete Phase 2 COMPILE. Identify the story arc FIRST, then organize facts around it. - -### Anti-Pattern 3: Over-Polishing the Output -**What it looks like**: Fixing every run-on sentence, removing all fragments, standardizing punctuation -**Why wrong**: Wabi-sabi violation. Natural imperfections are voice fingerprints. Sterile perfection is an AI tell. -**Do instead**: Fix errors (banned patterns, factual mistakes). Keep warnings (rhythm, fragments, run-ons). These are features. - -### Anti-Pattern 4: Waiting Indefinitely for Research Agents -**What it looks like**: "Still waiting for Agent 4... it's been 15 minutes..." -**Why wrong**: Diminishing returns. 3-4 agents provide sufficient data. Waiting wastes time without improving quality. -**Do instead**: Enforce 5-minute hard timeout. Proceed with available data. Supplement with targeted WebSearch if needed. - -### Anti-Pattern 5: Summarizing at the End -**What it looks like**: "In conclusion, [subject]'s 2025 was defined by three key themes..." -**Why wrong**: Voice-authentic writing never summarizes. Summary paragraphs are an AI tell. -**Do instead**: Point forward. End with momentum -- what comes next, what to watch for, why you should care tomorrow. - -### Anti-Pattern 6: Citing Raw Stats in Articles -**What it looks like**: "At 1771.7 in the ratings and climbing, with a community rating of 8.40" -**Why wrong**: Readers don't know or care about raw database numbers. This reads like a stats dump, not a story. -**Do instead**: Use data during research to understand trajectory. In the article, transform to narrative: "having the best stretch of their career right now" - -### Anti-Pattern 7: Missing Current News -**What it looks like**: A profile piece about a subject who has a major event tomorrow, but the article doesn't mention it -**Why wrong**: Makes the article feel like a Wikipedia summary rather than timely journalism. Misses the entire reason to read NOW. -**Do instead**: Always search for upcoming events, recent developments, and announcements. The closing sections should reflect where the story is RIGHT NOW. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Pipeline Architecture](../shared-patterns/pipeline-architecture.md) - Pipeline design principles -- [Voice-First Writing](../shared-patterns/voice-first-writing.md) - Voice generation methodology -- [Wabi-Sabi Authenticity](../shared-patterns/wabi-sabi-authenticity.md) - Natural imperfection principle - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Research is good enough with 2 agents" | Missing dimensions lead to shallow articles | Launch all 6, proceed with 4+ after timeout | -| "I can write without compiling first" | No story arc = fact dump, not article | Complete COMPILE phase with story arc | -| "Validation is optional for good writers" | Self-assessment is unreliable | Run `voice_validator.py`, every time | -| "This sounds natural, skip wabi-sabi check" | Over-polished = AI tell | Verify imperfections preserved | -| "Readers won't notice research language" | "Data indicates" kills voice authenticity | Transform ALL data to narrative | -| "Raw data adds credibility" | Readers don't know or care about database numbers | Use data for context, never cite raw numbers in article | -| "Historical facts are enough" | Missing current news makes article feel stale | Search last 1-2 weeks of news, upcoming events | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/research-agents.md`: Agent configuration, prompts, and timeout management - `${CLAUDE_SKILL_DIR}/references/voice-variants.md`: Voice-specific templates, CLI usage, and proven examples diff --git a/pipelines/skill-creation-pipeline/SKILL.md b/pipelines/skill-creation-pipeline/SKILL.md index f3a37fc2..ffb0eae8 100644 --- a/pipelines/skill-creation-pipeline/SKILL.md +++ b/pipelines/skill-creation-pipeline/SKILL.md @@ -36,66 +36,9 @@ routing: # Skill Creation Pipeline -## Operator Context - -This pipeline wraps `skill-creator` with explicit discovery, design -review, and validation gates. It is the **formal path** for creating new skills -— as opposed to ad-hoc creation — and should be used whenever skill quality, -uniqueness, or routing correctness is important. The pipeline does not replace -the creator agent; it provides the scaffolding around it. - -### Hardcoded Behaviors (Always Apply) -- **DISCOVER Before Any Files**: Phase 1 (DISCOVER) must complete before any - SKILL.md is written. No exceptions. This prevents duplicate skills from being - added to the repo. -- **ADR Check in DISCOVER**: During Phase 1, check for active ADR session (`.adr-session.json`) and run `python3 ~/.claude/scripts/adr-query.py list` to find related ADRs. If an active session exists, read relevant sections via `adr-query.py context --role skill-creator`. If creating a new skill as part of a pipeline, verify the ADR hash before proceeding. -- **Group-Prefix Naming**: New skills MUST use the same prefix as related existing skills. During DISCOVER, run `ls skills/ | grep {domain}` to find the group. Examples: voice skills start with `voice-`, Go skills with `go-`, PR skills with `pr-`, writing/content skills with `writing-`, review skills with `review-`. If no group exists, the new skill starts one. The directory name and the `name:` frontmatter field must match. -- **Design Brief Before SCAFFOLD**: Phase 2 (DESIGN) must produce a saved design - brief before Phase 3 begins. Writing a skill without a tier decision and phase - list produces inconsistent results. -- **Score Before INTEGRATE**: Phase 4 (VALIDATE) must call `agent-evaluation` on - the new SKILL.md and produce a numeric score before Phase 5 begins. Self- - assessment ("this looks good") does not satisfy this gate. -- **Minimum Grade B**: A new skill must score 75+ to proceed to integration. If - it scores below 75, return to Phase 3 (SCAFFOLD) and fix before re-scoring. Max - 3 iterations — after that, surface the scoring breakdown to the user. -- **INDEX.json Update Is Non-Optional**: Phase 5 (INTEGRATE) must add the new - skill to `skills/INDEX.json`. A skill that exists on disk but not in the index - is invisible to routing. - -### Default Behaviors (ON unless disabled) -- **Overlap Threshold at 70%**: In DISCOVER, if an existing skill covers ≥70% - of the requested domain, surface the overlap and recommend extending rather - than creating. Proceed only after a deliberate "create new" decision. -- **Complexity Tier Confirmation**: If the tier is ambiguous (user's description - could fit Simple or Medium, or Medium vs. Complex), ask before proceeding to - SCAFFOLD. Tier errors are cheaper to fix in DESIGN than after SCAFFOLD. -- **Save Design Brief to File**: Write the DESIGN output to - `skills/{name}/design-brief.md` before SCAFFOLD begins. This file is the - single source of truth for Phase 3. - -### Optional Behaviors (OFF unless enabled) -- **Skip DISCOVER**: Skip the duplication check (enable with "skip discover" or - "I've already checked for duplicates"). Use only when the skill name is - intentionally novel and overlap checking is unnecessary. -- **Skip VALIDATE**: Skip agent-evaluation scoring (enable with "skip validation" - or "quick creation"). Accepts the risk of lower quality at integration time. -- **Auto-Approve Design Brief**: Proceed from DESIGN to SCAFFOLD without - confirming the tier (enable with "auto" or "no confirmation needed"). - -## What This Skill CAN Do -- Detect overlap with existing skills before any files are written -- Select the appropriate complexity tier based on the skill's requirements -- Generate a complete SKILL.md following AGENT_TEMPLATE_V2.md patterns -- Score the result against the agent-evaluation rubric (100-point scale) -- Wire the new skill into `skills/INDEX.json` and prompt routing updates - -## What This Skill CANNOT Do -- Create agents — use `agent-creator-engineer` for new agents -- Update routing tables autonomously — Phase 5 prompts the update, but - `routing-table-updater` or manual edits handle the actual changes -- Guarantee an A-grade on first pass — iteration may be required -- Modify existing skills — use direct editing or `system-upgrade` for that +## Overview + +This pipeline wraps `skill-creator` with explicit discovery, design review, and validation gates. It is the **formal path** for creating new skills — as opposed to ad-hoc creation — and should be used whenever skill quality, uniqueness, or routing correctness is important. The pipeline does not replace the creator agent; it provides the scaffolding around it. --- @@ -104,7 +47,8 @@ the creator agent; it provides the scaffolding around it. ### Phase 1: DISCOVER **Goal**: Prevent duplication by scanning the existing skill and agent indexes -before any files are written. +before any files are written. DISCOVER **must complete before any SKILL.md is +written — no exceptions**. **Step 1**: Extract the domain keywords from the user's request. These are the terms that describe what the skill does (e.g., "code review", "branch naming", @@ -121,13 +65,30 @@ grep -i "" /path/to/skills/INDEX.json Also search `agents/INDEX.json` for agents that already handle this domain as part of their core capability. -**Step 3**: For each potentially overlapping skill, read its SKILL.md description +**Step 3**: Check for active ADR session. If `.adr-session.json` exists, run +`python3 ~/.claude/scripts/adr-query.py list` to find related ADRs. If +an active session exists, read relevant sections via +`adr-query.py context --role skill-creator`. If creating a skill as part of +a pipeline, verify the ADR hash before proceeding. + +**Step 4**: For each potentially overlapping skill, read its SKILL.md description and phase list. Estimate overlap percentage based on: - Same domain verbs (review, create, debug, deploy) - Same target artifact (Go files, PRs, branches, agents) - Same phase structure (does the existing skill already cover what's needed?) -**Step 4**: Report findings. +If an existing skill covers ≥70% of the requested domain, surface the overlap +and recommend extending rather than creating. Proceed only after a deliberate +"create new" decision (threshold is 70% overlap). + +**Step 5**: Identify the skill's group prefix. Run `ls skills/ | grep {domain}` +to find the group. Examples: voice skills start with `voice-`, Go skills with +`go-`, PR skills with `pr-`, writing/content skills with `writing-`, review +skills with `review-`. New skills **must use the same prefix as related existing +skills**. If no group exists, the new skill starts one. The directory name and +the `name:` frontmatter field must match exactly. + +**Step 6**: Report findings. ``` DISCOVER RESULTS @@ -135,7 +96,8 @@ DISCOVER RESULTS Request: [user's skill description in one line] Keywords: [extracted keywords] - +Group/Prefix: [identified prefix, e.g., voice-, go-, pr-] +ADRs checked: [Y/N, count if any found] Existing skills checked: [N] [If no overlap found:] @@ -155,17 +117,21 @@ Existing skills checked: [N] **Gate**: If overlap ≥70% with any existing skill, present the recommendation and wait for the user's decision. If the user says "create new anyway" or -"extend" the existing skill, act accordingly. If <70% overlap, proceed -automatically to Phase 2. +"extend" the existing skill, act accordingly. If <70% overlap or user confirms +new creation, proceed to Phase 2. (To skip DISCOVER entirely, enable with +"skip discover" or "I've already checked for duplicates" — use only when the +skill name is intentionally novel.) --- ### Phase 2: DESIGN **Goal**: Choose the complexity tier and produce a complete design brief before -any SKILL.md content is written. +any SKILL.md content is written. **Design brief must be saved before Phase 3 +begins** — writing a skill without a tier decision and phase list produces +inconsistent results. -**Step 1**: Classify the skill's complexity tier. +**Step 1**: Classify the skill's complexity tier using these characteristics. | Tier | Characteristics | Phase Count | Fan-Out | |------|----------------|-------------|---------| @@ -200,18 +166,19 @@ Reference Files Needed: Key Behaviors (Hardcoded): - [behavior] - -Anti-Patterns to Prevent: - - [pattern] ``` -**Step 3**: If the tier is unambiguous given the request, auto-select and note -the reason. If ambiguous (the request could reasonably fit two tiers), present -both options and ask the user to choose before proceeding. +**Step 3**: Tier confirmation logic. If the tier is unambiguous given the request, +auto-select and note the reason in the brief. If ambiguous (the request could +reasonably fit two tiers), present both options and ask the user to choose before +proceeding. Tier errors are cheaper to fix in DESIGN than after SCAFFOLD — +ask for confirmation on edge cases. (To skip confirmation, enable with "auto" or +"no confirmation needed" — use only for intentional auto-approval.) **Step 4**: Save the design brief. -Write to `skills/{name}/design-brief.md`. +Write to `skills/{name}/design-brief.md`. This file is the single source of +truth for Phase 3. **Gate**: Design brief saved. Tier confirmed (either auto-selected with rationale or user-confirmed). Proceed to Phase 3. @@ -224,25 +191,28 @@ or user-confirmed). Proceed to Phase 3. **Step 1**: Read the saved design brief from `skills/{name}/design-brief.md`. -**Step 2**: Generate the SKILL.md following these requirements. Every skill MUST -have all of these sections — no exceptions: - -| Section | Requirement | -|---------|-------------| -| Frontmatter | `name`, `description`, `version`, `user-invocable`, `agent`, `allowed-tools` | -| Operator Context | Hardcoded / Default / Optional behaviors (three subsections) | -| Capabilities | "What This Skill CAN Do" and "What This Skill CANNOT Do" | -| Instructions | One `### Phase N: NAME` section per phase in the design brief | -| Error Handling | At least 2–3 named error cases with cause and solution | -| Anti-Patterns | At least 2–3 named anti-patterns with what/why/do-instead | -| Examples | At least 1 realistic example with user input and step trace | - -**Step 3**: Apply these structural patterns from AGENT_TEMPLATE_V2: - -- Phase headers: `### Phase N: PHASENAME` -- Gates at end of each phase: `**Gate**: [condition]. [action].` -- Operator Context subsections named exactly: `### Hardcoded Behaviors (Always Apply)`, `### Default Behaviors (ON unless disabled)`, `### Optional Behaviors (OFF unless enabled)` -- Anti-patterns use: `**What it looks like**: ... **Why wrong**: ... **Do instead**: ...` +**Step 2**: Generate the SKILL.md following the structural patterns from +AGENT_TEMPLATE_V2. Include these sections in this order: + +1. **Frontmatter** with `name`, `description`, `version`, `user-invocable`, + `agent`, `allowed-tools` (required fields) +2. **Overview** (1–2 sentences on purpose and context) +3. **Instructions** with one `### Phase N: PHASENAME` section per phase in the + design brief. Each phase must end with a **Gate** statement: + `**Gate**: [condition]. [action].` +4. **Error Handling** with 2–3 named error cases (Cause, Solution pattern) +5. **References** (links to related files, skills, agents) + +Avoid these outdated sections — they are being removed from the template: +- "Operator Context" (hardcoded/default/optional behaviors) +- "What This Skill CAN/CANNOT Do" +- "Anti-Patterns" and anti-rationalization tables + +**Step 3**: Integrate constraints inline with each phase's reasoning and gate +logic rather than in separate subsections. For example: +- In Phase 1, explain why DISCOVER must complete first and what checks to run +- In Phase 2, note when tier confirmation is needed vs. auto-selected +- In Phase 4, explain why agent-evaluation scores must be ≥75 before proceeding **Step 4**: Write to `skills/{name}/SKILL.md`. @@ -252,23 +222,16 @@ have all of these sections — no exceptions: ### Phase 4: VALIDATE -**Goal**: Score the new skill against the agent-evaluation rubric and enforce -the minimum quality gate before integration. +**Goal**: Score the new skill against the agent-evaluation rubric. **Score must +be ≥75 (grade B or above) before proceeding to Phase 5.** Self-assessment +("this looks good") does not satisfy this gate — agent-evaluation scoring is +mandatory. **Step 1**: Run `agent-evaluation` on the new SKILL.md. Use the `agent-evaluation` skill, pointing it at `skills/{name}/SKILL.md`. This -produces a score breakdown across: - -| Criterion | Points | -|-----------|--------| -| Structure (YAML, phases, gates) | 20 | -| Operator Context (behaviors) | 15 | -| Error Handling | 15 | -| Reference Files | 10 | -| Validation Scripts | 10 | -| Content Depth | 30 | -| **Total** | **100** | +produces a score breakdown across six criteria (total 100 points). The skill +must score ≥75 to proceed to integration. **Step 2**: Report the score. @@ -282,10 +245,8 @@ Grade: [A (90+) | B (75–89) | C (60–74) | F (<60)] Breakdown: Structure: [N]/20 - Operator Context: [N]/15 - Error Handling: [N]/15 Reference Files: [N]/10 - Validation Scripts:[N]/10 + Error Handling: [N]/15 Content Depth: [N]/30 [If grade A or B:] @@ -301,17 +262,22 @@ Breakdown: - List the specific sections that are weak or missing - Return to Phase 3 with explicit instructions to fix those sections - Re-run Phase 4 after the fix -- Track iterations. After 3 failed iterations, surface the full scoring - breakdown to the user and ask whether to continue or redesign from Phase 2. +- Track iterations. Max 3 iterations — after 3 failed attempts to reach 75+, + surface the full scoring breakdown to the user and ask whether to continue + or redesign from Phase 2. Tier errors discovered late are expensive to fix. + +(To skip validation entirely, enable with "skip validation" or "quick creation" +— use only when accepting the risk of lower quality at integration time.) -**Gate**: Score ≥ 75 (grade B or above). Proceed to Phase 5. +**Gate**: Score ≥75 (grade B or above). Proceed to Phase 5. --- ### Phase 5: INTEGRATE -**Goal**: Wire the validated skill into the routing system so it is immediately -usable. +**Goal**: Wire the validated skill into the routing system. **INDEX.json update +is non-optional** — a skill that exists on disk but not in the index is invisible +to routing. **Step 1**: Add to `skills/INDEX.json`. @@ -327,6 +293,11 @@ Read the current INDEX.json and append an entry for the new skill: } ``` +Validate JSON syntax before proceeding: +```bash +python3 -c "import json; json.load(open('skills/INDEX.json'))" +``` + **Step 2**: Check whether the skill needs a routing entry in `/do`. A routing entry is needed if: @@ -356,7 +327,7 @@ Score: [N]/100 New skill is ready to use. ``` -**Gate**: INDEX.json updated. Routing status reported. Phase 5 complete. +**Gate**: INDEX.json updated and validated. Routing status reported. Phase 5 complete. --- @@ -390,43 +361,6 @@ validate. Fix the syntax before completing Phase 5. --- -## Anti-Patterns - -### Anti-Pattern 1: Skipping DISCOVER to Save Time -**What it looks like**: Jumping directly to DESIGN because "this skill is clearly -new" without checking the INDEX -**Why wrong**: The repo has 120+ skills. Near-duplicates exist under non-obvious -names. A 30-second INDEX scan has caught duplicates in practice. -**Do instead**: Always run Phase 1. If you're confident there's no overlap, the -scan will confirm it in seconds. - -### Anti-Pattern 2: Writing SKILL.md Before the Design Brief -**What it looks like**: Going straight from DISCOVER to writing SKILL.md content -without a tier decision and phase list -**Why wrong**: Tier errors discovered mid-SCAFFOLD require starting over. A 2- -minute design brief prevents a 20-minute rewrite. -**Do instead**: Complete Phase 2 and save the design brief before writing a single -line of SKILL.md. - -### Anti-Pattern 3: Self-Certifying VALIDATE -**What it looks like**: Saying "this skill looks complete, I'll give it a B" instead -of running `agent-evaluation` -**Why wrong**: Self-assessment of your own output is unreliable. The rubric has -6 specific criteria with point values. Without scoring against them, gaps in Error -Handling or Reference Files are routinely missed. -**Do instead**: Always invoke `agent-evaluation`. Even if the result is an A, you'll -have the breakdown to show the user. - -### Anti-Pattern 4: Silently Skipping INDEX Update -**What it looks like**: Finishing SCAFFOLD and VALIDATE and declaring the skill -"done" without updating INDEX.json -**Why wrong**: A skill that isn't in the index is invisible to `/do` routing and -to any tool that builds skill lists from INDEX.json. The skill exists on disk but -is unreachable. -**Do instead**: INDEX.json update is the first step of Phase 5. It is not optional. - ---- - ## Examples ### Example 1: Simple skill, no overlap diff --git a/pipelines/system-upgrade/SKILL.md b/pipelines/system-upgrade/SKILL.md index 7bb5b02f..d01bd80d 100644 --- a/pipelines/system-upgrade/SKILL.md +++ b/pipelines/system-upgrade/SKILL.md @@ -37,43 +37,11 @@ routing: # System Upgrade Pipeline -## Operator Context - -This skill orchestrates systematic upgrades to the agent/skill/hook/script ecosystem -when external changes warrant adaptation. It is a **top-down** upgrade mechanism — -triggered by Claude Code releases, user goal changes, or accumulated retro learnings — -complementing the **bottom-up** retro-knowledge-injector. - -### Hardcoded Behaviors (Always Apply) -- **Show Plan Before Implementing**: Phase 3 output (ranked upgrade list) MUST be presented to the user and approved before Phase 4 begins. Never silently execute upgrades. -- **Reuse Domain Agents**: Phase 4 (IMPLEMENT) dispatches to existing domain agents (skill-creator, agent-creator-engineer, hook-development-engineer, golang-general-engineer, etc.). The upgrade engineer orchestrates; specialists execute. -- **Parallel Fan-Out**: When 3+ components need the same type of upgrade, dispatch in parallel using multiple Agent tool calls in a single message. -- **Score Delta Required**: Phase 5 (VALIDATE) must produce before/after evaluation delta, not just "looks good." Use `agent-evaluation` skill. -- **Trigger Type Determines Input**: The three trigger types (claude-release, goal-change, retro-driven) require different input parsing in Phase 1. - -### Default Behaviors (ON unless disabled) -- **Scope Limiting**: Default audit depth = 10 most-recently-modified agents + all hooks. Full audit only if user says "comprehensive" or "all". -- **Dry Run Presentation**: Show Phase 3 output as a formatted table with Tier (critical/important/minor) and effort estimate. -- **Branch Creation**: Create a branch before Phase 4 (e.g., `chore/system-upgrade-YYYY-MM-DD`). - -### Optional Behaviors (OFF unless enabled) -- **Comprehensive Audit**: Audit all agents and skills (slow; enable with "comprehensive audit") -- **Full Upgrade Diff**: Force a full component scan instead of incremental diff (enable with `python3 ~/.claude/scripts/upgrade-diff.py --full` or "full upgrade") -- **Auto-Approve**: Skip user approval gate between Phase 3 and Phase 4 (enable with "auto-apply") -- **Skip Validate**: Skip agent-evaluation scoring (enable with "skip validation") - -## What This Skill CAN Do -- Parse Claude Code release notes and map changes to affected component types -- Audit agents, skills, hooks, and scripts for patterns that need updating -- Produce a ranked upgrade plan with tier classification and estimated effort -- Dispatch parallel upgrade agents for independent changes -- Score components before/after with agent-evaluation -- Create branch, commit, sync to ~/.claude, and create PR - -## What This Skill CANNOT Do -- Make architectural decisions without user approval (Phase 3 gate is mandatory) -- Modify core scripts (feature-state.py, plan-manager.py) — those require explicit user direction -- Guarantee correctness of generated upgrades — validation phase catches regressions +## Overview + +This skill orchestrates systematic upgrades to the agent/skill/hook/script ecosystem when external changes warrant adaptation. It is a **top-down** upgrade mechanism—triggered by Claude Code releases, user goal changes, or accumulated retro learnings—complementing the **bottom-up** retro-knowledge-injector. + +The pipeline enforces a mandatory approval gate: Phase 3 output (ranked upgrade list) MUST be presented to the user and approved before Phase 4 begins. Never silently execute upgrades. --- @@ -109,7 +77,7 @@ python3 ~/.claude/scripts/learning-db.py query --category design --category gotc ``` Evaluate entries for actionability and specificity. These are the upgrade signals. -**Output**: A structured "Change Manifest" — a list of change signals with type, description, and likely affected component types. +**Output**: A structured "Change Manifest"—a list of change signals with type, description, and likely affected component types. **Gate**: Change Manifest has at least 1 actionable signal. If zero signals found, report to user and stop. @@ -127,12 +95,12 @@ python3 ~/.claude/scripts/upgrade-diff.py Evaluate the JSON output: - If `mode` is `"incremental"` and `total_changed > 0`: scope the audit to only the files listed in `changed`. Skip Step 1 (audit depth) and proceed directly to Step 2 using only these components. -- If `mode` is `"incremental"` and `total_changed == 0`: report "No components changed since last upgrade" to the user and **stop**. No further phases are needed. +- If `mode` is `"incremental"` and `total_changed == 0`: report "No components changed since last upgrade" to the user and **stop**. No further phases are needed. (This prevents wasted effort when nothing has changed since the last upgrade.) - If `mode` is `"full"` (first run or `--full` flag): proceed with existing full audit behavior starting at Step 1. **Step 1**: Determine audit depth. -- Default: 10 most-recently-modified agents + all hooks + all relevant skills -- Comprehensive: all agents + all skills + all hooks +- **Default**: 10 most-recently-modified agents + all hooks + all relevant skills. This balances thoroughness with speed, focusing on components most likely to need changes. +- **Comprehensive**: all agents + all skills + all hooks. (Enable only if user says "comprehensive" or "all"; full audits are slower but ensure complete coverage.) ```bash # Get most recently modified agents @@ -161,10 +129,10 @@ grep -l "goroutine\|concurrency" agents/*.md skills/*/SKILL.md |------------|-------------|--------| | `deprecate` | Component is now obsolete or superseded | Low | | `upgrade` | Component needs modification to use new capability | Medium | -| `create-new` | Gap identified — new component needed | High | +| `create-new` | Gap identified—new component needed | High | | `inject-pattern` | Add a new hardcoded behavior or rule | Low-Medium | -**Step 4**: Produce the **Audit Report** — a list of affected components with their change type and rationale. +**Step 4**: Produce the **Audit Report**—a list of affected components with their change type and rationale. **Gate**: Audit Report produced. Proceed to Phase 3. @@ -172,7 +140,7 @@ grep -l "goroutine\|concurrency" agents/*.md skills/*/SKILL.md ### Phase 3: PLAN -**Goal**: Produce a ranked upgrade plan and get user approval before any changes. +**Goal**: Produce a ranked upgrade plan and get user approval before any changes. (The approval gate is mandatory; this prevents mass edits without visibility and ensures the user controls what changes are made to their system.) **Step 1**: Sort the Audit Report by priority: @@ -225,7 +193,7 @@ git checkout -b chore/system-upgrade-$(date +%Y-%m-%d) ### Phase 4: IMPLEMENT -**Goal**: Execute the approved plan by dispatching domain agents for each change. +**Goal**: Execute the approved plan by dispatching domain agents for each change. (Reuse domain agents; the upgrade engineer orchestrates while specialists execute. Parallel fan-out when 3+ changes target the same domain.) **Step 1**: Group changes by domain agent that should handle them: @@ -245,7 +213,7 @@ For each dispatched agent, provide: - The rationale (from the Change Manifest) - The relevant context (surrounding code, other files that reference this component) -**Step 3**: For low-effort changes (inject-pattern, Minor tier), make direct edits rather than dispatching agents. Batch these into one pass. +**Step 3**: For low-effort changes (inject-pattern, Minor tier), make direct edits rather than dispatching agents. Batch these into one pass. (This avoids overhead for simple changes while reserving agents for complex work.) **Step 4**: Track completion. Mark each planned item as done as agents complete. @@ -255,7 +223,7 @@ For each dispatched agent, provide: ### Phase 5: VALIDATE -**Goal**: Score changed components before/after to quantify upgrade quality. +**Goal**: Score changed components before/after to quantify upgrade quality. (Produce before/after evaluation delta, not just "looks good." Use `agent-evaluation` skill.) **Step 1**: For each modified agent or skill, run evaluation: @@ -275,7 +243,7 @@ VALIDATION REPORT **Step 2**: Flag any regressions (after < before). For regressions: - Report to user - Suggest fix or revert -- Do NOT auto-revert — user decides +- Do NOT auto-revert—user decides **Step 3**: For hook modifications, run syntax check: ```bash @@ -356,46 +324,6 @@ Solution: Manually copy modified files to `~/.claude/` equivalent directories. R --- -## Anti-Patterns - -### Anti-Pattern 1: Skipping the Plan Approval Gate -**What it looks like**: Moving from AUDIT directly to IMPLEMENT without showing the user what will change -**Why wrong**: Mass edits without visibility can break the system in hard-to-trace ways -**Do instead**: Always present the ranked plan and wait for explicit approval - -### Anti-Pattern 2: Handling All Changes Directly Instead of Dispatching -**What it looks like**: Making all edits inline rather than routing to domain agents -**Why wrong**: Domain agents (skill-creator, hook-development-engineer) know the templates and anti-patterns for their domain -**Do instead**: Dispatch to domain agents for anything beyond simple pattern injection - -### Anti-Pattern 3: Auditing Everything Every Time -**What it looks like**: Full audit of all 120+ skills on every trigger -**Why wrong**: Most changes affect a subset of components; full audits waste time and dilute focus -**Do instead**: Target the audit to the change signals. Comprehensive mode is opt-in. - -### Anti-Pattern 4: Skipping VALIDATE for "Simple" Changes -**What it looks like**: Deploying without agent-evaluation scores because "it's just a comment injection" -**Why wrong**: Even small changes can break an agent's Operator Context or scoring criteria -**Do instead**: Always score — even a 1-minute validation catches regressions before they reach production - ---- - -## Examples - -### Example 1: Claude Code release with new hook event -User: "Claude Code just shipped with a Notification event type for hooks. Upgrade the system." -Actions: Phase 1 parses "Notification event type". Phase 2 scans all hooks for event handling. Phase 3 shows plan (3 hooks need updating, 2 agents need docs update). User approves. Phase 4 dispatches hook-development-engineer. Phase 5 validates. Phase 6 deploys. - -### Example 2: Goal change — new domain focus -User: "I'm now working heavily with Rust. Update the system to handle Rust projects." -Actions: Phase 1 extracts "Rust as new domain". Phase 2 audits hooks (no Rust file patterns), `/do` routing (no Rust triggers), error-learner (no Rust tags). Phase 3 proposes: 1 new agent (rust-general-engineer), 2 hook updates (learning injector + retro injector), 1 routing update. User approves. Phase 4 dispatches agent-creator-engineer + hook-development-engineer in parallel. Phases 5–6 validate and deploy. - -### Example 3: Retro-driven upgrade -User: "/retro graduate" shows 7 ready candidates. -Actions: Phase 1 queries learning.db for design/gotcha candidates as the Change Manifest. Phase 2 maps candidates to target agents. Phase 3 proposes injecting 7 patterns into 5 agents. User approves with "skip 3 and 6". Phase 4 injects 5 patterns directly (Low effort, no domain agent needed). Phase 5 scores the 5 modified agents. Phase 6 deploys. - ---- - ## References - [agent-upgrade](../agent-upgrade/SKILL.md) - Bottom-up single-agent upgrade pipeline (complements this top-down system pipeline) diff --git a/pipelines/systematic-debugging/SKILL.md b/pipelines/systematic-debugging/SKILL.md index d6156f98..39b069dd 100644 --- a/pipelines/systematic-debugging/SKILL.md +++ b/pipelines/systematic-debugging/SKILL.md @@ -40,50 +40,6 @@ routing: Evidence-based 5-phase debugging pipeline with mandatory gates between each phase. No phase may be skipped. Each phase produces artifacts that survive context resets. -## Operator Context - -This pipeline operates as an operator for systematic debugging workflows, configuring Claude's behavior for rigorous, evidence-based root cause analysis. It implements the **Iterative Refinement** architectural pattern -- form hypothesis, test, refine, verify -- with **Domain Intelligence** embedded in the debugging methodology. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before debugging -- **Over-Engineering Prevention**: Fix only the bug. No speculative improvements, no "while I'm here" changes -- **Reproduce First**: NEVER attempt fixes before creating reliable reproduction -- **No Random Changes**: Every modification must be based on evidence from isolation -- **Evidence Required**: Every hypothesis must be tested with concrete evidence -- **Verify Fixes**: Confirm fix works AND doesn't introduce regressions -- **Phase Gates Enforced**: Each phase must pass its gate before the next begins - -### Default Behaviors (ON unless disabled) -- **Minimal Reproduction**: Create smallest possible test case that shows bug -- **Bisection Strategy**: Use binary search to narrow down failure point -- **One Change at a Time**: Never make multiple changes simultaneously -- **Document Findings**: Log all observations, hypotheses, and test results -- **Related Issues Check**: Search for similar bugs in codebase and git history -- **Temporary File Cleanup**: Remove debug logs and profiling output at completion -- **Persistent Debug File**: Maintain `.debug-session.md` for context-reset resilience - -### Optional Behaviors (OFF unless enabled) -- **Regression Test Creation**: Write automated test for this specific bug -- **Git Bisect**: Use `git bisect` to find breaking commit -- **Performance Profiling**: Run profiler to identify bottlenecks -- **Database Query Analysis**: Use EXPLAIN for slow query debugging -- **Network Tracing**: Capture traffic for API debugging - -## What This Pipeline CAN Do -- Systematically find root causes through evidence-based investigation -- Create minimal reproductions that isolate the exact failure -- Distinguish between symptoms and root causes -- Verify fixes don't introduce regressions -- Document findings for future reference -- Record patterns to learning.db for future sessions - -## What This Pipeline CANNOT Do -- Fix bugs without first reproducing them -- Make speculative changes without evidence -- Optimize performance (use performance-optimization-engineer instead) -- Refactor code (use systematic-refactoring instead) -- Skip any of the 5 phases - --- ## Instructions @@ -94,6 +50,8 @@ This pipeline operates as an operator for systematic debugging workflows, config **Artifact**: `debug-observations.md` +**Core Principle**: Reproduce first, always. NEVER attempt fixes before creating a reliable reproduction. This prevents you from chasing the wrong problem and ensures you can verify any fix actually works. + **Step 1: Document the bug** ```markdown @@ -104,17 +62,19 @@ Environment: [OS, language version, dependencies] ``` **Step 2: Create minimal reproduction** + - Strip to essentials -- remove unrelated code - Use smallest dataset that shows the bug - Isolate from external services where possible **Step 3: Verify consistency** -Run reproduction **3 times**. If inconsistent, identify variables (timing, randomness, concurrency) and add controls to make it deterministic. +Run reproduction **3 times**. If inconsistent, identify variables (timing, randomness, concurrency) and add controls to make it deterministic. Consistency is not optional -- if you can't reproduce it reliably, you cannot verify a fix. **Step 4: Check knowledge base** At the start of every new debug investigation, before forming any hypotheses: + 1. Check if `.debug-knowledge-base.md` exists in the project root 2. If it exists, search for keyword matches against the current symptom signature 3. Matches are **hypothesis candidates**, not confirmed diagnoses @@ -140,6 +100,8 @@ Create `.debug-session.md` with immutable Symptoms and Reproduction Steps sectio **Artifact**: Updated `debug-observations.md` with hypotheses section. +**Core Principle**: Form evidence-based hypotheses, not random guesses. Each hypothesis must have concrete evidence supporting it, and each must be testable. One change at a time -- multiple simultaneous changes hide which one fixed it. + **Step 1: List components involved in the failure** ```markdown @@ -178,7 +140,9 @@ Generate 3-5 hypotheses. Rank by likelihood based on evidence gathered so far. **Step 5: Update persistent debug file** -Write current top hypothesis and next action to `.debug-session.md` BEFORE taking any debugging action. +Write current top hypothesis and next action to `.debug-session.md` BEFORE taking any debugging action. This creates an audit trail that survives context resets. + +**Anti-Pattern Trap**: Do not make changes based on visual inspection alone. "I can see the bug" misses edge cases and is not evidence. Form a hypothesis, test it with data, then decide. **GATE**: At least 3 hypotheses documented with supporting evidence and test plans. Identified smallest code path and input that reproduces the bug. Proceed only when gate passes. @@ -188,6 +152,10 @@ Write current top hypothesis and next action to `.debug-session.md` BEFORE takin **Goal**: Write the minimal reproduction test, verify it fails (red). +**Artifact**: Reproduction test + test results + +**Core Principle**: Verify the bug with a test before attempting any fix. The test is the oracle that tells you when the bug is truly fixed. If 5+ consecutive Read/Grep/Glob calls occur without an Edit/Write/Bash action, STOP and explain what you're looking for and why before proceeding. Document the justification in `.debug-session.md` under Current Hypothesis -- this creates an audit trail of investigation decisions. + **Step 1: Write reproduction test** Create a test that captures the exact bug behavior: @@ -229,7 +197,11 @@ For the confirmed hypothesis area: ### Phase 4: FIX -**Goal**: Implement fix, verify test turns green. +**Goal**: Implement fix, verify test turns green. Ensure no regressions. + +**Artifact**: Green test + passing full suite + +**Core Principle**: Fix only the confirmed root cause. No speculative improvements, no "while I'm here" changes. Simple changes cause complex regressions -- complete all verification steps. If a specific test passes but the full suite fails, you have introduced regressions. Run the full suite every time. **Step 1: Make the smallest possible fix** @@ -237,19 +209,19 @@ Address only the confirmed root cause. No speculative improvements. **Step 2: Verify reproduction test passes** -Run the reproduction test. It must turn GREEN. +Run the reproduction test. It must turn GREEN. If it doesn't, the fix didn't work and you need to return to Phase 2 with new hypotheses. **Step 3: Test edge cases** -Test boundary values, empty input, null, maximum values. +Test boundary values, empty input, null, maximum values. Don't assume the fix works beyond the exact reproduction case. **Step 4: Run full test suite** -Verify no regressions. ALL tests must pass. +Verify no regressions. ALL tests must pass. Tests relied on buggy behavior, or fix changed API contract? If tests expected buggy behavior -> update tests. If fix exposed other bugs -> apply 5-phase process to each. **Step 5: Test related functionality** -Check similar patterns that might share the same root cause. +Check similar patterns that might share the same root cause. A root cause can appear in multiple places. **Step 6: Document fix summary** @@ -270,6 +242,8 @@ Testing: reproduction passes, edge cases pass, full suite passes **Goal**: Update learning database with pattern and fix for future sessions. +**Artifact**: Updated `.debug-knowledge-base.md` and learning.db entry + **Step 1: Record to knowledge base** Append entry to `.debug-knowledge-base.md`: @@ -326,62 +300,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Fixing Without Reproducing -**What it looks like**: "Let me add better error handling" before seeing the actual error -**Why wrong**: Can't verify fix works, may fix wrong issue -**Do instead**: Complete Phase 1 first. Always. - -### Anti-Pattern 2: Random Changes Without Evidence -**What it looks like**: "Maybe if I change this timeout..." without data -**Why wrong**: May mask symptom while leaving root cause. Can't explain why it works. -**Do instead**: Form hypothesis -> test -> confirm/refute -> iterate - -### Anti-Pattern 3: Multiple Changes at Once -**What it looks like**: Adding null check + fixing loop + wrapping in try/catch simultaneously -**Why wrong**: Can't determine which change fixed it. Introduces unnecessary code. -**Do instead**: One change, one test. Repeat until fixed. - -### Anti-Pattern 4: Insufficient Verification -**What it looks like**: "Specific test passes, ship it!" without running full suite -**Why wrong**: May have introduced regressions or missed edge cases -**Do instead**: Complete all Phase 4 steps before declaring done. - -### Anti-Pattern 5: Undocumented Root Cause -**What it looks like**: `git commit -m "Fixed bug"` with no explanation -**Why wrong**: Bug will reappear. No institutional knowledge preserved. -**Do instead**: Document root cause, fix, and create regression test. Complete Phase 5. - ---- - -## Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I can see the bug, no need to reproduce" | Visual inspection misses edge cases | Run reproduction 3 times | -| "This is probably the fix" | Probably != proven | Form hypothesis, test with evidence | -| "Tests pass, must be fixed" | Specific test != full suite | Run full test suite | -| "Simple change, no need to verify" | Simple changes cause complex regressions | Complete Phase 4 | -| "No need to record, I'll remember" | Context resets lose everything | Complete Phase 5 | - ---- - -## Analysis Paralysis Guard - -If 5+ consecutive Read/Grep/Glob calls occur without an Edit/Write/Bash action, -STOP and explain what you are looking for and why before proceeding. - -After explaining, justification for continued reading MUST be recorded in `.debug-session.md` under the Current Hypothesis section -- not just stated verbally. This creates an audit trail of investigation decisions that survives context resets. - ---- - ## References -This pipeline uses these shared patterns: -- [Anti-Rationalization](../../skills/shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../../skills/shared-patterns/verification-checklist.md) - Pre-completion checks - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/debugging-patterns.md`: Common bug patterns by category - `${CLAUDE_SKILL_DIR}/references/tools.md`: Language-specific debugging tools diff --git a/pipelines/systematic-refactoring/SKILL.md b/pipelines/systematic-refactoring/SKILL.md index 970b29a4..26670e5b 100644 --- a/pipelines/systematic-refactoring/SKILL.md +++ b/pipelines/systematic-refactoring/SKILL.md @@ -46,46 +46,6 @@ routing: Safe, verifiable refactoring through 5 explicit phases with mandatory gates. Each phase has gates that prevent common refactoring mistakes: breaking behavior, incomplete migrations, or orphaned code. -## Operator Context - -This pipeline operates as an operator for safe code refactoring, configuring Claude's behavior for incremental, verifiable changes. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution -- **Over-Engineering Prevention**: Only refactor what's directly requested. Keep changes minimal and focused. No speculative improvements or "while we're here" changes without explicit request. -- **NEVER change behavior without tests**: Characterization tests required before changes -- **NEVER make multiple changes at once**: One atomic change per commit -- **NEVER skip validation**: Tests must pass after every change -- **ALWAYS preserve external API**: Unless explicitly requested -- **Phase Gates Enforced**: Each phase must pass its gate before the next begins - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report facts without self-congratulation. Show command output rather than describing it. Be concise but informative. -- **Temporary File Cleanup**: Remove temporary test files, debug outputs, or backup files created during refactoring at task completion. Keep only files explicitly needed. -- **Write characterization tests first**: Capture current behavior before changing -- **Incremental commits**: Commit at each stable point -- **Update all callers**: Find and update every reference -- **Remove dead code**: Clean up orphaned code after migration - -### Optional Behaviors (OFF unless enabled) -- **Performance benchmarks**: Compare before/after performance -- **Documentation updates**: Auto-update docs for API changes -- **Type migration**: Update type definitions across codebase - -## What This Pipeline CAN Do -- Safely rename functions, variables, files across a codebase -- Extract code into new modules with caller migration -- Change function signatures with incremental migration -- Restructure directory layouts preserving all behavior -- Consolidate duplicate code with test verification -- Record refactoring patterns for future sessions - -## What This Pipeline CANNOT Do -- Fix bugs (use systematic-debugging instead) -- Add new features (use workflow-orchestrator instead) -- Make multiple changes simultaneously without testing between each -- Skip characterization tests before modifying code -- Leave incomplete migrations (old code alongside new) --- @@ -95,6 +55,8 @@ This pipeline operates as an operator for safe code refactoring, configuring Cla **Goal**: Document current behavior with tests before touching any code. +**Key Constraint**: NEVER change behavior without tests. Write characterization tests first. Capture current behavior before changing anything. + **Artifact**: Characterization test suite (green). ``` @@ -133,12 +95,12 @@ This pipeline operates as an operator for safe code refactoring, configuring Cla **Actions in this phase:** 1. Read the code to be refactored completely -2. Find all callers with Grep +2. Find all callers with Grep (be exhaustive — string refs, configs, reflection) 3. Run existing tests, note coverage gaps 4. Write characterization tests for uncovered behavior 5. Verify all tests pass -**GATE**: Test suite exists that verifies current behavior. ALL tests GREEN. Proceed only when gate passes. +**GATE**: Test suite exists that verifies current behavior. ALL tests GREEN. Zero gaps in coverage. Proceed only when gate passes. --- @@ -146,6 +108,8 @@ This pipeline operates as an operator for safe code refactoring, configuring Cla **Goal**: Identify refactoring targets, define incremental steps with rollback points. +**Key Constraints**: Only refactor what's directly requested. Keep changes minimal and focused. No speculative improvements. NEVER make multiple changes at once — one atomic change per commit. Break into smallest possible atomic changes with clear dependencies and rollback procedures for each step. + **Artifact**: `refactor-plan.md` ``` @@ -185,14 +149,14 @@ This pipeline operates as an operator for safe code refactoring, configuring Cla ``` **Actions in this phase:** -1. Define the exact end state +1. Define the exact end state (no scope creep) 2. Break into smallest possible atomic changes 3. Identify dependencies between steps 4. Define rollback procedure for each step 5. Estimate risk level for each step 6. Write `refactor-plan.md` to project root -**GATE**: `refactor-plan.md` exists with clear sequence of atomic changes and rollback points for each step. Proceed only when gate passes. +**GATE**: `refactor-plan.md` exists with clear sequence of atomic changes, rollback points, and scope strictly bounded. No speculative improvements included. Proceed only when gate passes. --- @@ -200,6 +164,8 @@ This pipeline operates as an operator for safe code refactoring, configuring Cla **Goal**: Apply changes incrementally, run tests after each step. Tests must stay green throughout. +**Key Constraints**: NEVER skip validation — tests must pass after every change. NEVER make multiple changes at once — one atomic change per commit. Phase gates enforced: each step must pass before the next begins. + ``` =============================================================== PHASE 3: REFACTOR - Step [N] of [Total] @@ -229,18 +195,13 @@ This pipeline operates as an operator for safe code refactoring, configuring Cla ``` **Actions in this phase:** -1. Make ONE atomic change +1. Make ONE atomic change (no more) 2. Run ALL tests 3. If pass: commit, move to next step 4. If fail: rollback, investigate, fix, retry 5. Repeat until all steps complete -**Error Recovery:** -- **Test Failure During Execute**: Stop immediately, rollback current step, investigate root cause, fix and retry OR revise plan -- **Incomplete Caller Migration**: Do not remove old code until ALL callers migrated. Use Grep to verify zero remaining references. Check for dynamic references (strings, reflection). -- **Unexpected Dependencies**: Stop and return to PLAN phase. Add new dependencies to plan. May need to add intermediate steps. - -**GATE**: ALL planned steps executed. Tests GREEN after every step. No step skipped or combined. Proceed only when gate passes. +**GATE**: ALL planned steps executed. Tests GREEN after every step. No step skipped or combined. Zero errors on retry. Proceed only when gate passes. --- @@ -248,6 +209,8 @@ This pipeline operates as an operator for safe code refactoring, configuring Cla **Goal**: Full test suite, diff summary, confirm no behavior change. +**Key Constraints**: Always preserve external API unless explicitly requested. Remove dead code: clean up orphaned code after migration. Grep confirms ZERO references to old location. Never leave incomplete migrations. + ``` =============================================================== PHASE 4: VERIFY @@ -288,7 +251,7 @@ This pipeline operates as an operator for safe code refactoring, configuring Cla =============================================================== ``` -**GATE**: Full test suite passes. Zero references to old names. No orphaned code. No behavior changes detected. Proceed only when gate passes. +**GATE**: Full test suite passes. Zero references to old names. No orphaned code. No behavior changes detected. External API preserved. All temporary test files cleaned up (keep only files explicitly needed). Proceed only when gate passes. --- @@ -296,6 +259,8 @@ This pipeline operates as an operator for safe code refactoring, configuring Cla **Goal**: Log refactoring patterns to learning database for future sessions. +**Key Constraint**: Complete recording — future refactors benefit from past patterns. No need to record? That's the signal you should record. Document what almost went wrong or required extra care. + **Step 1: Record refactoring pattern** ```markdown @@ -374,92 +339,22 @@ Phase 4: Verify all callers use new signature Phase 5: Record migration pattern ``` ---- - -## Anti-Patterns - -### Anti-Pattern 1: Big Bang Refactoring - -**What it looks like:** Changing 47 files in one commit, updating function name, all callers, tests, and docs simultaneously. - -**Why it's wrong:** -- One test failure breaks everything -- Impossible to bisect which change caused issues -- No rollback points if problems discovered later -- Merge conflicts guaranteed in active codebases - -**Do this instead:** -1. CHARACTERIZE: Write tests for current behavior -2. PLAN: Break into steps (add new function, migrate callers gradually, remove old) -3. REFACTOR: Commit after each atomic change (5-10 callers at a time) -4. VERIFY: Tests pass after every step - -### Anti-Pattern 2: Refactoring Without Tests First -**What it looks like:** Immediately creating new function and updating callers without writing tests. +## Error Handling -**Why it's wrong:** -- No verification that behavior is preserved -- Silent bugs introduced during extraction -- Can't prove refactoring was safe -- No baseline to compare against +### Test Failure During Refactor +Stop immediately, rollback current step, investigate root cause, fix and retry OR revise plan. -**Do this instead:** -1. CHARACTERIZE: Write tests for current behavior BEFORE touching code -2. Run tests: Verify they pass with current implementation -3. Make change: Extract function -4. Run tests again: Verify same results -5. Tests are your proof of correctness +### Incomplete Caller Migration +Do not remove old code until ALL callers migrated. Use Grep to verify zero remaining references. Check for dynamic references (strings, reflection). Be exhaustive: identify ALL callers upfront using Grep before moving forward. -### Anti-Pattern 3: Incomplete Migration +### Unexpected Dependencies +Stop and return to PLAN phase. Add new dependencies to plan. May need to add intermediate steps. Never continue if dependencies shift mid-refactoring. -**What it looks like:** Creating new location, updating 80% of callers, leaving old function "for backward compatibility". +### Small Changes, Easy to Skip +Small renames break string refs and configs. Grep for all references including strings. Never assume "this is too small to need the full process." -**Why it's wrong:** -- Code exists in two places indefinitely -- Future changes need double updates -- Confusion about which to use -- Old code becomes stale and buggy - -**Do this instead:** -1. PLAN: Identify ALL callers upfront (use Grep exhaustively) -2. REFACTOR: Update every single caller before removing old code -3. VERIFY: Grep confirms ZERO references to old location -4. Clean up: Remove old code completely - -### Anti-Pattern 4: Mixing Refactoring with Feature Work - -**What it looks like:** Renaming function AND changing logic in same refactoring. - -**Why it's wrong:** -- Can't tell if tests fail due to rename or logic change -- Violates "preserve behavior" principle -- Impossible to review as pure refactoring -- Rollback becomes unclear - -**Do this instead:** -1. Separate concerns: "This is ONLY refactoring, behavior unchanged" -2. Complete refactoring first: Rename with tests proving no behavior change -3. Then fix bug: In separate phase with new tests for fixed behavior -4. Two commits: One refactor (safe), one fix (behavior change) - ---- - -## Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "This refactoring is safe, no tests needed" | Refactoring without tests is flying blind | Write characterization tests first | -| "I'll update the remaining callers later" | Incomplete migrations rot forever | Migrate ALL callers before removing old code | -| "Small rename, no need for full process" | Small renames break string refs and configs | Grep for all references including strings | -| "I can fix this bug while refactoring" | Mixed concerns make failures undiagnosable | Separate commits: refactor then fix | -| "No need to record, it was straightforward" | Future refactors benefit from past patterns | Complete Phase 5 | - ---- +### Mixing Refactoring with Fixes +Separate concerns: This is ONLY refactoring, behavior unchanged. Do refactoring first (safe commit with tests proving no behavior change), then fix bugs separately (behavior change commit). Never combine. ## References - -This pipeline uses these shared patterns: -- [Anti-Rationalization](../../skills/shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../../skills/shared-patterns/verification-checklist.md) - Pre-completion checks -- [Gate Enforcement](../../skills/shared-patterns/gate-enforcement.md) - Phase transition rules diff --git a/pipelines/voice-calibrator/SKILL.md b/pipelines/voice-calibrator/SKILL.md index 06de61f8..9c6c4e7f 100644 --- a/pipelines/voice-calibrator/SKILL.md +++ b/pipelines/voice-calibrator/SKILL.md @@ -34,848 +34,218 @@ routing: # Voice Calibrator -## Operator Context - -This skill operates as an operator for voice calibration workflows, configuring Claude's behavior for rigorous, sample-driven voice profile creation and validation. It implements the **Deterministic Analysis** architectural pattern — extract metrics via scripts, interpret via AI, validate via scripts — with **Voice Fidelity** as the primary quality gate. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before calibration -- **Over-Engineering Prevention**: Extract observable patterns from samples, not theoretical analysis -- **Voice Fidelity**: Generated voice skills must pass authorship matching (4/5 roasters minimum) -- **Data Integrity**: NEVER modify curated calibration data, writing samples, or existing voice profile content outside explicit user request - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Display full analysis with A/B comparisons, never summarize pattern counts -- **Run voice_analyzer.py**: Always run deterministic script analysis before AI interpretation -- **Validation Loop**: Run voice_validator.py after generating output, iterate if needed (max 3) -- **Profile Persistence**: Save profile.json and config.json alongside SKILL.md -- **Distinctive Focus**: Prioritize patterns that differ from generic writing - -### Optional Behaviors (OFF unless enabled) -- **Strict Mode**: Require 5/5 roaster match instead of 4/5 minimum -- **Cross-Voice Comparison**: Compare two calibrated voice profiles for differences -- **Batch Analysis**: Analyze all posts in content/posts/ at once -- **Export Mode**: Generate standalone style guide document - -## What This Skill CAN Do -- Calibrate voice profiles from 50+ writing samples via deterministic scripts -- Analyze writing samples to extract quantitative voice metrics (sentence length, punctuation, contractions) -- Generate machine-readable profile.json with measurable targets and tolerances -- Generate SKILL.md voice skills with sample-first architecture for authorship matching -- Validate generated content against voice profile metrics -- Show A/B comparisons between default and calibrated output - -## What This Skill CANNOT Do -- Modify curated calibration data, writing samples, or existing voice profile content -- Skip deterministic analysis (scripts MUST run before AI interpretation) -- Generate content without prior calibration (only calibrates, user must invoke voice skill separately) -- Analyze fewer than 3 samples (insufficient data for reliable patterns) -- Copy copyrighted content verbatim (only extract style patterns) +## Overview ---- - -This skill analyzes writing samples and generates voice skill files. Each voice gets its own skill in `skills/voice-{name}/` with both AI instructions (SKILL.md) and machine-readable metrics (profile.json). +This pipeline analyzes writing samples and generates voice skill files for deterministic content generation. Each voice gets its own skill in `skills/voice-{name}/` with both AI instructions (SKILL.md) and machine-readable metrics (profile.json). **Architecture**: Sample-first generation with prompt engineering best practices for maximum voice fidelity. ---- - -## THE CRITICAL LESSON: QUANTITY OF EXAMPLES MATTERS MORE THAN RULES - -V7, V8, V9 of the voice skill all had the right **rules** but failed authorship matching (0/5 roasters said same author). V10 passed 5/5 because it had **100+ real samples categorized by pattern**. - -**The breakthrough insight:** -- Rules tell the AI what to do -- Examples show the AI what the voice looks like -- LLMs are pattern matchers - examples are more powerful than rules -- Roasters detect when content captures IDEAS but not EXECUTION STYLE -- Execution style comes from extensive example exposure, not rule following - -**What V10 had that V7-V9 didn't:** -1. 100+ Reddit comments saved and categorized -2. Examples organized by response type (short, medium, long) -3. Examples organized by pattern (admitting mistakes, acknowledging limits, disagreement) -4. Explicit phrase fingerprints ("For what I do / For what you do", "probably tomorrow but we will see") -5. Real typos from the person's writing documented as authenticity markers - -**REQUIREMENT: Generated voice skills MUST include extensive sample collections, not just rules.** - ---- - -## Prompt Engineering Best Practices - -When generating voice skills, apply these techniques for maximum effectiveness: - -### 1. Attention Anchoring (Bolding) - -**Usage**: Apply **bold** strictly to negative constraints and safety guardrails. - -```markdown -**You must strictly avoid** the "It's not X. It's Y" rhetorical pattern. -**NEVER use** em-dashes in any form. -``` - -**Mechanism**: Acts as attention flag for tokenizer, increasing statistical weight of constraint. - -### 2. Cognitive Chunking (Headers) - -**Usage**: Enforce clear hierarchy of instruction. Never present as wall of text. - -```markdown -## Identity (Who the voice is) -## Core Directives (What the voice does) -## Style & Tone (How the voice speaks) -## Negative Constraints (What to avoid) -``` - -**Mechanism**: Helps model separate distinct logical tasks, reducing "instruction bleeding." - -### 3. Context Isolation (Delimiters) - -**Usage**: Separate static instructions from dynamic context. - -```markdown ---- -(Use horizontal rules between sections) - - -(Use XML tags for user-provided content) - -``` - -**Mechanism**: Prevents model from confusing user input with system rules. - -### 4. Probability Dampening (Adverbs) - -**Usage**: Use adverbs when defining personality/tone. Avoid absolute binary instructions. - -```markdown -Write in a **subtly** skeptical tone. -Be **generally** direct rather than absolutely blunt. -Sound **slightly** informal while maintaining clarity. -``` - -**Mechanism**: Restricts token probability distribution less aggressively, preventing caricature. - -### 5. Chain-of-Thought Triggers - -**Usage**: For complex voice generation, add explicit logic directives. - -```markdown -Before writing, think step-by-step: -1. What is my actual position on this topic? -2. What would I say if asked about this in a forum? -3. What process did I go through to form this opinion? -``` - -**Mechanism**: Forces intermediate token generation, increasing accuracy. - -### 6. Few-Shot Examples (Pattern Matching) - -**Usage**: Include 1-3 examples of ideal interaction at end of skill. - -```markdown -User: "Explain why caching is hard" -Model (WRONG): "Caching is not storage. It's a bet about the future." -Model (CORRECT): "Think of caching as a bet about the future, not persistent storage." -``` - -**Mechanism**: LLMs are pattern matchers. Examples are more powerful than instructions. - -### 7. Sample-First Architecture - -**Usage**: Present actual writing samples BEFORE rules. - -```markdown -## Phase 1: Voice Grounding (Read Before Writing) - -Read these actual samples to internalize the voice: - -### Sample A: Technical Explanation -> [Actual sample text from user's writing] - -### Sample B: Personal Opinion -> [Another sample showing different context] -``` - -**Mechanism**: Samples show HOW the voice sounds. Rules tell WHAT it should be. Samples are more effective. - ---- - -## Human Authenticity Markers (CRITICAL) - -These patterns distinguish human writing from AI-generated content. Voice skills MUST teach these explicitly or generated content will fail authenticity checks. - -### 1. Incomplete Work Admissions - -**Requirement**: Include at least one admission of something incomplete, imperfect, or "inelegant." - -```markdown -Examples: -- "I haven't yet built the X feature" -- "it misses a bunch of detail" -- "that's a pretty inelegant solution" -- "it works but it's not pretty" -``` - -**Why**: AI presents finished, polished work. Humans admit gaps. - -### 2. Tangential Tool/Process References - -**Requirement**: Allow tangents that don't directly serve the narrative but show the writer's actual workflow. - -```markdown -Examples: -- "I built a custom tool for that (called X)" -- "I had Claude Code do the rest of the investigation" -- "I used my shot-scraper har command to grab..." -``` - -**Why**: AI stays on-message. Humans can't resist showing their tools. - -### 3. Assumed Shared Context (Parenthetical Asides) - -**Requirement**: Include parenthetical winks that assume reader knowledge. - -```markdown -Examples: -- "(that Rob Pike)" -- "(so I hope it continues to work)" -- "You won't be surprised to hear..." -- "—new to me, suggested by Claude—" -``` - -**Why**: AI explains everything. Humans assume shared context with their audience. - -### 4. Evolution/Iteration Narrative - -**Requirement**: Show history of attempts, not just final solution. - -```markdown -Examples: -- "I've made several past attempts at solving this problem" -- "a late refactoring, the initial system used..." -- "The first was X, but that missed Y" -``` - -**Why**: AI presents optimal solutions. Humans show the messy path. - -### 5. Mid-Thought Discoveries - -**Requirement**: Include moments where learning happens during writing. - -```markdown -Examples: -- "—new to me, suggested by Claude—" -- "I ended up using that trick in X itself!" -- "Turns out X knows the trick where..." -``` - -**Why**: AI writes from complete knowledge. Humans discover as they write. - -### 6. Unhedged Strong Opinions - -**Requirement**: State opinions directly without AI safety hedges. - -```markdown -WRONG (AI pattern): -- "This raises important concerns about..." -- "One might argue that..." -- "It's worth noting that..." - -RIGHT (Human pattern): -- "I don't like this at all." -- "I totally understand his rage." -- "This completely misses the point!" -``` - -**Why**: AI hedges everything. Humans assert. - -### 7. Playful/Subversive Notes - -**Requirement**: Allow personality to bleed through in unexpected moments. - -```markdown -Examples: -- "it felt a bit more subversive to have OpenAI Codex do it instead" -- "is a truly science fiction way of working" -- "The really neat trick there is..." -``` - -**Why**: AI is earnest. Humans have attitude. - -### 8. Specific Artifacts (Not Hypotheticals) - -**Requirement**: Reference real, specific things that can be verified. - -```markdown -Examples: -- Actual commit hashes: "c80b1dee Rename tool..." -- Real commands: "uvx claude-code-transcripts" -- Specific dependencies: "questionary—new to me" -``` - -**Why**: AI generates plausible examples. Humans cite real artifacts. - -### 9. Visible Self-Correction - -**Requirement**: Show thinking that changes direction mid-paragraph. - -```markdown -Examples: -- "At first I thought... but then I realized" -- "or rather, it was my willingness to..." -- "Actually, that's not quite right—" -``` - -**Why**: AI resolves neatly. Humans think out loud. - -### 10. Raw Emotion/Profanity (When Quoting) - -**Requirement**: When quoting others' strong reactions, preserve them raw. - -```markdown -Examples: -- Quote profanity verbatim when relevant -- Don't soften: "He was upset" → "Fuck you people" -- Preserve intensity of original -``` - -**Why**: AI sanitizes. Humans quote reality. - -### 11. Exploration Admission - -**Requirement**: Admit when exploration yielded nothing valuable. - -```markdown -Examples: -- "not everything is valuable" -- "The result essentially of this work was inconclusive" -- "nothing of real value came about from" -- "It's possible I find nothing of value and revert" -``` - -**Why**: AI always finds insights. Humans admit dead ends. - -### 12. Soft Future Planning - -**Requirement**: Plans should have uncertainty built in. - -```markdown -Examples: -- "likely tomorrow but we will see" -- "I may take a break and relax today, but I think I need to" -- "I will likely create" -- "Today is an exploration day" -``` - -**Why**: AI commits to plans. Humans hedge their schedules. - -### 13. Metaphor as Framing Device - -**Requirement**: Use metaphors to frame complex ideas, not explain them. - -```markdown -Examples: -- "I'm thinking of the ouroboros. The Snake that eats it's own tail." -- "It's effectively microservices principles applied to agentic cognition" -- "Rather than giving the agent a hammer, we give it a specific set of motions" -``` - -**Why**: AI explains metaphors. Humans drop them and assume understanding. - -### 14. Not-X-but-Y Lists (Contrast Pairs) - -**Requirement**: When explaining alternatives, use contrast pair format. - -```markdown -Examples: -- "Not kubectl get logs: a crashloop-investigator skill" -- "Not just check service: a service-endpoint-verifier" -``` - -**Note**: This is DIFFERENT from the forbidden "It's not X. It's Y" rhetorical pivot. This is listing alternatives with "Not X: Y" format for technical specifications. - -**Why**: Shows concrete alternatives, not rhetorical flourish. - -### 15. Hypothesis-Experiment Framing - -**Requirement**: Frame technical investigations as experiments with explicit uncertainty. - -```markdown -Examples: -- "The hypothesis is that by making skills this granular..." -- "I'm going to build it and find out" -- "Does it orchestrate well? Does it improve outcomes?" -- "No idea if it will work, but I find this an interesting topic" -``` - -**Why**: AI presents conclusions. Humans run experiments. - -### 16. Casual Closers/Interjections - -**Requirement**: End thoughts with casual, almost throwaway observations. - -```markdown -Examples: -- "so sounds super fun" -- "you get the point" -- "but okay, it doesn't take much to do that" -- "but we will see" -``` - -**Why**: AI wraps up neatly. Humans trail off naturally. - ---- - -## Authenticity Checklist for Generated Skills - -Before finalizing any voice skill, verify it teaches these patterns: - -**Core 10 (Required):** -- [ ] Incomplete work admissions (teaches vulnerability) -- [ ] Tool/process tangents (teaches showing work) -- [ ] Parenthetical asides (teaches assumed context) -- [ ] Evolution narrative (teaches iteration visibility) -- [ ] Mid-thought discoveries (teaches learning in public) -- [ ] Unhedged opinions (teaches assertion over hedging) -- [ ] Playful moments (teaches personality) -- [ ] Specific artifacts (teaches concreteness) -- [ ] Visible self-correction (teaches thinking out loud) -- [ ] Raw emotion preservation (when quoting) - -**Extended 6 (From calibration samples):** -- [ ] Exploration admission (admits dead ends) -- [ ] Soft future planning (hedged schedules) -- [ ] Metaphor as framing (drops metaphors, doesn't explain) -- [ ] Contrast pair lists (Not X: Y format for alternatives) -- [ ] Hypothesis-experiment framing (explicit uncertainty) -- [ ] Casual closers (trails off naturally) - -**If core markers are missing, content will be AI-detectable.** -**If extended markers are missing, content won't match the target voice specifically.** +**Core principle**: The quantity and organization of writing examples matters more than rules. LLMs are pattern matchers—samples show HOW the voice sounds; rules tell WHAT it should be. Samples are more powerful. --- -## Anti-Essay Patterns (CRITICAL FOR AUTHORSHIP MATCHING) - -These patterns prevent generated content from sounding like "polished blog writing" instead of authentic conversational voice. The difference is critical: roasters can tell when content captures ideas but not execution style. - -### Core Insight - -> **Essay voice**: Has a thesis. Delivers considered opinions. Performs for an audience. -> **Conversational voice**: Has observations. Thinks out loud. Explains to one person. - -Voice skills MUST teach the conversational patterns or generated content will sound like a different author entirely. - -### 1. Staccato Rhythm (No Flow) - -**Requirement**: Break sentences apart. One thought per paragraph. Stop frequently. - -```markdown -WRONG (Essay flow): -"At first I thought the newest model would solve whatever problem I was stuck on. GPT-4 will fix my agent issues. Claude 3 will understand my prompts better." - -RIGHT (Staccato): -"I thought new models would solve things. - -They didn't. - -GPT-4 wasn't it. Claude 3 wasn't either." -``` - -**Why**: Essay writers connect thoughts. Conversational writers stack them. - -### 2. No Rhetorical Signposting - -**Requirement**: Never announce your conclusions. Just state them. - -```markdown -FORBIDDEN PHRASES: -- "Here's where I landed:" -- "That's the part nobody talks about." -- "The iteration history here matters." -- "You won't be surprised to hear that" - -CORRECT: -- Just state the conclusion directly -- Let the reader follow without announcements -``` - -**Why**: Signposting is essay scaffolding. Conversation doesn't need it. - -### 3. No Rule of Three - -**Requirement**: Don't use tricolon (three parallel items) for rhetorical effect. - -```markdown -WRONG: -"GPT-4 will fix my agent issues. Claude 3 will understand my prompts better. Gemini 1.5 will handle my long contexts." - -RIGHT: -"I kept thinking the next model would fix things. It didn't." -``` - -**Why**: Rule of three is a writing technique. Conversation doesn't use it. - -### 4. Concede-Then-Assert (Not Assert-Then-Hedge) - -**Requirement**: Uncertainty comes FIRST, then your position. - -```markdown -WRONG (Assert-then-hedge): -"The answer is: it doesn't matter much. Maybe that's just my experience." - -RIGHT (Concede-then-assert): -"I'm not sure if this applies to everyone. For me, the answer is it doesn't matter much." -``` - -**Why**: Essay writers defend positions. Conversational writers explore them. - -### 5. Flat Emotional Delivery - -**Requirement**: Don't name emotions. Let them emerge from content. - -```markdown -WRONG (Named emotion): -"What a waste of time that was." -"I hate this obsession with benchmarks." - -RIGHT (Flat delivery): -"Total waste of time." -"Benchmarks don't measure anything useful." -``` - -**Why**: Theatrical emotion is performance. Flat delivery is authentic. - -### 6. Sparse First-Person - -**Requirement**: Let ideas lead sentences, not "I". - -```markdown -WRONG (I-heavy): -"I ran Sonnet 3.5 against Opus 3. I compared the outputs. I found the differences were marginal." - -RIGHT (Ideas lead): -"Running Sonnet against Opus. Comparing outputs. The differences? Marginal." -``` - -**Why**: Essay writers narrate their experience. Conversational writers report observations. - -### 7. No Parenthetical Self-Deprecation - -**Requirement**: Don't apologize for your tools or methods in parentheses. - -```markdown -WRONG: -"(I built a janky spreadsheet for this, nothing pretty)" -"(current)" as a label - -RIGHT: -"I built a spreadsheet. Nothing fancy." -Or just don't mention it at all. -``` +## Instructions -**Why**: Parenthetical self-deprecation is a blogging trick to seem relatable. It feels performed. +### Mode 1: Full Calibration (ANALYZE-GENERATE-VALIDATE) -### 8. Genuine Check-Ins (Not Rhetorical) +Use when: First calibration or major style update -**Requirement**: Check-ins should verify understanding, not seek validation. +#### Step 1: COLLECT Samples (EXTENSIVE COLLECTION REQUIRED) -```markdown -WRONG (Seeking validation): -"Does that resonate with your experience?" +You need 50-100+ writing samples for authorship matching to work. -RIGHT (Verifying understanding): -"Does that help?" -"Does that track?" -"Am I making sense?" -``` +Historical lesson: V7-V9 failed with 3-10 samples. V10 passed with 100+ samples because it had extensive categorized sample collections, not just rules. The samples ARE the skill. Rules are secondary. -**Why**: "Does that resonate?" is copywriting. "Does that help?" is conversation. +Sources to mine: +- Reddit comment history (most valuable - casual voice) +- HackerNews comments +- Forum posts +- Blog posts +- Email threads (if provided) +- Chat logs (if provided) -### 9. Mid-Sentence Pivots +Required minimum: +- 50+ individual samples +- Mix of response lengths (short, medium, long) +- Mix of contexts (technical, casual, disagreement) -**Requirement**: Show course-corrections inside sentences, not between them. +Save samples to: `skills/voice-{name}/references/samples/` -```markdown -WRONG (Clean self-correction): -"I thought the model was the problem. But then I realized it was architectural." +Organize by source: + - `reddit-samples-YYYY-MM-DD.md` + - `hn-samples-YYYY-MM-DD.md` + - `blog-samples.md` -RIGHT (Mid-sentence pivot): -"I thought the model was-- actually no, it was architectural." -Or: "I'm not sure if-- my problems are solved doing it my way." +Command: +```bash +mkdir -p skills/voice-{name}/references/samples ``` -**Why**: Essay writers present polished revisions. Conversational writers think out loud. - -### 10. Single-Sentence Paragraphs - -**Requirement**: Most paragraphs should be 1-2 sentences. Never exceed 3. +**Gate**: If user only provides 3-5 samples, STOP. Ask for more. Explain that authorship matching requires 50+ samples. Suggest: "Can you export your Reddit/HN comment history?" -```markdown -WRONG: -[4-6 sentence paragraph] +#### Step 2: ANALYZE (Deterministic) -RIGHT: -"Short thought. +Run voice analyzer on samples to extract quantitative metrics: -Another short thought. +```bash +python3 ~/.claude/scripts/voice_analyzer.py analyze \ + --samples skills/voice-{name}/references/samples/*.md \ + --output skills/voice-{name}/profile.json -A third, building on the second." +# View text report for interpretation +python3 ~/.claude/scripts/voice_analyzer.py analyze \ + --samples skills/voice-{name}/references/samples/*.md \ + --format text ``` -**Why**: Essay writers develop ideas in paragraphs. Conversational writers stack observations. - ---- - -## Anti-Essay Checklist for Generated Skills - -Before finalizing any voice skill, verify it teaches these patterns: - -- [ ] Staccato rhythm (one thought per paragraph) -- [ ] No rhetorical signposting ("Here's where I landed") -- [ ] No rule of three constructions -- [ ] Concede-then-assert structure -- [ ] Flat emotional delivery (no "I hate", "What a waste") -- [ ] Sparse first-person (ideas lead, not "I") -- [ ] No parenthetical self-deprecation -- [ ] Genuine check-ins ("Does that help?") -- [ ] Mid-sentence pivots when self-correcting -- [ ] Single-sentence paragraphs as default - -**If these are missing, content will sound like a different author than the samples.** - ---- - -## Deterministic Infrastructure - -This skill uses Python scripts for quantitative analysis. AI handles interpretation and skill generation. - -### Scripts Used +The script extracts: +- Sentence length distribution (short/medium/long/very long percentages) +- Punctuation metrics (comma density, em-dash count, question rate) +- Word metrics (contraction rate, person usage, function word signature) +- Structure metrics (paragraph length, fragment rate, sentence starters) +- Pattern signatures (transitions used/avoided, opening/closing patterns) -| Script | Purpose | When Called | -|--------|---------|-------------| -| `voice_analyzer.py analyze` | Extract metrics from samples | Step 2 of calibration | -| `voice_analyzer.py compare` | Compare two voice profiles | Voice comparison mode | -| `voice_validator.py validate` | Validate generated output | Step 4 of calibration | -| `voice_validator.py check-banned` | Quick pattern check | During refinement | -| `voice_validator.py check-rhythm` | Check sentence rhythm only | Rhythm validation | +**Reasoning**: Quantitative grounding prevents vague interpretations. Metrics like "contraction rate: 85%" or "avg sentence length: 14.2 words" are specific and measurable, not subjective descriptions. -### Generated Files +#### Step 3: GENERATE (AI) -After calibration, the voice skill directory contains: +Using profile.json as foundation, generate three artifacts: -``` -skills/voice-{name}/ -├── SKILL.md # AI instructions (generated) -├── profile.json # Machine-readable metrics (from analyzer) -├── config.json # Validation settings -└── references/ - └── samples/ # Input samples (copied) -``` +**3a. config.json** - Validation settings -### Profile.json Structure - -The analyzer outputs this structure: +This defines how strictly to validate content against the profile: +- `metric_tolerance`: How strict (0.2 = ±20% variance allowed) +- `required_checks`: Must-pass validations (banned_phrases, punctuation, rhythm) +- `optional_checks`: Nice-to-have validations (metrics, sentence_starters) +- `voice_specific_patterns`: Custom rules for this voice +Example template: ```json { - "meta": { - "samples_analyzed": 3, - "total_words": 5000, - "total_sentences": 250, - "generated_at": "2025-01-15T10:30:00Z" - }, - "sentence_metrics": { - "length_distribution": { - "short_3_10": 0.35, - "medium_11_20": 0.45, - "long_21_30": 0.15, - "very_long_31_plus": 0.05 - }, - "average_length": 14.2, - "variance": 6.8, - "max_consecutive_similar": 2 - }, - "punctuation_metrics": { - "comma_density": 0.045, - "exclamation_rate": 0.02, - "question_rate": 0.08, - "em_dash_count": 0, - "semicolon_rate": 0.01 - }, - "word_metrics": { - "contraction_rate": 0.85, - "first_person_rate": 0.025, - "second_person_rate": 0.018, - "function_word_signature": {"the": 0.045, "to": 0.032, "...": "..."} - }, - "structure_metrics": { - "avg_paragraph_sentences": 3.2, - "fragment_rate": 0.12, - "sentence_starters": { - "pronoun": 0.28, - "conjunction": 0.15, - "article": 0.22, - "adverb": 0.10, - "other": 0.25 - } - }, - "pattern_signatures": { - "transition_words": ["but", "so", "and", "..."], - "avoided_transitions": ["however", "moreover", "furthermore"], - "opening_patterns": ["direct_statement", "pronoun_start"], - "closing_patterns": ["statement", "callback"] + "name": "{Voice Name}", + "version": "2.0.0", + "description": "{Voice description}", + "modes": ["chat", "blog", "technical"], + "validation": { + "strict_banned_patterns": true, + "em_dash_forbidden": true, + "metric_tolerance": 0.20, + "required_checks": ["banned_phrases", "punctuation", "rhythm"], + "optional_checks": ["metrics", "sentence_starters"] } } ``` ---- - -## Multi-Voice Support - -### Available Voices - -Voices are stored in `skills/voice-{name}/`. List available voices with: - -```bash -ls skills/voice-*/SKILL.md -``` - -### Creating a New Voice - -``` -/voice calibrate --name yourname --samples [file1] [file2] [file3] -``` - -This will: -1. Copy samples to `skills/voice-{name}/references/samples/` -2. Run `voice_analyzer.py` to extract metrics -3. Generate profile.json, config.json, and SKILL.md -4. Run `voice_validator.py` on test output -5. Show A/B comparison - -### Updating an Existing Voice - -``` -/voice refine --name your-voice --samples [additional samples] -``` - -This merges new patterns with existing calibration. - ---- - -## Instructions - -### Mode 1: Full Calibration (ANALYZE-GENERATE-VALIDATE) - -Use when: First calibration or major style update - -#### Step 1: COLLECT Samples (EXTENSIVE COLLECTION REQUIRED) +**3b. SKILL.md** - AI instructions with EXTENSIVE SAMPLE SECTIONS -``` -CRITICAL: You need 50-100+ writing samples for authorship matching to work. +**CRITICAL**: The SKILL.md must be LONG (2000+ lines minimum). -V7-V9 failed with 3-10 samples. V10 passed with 100+ samples. -The samples ARE the skill. Rules are secondary. +Most of the file should be SAMPLES, not rules. This is the inverse of typical prompt engineering—examples are more powerful than abstract instructions. -Sources to mine: -- Reddit comment history (most valuable - casual voice) -- HackerNews comments -- Forum posts -- Blog posts -- Email threads (if provided) -- Chat logs (if provided) +Required sections in order of importance: -Required minimum: -- 50+ individual samples -- Mix of response lengths (short, medium, long) -- Mix of contexts (technical, casual, disagreement) - -Save samples to: skills/voice-{name}/references/samples/ -Organize by source: - reddit-samples-YYYY-MM-DD.md - hn-samples-YYYY-MM-DD.md - blog-samples.md +**a. Phase 1: Voice Grounding (Read Before Writing)** (~400+ lines) -Commands: - mkdir -p skills/voice-{name}/references/samples +Real samples organized by response length and pattern. This section should be the LONGEST section in the skill: +- 10+ samples each for short/medium/long responses +- 5+ samples each for key patterns: mistakes, limits, disagreement, technical explanation -If user only provides 3-5 samples: - STOP. Ask for more. Explain that authorship matching requires 50+ samples. - Suggest: "Can you export your Reddit/HN comment history?" -``` +Structure: +```markdown +## Phase 1: Voice Grounding (Read Before Writing) -#### Step 2: ANALYZE (Deterministic) +Read these actual samples to internalize the voice: -```bash -# Run voice analyzer on samples -python3 ~/.claude/scripts/voice_analyzer.py analyze \ - --samples skills/voice-{name}/references/samples/*.md \ - --output skills/voice-{name}/profile.json +### Sample A: Short Response +> [Actual sample text] -# View text report for interpretation -python3 ~/.claude/scripts/voice_analyzer.py analyze \ - --samples skills/voice-{name}/references/samples/*.md \ - --format text +### Sample B: Technical Explanation +> [Another sample] + +### Sample C: Admitting Limits +> [Sample showing vulnerability] ``` -The script extracts: -- Sentence length distribution (short/medium/long/very long percentages) -- Punctuation metrics (comma density, em-dash count, question rate) -- Word metrics (contraction rate, person usage, function word signature) -- Structure metrics (paragraph length, fragment rate, sentence starters) -- Pattern signatures (transitions used/avoided, opening/closing patterns) +Reason for front-loading samples: Roasters detect when content captures IDEAS but not EXECUTION STYLE. Execution style comes from extensive example exposure, not rule following. -#### Step 3: GENERATE (AI) +**b. Phrase Fingerprints** (~50 lines) -Using profile.json as foundation, generate: +10+ distinctive phrase patterns with exact quotes. These are what roasters use to match authorship: +- Recurring sentence starters +- Characteristic closers +- Unique idioms or expressions +- Verbal tics -1. **config.json** - Validation settings: - - metric_tolerance (how strict) - - required_checks (must pass) - - voice_specific_patterns (custom rules) +**c. Natural Typos** (~20 lines) -2. **SKILL.md** - AI instructions with EXTENSIVE SAMPLE SECTIONS: +5+ real typos from their writing. Authenticity markers that humans have but AI typically doesn't. - **CRITICAL: The SKILL.md must be LONG (2000+ lines minimum).** +**d. Voice Metrics** (~100 lines) - Most of the file should be SAMPLES, not rules. +Quantitative targets extracted from profile.json: +- Sentence length targets and ranges +- Contraction rate percentage +- Em-dash usage (if any) +- Paragraph length patterns +- Fragment rate - Required sections (in order of importance): +**e. Human Authenticity Markers** (~300 lines) - a. **Extensive Authentic Samples** (~400+ lines): - - Samples organized by response length (10+ each for short/medium/long) - - Samples organized by pattern (5+ each for mistakes/limits/disagreement/technical) - - This section should be the LONGEST section in the skill +Teach these 16 patterns explicitly—they distinguish human writing from AI-generated content. Generated content will fail authenticity checks without them: - b. **Phrase Fingerprints** (~50 lines): - - 10+ distinctive phrase patterns with exact quotes - - These are what roasters use to match authorship +1. **Incomplete work admissions** - Include at least one admission of something incomplete, imperfect, or "inelegant" (e.g., "I haven't yet built the X feature") +2. **Tangential tool/process references** - Allow tangents showing actual workflow (e.g., "I built a custom tool for that") +3. **Assumed shared context** - Parenthetical winks assuming reader knowledge (e.g., "(that Rob Pike)") +4. **Evolution/iteration narrative** - Show history of attempts, not just final solution (e.g., "The first was X, but that missed Y") +5. **Mid-thought discoveries** - Include moments where learning happens during writing (e.g., "new to me, suggested by Claude") +6. **Unhedged strong opinions** - State opinions directly without AI safety hedges (e.g., "I don't like this at all" not "One might argue") +7. **Playful/subversive notes** - Allow personality to bleed through in unexpected moments +8. **Specific artifacts** - Reference real, specific things that can be verified (actual commits, real commands, specific dependencies) +9. **Visible self-correction** - Show thinking that changes direction mid-paragraph (e.g., "At first I thought... but then I realized") +10. **Raw emotion/profanity** - When quoting others' strong reactions, preserve them raw +11. **Exploration admission** - Admit when exploration yielded nothing valuable (e.g., "nothing of real value came about from") +12. **Soft future planning** - Plans should have uncertainty built in (e.g., "likely tomorrow but we will see") +13. **Metaphor as framing device** - Use metaphors to frame ideas, not explain them +14. **Not-X-but-Y lists** - When listing alternatives, use contrast pair format for technical specs +15. **Hypothesis-experiment framing** - Frame investigations as experiments with explicit uncertainty +16. **Casual closers/interjections** - End thoughts with casual, almost throwaway observations - c. **Natural Typos** (~20 lines): - - 5+ real typos from their writing - - Authenticity markers +**f. Anti-Essay Patterns** (~150 lines) - d. **Voice Metrics** (~100 lines): - - Quantitative targets from profile.json - - Sentence length, contraction rate, etc. +Patterns that AI frequently uses but humans don't—these kill authenticity matching: - e. **Rules and Prohibitions** (~200 lines): - - Anti-essay patterns - - Human authenticity markers - - Banned phrases +- "It's not X. It's Y" rhetorical pivots (distinct from technical "Not X: Y" contrast pairs) +- "Raises important concerns about..." hedging language +- "One might argue..." passive formality +- "It's worth noting that..." editorial flourish +- Exhaustive explanations and over-completeness +- Wall-of-text paragraphs without natural breaks +- Perfectly resolved tension/conflict +- Prescriptive advice without caveats - f. **Generation Protocol** (~100 lines): - - Pre/During/Post checklists +**g. Rules and Prohibitions** (~200 lines) - **If SKILL.md is under 1500 lines, you don't have enough samples.** +- Banned phrases specific to this voice +- Em-dash rules (forbidden for most voices) +- Contraction minimums/maximums +- Paragraph structure constraints +- Sentence starter preferences/prohibitions -See "Config.json Template" and "Voice Skill Output Structure" sections below. +**h. Generation Protocol** (~100 lines) + +Pre/During/Post checklists: +- Before writing: What patterns apply to this topic? +- During writing: Check sentence lengths, contraction rate, paragraph breaks +- After writing: Run authenticity markers checklist + +**If SKILL.md is under 1500 lines, you don't have enough samples.** + +Reasoning: The file length correlates with sample collection depth. A short file signals insufficient grounding. #### Step 4: VALIDATE (Deterministic) +Generate test content using the voice skill, then validate against profile: + ```bash # Generate test content (AI), save to temp file # Then validate against profile @@ -893,10 +263,12 @@ Exit codes: - 1 = fail (score < 70) - 2 = execution error -If validation fails: -- Show violations (errors, warnings, info) -- Adjust SKILL.md guidance based on violations -- Regenerate test content and revalidate (max 3 iterations) +**Interpretation**: If validation fails: +1. Show violations (errors, warnings, info) +2. Adjust SKILL.md guidance based on violations +3. Regenerate test content and revalidate (max 3 iterations) + +The reason we iterate: Validation can surface issues in the SKILL.md instructions that AI misinterpreted. Each iteration helps sharpen the rules. #### Step 5: OUTPUT @@ -936,12 +308,11 @@ Display calibration summary: Use when: Adjusting specific parameters based on feedback -``` Supported adjustments: -- "Make sentences shorter" -> Reduce avg length target by 20% -- "Use fewer lists" -> Set list_usage to "rare" -- "More direct openings" -> Increase direct_statement percentage -- "Less formal" -> Adjust formality marker +- "Make sentences shorter" → Reduce avg length target by 20% +- "Use fewer lists" → Set list_usage to "rare" +- "More direct openings" → Increase direct_statement percentage +- "Less formal" → Adjust formality marker Process: 1. Load existing profile.json @@ -951,6 +322,7 @@ Process: 5. Validate with check-banned for quick feedback Quick validation: +```bash python3 ~/.claude/scripts/voice_validator.py check-banned \ --content test-output.md \ --voice {name} \ @@ -965,19 +337,19 @@ python3 ~/.claude/scripts/voice_validator.py check-banned \ Use when: Testing calibration on new topic -``` Process: 1. Read existing profile.json 2. Take topic from user 3. Generate DEFAULT output (generic Claude style) 4. Generate CALIBRATED output (apply SKILL.md) 5. Validate CALIBRATED output: + ```bash python3 ~/.claude/scripts/voice_validator.py validate \ --content calibrated-output.md \ --profile skills/voice-{name}/profile.json \ --format text + ``` 6. Highlight specific differences -``` **Gate**: Both DEFAULT and CALIBRATED outputs generated. Validation passes on calibrated output. Specific differences highlighted for user review. @@ -1000,579 +372,296 @@ Shows differences in: - Word metrics (contraction rate, person usage) - Pattern signatures (unique transitions, opening/closing patterns) ---- - -## Config.json Template - -```json -{ - "name": "{Voice Name}", - "version": "2.0.0", - "description": "{Voice description}", - "modes": ["chat", "blog", "technical"], - "validation": { - "strict_banned_patterns": true, - "em_dash_forbidden": true, - "metric_tolerance": 0.20, - "required_checks": ["banned_phrases", "punctuation", "rhythm"], - "optional_checks": ["metrics", "sentence_starters"] - }, - "thresholds": { - "pass_score": 70, - "error_max": 0, - "warning_max": 5 - }, - "voice_specific_patterns": [ - { - "name": "example_pattern", - "type": "forbidden", - "patterns": ["pattern1", "pattern2"], - "severity": "warning", - "message": "This pattern doesn't match the voice" - } - ], - "metrics": { - "contraction_rate": 0.85, - "comma_density": 0.045, - "avg_sentence_length": 14.2 - } -} -``` +**Reasoning for comparison**: Understanding how two voices differ at the metric level helps calibrate which samples most strongly drive the differences. --- -## Voice Skill Output Structure (V6 Complete Format) - -When calibration completes, generate a voice skill following this COMPLETE structure. This template incorporates all prompt engineering best practices and must be followed exactly. - -### CRITICAL GENERATION REQUIREMENTS - -Before generating any voice skill, ensure ALL of the following: - -1. **Quantitative Targets are EXPLICIT** - Include exact percentages for: - - Sentence length distribution (short/medium/long/very long) - - Average sentence length (target word count) - - Pronoun starter percentage (often 20-30%) - - Contraction rate - -2. **Probability Dampening is APPLIED** - Use adverbs in trait descriptions: - - "**subtly** skeptical" not "skeptical" - - "**slightly** informal" not "informal" - - "**generally** direct" not "direct" - -3. **Context Isolation is APPLIED** - Use XML tags: - - `` for core directives - - `` for prohibitions - -4. **Attention Anchoring is APPLIED** - Bold negative constraints: - - "**NEVER use**", "**You must strictly avoid**" - -5. **Contrastive Examples are COMPLETE** - Include: - - Table comparing voice aspects - - Full paragraph-level comparison (Generic AI vs This Voice) - -6. **Generation Protocol has CHECKLIST** - Pre/During/Post phases with checkbox list +## Prompt Engineering Best Practices for Voice Skills -7. **Quick Reference Card at END** - Scannable summary of voice DNA +When generating voice skills, apply these techniques for maximum effectiveness: ---- +### 1. Attention Anchoring (Bolding) -### OUTPUT TEMPLATE (Copy and Fill) +Apply **bold** strictly to negative constraints and safety guardrails: ```markdown ---- -name: voice-{name} -description: {Name}'s personal writing voice (v6 - complete format with quantitative targets) -version: 6.0.0 ---- +**You must strictly avoid** the "It's not X. It's Y" rhetorical pattern. +**NEVER use** em-dashes in any form. +``` -# Voice: {Name} (V6 - Complete Format) +**Mechanism**: Acts as attention flag for tokenizer, increasing statistical weight of constraint. -This skill uses sample-first architecture with explicit quantitative targets. -Samples demonstrate HOW the voice sounds. Metrics ensure MEASURABLE consistency. +### 2. Cognitive Chunking (Headers) -**Architecture**: Reference injection -> Pre-generation calibration -> Chunked generation -> Validation +Enforce clear hierarchy of instruction. Never present as wall of text: ---- +```markdown +## Identity (Who the voice is) +## Core Directives (What the voice does) +## Style & Tone (How the voice speaks) +## Negative Constraints (What to avoid) +``` -## Identity (Who) +**Mechanism**: Helps model separate distinct logical tasks, reducing "instruction bleeding." -{Name} is a [role description] who communicates with **subtly** [trait1] and **slightly** [trait2]. The voice is [characteristic1], [characteristic2], and grounded in [grounding]. +### 3. Context Isolation (Delimiters) -**Core traits:** -- **Slightly** [trait with adverb modifier] -- **Generally** [trait with adverb modifier] -- [Observable behavior pattern] -- [Observable behavior pattern] +Separate static instructions from dynamic context using horizontal rules and XML tags: +```markdown --- +(Use horizontal rules between sections) -## Core Directives (What) - - - -### Primary Objectives - -1. **[Directive 1]** - [Brief explanation] -2. **[Directive 2]** - [Brief explanation] -3. **[Directive 3]** - [Brief explanation] -4. **[Directive 4]** - [Brief explanation] - -### Pre-Generation Thinking (Chain-of-Thought Required) - -**Before writing any response, think step-by-step:** - -1. What is my actual position on this topic? -2. What personal experience or process informs this position? -3. What might the reader misunderstand that I should clarify? -4. What valid counterpoints exist that I should acknowledge? -5. [Voice-specific thinking prompt] - -Only after answering these questions internally should generation begin. - + +(Use XML tags for user-provided content) - ---- - -## Style & Tone (How) - -### Phase 1: Voice Grounding (Read Before Writing) - -Read these actual samples to internalize the voice. The goal is pattern absorption, not rule following. - -**CRITICAL: This section must be EXTENSIVE. 50-100+ samples organized by category.** - -The difference between V7-V9 (failed authorship matching) and V10 (passed 5/5) was the quantity and organization of samples. Rules tell AI what to do. Samples show AI what the voice actually looks like. - ---- - -### Extensive Authentic Samples (REQUIRED - Minimum 50) - -#### Response Length Distribution - -**Very short (1 sentence, ~25% of responses):** -``` -[Include 10+ real very short responses] -``` - -**Short (2-3 sentences, ~35% of responses):** -``` -[Include 15+ real short responses] ``` -**Medium (4-6 sentences, ~25% of responses):** -``` -[Include 10+ real medium responses] -``` - -**Long (paragraph+, ~15% of responses):** -``` -[Include 5+ real long responses] -``` - -#### Pattern-Organized Samples (CRITICAL FOR AUTHORSHIP MATCHING) - -**Admitting Mistakes (REQUIRED PATTERN):** -``` -[Include 5+ examples of how this person admits errors] -``` +**Mechanism**: Prevents model from confusing user input with system rules. -**Acknowledging Limits (REQUIRED PATTERN):** -``` -[Include 5+ examples of how this person says "I don't know"] -``` +### 4. Probability Dampening (Adverbs) -**Incomplete Work Admissions (REQUIRED PATTERN):** -``` -[Include 5+ examples of "haven't finished", "on my list for months"] -``` +Use adverbs when defining personality/tone. Avoid absolute binary instructions: -**Respectful Disagreement (REQUIRED PATTERN):** -``` -[Include 5+ examples showing how they disagree without accusation] +```markdown +Write in a **subtly** skeptical tone. +Be **generally** direct rather than absolutely blunt. +Sound **slightly** informal while maintaining clarity. ``` -**Technical Expertise Delivered Casually:** -``` -[Include 5+ examples of technical explanation style] -``` +**Mechanism**: Restricts token probability distribution less aggressively, preventing caricature. -**Casual Closers:** -``` -[Include 5+ examples: "but we'll see", "does that help?", etc.] -``` +### 5. Chain-of-Thought Triggers -**Strong Opinions (Delivered Directly):** -``` -[Include 5+ examples of unhedged opinions] -``` +For complex voice generation, add explicit logic directives: -**Sarcasm/Wit (if present):** -``` -[Include examples if this is part of the voice] +```markdown +Before writing, think step-by-step: +1. What is my actual position on this topic? +2. What would I say if asked about this in a forum? +3. What process did I go through to form this opinion? ``` -#### Phrase Fingerprints (CRITICAL) - -These exact phrase patterns appearing in both samples and generated content are what roasters use to confirm authorship. - -| Fingerprint | Example from Samples | -|-------------|---------------------| -| [Unique phrase 1] | "[Exact quote]" | -| [Unique phrase 2] | "[Exact quote]" | -| [Unique phrase 3] | "[Exact quote]" | -[Include 10+ distinctive phrase patterns] - -#### Natural Typos/Errors (AUTHENTICITY MARKERS) - -Real typos from the person's writing. Include occasional typos to match: -- "[typo1]" (correct: [word]) -- "[typo2]" (correct: [word]) -[Include 5+ real typos from their writing] - ---- - -**If you don't have 50+ samples organized this way, the generated skill will fail authorship matching. Go get more samples before proceeding.** - ---- - -### Phase 2: Voice Metrics (QUANTITATIVE TARGETS) +**Mechanism**: Forces intermediate token generation, increasing accuracy. -These metrics are extracted from [N] sentences across [N] writing samples. -**These are TARGETS to hit, not just observations.** +### 6. Few-Shot Examples (Pattern Matching) -#### Sentence Architecture (MUST MATCH) +Include 1-3 examples of ideal interaction at end of skill: -| Metric | Target | Tolerance | Notes | -|--------|--------|-----------|-------| -| Average length | **{X} words** | ±2 words | Primary rhythm indicator | -| Short (3-10 words) | **{X}%** | ±5% | For emphasis and pacing | -| Medium (11-20 words) | **{X}%** | ±5% | For explanation | -| Long (21-30 words) | **{X}%** | ±3% | For complex ideas | -| Very long (31+ words) | **{X}%** | ±2% | Rare, for building arguments | -| Fragment rate | {X}% | ±3% | Intentional fragments for emphasis | +```markdown +User: "Explain why caching is hard" +Model (WRONG): "Caching is not storage. It's a bet about the future." +Model (CORRECT): "Think of caching as a bet about the future, not persistent storage." +``` -#### Sentence Starters (CRITICAL DIFFERENTIATOR) +**Mechanism**: LLMs are pattern matchers. Examples are more powerful than instructions. -| Type | Target | Examples | -|------|--------|----------| -| **Pronoun** | **{X}%** | "I think", "It works", "That's what" | -| Other | {X}% | Topic-driven starts | -| Conjunction | {X}% | "But the reality", "And now" | -| Article | {X}% | "The problem", "A skill" | -| Adverb | {X}% | Occasionally, for emphasis | +### 7. Sample-First Architecture -**Key insight**: [Explain what the pronoun percentage indicates about voice] +Present actual writing samples BEFORE rules: -#### Punctuation Signature +```markdown +## Phase 1: Voice Grounding (Read Before Writing) -| Element | Target | Rule | -|---------|--------|------| -| Comma density | {X} per word | [Light/Medium/Heavy] comma usage | -| Question rate | {X}% | [Type of questions used] | -| Exclamation rate | {X}% | [When to use, if ever] | -| Em-dashes | **FORBIDDEN** | **Never use under any circumstances** | -| Semicolons | {X}% | [Use or avoid] | -| Contraction rate | **{X}%** | [High/Medium - key formality indicator] | +Read these actual samples to internalize the voice: -#### Function Word Signature (Tier 1 Pattern) +### Sample A: Technical Explanation +> [Actual sample text from user's writing] -Top 10: [list function words with percentages] +### Sample B: Personal Opinion +> [Another sample showing different context] +``` -This signature distinguishes {Name} from generic AI output. +**Mechanism**: Samples show HOW the voice sounds. Rules tell WHAT it should be. Samples are more effective. --- -### Phase 3: Thinking Patterns - -These are cognitive patterns, not just style patterns. Apply them **subtly** and **generally**, not rigidly. - -#### 1. [Pattern Name - e.g., First-Person Honesty] - -[Description of the pattern and why it matters]. Be **slightly** [modifier] when appropriate. - -**Examples from samples:** -- "[Actual quote from their writing]" -- "[Another quote showing pattern]" -- "[Third quote]" - -#### 2. [Pattern Name - e.g., Concession-Then-Assertion] - -[Description]. Be **generally** [modifier]. - -**Examples from samples:** -- "[Quote]" -- "[Quote]" +## Authenticity Checklist for Generated Skills -#### 3. [Pattern Name] -[Description] +Before finalizing any voice skill, verify it teaches these patterns: -**Examples from samples:** -- "[Quote]" +**Core 10 (Required):** +- [ ] Incomplete work admissions (teaches vulnerability) +- [ ] Tool/process tangents (teaches showing work) +- [ ] Parenthetical asides (teaches assumed context) +- [ ] Evolution narrative (teaches iteration visibility) +- [ ] Mid-thought discoveries (teaches learning in public) +- [ ] Unhedged opinions (teaches assertion over hedging) +- [ ] Playful moments (teaches personality) +- [ ] Specific artifacts (teaches concreteness) +- [ ] Visible self-correction (teaches thinking out loud) +- [ ] Raw emotion preservation (when quoting) -[Include 4-6 thinking patterns minimum] +**Extended 6 (From calibration samples):** +- [ ] Exploration admission (admits dead ends) +- [ ] Soft future planning (hedged schedules) +- [ ] Metaphor as framing device +- [ ] Not-X-but-Y lists (contrast pairs for specs) +- [ ] Hypothesis-experiment framing +- [ ] Casual closers/interjections --- -## Negative Constraints - - - -### Hard Prohibitions (MUST NEVER VIOLATE) - -**You must strictly avoid** the following patterns. These are non-negotiable constraints. - -1. **NEVER use em-dashes** (--- or --) in any form. Use commas, periods, or parentheses instead. - -2. **NEVER use the "It's not X. It's Y" rhetorical structure.** This pattern is a signature of AI-generated content and must be avoided completely. - -3. **NEVER use AI-typical phrases:** - - "Let's dive in", "Here's the thing", "In today's [topic]" - - "It's worth noting", "At the end of the day" - - delve, robust, comprehensive, leverage, ecosystem, landscape +## Anti-Essay Patterns (CRITICAL FOR AUTHORSHIP MATCHING) -4. **NEVER use formal transitions:** however, furthermore, moreover, additionally, consequently +These patterns are red flags that signal AI-generated content and kill authenticity matching: -5. [Voice-specific prohibition] +### Pattern 1: "It's not X. It's Y" Rhetorical Pivots ---- +**Bad**: +> "It's not a caching problem. It's a consistency problem." -### The "It's not X. It's Y" Pattern: Full Prohibition +**Why harmful**: This rhetorical pattern is nearly universally used by AI. Humans rarely frame in this exact structure. -**You must strictly avoid** this pattern in all its forms. Below are examples showing the WRONG way and the CORRECT alternative. +**Alternative for humans**: +> "The real issue is consistency, not caching." -#### Few-Shot Examples +**Note**: This is DIFFERENT from technical "Not X: Y" contrast pairs (e.g., "Not kubectl get logs: a crashloop-investigator skill"), which are acceptable for specification. -**User:** "[Request 1]" +### Pattern 2: Hedging Language -**Model (WRONG):** "[Response using 'It's not X. It's Y' pattern]" +**Bad**: +- "This raises important concerns about..." +- "One might argue that..." +- "It's worth noting that..." +- "It could be said that..." -**Model (CORRECT):** "[Response using acceptable alternative]" +**Why harmful**: AI defaults to hedging for safety. Humans assert directly. ---- +**Alternative for humans**: +- "I think this is wrong because..." +- "The real problem is..." +- "Here's what I'd do..." -**User:** "[Request 2]" +### Pattern 3: Exhaustive Explanations -**Model (WRONG):** "[Wrong pattern]" +**Bad**: +> "There are several approaches one could take. First, you could try X. However, X has limitations. Second, you could try Y. This approach has benefits but also drawbacks. Finally, Z is another option. To summarize, each has tradeoffs..." -**Model (CORRECT):** "[Correct pattern]" +**Why harmful**: Exhaustive coverage is AI's default. Humans jump to what they use and mention others in passing. ---- +**Alternative for humans**: +> "I use X for this. Some people prefer Y, but I find X works better." -**User:** "[Request 3]" +### Pattern 4: Perfect Resolution -**Model (WRONG):** "[Wrong pattern]" +**Bad**: +> "After struggling with this for weeks, I finally realized the core issue was..." -**Model (CORRECT):** "[Correct pattern]" +**Why harmful**: Too clean. Humans often leave problems partially resolved. ---- +**Alternative for humans**: +> "I still haven't fully solved this, but I think the issue might be..." -#### Pattern Recognition +### Pattern 5: Wall-of-Text Paragraphs -The forbidden pattern has this structure: -- "It's not [thing A]. It's [thing B]." -- "[X] isn't [property]. It's [other property]." -- "This isn't about [A]. This is about [B]." +**Bad**: +Single paragraph with 5+ sentences, no natural breaks. -**Acceptable alternatives:** -- "Think of [B] as the focus, not [A]." -- "[B] matters more than [A] here." -- "The key thing is [B], rather than [A]." -- "I'd frame it as [B]. The [A] part is secondary." -- "[It's/This is] [B], not [A]." (Reversed order is acceptable) +**Why harmful**: Breaks don't always come from paragraph topic, but from natural breathing points in human thinking. - +**Alternative**: Use 2-3 sentence paragraphs, breaks where you naturally pause. ---- +### Pattern 6: Prescriptive Advice Without Caveats -### Soft Constraints +**Bad**: +> "You should always validate input before processing." -Apply these **generally** but not rigidly: +**Why harmful**: AI gives absolute rules. Humans qualify. -- [Soft constraint 1] -- [Soft constraint 2] -- Keep very long sentences (31+) under {X}% of total +**Alternative**: +> "I always validate input first, though sometimes you can skip it if you're confident about the source." --- -## Contrastive Examples - -### What Is vs What Isn't This Voice - -| Aspect | This Voice ({Name}) | Not This Voice (Generic AI) | -|--------|---------------------|---------------------------| -| Opening | "[Example from samples]" | "In today's rapidly evolving landscape..." | -| Uncertainty | "[How they express doubt]" | "It's worth noting that perspectives may vary" | -| Agreement | "[How they agree]" | "I absolutely agree with your insightful point" | -| Disagreement | "[How they disagree]" | "While there are valid concerns, I would respectfully suggest..." | -| Conclusion | "[How they end]" | "In conclusion, we have explored..." | -| Technical | "[Technical explanation style]" | "This represents a robust paradigm for..." | - -### Full Contrastive Example (Paragraph Level) +## Anti-Essay Checklist for Generated Skills -**Generic AI:** -> In today's rapidly evolving technological landscape, the concept of [topic] has become increasingly important. This article will delve into the nuances of how modern systems handle [topic], exploring the various approaches and their implications. +Before finalizing, scan SKILL.md and generated samples for these anti-patterns. If found, flag for revision: -**{Name} Voice:** -> [Actual paragraph from samples showing the same topic handled in their voice] +- [ ] "It's not X. It's Y" pivots (rhetorical structure, NOT technical contrast) +- [ ] Hedging language (raises concerns, one might argue, worth noting) +- [ ] Exhaustive option lists (first... second... finally...) +- [ ] Perfect narrative resolution +- [ ] Wall-of-text paragraphs +- [ ] Prescriptive absolutes without hedges +- [ ] Over-explanation and over-completeness +- [ ] Editorial flourishes ("As a matter of fact") +- [ ] Sanitized emotion (softened quotes) +- [ ] Passive voice in personal narrative --- -## Human Authenticity Requirements (MANDATORY) +## Deterministic Infrastructure -These patterns MUST be included or content will be flagged as AI-generated. +Two required scripts power deterministic analysis: -### Required Authenticity Markers +### voice_analyzer.py -1. **Incomplete Work Admission** - At least one admission of imperfection: - - "I haven't yet built X" - - "it's not pretty but it works" - - "that's a pretty inelegant solution" +Extracts quantitative metrics from writing samples: -2. **Tool/Process Tangent** - Show your actual workflow: - - "I built a custom tool for that (called X)" - - "I used my [tool] command to grab..." +```bash +python3 ~/.claude/scripts/voice_analyzer.py analyze \ + --samples skills/voice-{name}/references/samples/*.md \ + --output skills/voice-{name}/profile.json +``` -3. **Parenthetical Aside** - Assume shared context with reader: - - "(that [famous person])" - - "(so I hope it continues to work)" - - "You won't be surprised to hear..." +Produces profile.json with: +- Sentence length distribution (min, mean, max, percentiles) +- Punctuation metrics (comma density, em-dash frequency, question rate) +- Word metrics (contraction rate, person usage, function word signature) +- Structure metrics (paragraph length, fragment rate, sentence starters) +- Pattern signatures (transitions, opening/closing patterns) -4. **Evolution Narrative** - Show iteration, not just final solution: - - "I've made several past attempts" - - "a late refactoring, the initial version used..." +### voice_validator.py -5. **Unhedged Opinion** - Assert directly, don't hedge: - - WRONG: "This raises important concerns" - - RIGHT: "I don't like this at all." +Validates generated content against profile: -6. **Mid-Thought Discovery** - Learn while writing: - - "Turns out X knows the trick where..." - - "I ended up using that trick in X itself!" +```bash +python3 ~/.claude/scripts/voice_validator.py validate \ + --content test-output.md \ + --profile skills/voice-{name}/profile.json \ + --voice {name} \ + --format text +``` -7. **Visible Self-Correction** - Change direction mid-thought: - - "At first I thought... but then I realized" - - "or rather, it was my willingness to..." +Produces violations report with: +- Banned phrase violations +- Metric tolerance violations +- Rhythm/structure violations +- Authenticity pattern gaps --- -## Generation Protocol - -### Pre-Generation (Chain-of-Thought Required) +## Multi-Voice Support -**Before writing, think step-by-step and answer these questions internally:** +Organize multiple voices at: `skills/voice-{name}/` -1. What is my actual position on this? -2. What would I say if asked about this in a forum? -3. What process did I go through to form this opinion? -4. What valid counterpoints should I acknowledge? -5. What imperfection or gap can I honestly admit? -6. What tangent about my process might be interesting? - -### During Generation - -1. **Start with your position**, not background -2. **Use first-person** when stating opinions ({X}% pronoun starts target) -3. **Target {X} word average** sentence length -4. **Include {X}% short sentences** (3-10 words) for rhythm -5. **Include at least one incomplete/imperfect admission** -6. **Include at least one parenthetical aside** -7. **Allow one tangent that shows your process** -8. Write in a **subtly** [trait] tone -9. Be **slightly** [trait] but [qualifier] - -### Post-Generation Validation - -**You must verify** all of these before finalizing: - -**Anti-AI Patterns:** -- [ ] No em-dashes anywhere -- [ ] No "It's not X. It's Y" patterns (check carefully) -- [ ] No AI-typical phrases or formal transitions -- [ ] No excessive hedging ("It's worth noting", "One might argue") - -**Voice Metrics:** -- [ ] First-person used for opinions -- [ ] Sentence length distribution approximately matches targets -- [ ] At least {X}% of sentences start with pronouns -- [ ] Contraction rate matches target ({X}%) - -**Authenticity Markers (REQUIRED):** -- [ ] At least one incomplete work admission -- [ ] At least one parenthetical aside -- [ ] At least one unhedged strong opinion -- [ ] At least one mid-thought discovery or self-correction -- [ ] [Voice-specific validation item] +Each voice directory structure: +``` +skills/voice-{name}/ + ├── SKILL.md (AI instructions) + ├── profile.json (statistical metrics) + ├── config.json (validation settings) + └── references/ + └── samples/ (raw writing samples) +``` ---- +Commands to manage: +```bash +# List all voices +/voice list -## Quick Reference Card +# Show specific voice profile +/voice show --name {name} -**Voice DNA:** -- [Key trait 1] -- [Key trait 2] -- [Key trait 3] -- [Key trait 4] -- **Subtly** [adverb-modified trait] -- **Generally** [adverb-modified trait] - -**Sentence Profile (TARGETS):** -- {X}% short, {X}% medium, {X}% long -- **{X}%** start with pronouns -- **{X}%** contractions -- {X}% questions -- Average: **{X} words** - -**Forbidden (Hard Constraints):** -- **Em-dashes** (never, under any circumstances) -- **"It's not X. It's Y"** (use reversed order or alternative phrasing) -- **Formal transitions** (however, furthermore, moreover) -- **AI opening phrases** (let's dive in, here's the thing) +# Compare two voices +python3 ~/.claude/scripts/voice_analyzer.py compare \ + --profile1 skills/voice-a/profile.json \ + --profile2 skills/voice-b/profile.json ``` --- -### VALIDATION CHECKLIST FOR GENERATED SKILL - -Before finalizing any generated voice skill, verify: - -**SAMPLE COLLECTION (CRITICAL - This is what makes authorship matching work):** -- [ ] **50+ real samples included** (not 5-7, not 10, FIFTY MINIMUM) -- [ ] Samples organized by response length (very short / short / medium / long) -- [ ] Samples organized by pattern type (mistakes / limits / disagreement / technical) -- [ ] **10+ phrase fingerprints documented** with exact quotes -- [ ] **5+ real typos documented** as authenticity markers -- [ ] Samples show EXECUTION STYLE, not just ideas - -**QUANTITATIVE TARGETS:** -- [ ] All metrics have explicit targets with tolerances -- [ ] Pronoun starter percentage is highlighted as CRITICAL DIFFERENTIATOR -- [ ] Average sentence length target is specified - -**PROMPT ENGINEERING:** -- [ ] Probability dampening adverbs used throughout (subtly, slightly, generally) -- [ ] XML context tags used for static instructions and safety guardrails -- [ ] Attention anchoring (bold) used for all negative constraints -- [ ] "It's not X. It's Y" prohibition has 3+ few-shot examples - -**STRUCTURE:** -- [ ] Contrastive table covers 6+ aspects -- [ ] Full paragraph-level contrastive example included -- [ ] Generation Protocol has Pre/During/Post phases -- [ ] Post-Generation has checkbox validation list -- [ ] Quick Reference Card at end summarizes voice DNA and targets - -**AUTHORSHIP MATCHING REQUIREMENT:** -The generated skill MUST pass this test: -1. Generate test content using the skill -2. Have 5 roasters compare it to original samples -3. At least 4/5 roasters should say "SAME AUTHOR" -4. If not, add more samples and retry - -See your own voice skill's SKILL.md for reference implementations that pass authorship matching. - ---- - ## Error Handling ### Error: "Insufficient samples" @@ -1582,6 +671,8 @@ Minimum 3 samples required. If fewer provided: 2. Suggest using batch mode to analyze all 3. Request additional samples +**Reasoning**: Statistical significance requires minimum data. 50+ samples is the practical target for authorship matching to work. + ### Error: "Samples too similar" All samples from same time period or topic: @@ -1589,6 +680,8 @@ All samples from same time period or topic: 2. Recommend diverse sample selection 3. Proceed with warning in profile +**Reasoning**: Homogeneous samples may miss variation in voice across contexts. + ### Error: "No existing profile for comparison" When A/B mode requested without prior calibration: @@ -1603,21 +696,23 @@ When content repeatedly fails validation: 3. Consider relaxing metric_tolerance in config.json 4. Manual review of SKILL.md instructions -### Error: "Script execution failed" +**Reasoning**: Iteration limit prevents infinite loops. If 3 passes don't resolve, the profile may be over-constrained or SKILL.md instructions may be contradictory. -``` -python3 ~/.claude/scripts/voice_analyzer.py --help -python3 ~/.claude/scripts/voice_validator.py --help -``` +### Error: "Script execution failed" Check: 1. Python 3 available 2. Scripts executable 3. File paths valid +```bash +python3 ~/.claude/scripts/voice_analyzer.py --help +python3 ~/.claude/scripts/voice_validator.py --help +``` + --- -## Reference Files +## References - `scripts/voice_analyzer.py`: Deterministic metrics extraction - `scripts/voice_validator.py`: Content validation against profile @@ -1634,7 +729,7 @@ Create your own voice skill using the voice-calibrator workflow. A complete voic --- -## Quick Reference +## Quick Reference Card | Command | Action | |---------|--------| @@ -1709,30 +804,3 @@ The profile is NOT automatically applied. Invoke explicitly: - "Write [topic] using my voice profile" - "Review this draft against my voice profile" - "Validate this content against the voice" - ---- - -## Anti-Patterns - -### Anti-Pattern 1: Rules Without Samples -**What it looks like**: Generating a voice skill with detailed rules and prohibitions but fewer than 50 writing samples -**Why wrong**: V7-V9 proved this fails authorship matching (0/5 roasters). LLMs are pattern matchers — examples are more powerful than instructions. Rules tell the AI what to do; only samples show what the voice looks like. -**Do instead**: Collect 50-100+ writing samples before generating any SKILL.md. The samples ARE the skill. If the user provides fewer than 50, stop and ask for more. - -### Anti-Pattern 2: Skipping Deterministic Analysis -**What it looks like**: Interpreting writing style through AI reading of samples without running voice_analyzer.py first -**Why wrong**: AI interpretation without quantitative grounding produces vague, non-falsifiable descriptions ("writes casually"). Deterministic metrics (contraction rate: 85%, avg sentence length: 14.2 words) are specific and measurable. -**Do instead**: Always run voice_analyzer.py before AI interpretation. The script extracts metrics; the AI interprets what they mean for voice reproduction. - -### Anti-Pattern 3: Over-Polishing Output -**What it looks like**: Generating voice skill content that reads as grammatically perfect, well-structured prose -**Why wrong**: Human writing has natural imperfections — run-ons, fragments, casual punctuation, mid-thought corrections. Sterile perfection is an AI tell that roasters detect immediately. This is the wabi-sabi principle. -**Do instead**: Preserve the target voice's authentic markers including typos, fragments, and casual patterns. Teach these explicitly in the generated SKILL.md as authenticity markers, not errors. - ---- - -## References - -This skill uses these shared patterns: -- [Anti-Rationalization](../../skills/shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations during calibration -- [Verification Checklist](../../skills/shared-patterns/verification-checklist.md) - Pre-completion checks for calibration output diff --git a/pipelines/voice-writer/SKILL.md b/pipelines/voice-writer/SKILL.md index fe71b612..19b5ffe7 100644 --- a/pipelines/voice-writer/SKILL.md +++ b/pipelines/voice-writer/SKILL.md @@ -49,63 +49,7 @@ routing: # Voice Writer Skill -## Operator Context - -This skill operates as an operator for all voiced content generation, configuring Claude's behavior for high-fidelity voice impersonation with measurable quality gates and mandatory joy-check. It implements the **Pipeline** architectural pattern -- LOAD, GROUND, GENERATE, VALIDATE, REFINE, JOY-CHECK, OUTPUT, CLEANUP -- with **Deterministic Validation** via Python scripts at quality gates and **Joy Framing** validation before output. - -This skill replaces `voice-orchestrator` and absorbs `blog-post-writer`. It is the single entry point for all voiced content generation. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before generating -- **Over-Engineering Prevention**: Generate content that matches the voice, not perfect prose. Do not add features, modes, or structure the user did not request -- **Deterministic Validation**: ALWAYS use `scripts/voice_validator.py` for validation, NEVER self-assess voice quality -- **Iteration Limits**: Maximum 3 refinement iterations, then output best attempt with report -- **Em-Dash Prohibition**: NEVER generate em-dashes in any voice output. Use commas, periods, or restructure -- **Wabi-Sabi Authenticity**: Natural imperfections (run-ons, fragments, loose punctuation) are FEATURES of human writing. Sterile grammatical perfection is an AI tell. See `skills/shared-patterns/wabi-sabi-authenticity.md` -- **Voice Required**: Every generation MUST target a specific voice. No voiceless generation -- **Artifacts Over Memory**: Write content to files at each phase, not just context -- **Joy Framing**: All output MUST pass joy-check before delivery. No grievance-framed content ships -- **Banned Words Enforcement**: NEVER use words from `references/banned-words.md`. Scan every draft before finalizing - -### Default Behaviors (ON unless disabled) -- **Full Pipeline**: Run all 8 phases (LOAD, GROUND, GENERATE, VALIDATE, REFINE, JOY-CHECK, OUTPUT, CLEANUP) -- **Validation Report**: Always include validation metrics in output -- **Sample Loading**: Load 1-2 reference samples as few-shot examples when available -- **Temp File Cleanup**: Remove generated temp files after completion -- **One Fix at a Time**: Address violations individually during refinement, not all at once -- **Document Findings**: Log validation scores and iteration results -- **Preview Before Write**: Display full draft for approval before writing to file -- **Post-Draft Banned Word Scan**: Verify zero banned words before finalizing - -### Optional Behaviors (OFF unless enabled) -- **Skip Validation**: Draft mode, bypasses validation step (`--skip-validation`) -- **Validate Only**: Check existing content without generation (`--validate`) -- **Verbose Output**: Show full validation JSON including all metrics (`--verbose`) -- **Custom Threshold**: Override default pass score from config.json -- **Direct Write Mode**: Skip preview and write directly to file -- **Outline Only**: Generate structure without full draft - -## What This Skill CAN Do -- Load voice skills with associated profile.json and config.json -- Generate content matching voice patterns, metrics, and signature phrases -- Run deterministic validation against voice profiles using Python scripts -- Refine content iteratively based on violation feedback (max 3 iterations) -- Produce validation reports with pass/fail status and metrics comparison -- Validate existing content against voice profiles without generation -- Write complete blog posts with proper Hugo frontmatter -- Apply voice-specific patterns (metaphors, rhythm, structure, tone) -- Run joy-check to catch grievance, accusation, and victimhood framing -- Reframe negatively-framed paragraphs while preserving substance - -## What This Skill CANNOT Do -- Create new voices (use `create-voice` skill instead) -- Modify voice profiles (profiles are read-only during generation) -- Analyze writing samples (use `scripts/voice_analyzer.py` directly) -- Guarantee 100% pass rate (some content may fail after max iterations) -- Generate without a voice target (a `--voice` parameter is ALWAYS required) -- Self-assess voice quality (MUST use deterministic validator script) -- Use banned words under any circumstances (see `references/banned-words.md`) -- Skip the joy-check phase (it is mandatory, not optional) +This skill operates as the unified entry point for all voiced content generation. It implements an 8-phase pipeline architecture with deterministic validation at quality gates, joy-check enforcement before output, and strict iteration limits. --- @@ -115,6 +59,8 @@ This skill replaces `voice-orchestrator` and absorbs `blog-post-writer`. It is t **Goal**: Load all voice infrastructure files and verify they exist. +Before loading, understand that this skill **requires a voice target** — content cannot be generated without a voice. If the user has not specified a voice, default to the repository's configured voice skill or ask the user which voice to use. + **Step 1: Locate voice directory** ```bash @@ -123,23 +69,11 @@ ls $HOME/claude-code-toolkit/skills/voice-{name}/ **Step 2: Load required files** -| File | Purpose | -|------|---------| -| `skills/voice-{name}/SKILL.md` | AI instructions, patterns, anti-patterns | -| `skills/voice-{name}/profile.json` | Quantitative metrics targets | -| `skills/voice-{name}/config.json` | Validation settings, modes, thresholds | - -**Step 3: Load optional files** - -- `skills/voice-{name}/references/samples/` -- Few-shot examples (load 1-2 if available) +Load `SKILL.md`, `profile.json`, and `config.json` from `skills/voice-{name}/`. Also load 1-2 samples from `references/samples/` if available. -**Step 4: Parse thresholds from config.json** +See `references/voice-infrastructure.md` for file table, schema details, and threshold parsing. -Extract `thresholds.pass_score`, `thresholds.error_max`, `thresholds.warning_max`, and available `modes`. - -See `references/voice-infrastructure.md` for full schema details. - -**Step 5: Verify file presence** +**Step 3: Verify file presence** ```bash test -f skills/voice-{name}/SKILL.md && echo "SKILL.md: OK" @@ -149,84 +83,29 @@ test -f skills/voice-{name}/config.json && echo "config.json: OK" If any required file is missing, STOP and report the error. Do not proceed with partial infrastructure. +**Important constraint**: This phase is mandatory. Skipping Phase 1 will result in generation without voice infrastructure, producing hollow-sounding content that metrics match but feels mechanically written. Complete this phase fully before proceeding. + **Gate**: All required files exist and parse successfully. Proceed only when gate passes. ### Phase 2: GROUND **Goal**: Establish emotional, relational, and structural context before generation. -**Step 1: Emotional anchoring** +Grounding prevents over-engineered output. Only write what was explicitly requested. Do not add "Future Implications" sections, "Related Topics" sidebars, or any unsolicited structure — ask before adding anything extra. -Answer these three questions before generating: +**Step 1: Emotional anchoring** — Answer the three grounding questions (what emotion, what the writer cares about, who they're writing for). -| Question | Why It Matters | -|----------|----------------| -| What emotion drives this content? | Sets underlying tone (celebration, frustration, curiosity) | -| What does the writer care about? | Guides emphasis and detail level | -| Who are they writing for? | Calibrates assumed knowledge and language | +**Step 2: Relational positioning** — Establish writer-audience relationship, assumed knowledge level, and intimacy level. -**Step 2: Relational positioning** +**Step 3: Mode selection** — Select content mode from the voice's `config.json`. Infer from subject matter if user does not specify. -| Dimension | Options | -|-----------|---------| -| Writer-Audience relationship | Peer, expert, fan, community member | -| Assumed knowledge level | Newcomer, familiar, expert | -| Intimacy level | Public formal, community casual, personal | +**Step 4: Blog post assessment** (if blog post or article) — Capture topic, scope, audience, and estimated length. -**Step 3: Mode selection** +**Step 5: Structure planning** (if blog post or article) — Plan opening pattern, sections, closing pattern, and callback element. Draft Hugo frontmatter if applicable. -Select content mode from the voice's `config.json` modes list. Each voice defines modes that shape structure and tone (e.g., "awards" mode produces celebratory recognition pieces, "technical" mode produces systems explanations). +See `references/grounding-guide.md` for full question tables, templates, and content type selection. -See `references/voice-infrastructure.md` for available modes per voice. - -If user does not specify a mode, infer the best match from the subject matter and available modes. - -**Step 4: Blog post assessment** (if the request is a blog post or article) - -When the content is a blog post, article, or similar structured piece, also perform: - -```markdown -## Assessment -- Topic: [user-provided topic] -- Scope: [narrow / medium / broad] -- Audience: [beginner / intermediate / expert] -- Estimated length: [short 500-800 / medium 1000-1500 / long 2000+] -``` - -**Step 5: Structure planning** (if the request is a blog post or article) - -Plan the post structure using voice patterns and structure templates: - -```markdown -## Plan -- Opening pattern: [Provocative Question / News Lead / Bold Claim / Direct Answer] -- Draft opening: [first sentence or question] -- Core metaphor: [conceptual lens, if voice uses extended metaphors] -- Sections: - 1. [Section name]: [purpose] - 2. [Section name]: [purpose] - ... -- Closing pattern: [Callback / Implication / Crescendo] -- Callback element: [what from opening returns] -``` - -Draft frontmatter if writing to a Hugo site: - -```yaml ---- -title: "Post Title Here" -slug: "post-slug-here" -date: YYYY-MM-DD -draft: false -tags: ["tag1", "tag2"] -summary: "One sentence description for list views" ---- -``` - -Select content type from `references/structure-templates.md` if available: -- **Problem-Solution**: Bug fix, debugging session, resolution -- **Technical Explainer**: Concept, technology, how it works -- **Walkthrough**: Step-by-step instructions for a task +**Important constraint**: This grounding is mandatory, not optional. Content generated without emotional anchor and mode selection sounds mechanical regardless of metrics match. The validator catches style mismatches but cannot fix a hollow emotional foundation. Do not skip this step even briefly — complete it fully before moving to GENERATE. **Gate**: Emotion, audience, and mode are established. If blog post, topic assessed and structure planned. Proceed only when gate passes. @@ -242,29 +121,9 @@ Select content type from `references/structure-templates.md` if available: **Step 4: Apply mode-specific patterns** based on selected mode -**Step 4b: Apply architectural patterns** from the voice skill's `## Architectural Patterns` section (if present): - -- **Argument flow**: Build the piece using the documented direction (inductive/deductive/mixed). If inductive, lead with evidence and land the claim late. If deductive, open with the claim. -- **Concessions**: When handling disagreement, follow the documented concession structure and use the documented pivot markers -- not generic "however" or "on the other hand." -- **Analogy domains**: Draw analogies ONLY from the documented source domains. Do NOT use generic analogies from undocumented domains. -- **Bookends**: Open with the documented opening move, close with the documented closing move. - -If the voice skill has no `## Architectural Patterns` section, skip this step. +**Step 4b: Apply architectural patterns** from the voice skill's `## Architectural Patterns` section (if present): argument flow direction, concession pivot markers, analogy domains, and bookend moves. Skip if no such section exists. -**Generation checklist:** - -- [ ] Sentence length varies according to profile distribution -- [ ] Contractions match target rate -- [ ] No em-dashes (use commas, periods, or restructure) -- [ ] Opening matches voice pattern signatures -- [ ] Closing matches voice pattern signatures -- [ ] Transition words from profile preferred list -- [ ] Banned patterns avoided (exploration verbs, corporate jargon) -- [ ] Banned words avoided (scan against `references/banned-words.md`) -- [ ] Argument builds in documented direction (if architectural patterns present) -- [ ] Concessions use documented structure and pivot markers (if applicable) -- [ ] Analogies drawn from documented domains only (if applicable) -- [ ] Specific numbers included for all claims, not vague adjectives +See `references/generation-checklist.md` for the full 12-item generation checklist, architectural patterns application rules, and em-dash prohibition details. **Step 5: Write to temp file** @@ -274,31 +133,24 @@ cat > /tmp/voice-content-draft.md << 'CONTENT' CONTENT ``` +**Important constraints**: Single voice per piece; no over-engineering; preview before write unless Direct Write Mode is enabled. + **Gate**: Content written to file. All checklist items addressed. Proceed only when gate passes. ### Phase 4: VALIDATE (Deterministic) **Goal**: Run the voice validator script against generated content. No self-assessment. -**Step 1: Execute validation** - -```bash -python3 $HOME/claude-code-toolkit/scripts/voice_validator.py validate \ - --content /tmp/voice-content-draft.md \ - --profile $HOME/claude-code-toolkit/skills/voice-{name}/profile.json \ - --voice {name} \ - --format json -``` +This phase is non-negotiable. Do not skip validation for "good enough" content. Human perception drifts. Deterministic validation catches patterns you miss. Self-assessment is not validation. Use `--skip-validation` only for true drafts the user explicitly requests as drafts. -See `references/validation-scripts.md` for full command reference and output schema. +**Step 1: Execute validation** — Run `voice_validator.py validate` against the draft. See `references/validation-scripts.md` for full command syntax and output schema. -**Step 2: Decision logic** +**Step 2: Decision logic** — Pass → JOY-CHECK. Fail + iterations < 3 → REFINE. Fail + iterations ≥ 3 → JOY-CHECK with failure report. See `references/validation-scripts.md` for the full decision table. -| Condition | Action | -|-----------|--------| -| `pass == true` AND `score >= threshold` | Proceed to Phase 6: JOY-CHECK | -| `pass == false` AND `iterations < 3` | Proceed to Phase 5: REFINE | -| `pass == false` AND `iterations >= 3` | Proceed to Phase 6: JOY-CHECK with failure report | +**Important constraints**: +- **Trust the validator, not intuition**: Do not rationalize validator strictness — it catches real AI patterns humans miss. If the validator rejects content, fix violations or adjust the profile through calibration. +- **Address warnings carefully**: Fix all errors, address warnings if easy, ship when score >= threshold. Over-polishing creates sterile output that violates wabi-sabi (natural imperfections as features of human writing). Sterile grammatical perfection is an AI tell. Do not spend 5+ iterations eliminating all warnings — warnings are informational, errors are blockers. +- **One fix at a time**: During refinement, address violations individually. Do not fix multiple violations simultaneously — this introduces new violations and changes voice characteristics that were passing. **Gate**: Validation result captured. Decision made. Proceed only when gate passes. @@ -306,11 +158,13 @@ See `references/validation-scripts.md` for full command reference and output sch **Goal**: Fix violations identified by the validator. Maximum 3 iterations. +Refinement is targeted, surgical fixing — not wholesale rewriting. Each iteration should fix one specific violation, not rewrite entire sections. Rewriting sections introduces new violations and changes voice characteristics. + **Step 1: Process violations in severity order** (errors first, then warnings) For each violation: 1. Read line number, text, type, and suggested fix -2. Apply targeted fix (see `references/voice-infrastructure.md` for fix strategies) +2. Apply targeted fix — see `references/voice-infrastructure.md` for fix strategies by violation type 3. Do NOT make unrelated changes **Step 2: Write updated content to temp file** @@ -329,6 +183,8 @@ For each violation: **Goal**: Validate content for joy-centered tonal framing. No grievance-framed content ships. +Joy-check is not optional, not even if validation passed. Voice validation checks stylistic fidelity. Joy-check checks tonal framing. Content can match a voice perfectly while framing through grievance, bitterness, accusation, or victimhood. Grievance framing slips in subtly — regex + rubric catch what visual scanning misses. + **Step 1: Run regex pre-filter** ```bash @@ -337,37 +193,15 @@ python3 $HOME/claude-code-toolkit/scripts/scan-negative-framing.py /tmp/voice-co If regex hits are found, fix them before proceeding. These are high-confidence negative framing patterns (victimhood, accusation, bitterness, passive aggression). Apply the scanner's suggested reframes and re-run until clean. -**Step 2: Evaluate each paragraph against the Joy Framing Rubric** - -| Dimension | Joy-Centered (PASS) | Grievance-Centered (FAIL) | -|-----------|-------------------|--------------------------| -| **Subject position** | Author as explorer, builder, learner | Author as victim, wronged party, unrecognized genius | -| **Other people** | Fellow travelers, interesting minds, people figuring things out | Opponents, thieves, people who should have done better | -| **Difficult experiences** | Interesting, surprising, made me think differently | Unfair, hurtful, someone should fix this | -| **Uncertainty** | Comfortable, curious, "none of us know" | Anxious, defensive, "I need to prove" | -| **Closing energy** | Forward-looking, building, sharing, exploring | Cautionary, warning, demanding, lamenting | - -**Step 3: Score each paragraph** +If the script is unavailable, skip the regex pre-filter and proceed directly to LLM-based joy-check analysis — the regex pre-filter is an optimization, not a requirement. -For each paragraph, assign one of: -- **JOY** (80-100): Frames through curiosity, generosity, or earned satisfaction -- **NEUTRAL** (50-79): Factual, neither joy nor grievance -- **CAUTION** (30-49): Leans toward grievance but recoverable with reframing -- **GRIEVANCE** (0-29): Frames through accusation, victimhood, or bitterness +**Step 2: Evaluate each paragraph against the Joy Framing Rubric** — Score each paragraph as JOY (80-100), NEUTRAL (50-79), CAUTION (30-49), or GRIEVANCE (0-29) across five dimensions: subject position, other people, difficult experiences, uncertainty, and closing energy. -**Step 4: Rewrite GRIEVANCE paragraphs** +**Step 3: Score each paragraph** — One GRIEVANCE is a FAIL condition for the whole piece. CAUTION paragraphs are acceptable if the overall piece passes. -If any paragraph scores GRIEVANCE: -1. Rewrite preserving substance, changing only the framing -2. Shift toward curiosity, generosity, or earned satisfaction -3. Re-evaluate the rewritten paragraph to confirm it no longer scores GRIEVANCE -4. Maximum 3 joy-check iterations +**Step 4: Rewrite GRIEVANCE paragraphs** — Rewrite preserving substance, shifting framing toward curiosity/generosity/earned satisfaction. Maximum 3 joy-check iterations. -**Joy-check rules:** -- Reframe, don't suppress -- negative experiences are valid topics, only the framing changes -- Preserve substance -- change the lens, not the facts -- One GRIEVANCE paragraph is a FAIL condition for the whole piece -- CAUTION paragraphs are acceptable if the overall piece passes +See `references/joy-check-rubric.md` for the full rubric table, scoring system, rewrite rules, and the important constraint about facts arranged as prosecution. **Gate**: No GRIEVANCE paragraphs remain. Joy-check passes. Proceed only when gate passes. @@ -375,47 +209,9 @@ If any paragraph scores GRIEVANCE: **Goal**: Format and display final content with validation report. -**Output format:** +Display content followed by a validation report showing status, score, iterations, per-check results, metrics comparison table, and joy-check summary. Always include validation metrics — do not ship without showing the measurements. Show target file path if writing to file. Await user approval before writing unless Direct Write Mode is enabled. -``` -=============================================================== - VOICE CONTENT: {Voice Name} -=============================================================== - -[Generated content here] - -=============================================================== - VALIDATION REPORT -=============================================================== - - Status: PASSED / FAILED - Score: {score}/100 - Iterations: {N} - - Checks: - [check] Banned patterns: None detected - [check] Em-dash: 0 found - [check] Rhythm: Varied sentence lengths - [warn] Contraction rate: 65% (target: 72%) - - Metrics Comparison: - | Metric | Target | Actual | Status | - |-------------------|--------|--------|--------| - | Avg sentence len | 15.3 | 14.8 | [ok] | - | Contraction rate | 0.72 | 0.65 | [warn] | - | Short sentences | 0.35 | 0.32 | [ok] | - - Joy Check: - Status: PASSED - Overall Joy Score: {score}/100 - Paragraphs: {N} JOY, {N} NEUTRAL, {N} CAUTION, 0 GRIEVANCE - -=============================================================== -``` - -**Status indicators**: `[check]` = passed, `[warn]` = warning, `[fail]` = error, `[ok]` = within threshold - -Show target file path if writing to a file. Await user approval before writing unless Direct Write Mode is enabled. +See `references/output-format.md` for the full report template and status indicators. **Gate**: Output displayed with validation report. Proceed only when gate passes. @@ -432,17 +228,7 @@ rm -f /tmp/voice-content-draft-*.md **Step 2**: Confirm final content is saved to user-specified location (if requested) -**Step 3**: Report pipeline completion with final status - -```markdown -## Pipeline Complete -Voice: {name} -Status: PASSED/FAILED -Score: {score}/100 -Joy Score: {joy_score}/100 -Iterations: {N} -Output: [location or displayed inline] -``` +**Step 3**: Report pipeline completion with final status. See `references/output-format.md` for the completion report template. **Gate**: No orphaned temp files. Pipeline complete. @@ -492,141 +278,21 @@ Result: Voice-consistent technical piece with full validation ## Error Handling -### Error: "Voice Not Found" -Cause: Voice name misspelled or voice directory does not exist -Solution: -1. Check spelling of voice name -2. List available voices: `ls $HOME/claude-code-toolkit/skills/voice-*/` -3. Create new voice using `create-voice` skill - -### Error: "Profile or Config Missing" -Cause: Voice directory exists but required files (profile.json, config.json) are absent -Solution: -1. Run voice-calibrator to generate missing files -2. Or use voice analyzer: `python3 ~/.claude/scripts/voice_analyzer.py analyze --samples [files] --output profile.json` -3. For config.json, see `references/voice-infrastructure.md` for schema and example - -### Error: "Validator Script Failed (Exit Code 2)" -Cause: File not found, invalid JSON, or Python environment issue -Solution: -1. Verify content file path exists -2. Check Python 3 is available: `python3 --version` -3. Test script directly: `python3 $HOME/claude-code-toolkit/scripts/voice_validator.py --help` -4. Verify profile.json is valid JSON - -### Error: "Validation Failed After 3 Iterations" -Cause: Content cannot meet threshold within iteration limit -Solution: -1. Output best attempt with full failure report showing remaining violations -2. User can manually edit flagged lines -3. User can re-validate with `/voice-writer --validate` -4. Consider recalibrating voice profile if failures are systemic - -### Error: "Regex Scanner Fails or Not Found" -Cause: `scan-negative-framing.py` script missing or Python error -Solution: -1. Verify script exists: `ls scripts/scan-negative-framing.py` -2. Check Python version: `python3 --version` (requires 3.10+) -3. If script unavailable, skip regex pre-filter and proceed directly to LLM-based joy-check analysis -- the regex pre-filter is an optimization, not a requirement - -### Error: "Joy-Check Failed After 3 Iterations" -Cause: Rewritten paragraphs keep introducing new GRIEVANCE patterns, often because the underlying premise is grievance-based -Solution: -1. Output the best version achieved with flagged remaining concerns -2. Explain which specific rubric dimensions resist correction -3. Suggest the framing premise itself may need rethinking, not just the language - -### Error: "No voice specified" -Cause: User did not specify a voice parameter -Solution: -1. Default to the user's configured voice skill -2. Notify user which voice is being used -3. Proceed with Phase 1 - -### Error: "Topic too broad for target length" -Cause: Topic scope exceeds estimated word count -Solution: -1. Ask user to narrow scope -2. Suggest 2-3 specific angles derived from the topic -3. Proceed once user selects a narrower focus - ---- - -## Anti-Patterns - -### Anti-Pattern 1: Skipping Validation for "Good Enough" -**What it looks like**: "The content sounds fine to me, I'll skip validation." -**Why wrong**: Human perception drifts. Deterministic validation catches patterns you miss. Self-assessment is not validation. -**Do instead**: ALWAYS validate. Use `--skip-validation` only for true drafts the user explicitly requests as drafts. - -### Anti-Pattern 2: Self-Assessing Voice Quality -**What it looks like**: "I read the content and it matches the voice profile." -**Why wrong**: LLMs cannot reliably self-assess stylistic accuracy. That is why the deterministic validator exists. -**Do instead**: Run `voice_validator.py`. Trust the script output, not your assessment. - -### Anti-Pattern 3: Skipping the Grounding Step -**What it looks like**: Jumping straight from LOAD to GENERATE without establishing context. -**Why wrong**: Voice without emotional grounding sounds mechanical. Metrics match but the content feels hollow. -**Do instead**: Complete Phase 2 GROUND, even briefly. Establish emotion, audience, and mode before generating. - -### Anti-Pattern 4: Over-Iterating on Warnings -**What it looks like**: Spending 5+ iterations trying to eliminate all warnings. -**Why wrong**: Warnings are informational. Errors are blockers. Over-polishing creates sterile output that violates wabi-sabi. -**Do instead**: Fix all errors, address warnings if easy, ship when score >= threshold. Maximum 3 iterations. - -### Anti-Pattern 5: Rewriting Entire Sections During Refinement -**What it looks like**: Rewriting paragraphs to fix a single banned phrase violation. -**Why wrong**: Introduces new violations. Changes voice characteristics that were passing. -**Do instead**: Apply targeted, surgical fixes. Change only the violating text. - -### Anti-Pattern 6: Skipping Joy-Check Because Validation Passed -**What it looks like**: "Voice validator passed, so the content is ready to ship." -**Why wrong**: Voice validation checks stylistic fidelity. Joy-check checks tonal framing. Content can match a voice perfectly while framing through grievance. -**Do instead**: ALWAYS run joy-check after validation. They check different things. - -### Anti-Pattern 7: Writing Without Voice Skill Loaded -**What it looks like**: Starting to draft before reading the voice skill's patterns. -**Why wrong**: Content will not match the voice profile. Retrofitting voice is harder than starting with it. -**Do instead**: Complete Phase 1 fully. Load and read the voice skill before writing any content. - -### Anti-Pattern 8: Adding Unsolicited Sections -**What it looks like**: Adding "Future Implications" or "Related Topics" sections the user did not request. -**Why wrong**: Over-engineering the content. User asked for a specific piece, not a content hub. -**Do instead**: Write exactly what was requested. Ask before adding anything extra. - -### Anti-Pattern 9: Blending Voice Patterns -**What it looks like**: Mixing one voice's extended metaphors with another voice's community warmth in one post. -**Why wrong**: Each voice has distinct patterns. Mixing creates an inconsistent, inauthentic voice. -**Do instead**: Use exactly one voice profile per piece. Follow that voice skill's patterns exclusively. +| Error | Cause | Solution | +|-------|-------|----------| +| Voice Not Found | Misspelled name or missing directory | Check spelling; `ls $HOME/claude-code-toolkit/skills/voice-*/`; use `create-voice` | +| Profile or Config Missing | Files absent from voice directory | Run `voice-calibrator`; see `references/voice-infrastructure.md` for schema | +| Validator Script Failed (Exit Code 2) | File not found, invalid JSON, or Python issue | Verify paths; check `python3 --version`; test with `--help` flag | +| Validation Failed After 3 Iterations | Content cannot meet threshold | Output best attempt with failure report; user edits flagged lines; recalibrate if systemic | +| Regex Scanner Fails or Not Found | `scan-negative-framing.py` missing or error | Skip regex pre-filter; proceed to LLM-based joy-check analysis — it's an optimization, not a requirement | +| Joy-Check Failed After 3 Iterations | Underlying premise is grievance-based | Output best version; flag rubric dimensions that resist correction; suggest rethinking the framing premise | +| No voice specified | User omitted voice parameter | Default to configured voice skill; notify user; proceed with Phase 1 | +| Topic too broad for target length | Scope exceeds word count | Ask user to narrow scope; suggest 2-3 specific angles | --- ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) - Phase transitions -- [Wabi-Sabi Authenticity](../shared-patterns/wabi-sabi-authenticity.md) - Natural imperfections as features -- [Voice-First Writing](../shared-patterns/voice-first-writing.md) - Voice-driven content patterns - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "The validator is too strict" | Validator catches real AI patterns humans miss | Fix violations or adjust profile through calibration | -| "This voice doesn't need validation" | All voices drift without measurement | ALWAYS validate with script | -| "The metrics don't matter for this piece" | Metrics ensure consistency across outputs | Address deviations | -| "Manual review is sufficient" | Humans miss patterns deterministic checks catch | Use script validation | -| "One em-dash won't hurt" | Em-dash is the most reliable AI marker | NEVER use em-dashes | -| "Content sounds right to me" | Self-assessment is not validation | Run voice_validator.py | -| "No banned words jumped out at me" | Visual scan misses words in context | Run systematic scan against full banned list | -| "Close enough to the voice" | Close is not matching the voice profile | Re-read voice skill, verify each pattern | -| "Joy-check is overkill for this piece" | Grievance framing slips in subtly -- regex + rubric catch what you miss | ALWAYS run joy-check | -| "The content is factual, so the framing is fine" | Facts arranged as prosecution are framing, not neutrality | Evaluate arrangement of facts, not just accuracy | -| "The reframe would be dishonest" | Reframing is editorial craft, not dishonesty -- substance stays the same | Preserve substance, change only the lens | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/validation-scripts.md`: Full validation command reference and output schema - `${CLAUDE_SKILL_DIR}/references/voice-infrastructure.md`: Voice file structure, config/profile schemas, modes, fix strategies - `${CLAUDE_SKILL_DIR}/references/banned-words.md`: Words and phrases that signal AI-generated content diff --git a/pipelines/voice-writer/references/generation-checklist.md b/pipelines/voice-writer/references/generation-checklist.md new file mode 100644 index 00000000..58b43664 --- /dev/null +++ b/pipelines/voice-writer/references/generation-checklist.md @@ -0,0 +1,62 @@ +# Generation Checklist + +Full reference for Phase 3 (GENERATE) of the voice-writer pipeline. + +--- + +## Full Generation Checklist + +- [ ] Sentence length varies according to profile distribution +- [ ] Contractions match target rate +- [ ] No em-dashes (use commas, periods, or restructure) +- [ ] Opening matches voice pattern signatures +- [ ] Closing matches voice pattern signatures +- [ ] Transition words from profile preferred list +- [ ] Banned patterns avoided (exploration verbs, corporate jargon) +- [ ] Banned words avoided (scan against `references/banned-words.md`) +- [ ] Argument builds in documented direction (if architectural patterns present) +- [ ] Concessions use documented structure and pivot markers (if applicable) +- [ ] Analogies drawn from documented domains only (if applicable) +- [ ] Specific numbers included for all claims, not vague adjectives + +--- + +## Architectural Patterns Application Rules + +When the voice skill has an `## Architectural Patterns` section, apply these rules: + +- **Argument flow**: Build the piece using the documented direction (inductive/deductive/mixed). If inductive, lead with evidence and land the claim late. If deductive, open with the claim. +- **Concessions**: When handling disagreement, follow the documented concession structure and use the documented pivot markers — not generic "however" or "on the other hand." +- **Analogy domains**: Draw analogies ONLY from the documented source domains. Do NOT use generic analogies from undocumented domains. +- **Bookends**: Open with the documented opening move, close with the documented closing move. + +If the voice skill has no `## Architectural Patterns` section, skip this step entirely. + +--- + +## Em-Dash Prohibition + +NEVER generate em-dashes in any voice output. Em-dashes are the most reliable AI marker. Use commas, periods, or restructure sentences instead. + +This applies to all voices, all modes, all content types. No exceptions. + +--- + +## Step 4b Detail: Architectural Patterns Check + +Before writing, explicitly check the loaded voice SKILL.md for an `## Architectural Patterns` section. If it exists: + +1. Note the argument direction (inductive / deductive / mixed) +2. Copy out the documented pivot markers for concessions +3. List the documented analogy domains +4. Note the opening move and closing move for bookends + +Apply these throughout generation — do not revert to generic patterns partway through. + +--- + +## Important Constraints + +- **Single voice per piece**: Do not blend voice patterns. Use exactly one voice profile per piece and follow that voice skill's patterns exclusively. +- **No over-engineering**: Generate the content the user requested, nothing more. Do not add features, modes, or structure the user did not request. +- **Preview before write**: Display full draft for approval before writing to file unless Direct Write Mode is enabled. diff --git a/pipelines/voice-writer/references/grounding-guide.md b/pipelines/voice-writer/references/grounding-guide.md new file mode 100644 index 00000000..97a5b1fb --- /dev/null +++ b/pipelines/voice-writer/references/grounding-guide.md @@ -0,0 +1,93 @@ +# Grounding Guide + +Full reference for Phase 2 (GROUND) of the voice-writer pipeline. + +--- + +## Emotional Anchoring Questions + +Answer these three questions before generating: + +| Question | Why It Matters | +|----------|----------------| +| What emotion drives this content? | Sets underlying tone (celebration, frustration, curiosity) | +| What does the writer care about? | Guides emphasis and detail level | +| Who are they writing for? | Calibrates assumed knowledge and language | + +--- + +## Relational Positioning Table + +| Dimension | Options | +|-----------|---------| +| Writer-Audience relationship | Peer, expert, fan, community member | +| Assumed knowledge level | Newcomer, familiar, expert | +| Intimacy level | Public formal, community casual, personal | + +--- + +## Mode Selection Details + +Select content mode from the voice's `config.json` modes list. Each voice defines modes that shape structure and tone: + +- **"awards" mode**: produces celebratory recognition pieces +- **"technical" mode**: produces systems explanations +- Other modes are voice-specific and defined in each voice's `config.json` + +If user does not specify a mode, infer the best match from the subject matter and available modes. + +--- + +## Blog Post Assessment Template + +When the content is a blog post, article, or similar structured piece: + +```markdown +## Assessment +- Topic: [user-provided topic] +- Scope: [narrow / medium / broad] +- Audience: [beginner / intermediate / expert] +- Estimated length: [short 500-800 / medium 1000-1500 / long 2000+] +``` + +--- + +## Structure Planning Template + +Plan the post structure using voice patterns and structure templates: + +```markdown +## Plan +- Opening pattern: [Provocative Question / News Lead / Bold Claim / Direct Answer] +- Draft opening: [first sentence or question] +- Core metaphor: [conceptual lens, if voice uses extended metaphors] +- Sections: + 1. [Section name]: [purpose] + 2. [Section name]: [purpose] + ... +- Closing pattern: [Callback / Implication / Crescendo] +- Callback element: [what from opening returns] +``` + +Draft frontmatter if writing to a Hugo site: + +```yaml +--- +title: "Post Title Here" +slug: "post-slug-here" +date: YYYY-MM-DD +draft: false +tags: ["tag1", "tag2"] +summary: "One sentence description for list views" +--- +``` + +--- + +## Content Type Selection + +Select content type from `references/structure-templates.md` if available: + +- **Problem-Solution**: Bug fix, debugging session, resolution +- **Technical Explainer**: Concept, technology, how it works +- **Walkthrough**: Step-by-step instructions for a task diff --git a/pipelines/voice-writer/references/joy-check-rubric.md b/pipelines/voice-writer/references/joy-check-rubric.md new file mode 100644 index 00000000..b3d823f8 --- /dev/null +++ b/pipelines/voice-writer/references/joy-check-rubric.md @@ -0,0 +1,53 @@ +# Joy-Check Rubric + +Full reference for Phase 6 (JOY-CHECK) of the voice-writer pipeline. + +--- + +## Joy Framing Rubric + +Evaluate each paragraph against these five dimensions: + +| Dimension | Joy-Centered (PASS) | Grievance-Centered (FAIL) | +|-----------|-------------------|--------------------------| +| **Subject position** | Author as explorer, builder, learner | Author as victim, wronged party, unrecognized genius | +| **Other people** | Fellow travelers, interesting minds, people figuring things out | Opponents, thieves, people who should have done better | +| **Difficult experiences** | Interesting, surprising, made me think differently | Unfair, hurtful, someone should fix this | +| **Uncertainty** | Comfortable, curious, "none of us know" | Anxious, defensive, "I need to prove" | +| **Closing energy** | Forward-looking, building, sharing, exploring | Cautionary, warning, demanding, lamenting | + +--- + +## Scoring System + +For each paragraph, assign one of: + +- **JOY** (80-100): Frames through curiosity, generosity, or earned satisfaction +- **NEUTRAL** (50-79): Factual, neither joy nor grievance +- **CAUTION** (30-49): Leans toward grievance but recoverable with reframing +- **GRIEVANCE** (0-29): Frames through accusation, victimhood, or bitterness + +--- + +## Rewrite Rules + +If any paragraph scores GRIEVANCE: + +1. Rewrite preserving substance, changing only the framing +2. Shift toward curiosity, generosity, or earned satisfaction +3. Re-evaluate the rewritten paragraph to confirm it no longer scores GRIEVANCE +4. Maximum 3 joy-check iterations + +General rules: +- Reframe, don't suppress -- negative experiences are valid topics, only the framing changes. This is editorial craft, not dishonesty — substance stays the same. +- Preserve substance -- change the lens, not the facts +- One GRIEVANCE paragraph is a FAIL condition for the whole piece +- CAUTION paragraphs are acceptable if the overall piece passes + +--- + +## Important Constraint: Facts as Prosecution + +Do not rationalize that "the content is factual, so the framing is fine." Facts arranged as prosecution are framing, not neutrality. Evaluate the arrangement of facts, not just their accuracy. The reframe would not be dishonest — it is how we choose to tell the truth. + +For example, a sequence of negative facts presented in order of escalating severity, with no counterpoint or generosity, is grievance framing even if every individual fact is accurate. diff --git a/pipelines/voice-writer/references/output-format.md b/pipelines/voice-writer/references/output-format.md new file mode 100644 index 00000000..ec935b64 --- /dev/null +++ b/pipelines/voice-writer/references/output-format.md @@ -0,0 +1,66 @@ +# Output Format Reference + +Full output format for Phase 7 (OUTPUT) of the voice-writer pipeline. + +--- + +## Report Template + +``` +=============================================================== + VOICE CONTENT: {Voice Name} +=============================================================== + +[Generated content here] + +=============================================================== + VALIDATION REPORT +=============================================================== + + Status: PASSED / FAILED + Score: {score}/100 + Iterations: {N} + + Checks: + [check] Banned patterns: None detected + [check] Em-dash: 0 found + [check] Rhythm: Varied sentence lengths + [warn] Contraction rate: 65% (target: 72%) + + Metrics Comparison: + | Metric | Target | Actual | Status | + |-------------------|--------|--------|--------| + | Avg sentence len | 15.3 | 14.8 | [ok] | + | Contraction rate | 0.72 | 0.65 | [warn] | + | Short sentences | 0.35 | 0.32 | [ok] | + + Joy Check: + Status: PASSED + Overall Joy Score: {score}/100 + Paragraphs: {N} JOY, {N} NEUTRAL, {N} CAUTION, 0 GRIEVANCE + +=============================================================== +``` + +## Status Indicators + +| Indicator | Meaning | +|-----------|---------| +| `[check]` | Passed | +| `[warn]` | Warning | +| `[fail]` | Error | +| `[ok]` | Within threshold | + +--- + +## Pipeline Completion Report (Phase 8) + +```markdown +## Pipeline Complete +Voice: {name} +Status: PASSED/FAILED +Score: {score}/100 +Joy Score: {joy_score}/100 +Iterations: {N} +Output: [location or displayed inline] +``` diff --git a/pipelines/voice-writer/references/validation-scripts.md b/pipelines/voice-writer/references/validation-scripts.md new file mode 100644 index 00000000..462a7a1d --- /dev/null +++ b/pipelines/voice-writer/references/validation-scripts.md @@ -0,0 +1,81 @@ +# Validation Scripts Reference + +Command reference and output schema for Phase 4 (VALIDATE) of the voice-writer pipeline. + +--- + +## Primary Validation Command + +```bash +python3 $HOME/claude-code-toolkit/scripts/voice_validator.py validate \ + --content /tmp/voice-content-draft.md \ + --profile $HOME/claude-code-toolkit/skills/voice-{name}/profile.json \ + --voice {name} \ + --format json +``` + +Replace `{name}` with the voice name (e.g., `myvoice`). + +--- + +## Output Schema + +The validator returns JSON with the following shape: + +```json +{ + "pass": true, + "score": 82, + "threshold": 75, + "iterations": 1, + "violations": [ + { + "line": 14, + "text": "delve into", + "type": "error", + "rule": "banned_phrase", + "suggestion": "Remove or replace with direct verb" + } + ], + "warnings": [], + "metrics": { + "avg_sentence_len": 14.8, + "contraction_rate": 0.65, + "short_sentence_ratio": 0.32 + } +} +``` + +--- + +## Decision Logic (from Phase 4) + +| Condition | Action | +|-----------|--------| +| `pass == true` AND `score >= threshold` | Proceed to Phase 6: JOY-CHECK | +| `pass == false` AND `iterations < 3` | Proceed to Phase 5: REFINE | +| `pass == false` AND `iterations >= 3` | Proceed to Phase 6: JOY-CHECK with failure report | + +--- + +## Additional Scripts + +### Negative Framing Scanner (Phase 6 pre-filter) + +```bash +python3 $HOME/claude-code-toolkit/scripts/scan-negative-framing.py /tmp/voice-content-draft.md +``` + +Returns a list of regex-matched negative framing patterns (victimhood, accusation, bitterness, passive aggression) with suggested reframes. + +If this script is unavailable, skip the regex pre-filter and proceed directly to LLM-based joy-check analysis. The regex pre-filter is an optimization, not a requirement. + +--- + +## Validator Help + +```bash +python3 $HOME/claude-code-toolkit/scripts/voice_validator.py --help +``` + +Use this to check available flags if the command syntax above fails. diff --git a/pipelines/voice-writer/references/voice-infrastructure.md b/pipelines/voice-writer/references/voice-infrastructure.md new file mode 100644 index 00000000..03ccccbc --- /dev/null +++ b/pipelines/voice-writer/references/voice-infrastructure.md @@ -0,0 +1,107 @@ +# Voice Infrastructure Reference + +Voice file structure, config/profile schemas, modes, and fix strategies. + +--- + +## Voice File Structure + +Each voice lives at `skills/voice-{name}/` and contains: + +| File | Required | Purpose | +|------|----------|---------| +| `SKILL.md` | Yes | AI instructions, patterns, anti-patterns | +| `profile.json` | Yes | Quantitative metrics targets | +| `config.json` | Yes | Validation settings, modes, thresholds | +| `references/samples/` | No | Few-shot examples (load 1-2 if available) | + +--- + +## profile.json Schema + +The profile captures quantitative voice metrics: + +```json +{ + "voice": "name", + "metrics": { + "avg_sentence_len": 15.3, + "contraction_rate": 0.72, + "short_sentence_ratio": 0.35 + }, + "preferred_transitions": ["so", "but", "and", "because"], + "banned_phrases": ["delve into", "it's worth noting", "in conclusion"] +} +``` + +--- + +## config.json Schema + +The config drives validation thresholds and available modes: + +```json +{ + "thresholds": { + "pass_score": 75, + "error_max": 0, + "warning_max": 3 + }, + "modes": ["default", "technical", "awards"], + "direct_write": false +} +``` + +### Threshold Parsing (Phase 1) + +Extract from `config.json`: +- `thresholds.pass_score` — minimum score to pass validation +- `thresholds.error_max` — maximum allowed errors (usually 0) +- `thresholds.warning_max` — maximum allowed warnings +- `modes` — list of available content modes + +--- + +## Available Modes Per Voice + +Modes are voice-specific and defined in each voice's `config.json`. Common examples: +- `default` — general-purpose content +- `technical` — systems explanations, how-it-works pieces +- `awards` — celebratory recognition pieces + +Load the voice's `config.json` to see which modes are available. + +--- + +## Fix Strategies for Phase 5 (REFINE) + +Apply targeted fixes in this order (errors before warnings): + +| Violation Type | Fix Strategy | +|----------------|-------------| +| Banned phrase | Remove or replace with the direct version of the phrase | +| Em-dash | Replace with comma, period, or restructure the sentence | +| Sentence rhythm | Break long sentences; vary short/medium/long distribution | +| Contraction rate too low | Replace "do not" → "don't", "it is" → "it's", etc. | +| Contraction rate too high | Expand contractions in more formal passages | +| Transition word mismatch | Replace with a transition from the profile's preferred list | + +Key rule: fix one violation at a time. Do not rewrite entire sections — fix the specific issue only. + +--- + +## Creating Missing Files + +### If profile.json is missing + +```bash +python3 ~/.claude/scripts/voice_analyzer.py analyze \ + --samples [writing-sample-files] \ + --output skills/voice-{name}/profile.json +``` + +Or use the `voice-calibrator` skill. + +### If config.json is missing + +Create a minimal config using the schema above. Use `pass_score: 75` as a starting default and add modes based on the voice's intended use cases. diff --git a/pipelines/workflow-orchestrator/SKILL.md b/pipelines/workflow-orchestrator/SKILL.md index e93fbc60..743aa79b 100644 --- a/pipelines/workflow-orchestrator/SKILL.md +++ b/pipelines/workflow-orchestrator/SKILL.md @@ -43,76 +43,6 @@ routing: Orchestrate complex multi-step software development tasks using the BRAINSTORM / WRITE-PLAN / EXECUTE-PLAN pattern. Breaks ambiguous or complex work into well-defined, verifiable subtasks with clear progress tracking. -## Operator Context - -This skill operates as an operator for complex task orchestration, configuring Claude's behavior for systematic multi-phase workflow execution. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before orchestration. Project instructions override default behaviors. -- **Over-Engineering Prevention**: Only create tasks for work that's directly requested. Keep plans simple and focused. No speculative features or flexibility that wasn't asked for. -- **Exact File Paths Required**: All tasks must specify absolute file paths, never relative paths or wildcards -- **Verification Mandatory**: Every task must include a verification step that confirms successful completion -- **Task Duration**: Individual tasks must be scoped to 2-5 minutes of work (break larger work into multiple tasks) -- **Dependency Declaration**: Tasks with dependencies must explicitly list prerequisite task IDs -- **Status Tracking**: After each task execution, report completion status and any blockers encountered -- **Context-Budget Awareness**: Monitor estimated context usage and adjust execution behavior by zone. Plans should target completion within 50% of context. Why: context exhaustion mid-execution produces silently degraded output with no recovery path. - - | Context Used | Zone | Behavior | - |-------------|------|----------| - | 0-30% | PEAK | Full execution, all quality gates, comprehensive verification | - | 30-50% | GOOD | Full execution, all quality gates. Target plan completion within this zone. | - | 50-70% | DEGRADING | Prioritize remaining critical tasks. Skip optional verification. Warn user. | - | 70%+ | POOR | Complete only in-progress task. Create handoff artifacts ([Session Handoff](../../adr/069-session-handoff-system.md)). Stop accepting new tasks. | - - Zone transitions trigger explicit log entries: "Entering DEGRADING zone at ~52% context usage. 3 tasks remaining. Prioritizing critical path." These zones are planning heuristics based on estimated usage, not precise measurements. - -- **Deviation Rules**: Numbered rules governing what the orchestrator may do autonomously during execution versus what requires human approval. Why: without explicit boundaries, the orchestrator either over-escalates (asking about typos) or under-escalates (silently making architectural changes). - - | Rule | Category | Action | Example | - |------|----------|--------|---------| - | 1 | Bug fix | Auto-fix | Typo in variable name, missing import, off-by-one error | - | 2 | Missing critical functionality | Auto-add if clearly required by the plan | A function referenced in the plan but not yet created | - | 3 | Blocking issue | Auto-fix environmental/dependency problems | Missing package, wrong version, permission issue | - | 4 | Architectural change | **STOP and ask user** | Changing a data model, adding a new service, altering an API contract | - - Rules 1-3 are autonomous -- the orchestrator handles them without user input. Rule 4 is a hard stop. Every autonomous deviation is logged with: the rule number that authorized it, what was changed, and why. Deviation rules classify WHAT to fix; the [Autonomous Repair](../shared-patterns/autonomous-repair.md) pattern governs HOW MANY TIMES to try (default: 2 attempts per task). - -### Default Behaviors (ON unless disabled) -- **Plan Directory Storage**: Save all plans to `plan/active/{plan-name}.md` instead of temp files. This enables plan discovery, tracking, and cleanup workflows. -- **Communication Style**: Report facts without self-congratulation. Show command output rather than describing it. Be concise but informative. -- **Plan Lifecycle Management**: After workflow completion, ask user whether to archive plan to `plan/completed/` or keep active. -- **Progress Reporting**: Report progress after each task completion -- **Blocker Detection**: Detect and report blockers immediately when encountered -- **Status Updates**: Provide phase transition notifications -- **Rationale Logging**: Document decision rationale in brainstorm phase - -### Optional Behaviors (OFF unless enabled) -- **Parallel Execution**: Execute independent tasks in parallel using Task tool (OFF by default - sequential is safer) -- **Automated Rollback**: Automatically revert changes if verification fails (OFF by default - manual review safer) -- **Time Tracking**: Log actual time taken per task vs estimated (OFF by default) -- **Dry Run Mode**: Generate plan without executing (OFF by default) - -## What This Skill CAN Do -- Break complex tasks into atomic, verifiable subtasks (2-5 min each) -- Manage dependencies between subtasks -- Track progress with status reporting -- Handle verification failures with retry/rollback -- Manage plan lifecycle (create, execute, archive, abandon) -- Suggest parallelization opportunities -- Validate plans before execution via plan-checker integration -- Detect regressions across task groups during execution -- Autonomously fix Rule 1-3 deviations without user intervention -- Adapt execution behavior based on context-budget zone - -## What This Skill CANNOT Do -- Execute tasks without a plan (must complete BRAINSTORM and WRITE-PLAN first) -- Skip phase gates (all gates must pass before proceeding) -- Create tasks without absolute file paths and verification commands -- Handle trivial single-file edits (use direct editing instead) -- Proceed past blockers without user input -- Make architectural changes autonomously (Deviation Rule 4 -- always requires user approval) -- Precisely measure context usage (zones are estimates, not metered values) - ## Instructions ### Three-Phase Workflow Overview @@ -130,7 +60,9 @@ This skill operates as an operator for complex task orchestration, configuring C - [ ] Selected approach has documented rationale - [ ] Constraints and dependencies are identified -**Purpose**: Transform ambiguous requirements into clear, actionable plans through Socratic refinement. +**Purpose**: Transform ambiguous requirements into clear, actionable plans through Socratic refinement. This phase prevents rework by establishing shared understanding before any code is written. + +**Constraint**: Only create tasks for work that's directly requested. Keep plans simple and focused. No speculative features or flexibility that wasn't asked for. This prevents scope creep and wasted tokens. #### Step 1: Understand Requirements @@ -158,6 +90,8 @@ Document: - **External Dependencies**: Services, databases, APIs that must be available - **Compatibility Requirements**: Backward compatibility, migration needs +**Constraint**: Read and follow repository CLAUDE.md files before orchestration. Project instructions override default behaviors. This ensures alignment with local norms. + #### Step 3: Generate Multiple Approaches Brainstorm 2-3 approaches with pros, cons, complexity, and risk for each. @@ -187,7 +121,16 @@ Document the selected approach, rationale for choosing it, how it addresses cons - [ ] Dependencies between tasks are documented - [ ] Plan has been saved to a file -**Purpose**: Break down the selected approach into executable, verifiable tasks. +**Purpose**: Break down the selected approach into executable, verifiable tasks. This phase produces a concrete artifact that can be validated before execution. + +**Constraints**: +- Every task must include a verification step that confirms successful completion +- Individual tasks must be scoped to 2-5 minutes of work (break larger work into multiple tasks) +- Tasks with dependencies must explicitly list prerequisite task IDs +- All tasks must specify absolute file paths, never relative paths or wildcards +- Save all plans to `plan/active/{plan-name}.md` instead of temp files. This enables plan discovery, tracking, and cleanup workflows. + +**Why**: Time-bounded tasks ensure focus. Absolute paths enable independent subagent execution. Verification commands prevent silent failures. Explicit dependencies catch circular dependencies before execution. #### Step 1: Create Task Breakdown @@ -217,7 +160,7 @@ T1 -> T2 -> T4 \-> T3 -/ ``` -**Note on parallelization**: If independent task groups exist, note them in the plan. Suggest parallel execution mode to user if it would provide meaningful speedup. +**Note on parallelization**: If independent task groups exist, note them in the plan. Suggest parallel execution mode to user if it would provide meaningful speedup. This is optional behavior (OFF by default) -- sequential is safer. #### Step 3: Define Verification Steps @@ -286,6 +229,17 @@ T1 -> T2 **Purpose**: Catch plan-level defects before they waste an execution cycle. A poorly structured plan -- vague tasks, missing dependencies, implicit ordering -- produces predictable execution failures. Validating before execution is cheaper than discovering problems mid-implementation. +**Constraint**: Monitor estimated context usage and adjust execution behavior by zone. Plans should target completion within 50% of context. Why: context exhaustion mid-execution produces silently degraded output with no recovery path. + +| Context Used | Zone | Behavior | +|-------------|------|----------| +| 0-30% | PEAK | Full execution, all quality gates, comprehensive verification | +| 30-50% | GOOD | Full execution, all quality gates. Target plan completion within this zone. | +| 50-70% | DEGRADING | Prioritize remaining critical tasks. Skip optional verification. Warn user. | +| 70%+ | POOR | Complete only in-progress task. Create handoff artifacts ([Session Handoff](../../adr/069-session-handoff-system.md)). Stop accepting new tasks. | + +Zone transitions trigger explicit log entries: "Entering DEGRADING zone at ~52% context usage. 3 tasks remaining. Prioritizing critical path." These zones are planning heuristics based on estimated usage, not precise measurements. + #### Step 1: Run Plan-Checker Invoke the [plan-checker](../plan-checker/SKILL.md) against the saved plan. The plan-checker evaluates the plan across its verification dimensions, including: @@ -331,7 +285,14 @@ Record in the plan file: - [ ] Regression checks have passed for prior task groups - [ ] Final status report has been generated -**Purpose**: Execute tasks from the plan, verify each step, handle blockers. +**Purpose**: Execute tasks from the plan, verify each step, handle blockers. Each task is verified against its success criteria, and cross-task regressions are detected. + +**Constraints**: +- After each task execution, report completion status and any blockers encountered +- Every autonomous deviation is logged with: the rule number that authorized it, what was changed, and why +- Report facts without self-congratulation. Show command output rather than describing it. Be concise but informative. +- Report progress after each task completion +- Detect and report blockers immediately when encountered #### Step 1: Load and Validate Plan @@ -350,11 +311,22 @@ Update plan status to "In Progress". 1. **Check Dependencies**: Verify all prerequisite tasks completed successfully 2. **Report Start**: Log task start and current context-budget zone 3. **Execute Operations**: Perform the task operations -4. **Apply Deviation Rules**: If unexpected issues arise during execution, classify by rule number (1-3: auto-fix with logged justification; 4: stop and ask user). See Deviation Rules in Hardcoded Behaviors. +4. **Apply Deviation Rules**: If unexpected issues arise during execution, classify by rule number (1-3: auto-fix with logged justification; 4: stop and ask user). See Deviation Rules below. 5. **Run Verification**: Execute verification command 6. **Evaluate Result**: Check against success criteria 7. **Report Completion**: Log task status and overall progress (e.g., "2/6 tasks complete") +**Deviation Rules** (numbered rules governing what the orchestrator may do autonomously during execution versus what requires human approval. Why: without explicit boundaries, the orchestrator either over-escalates (asking about typos) or under-escalates (silently making architectural changes)): + +| Rule | Category | Action | Example | +|------|----------|--------|---------| +| 1 | Bug fix | Auto-fix | Typo in variable name, missing import, off-by-one error | +| 2 | Missing critical functionality | Auto-add if clearly required by the plan | A function referenced in the plan but not yet created | +| 3 | Blocking issue | Auto-fix environmental/dependency problems | Missing package, wrong version, permission issue | +| 4 | Architectural change | **STOP and ask user** | Changing a data model, adding a new service, altering an API contract | + +Rules 1-3 are autonomous -- the orchestrator handles them without user input. Rule 4 is a hard stop. Every autonomous deviation is logged with: the rule number that authorized it, what was changed, and why. + #### Step 3: Regression Gate Before verifying the current task group, re-run verification checks for prior completed task groups. Why: Task 5 may silently break what Task 3 built. Independent per-task verification misses cross-task regressions that compound into harder-to-diagnose failures later. @@ -374,11 +346,11 @@ After Task Group 3 executes -> verify Groups 1+2 (regression) -> verify Group 3 **Scaling for large plans**: Full regression checking on every prior group is the default for plans with 5 or fewer task groups. For plans with more than 5 groups, check only the immediately preceding group to avoid consuming excessive context on re-verification. The tradeoff: less thorough regression detection, but more context available for actual execution. -**Task groups**: In sequential execution, each task is its own group. When wave-based parallel execution is enabled (see [Autonomous Repair](../shared-patterns/autonomous-repair.md)), each wave forms a task group. Regression checks happen between groups, not between individual tasks within a group. +**Task groups**: In sequential execution, each task is its own group. When wave-based parallel execution is enabled, each wave forms a task group. Regression checks happen between groups, not between individual tasks within a group. #### Step 4: Handle Verification Failures -If verification fails, apply the [Autonomous Repair](../shared-patterns/autonomous-repair.md) pattern: +If verification fails: 1. **Report Failure**: Document the error output and analysis 2. **Classify by Deviation Rule**: Determine if the failure falls under Rule 1-3 (auto-fix) or Rule 4 (stop and ask). This classification happens before strategy selection -- if it's Rule 4, ESCALATE immediately regardless of repair budget. @@ -454,99 +426,14 @@ After successful execution, prompt user about plan lifecycle: **Cause**: Plan has structural issues (vague scope, missing dependencies, unbounded tasks) that incremental revision cannot fix **Solution**: Log remaining issues as accepted risks in the plan's Notes section and proceed to execution. The 3-iteration limit prevents infinite planning loops. If issues are fundamental, ask user to restructure the task scope. -## Common Anti-Patterns - -### Anti-Pattern 1: Skipping Brainstorm Phase - -**Problem**: Jumping straight to creating tasks without clarifying requirements or exploring approaches. - -**Why it fails**: No requirement clarification (OAuth? JWT?), no approach exploration, no constraint identification. Results in rework. - -**Fix**: Complete all BRAINSTORM steps -- clarify requirements, generate 2-3 approaches, select with documented rationale -- before creating any tasks. - -### Anti-Pattern 2: Vague Task Definitions - -**Problem**: Tasks with descriptions like "Fix the database", file references like "database files", and verification like "check it". - -**Why it fails**: No absolute file paths, no specific operations, impossible to verify, cannot be executed by independent subagent. - -**Fix**: Every task must have absolute file paths, specific operations, and executable verification commands with clear success criteria. - -### Anti-Pattern 3: Creating Unnecessary Orchestration - -**Problem**: Using the full BRAINSTORM/WRITE-PLAN/EXECUTE-PLAN workflow for a typo fix or single-file edit. - -**Why it fails**: Simple single-file edits don't need orchestration. Tasks under 2 minutes should use direct editing. - -**Fix**: Only orchestrate when work spans multiple files/systems and requires coordination. Use direct editing for everything else. - -### Anti-Pattern 4: Speculative Feature Addition - -**Problem**: User asks for a login form, assistant plans a comprehensive auth system with OAuth, 2FA, role-based permissions, and audit logging. - -**Why it fails**: Adding unrequested features violates "only implement what's requested". Massive scope increase without confirmation. - -**Fix**: Implement exactly what was requested. If related features seem useful, ask the user before expanding scope. - -### Anti-Pattern 5: Skipping Plan Validation - -**Problem**: Proceeding directly from WRITE-PLAN to EXECUTE-PLAN without running plan-checker. - -**Why it fails**: A plan that looks reasonable at a glance may have vague tasks, missing dependencies, or scope that exceeds the context budget. These defects are discovered mid-execution, wasting an entire execution cycle. Validation is cheaper than rework. - -**Fix**: Always run VALIDATE-PLAN (Phase 2.5) between WRITE-PLAN and EXECUTE-PLAN. The bounded revision loop (max 3 iterations) prevents infinite planning while catching major issues. - -### Anti-Pattern 6: Ignoring Regressions - -**Problem**: Verifying only the current task without checking whether prior tasks' outputs remain valid. - -**Why it fails**: Task N may silently break what Task N-2 built. Without regression checks, these breaks compound -- by the time they surface, the root cause is buried under subsequent changes. - -**Fix**: Run regression gate checks on prior task groups before verifying the current group. For plans with 5 or fewer groups, check all prior groups. For larger plans, check at least the immediately preceding group. - -### Anti-Pattern 7: Auto-Fixing Architectural Changes - -**Problem**: Treating an architectural change (new service, altered API contract, data model change) as a simple bug fix and applying it autonomously. - -**Why it fails**: Architectural changes have cascading effects the orchestrator cannot fully anticipate. A "quick fix" to a data model can break consumers, invalidate migrations, or create backward-compatibility issues. - -**Fix**: Classify every autonomous fix by Deviation Rule number. If it's Rule 4 (architectural change), STOP and ask the user. When in doubt about the classification, escalate -- false escalation wastes seconds; false autonomy wastes hours. - -## Validation - -To validate a workflow execution: -1. All phase gates passed (requirements clarified, approach selected) -2. Plan saved to `plan/active/` with absolute file paths -3. Plan-checker validation completed (PASS, PASS-WITH-WARNINGS, or BLOCK-OVERRIDDEN with documented risks) -4. Each task has executable verification command -5. All verifications pass after execution -6. Regression checks pass for all prior task groups -7. All autonomous deviations logged with rule numbers - ## References This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations - [Gate Enforcement](../shared-patterns/gate-enforcement.md) - Phase transition rules - [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Autonomous Repair](../shared-patterns/autonomous-repair.md) - Bounded self-repair with RETRY/DECOMPOSE/PRUNE/ESCALATE strategies Related skills and ADRs: - [Plan Checker](../plan-checker/SKILL.md) - Pre-execution plan validation (ADR-074) - [ADR-076: Autonomous Repair Mechanism](../../adr/076-autonomous-repair-mechanism.md) - Repair strategies and budget enforcement - [ADR-079: Workflow Orchestrator Enhancements](../../adr/079-workflow-orchestrator-enhancements.md) - Plan validation, context budget, deviation rules, regression gates - [ADR-069: Session Handoff System](../../adr/069-session-handoff-system.md) - Handoff artifacts for POOR context zone - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Requirements are clear enough" | Ambiguity causes rework | Complete BRAINSTORM phase | -| "Tasks are roughly scoped" | Vague tasks can't be verified | Define exact file paths + verification | -| "Simple enough to skip planning" | Unplanned work has higher failure rate | Use BRAINSTORM -> WRITE-PLAN -> EXECUTE | -| "Let me just add this feature too" | Scope creep wastes time and tokens | Only implement what was requested | -| "Plan looks fine, skip validation" | Plans that look fine still have hidden dependency gaps and vague tasks | Run VALIDATE-PLAN -- it catches what eyeballing misses | -| "One more revision will perfect it" | Diminishing returns after 3 iterations; execution surfaces remaining issues faster | Stop at 3 iterations, log remaining issues as accepted risks | -| "This architectural change is small enough to auto-fix" | Small architectural changes still have cascading effects | Deviation Rule 4: STOP and ask. Always. | -| "Prior tasks are fine, no need to re-check" | Cross-task regressions are silent until they compound | Run regression gate between task groups | -| "Context is probably fine" | "Probably" is how you end up with degraded output and no handoff | Check context zone; act on DEGRADING/POOR transitions | diff --git a/scripts/score-component.py b/scripts/score-component.py index 539e3cab..90722c3f 100644 --- a/scripts/score-component.py +++ b/scripts/score-component.py @@ -328,36 +328,49 @@ def check_reference_files(component_type: str, file_path: Path) -> CheckResult: return CheckResult("Reference files", 10, 0, f"No references/ directory at {refs_dir.relative_to(REPO_ROOT)}") -def check_operator_context(content: str) -> CheckResult: - """Check: Operator Context with Hardcoded/Default/Optional subsections (15 pts).""" - subsections = { - "Hardcoded Behaviors": bool(re.search(r"#{2,4}\s+Hardcoded Behaviors", content, re.IGNORECASE)), - "Default Behaviors": bool(re.search(r"#{2,4}\s+Default Behaviors", content, re.IGNORECASE)), - "Optional Behaviors": bool(re.search(r"#{2,4}\s+Optional Behaviors", content, re.IGNORECASE)), - } - - found = sum(1 for v in subsections.values() if v) +def check_workflow_instructions(content: str) -> CheckResult: + """Check: Instructions section with workflow-first structure (15 pts). + + Workflow-first model: Instructions section with phases/steps and inline + constraints using "because X" reasoning. Replaces the old Operator Context + check (Hardcoded/Default/Optional subsections were removed in the + workflow-first migration). + """ + has_instructions = bool(re.search(r"#{2,4}\s+Instructions", content, re.IGNORECASE)) + has_phases = bool(re.search(r"#{2,4}\s+(Phase|Step)\s+\d", content, re.IGNORECASE)) + has_gates = bool(re.search(r"\*\*Gate\*\*", content)) + + found = sum([has_instructions, has_phases, has_gates]) earned = round((found / 3) * 15) if found == 3: - return CheckResult("Operator Context", 15, 15) - - missing = [k for k, v in subsections.items() if not v] - return CheckResult("Operator Context", 15, earned, f"{found}/3 subsections (missing: {', '.join(missing)})") - + return CheckResult("Workflow instructions", 15, 15) -def check_can_cannot_sections(content: str) -> CheckResult: - """Check: CAN and CANNOT sections present (10 pts).""" - has_can = bool(re.search(r"#{2,4}\s+What This .* CAN Do", content, re.IGNORECASE)) - has_cannot = bool(re.search(r"#{2,4}\s+What This .* CANNOT Do", content, re.IGNORECASE)) - - if has_can and has_cannot: - return CheckResult("CAN/CANNOT sections", 10, 10) - elif has_can or has_cannot: - present = "CAN" if has_can else "CANNOT" - missing = "CANNOT" if has_can else "CAN" - return CheckResult("CAN/CANNOT sections", 10, 5, f"Has {present}, missing {missing}") - return CheckResult("CAN/CANNOT sections", 10, 0, "Neither CAN nor CANNOT section found") + missing = [] + if not has_instructions: + missing.append("Instructions section") + if not has_phases: + missing.append("Phase/Step numbering") + if not has_gates: + missing.append("Gate checkpoints") + return CheckResult("Workflow instructions", 15, earned, f"{found}/3 elements (missing: {', '.join(missing)})") + + +def check_inline_constraints(content: str) -> CheckResult: + """Check: Inline constraints with reasoning (10 pts). + + Workflow-first model: constraints are distributed inline with "because X" + reasoning at point-of-use. Replaces the old CAN/CANNOT sections check. + """ + because_count = len(re.findall(r"\bbecause\b", content, re.IGNORECASE)) + + if because_count >= 5: + return CheckResult("Inline constraints", 10, 10, f"{because_count} inline 'because' reasoning instances") + elif because_count >= 2: + return CheckResult("Inline constraints", 10, 5, f"{because_count} inline 'because' reasoning (target: 5+)") + return CheckResult( + "Inline constraints", 10, 0, f"Only {because_count} inline constraint reasoning found (target: 5+)" + ) def check_broken_internal_links(content: str, file_path: Path) -> CheckResult: @@ -481,8 +494,8 @@ def score_component(file_path: Path, do_check_secrets: bool = False) -> Componen score.checks.append(check_error_handling_section(content)) score.checks.append(check_routing_registration(component_type, file_path, fm)) score.checks.append(check_reference_files(component_type, file_path)) - score.checks.append(check_operator_context(content)) - score.checks.append(check_can_cannot_sections(content)) + score.checks.append(check_workflow_instructions(content)) + score.checks.append(check_inline_constraints(content)) score.checks.append(check_broken_internal_links(content, file_path)) if do_check_secrets: diff --git a/scripts/tests/test_score_component.py b/scripts/tests/test_score_component.py index 515b78ff..95cfeee8 100644 --- a/scripts/tests/test_score_component.py +++ b/scripts/tests/test_score_component.py @@ -136,12 +136,12 @@ def test_frontmatter_pass(self) -> None: assert checks["Valid YAML frontmatter"]["status"] == "PASS" assert checks["Valid YAML frontmatter"]["earned"] == 10 - def test_operator_context_check(self) -> None: - """Operator Context check detects subsections in /do skill.""" + def test_workflow_instructions_check(self) -> None: + """Workflow instructions check detects phases/gates in /do skill.""" result = run_script("skills/do/SKILL.md", "--json") data = json.loads(result.stdout) checks = {c["name"]: c for c in data["results"][0]["checks"]} - assert checks["Operator Context"]["status"] == "PASS" + assert checks["Workflow instructions"]["status"] == "PASS" def test_secret_detection_clean_agent(self) -> None: """No secrets in a standard agent.""" diff --git a/skills/INDEX.json b/skills/INDEX.json index c9bc4de5..c283ba87 100644 --- a/skills/INDEX.json +++ b/skills/INDEX.json @@ -1,6 +1,6 @@ { "version": "2.0", - "generated": "2026-03-27T03:14:10Z", + "generated": "2026-03-27T19:00:36Z", "generated_by": "scripts/generate-skill-index.py", "skills": { "adr-consultation": { diff --git a/skills/adr-consultation/SKILL.md b/skills/adr-consultation/SKILL.md index e8ef8975..62a11215 100644 --- a/skills/adr-consultation/SKILL.md +++ b/skills/adr-consultation/SKILL.md @@ -36,54 +36,13 @@ routing: # ADR Consultation Skill -## Operator Context - -This skill operates as an orchestrator for multi-agent architecture consultation, configuring -Claude's behavior to dispatch 3 specialized reviewer agents in parallel against an ADR and -synthesize their findings into a PROCEED or BLOCKED verdict. It implements the **ConsensusCode -file-based communication protocol** adapted to our ADR system — agents write responses to -`adr/{name}/` directories, enabling inter-agent communication and persistence across sessions. - -The design principle: challenge architecture decisions BEFORE implementation, not after. A -BLOCKED verdict stops implementation dispatch. A PROCEED verdict explicitly clears the path. -This is the gate between feature-plan and feature-implement for Medium+ decisions. - -### Hardcoded Behaviors (Always Apply) - -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before dispatching agents. -- **Single-Message Dispatch**: All 3 Task calls MUST appear in ONE response for true parallelism. - Sequential dispatch defeats the purpose and triples wall-clock time. -- **Artifacts Over Memory**: Every agent writes to `adr/{name}/{agent-name}.md`. Synthesis reads - from files, not from context. This makes consultation results persistent across sessions. -- **Gate Enforcement**: If blocking concerns exist in concerns.md, do NOT proceed to - implementation. Report the blocks and stop. This is not advisory — it is a hard gate. -- **All 3 Agents Required**: Do not skip a reviewer because the ADR "seems simple". Let - agents report "no concerns" if genuinely clean. Partial consultation gives false confidence. - -### Default Behaviors (ON unless disabled) - -- **Consultation Directory Creation**: Create `adr/{adr-name}/` before dispatching agents. -- **Structured Concerns Tracking**: Every concern raised by any agent is tracked in - `adr/{adr-name}/concerns.md` with severity (blocking/important/minor) and resolution status. -- **Synthesis Production**: After all agents complete, produce `adr/{adr-name}/synthesis.md` - with areas of agreement, disagreement, and final verdict rationale. -- **Verdict Output**: Always produce a final PROCEED or BLOCKED verdict with supporting reasons. - -### Optional Behaviors (OFF unless enabled) - -- **5-Agent Mode**: For Complex decisions (new subsystem, major API change), add - `reviewer-security` and a second domain expert. Enable with "complex consultation" or - "full consultation". -- **Prior Consultation Check**: Before dispatching, scan `adr/{name}/` for existing agent - files to avoid duplicate consultation. Enable with "check for prior consultation". - ---- +Multi-agent architecture consultation that dispatches 3 specialized reviewers in parallel against an ADR and synthesizes their findings into a PROCEED or BLOCKED verdict. This is the gate between feature-plan and feature-implement for Medium+ decisions because challenging architecture decisions before implementation prevents costly post-implementation rework. ## Instructions ### Phase 1: DISCOVER -**Goal**: Identify the ADR being consulted on and prepare the consultation directory. +**Goal**: Identify the ADR and prepare the consultation directory. **Step 1: Locate the ADR** @@ -92,6 +51,8 @@ Check for ADR path in this order: 2. Active session context from adr-system hook (`.adr-session.json`) 3. Ask the user which ADR to consult on +Do not guess which ADR to consult on because an incorrect guess wastes a full consultation cycle. + ```bash # Check for active ADR session cat .adr-session.json 2>/dev/null @@ -100,7 +61,19 @@ cat .adr-session.json 2>/dev/null ls adr/*.md ``` -**Step 2: Read the ADR** +Even if this ADR was discussed informally, run the formal consultation because undocumented discussion produces no persistent artifacts and cannot be referenced by future sessions. + +**Step 2: Check for prior consultation** + +Before dispatching, scan `adr/{adr-name}/` for existing agent files because silently overwriting prior consultation work destroys the audit trail. + +```bash +ls adr/{adr-name}/ 2>/dev/null +``` + +If existing files are found, report them and their timestamps. Ask the user whether to overwrite (re-run consultation) or use existing results. + +**Step 3: Read the ADR** Read the full ADR content. Extract: - The decision being made @@ -108,24 +81,31 @@ Read the full ADR content. Extract: - Any stated risks or consequences - The ADR name (filename without `.md`) for the consultation directory -**Step 3: Create consultation directory** +**Step 4: Create consultation directory** + +Create `adr/{adr-name}/` before dispatching agents because agents need a valid directory to write their output files. ```bash mkdir -p adr/{adr-name} ``` -**Gate**: ADR content read, consultation directory created, ADR name confirmed. Do NOT dispatch -agents until this gate passes — agents need the ADR content and a valid directory to write to. +**Gate**: ADR content read, consultation directory created, ADR name confirmed. Do NOT dispatch agents until this gate passes because agents need the ADR content and a valid directory to write to. --- ### Phase 2: DISPATCH -**Goal**: Launch all 3 consultation agents in a single message for true parallel execution. +**Goal**: Launch all consultation agents in a single message for true parallel execution. + +All three Task calls MUST appear in ONE response because sequential dispatch triples wall-clock time with no cross-perspective benefit. The value of this skill is simultaneous independent judgment. If you find yourself dispatching agents one at a time, stop and restructure into a single message. + +Dispatch all 3 agents even if the ADR "seems simple" because partial consultation gives false confidence. Let agents report "no concerns" if genuinely clean — that is a fast, cheap confirmation that removing a safety net cannot replicate. + +Even when there is time pressure, do not skip consultation because blocking concerns discovered post-implementation cost dramatically more to fix than the minutes this consultation takes. -**CRITICAL**: All three Task calls MUST appear in ONE response. If you dispatch them one at a -time, you are running sequential analysis at 3x the cost with none of the cross-perspective -benefit. The value of this skill is in simultaneous independent judgment. +**Standard mode (3 agents)**: Always dispatch all three. + +**Complex mode (5 agents)**: For Complex decisions (new subsystem, major API change), add `reviewer-security` and a second domain expert. Enable with "complex consultation" or "full consultation". Each agent receives: 1. The full ADR content as context @@ -133,8 +113,6 @@ Each agent receives: 3. Explicit output path: `adr/{adr-name}/{agent-name}.md` 4. The structured output format defined below -Dispatch exactly these 3 agents: - **Agent 1: reviewer-contrarian** Lens: Challenge assumptions, find simpler alternatives, validate premises. @@ -217,8 +195,7 @@ Structure your response as: **Agent 3: reviewer-meta-process** -Lens: System health, single points of failure, whether this makes one component indispensable, -whether it aligns with established architecture principles. +Lens: System health, single points of failure, whether this makes one component indispensable, whether it aligns with established architecture principles. Prompt template: ``` @@ -262,8 +239,7 @@ Structure your response as: [Concrete recommendation with rationale.] ``` -**Gate**: All 3 Task calls dispatched in a single message. Proceed to Phase 3 only when all 3 -agents have returned and written their files to `adr/{adr-name}/`. +**Gate**: All Task calls dispatched in a single message. Proceed to Phase 3 only when all agents have returned and written their files to `adr/{adr-name}/`. --- @@ -271,7 +247,9 @@ agents have returned and written their files to `adr/{adr-name}/`. **Goal**: Read all agent responses from the consultation directory and produce a synthesis. -**Step 1: Read all agent responses** +**Step 1: Read all agent responses from files** + +Read the response files from disk, not from Task return context, because files persist across sessions while context does not — synthesis from context is not reproducible. ```bash cat adr/{adr-name}/reviewer-contrarian.md @@ -281,7 +259,7 @@ cat adr/{adr-name}/reviewer-meta-process.md **Step 2: Extract all concerns** -For each concern raised across all agents, record in `adr/{adr-name}/concerns.md`: +Track every concern raised by any agent in `adr/{adr-name}/concerns.md` with severity and resolution status because structured tracking prevents concerns from being lost during synthesis. ```markdown # Concerns: {adr-name} @@ -304,6 +282,10 @@ Resolution states (update as concerns are addressed): **Step 3: Identify verdict agreement** +Do not treat NEEDS_CHANGES as equivalent to PROCEED because NEEDS_CHANGES means the agent identified real concerns that should be addressed. Multiple NEEDS_CHANGES aggregates to a higher concern level, not a softer approval. + +When one reviewer disagrees with the majority, track the minority concern with full severity assessment because minority dissent catches real failures that consensus misses. + | Pattern | Meaning | |---------|---------| | All 3 PROCEED | Strong consensus — proceed with confidence | @@ -311,6 +293,8 @@ Resolution states (update as concerns are addressed): | Any BLOCK | Hard block — must resolve before proceeding | | Mixed NEEDS_CHANGES | Significant concerns — address before proceeding | +The synthesizer can also identify cross-cutting concerns that individual agents missed because agents assess separately and may not see emergent issues visible only in combination. Document any orchestrator-level concern in concerns.md and factor it into the verdict. + **Step 4: Write synthesis** Write `adr/{adr-name}/synthesis.md`: @@ -346,8 +330,7 @@ Write `adr/{adr-name}/synthesis.md`: [Why the overall verdict is PROCEED or BLOCKED, given the above.] ``` -**Gate**: All concerns extracted to concerns.md, synthesis.md written. Proceed to Phase 4 -only when both files exist in `adr/{adr-name}/`. +**Gate**: All concerns extracted to concerns.md, synthesis.md written. Proceed to Phase 4 only when both files exist in `adr/{adr-name}/`. --- @@ -357,8 +340,9 @@ only when both files exist in `adr/{adr-name}/`. **Step 1: Check for blocking concerns** -Read `adr/{adr-name}/concerns.md`. If any concern has `**Severity**: blocking`, the verdict -is BLOCKED. +Read `adr/{adr-name}/concerns.md`. If any concern has `**Severity**: blocking`, the verdict is BLOCKED. This is a hard gate, not advisory, because blocking concerns that surface post-implementation cost dramatically more to fix. + +Do not rationalize blocking concerns as "theoretical" because theoretical risk is still risk, and the gate exists specifically to prevent implementation from proceeding with unresolved blocking issues. **Step 2: Issue verdict** @@ -431,7 +415,6 @@ When an ADR's implementation is complete and merged: **Cleanup instructions:** -To clean up after implementation is complete: 1. **Keep**: `adr/{name}/synthesis.md` (permanent record of verdict) 2. **Keep**: `adr/{name}/concerns.md` (permanent record of concerns + resolutions) 3. **Delete**: `adr/{name}/reviewer-*.md` (agent responses — value extracted into synthesis) @@ -445,10 +428,7 @@ rm adr/{name}/reviewer-*.md ls adr/{name}/synthesis.md adr/{name}/concerns.md ``` -**Note**: The consultation directory is auto-created by Phase 1 (`mkdir -p adr/{adr-name}`). -No `.gitkeep` is needed — the `adr/` directory is gitignored and the skill handles creation -on demand. The lifecycle is: skill creates directory -> consultation runs -> implementation -completes -> cleanup removes ephemeral agent responses -> permanent records remain. +The consultation directory is auto-created by Phase 1 (`mkdir -p adr/{adr-name}`). No `.gitkeep` is needed because the `adr/` directory is gitignored and the skill handles creation on demand. --- @@ -479,58 +459,9 @@ concerns.md and factor it into the verdict. --- -## Anti-Patterns - -### Anti-Pattern 1: Sequential Dispatch -**What it looks like**: Dispatching agents one at a time, waiting for each to return. -**Why wrong**: Triples wall-clock time, defeats the purpose of parallel consultation. -**Do instead**: All 3 Task calls in ONE message. This is not optional. - -### Anti-Pattern 2: Skipping an Agent Because the ADR "Seems Simple" -**What it looks like**: "This is a small change, we don't need the user advocate." -**Why wrong**: "Simple" changes often carry hidden user impact or coupling risks. Let the -agent return "no concerns" — that's a fast, cheap confirmation. Skipping removes a safety net. -**Do instead**: Dispatch all 3 always. Scale down from Complex (5-agent) to Standard (3-agent), -not from Standard to incomplete. - -### Anti-Pattern 3: Issuing Verdict Without Reading Agent Files -**What it looks like**: Synthesizing from agent return values in context rather than reading files. -**Why wrong**: Violates artifacts-over-memory principle. Files persist; context does not. -Synthesis from context is not reproducible across sessions. -**Do instead**: Always read `adr/{name}/*.md` files explicitly before synthesizing. - -### Anti-Pattern 4: Treating NEEDS_CHANGES as PROCEED -**What it looks like**: "Two agents said NEEDS_CHANGES, that's basically approval." -**Why wrong**: NEEDS_CHANGES means the agent identified real concerns that should be addressed. -Multiple NEEDS_CHANGES aggregates to a higher concern level. -**Do instead**: Extract every concern, assess severity. Two NEEDS_CHANGES with important -concerns is not the same as one PROCEED with no concerns. - -### Anti-Pattern 5: Proceeding Despite Blocking Concerns -**What it looks like**: "We'll fix it during implementation." "It's probably fine." -**Why wrong**: This is exactly what the gate is designed to prevent. Blocking concerns that -surface post-implementation cost dramatically more to fix. -**Do instead**: Hard stop. Update the ADR to address blocking concerns. Re-run consultation. - ---- - -## Anti-Rationalization - -| Rationalization Attempt | Why It's Wrong | Required Action | -|------------------------|----------------|-----------------| -| "The ADR looks solid, consultation is ceremony" | Unreviewed plans have blind spots | Run consultation; let agents find nothing if it's solid | -| "One reviewer disagreed but the others agreed" | Minority dissent catches real failures | Track the concern; evaluate severity | -| "We're on a deadline, skip consultation" | Blocking concerns discovered post-impl cost more | Run consultation; it takes minutes | -| "The contrarian is always negative" | Structured dissent is the feature, not a bug | Record the concern; assess validity | -| "We already discussed this informally" | Undocumented discussion is not a consultation | Run the formal consultation; create artifacts | -| "Blocking concern is theoretical" | Theoretical risk is risk | Require explicit resolution in ADR before proceeding | - ---- - ## References - [ADR: Multi-Agent Consultation](../../adr/multi-agent-consultation.md) — The architecture decision this skill implements - [parallel-code-review](../parallel-code-review/SKILL.md) — Fan-out/fan-in pattern this skill adapts - [dispatching-parallel-agents](../dispatching-parallel-agents/SKILL.md) — Core parallel dispatch mechanics -- [anti-rationalization-core.md](../shared-patterns/anti-rationalization-core.md) — Universal rationalization patterns - [reviewer-contrarian](../../agents/reviewer-contrarian.md) — Contrarian lens agent diff --git a/skills/agent-comparison/SKILL.md b/skills/agent-comparison/SKILL.md index 94de2886..f2845929 100644 --- a/skills/agent-comparison/SKILL.md +++ b/skills/agent-comparison/SKILL.md @@ -28,48 +28,7 @@ routing: # Agent Comparison Skill -## Operator Context - -This skill operates as an operator for agent A/B testing workflows, configuring Claude's behavior for rigorous, evidence-based variant comparison. It implements the **Benchmark Pipeline** architectural pattern — prepare variants, run identical tasks, measure outcomes, report findings — with **Domain Intelligence** embedded in the comparison methodology. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Over-Engineering Prevention**: Keep benchmark scripts simple. No speculative features or configurable frameworks that were not requested -- **Identical Task Prompts**: Both agents MUST receive the exact same task description, character-for-character -- **Isolated Execution**: Each agent runs in a separate session to avoid contamination -- **Test-Based Validation**: All generated code MUST pass the same test suite with `-race` flag -- **Evidence-Based Reporting**: Every claim backed by measurable data (tokens, test counts, quality scores) -- **Total Session Cost**: Measure total tokens to working solution, not just prompt size - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report facts without self-congratulation. Show command output rather than describing it -- **Temporary File Cleanup**: Remove temporary benchmark files and debug outputs at completion. Keep only comparison report and generated code -- **Two-Tier Benchmarking**: Run both simple (algorithmic) and complex (production) tasks -- **Token Tracking**: Record input/output token counts per turn where visible -- **Quality Grading**: Score code on correctness, error handling, idioms, documentation, testing -- **Comparative Summary**: Generate side-by-side comparison report with clear verdict - -### Optional Behaviors (OFF unless enabled) -- **Multiple Runs**: Run each benchmark 3x to account for variance -- **Blind Evaluation**: Hide agent identity during quality grading -- **Extended Benchmark Suite**: Run additional domain-specific tests -- **Historical Tracking**: Compare against previous benchmark runs - -## What This Skill CAN Do -- Systematically compare agent variants through controlled benchmarks -- Measure total session token cost (prompt + reasoning + tools + retries) -- Grade code quality using domain-specific checklists -- Reveal quality differences invisible to simple metrics (prompt size, line count) -- Generate comparison reports with evidence-backed verdicts - -## What This Skill CANNOT Do -- Compare agents without running identical tasks on both -- Declare a winner based on prompt size alone -- Skip quality grading and rely only on test pass rates -- Evaluate single agents in isolation (use quality-grading skill instead) -- Compare skills or prompts (this is for agent variants only) - ---- +Compare agent variants through controlled A/B benchmarks. Runs identical tasks on both agents, grades output quality with domain-specific checklists, and reports total session token cost to a working solution. This skill is exclusively for agent variant comparison — use `agent-evaluation` for single-agent assessment, and `skill-eval` for skill testing. ## Instructions @@ -77,6 +36,8 @@ This skill operates as an operator for agent A/B testing workflows, configuring **Goal**: Create benchmark environment and validate both agent variants exist. +Read and follow the repository CLAUDE.md before starting any execution. + **Step 1: Analyze original agent** ```bash @@ -94,7 +55,6 @@ grep -c '```' agents/{original-agent}.md If creating a compact variant, preserve: - YAML frontmatter (name, description, routing) -- Operator Context (Hardcoded/Default/Optional) - Core patterns and principles - Error handling philosophy @@ -103,7 +63,7 @@ Remove or condense: - Verbose explanations (condense to bullet points) - Redundant instructions and changelogs -Target: 10-15% of original size while keeping essential knowledge. Removing capability (error handling patterns, concurrency patterns) invalidates the comparison. Remove redundancy, not knowledge. +Target 10-15% of original size while keeping essential knowledge. Remove redundancy, not capability — stripping error handling patterns or concurrency guidance creates an unfair comparison because the compact agent is missing essential knowledge rather than expressing it concisely. **Step 3: Validate compact variant structure** @@ -111,9 +71,6 @@ Target: 10-15% of original size while keeping essential knowledge. Removing capa # Verify YAML frontmatter head -20 agents/{compact-agent}.md | grep -E "^(name|description):" -# Verify Operator Context preserved -grep -c "### Hardcoded Behaviors" agents/{compact-agent}.md - # Compare sizes echo "Original: $(wc -l < agents/{original-agent}.md) lines" echo "Compact: $(wc -l < agents/{compact-agent}.md) lines" @@ -125,7 +82,9 @@ echo "Compact: $(wc -l < agents/{compact-agent}.md) lines" mkdir -p benchmark/{task-name}/{full,compact} ``` -Write the task prompt ONCE, then copy it for both agents. NEVER customize prompts per agent. +Write the task prompt ONCE, then copy it for both agents. Both agents must receive the exact same task description, character-for-character, because different requirements produce different solutions and invalidate all measurements. + +Keep benchmark scripts simple — no speculative features or configurable frameworks that were not requested. **Gate**: Both agent variants exist with valid YAML frontmatter. Benchmark directories created. Identical task prompts written. Proceed only when gate passes. @@ -135,9 +94,9 @@ Write the task prompt ONCE, then copy it for both agents. NEVER customize prompt **Step 1: Run simple task benchmark (2-3 tasks)** -Use algorithmic problems with clear specifications (e.g., Advent of Code Day 1-6). Both agents should perform identically on well-defined problems. Simple tasks establish a baseline — if an agent fails here, it has fundamental issues. +Use algorithmic problems with clear specifications (e.g., Advent of Code Day 1-6). Simple tasks establish a baseline — if an agent fails here, it has fundamental issues. Running multiple simple tasks is necessary because a single data point is sensitive to task selection bias and cannot distinguish luck from systematic quality. -Spawn both agents in parallel using Task tool for fair timing: +Spawn both agents in parallel using Task tool. Each agent runs in a separate session to avoid contamination: ``` Task( @@ -155,7 +114,7 @@ Run in parallel to avoid caching effects or system load variance skewing results **Step 2: Run complex task benchmark (1-2 tasks)** -Use production-style problems that require concurrency, error handling, edge case anticipation. These are where quality differences emerge. See `references/benchmark-tasks.md` for standard tasks. +Use production-style problems that require concurrency, error handling, edge case anticipation — these are where quality differences emerge because simple tasks mask differences in edge case handling. See `references/benchmark-tasks.md` for standard tasks. Recommended complex tasks: - **Worker Pool**: Rate limiting, graceful shutdown, panic recovery @@ -164,7 +123,7 @@ Recommended complex tasks: **Step 3: Capture metrics for each run** -Record immediately after each agent completes. Do NOT wait until all runs finish. +Record immediately after each agent completes — do not wait until all runs finish, because delayed recording loses precision. Track input/output token counts per turn where visible, since total session cost (not just prompt size) is what matters. | Metric | Full Agent | Compact Agent | |--------|------------|---------------| @@ -183,7 +142,7 @@ cd benchmark/{task-name}/full && go test -race -v -count=1 cd benchmark/{task-name}/compact && go test -race -v -count=1 ``` -Use `-count=1` to disable test caching. Race conditions are automatic quality failures — record them but do NOT fix them for the agent being tested. +Use `-count=1` to disable test caching. All generated code must pass the same test suite with the `-race` flag because race conditions are automatic quality failures. Record them but do NOT fix them for the agent being tested. **Gate**: Both agents completed all tasks. Metrics captured for every run. Test output saved. Proceed only when gate passes. @@ -193,7 +152,7 @@ Use `-count=1` to disable test caching. Race conditions are automatic quality fa **Step 1: Create quality checklist BEFORE reviewing code** -Define criteria before seeing results to prevent bias. Do NOT invent criteria after seeing one agent's output. See `references/grading-rubric.md` for standard rubrics. +Define criteria before seeing results to prevent bias — inventing criteria after seeing one agent's output skews the comparison. See `references/grading-rubric.md` for standard rubrics. | Criterion | 5/5 | 3/5 | 1/5 | |-----------|-----|-----|-----| @@ -205,7 +164,7 @@ Define criteria before seeing results to prevent bias. Do NOT invent criteria af **Step 2: Score each solution independently** -Grade each agent's code on all five criteria. Score one agent completely before starting the other. +Grade each agent's code on all five criteria. Score one agent completely before starting the other. Report facts and show command output rather than describing it — every claim must be backed by measurable data (tokens, test counts, quality scores). ```markdown ## {Agent} Solution - {Task} @@ -233,7 +192,7 @@ For each bug found, record: - Test coverage: {did tests catch it? why not?} ``` -"Tests pass" is necessary but not sufficient. Production bugs often pass tests — Clear() returning nothing passes if no test checks the return value. TTL=0 bugs pass if no test uses zero TTL. +"Tests pass" is necessary but not sufficient — production bugs often pass tests. Clear() returning nothing passes if no test checks the return value. TTL=0 bugs pass if no test uses zero TTL. Apply the domain-specific quality checklist rather than relying only on test pass rates, because tests can miss goroutine leaks, wrong semantics, and other production issues. **Step 4: Calculate effective cost** @@ -241,7 +200,7 @@ For each bug found, record: effective_cost = total_tokens * (1 + bug_count * 0.25) ``` -An agent using 194k tokens with 0 bugs has better economics than one using 119k tokens with 5 bugs requiring fixes. The metric that matters is total cost to working, production-quality solution. +An agent using 194k tokens with 0 bugs has better economics than one using 119k tokens with 5 bugs requiring fixes. The metric that matters is total cost to working, production-quality solution — not prompt size, because prompt is a one-time cost while reasoning tokens dominate sessions. Check quality scores before claiming token savings, since savings that come from cutting corners are not real savings. **Gate**: Both solutions graded with evidence. Specific bugs documented with production impact. Effective cost calculated. Proceed only when gate passes. @@ -258,6 +217,8 @@ Use the report template from `references/report-template.md`. Include: - Specific bugs found and their production impact - Verdict based on total evidence +Generate a side-by-side comparison report with a clear verdict. + **Step 2: Run comparison analysis** ```bash @@ -268,18 +229,18 @@ diff benchmark/{task-name}/full/ benchmark/{task-name}/compact/ **Step 3: Analyze token economics** -The key economic insight: agent prompts are a one-time cost per session. Everything after — reasoning, code generation, debugging, retries — costs tokens on every turn. +The key economic insight: agent prompts are a one-time cost per session. Everything after — reasoning, code generation, debugging, retries — costs tokens on every turn. When a micro agent produces correct code, it uses approximately the same total tokens. The savings appear only when it cuts corners. | Pattern | Description | |---------|-------------| | Large agent, low churn | High initial cost, fewer retries, less debugging | | Small agent, high churn | Low initial cost, more retries, more debugging | -When a micro agent produces correct code, it uses approximately the same total tokens. The savings appear only when it cuts corners. +Our data showed a 57-line agent used 69.5k tokens vs 69.6k for a 3,529-line agent on the same correct solution — prompt size alone does not determine cost. **Step 4: State verdict with evidence** -The verdict MUST be backed by data. Include: +The verdict must be backed by data — do not declare a winner based on prompt size alone. Include: - Which agent won on simple tasks (expected: equivalent) - Which agent won on complex tasks (expected: full agent) - Total session cost comparison @@ -288,31 +249,19 @@ The verdict MUST be backed by data. Include: See `references/methodology.md` for the complete testing methodology with December 2024 data. -**Gate**: Report generated with all metrics. Verdict stated with evidence. Report saved to benchmark directory. +**Step 5: Clean up** ---- +Remove temporary benchmark files and debug outputs. Keep only the comparison report and generated code. + +**Gate**: Report generated with all metrics. Verdict stated with evidence. Report saved to benchmark directory. -## Examples - -### Example 1: Creating a Compact Agent -User says: "Create a compact version of golang-general-engineer and test it" -Actions: -1. Analyze original, create compact variant at 10-15% size (PREPARE) -2. Run simple task (Advent of Code) + complex task (Worker Pool) on both (BENCHMARK) -3. Score both with domain-specific checklist, calculate effective cost (GRADE) -4. Generate comparison report with verdict (REPORT) -Result: Data-driven recommendation on whether compact version is viable - -### Example 2: Comparing Internal vs External Agent -User says: "Compare our Go agent against go-expert-0xfurai" -Actions: -1. Validate both agents exist, prepare identical task prompts (PREPARE) -2. Run two-tier benchmarks with token tracking (BENCHMARK) -3. Grade with production quality checklist, document all bugs (GRADE) -4. Report with token economics showing prompt cost vs session cost (REPORT) -Result: Evidence-based comparison showing true cost of each variant +### Optional Extensions ---- +These are off by default. Enable explicitly when needed: +- **Multiple Runs**: Run each benchmark 3x to account for variance +- **Blind Evaluation**: Hide agent identity during quality grading +- **Extended Benchmark Suite**: Run additional domain-specific tests +- **Historical Tracking**: Compare against previous benchmark runs ## Error Handling @@ -332,53 +281,8 @@ Solution: Valid data point. Grade on test coverage and quality, not raw count. M Cause: Complex task taking too long or agent stuck in retry loop Solution: Note the timeout and number of retries attempted. Record as incomplete with partial metrics. Increase timeout limit if warranted, but excessive retries are a quality signal — an agent that needs many retries is less efficient regardless of final outcome. ---- - -## Anti-Patterns - -### Anti-Pattern 1: Comparing Only Prompt Size -**What it looks like**: "Compact agent is 90% smaller, therefore 90% more efficient" -**Why wrong**: Prompt is one-time cost. Session reasoning, retries, and debugging dominate total tokens. Our data showed a 57-line agent used 69.5k tokens vs 69.6k for a 3,529-line agent on the same correct solution. -**Do instead**: Measure total session tokens to working solution. - -### Anti-Pattern 2: Different Task Prompts -**What it looks like**: Giving the full agent harder requirements than the compact agent -**Why wrong**: Creates unfair comparison. Different requirements produce different solutions, invalidating all measurements. -**Do instead**: Copy-paste identical prompts character-for-character. Verify before running. - -### Anti-Pattern 3: Treating Test Failures as Equal Quality -**What it looks like**: "Both agents completed the task" when one has 12/12 tests and the other has 8/12 -**Why wrong**: Bugs have real cost. False equivalence between producing code and producing working code. -**Do instead**: Grade quality rigorously. Calculate effective cost with bug penalty multiplier. - -### Anti-Pattern 4: Single Benchmark Declaration -**What it looks like**: "Tested on one puzzle. Compact agent wins!" -**Why wrong**: Single data point is sensitive to task selection bias. Simple tasks mask differences in edge case handling. Cannot distinguish luck from systematic quality. -**Do instead**: Run two-tier benchmarking with 2-3 simple tasks and 1-2 complex tasks. - -### Anti-Pattern 5: Removing Core Patterns to Create Compact Agent -**What it looks like**: Compact version removes error handling patterns, concurrency guidance, and testing requirements to reduce size -**Why wrong**: Creates unfair comparison. Compact agent is missing essential knowledge, guaranteeing quality degradation rather than testing if brevity is possible. -**Do instead**: Remove verbose examples and redundant explanations, not capability. Keep one representative example per pattern. Condense explanations to bullet points but retain key insights. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Compact agent saved 50% tokens" | Savings may come from cutting corners, not efficiency | Check quality scores before claiming savings | -| "Tests pass, agents are equal" | Tests can miss production bugs (goroutine leaks, wrong semantics) | Apply domain-specific quality checklist | -| "One benchmark is enough" | Single task is sensitive to selection bias | Run two-tier benchmarks (simple + complex) | -| "Prompt size determines cost" | Prompt is one-time; reasoning tokens dominate sessions | Measure total session cost to working solution | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/methodology.md`: Complete testing methodology with December 2024 data - `${CLAUDE_SKILL_DIR}/references/grading-rubric.md`: Detailed grading criteria and quality checklists - `${CLAUDE_SKILL_DIR}/references/benchmark-tasks.md`: Standard benchmark task descriptions and prompts diff --git a/skills/agent-evaluation/SKILL.md b/skills/agent-evaluation/SKILL.md index 2534f9e6..4c950a4b 100644 --- a/skills/agent-evaluation/SKILL.md +++ b/skills/agent-evaluation/SKILL.md @@ -27,52 +27,16 @@ routing: # Agent Evaluation Skill -## Operator Context - -This skill operates as an operator for agent/skill quality assurance, configuring Claude's behavior for objective, evidence-based evaluation. It implements the **Iterative Assessment** pattern — identify targets, validate structure, measure depth, score, report — with **Domain Intelligence** embedded in the scoring rubric. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before evaluation -- **Over-Engineering Prevention**: Evaluate only what is requested. Do not speculatively analyze additional agents/skills or invent metrics that were not asked for -- **Read-Only Evaluation**: NEVER modify agents or skills during evaluation — only report findings -- **Evidence-Based Findings**: Every issue MUST include file path and line reference -- **Objective Scoring**: Use the rubric consistently across all evaluations — no subjective "looks good" assessments -- **Complete Output**: Show all test results with scores; never summarize as "all tests pass" - -### Default Behaviors (ON unless disabled) -- **Full Test Suite**: Run all evaluation categories (structural, content, code, integration) -- **Priority Ranking**: Sort findings by impact (HIGH / MEDIUM / LOW) -- **Score Calculation**: Generate numeric quality scores using the standard rubric -- **Improvement Suggestions**: Provide specific, actionable recommendations with file paths -- **Temporary File Cleanup**: Remove any intermediate analysis files at task completion -- **Comparative Analysis**: Show how evaluated items compare to collection averages - -### Optional Behaviors (OFF unless enabled) -- **Historical Comparison**: Compare current scores to previous evaluations (requires baseline) -- **Cross-Reference Validation**: Check all internal links and references resolve -- **Code Example Execution**: Actually run code examples to verify they work - -## What This Skill CAN Do -- Score agents and skills against a consistent 100-point rubric -- Detect missing sections, broken references, and structural gaps -- Measure content depth and compare to collection averages -- Generate structured reports with prioritized findings -- Batch-evaluate entire collections with summary statistics - -## What This Skill CANNOT Do -- Modify or fix agents/skills (use skill-creator instead) -- Evaluate external repositories or non-agent/skill files -- Replace human judgment on content accuracy or domain correctness -- Skip rubric categories — all must be scored - ---- +Objective, evidence-based quality assessment for agents and skills. Implements a 6-phase rubric: Identify, Structural, Content, Code, Integration, Report. Every finding must cite a file path and line number — no subjective "looks good" verdicts. ## Instructions -### Step 1: Identify Evaluation Targets +### Phase 1: Identify Evaluation Targets **Goal**: Determine what to evaluate and confirm targets exist. +Read the repository CLAUDE.md first to understand current standards before evaluating anything. Only evaluate what was explicitly requested — do not speculatively analyze additional agents or skills. + ```bash # List all agents ls agents/*.md | wc -l @@ -87,10 +51,12 @@ ls -la skills/{name}/ **Gate**: All targets confirmed to exist on disk. Proceed only when gate passes. -### Step 2: Structural Validation +### Phase 2: Structural Validation **Goal**: Check that required components exist and are well-formed. +Score every rubric category — never skip a category even if it "looks fine." Parse each required field explicitly rather than eyeballing YAML. Record PASS/FAIL with the line number for each check. + **For Agents** — check each item and record PASS/FAIL with line number: 1. YAML front matter: `name`, `description`, `color` fields present @@ -173,10 +139,12 @@ See `references/scoring-rubric.md` for full/partial/no credit breakdowns. **Gate**: All structural checks scored with evidence. Proceed only when gate passes. -### Step 3: Content Depth Analysis +### Phase 3: Content Depth Analysis **Goal**: Measure content quality and volume. +Do not estimate length by impression — count lines and calculate the score. "Content is long enough" is not a measurement. + ```bash # Skill total lines (SKILL.md + references) skill_lines=$(wc -l < skills/{name}/SKILL.md) @@ -199,10 +167,12 @@ agent_lines=$(wc -l < agents/{name}.md) **Gate**: Depth score calculated. Proceed only when gate passes. -### Step 4: Code Quality Checks +### Phase 4: Code Quality Checks **Goal**: Validate that code examples and scripts are functional. +A script existing on disk does not mean it works — run `python3 -m py_compile` on every `.py` file. Search for placeholder text in every file, not just files that "look incomplete." + 1. **Script syntax**: Run `python3 -m py_compile` on all `.py` files 2. **Placeholder detection**: Search for `[TODO]`, `[TBD]`, `[PLACEHOLDER]`, `[INSERT]` 3. **Code block tagging**: Count untagged (bare ` ``` `) vs tagged (` ```language `) blocks @@ -221,7 +191,7 @@ grep -c '```$' {file} **Gate**: All code checks complete. Proceed only when gate passes. -### Step 5: Integration Verification +### Phase 5: Integration Verification **Goal**: Confirm cross-references and tool declarations are consistent. @@ -257,10 +227,14 @@ grep -c "anti-rationalization-core" skills/{name}/SKILL.md **Gate**: All integration checks complete. Proceed only when gate passes. -### Step 6: Generate Quality Report +### Phase 6: Generate Quality Report **Goal**: Compile all findings into the standard report format. +Show all test results with individual scores — never summarize as "all tests pass." Sort findings by impact (HIGH / MEDIUM / LOW). Include specific, actionable recommendations with file paths and line numbers. When batch evaluating, show how each item compares to collection averages; do not report "most are good quality" without quantitative data. + +This phase is read-only: report findings but never modify agents or skills. Use skill-creator for fixes. Clean up any intermediate analysis files created during evaluation. + Use the report template from `references/report-templates.md`. The report MUST include: 1. **Header**: Name, type, date, overall score and grade @@ -351,50 +325,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Superficial Evaluation Without Evidence -**What it looks like**: "Structure: Looks good. Content: Seems adequate. Overall: PASS" -**Why wrong**: No file paths, no line references, no specific scores. Cannot verify or reproduce. -**Do instead**: Score every rubric category. Cite file:line for every finding. - -### Anti-Pattern 2: Skipping Validation Script Execution -**What it looks like**: "The skill has a validation script present." -**Why wrong**: Presence is not correctness. Script may have syntax errors or do nothing. -**Do instead**: Run `python3 -m py_compile` at minimum. Execute the script and capture output. - -### Anti-Pattern 3: Accepting Placeholder Content as Complete -**What it looks like**: "Agent has comprehensive examples section. PASS" -**Why wrong**: Did not check if examples contain [TODO] or [PLACEHOLDER] text. -**Do instead**: Search for placeholder patterns. Score content on substance, not section headers. - -### Anti-Pattern 4: Batch Evaluation Without Summary Statistics -**What it looks like**: "Evaluated all 38 agents. Most are good quality." -**Why wrong**: No quantitative data. Cannot track improvements or identify problem areas. -**Do instead**: Generate score distribution table, top/bottom performers, common issues count. See `references/batch-evaluation.md` for the collection summary template. - -### Anti-Pattern 5: Ignoring Repository-Specific Standards -**What it looks like**: "This agent follows standard practices and is well-structured." -**Why wrong**: Did not check CLAUDE.md requirements. May miss v2 standards (YAML list format, pipe description, item count ranges, gates, anti-rationalization table). -**Do instead**: Check CLAUDE.md first. Verify all v2-specific criteria. A generic "well-structured" verdict is meaningless without rubric scores. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "YAML looks fine, no need to parse it" | Looking is not parsing; fields may be missing | Check each required field explicitly | -| "Content is long enough, skip counting" | Impressions are not measurements | Count lines, calculate score | -| "Script exists, must work" | Existence is not correctness | Run `python3 -m py_compile` | -| "One failing check, rest are probably fine" | Partial evaluation is not evaluation | Complete all 6 steps | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/scoring-rubric.md` - Full/partial/no credit breakdowns per rubric category - `${CLAUDE_SKILL_DIR}/references/report-templates.md` - Standard report format templates (single, batch, comparison) diff --git a/skills/anti-ai-editor/SKILL.md b/skills/anti-ai-editor/SKILL.md index 233a1a1f..e995f113 100644 --- a/skills/anti-ai-editor/SKILL.md +++ b/skills/anti-ai-editor/SKILL.md @@ -30,48 +30,7 @@ routing: # Anti-AI Editor -## Operator Context - -This skill operates as an operator for content editing, detecting and removing AI-generated writing patterns. It implements the **Targeted Revision** architectural pattern -- scan for patterns, propose minimal fixes, preserve meaning -- with **Wabi-Sabi Authenticity** ensuring human imperfections are features, not bugs. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before editing -- **Over-Engineering Prevention**: Make minimal fixes only. No rewrites, no "while I'm here" improvements -- **Preserve Meaning**: NEVER change actual meaning or intent while fixing style -- **Show All Changes**: Display before/after for every modification with reason -- **Context Awareness**: Some flagged words are appropriate in technical contexts -- **Wabi-Sabi Enforcement**: Human imperfections (run-ons, fragments, loose punctuation) are features -- do NOT "fix" them - -### Default Behaviors (ON unless disabled) -- **Full Preview**: Show complete edited content before saving -- **Categorized Reporting**: Group issues by type (cliches, passive, structural, meta) -- **Actionable Fixes**: Every detected issue includes a specific replacement -- **Frontmatter Skip**: Skip YAML frontmatter, code blocks, and inline code -- **Voice Integration**: If voice specified, check voice-specific anti-patterns - -### Optional Behaviors (OFF unless enabled) -- **Auto-Apply**: Apply changes without preview confirmation -- **Aggressive Mode**: Flag borderline cases (use for marketing content) -- **Stats Only**: Report issues without suggesting fixes - -## What This Skill CAN Do -- Detect AI cliches and suggest natural replacements -- Identify passive voice overuse and suggest active alternatives -- Flag structural issues (monotonous sentence length, list overuse) -- Remove meta-commentary that adds no value -- Handle Hugo frontmatter correctly (skip YAML, edit content only) -- Preserve code blocks and technical terminology -- Show before/after comparisons for all changes - -## What This Skill CANNOT Do -- Rewrite content entirely (use targeted fixes only) -- Change technical accuracy for stylistic reasons (meaning is sacred) -- Remove domain-specific jargon that is appropriate in context -- Fix factual errors (style-only skill, not a fact-checker) -- Generate new content (use voice skills instead) -- Polish away authentic imperfections (see [Wabi-Sabi](../shared-patterns/wabi-sabi-authenticity.md)) - ---- +Detect and remove AI-generated writing patterns through targeted, minimal edits. This skill scans for cliches, passive voice, structural monotony, and meta-commentary, then proposes specific replacements -- never wholesale rewrites. Human imperfections (run-ons, fragments, loose punctuation) are features, not bugs; do not "fix" them. ## Instructions @@ -81,7 +40,9 @@ This skill operates as an operator for content editing, detecting and removing A **Step 1: Read and classify the file** -Read the target file. Identify file type (blog post, docs, README). Skip frontmatter (YAML between `---` markers), code blocks, inline code, and blockquotes. +Read the target file. Identify file type (blog post, docs, README). Skip frontmatter (YAML between `---` markers), code blocks, inline code, and blockquotes -- edits to these zones corrupt structure and are never appropriate. + +If a voice profile is specified, also check voice-specific anti-patterns alongside the standard categories. **Step 2: Scan for issues by category** @@ -98,6 +59,8 @@ Read the target file. Identify file type (blog post, docs, README). Skip frontma | Generic Closers | "future looks bright", "continues to evolve" | `references/detection-patterns.md` | | Curly Quotes | \u201C \u201D \u2018 \u2019 (ChatGPT-specific) | `references/detection-patterns.md` | +Some flagged words are appropriate in technical contexts. "Leverage" in "Use a lever to leverage mechanical advantage" is correct -- only flag words when used as corporate-speak, not in their literal or technical sense. + **Step 3: Count and classify issues** Record each issue with line number, category, and severity weight: @@ -137,6 +100,8 @@ Record each issue with line number, category, and severity weight: 3. **Meta-commentary** (usually removable) 4. **Passive Voice** (case-by-case judgment) +Every fix must be the minimum change needed. Multiple small edits beat one big rewrite because rewrites lose author voice and may introduce new AI patterns. Every detected issue must include a specific replacement -- reporting "Contains AI-sounding language" without a concrete fix is useless. + **Step 3: Wabi-sabi check** Before proposing any fix, ask: "Would removing this imperfection make it sound MORE robotic?" If yes, do NOT flag it. Preserve: @@ -145,6 +110,8 @@ Before proposing any fix, ask: "Would removing this imperfection make it sound M - Loose punctuation that matches conversational flow - Self-corrections mid-thought ("well, actually...") +Natural informal language like "So basically" in a casual blog post is spoken rhythm, not an AI pattern. Only remove patterns that are AI-generated, not patterns that are merely informal. + **Gate**: Approach selected. Fixes prioritized. Wabi-sabi exceptions noted. Proceed only when gate passes. ### Phase 3: EDIT @@ -153,6 +120,8 @@ Before proposing any fix, ask: "Would removing this imperfection make it sound M **Step 1: Generate the edit report** +Show before/after for every modification with the reason -- never apply silent changes. + ``` ================================================================= ANTI-AI EDIT: [filename] @@ -183,6 +152,8 @@ Before proposing any fix, ask: "Would removing this imperfection make it sound M ================================================================= ``` +Style edits must never change what the content says. When fixing "This solution robustly handles edge cases", write "This solution handles edge cases reliably" -- fix the style word, keep the technical meaning intact. If removing a flagged word would lose meaningful information, rephrase rather than delete. + **Step 2: Apply changes after confirmation** Use the Edit tool for each change. Verify each edit applied correctly. @@ -214,11 +185,11 @@ Meaning Preserved: Yes/No **Gate**: All verification steps pass. Edit is complete. ---- +## Reference Material -## Examples +### Examples -### Example 1: Blog Post (Heavy Editing) +#### Example 1: Blog Post (Heavy Editing) User says: "De-AI this blog post" Actions: 1. Read file, skip frontmatter, scan all categories (ASSESS) @@ -227,7 +198,7 @@ Actions: 4. Re-read, verify meaning preserved, no new AI patterns (VERIFY) Result: 67% shorter intro, all AI cliches removed, voice preserved -### Example 2: Technical Docs (Light Editing) +#### Example 2: Technical Docs (Light Editing) User says: "Check this for AI patterns" Actions: 1. Read file, identify technical context, scan for patterns (ASSESS) @@ -236,8 +207,6 @@ Actions: 4. Verify technical accuracy unchanged (VERIFY) Result: Clearer prose, same information, technical terms untouched ---- - ## Error Handling ### Error: "File Not Found" @@ -261,55 +230,8 @@ Solution: 2. Re-read file to verify YAML integrity 3. If corrupted, restore from git: `git checkout -- [file]` ---- - -## Anti-Patterns - -### Anti-Pattern 1: Changing Meaning While Fixing Style -**What it looks like**: Removing "edge cases" from "This solution robustly handles edge cases" -- losing meaningful technical information -**Why wrong**: Style edits must never change what the content says -**Do instead**: "This solution handles edge cases reliably" -- fix style, keep meaning - -### Anti-Pattern 2: Over-Correcting Natural Informal Language -**What it looks like**: Removing "So basically" from a casual blog post because it sounds informal -**Why wrong**: "So basically" is natural spoken rhythm. Blog posts can be conversational. -**Do instead**: Leave natural voice markers alone. Only remove AI-generated patterns. - -### Anti-Pattern 3: Ignoring Technical Context -**What it looks like**: Flagging "leverage" in "Use a lever to leverage mechanical advantage" -**Why wrong**: "Leverage" is technically correct when discussing actual mechanics -**Do instead**: Only flag words when used as corporate-speak, not in their literal or technical sense - -### Anti-Pattern 4: Wholesale Rewrites Instead of Targeted Edits -**What it looks like**: Completely rewriting a paragraph instead of fixing specific patterns -**Why wrong**: Loses author voice, may introduce new AI patterns, harder to review -**Do instead**: Make the minimum changes needed. Multiple small edits beat one big rewrite. - -### Anti-Pattern 5: Reporting Without Actionable Fixes -**What it looks like**: "Line 15: Contains AI-sounding language" with no specific fix -**Why wrong**: Useless feedback -- the user needs to know WHAT to change and HOW -**Do instead**: Show exact original text, exact replacement, and reason for the change - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Wabi-Sabi Authenticity](../shared-patterns/wabi-sabi-authenticity.md) - Preserves human imperfections - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "It's just a style word, keep it" | AI cliches are the most obvious tells | Check against cliche list, replace if matched | -| "Fixing this would lose the flow" | Flow from AI patterns is synthetic flow | Remove and let natural rhythm emerge | -| "Technical content needs formal language" | Formal does not mean AI-sounding | Keep technical terms, remove corporate-speak | -| "The author probably wrote it that way" | If 5+ AI patterns cluster, it's generated | Apply systematic editing regardless | -| "Minor issues, not worth fixing" | Minor issues accumulate into AI tells | Fix all detected patterns | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/cliche-replacements.md`: Complete list of 80+ AI phrases with replacements - `${CLAUDE_SKILL_DIR}/references/detection-patterns.md`: Regex patterns for automated detection - `${CLAUDE_SKILL_DIR}/references/detection-rules.md`: Inline detection rules and structural checks diff --git a/skills/batch-editor/SKILL.md b/skills/batch-editor/SKILL.md index 68445605..3388e75b 100644 --- a/skills/batch-editor/SKILL.md +++ b/skills/batch-editor/SKILL.md @@ -27,48 +27,7 @@ routing: # Batch Editor Skill -## Operator Context - -This skill operates as an operator for bulk content editing, configuring Claude's behavior for safe, reversible batch modifications across Hugo blog posts. It implements a **Preview-Confirm-Apply** pattern with mandatory git safety checks before any destructive operation. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files -- **Preview First**: ALWAYS show complete preview before applying any changes -- **Git Safety**: Check for uncommitted changes before any modification -- **Never Auto-Apply**: Require explicit user confirmation for destructive operations -- **Complete Output**: Show all affected files and changes, never summarize matches -- **Atomic Operations**: All files succeed or none are modified - -### Default Behaviors (ON unless disabled) -- **Preview Mode**: Show all matches without modifying files -- **Backup Reminder**: Suggest git commit/stash before applying changes -- **Case Sensitive**: Find/replace is case-sensitive by default -- **Content Scope**: Only process files in content/posts/ -- **Preserve Formatting**: Keep original frontmatter indentation, quotes, field order - -### Optional Behaviors (OFF unless enabled) -- **Case Insensitive**: Use --ignore-case for case-insensitive matching -- **Include Drafts**: Use --include-drafts to also process draft posts -- **Extended Scope**: Use --scope to process other content directories -- **Force Apply**: Use --force to skip git safety checks (dangerous) - -## What This Skill CAN Do -- Find and replace text (literal or regex) across multiple markdown files -- Add, modify, or remove frontmatter fields in bulk -- Batch update tags, categories, or other taxonomy arrays -- Standardize heading levels, link formats, and whitespace -- Preview all changes with line-level context before applying -- Count total matches and affected files with dry-run validation - -## What This Skill CANNOT Do -- Modify files outside content/ directory -- Skip the preview step (hardcoded safety) -- Undo changes without git (use git rollback) -- Modify files when git has uncommitted changes (unless --force) -- Process binary files or images -- Make external API calls - ---- +Safe, reversible bulk modifications across Hugo blog posts using a **Preview-Confirm-Apply** pattern. Supports find/replace (literal or regex), frontmatter field operations (add/modify/remove), and content transforms (headings, links, whitespace, quotes). All operations are scoped to `content/posts/*.md` by default and limited to markdown files -- binary files, images, and files outside `content/` are never touched. ## Instructions @@ -86,14 +45,15 @@ This skill operates as an operator for bulk content editing, configuring Claude' **Common Options:** - `--dry-run` - Validate pattern, show matches, don't apply - `--apply` - Apply changes after preview confirmation -- `--ignore-case` - Case-insensitive matching -- `--include-drafts` - Also process draft posts -- `--scope ` - Process different content directory +- `--ignore-case` - Case-insensitive matching (default is case-sensitive) +- `--include-drafts` - Also process draft posts (excluded by default) +- `--scope ` - Process different content directory (default: `content/posts/`) - `--regex` - Enable regex mode for find-replace +- `--force` - Skip git safety checks (dangerous -- no rollback safety net) ### Phase 1: SAFETY CHECK -Before any batch operation, verify git status: +Before any batch operation, verify git status. Batch edits are irreversible without git, so this check exists to guarantee a rollback path. ```bash cd $HOME/your-project && git status --porcelain @@ -113,10 +73,12 @@ cd $HOME/your-project && git status --porcelain - List of modified files (if any) - Recommended action (commit, stash, or proceed) -**Gate**: Git status is clean OR user provides --force. Do not proceed without passing this gate. +**Gate**: Git status is clean OR user provides --force. Do not proceed without passing this gate. Even when a few files are involved, uncommitted changes make rollback unreliable -- always verify. ### Phase 2: SCAN AND PREVIEW +The preview is mandatory and cannot be skipped. Users must see every individual change before any file is modified, because batch patterns frequently produce false positives that only a human can catch. + **Step 1: Parse request** Extract from user request: @@ -143,6 +105,8 @@ For frontmatter operations, read each file and parse the YAML frontmatter block **Step 3: Generate preview** +Show every match individually with context. Never summarize as "N matches in M files" without showing each one -- users cannot verify correctness from a count alone. + For each match, show: - File path relative to repository root - Line number and surrounding context @@ -165,11 +129,13 @@ content/posts/example.md: ~ tags: ["a"] -> ["a","b"] (modify) ``` -**Gate**: Preview displayed with all matches visible. User must see every individual change. Never summarize as "N matches in M files" without showing each one. +**Gate**: Preview displayed with all matches visible. User must see every individual change. ### Phase 3: APPLY (on explicit confirmation only) -Only proceed when user explicitly confirms with `--apply` or clear affirmative. +Only proceed when user explicitly confirms with `--apply` or clear affirmative. Never auto-apply -- the user must opt in to every destructive operation. + +All changes are atomic: validate that every target file is writable before modifying any of them. If any file would fail (permissions, disk space), abort the entire operation rather than leaving the repository in a partially edited state. **For find-replace:** 1. Read each file with matches @@ -180,7 +146,7 @@ Only proceed when user explicitly confirms with `--apply` or clear affirmative. **For frontmatter add:** 1. Read file, parse frontmatter (YAML --- delimiters) 2. Insert new field before closing `---` -3. Preserve original formatting (indentation, quote style, field order) +3. Preserve original formatting -- keep indentation, quote style, and field order intact. Modifying only the target field produces clean git diffs and avoids breaking parsers. 4. Write modified content **For frontmatter modify:** @@ -224,11 +190,9 @@ The `transform` operation supports these built-in transforms: For custom transforms, use `find-replace --regex` with user-provided patterns. See `references/regex-patterns.md` for tested patterns. ---- +### Examples -## Examples - -### Example 1: Simple Find/Replace +**Example 1: Simple Find/Replace** User says: "Replace Hugo with Hugo SSG across all posts" Actions: 1. Check git status -- clean, proceed (SAFETY CHECK) @@ -238,7 +202,7 @@ Actions: 5. Apply replacements, show per-file summary with rollback command (APPLY + VERIFY) Result: All occurrences replaced, rollback instructions provided -### Example 2: Add Frontmatter Field +**Example 2: Add Frontmatter Field** User says: "Add author field to all posts that don't have one" Actions: 1. Check git status -- clean, proceed (SAFETY CHECK) @@ -249,7 +213,7 @@ Actions: 6. Report: 4 files modified, 2 skipped (already had author) (VERIFY) Result: Field added to posts missing it, existing posts unchanged -### Example 3: Content Transform +**Example 3: Content Transform** User says: "Demote all H1 headings to H2" Actions: 1. Check git status -- clean, proceed (SAFETY CHECK) @@ -260,7 +224,7 @@ Actions: 6. Suggest `hugo --quiet` to verify no build issues (VERIFY) Result: All H1 headings demoted, H2+ unchanged -### Example 4: Regex with Dry Run +**Example 4: Regex with Dry Run** User says: "Show me all date formats in posts but don't change anything" Actions: 1. Check git status (SAFETY CHECK) @@ -271,8 +235,6 @@ Result: User sees all date occurrences, can decide on follow-up action See `references/examples.md` for full output format templates with banner formatting. ---- - ## Error Handling ### Error: "No Matches Found" @@ -305,53 +267,8 @@ Solution: 3. Fix permissions: `chmod 644 content/posts/*.md` 4. Retry operation ---- - -## Anti-Patterns - -### Anti-Pattern 1: Applying Without Preview -**What it looks like**: Immediately modifying files without showing what will change -**Why wrong**: Batch operations can cause widespread damage. User loses ability to catch mistakes. -**Do instead**: ALWAYS show complete preview first. Never modify files until user explicitly confirms. - -### Anti-Pattern 2: Summarizing Instead of Showing -**What it looks like**: "Found 47 matches across 12 files. Apply changes?" -**Why wrong**: User cannot verify each change is correct. Some matches may be false positives. -**Do instead**: Show every match with line-level before/after context. - -### Anti-Pattern 3: Ignoring Git State -**What it looks like**: Proceeding with batch edit when git has uncommitted changes -**Why wrong**: User may lose work. Rollback becomes complicated with mixed changes. -**Do instead**: Always check git status first. Block if uncommitted changes exist (unless --force). - -### Anti-Pattern 4: Destroying Frontmatter Format -**What it looks like**: Rewriting entire frontmatter block when modifying a single field -**Why wrong**: Creates noisy git diffs, may break parsers, loses author's preferred formatting. -**Do instead**: Modify only the target field. Preserve indentation, quote style, and field order. - -### Anti-Pattern 5: Non-Atomic Application -**What it looks like**: Applying changes to some files, then failing on others mid-operation -**Why wrong**: Leaves repository in inconsistent state with partial edits. -**Do instead**: Validate all files are writable before applying any changes. All or nothing. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Only a few files, no need to preview" | Few files can still have false positives | Show complete preview | -| "Pattern is simple, regex won't over-match" | Simple patterns match unexpected content | Test with grep first | -| "Git is clean, no need to check" | Status could have changed since last check | Always verify | -| "User said apply, skip the preview" | User may not realize scope of changes | Preview is hardcoded, never skip | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/operation-types.md`: Detailed operation syntax and options - `${CLAUDE_SKILL_DIR}/references/regex-patterns.md`: Common regex patterns for Hugo content - `${CLAUDE_SKILL_DIR}/references/safety-checklist.md`: Pre-edit validation steps and rollback procedures diff --git a/skills/blog-post-writer/SKILL.md b/skills/blog-post-writer/SKILL.md index 0f093fd1..b90d3a9a 100644 --- a/skills/blog-post-writer/SKILL.md +++ b/skills/blog-post-writer/SKILL.md @@ -26,45 +26,7 @@ routing: # Blog Post Writer Skill -## Operator Context - -This skill operates as an operator for blog post creation, configuring Claude's behavior for structured, voice-consistent content generation. It implements the **Pipeline** architectural pattern -- Assess, Decide, Draft, Preview -- with **Voice Integration** via separate voice skills for stylistic patterns. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before writing -- **Over-Engineering Prevention**: Write the post requested. No "bonus sections", no unsolicited additions, no extra content types -- **Banned Words Enforcement**: NEVER use words from `references/banned-words.md`. Scan every draft before finalizing -- **Voice Compliance**: Follow the specified voice skill's patterns exactly. Do not blend voices -- **Hugo Format**: All posts use proper YAML frontmatter with correct syntax -- **Em-Dash Prohibition**: NEVER use em-dashes. Use commas, periods, or sentence restructuring instead - -### Default Behaviors (ON unless disabled) -- **Voice Required**: Must specify a voice skill (user must configure a default voice or specify one) -- **Preview Before Write**: Display full draft for approval before writing to file -- **Post-Draft Banned Word Scan**: Verify zero banned words before finalizing -- **Structure Template**: Use appropriate template from `references/structure-templates.md` -- **Specific Numbers**: Include concrete numbers for all claims, not vague adjectives - -### Optional Behaviors (OFF unless enabled) -- **Direct Write Mode**: Skip preview and write directly to file -- **Outline Only**: Generate structure without full draft -- **Multiple Variants**: Generate 2-3 opening paragraphs for selection - -## What This Skill CAN Do -- Write complete blog posts in any defined voice profile -- Apply voice-specific patterns (metaphors, rhythm, structure, tone) -- Generate proper Hugo frontmatter with correct YAML syntax -- Coordinate with voice skills for stylistic consistency -- Revise drafts based on user feedback - -## What This Skill CANNOT Do -- Use banned words under any circumstances (see `references/banned-words.md`) -- Write without a voice specification (redirect: specify `--voice` or accept default) -- Add sections not requested by user (redirect: ask user before adding) -- Create or modify voice profiles (redirect: use voice skill creation workflow) -- Skip the banned word verification step - ---- +Voice-integrated blog post creation using a 4-phase pipeline: Assess, Decide, Draft, Preview. Each phase has a gate that must pass before proceeding. ## Instructions @@ -72,7 +34,9 @@ This skill operates as an operator for blog post creation, configuring Claude's **Goal**: Understand the topic, select voice, and classify content type before writing. -**Step 1: Analyze the topic** +**Step 1: Read repository CLAUDE.md** to load any project-specific writing rules or conventions before proceeding. + +**Step 2: Analyze the topic** ```markdown ## Assessment @@ -82,20 +46,20 @@ This skill operates as an operator for blog post creation, configuring Claude's - Estimated length: [short 500-800 / medium 1000-1500 / long 2000+] ``` -**Step 2: Select voice** +**Step 3: Select voice** -Load the specified voice skill. If none specified, ask the user which voice to use. +Load the specified voice skill. If none specified, ask the user which voice to use -- a voice must be selected before any writing begins because retrofitting voice patterns onto an existing draft produces inconsistent results. -Available voices can be found at `skills/voice-*/SKILL.md`. Create new voices with `/create-voice`. +Available voices can be found at `skills/voice-*/SKILL.md`. Create new voices with `/create-voice`. This skill does not create or modify voice profiles; redirect to the voice skill creation workflow if needed. -**Step 3: Classify content type** +**Step 4: Classify content type** Choose from `references/structure-templates.md`: - **Problem-Solution**: Bug fix, debugging session, resolution - **Technical Explainer**: Concept, technology, how it works - **Walkthrough**: Step-by-step instructions for a task -**Gate**: Topic analyzed, voice loaded, content type selected. Proceed only when gate passes. +**Gate**: Topic analyzed, voice loaded, content type selected. Proceed only when all three are confirmed. ### Phase 2: DECIDE @@ -103,7 +67,7 @@ Choose from `references/structure-templates.md`: **Step 1: Plan opening** -Read opening patterns from voice skill. Select the pattern that fits the topic. +Read opening patterns from the loaded voice skill. Select the pattern that fits the topic. Use exactly one voice profile per post -- mixing patterns from different voices creates inconsistent, inauthentic output. ```markdown ## Plan @@ -111,7 +75,7 @@ Read opening patterns from voice skill. Select the pattern that fits the topic. - Draft opening: [first sentence or question] ``` -**Step 2: Plan extended metaphor** (if voice uses them) +**Step 2: Plan extended metaphor** (if the loaded voice uses them) ```markdown - Core metaphor: [conceptual lens] @@ -120,6 +84,8 @@ Read opening patterns from voice skill. Select the pattern that fits the topic. **Step 3: Plan sections** (3-7 sections) +Plan only sections the user requested. Do not add "Future Implications", "Related Topics", or other unsolicited sections -- write what was asked for and nothing more. If you think an additional section would help, ask the user before including it. + ```markdown - Sections: 1. [Section name]: [purpose] @@ -138,6 +104,8 @@ Read closing patterns from voice skill. Select pattern and identify callback ele **Step 5: Draft frontmatter** +All posts use Hugo YAML frontmatter with correct syntax: + ```yaml --- title: "Post Title Here" @@ -164,7 +132,7 @@ summary: "One sentence description for list views" - Apply sentence rhythm from voice skill - Develop extended metaphors if voice uses them - Use second-person address if voice uses it -- Include specific numbers for all claims +- Include specific numbers for all claims -- use concrete data, not vague adjectives like "significant" or "many" **Step 3: Write closing** - Apply closing pattern from voice skill @@ -172,21 +140,24 @@ summary: "One sentence description for list views" - Build to emotional crescendo if appropriate for voice **Step 4: Banned word scan** -- Scan entire draft against `references/banned-words.md` -- If ANY banned word found: rewrite the affected sentence immediately + +Run a systematic scan of the entire draft against `references/banned-words.md`. Visual scanning is insufficient because banned words hide in context -- the full list must be checked programmatically. + +- If ANY banned word is found: rewrite the affected sentence immediately using alternatives from the reference - Re-scan until zero violations +- Never suppress or skip a detection -**Step 5: Voice verification** -- Check patterns against voice skill requirements +**Step 5: Voice and formatting verification** +- Check patterns against voice skill requirements; "close enough" is not passing -- re-read the voice skill and verify each pattern explicitly - Verify sentence rhythm matches voice - Confirm opening and closing styles match voice -- Verify zero em-dashes in entire draft +- Verify zero em-dashes in entire draft -- em-dashes are never acceptable; rewrite with commas, periods, or sentence restructuring -**Gate**: Draft complete, zero banned words, voice patterns verified. Proceed only when gate passes. +**Gate**: Draft complete, zero banned words, voice patterns verified, zero em-dashes. Proceed only when gate passes. ### Phase 4: PREVIEW -**Goal**: Display full draft for user approval before writing to file. +**Goal**: Display full draft for user approval before writing to file. Skipping preview loses the user's opportunity to request changes, and rewrites after file creation are costlier than previews. (Skip this phase only if the user explicitly enables Direct Write Mode.) **Step 1: Present draft** @@ -194,6 +165,8 @@ Display the complete post with frontmatter. Show target file path. **Step 2: Show compliance report** +Verify that all planned sections are present -- length alone does not indicate completeness. + ```markdown ## Voice Compliance - Opening pattern: [pattern name] - PASS/FAIL @@ -212,8 +185,6 @@ Wait for user confirmation before writing. If user requests changes, return to t **Gate**: User approves draft. Write to file. Task complete. ---- - ## Error Handling ### Error: "No voice specified" @@ -245,55 +216,7 @@ Solution: 2. Suggest 2-3 specific angles derived from the topic 3. Proceed once user selects a narrower focus ---- - -## Anti-Patterns - -### Anti-Pattern 1: Writing Without Voice Skill Loaded -**What it looks like**: Starting to draft before reading the voice skill's patterns -**Why wrong**: Post will not match the voice profile. Retrofitting voice is harder than starting with it. -**Do instead**: Complete Phase 1 fully. Load and read the voice skill before writing any content. - -### Anti-Pattern 2: Ignoring Banned Word Scan -**What it looks like**: "The draft reads well, no need to scan for banned words" -**Why wrong**: Banned words are AI fingerprints. A single occurrence undermines authenticity. -**Do instead**: Run banned word scan on every draft. Zero tolerance. - -### Anti-Pattern 3: Adding Unsolicited Sections -**What it looks like**: Adding "Future Implications" or "Related Topics" sections the user did not request -**Why wrong**: Over-engineering the content. User asked for a specific post, not a content hub. -**Do instead**: Write exactly what was requested. Ask before adding anything extra. - -### Anti-Pattern 4: Skipping Preview Phase -**What it looks like**: Writing directly to file without showing the draft first -**Why wrong**: User loses the opportunity to review and request changes. Rewrites are costlier than previews. -**Do instead**: Always show full draft with compliance report unless Direct Write Mode is enabled. - -### Anti-Pattern 5: Blending Voice Patterns -**What it looks like**: Mixing one voice's extended metaphors with another voice's community warmth in one post -**Why wrong**: Each voice has distinct patterns. Mixing creates an inconsistent, inauthentic voice. -**Do instead**: Use exactly one voice profile per post. Follow that voice skill's patterns exclusively. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Wabi-Sabi Authenticity](../shared-patterns/wabi-sabi-authenticity.md) - Natural imperfections over synthetic perfection -- [Voice-First Writing](../shared-patterns/voice-first-writing.md) - Voice-driven content patterns - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "No banned words jumped out at me" | Visual scan misses words in context | Run systematic scan against full banned list | -| "Close enough to the voice" | Close ≠ matching the voice profile | Re-read voice skill, verify each pattern | -| "The post is already long enough" | Length ≠ completeness | Check all planned sections are present | -| "Em-dash here reads better" | Em-dashes are absolutely forbidden | Rewrite with comma, period, or restructure | -| "One extra section adds value" | User did not request it | Write what was asked, nothing more | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/banned-words.md`: Words and phrases that signal AI-generated content - `${CLAUDE_SKILL_DIR}/references/structure-templates.md`: Templates for Problem-Solution, Technical Explainer, and Walkthrough content types diff --git a/skills/branch-naming/SKILL.md b/skills/branch-naming/SKILL.md index e9f38d7a..bb4d4348 100644 --- a/skills/branch-naming/SKILL.md +++ b/skills/branch-naming/SKILL.md @@ -25,76 +25,30 @@ routing: # Branch Naming Skill -## Operator Context - -This skill operates as an operator for Git branch naming workflows, configuring Claude's behavior for deterministic branch name generation and validation with conventional commit integration. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Deterministic Naming**: Type-to-prefix mapping, kebab-case sanitization, 50-char limit -- **Character Whitelist**: Only a-z, 0-9, hyphens in subject; forward slash only in prefix -- **Over-Engineering Prevention**: Only generate/validate names. No branch creation, deletion, or management -- **Reproduce-First Validation**: Always validate generated names before presenting to user -- **No Speculative Features**: No branch templates, cleanup tools, or management beyond naming - -### Default Behaviors (ON unless disabled) -- **Interactive Confirmation**: Show generated name and ask for confirmation before use -- **Conventional Commit Inference**: Detect commit type from message and map to prefix -- **Duplicate Detection**: Check if proposed name already exists locally or remotely -- **Suggestion Alternatives**: If name exists, suggest alternatives (-v2, -alt, timestamp) -- **Sanitization Pipeline**: Lowercase, hyphenate, strip special chars, collapse hyphens -- **Intelligent Truncation**: Remove filler words and abbreviate to fit length limit - -### Optional Behaviors (OFF unless enabled) -- **Auto-Accept**: Skip confirmation for automated/scripted workflows -- **Custom Prefix Rules**: Override default type-to-prefix mapping via .branch-naming.json -- **Allow Long Names**: Bypass 50-char limit for exceptional cases - -## What This Skill CAN Do -- Parse conventional commit messages to extract type and subject -- Map commit types to branch prefixes (feat -> feature/, fix -> fix/, etc.) -- Sanitize text to kebab-case following the 7-step pipeline -- Validate branch name format (prefix, length, characters, duplicates) -- Generate alternative names when duplicates exist - -## What This Skill CANNOT Do -- Create, delete, or manage Git branches (use git directly) -- Enforce GitHub branch protection rules (GitHub settings) -- Resolve naming conflicts between competing conventions (human judgment) -- Auto-rename existing branches (risks breaking active work) - ---- +Generate and validate Git branch names from conventional commit messages or plain descriptions. This skill only handles naming -- it does not create, delete, or manage branches. ## Instructions -### Phase 1: ANALYZE - -**Goal**: Parse input and determine commit type and subject. +### Step 1: Parse Input -**Step 1: Parse input** +Determine the commit type and subject from whatever the user provides. -If conventional commit message provided (e.g., `feat: add user auth`): +**If a conventional commit message** (e.g., `feat: add user auth`): - Extract type, optional scope, and subject - Pattern: `[optional scope]: ` -If plain description provided (e.g., `add user authentication`): +**If a plain description** (e.g., `add user authentication`): - Infer type from keywords (see `references/type-mapping.md` for full mapping) - Keywords: add/implement/create -> feat, fix/resolve/correct -> fix, document/readme -> docs, refactor/restructure -> refactor, test/spec -> test, remove/delete/update -> chore -- Default if no keywords match: feat +- Default to `feat` when no keywords match -**Step 2: Validate input content** -- Strip banned characters (emojis, special chars) -- If input is too vague to determine type, prompt user for specifics +Strip banned characters (emojis, special chars) from the input. If the input is too vague to determine a type (e.g., "stuff", "things"), prompt the user for a more descriptive input starting with an action verb. -**Gate**: Commit type identified and subject extracted. If FAIL, prompt user for clarification. +**Gate**: Commit type identified and subject extracted. If not, prompt for clarification before continuing. -### Phase 2: GENERATE +### Step 2: Generate Branch Name -**Goal**: Produce a valid branch name from analyzed input. - -**Step 1: Map type to prefix** - -Standard mapping (see `references/type-mapping.md` for details): +**Map type to prefix** using the standard table (or `.branch-naming.json` overrides if present in the repo root): | Type | Prefix | |------|--------| @@ -110,47 +64,47 @@ Standard mapping (see `references/type-mapping.md` for details): | ci | ci/ | | revert | revert/ | -Check for `.branch-naming.json` in repository root for custom overrides. - -**Step 2: Sanitize subject to kebab-case** +Every branch name must have a prefix from this list -- unprefixed names like `add-user-authentication` break CI/CD automation and make filtering impossible. -Apply the 7-step sanitization pipeline (see `references/sanitization-rules.md`): +**Sanitize the subject to kebab-case** using the 7-step pipeline (see `references/sanitization-rules.md`): 1. Lowercase 2. Strip leading/trailing whitespace 3. Replace spaces with hyphens -4. Replace underscores with hyphens +4. Replace underscores with hyphens (underscores violate kebab-case convention and create inconsistency with conventional commits) 5. Remove special characters (keep only a-z, 0-9, hyphens) 6. Collapse multiple consecutive hyphens 7. Remove leading/trailing hyphens -**Step 3: Apply length limits** +Only a-z, 0-9, and hyphens are allowed in the subject. The forward slash appears only once, separating prefix from subject. -Total branch name (prefix + subject) must be 50 characters or fewer. If exceeded: +**Apply the 50-character length limit** (prefix + subject combined). If exceeded: 1. Remove filler words (the, a, with, and, for, etc.) 2. Apply common abbreviations (authentication -> auth, configuration -> config) -3. Truncate at word boundaries (never cut mid-word) +3. Truncate at word boundaries -- never cut mid-word -**Step 4: Combine prefix + sanitized subject** +Long names signal scope creep; move detail to the commit message body rather than cramming it into the branch name. + +**Combine prefix + sanitized subject**: Example: `feat: add user authentication` -> `feature/add-user-authentication` -**Gate**: Valid branch name generated (correct prefix, kebab-case, within length limit, allowed characters only). +**Gate**: Generated name has a valid prefix, uses kebab-case, stays within the length limit, and contains only allowed characters. -### Phase 3: VALIDATE +### Step 3: Validate -**Goal**: Confirm generated name meets all requirements. +Run all checks against the generated (or user-provided) name: -**Step 1: Format validation** -- Has valid prefix from allowed list -- Subject is kebab-case +**Format validation**: +- Has a valid prefix from the allowed list +- Subject is kebab-case (no uppercase, no underscores) - Only allowed characters (a-z, 0-9, hyphens, one forward slash) - No leading/trailing hyphens in subject - No consecutive hyphens +- Name is specific enough to convey purpose (`feature/updates` or `fix/stuff` are too vague) -**Step 2: Length check** -- Total length is 50 characters or fewer +**Length check**: Total length is 50 characters or fewer. -**Step 3: Duplicate detection** +**Duplicate detection**: ```bash # Check local @@ -160,22 +114,18 @@ git branch --list "" git ls-remote --heads origin "" ``` -If duplicate found, generate alternatives: +If a duplicate is found, generate alternatives: 1. Append `-v2`, `-v3` for versioning 2. Append date `-YYYYMMDD` for uniqueness -3. Ask user for custom suffix - -**Step 4: Repository convention compliance** - -Check `.branch-naming.json` if present for custom prefix restrictions. +3. Ask user for a custom suffix -**Gate**: All validation checks pass. If FAIL, regenerate with adjustments or present alternatives. +**Repository convention compliance**: Check `.branch-naming.json` if present for custom prefix restrictions. -### Phase 4: CONFIRM +**Gate**: All validation checks pass. If any fail, regenerate with adjustments or present alternatives. -**Goal**: Present validated name and get user approval. +### Step 4: Confirm -**Step 1: Display result** +Present the validated name and wait for user approval before proceeding: ``` Generated Branch Name: feature/add-user-authentication @@ -187,39 +137,36 @@ Generated Branch Name: feature/add-user-authentication Use this branch name? [Y/n] ``` -**Step 2: Handle response** -- **Yes**: Output final name with git checkout command -- **No**: Return to Phase 1 with new input -- **Customize**: User provides custom name, run through Phase 3 validation +Handle the response: +- **Yes**: Output the final name with a `git checkout -b` command +- **No**: Return to Step 1 with new input +- **Customize**: User provides a custom name; run it through Step 3 validation -**Gate**: User approved name. Workflow complete. +Skip confirmation only in automated/scripted workflows where the caller has explicitly opted into auto-accept. ---- +**Gate**: User approved name. Workflow complete. -## Examples +### Examples -### Example 1: From Conventional Commit +**From a conventional commit**: Input: `feat: add user authentication` -Actions: -1. Parse: type=feat, subject="add user authentication" (ANALYZE) -2. Map feat -> feature/, sanitize -> "add-user-authentication" (GENERATE) -3. Validate format, length (31 chars), no duplicates (VALIDATE) -4. Present and confirm (CONFIRM) +1. Parse: type=feat, subject="add user authentication" +2. Map feat -> feature/, sanitize -> "add-user-authentication" +3. Validate format, length (31 chars), no duplicates +4. Present and confirm Result: `feature/add-user-authentication` -### Example 2: From Plain Description with Truncation +**From a plain description with truncation**: Input: `add comprehensive user authentication system with OAuth2 and JWT` -Actions: -1. Infer type=feat from "add" keyword (ANALYZE) -2. Sanitize, remove fillers, abbreviate auth -> 32 chars (GENERATE) -3. Validate all checks pass (VALIDATE) -4. Present and confirm (CONFIRM) +1. Infer type=feat from "add" keyword +2. Sanitize, remove fillers, abbreviate auth -> 32 chars +3. Validate all checks pass +4. Present and confirm Result: `feature/add-user-auth-oauth2-jwt` -### Example 3: Validation of Existing Branch +**Validating an existing branch name**: Input: `feature/User_Authentication` -Actions: -1. Detect uppercase letters and underscores (VALIDATE) +1. Detect uppercase letters and underscores 2. Report issues with corrections Result: Suggest `feature/user-authentication` @@ -250,50 +197,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Underscores Instead of Hyphens -**What it looks like**: `feature/add_user_auth` -**Why wrong**: Violates kebab-case convention, inconsistent with conventional commits -**Do instead**: Use hyphens: `feature/add-user-auth` - -### Anti-Pattern 2: Vague Branch Names -**What it looks like**: `feature/updates`, `fix/stuff`, `feature/branch-1` -**Why wrong**: Impossible to understand purpose, hard to track, likely duplicates -**Do instead**: Be specific: `feature/add-oauth2-login`, `fix/login-timeout-30s` - -### Anti-Pattern 3: Missing Branch Prefix -**What it looks like**: `add-user-authentication` (no prefix) -**Why wrong**: No type indication, breaks CI/CD automation, inconsistent filtering -**Do instead**: Always include conventional commit prefix: `feature/add-user-authentication` - -### Anti-Pattern 4: Overly Long Names -**What it looks like**: `feature/add-comprehensive-user-authentication-system-with-oauth2-jwt-and-session-management` (95 chars) -**Why wrong**: Exceeds 50-char limit, hard to read, indicates scope too large -**Do instead**: Abbreviate and move details to commit body: `feature/add-user-auth-oauth2-jwt` - -### Anti-Pattern 5: Mixing Naming Conventions -**What it looks like**: Repository has `feat/`, `feature/`, `bugfix/`, `fix/`, `Feature/` branches -**Why wrong**: No standard, hard to filter, CI/CD rules only match some patterns -**Do instead**: Enforce one convention via `.branch-naming.json` and this skill - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Any name is fine" | Inconsistent names break automation and readability | Use skill to generate compliant name | -| "I'll fix the name later" | Branch renames disrupt active work and PRs | Name correctly from the start | -| "50 chars is too restrictive" | Long names indicate scope creep | Abbreviate; move detail to commit body | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/type-mapping.md`: Conventional commit type to branch prefix mapping - `${CLAUDE_SKILL_DIR}/references/naming-conventions.md`: Branch format rules, character whitelist, examples - `${CLAUDE_SKILL_DIR}/references/sanitization-rules.md`: 7-step text sanitization pipeline and truncation strategies diff --git a/skills/code-cleanup/SKILL.md b/skills/code-cleanup/SKILL.md index ba442b3c..8f407002 100644 --- a/skills/code-cleanup/SKILL.md +++ b/skills/code-cleanup/SKILL.md @@ -41,78 +41,15 @@ routing: # Code Cleanup Skill -## Operator Context - -This skill operates as an operator for systematic code quality scanning, configuring Claude's behavior for focused detection and prioritized reporting of technical debt. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution -- **Over-Engineering Prevention**: Only scan for requested issue types or smart defaults. Do not build elaborate reporting systems, dashboards, or speculative features -- **Report Before Fix**: NEVER modify files without explicit user permission. Default mode is read-only scan and report -- **Exact References**: Report all findings with exact file:line references. Never summarize away specifics -- **Exclude Non-Source**: Always exclude vendor/, node_modules/, .venv/, build/, dist/, generated/ directories -- **Prioritized Output**: Rank findings by impact/effort ratio. Never present a flat unsorted list - -### Default Behaviors (ON unless disabled) -- **Focused Scope**: Ask user for target directories when request is vague rather than scanning everything -- **Context Lines**: Show 3 lines of context around each finding for quick comprehension -- **TODO Age Triage**: Use git blame to age TODOs and categorize by staleness (Critical >90d, High 30-90d, Normal <30d) -- **Quick Wins First**: Present auto-fixable issues (unused imports, formatting) before manual-effort items -- **Effort Estimates**: Include time estimates per category so user can plan -- **Tool Availability Check**: Verify required analysis tools exist before scanning; report missing tools with install commands -- **Temporary File Cleanup**: Remove intermediate scan outputs at completion, keep only final report - -### Optional Behaviors (OFF unless enabled) -- **Auto-Fix**: Apply safe, deterministic fixes (requires --fix flag or explicit permission) -- **Direct File Modification**: Write changes to source files (only with explicit permission) -- **Tool Auto-Install**: Install missing analysis tools automatically -- **Commit Fixes**: Stage and commit applied fixes - -## What This Skill CAN Do -- Scan repositories for 9 categories of technical debt (TODOs, unused imports, dead code, missing type hints, deprecated functions, naming inconsistencies, high complexity, duplicate code, missing docstrings) -- Prioritize findings by impact/effort ratio with time estimates -- Generate structured markdown reports with exact file:line references -- Apply safe auto-fixes when user grants permission (unused imports, formatting, import sorting) -- Triage TODO comments by age using git history - -## What This Skill CANNOT Do -- Fix bugs or investigate failures (use systematic-debugging) -- Implement new features (use test-driven-development) -- Perform architectural review (use domain-specific agent) -- Run security audits (use security scanning tools) -- Modify files without explicit user permission -- Skip the prioritization step +Scan repositories for 9 categories of technical debt (TODOs, unused imports, dead code, missing type hints, deprecated functions, naming inconsistencies, high complexity, duplicate code, missing docstrings), prioritize findings by impact/effort ratio with time estimates, and generate structured markdown reports with exact file:line references. Can apply safe auto-fixes when the user grants explicit permission. ---- +### Examples + +**Focused cleanup** -- User says "Clean up the API handlers in src/api/". Read project config, scan src/api/ for all 9 categories, prioritize (5 unused imports auto-fixable, 2 stale TODOs >90d, 1 high-complexity function), present tiered report with auto-fix commands. + +**Broad debt scan** -- User says "What's the state of technical debt in this repo?". Identify languages and source directories, run all applicable scans, group 47 findings into Quick Wins (12), Important (8), Polish (27), generate full report with effort estimates: 2h quick wins, 6h important, 4h polish. -## Examples - -### Example 1: Focused Cleanup Request -User says: "Clean up the API handlers in src/api/" -Actions: -1. Read CLAUDE.md and project config (SCOPE) -2. Scan src/api/ only for all 9 categories (SCAN) -3. Prioritize: 5 unused imports (auto-fixable), 2 stale TODOs (>90d), 1 high-complexity function (PRIORITIZE) -4. Present tiered report with auto-fix commands (REPORT) -Result: User gets focused, actionable report for their working area - -### Example 2: Broad Technical Debt Scan -User says: "What's the state of technical debt in this repo?" -Actions: -1. Identify languages and source directories, exclude vendor/generated (SCOPE) -2. Run all applicable scans across source directories (SCAN) -3. Group 47 findings into Quick Wins (12), Important (8), Polish (27) (PRIORITIZE) -4. Generate full report with effort estimates: 2h quick wins, 6h important, 4h polish (REPORT) -Result: User has clear picture of debt and time investment needed - -### Example 3: Auto-Fix Request -User says: "Fix all the unused imports and sort them" -Actions: -1. Verify ruff/goimports available (SCOPE) -2. Scan for F401 and I001 violations only (SCAN) -3. Report findings: 23 unused imports across 8 files (PRIORITIZE + REPORT) -4. User confirms: apply fixes, run tests, show diff (FIX) -Result: Clean imports with verified test suite +**Auto-fix request** -- User says "Fix all the unused imports and sort them". Verify ruff/goimports available, scan for F401 and I001 violations only, report 23 unused imports across 8 files, user confirms, apply fixes, run tests, show diff. --- @@ -123,19 +60,19 @@ Result: Clean imports with verified test suite **Goal**: Determine what to scan and verify tooling is available. **Step 1: Read project context** -- Check for CLAUDE.md, .gitignore, pyproject.toml, go.mod, package.json +- Check for CLAUDE.md, .gitignore, pyproject.toml, go.mod, package.json -- read and follow any repository CLAUDE.md before doing anything else, since it may contain project-specific exclusions or conventions that override defaults - Identify primary languages and project structure **Step 2: Determine scan scope** -- If user specified a directory or issue type, use that exactly -- If user specified only issue type (e.g., "find unused imports"), scan all source directories for that type only -- If request is vague ("clean up code"), ask user for target area or default to source directories only -- Always exclude: vendor/, node_modules/, .venv/, build/, dist/, generated/, .git/ +- If the user specified a directory or issue type, use that exactly -- only scan for requested issue types or smart defaults, never build elaborate reporting dashboards or speculative features +- If the user specified only an issue type (e.g., "find unused imports"), scan all source directories for that type only +- If the request is vague ("clean up code"), ask the user for a target area rather than scanning the entire codebase, because unfocused scans produce overwhelming noise that users cannot act on +- Always exclude: vendor/, node_modules/, .venv/, build/, dist/, generated/, .git/ -- these contain third-party or generated code that the user cannot fix, so including them buries real findings - Respect .gitignore patterns when determining what to scan **Step 3: Verify tool availability** -Check which analysis tools are installed. Report missing tools with install commands before proceeding. +Check which analysis tools are installed so you know what scans are possible before starting. Report missing tools with install commands. ```bash # Python tools @@ -158,7 +95,7 @@ If critical tools are missing, offer to proceed with partial scan using availabl Run applicable scans based on language and scope. See `references/scan-commands.md` for full command reference. **Core scans (all languages)**: -1. **Stale TODOs**: grep for TODO/FIXME/HACK/XXX, age with git blame +1. **Stale TODOs**: grep for TODO/FIXME/HACK/XXX, then age every match with git blame -- a 180-day-old TODO about a data race is fundamentally different from yesterday's "TODO: add test case", so age-based triage is essential for prioritization 2. **Unused imports**: ruff (Python), goimports (Go) 3. **Dead code**: vulture (Python), staticcheck (Go) 4. **Complexity**: radon (Python), gocyclo (Go) @@ -170,9 +107,7 @@ Run applicable scans based on language and scope. See `references/scan-commands. 8. Duplicate code (pylint --enable=duplicate-code) 9. Missing docstrings (ruff --select D) -Collect all output with exact file:line references. Do not filter or summarize raw scan output. - -For each scan, record: +Collect all output with exact file:line references -- never summarize away specifics, because the user needs precise locations to act on findings. For each scan, record: - Number of findings - Files affected - Whether findings are auto-fixable @@ -183,7 +118,7 @@ If a scan tool is unavailable, note it as skipped and continue with remaining sc ### Phase 3: PRIORITIZE -**Goal**: Rank findings by impact/effort ratio and categorize. +**Goal**: Rank findings by impact/effort ratio and categorize. Never present a flat unsorted list -- a critical 90-day-old security TODO buried among trivial missing docstrings wastes the user's attention. **Step 1: Assign impact and effort** @@ -201,13 +136,13 @@ If a scan tool is unavailable, note it as skipped and continue with remaining sc | Magic numbers | Low | Low | 5 | **Step 2: Group into tiers** -- **Quick Wins** (High priority, low effort): Unused imports, stale TODOs, dead code +- **Quick Wins** (High priority, low effort): Unused imports, stale TODOs, dead code -- present auto-fixable issues first so the user gets immediate value - **Important** (High impact, medium+ effort): Deprecated functions, high complexity, duplicates - **Polish** (Lower impact): Missing types, docstrings, naming, magic numbers **Step 3: Estimate total effort per tier** -Use these time estimates per issue: +Include time estimates so the user can plan their cleanup budget: | Issue Type | Time per Instance | |------------|-------------------| @@ -230,6 +165,8 @@ Multiply by instance count for tier totals. **Goal**: Present findings in structured, actionable format. +This skill defaults to read-only scan and report. Do not modify any files during this phase. + Generate report with this structure: 1. Executive summary (total issues, tier counts, estimated effort) 2. Quick Wins with auto-fix commands where available @@ -241,21 +178,23 @@ See `references/report-template.md` for complete template. Print complete report to stdout. Do NOT summarize or truncate findings. -If user provided `--output {file}` flag, also write report to the specified file. +If the user provided `--output {file}` flag, also write report to the specified file. For each finding in the report: - Include exact file:line reference -- Show 3 lines of surrounding context +- Show 3 lines of surrounding context for quick comprehension - Provide specific fix suggestion or auto-fix command - Note whether the fix is auto-fixable or requires manual effort +Remove any intermediate scan outputs at completion, keeping only the final report. + **Gate**: Report delivered with all findings, exact references, and actionable suggestions. -### Phase 5: FIX (Optional - only with explicit permission) +### Phase 5: FIX (Optional -- only with explicit permission) **Goal**: Apply safe, deterministic fixes. -MUST have explicit user permission before proceeding. Never auto-enter this phase. +MUST have explicit user permission before proceeding. Never auto-enter this phase -- the user expected a report, not file modifications, and changes may conflict with in-progress work. **Step 1: Confirm scope with user** @@ -361,50 +300,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Scanning Everything Without Focus -**What it looks like**: Running all 9 scan types across entire codebase for "check for any code issues" -**Why wrong**: Produces overwhelming noise, user cannot act on 100+ unsorted findings -**Do instead**: Ask for target area, start with quick wins, offer to expand scope - -### Anti-Pattern 2: Auto-Fixing Without Permission -**What it looks like**: Running `ruff --fix` immediately after scan completes -**Why wrong**: User expected a report, not file modifications. Changes may conflict with in-progress work -**Do instead**: Report findings first. Only fix when user explicitly requests it - -### Anti-Pattern 3: Flat Unsorted Reports -**What it looks like**: Alphabetical list of all issues with equal weight -**Why wrong**: Critical 90-day-old security TODO buried among trivial missing docstrings -**Do instead**: Always prioritize by impact/effort. Quick wins first, polish last - -### Anti-Pattern 4: Scanning Vendor/Generated Code -**What it looks like**: Reporting 500 issues mostly in node_modules/ or generated protobuf files -**Why wrong**: User cannot fix third-party or generated code -**Do instead**: Exclude non-source directories. Only scan code the user controls - -### Anti-Pattern 5: Treating All TODOs Equally -**What it looks like**: Listing 47 TODOs in a flat list with no age or severity context -**Why wrong**: A 180-day-old TODO about a data race is fundamentally different from yesterday's "TODO: add test case" -**Do instead**: Age every TODO with git blame. Triage by age (Critical >90d, High 30-90d, Normal <30d) and severity keywords - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I'll just fix these while scanning" | Modifying without permission violates report-first rule | Complete report, ask permission, then fix | -| "Scanning everything is more thorough" | Thoroughness without focus produces noise, not value | Scope first, scan second | -| "These findings are obvious, no need for references" | file:line references are non-negotiable | Always include exact locations | -| "Auto-fix is safe, no need to ask" | Safe for tool does not mean safe for user's workflow | Always get explicit permission | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/scan-commands.md`: Language-specific scan commands and expected output - `${CLAUDE_SKILL_DIR}/references/report-template.md`: Full structured report template diff --git a/skills/code-linting/SKILL.md b/skills/code-linting/SKILL.md index c326859b..e006418a 100644 --- a/skills/code-linting/SKILL.md +++ b/skills/code-linting/SKILL.md @@ -26,213 +26,127 @@ routing: # Code Linting Skill -Unified linting workflow for Python (ruff) and JavaScript (Biome). - -## Operator Context - -This skill operates as an operator for code quality enforcement, configuring Claude's behavior for consistent linting and formatting across Python and JavaScript codebases. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution. Project-specific linting rules override defaults. -- **Over-Engineering Prevention**: Only run requested linters and fixes. Don't add custom rules, configuration changes, or additional tooling without explicit request. -- **Show complete linter output**: NEVER summarize as "no issues found" - display actual command output -- **Run both Python and JS linting**: When project has both languages, lint both (unless user specifies otherwise) -- **Use project-specific configs**: Always use pyproject.toml/biome.json settings, never override -- **Preserve line width settings**: Respect the line width configured in project tools - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report facts without self-congratulation. Show command output rather than describing it. Be concise but informative. -- **Temporary File Cleanup**: Remove any temporary lint report files or cache files created during execution at task completion. -- **Auto-fix safe issues**: Apply `--fix` flag for formatting and import ordering issues -- **Check before commit**: Verify code passes linting before suggesting commit -- **Report all categories**: Show errors, warnings, and style issues together -- **Suggest manual fixes**: For issues that can't be auto-fixed, explain how to resolve - -### Optional Behaviors (OFF unless enabled) -- **Strict mode**: Treat warnings as errors (fail on any issue) -- **Single language only**: Lint only Python or only JavaScript when requested -- **Format only**: Skip linting, only run formatting -- **Ignore specific rules**: Disable particular lint rules for edge cases - -## What This Skill CAN Do -- Run ruff check/format for Python codebases -- Run Biome check/format for JavaScript/TypeScript codebases -- Auto-fix safe issues (import ordering, formatting) -- Show complete linter output for review -- Use project-specific configurations (pyproject.toml, biome.json) - -## What This Skill CANNOT Do -- Override project linter configurations without explicit request -- Summarize linter output (must show full command output) -- Run linters for languages other than Python and JavaScript/TypeScript -- Fix complex logic issues (only style/formatting/import issues) -- Skip reading linter output before applying auto-fixes +Unified linting workflow for Python (ruff) and JavaScript (Biome). Covers check, format, and auto-fix for both languages. Only handles Python and JavaScript/TypeScript -- complex logic issues and other languages are out of scope. -## Error Handling +## Instructions -### Error: "ruff not found" -**Cause**: Virtual environment not activated or ruff not installed -**Solution**: -- Use virtual environment path: `./venv/bin/ruff` or `./env/bin/ruff` -- Or install globally: `pip install ruff` -- Or use pipx: `pipx run ruff check .` +### 1. Read Project Configuration -### Error: "biome not found" -**Cause**: Biome not installed in project -**Solution**: Run `npx @biomejs/biome` to use npx-based execution - -### Error: "Configuration file not found" -**Cause**: Running from wrong directory -**Solution**: cd to project root where pyproject.toml/biome.json exist +Before running any linter, read the repository's CLAUDE.md for project-specific linting rules -- those override every default below. Then locate the project's linter config files (`pyproject.toml` for ruff, `biome.json` for Biome). All linter invocations must use these configs as-is; never override line width, rule sets, or other project settings. -## Quick Reference +### 2. Detect Languages and Run Checks -### Python (ruff) +When a project contains both Python and JavaScript/TypeScript, lint both unless the user explicitly requests a single language. Run the check command first to see what violations exist: ```bash -# Navigate to your project -cd /path/to/your/project - -# Check for issues (use project's venv if available) +# Python -- use project venv when available ruff check . -# or with virtual env: ./venv/bin/ruff check . +# or: ./venv/bin/ruff check . -# Auto-fix issues -ruff check --fix . +# JavaScript/TypeScript +npx @biomejs/biome check src/ +``` -# Format code -ruff format . +**Always display the complete linter output.** Never summarize results as "no issues found" or describe output secondhand -- show the actual command output so the user can see every error, warning, and style issue together. -# Check formatting only (no changes) -ruff format --check . -``` +### 3. Review Output Before Fixing -### JavaScript (Biome) +Read the full output and understand what violations exist and their severity before applying any fixes. Jumping straight to `--fix` without reviewing risks auto-removing imports that are still needed or making changes that reduce readability. -```bash -# Navigate to your project -cd /path/to/your/project +### 4. Apply Auto-Fixes -# Check for issues -npx @biomejs/biome check src/ +Apply `--fix` for safe categories: formatting, import ordering, and style issues that the linter can correct mechanically. -# Auto-fix issues -npx @biomejs/biome check --write src/ +```bash +# Python +ruff check --fix . +ruff format . -# Format only +# JavaScript/TypeScript +npx @biomejs/biome check --write src/ npx @biomejs/biome format --write src/ ``` -### Combined Commands (if Makefile configured) +Only run the linters and fixes that were requested. Do not add custom rules, configuration changes, or additional tooling unless the user explicitly asks. + +### 5. Review the Diff + +After auto-fix, review the diff to verify changes are correct and safe: ```bash -make lint # Check both Python and JS -make lint-fix # Fix both Python and JS +git diff ``` -## Configuration Files +Auto-fixes can occasionally remove imports that are still needed, reformat code in ways that hurt readability, or introduce subtle bugs through variable shadowing changes. Revert any problematic auto-fixes before proceeding. -| Tool | Config | Typical Line Width | -|------|--------|-------------------| -| ruff | pyproject.toml | 88-120 | -| biome | biome.json | 80-120 | +### 6. Fix Remaining Issues Manually -## Common Fixes +For violations that cannot be auto-fixed, explain each one and how to resolve it: -### Python +**Python common fixes:** - Unused import (F401): Remove or use the import - Import order (I001): Run `ruff check --fix` - Line too long (E501): Break into multiple lines or adjust line-length config -### JavaScript +**JavaScript common fixes:** - noVar: Replace `var` with `let`/`const` - useConst: Use `const` for unchanging values - noDoubleEquals: Use `===` instead of `==` -## Anti-Patterns +### 7. Verify Before Commit -### Anti-Pattern 1: Running Linter Without Reading Output +Run the linter one final time to confirm zero violations before suggesting a commit: -**What it looks like:** ```bash -$ ruff check . -$ # Immediately running --fix without reviewing issues -$ ruff check --fix . +ruff check . +ruff format --check . +npx @biomejs/biome check src/ ``` -**Why it's wrong:** -- May auto-fix issues that need manual review -- Misses understanding of what violations exist -- Can introduce unintended changes (e.g., removing imports still needed) +Report output factually -- no self-congratulation, just the command results. + +### 8. Clean Up -**Do this instead:** -1. Read the complete linter output first -2. Understand what violations exist and their severity -3. Decide which fixes are safe to automate -4. Apply targeted fixes or manual corrections +Remove any temporary lint report files or cache files created during execution. -### Anti-Pattern 2: Applying Auto-Fixes Blindly +### Combined Commands (if Makefile configured) -**What it looks like:** ```bash -$ ruff check --fix . -# All files changed -$ git add . && git commit -m "lint fixes" -# Didn't review what changed +make lint # Check both Python and JS +make lint-fix # Fix both Python and JS ``` -**Why it's wrong:** -- Auto-fixes might remove imports you still need -- Could reformat code in ways that reduce readability for specific cases -- May introduce subtle bugs (e.g., changing variable shadowing) +### Configuration Reference -**Do this instead:** -1. Run `ruff check --fix .` -2. Review the diff: `git diff` -3. Verify changes are correct and safe -4. Revert any problematic auto-fixes -5. Then commit with understanding of what changed +| Tool | Config | Typical Line Width | +|------|--------|-------------------| +| ruff | pyproject.toml | 88-120 | +| biome | biome.json | 80-120 | -### Anti-Pattern 3: Summarizing Linter Output +### Optional Modes -**What it looks like:** -``` -User: "Lint the code" -Assistant: "I ran the linter and found 3 issues. All fixed!" -``` +- **Strict mode**: Treat warnings as errors (fail on any issue) -- enable when requested +- **Format only**: Skip linting, only run formatting -- enable when requested +- **Ignore specific rules**: Disable particular lint rules for edge cases -- enable when requested -**Why it's wrong:** -- User can't see what violations existed -- Can't verify fixes were appropriate -- Hides important details about code quality +## Error Handling -**Do this instead:** -Show complete command output: -```bash -$ ruff check . -src/main.py:10:1: F401 [*] `os` imported but unused -src/utils.py:25:80: E501 Line too long (95 > 88 characters) -Found 2 errors. -[*] 1 fixable with the --fix flag - -$ ruff check --fix . -Fixed 1 error: -src/main.py:10:1: F401 Removed unused import `os` -``` +### Error: "ruff not found" +**Cause**: Virtual environment not activated or ruff not installed +**Solution**: +- Use virtual environment path: `./venv/bin/ruff` or `./env/bin/ruff` +- Or install globally: `pip install ruff` +- Or use pipx: `pipx run ruff check .` -## Workflow +### Error: "biome not found" +**Cause**: Biome not installed in project +**Solution**: Run `npx @biomejs/biome` to use npx-based execution -1. Run linter to check: `ruff check .` / `npx @biomejs/biome check src/` -2. **Review the complete output** to understand violations -3. Auto-fix what's safe: `ruff check --fix .` -4. **Review the diff** to verify auto-fixes are correct -5. Format code: `ruff format .` -6. Review remaining issues and fix manually -7. Commit and push -8. Check CI/GitHub Actions +### Error: "Configuration file not found" +**Cause**: Running from wrong directory +**Solution**: cd to project root where pyproject.toml/biome.json exist ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks +- [ruff documentation](https://docs.astral.sh/ruff/) +- [Biome documentation](https://biomejs.dev/) diff --git a/skills/codebase-analyzer/SKILL.md b/skills/codebase-analyzer/SKILL.md index 9d2fd4da..5c9c7adc 100644 --- a/skills/codebase-analyzer/SKILL.md +++ b/skills/codebase-analyzer/SKILL.md @@ -29,57 +29,20 @@ routing: # Codebase Analyzer Skill -## Operator Context - -This skill operates as an operator for statistical codebase analysis, configuring Claude's behavior for measurement-based rule discovery from Go codebases. It implements a **Measure, Don't Read** methodology -- Python scripts count patterns to avoid LLM training bias override, then statistics are interpreted to derive confidence-scored rules. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution. Project instructions override default behaviors. -- **Over-Engineering Prevention**: Scripts perform pure statistical measurement only. No feature additions beyond counting patterns. No speculative metrics or flexibility that was not requested. -- **Measurement-Only Analysis**: Scripts count and measure; NEVER interpret or judge code quality during data collection phase. The LLM is a calculator, not a judge. -- **No Training Bias**: Analysis MUST avoid LLM interpretation of "good" vs "bad" patterns during measurement. What IS in the code is the local standard. -- **Confidence Gating**: Only derive rules from patterns with >70% consistency. Below that threshold, report statistics without creating rules. -- **Separate Measurement from Interpretation**: Run scripts first (mechanical), then interpret statistics second (analytical). Never combine these steps. - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report facts without self-congratulation. Show complete statistics rather than describing them. Be concise but informative. -- **Temporary File Cleanup**: Analysis scripts do not create temporary files (single-pass processing). Any debug outputs or iteration files should be removed at completion. -- **Verbose Output**: Display summary statistics to stderr, full JSON to stdout or file. -- **Confidence Thresholds**: HIGH (>85%), MEDIUM (70-85%), below 70% not extracted as rule. -- **Vendor Filtering**: Automatically skip vendor/, testdata/, and generated code to avoid polluting statistics with external patterns. - -### Optional Behaviors (OFF unless enabled) -- **Cross-Repository Analysis**: Compare patterns across multiple repos (requires explicit request). -- **Historical Tracking**: Re-analyze same repo over time to track pattern evolution (requires explicit request). -- **Custom Metric Addition**: Add new measurement categories beyond the 100 standard metrics (requires explicit request). - -## What This Skill CAN Do -- Extract implicit coding rules through statistical analysis of Go codebases -- Measure 100 metrics across 25 categories using Python scripts -- Derive confidence-scored rules from pattern frequency data -- Produce a 10-dimensional Style Vector quality fingerprint (0-100 scores) -- Discover shadow constitution rules (linter suppressions teams accept) -- Compare patterns across multiple repositories for team-wide standards - -## What This Skill CANNOT Do -- Judge code quality subjectively (measures patterns, not "good" vs "bad") -- Analyze non-Go codebases (scripts are Go-specific) -- Derive rules from codebases with fewer than 50 Go files (insufficient sample) -- Replace code review or linting (produces rules, not enforcement) -- Skip measurement and rely on LLM "reading" the code - ---- +Statistical rule discovery through measurement of Go codebases. Python scripts count patterns to avoid LLM training bias, then statistics are interpreted to derive confidence-scored rules. The core principle is **Measure, Don't Read** -- what IS in the code is the local standard, not what an LLM thinks "should be" there. ## Instructions -### Phase 1: CONFIGURE (Do NOT proceed without validated target) +### Phase 1: CONFIGURE **Goal**: Validate target and select analyzer variant. +Read and follow the repository's CLAUDE.md before doing anything else -- project instructions override default behaviors. + **Step 1: Validate the target** - Confirm path points to a Go repository root with .go files - Check for standard structure (cmd/, internal/, pkg/) -- Verify sufficient file count (50+ files for meaningful rules, 100+ ideal) +- Verify sufficient file count: 50+ files for meaningful rules, 100+ ideal. Below 50 files, statistics produce high variance -- patterns that look consistent may be coincidence. For small repos, combine analysis across multiple team repos rather than treating thin data as definitive. **Step 2: Select cartographer variant** @@ -119,10 +82,14 @@ This skill operates as an operator for statistical codebase analysis, configurin **Gate**: Target directory exists, contains 50+ Go files, variant selected. Proceed only when gate passes. -### Phase 2: MEASURE (Do NOT interpret during this phase) +### Phase 2: MEASURE **Goal**: Run statistical analysis scripts. Pure measurement -- no interpretation yet. +This phase is strictly mechanical. Scripts count and measure; do not interpret or judge code quality during data collection. Combining measurement with interpretation introduces LLM training bias -- the model reports what "should be" instead of what IS. Run scripts first, interpret the numbers second, always as separate steps. + +Automatically filter vendor/, testdata/, and generated code (files with "Code generated by..." markers) to avoid polluting statistics with external patterns. + **Step 1: Execute the cartographer** ```bash @@ -134,6 +101,8 @@ grep -rn 'fmt.Errorf.*%w' ~/repos/my-project --include="*.go" | wc -l grep -rn 'func New' ~/repos/my-project --include="*.go" | wc -l ``` +Never substitute LLM "reading the codebase" for running the cartographer scripts. When an LLM sees `return err` it may report "not wrapping errors properly" even if that IS the local standard. The scripts produce deterministic, reproducible counts; the LLM's role begins at interpretation in Phase 3. + **Step 2: Verify output integrity** - Confirm JSON output is valid and complete - Check file count matches expectations (no vendor pollution) @@ -173,10 +142,12 @@ grep -rn 'func New' ~/repos/my-project --include="*.go" | wc -l **Gate**: Script completed without errors, JSON output is valid, file count is reasonable. Proceed only when gate passes. -### Phase 3: INTERPRET (Now the LLM analyzes) +### Phase 3: INTERPRET **Goal**: Derive rules from statistics. This is where LLM interpretation happens -- AFTER measurement is complete. +Report facts and show complete statistics rather than describing them. Do not editorialize about code quality -- the numbers speak for themselves. + **Step 1: Review the three lenses** | Lens | Question | Measures | @@ -189,11 +160,13 @@ For detailed lens explanations, see `references/three-lenses.md`. **Step 2: Extract rules by confidence** +Only derive rules from patterns with sufficient consistency. Forcing rules from weak patterns causes false positives in reviews and may impose standards the team has not organically adopted. + | Confidence | Threshold | Action | Example | |------------|-----------|--------|---------| | HIGH | >85% consistency | Extract as enforceable rule | "96% use err not e" -> MUST use err | | MEDIUM | 70-85% consistency | Extract as recommendation | "78% guard clauses" -> SHOULD prefer guards | -| Below 70% | Not extracted | Report as observation only | "55% single-letter receivers" -> No rule | +| Below 70% | Not extracted as rule | Report as observation only | "55% single-letter receivers" -> No rule | **Step 3: Review Style Vector** (Omni only) - 10 composite scores (0-100): Consistency, Modernization, Safety, Idiomaticity, Documentation, Testing Maturity, Architecture, Performance, Observability, Production Readiness @@ -207,7 +180,7 @@ For detailed lens explanations, see `references/three-lenses.md`. **Gate**: Rules extracted with evidence and confidence levels. Style Vector reviewed. Proceed only when gate passes. -### Phase 4: DELIVER (Do NOT mark complete without artifacts) +### Phase 4: DELIVER **Goal**: Produce actionable output artifacts. @@ -245,7 +218,7 @@ Format each rule as: - Compare with pr-miner data if available (explicit vs implicit rules) - Suggest CLAUDE.md updates for high-confidence rules - Identify golangci-lint rules that could enforce discovered patterns -- Suggest quarterly re-analysis schedule +- Suggest quarterly re-analysis schedule -- coding patterns evolve with team growth and new Go versions, so a one-time snapshot becomes stale within months ``` =============================================================== @@ -350,50 +323,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: LLM Reading Instead of Script Measuring -**What**: Using Claude to "read the codebase and find patterns" instead of running cartographer scripts -**Why wrong**: LLM applies training bias -- reports what "should be" instead of what IS. When the LLM sees `return err` it reports "not wrapping errors properly" even if that IS the local standard. -**Do instead**: Run the cartographer script first (measurement), then interpret the statistics (analysis). Two separate steps, never combined. - -### Anti-Pattern 2: Rules from Low-Confidence Patterns -**What**: Creating enforceable rules from patterns below 70% consistency (e.g., "45% use fmt.Errorf with %w" becomes "All errors must use fmt.Errorf") -**Why wrong**: Forces consistency where the team has not achieved it organically. Causes false positives in reviews. Team may be transitioning between patterns. -**Do instead**: Only derive rules from HIGH confidence (>85%). For 70-85%, suggest "consider standardizing." Below 70%, report as observation only. - -### Anti-Pattern 3: Analyzing Insufficient Sample Size -**What**: Running analysis on a repo with <50 Go files and treating results as definitive patterns -**Why wrong**: Small sample size produces high variance. Patterns that appear consistent at 20 files may be coincidence. Cannot distinguish signal from noise. -**Do instead**: Require 50+ files minimum. For small repos, combine analysis across multiple team repos. For monorepos, analyze the full tree. - -### Anti-Pattern 4: One-Time Analysis Without Follow-Up -**What**: Analyzing once, extracting rules, never re-running as the codebase evolves -**Why wrong**: Coding patterns evolve with team growth and new Go versions. One-time snapshot becomes stale within months. Cannot measure impact of standardization efforts. -**Do instead**: Re-analyze quarterly. Compare Style Vector scores over time. Track pattern adoption (e.g., "Did Modernization score improve after Go 1.21 adoption?"). - -### Anti-Pattern 5: Mixing Measurement and Interpretation -**What**: Having the LLM "read" code files and count patterns manually instead of running the deterministic Python scripts -**Why wrong**: LLM counting is unreliable at scale -- misses files, double-counts, applies inconsistent criteria. Python scripts produce deterministic, reproducible results across runs. -**Do instead**: ALWAYS run the cartographer script for measurement (Phase 2). The LLM's role begins at interpretation (Phase 3), working from the script's JSON output. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I can read the code and find patterns" | Reading applies training bias; measures what "should be" not what IS | Run cartographer scripts for measurement | -| "Small repo is fine for analysis" | <50 files produces unreliable statistics | Combine repos or accept limited confidence | -| "This 55% pattern should be a rule" | Below 70% is noise, not signal | Only extract rules above confidence threshold | -| "Analysis was done last year, still valid" | Patterns evolve with team and language | Re-analyze quarterly | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/three-lenses.md`: Detailed explanation of the three analysis lenses - `${CLAUDE_SKILL_DIR}/references/examples.md`: Real-world analysis examples and workflows diff --git a/skills/codebase-overview/SKILL.md b/skills/codebase-overview/SKILL.md index 82fa5675..9255538f 100644 --- a/skills/codebase-overview/SKILL.md +++ b/skills/codebase-overview/SKILL.md @@ -32,74 +32,39 @@ routing: # Codebase Overview Skill -## Operator Context - -This skill operates as an operator for systematic codebase exploration, configuring Claude's behavior for rigorous, evidence-based discovery. It implements the **Phased Discovery** architectural pattern -- Detect project type, Explore key files, Map architecture, Summarize findings -- with **Domain Intelligence** embedded in language-specific exploration strategies. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before exploration -- **Over-Engineering Prevention**: Explore only what's needed for the overview. No speculative deep-dives, no "while I'm here" tangents -- **Read-Only Operations**: NEVER modify, create, or delete project files -- **Evidence Required**: Every architectural claim MUST cite an examined file and path -- **Systematic Phases**: Follow DETECT -> EXPLORE -> MAP -> SUMMARIZE in order -- **Project Agnostic**: Work across any language, framework, or build system -- **Absolute Paths**: All file references in output MUST use absolute paths -- **Forbidden-Files Guardrail**: NEVER read or quote files matching sensitive patterns. Secrets leaked into exploration output are hard to retract and easy to miss. Check every file path against the forbidden list BEFORE reading. Skip silently -- do not log the file contents or path contents in output. - - ``` - # Secrets and credentials - .env, .env.*, *.pem, *.key, credentials.json, secrets.*, *secret*, *credential*, *password* - - # Authentication tokens - token.json, .npmrc, .pypirc - - # Cloud provider credentials - .aws/credentials, .gcloud/, service-account*.json - ``` - -### Default Behaviors (ON unless disabled) -- Report facts without self-congratulation; show evidence, not descriptions -- Start from current working directory -- Examine up to 20 key files per category -- Include recent commit themes (last 10 commits) -- Generate structured markdown report to stdout -- Exclude noise directories: `node_modules/`, `venv/`, `vendor/`, `dist/`, `build/`, `__pycache__/` -- Remove any temporary files created during exploration at completion -- Show absolute file paths in all output - -### Optional Behaviors (OFF unless enabled) -- Deep dive into specific subsystems (on explicit request) -- Generate architecture diagrams (requires explicit request) -- Include full file contents in report (vs summaries) -- Export findings to separate file (vs stdout) -- Analyze dependency vulnerability status - -## What This Skill CAN Do -- Systematically discover project type, tech stack, and architecture -- Identify entry points, core modules, data models, and API surfaces -- Map design patterns and key abstractions with file-level evidence -- Produce actionable onboarding documentation for new contributors -- Work across any language and framework without prior knowledge - -## What This Skill CANNOT Do -- Modify, create, or delete any files in the target project -- Run the application or execute its test suite -- Provide deep domain expertise (use specialized agent instead) -- Replace reading the codebase (this accelerates, not replaces) -- Skip any of the 4 phases - ---- +Systematic 4-phase codebase exploration that produces an evidence-backed onboarding report. Phases run in strict order -- DETECT, EXPLORE, MAP, SUMMARIZE -- because later phases depend on context established by earlier ones. This skill accelerates reading the codebase but does not replace it. ## Instructions Execute all phases autonomously. Verify each gate before advancing. Consult `references/exploration-strategies.md` for language-specific discovery commands. +Before starting any exploration, read and follow any `.claude/CLAUDE.md` or `CLAUDE.md` in the repository root because project-specific instructions override default behavior. + +This is a **read-only** skill -- never modify, create, or delete project files because the goal is observation, not mutation. Likewise, never run the application or execute its test suite because those are execution concerns outside this skill's scope. For deep domain analysis, route to a specialized agent instead. + +### Forbidden-Files Guardrail + +Check every file path against this list BEFORE reading because secrets leaked into exploration output are hard to retract and easy to miss. Skip silently -- do not log the file contents or path in output. + +``` +# Secrets and credentials +.env, .env.*, *.pem, *.key, credentials.json, secrets.*, *secret*, *credential*, *password* + +# Authentication tokens +token.json, .npmrc, .pypirc + +# Cloud provider credentials +.aws/credentials, .gcloud/, service-account*.json +``` + ### Phase 1: DETECT **Goal**: Determine project type, language, framework, and tech stack. **Step 1: Examine root directory** +Start from the current working directory because that is the project the user is asking about. + ```bash ls -la ``` @@ -112,6 +77,8 @@ Identify configuration files that indicate project type: - `Cargo.toml` -> Rust - See `references/exploration-strategies.md` for complete indicator table +Always detect project type before reading source files because framework context changes how you interpret code (e.g., a `models/` directory means something different in Django vs. Express). + **Step 2: Read primary configuration** Based on detected type, read the main config file. Preference order: @@ -150,12 +117,16 @@ Read any `.claude/CLAUDE.md` or `CLAUDE.md` in the repository root. Follow its i - Test command: [from scripts/Makefile] ``` -**Gate**: Project type identified (language + framework). Tech stack documented. Build/run commands known. Proceed ONLY when gate passes. +**Gate**: Project type identified (language + framework). Tech stack documented. Build/run commands known. Proceed ONLY when gate passes -- skipping this gate leads to wrong architectural assumptions downstream. ### Phase 2: EXPLORE **Goal**: Discover entry points, core modules, data models, API surfaces, configuration, and tests. +Explore only what is needed for the overview because speculative deep-dives waste tokens without proportional value. Limit to 20 files per category because representative samples are more useful than exhaustive coverage. If a category has more than 20 files, note the total count and state that you examined a representative sample. + +On explicit user request, deep-dive into specific subsystems, generate architecture diagrams, include full file contents, export findings to a separate file, or analyze dependency vulnerability status. These are off by default because the standard overview does not require them. + **Step 1: Find entry points** Use language-specific patterns from `references/exploration-strategies.md`. Read each entry point file to understand application bootstrapping. @@ -165,6 +136,8 @@ For any language, look for: - Server/app initialization files - CLI entry points declared in config +Config files alone are not enough to understand a project because they show dependencies, not architecture -- always read entry points and core modules too. + **Step 2: Map directory structure** ```bash @@ -178,6 +151,8 @@ find . -type d \ | head -50 ``` +Exclude noise directories (`node_modules/`, `venv/`, `vendor/`, `dist/`, `build/`, `__pycache__/`) because they contain generated or third-party code that obscures the project's own structure. + Categorize directories by layer: | Pattern | Layer | @@ -256,7 +231,7 @@ Document: testing framework, test organization (co-located vs separate directory **Step 1: Identify design patterns** -Based on examined files, identify and document with evidence: +Based on examined files, identify and document with evidence. Every architectural claim must cite an examined file and path because uncited claims cannot be verified and mislead readers. ```markdown ## Design Patterns @@ -267,6 +242,8 @@ Based on examined files, identify and document with evidence: - DI approach: [manual/framework/none] (evidence: [file paths]) ``` +Do not infer architecture from the README alone because READMEs may be outdated or incomplete -- always verify against actual source files. + **Step 2: Map key abstractions** Identify the 5-10 most important types, classes, or modules: @@ -294,13 +271,15 @@ Trace a typical request from entry point through the full stack: 6. Response flows back through handler ``` +All file paths in output must be absolute because relative paths are ambiguous when the report is read outside the project directory. + **Step 4: Analyze recent activity** ```bash git log --oneline --no-decorate -10 ``` -Categorize commits into themes: +Include recent commit themes (last 10 commits). Categorize commits into themes: - Feature development (new capabilities) - Bug fixes (corrections) - Refactoring (structural changes) @@ -322,6 +301,8 @@ Use the template in `references/report-template.md`. Fill every section with evi - All commands MUST come from actual config files (package.json, Makefile, etc.) - Empty sections MUST note why information is unavailable +Report facts without self-congratulation -- show evidence, not descriptions of how thorough the exploration was. Every claim must have file-backed evidence because "report looks complete" is not the same as "report is complete." + **Step 2: Quality check** Before outputting, verify: @@ -331,6 +312,8 @@ Before outputting, verify: - [ ] Paths are absolute, not relative - [ ] Commands are real, not guessed +Adjust the 20-files-per-category limit if a specific area needs deeper sampling -- some projects concentrate complexity in one layer. Note any such adjustments in the report. + **Step 3: Generate "Where to Add New Code" section** Append a prescriptive section to the report. Developers exploring a codebase need to know not just what exists but where to put new things. For each major code category discovered during exploration, provide the directory, a concrete example file to use as a template, and any naming conventions. @@ -362,7 +345,9 @@ If any matches are found: **Step 5: Output report** -Display complete markdown report to stdout. If export behavior is enabled, also write to file. +Display complete markdown report to stdout. Generate the report to stdout by default because most users need inline context, not a separate file. If export behavior is explicitly requested, also write to file. + +Remove any temporary files created during exploration because they are intermediate artifacts, not deliverables. **Gate**: Report has all sections filled. All paths are absolute. All claims cite evidence. "Where to Add New Code" section populated with real file references. Secret scan passed (no unredacted secrets in output). Report is actionable for onboarding. Quality check passes. Total files examined count is accurate. @@ -378,7 +363,7 @@ Use parallel mapping when the exploration goal is broad and open-ended -- full o ### Agent Domains -Launch 4 parallel agents using Task, each focused on a specific domain. Each agent follows the forbidden-files guardrail and writes a structured document. +Launch 4 parallel agents using Task, each focused on a specific domain. Each agent follows the forbidden-files guardrail and writes a structured document. This skill works across any language, framework, or build system because the agent instructions are project-agnostic. | Agent | Focus | Output File | |-------|-------|-------------| @@ -479,50 +464,8 @@ Solution: Skip the inaccessible file. Note in the "Files Examined" section which --- -## Anti-Patterns - -### Anti-Pattern 1: Exploring Without Detecting Project Type -**What it looks like**: Reading random source files before checking config files -**Why wrong**: Wastes time, misinterprets code without framework context, produces inaccurate conclusions -**Do instead**: Complete Phase 1 (DETECT) first. Always read config before code. - -### Anti-Pattern 2: Claims Without File Evidence -**What it looks like**: "This appears to use microservices" without citing any examined file -**Why wrong**: Violates evidence-required hardcoded behavior. Misleads readers. Cannot be verified. -**Do instead**: Every claim cites a file. Example: "Uses PostgreSQL (confirmed in `/abs/path/config/database.py`: `postgresql://...`)" - -### Anti-Pattern 3: Exhaustive File Reading -**What it looks like**: Reading all 50+ files in `models/` directory instead of 3-5 representatives -**Why wrong**: Token bloat without proportional value. Violates over-engineering prevention. -**Do instead**: Limit to 20 files per category. Read representative samples. Note coverage in report. - -### Anti-Pattern 4: Skipping Phase Gates -**What it looks like**: Jumping to MAP before finishing EXPLORE, or generating report with missing sections -**Why wrong**: Incomplete overview with critical gaps. Wrong architectural assumptions. -**Do instead**: Verify every gate condition before proceeding. Return to previous phase if information is missing. - -### Anti-Pattern 5: Generic Report Output -**What it looks like**: Report with "This module handles business logic" and no file paths or specifics -**Why wrong**: Not actionable for onboarding. No better than reading a README. -**Do instead**: Include specific paths, line references, and exact commands from config files. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I can infer the architecture from the README" | README may be outdated or incomplete | Examine actual source files | -| "Config files are enough to understand the project" | Config shows dependencies, not architecture | Read entry points and core modules | -| "20 files is enough for any project" | Some areas need deeper sampling | Adjust per category, note limitations | -| "Report looks complete" | Looking complete != being complete | Verify every section has file-backed evidence | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/report-template.md`: Standard markdown report template with all sections - `${CLAUDE_SKILL_DIR}/references/exploration-strategies.md`: Language-specific discovery commands and patterns diff --git a/skills/comment-quality/SKILL.md b/skills/comment-quality/SKILL.md index ca3ea7bd..fa763b14 100644 --- a/skills/comment-quality/SKILL.md +++ b/skills/comment-quality/SKILL.md @@ -27,46 +27,7 @@ routing: # Comment Quality Skill -## Operator Context - -This skill operates as an operator for documentation quality assurance, configuring Claude's behavior for timeless, meaningful code comments and documentation. It implements a **Scan, Analyze, Rewrite, Verify** workflow with deterministic pattern matching against known temporal and activity language. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before scanning -- **Over-Engineering Prevention**: Scan only files explicitly requested or in the current working scope. Do NOT scan entire codebases unless user asks for full audit -- **Remove ALL Temporal Language**: Words like "new", "old", "now", "recently", "updated", "fixed" are NEVER acceptable in comments -- **Focus on WHAT and WHY**: Comments MUST explain current functionality and purpose, never history or development activity -- **No Exceptions for Temporal Language**: Every instance is flagged regardless of perceived harmlessness -- **Preserve Legal and Generated Markers**: Copyright headers, license text, @generated tags, and @deprecated annotations are left unchanged - -### Default Behaviors (ON unless disabled) -- **Concise Reporting**: Report facts with file paths and line numbers. No excessive commentary -- **Suggest Rewrites**: Provide corrected versions of every problematic comment -- **Explain Reasoning**: Clarify why temporal language fails for long-term maintainability -- **Check Surrounding Context**: Inspect nearby comments for temporal patterns when one is found -- **Report All Findings**: List every instance, not just the first few -- **Temporary File Cleanup**: Remove scan results, intermediate reports, or helper files at task completion - -### Optional Behaviors (OFF unless enabled) -- **Auto-Fix Mode**: Automatically apply corrections without user review (enable with explicit request) -- **Aggressive Scanning**: Check git commit messages and PR descriptions (enable with explicit request) -- **Batch Processing**: Process files by directory with grouped reports (enable for large codebases) - -## What This Skill CAN Do -- Identify temporal language, development-activity words, and relative comparisons in comments -- Provide specific, actionable rewrites that explain WHAT and WHY -- Scan multiple file types: `.go`, `.py`, `.js`, `.ts`, `.md`, `.txt` -- Distinguish between developer comments and legal/generated markers -- Generate structured reports with file path, line number, current text, and suggested fix - -## What This Skill CANNOT Do -- Write new documentation from scratch (use documentation skills instead) -- Enforce code style rules beyond comment content (use code-linting instead) -- Fix comments without reading surrounding code context -- Auto-fix without explicit user authorization -- Modify copyright headers, license text, or generated code markers - ---- +Review code comments for temporal references, development-activity language, and relative comparisons. Produces structured reports with actionable rewrites that explain WHAT the code does and WHY, never WHEN something changed. Supports `.go`, `.py`, `.js`, `.ts`, `.md`, and `.txt` files. ## Instructions @@ -76,11 +37,16 @@ This skill operates as an operator for documentation quality assurance, configur **Step 1: Determine scope** -Scan only what was requested. If user specifies files, scan those files. If user specifies a directory, scan that directory. NEVER default to full codebase. +Read the repository CLAUDE.md first to pick up any project-specific comment conventions. + +Scan only what was requested. If user specifies files, scan those files. If user specifies a directory, scan that directory. NEVER default to full codebase -- even if you suspect other files have issues, honor the explicit scope and suggest expansion separately at the end. + +If user explicitly requests auto-fix, enable it. Otherwise present findings for review. For large codebases, group findings by directory when reporting. **Step 2: Search for temporal patterns** -Target these categories: +Flag every instance of the following categories. No temporal word is "harmless" -- all temporal language ages poorly and must be rewritten regardless of how innocuous it seems: + - **Temporal words**: "new", "old", "previous", "current", "now", "recently", "latest", "modern" - **Development activity**: "added", "removed", "deleted", "updated", "changed", "modified", "fixed", "improved", "enhanced", "refactored", "optimized" - **State transitions**: "replaced", "migrated", "upgraded", "deprecated", "became", "turned into", "evolved" @@ -89,14 +55,16 @@ Target these categories: **Step 3: Filter false positives** -Exclude from findings: -- Copyright and license headers -- `@generated` markers -- `@deprecated` annotations (keep the tag, flag only temporal explanation text) +Exclude from findings -- these are not developer comments and must remain untouched: +- Copyright and license headers (legal requirements, not code comments) +- `@generated` markers (tooling markers) +- `@deprecated` annotations (keep the tag, flag only temporal explanation text after it) - Variable names or string literals that happen to contain temporal words - TODO/FIXME items that describe future work without temporal references -**Gate**: All files in scope scanned. Findings list populated with file path, line number, and matched text. Proceed only when gate passes. +When a finding appears, inspect nearby comments in the same function or block -- temporal language tends to cluster. + +**Gate**: All files in scope scanned. Findings list populated with file path, line number, and matched text. Every finding listed, not just the first few. Proceed only when gate passes. ### Phase 2: ANALYZE @@ -104,7 +72,7 @@ Exclude from findings: **Step 1: Read surrounding code** -For each finding, read the function, block, or section the comment describes. Understand what the code actually does. +For each finding, read the function, block, or section the comment describes. Understand what the code actually does. A rewrite without code context produces vague replacements that strip temporal words without adding substance. **Step 2: Classify the comment** @@ -131,7 +99,7 @@ For each comment, identify: **Step 1: Draft rewrites** -For each finding, produce: +For each finding, produce a structured entry with file path, line number, current text, suggested replacement, and reasoning: ```markdown **File: `path/to/file.ext`** @@ -150,7 +118,7 @@ Each rewrite MUST pass these checks: - [ ] Is it more specific than what it replaces (not just temporal word removed)? - [ ] Does it add value for a future maintainer? -If a rewrite just removes the temporal word without adding substance, it fails validation. Rewrite again with specific, descriptive content. +If a rewrite just removes the temporal word without adding substance, it fails validation. Simply deleting a word produces a useless comment -- `// Updated error handling` becoming `// Error handling` adds nothing. Rewrite with specific, descriptive content: `// Handles database connection errors with exponential backoff retry`. **Gate**: All rewrites pass quality checks. No vague or empty replacements. Proceed only when gate passes. @@ -160,6 +128,8 @@ If a rewrite just removes the temporal word without adding substance, it fails v **Step 1: Generate report** +Report facts concisely with file paths and line numbers. Every finding must include the current text, suggested replacement, and reasoning -- a diagnostic-only count without rewrites creates work without providing solutions. + ```markdown ## Comment Quality Review @@ -178,42 +148,13 @@ If a rewrite just removes the temporal word without adding substance, it fails v **Step 2: Apply fixes (if auto-fix enabled)** -If user requested auto-fix, apply all rewrites using Edit tool. Verify each edit succeeded. - -**Gate**: Report delivered. All findings accounted for. Task complete. +If user requested auto-fix, apply all rewrites using Edit tool. Verify each edit succeeded. Never auto-fix without explicit user authorization. ---- +**Step 3: Cleanup** -## Examples - -### Example 1: Single File Review -User says: "Check the comments in auth.go" -Actions: -1. Scan only auth.go for temporal patterns (SCAN) -2. Read surrounding code for each finding, classify severity (ANALYZE) -3. Draft specific rewrites with WHAT/WHY focus (REWRITE) -4. Deliver report with file path, line, current, suggested, reason (REPORT) -Result: Targeted report for one file with actionable rewrites - -### Example 2: Pre-Release Documentation Audit -User says: "Audit all markdown files in docs/ before release" -Actions: -1. Scan all .md files in docs/ directory (SCAN) -2. Classify findings, skip license headers and generated markers (ANALYZE) -3. Generate rewrites, validate each passes the 10-year test (REWRITE) -4. Deliver grouped report sorted by file, with summary statistics (REPORT) -Result: Comprehensive audit with every finding addressed - -### Example 3: Auto-Fix Mode -User says: "Fix all temporal comments in pkg/api/ automatically" -Actions: -1. Scan all code files in pkg/api/ (SCAN) -2. Analyze context for each finding (ANALYZE) -3. Generate and validate rewrites (REWRITE) -4. Apply fixes using Edit tool, verify each succeeded, deliver report (REPORT) -Result: All temporal comments replaced in-place with verification +Remove any scan results, intermediate reports, or helper files created during execution. ---- +**Gate**: Report delivered. All findings accounted for. Task complete. ## Error Handling @@ -229,7 +170,7 @@ Cause: Large codebase with widespread temporal language Solution: 1. Prioritize by file importance (README first, then core modules) 2. Group findings by pattern type -3. Enable batch processing optional behavior +3. Process files by directory with grouped reports ### Error: "Comment Meaning Unclear Without History" Cause: Comment only makes sense with development context that no longer exists @@ -238,47 +179,8 @@ Solution: 2. If purpose is clear from code, suggest removing the comment entirely 3. If purpose is unclear, ask user for clarification before rewriting ---- - -## Anti-Patterns - -### Anti-Pattern 1: Scanning Entire Codebase Without Scope -**What it looks like**: User asks "check comments in auth.go" and agent scans all 500 files -**Why wrong**: Wastes tokens, produces overwhelming reports, ignores user's explicit request -**Do instead**: Scan only what was requested. Ask before expanding scope. - -### Anti-Pattern 2: Vague Rewrites That Remove Without Adding -**What it looks like**: `// Updated error handling` becomes `// Error handling` -**Why wrong**: Removing temporal word without adding substance produces a useless comment -**Do instead**: `// Handles database connection errors with exponential backoff retry` - -### Anti-Pattern 3: Flagging Legal Text and Generated Markers -**What it looks like**: Flagging `Copyright 2023-2024` as temporal language -**Why wrong**: Copyright years are legal requirements; generated markers serve tooling -**Do instead**: Skip license headers, @generated markers, and @deprecated tags - -### Anti-Pattern 4: Reporting Without Actionable Fixes -**What it looks like**: "Found 47 instances of temporal language" with no suggested rewrites -**Why wrong**: Diagnostic-only reports create work without providing solutions -**Do instead**: Every finding includes file path, line number, current text, suggested fix, and reason - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "That temporal word is harmless here" | All temporal language ages poorly | Flag and rewrite it | -| "Removing the word is enough" | Removing without adding produces empty comments | Write specific replacement | -| "Legal text has dates too" | Legal text is not a code comment | Skip legal headers, flag code comments | -| "User only asked about one file" | Nearby files may share patterns | Report scope, suggest expansion if warranted | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/temporal-keywords.txt`: Complete list of temporal words to flag - `${CLAUDE_SKILL_DIR}/references/examples.md`: Before/after examples of comment rewrites diff --git a/skills/condition-based-waiting/SKILL.md b/skills/condition-based-waiting/SKILL.md index 59b8329d..66c83377 100644 --- a/skills/condition-based-waiting/SKILL.md +++ b/skills/condition-based-waiting/SKILL.md @@ -28,54 +28,9 @@ routing: category: process --- -# Condition-Based Waiting Skill - -## Operator Context - -This skill operates as an operator for wait and retry implementations, configuring Claude's behavior for robust condition-based polling. It implements the **Pattern Selection** architectural approach -- identify wait scenario, select pattern, implement with safety bounds, verify behavior. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before implementation -- **Over-Engineering Prevention**: Only implement the pattern directly needed. Don't add circuit breakers when simple retries suffice. Don't add health checks when a single poll works. -- **Always Include Timeout**: Every wait loop MUST have a maximum timeout to prevent infinite hangs -- **Always Include Max Retries**: Every retry loop MUST have a maximum retry count -- **Always Add Jitter**: Exponential backoff MUST include random jitter to prevent thundering herd -- **Never Busy Wait**: Minimum 10ms between polls for local operations, 100ms for external services -- **Descriptive Timeout Errors**: Timeout messages MUST include what was waited for and last observed state - -### Default Behaviors (ON unless disabled) -- **Progress Reporting**: Report wait progress with timeout values and retry counts -- **Cleanup on Timeout**: Cancel pending operations when timeout expires -- **Logging on Failure**: Log each retry attempt with failure reason and attempt number -- **Related Pattern Check**: Search codebase for existing wait/retry patterns to maintain consistency -- **Minimum Poll Interval Enforcement**: Enforce floor on polling frequency based on target type -- **Monotonic Clock Usage**: Use `time.monotonic()` (not `time.time()`) for elapsed time to avoid clock drift -- **Error Classification**: Separate transient from permanent errors before implementing retries - -### Optional Behaviors (OFF unless enabled) -- **Progress Callbacks**: Report progress during long waits via user-provided callback -- **Metrics Collection**: Track retry counts, wait times, and failure rates for monitoring -- **Fallback Values**: Return cached/default values when retries exhausted -- **Dead Letter Logging**: Record permanently failed operations for manual review - -## What This Skill CAN Do -- Implement condition-based polling with bounded timeouts -- Add exponential backoff with jitter to retry logic -- Handle API rate limits with Retry-After header support -- Wait for services to become healthy (TCP, HTTP, command) -- Implement circuit breaker patterns for cascade failure prevention -- Provide both Python and Bash implementations - -## What This Skill CANNOT Do -- Fix underlying service issues (only handles transient failures) -- Guarantee eventual success (some operations permanently fail) -- Replace proper async/await patterns for event-driven code -- Handle distributed coordination (use distributed locks instead) -- Provide real-time guarantees (polling has inherent latency) +# Condition-Based Waiting ---- - -## Quick Reference +Implement condition-based polling and retry patterns with bounded timeouts, jitter, and error classification. Select the right pattern for the scenario, implement it with safety bounds, and verify both success and failure paths. | Pattern | Use When | Key Safety Bound | |---------|----------|-----------------| @@ -85,49 +40,51 @@ This skill operates as an operator for wait and retry implementations, configuri | Health Check | Wait for service(s) to be ready | All-pass requirement + per-check status | | Circuit Breaker | Prevent cascade failures | Failure threshold + recovery timeout | ---- - ## Instructions -### Phase 1: IDENTIFY WAIT PATTERN +Before implementing any pattern, read the repository CLAUDE.md and search the codebase for existing wait/retry patterns to maintain consistency with what already exists. -**Goal**: Select the correct pattern for the use case. +### Step 1: Select the Pattern -``` -Decision Tree: +Walk this decision tree to pick the right pattern. Only implement the pattern directly needed -- do not add circuit breakers when simple retries suffice, and do not add health checks when a single poll works. +``` 1. Waiting for a condition to become true? - YES -> Simple Polling (Phase 2) + YES -> Simple Polling (Step 2) NO -> Continue 2. Retrying a failing operation? YES -> Rate-limited (429)? - YES -> Rate Limit Recovery (Phase 4) - NO -> Exponential Backoff (Phase 3) + YES -> Rate Limit Recovery (Step 5) + NO -> Exponential Backoff (Step 4) NO -> Continue 3. Waiting for a service to start? - YES -> Health Check Waiting (Phase 5) + YES -> Health Check Waiting (Step 6) NO -> Continue 4. Service frequently failing, need fast-fail? - YES -> Circuit Breaker (Phase 6) + YES -> Circuit Breaker (Step 7) NO -> Simple Poll or Backoff ``` -**Gate**: Pattern selected with rationale. Proceed only when gate passes. +### Step 2: Implement Simple Polling -### Phase 2: SIMPLE POLLING +Wait for a condition to become true with bounded timeout. -**Goal**: Wait for a condition to become true with bounded timeout. +1. Define the condition function (returns truthy when ready). +2. Set timeout and poll interval based on target type. Use `time.monotonic()` for elapsed time measurement -- never `time.time()`, which drifts with clock adjustments. -**Step 1**: Define the condition function (returns truthy when ready) +| Target Type | Min Interval | Typical Interval | Example | +|-------------|-------------|-----------------|---------| +| In-process state | 10ms | 50-100ms | Flag, queue, state machine | +| Local file/socket | 100ms | 500ms | File exists, port open | +| Local service | 500ms | 1-2s | Database, cache | +| Remote API | 1s | 5-10s | HTTP endpoint, cloud service | -**Step 2**: Set timeout and poll interval -- Local operations: 10-100ms poll interval, 30s timeout -- External services: 1-5s poll interval, 120s timeout +Never busy-wait (tight loop with no sleep). The minimum poll interval is 10ms for local operations, 100ms for external services. Tighter loops burn CPU, cause thermal throttling, and starve other processes. -**Step 3**: Implement with safety bounds +3. Implement with a mandatory timeout. Every wait loop must have a maximum timeout to prevent infinite hangs. The timeout error message must include what was waited for and the last observed state so the caller can diagnose failures. ```python # Core pattern (full implementation in references/implementation-patterns.md) @@ -141,25 +98,32 @@ while time.monotonic() < deadline: raise TimeoutError(f"Timeout waiting for: {description}") ``` -**Step 4**: Test with both success and timeout scenarios +4. Report wait progress with timeout values and retry counts during the wait. Cancel pending operations when the timeout expires. +5. Test with both success and timeout scenarios. Force the condition to never become true and confirm TimeoutError fires with a descriptive message. -**Gate**: Polling works for success case AND raises TimeoutError appropriately. Proceed only when gate passes. +### Step 3: Verify Before Proceeding -### Phase 3: EXPONENTIAL BACKOFF +After implementing any pattern from Steps 2-7, verify: +- Success path works as expected +- Failure/timeout path produces a descriptive error +- Logging captures each attempt with failure reason and attempt number +- No arbitrary sleep values remain (replace `sleep(N)` with condition-based polling) -**Goal**: Retry failing operations with increasing delays and jitter. +### Step 4: Implement Exponential Backoff -**Step 1**: Identify retryable vs non-retryable errors -- Retryable: 408, 429, 500, 502, 503, 504, network timeouts, connection refused -- Non-retryable: 400, 401, 403, 404, validation errors, auth failures +Retry failing operations with increasing delays and jitter. -**Step 2**: Configure backoff parameters -- `max_retries`: 3-5 for APIs, 5-10 for infrastructure -- `initial_delay`: 0.5-2s -- `max_delay`: 30-60s -- `jitter_range`: 0.5 (adds +/-50% randomness) +1. Classify errors before implementing retries. Separate transient from permanent errors -- retrying permanent errors wastes time and quota. + - **Retryable**: 408, 429, 500, 502, 503, 504, network timeouts, connection refused + - **Non-retryable**: 400, 401, 403, 404, validation errors, auth failures -**Step 3**: Implement with jitter (MANDATORY) +2. Configure backoff parameters. Every retry loop must have a maximum retry count. + - `max_retries`: 3-5 for APIs, 5-10 for infrastructure + - `initial_delay`: 0.5-2s + - `max_delay`: 30-60s + - `jitter_range`: 0.5 (adds +/-50% randomness) + +3. Implement with jitter. Jitter is mandatory on all exponential backoff -- without it, all clients retry at the same instant after an outage (thundering herd), amplifying the load spike that caused the failure. ```python # Core pattern (full implementation in references/implementation-patterns.md) @@ -175,31 +139,24 @@ for attempt in range(max_retries + 1): delay = min(delay * backoff_factor, max_delay) ``` -**Step 4**: Verify retry behavior with forced failures - -**Gate**: Backoff includes jitter, respects max_retries, only retries transient errors. Proceed only when gate passes. - -### Phase 4: RATE LIMIT RECOVERY +4. Log each retry attempt with the failure reason and attempt number. Verify retry behavior with forced failures. -**Goal**: Handle HTTP 429 responses using Retry-After headers. +### Step 5: Implement Rate Limit Recovery -**Step 1**: Detect 429 status code in response +Handle HTTP 429 responses using Retry-After headers. -**Step 2**: Parse `Retry-After` header (seconds or HTTP-date format) - -**Step 3**: Wait the specified duration, then retry - -**Step 4**: Fall back to default wait (60s) if header missing +1. Detect 429 status code in response. +2. Parse `Retry-After` header (seconds or HTTP-date format). +3. Wait the specified duration, then retry. +4. Fall back to default wait (60s) if header missing. See `references/implementation-patterns.md` for full `RateLimitedClient` class. -**Gate**: Honors Retry-After header when present, uses sensible default when absent. Proceed only when gate passes. - -### Phase 5: HEALTH CHECK WAITING +### Step 6: Implement Health Check Waiting -**Goal**: Wait for services to become healthy before proceeding. +Wait for services to become healthy before proceeding. -**Step 1**: Define health checks by type +1. Define health checks by type: | Type | Check | Example | |------|-------|---------| @@ -207,33 +164,27 @@ See `references/implementation-patterns.md` for full `RateLimitedClient` class. | HTTP | Endpoint returns 2xx | `http://localhost:8080/health` | | Command | Exit code 0 | `pgrep -f 'celery worker'` | -**Step 2**: Set appropriate timeouts (services often need 30-120s to start) - -**Step 3**: Poll all checks, succeed only when ALL pass - -**Step 4**: Report status of each check during waiting +2. Set appropriate timeouts (services often need 30-120s to start). Use poll intervals from the target type table in Step 2. +3. Poll all checks, succeed only when ALL pass. Report status of each check during waiting so the caller can see which service is lagging. See `references/implementation-patterns.md` for full `wait_for_healthy()` implementation. -**Gate**: All health checks pass within timeout. Status reported per-check. Proceed only when gate passes. - -### Phase 6: CIRCUIT BREAKER +### Step 7: Implement Circuit Breaker -**Goal**: Prevent cascade failures by failing fast after repeated errors. +Prevent cascade failures by failing fast after repeated errors. -**Step 1**: Configure thresholds -- `failure_threshold`: Number of failures before opening (typically 5) -- `recovery_timeout`: Time before testing recovery (typically 30s) -- `half_open_max_calls`: Successful calls needed to close (typically 3) +1. Configure thresholds: + - `failure_threshold`: Number of failures before opening (typically 5) + - `recovery_timeout`: Time before testing recovery (typically 30s) + - `half_open_max_calls`: Successful calls needed to close (typically 3) -**Step 2**: Implement state machine -- CLOSED: Normal operation, count failures -- OPEN: Reject immediately, wait for recovery timeout -- HALF_OPEN: Allow test calls, close on success streak +2. Implement state machine: + - CLOSED: Normal operation, count failures + - OPEN: Reject immediately, wait for recovery timeout + - HALF_OPEN: Allow test calls, close on success streak -**Step 3**: Add fallback behavior for OPEN state - -**Step 4**: Test all state transitions +3. Add fallback behavior for OPEN state. +4. Test all state transitions: ``` CLOSED --(failure_threshold reached)--> OPEN @@ -244,48 +195,18 @@ HALF_OPEN --(any failure)-------------> OPEN See `references/implementation-patterns.md` for full `CircuitBreaker` class. -**Gate**: All four state transitions work correctly. Fallback provides degraded service. Proceed only when gate passes. +### Examples ---- +**Flaky test with sleep()**: User says "This test uses sleep(5) and sometimes fails in CI." Identify as Simple Polling (Step 2). Define what the test is actually waiting for, replace `sleep(5)` with `wait_for(condition, description, timeout=30)`, run test 3 times to verify reliability, then force the condition to never be true and confirm TimeoutError. -## Examples - -### Example 1: Flaky Test with sleep() -User says: "This test uses sleep(5) and sometimes fails in CI" -Actions: -1. Identify as Simple Polling pattern (Phase 1) -2. Define condition: what the test is actually waiting for (Phase 2, Step 1) -3. Replace `sleep(5)` with `wait_for(condition, description, timeout=30)` (Phase 2, Step 3) -4. Run test 3 times to verify reliability (Phase 2, Step 4) -5. Verify timeout path: force condition to never be true, confirm TimeoutError -Result: Deterministic test that adapts to execution speed - -### Example 2: API Integration with Rate Limits -User says: "Our batch job hits 429 errors from the API" -Actions: -1. Identify as Rate Limit Recovery + Exponential Backoff (Phase 1) -2. Classify errors: 429 is retryable, 400/401/404 are not (Phase 3, Step 1) -3. Add Retry-After header parsing with 60s default fallback (Phase 4) -4. Add exponential backoff with jitter for non-429 transient errors (Phase 3, Step 3) -5. Test: normal flow, 429 handling, exhausted retries, non-retryable errors -Result: Resilient API client that respects rate limits - -### Example 3: Service Startup in Docker Compose -User says: "App crashes because it starts before the database is ready" -Actions: -1. Identify as Health Check Waiting (Phase 1) -2. Define checks: TCP on postgres:5432, HTTP on api:8080/health (Phase 5, Step 1) -3. Set 120s timeout with 2s poll interval (Phase 5, Step 2) -4. Implement wait_for_healthy() with all-pass requirement (Phase 5, Step 3) -5. Verify: services start within timeout, timeout fires when service is down -Result: Reliable startup ordering without arbitrary sleep() +**API integration with rate limits**: User says "Our batch job hits 429 errors from the API." Classify errors: 429 is retryable, 400/401/404 are not (Step 4). Add Retry-After header parsing with 60s default fallback (Step 5). Add exponential backoff with jitter for non-429 transient errors (Step 4). Test: normal flow, 429 handling, exhausted retries, non-retryable errors. ---- +**Service startup in Docker Compose**: User says "App crashes because it starts before the database is ready." Define checks: TCP on postgres:5432, HTTP on api:8080/health (Step 6). Set 120s timeout with 2s poll interval. Implement wait_for_healthy() with all-pass requirement. Verify: services start within timeout, timeout fires when service is down. ## Error Handling ### Error: "Timeout expired before condition met" -Cause: Condition never became true within timeout window +Cause: Condition never became true within timeout window. Solution: 1. Verify condition function logic is correct 2. Increase timeout if operation legitimately needs more time @@ -293,7 +214,7 @@ Solution: 4. Check for deadlocks or blocked resources ### Error: "All retries exhausted" -Cause: Operation failed on every attempt including retries +Cause: Operation failed on every attempt including retries. Solution: 1. Distinguish transient from permanent errors in retryable_exceptions 2. Verify external service is actually reachable @@ -301,67 +222,14 @@ Solution: 4. Increase max_retries only if error is genuinely transient ### Error: "Circuit breaker open" -Cause: Failure threshold exceeded, circuit rejecting calls +Cause: Failure threshold exceeded, circuit rejecting calls. Solution: 1. Investigate why underlying service is failing 2. Implement fallback behavior for CircuitOpenError 3. Wait for recovery_timeout to elapse before testing 4. Consider adjusting failure_threshold for known-flaky services ---- - -## Anti-Patterns - -### Anti-Pattern 1: Arbitrary Sleep Values -**What it looks like**: `time.sleep(5)` then check result -**Why wrong**: Works on fast machines, fails under load. Wastes time when fast, races when slow. -**Do instead**: `wait_for(condition, description, timeout=30)` - -### Anti-Pattern 2: No Maximum Timeout -**What it looks like**: `while not condition(): time.sleep(0.1)` with no deadline -**Why wrong**: Hangs indefinitely if condition never met. Blocks CI pipelines forever. -**Do instead**: Always set a deadline with `time.monotonic() + timeout` - -### Anti-Pattern 3: Backoff Without Jitter -**What it looks like**: `delay *= 2` with exact exponential growth -**Why wrong**: Thundering herd -- all clients retry simultaneously, amplifying load spikes -**Do instead**: `delay * random.uniform(0.5, 1.5)` to spread retries - -### Anti-Pattern 4: Retrying Non-Retryable Errors -**What it looks like**: `except Exception: retry()` catching everything -**Why wrong**: 400/401/404 will never succeed on retry. Wastes time and quota. -**Do instead**: Only retry transient errors (408, 429, 500, 502, 503, 504) - -### Anti-Pattern 5: Busy Waiting -**What it looks like**: `while not ready: pass` or polling every 1ms -**Why wrong**: Burns CPU, causes thermal throttling, starves other processes -**Do instead**: Minimum 10ms for local ops, 100ms+ for external services, 1-5s for network - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "sleep(5) is long enough" | Timing assumptions break under load | Replace with condition-based polling | -| "Jitter isn't necessary for small scale" | Scale changes; thundering herd hits early | Always add jitter to backoff | -| "No need for timeout, it always succeeds" | Always ≠ will always | Add timeout with descriptive error | -| "Retry everything, it's safer" | Retrying permanent errors wastes resources | Classify retryable vs non-retryable | - -### Recommended Poll Intervals - -| Target Type | Min Interval | Typical Interval | Example | -|-------------|-------------|-----------------|---------| -| In-process state | 10ms | 50-100ms | Flag, queue, state machine | -| Local file/socket | 100ms | 500ms | File exists, port open | -| Local service | 500ms | 1-2s | Database, cache | -| Remote API | 1s | 5-10s | HTTP endpoint, cloud service | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/implementation-patterns.md`: Complete Python/Bash implementations for all patterns diff --git a/skills/content-calendar/SKILL.md b/skills/content-calendar/SKILL.md index e1d4c02e..0bb7e6e2 100644 --- a/skills/content-calendar/SKILL.md +++ b/skills/content-calendar/SKILL.md @@ -26,45 +26,7 @@ routing: # Content Calendar Skill -## Operator Context - -This skill operates as an operator for editorial content pipeline management, configuring Claude's behavior for tracking blog posts through ideation, drafting, editing, and publication stages. It implements the **State Machine** architectural pattern -- content moves through defined stages with recorded transitions -- with **Pipeline Intelligence** embedded in the stage enforcement. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before any pipeline operation -- **Over-Engineering Prevention**: Only perform the requested pipeline operation. No speculative reorganization, no "while I'm here" reformatting of unrelated sections -- **Single Source of Truth**: All pipeline state lives in a single `content-calendar.md` file -- never store state elsewhere -- **Stage Integrity**: Content moves forward through defined stages only; no skipping stages -- **Date Tracking**: Every stage transition records a timestamp in YYYY-MM-DD format -- **File Preservation**: Read the full calendar file before writing; never truncate or lose existing entries - -### Default Behaviors (ON unless disabled) -- **Visual Dashboard**: Show pipeline overview with progress indicators on view -- **Upcoming Awareness**: Highlight scheduled content in next 14 days -- **In-Progress Focus**: Emphasize content actively being worked on (Outlined, Drafted, Editing) -- **Recent History**: Show publications from last 30 days -- **Duplicate Warning**: Warn when adding topics with titles matching existing entries - -### Optional Behaviors (OFF unless enabled) -- **Stale Detection**: Flag content stuck in a stage for 14+ days -- **Velocity Metrics**: Show publishing rate and stage throughput statistics -- **Auto-Archive**: Move published content older than current month to Historical section - -## What This Skill CAN Do -- View current pipeline state across all stages with dashboard formatting -- Add new ideas to the pipeline with duplicate detection -- Move content between adjacent stages with timestamp recording -- Schedule ready content for publication dates -- Archive published content to monthly historical sections -- Parse and update the content-calendar.md file structure - -## What This Skill CANNOT Do -- Create or modify actual Hugo content files (use content creation skills) -- Publish or deploy posts (use deploy skill) -- Skip stages when moving content forward -- Send notifications or reminders about scheduled content - ---- +Manage editorial content through 6 pipeline stages: Ideas, Outlined, Drafted, Editing, Ready, Published. All pipeline state lives in a single `content-calendar.md` file -- this is the sole source of truth, never store state elsewhere. ## Instructions @@ -72,44 +34,47 @@ This skill operates as an operator for editorial content pipeline management, co **Goal**: Load and validate the current calendar state before any mutation. -**Step 1**: Read the content calendar file from the project root - -**Step 2**: Parse pipeline sections -- extract entries from Ideas, Outlined, Drafted, Editing, Ready, Published, and Historical sections. +Memory of pipeline state is unreliable -- always read the actual file, because assumed state leads to overwrites of changes made by other processes or manual edits. -**Step 3**: Validate file structure -- all required sections exist, counts match actual entries. +1. Read `content-calendar.md` from the project root. Also read the repository CLAUDE.md to ensure compliance with project-specific rules. +2. Parse pipeline sections -- extract entries from Ideas, Outlined, Drafted, Editing, Ready, Published, and Historical sections. +3. Validate file structure -- all required sections exist, counts match actual entries. **Gate**: Calendar file loaded and parsed successfully. All sections accounted for. Proceed only when gate passes. ### Phase 2: EXECUTE OPERATION -**Goal**: Perform the requested pipeline operation with proper validation. +**Goal**: Perform the requested pipeline operation -- only the operation requested. No speculative reorganization, no "while I'm here" reformatting of unrelated sections. #### Operation: View Pipeline 1. Count entries in each stage 2. Identify upcoming scheduled content (next 14 days) -3. Identify in-progress content (Outlined, Drafted, Editing) +3. Identify in-progress content (Outlined, Drafted, Editing) -- emphasize these as actively being worked on 4. Gather recent publications (last 30 days) 5. Display dashboard with progress indicators +6. Optionally flag content stuck in a stage for 14+ days or show velocity metrics if requested #### Operation: Add Idea 1. Validate topic name is non-empty -2. Search all sections for duplicate titles; warn if found +2. Search all sections for duplicate titles (case-insensitive); warn if a matching title exists because duplicates clutter the pipeline 3. Append `- [ ] [Topic name]` to Ideas section 4. Update pipeline count in overview table #### Operation: Move Content +Content moves forward through defined stages only -- each transition represents real editorial work completed, so skipping stages misrepresents progress. + 1. Find topic in its current section (search all sections) 2. Validate target stage is the next sequential stage: - Ideas -> Outlined -> Drafted -> Editing -> Ready -> Published 3. Remove entry from current section -4. Add to target section with timestamp metadata: +4. Add to target section with timestamp metadata (every transition records YYYY-MM-DD): - outlined: `(outline: YYYY-MM-DD)` - drafted: `(draft: YYYY-MM-DD)` - editing: `(editing: YYYY-MM-DD)` - - ready: `(ready: YYYY-MM-DD)` -- requires scheduled publication date + - ready: `(ready: YYYY-MM-DD)` -- prompt for a scheduled publication date because content without a date clogs the pipeline and goes stale - published: `(published: YYYY-MM-DD)` 5. Update pipeline counts @@ -122,6 +87,8 @@ This skill operates as an operator for editorial content pipeline management, co #### Operation: Archive Published +Archive prevents the Published section from growing unbounded, which makes the dashboard cluttered and counts misleading. + 1. Find Published entries older than current month 2. Move to appropriate `### YYYY-MM` section in Historical 3. Update pipeline counts @@ -132,16 +99,14 @@ This skill operates as an operator for editorial content pipeline management, co **Goal**: Persist changes and verify the write succeeded. -**Step 1**: Write the updated calendar file back to disk. +Read the full calendar file before writing -- never truncate or lose existing entries. -**Step 2**: Re-read the file and verify the change is present. - -**Step 3**: Display confirmation with relevant dashboard section showing the change. +1. Write the updated calendar file back to disk. +2. Re-read the file and verify the change is present. Looking correct is not the same as being correct; the re-read proves it. +3. Display confirmation with relevant dashboard section showing the change. **Gate**: File written, re-read confirms changes persisted. Operation complete. ---- - ## Error Handling ### Error: "Calendar file not found" @@ -165,48 +130,8 @@ Solution: 2. Show the topic's current stage and the next valid stage 3. Ask user to confirm sequential move or move to adjacent stage ---- - -## Anti-Patterns - -### Anti-Pattern 1: Skipping Stages -**What it looks like**: Moving content directly from Ideas to Ready or Drafted to Published -**Why wrong**: Skips essential phases of work; stage transitions represent real editorial progress -**Do instead**: Move through each stage sequentially; each transition records that actual work occurred - -### Anti-Pattern 2: No Scheduled Dates on Ready Content -**What it looks like**: Content sits in Ready indefinitely without a publication date -**Why wrong**: Pipeline clogs with "ready" items that never publish; content goes stale -**Do instead**: Always prompt for a scheduled date when moving to Ready - -### Anti-Pattern 3: Ignoring Historical Archive -**What it looks like**: Published section grows unbounded month after month -**Why wrong**: Makes pipeline overview cluttered and dashboard counts misleading -**Do instead**: Archive published content monthly to Historical section - -### Anti-Pattern 4: Mutating Without Reading First -**What it looks like**: Writing calendar file based on assumed state rather than reading first -**Why wrong**: Overwrites changes made by other processes or manual edits -**Do instead**: Always read the full file before any write operation (Phase 1 is mandatory) - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I know the current state" | Memory of state ≠ actual file state | Read the calendar file first | -| "Skipping one stage is fine" | Stage transitions track real work | Enforce sequential movement | -| "The file format looks right" | Looking ≠ verifying | Re-read after write to confirm | -| "No one will notice stale entries" | Stale content degrades pipeline trust | Flag or archive old entries | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/pipeline-stages.md`: Detailed stage definitions and transition criteria - `${CLAUDE_SKILL_DIR}/references/calendar-format.md`: Complete file format specification with examples - `${CLAUDE_SKILL_DIR}/references/operations.md`: Detailed command reference with edge cases diff --git a/skills/content-engine/SKILL.md b/skills/content-engine/SKILL.md index 1cb479c2..7320a910 100644 --- a/skills/content-engine/SKILL.md +++ b/skills/content-engine/SKILL.md @@ -50,47 +50,17 @@ routing: # Content Engine Skill -## Operator Context +Repurpose anchor content into platform-native variants. This skill produces drafts only — it does not make API calls or publish content. Posting is handled downstream by `x-api` (single platform) or `crosspost` (multi-platform). -This skill repurposes anchor content into platform-native variants. It is not a summarization tool. Summaries do not work as social content — they lack hooks, carry the wrong register, and read identically across platforms. - -**Platform-native means:** -- Different register per platform (conversational on X, professional-but-human on LinkedIn, punchy on TikTok) -- Different structure per platform (thread vs. long-form post vs. short script vs. newsletter section) -- Different hook style per platform (open fast on X, strong first line on LinkedIn, interrupt on TikTok) -- Not the same text made shorter - -**This skill produces drafts only.** It does not make API calls or publish content. Posting is handled downstream by `x-api` (single platform) or `crosspost` (multi-platform). - -**The quality gate in Phase 4 is not optional.** Generic output with banned hype phrases or duplicated copy across platforms is not deliverable. The skill will rewrite and retry until both script checks pass before proceeding to Phase 5. - -### Hardcoded Behaviors (Always Apply) - -- **Do not proceed to Phase 2 if source asset AND platform target are both unspecified.** Ask. One missing is acceptable if the other provides clear direction. Both missing means there is nothing to work with. -- **Save artifacts to files.** `content_ideas.md` and `content_drafts.md` must exist as files before Phase 5 executes. Context is not an artifact. -- **No verbatim sentence shared across any two platform drafts.** Each platform variant is written from scratch using the atomic idea as input, not copied and trimmed. -- **Banned phrases are hard rejections.** If `scan-negative-framing.py --mode hype` exits non-zero, rewrite the flagged draft and re-run the check. Do not deliver the batch until exit 0. -- **X character limits are per-tweet.** Each tweet in a thread must be ≤280 characters and carry exactly one thought. Do not split mid-sentence across tweets without a natural break. -- **LinkedIn first line is the hook.** The first line must work standalone — it is the only text visible before "see more". If the first line requires the rest of the post to land, it fails. -- **TikTok/YouTube: result or interrupt first, not preamble.** Never open with "In this video I will..." or "Today we're going to...". Show the result or interrupt the pattern in the first 3 seconds of script. - -### Default Behaviors (ON unless disabled) - -- **Extract 3-7 atomic ideas from the source asset.** Fewer than 3 means underutilization. More than 7 means the asset lacks coherence and should be split. -- **Primary platform first in delivery.** If a primary platform was stated, that draft leads the delivery. -- **Flag missing details before handoff.** Any draft containing placeholder text (`[URL]`, `[handle]`, `[date]`) must be flagged at delivery with what is needed to finalize it. - -### Optional Behaviors (OFF unless enabled) - -- **Multi-idea series**: Extract all ideas and schedule as a series (pairs with `content-calendar`) -- **Voice profile application**: After drafting, apply a voice profile via `voice-writer` -- **Immediate publish**: After gate passes, hand off to `x-api` or `crosspost` +Platform-native means each variant is written from scratch for its target platform: different register (conversational on X, professional-but-human on LinkedIn, punchy on TikTok), different structure (thread vs. long-form post vs. short script vs. newsletter section), and different hook style (open fast on X, strong first line on LinkedIn, interrupt on TikTok). Shortening the same text for each platform is not adaptation — it produces content that reads identically everywhere and fails on every platform. --- -## Phase 1: GATHER — Collect Inputs Before Writing Anything +## Instructions -**Goal**: Establish everything needed to write platform-native variants. Do not begin writing until this phase is complete. +### Phase 1: GATHER — Collect Inputs Before Writing Anything + +Establish everything needed to write platform-native variants. Do not begin writing until this phase is complete. **Required inputs:** @@ -102,15 +72,17 @@ This skill repurposes anchor content into platform-native variants. It is not a | Goal | Awareness, conversion, recruiting, authority, launch support, engagement | Infer from source if obvious; ask otherwise | | Constraints | Character limits already observed, brand voice notes, phrases to avoid | Skip if none stated | -**Gate**: Source asset present AND at least one target platform identified. If either is missing, ask before proceeding. +**Gate**: Source asset present AND at least one target platform identified. If either is missing, ask before proceeding. Both missing means there is nothing to work with — do not guess. + +Produce only the platforms the user requested. If the user says "turn this into an X thread", produce an X thread. Offer to expand to other platforms in Phase 5, but do not produce unrequested variants. Do not write any content in this phase. Only collect and confirm inputs. --- -## Phase 2: EXTRACT — Identify 3-7 Atomic Ideas +### Phase 2: EXTRACT — Identify 3-7 Atomic Ideas -**Goal**: Identify the discrete, postable units inside the source asset. Each atomic idea must be able to stand alone as a post on at least one platform without requiring the reader to know the source. +Identify the discrete, postable units inside the source asset. Each atomic idea must stand alone as a post on at least one platform without requiring the reader to know the source. **Steps:** @@ -122,6 +94,8 @@ Do not write any content in this phase. Only collect and confirm inputs. 3. Rank by relevance to the stated goal 4. Write each atomic idea as one sentence maximum +Fewer than 3 ideas means the source is very narrow — proceed with what exists (minimum 1 is sufficient for a single platform) and note in the output file that the source yielded fewer than expected. More than 7 means the asset lacks coherence and should be split; ask the user which section to focus on. + **Output format for `content_ideas.md`:** ```markdown @@ -144,56 +118,56 @@ Platforms: [target platforms] [Which idea leads — the strongest for the stated goal] ``` -**Gate**: 3-7 numbered atomic ideas saved to `content_ideas.md`. Each is specific and standalone. Proceed only when file exists. +**Gate**: Numbered atomic ideas saved to `content_ideas.md`. Each is specific and standalone. The file must exist before proceeding — context is not an artifact. --- -## Phase 3: DRAFT — Write Platform-Native Variants +### Phase 3: DRAFT — Write Platform-Native Variants -**Goal**: Write one draft per target platform, each starting from the primary atomic idea (or specified idea) as raw material. Drafts are written from scratch per platform — not shortened or trimmed from each other. +Write one draft per target platform, each starting from the primary atomic idea (or specified idea) as raw material. + +Every draft must be written from scratch for its platform. Do not write one version and shorten or trim it for others — audiences on each platform recognize content that was not written for them. No two platform drafts may share a verbatim sentence. If the LinkedIn draft opens with "This article covers..." or the X tweet says "New post: [title]. Key points: 1, 2, 3", that is a summary, not an adaptation. Summaries give readers no reason to stop scrolling. Apply platform-specific rules (see `references/platform-specs.md` for full detail): -### X (Twitter) +#### X (Twitter) - **Register**: Conversational, direct, opinionated. No corporate voice. - **Hook**: Open fast. The first tweet carries the entire weight — if it doesn't stop the scroll, the thread dies. - **Structure for single tweet**: One idea, one sharp claim, optionally one proof point. End with a question or strong assertion, not a CTA. -- **Structure for thread**: Each tweet carries one thought. Segment at natural breaks. Number tweets only if ≥5. No cliffhanger tweets that require the next to make sense. -- **Character limit**: 280 per tweet, hard limit. +- **Structure for thread**: Each tweet carries one thought. Segment at natural breaks. Number tweets only if >=5. No cliffhanger tweets that require the next to make sense. +- **Character limit**: 280 per tweet, hard limit. Each tweet must carry exactly one thought — do not split mid-sentence across tweets without a natural break. - **Hashtags**: 0-2 maximum. Only if they add discoverability, never for decoration. - **Links**: One link only, at the end of the last tweet if needed. Not in the middle of a thread. -- **CTAs**: Optional. If present, one sentence, at the end, low pressure. +- **CTAs**: Optional. If present, one sentence, at the end, low pressure. Do not reuse the same CTA text used in other platform drafts. -### LinkedIn +#### LinkedIn - **Register**: Professional but human. Not corporate. Lessons and results framing over announcement framing. -- **Hook**: First line must work standalone before "see more". It is a promise, not a topic sentence. +- **Hook**: First line must work standalone before "see more" — it is the only text visible before the fold. If the first line requires the rest of the post to land, rewrite it. It is a promise, not a topic sentence. - **Structure**: First line → 2-4 short paragraphs of substance → optional takeaway or question at end. - **Length**: 150-300 words optimal. Can go to 600 if the content earns it. Not longer. - **Hashtags**: 3-5 at the end. Relevant, not decorative. -- **Links**: Put in comments, not in the post body. LinkedIn suppresses posts with external links. -- **CTAs**: One soft CTA at end if appropriate (follow for more, drop your take in comments). +- **Links**: Put in comments, not in the post body. LinkedIn suppresses posts with external links. Reference "link in comments" if needed. +- **CTAs**: One soft CTA at end if appropriate (follow for more, drop your take in comments). Match the CTA to LinkedIn norms — "Check out the full article in the comments!" reads differently here than on X. -### TikTok +#### TikTok - **Format**: Short video script (voiceover or talking-head style). -- **First 3 seconds**: Show the result, state the unexpected thing, or interrupt a pattern. Never start with "In this video...". This is the make-or-break moment. -- **Hooks matter more than summaries.** A weak hook with great content fails. A strong hook with decent content succeeds. +- **First 3 seconds**: Show the result, state the unexpected thing, or interrupt a pattern. Never start with "In this video..." or "Today we're going to..." — preamble kills retention. This is the make-or-break moment. - **Length**: 30-60 seconds optimal (150-300 words at speaking pace). - **Structure**: Hook (3s) → one demonstration or explanation → punchline or twist → CTA (5s max). - **No lists or headers in the script.** Write it to be spoken aloud. -### YouTube +#### YouTube - **Format**: Video script or description (specify which in the draft). -- **Show result early**: Within the first 30 seconds of script, show or state the result. Do not build to it. -- **First 3 seconds**: Same rule as TikTok — interrupt or result, not preamble. +- **Show result early**: Within the first 30 seconds of script, show or state the result. Do not build to it. Same rule as TikTok for the first 3 seconds — interrupt or result, not preamble. - **Chapter structure**: If script is >3 minutes, include chapter markers. - **Description**: 2-3 sentences that work as search-discoverable summary, then bullet points of what the video covers, then links/CTA. - **Thumbnail note**: Include one suggested thumbnail concept (visual + text overlay) with the draft. -### Newsletter +#### Newsletter - **Register**: One-on-one. Write to one person, not a list. - **Lens**: One clear angle on the idea. Not a summary — a perspective. @@ -247,21 +221,21 @@ Subject line options: [Draft] ``` -**Gate**: One draft per target platform saved to `content_drafts.md`. No two drafts share a verbatim sentence (self-check before running scripts in Phase 4). Proceed only when file exists. +**Gate**: One draft per target platform saved to `content_drafts.md`. Self-check that no two drafts share a verbatim sentence before running scripts in Phase 4. The file must exist before proceeding. --- -## Phase 4: GATE — Quality Check Before Delivery +### Phase 4: GATE — Quality Check Before Delivery -**Goal**: Mechanically verify drafts before delivery. LLM self-assessment is a secondary check only. Both script checks must exit 0. The gate cannot be bypassed. +Mechanically verify drafts before delivery. Both script checks must exit 0. The gate cannot be bypassed — LLM self-assessment alone ("I reviewed the drafts and they look clean") misses hype phrases in context and cannot do verbatim comparison reliably. Run the scripts. -### Check 1: Hype Phrase Scan +#### Check 1: Hype Phrase Scan ```bash python3 scripts/scan-negative-framing.py --mode hype --drafts content_drafts.md ``` -This check flags banned hype phrases. Hard-rejected phrases include: +This check flags banned hype phrases — they are hard rejections, not suggestions. Banned phrases include: - "excited to share" - "thrilled to announce" @@ -279,9 +253,11 @@ This check flags banned hype phrases. Hard-rejected phrases include: - "transformative" - "disruptive" +Opening with hype ("Excited to share our game-changing approach to...") reads as corporate noise. Replace with a specific result, number, counterintuitive claim, or observation: "We cut deploy time by 80%. Here is what actually changed." + **If exit non-zero**: Identify the flagged draft(s), rewrite only the affected sections, save to `content_drafts.md`, re-run the check. Do not proceed to Phase 5 until exit 0. -### Check 2: Cross-Platform Verbatim Check +#### Check 2: Cross-Platform Verbatim Check ```bash python3 scripts/scan-negative-framing.py --mode cross-platform --drafts content_drafts.md @@ -291,7 +267,7 @@ This check identifies any sentence appearing verbatim in two or more platform se **If exit non-zero**: Rewrite the flagged sentence(s) in one of the two platforms where they appear. The rewrite must be platform-native — not a synonym swap. Re-run the check. Do not proceed to Phase 5 until exit 0. -### Secondary LLM Check (after scripts pass) +#### Secondary LLM Check (after scripts pass) Once both scripts exit 0, verify: - [ ] Each draft reads natively for its platform (register, length, formatting feel right) @@ -303,16 +279,16 @@ Once both scripts exit 0, verify: --- -## Phase 5: DELIVER — Present Drafts with Posting Guidance +### Phase 5: DELIVER — Present Drafts with Posting Guidance -**Goal**: Hand off clean drafts with enough context for the user or a downstream skill to act immediately. +Hand off clean drafts with enough context for the user or a downstream skill to act immediately. **Delivery order**: Primary platform first (if specified), then remaining platforms alphabetically. **For each draft, include:** 1. The draft text (from `content_drafts.md`) 2. Optimal posting time if known (platform norms: X/LinkedIn weekdays 8-10am, TikTok evenings, Newsletter Tuesday-Thursday) -3. Any remaining placeholders that must be resolved before publishing (flag clearly) +3. Any remaining placeholders that must be resolved before publishing — flag clearly with what is needed to finalize (e.g., `[URL]` needs the published article link, `[handle]` needs the company X handle) 4. Suggested posting order if multiple platforms (e.g., "post X first to gauge reaction, then LinkedIn 48 hours later") **Downstream handoff options:** @@ -325,8 +301,13 @@ Once both scripts exit 0, verify: | Apply a voice profile to drafts | `voice-writer` skill (post-process these drafts) | | Extract more ideas from the same source | Re-run from Phase 2 | +**Optional behaviors** (off unless enabled by user): +- **Multi-idea series**: Extract all ideas and schedule as a series (pairs with `content-calendar`) +- **Voice profile application**: After drafting, apply a voice profile via `voice-writer` +- **Immediate publish**: After gate passes, hand off to `x-api` or `crosspost` + **Artifacts produced:** -- `content_ideas.md` — 3-7 numbered atomic ideas with ranking +- `content_ideas.md` — numbered atomic ideas with ranking - `content_drafts.md` — platform-native drafts, gate-verified, status: READY --- @@ -359,40 +340,6 @@ Solution: Proceed with what exists (minimum 1 idea is sufficient for a single pl --- -## Anti-Patterns - -### Anti-Pattern 1: Summarizing instead of adapting -**What it looks like**: LinkedIn draft that opens with "This article covers..." or X tweet that says "New post: [title]. Key points: 1, 2, 3. Link in bio." -**Why wrong**: Summaries give readers no reason to stop scrolling. They signal the platform-native content is elsewhere. -**Do instead**: Extract one atomic idea and write a post that delivers value on its own, with the link as optional context. - -### Anti-Pattern 2: Copy-paste-shorten across platforms -**What it looks like**: LinkedIn draft is the X thread with hashtags added. TikTok script is the LinkedIn draft read aloud. -**Why wrong**: Fails the cross-platform verbatim check. Also fails the native-read check — audiences on each platform recognize content that was not written for them. -**Do instead**: Start each draft from the atomic idea, not from another platform's draft. - -### Anti-Pattern 3: Bypassing the gate with self-assessment -**What it looks like**: "I reviewed the drafts and they look clean" without running the scripts. -**Why wrong**: LLM self-assessment misses hype phrases in context and cannot do verbatim comparison reliably. -**Do instead**: Run both script checks. If scripts are unavailable, use Grep manually and document that the automated gate was not run. - -### Anti-Pattern 4: Hype phrasing in hooks -**What it looks like**: "Excited to share our game-changing approach to..." -**Why wrong**: Fails the hype scan. Also reads as corporate noise — the audience stops reading. -**Do instead**: Open with a specific result, number, counterintuitive claim, or observation. "We cut deploy time by 80%. Here is what actually changed." - -### Anti-Pattern 5: Identical CTAs across platforms -**What it looks like**: Every draft ends with "Check out the full article in the comments!" -**Why wrong**: LinkedIn suppresses posts with external links. X readers expect links at end of threads. Newsletter readers want a different action than social followers. -**Do instead**: Match CTA to platform norms. For LinkedIn: put link in comments and reference that. For X: link at end of last tweet. For newsletter: inline or button. - -### Anti-Pattern 6: Producing all platforms when only one was requested -**What it looks like**: User says "turn this into an X thread" and skill produces X + LinkedIn + TikTok drafts. -**Why wrong**: Wastes effort and signals the skill didn't understand the request. -**Do instead**: Produce exactly what was requested. Offer to expand to other platforms in Phase 5. - ---- - ## References - `${CLAUDE_SKILL_DIR}/references/platform-specs.md` — Character limits, format rules, and posting norms per platform diff --git a/skills/create-voice/SKILL.md b/skills/create-voice/SKILL.md index 13ce32d1..b2c67b60 100644 --- a/skills/create-voice/SKILL.md +++ b/skills/create-voice/SKILL.md @@ -48,55 +48,13 @@ Create a complete voice profile from writing samples through a 7-phase pipeline. --- -## Operator Context - -This skill operates as an operator for voice creation workflows, configuring Claude's behavior for guiding users through the complete pipeline from raw writing samples to a validated, authorship-matching voice skill. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before starting any work -- **No Existing File Modification**: NEVER modify voice_analyzer.py, voice_validator.py, banned-patterns.json, voice-calibrator, voice-orchestrator, or any existing skill/script. The existing tools work. This skill only creates new files in `skills/voice-{name}/` -- **Wabi-Sabi Throughout**: Natural imperfections are features, not bugs. This principle applies at EVERY phase, not as an afterthought. See `skills/shared-patterns/wabi-sabi-authenticity.md` -- **50-Sample Minimum**: Do not proceed past Step 1 without 50+ writing samples. The system tried with 3-10 and FAILED. 50+ is where it starts working. WHY: LLMs are pattern matchers, and rules tell AI what to do but samples show AI what the voice looks like. V7-V9 had correct rules but failed authorship matching (0/5 roasters). V10 passed 5/5 because it had 100+ categorized samples. -- **Deterministic Before AI**: Always run script-based analysis before AI interpretation. WHY: Scripts produce reproducible, quantitative baselines. AI interpretation without data drifts toward generic patterns. -- **Artifacts Over Memory**: Save outputs to files at every phase. Context is ephemeral; files persist. - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report progress with phase status banners. Be direct about what passed or failed, not congratulatory -- **Phase Gates**: Enforce GATE checkpoints between all phases. Do not proceed if the gate condition is unmet -- **Iteration Limits**: Maximum 3 validation/refinement iterations before escalating to user -- **Validation Pipeline**: Run both `voice_validator.py validate` and `voice_validator.py check-banned` during Step 6 -- **Reference Loading**: Show users any existing voice implementation as a concrete example of "done" - -### Optional Behaviors (OFF unless enabled) -- **Strict Mode**: Require 5/5 roaster match instead of 4/5 minimum -- **Verbose Analysis**: Show full voice_analyzer.py JSON output inline instead of summary -- **Cross-Voice Comparison**: Compare new profile against existing voice profiles -- **Batch Sample Import**: Automated scraping/organization of samples from a single source - -## What This Skill CAN Do -- Guide users through the complete 7-phase voice creation pipeline -- Organize writing samples into the correct directory structure -- Run deterministic analysis via `scripts/voice_analyzer.py` -- Identify voice patterns, phrase fingerprints, and wabi-sabi markers using AI interpretation of samples + metrics -- Generate complete voice skill files (SKILL.md, profile.json, config.json) following the voice-calibrator template -- Validate generated content against the new profile using `scripts/voice_validator.py` -- Iterate on failures with targeted fixes (max 3 iterations) - -## What This Skill CANNOT Do -- Generate content in a voice (use `voice-orchestrator` after creation) -- Modify existing voice skills or profiles (use `voice-calibrator`) -- Scrape writing samples from the internet automatically (user must provide samples) -- Guarantee authorship matching will pass (depends on sample quality and quantity) -- Skip deterministic analysis (scripts MUST run before AI interpretation) -- Create a voice from fewer than 50 samples (hard minimum, see Hardcoded Behaviors) - ---- - ## Instructions ### Overview -The pipeline has 7 phases. Each phase produces artifacts and has a gate that must pass before proceeding. +Read and follow the repository CLAUDE.md before starting any work. + +The pipeline has 7 phases. Each phase produces artifacts saved to files (because context is ephemeral; files persist) and has a gate that must pass before proceeding. Report progress with phase status banners at each gate. Be direct about what passed or failed, not congratulatory. | Phase | Name | Artifact | Gate | |-------|------|----------|------| @@ -114,53 +72,9 @@ The pipeline has 7 phases. Each phase produces artifacts and has a gate that mus **Goal**: Build a corpus of real writing that captures the full range of the person's voice. -**Why 50+ samples**: The voice system learned this the hard way. 3-10 samples capture IDEAS but not EXECUTION STYLE. Authorship matching detects execution style. V7-V9 had the right rules but failed 0/5 because the model had too few examples to absorb the actual texture of the writing. V10 passed 5/5 with 100+ categorized samples. More is better; 50 is the minimum where it starts working. - -#### Where to Find Samples - -| Source | What to Look For | File Naming | -|--------|-----------------|-------------| -| Reddit history | Comments, posts, replies | `reddit-samples-YYYY-MM-DD.md` | -| Hacker News | Comments, Ask HN answers | `hn-samples-YYYY-MM-DD.md` | -| Blog posts | Published articles | `blog-samples.md` | -| Forum posts | Any discussion forum | `forum-samples-YYYY-MM-DD.md` | -| Emails | Professional and casual | `email-samples.md` | -| Chat logs | Slack, Discord, iMessage | `chat-samples.md` | -| Social media | Twitter/X threads | `social-samples.md` | - -#### Sample Quality Guidelines - -- **Mix of lengths**: Very short (1 sentence), short (2-3 sentences), medium (paragraph), long (multi-paragraph). The distribution matters because most people write short responses most of the time. -- **Mix of contexts**: Technical, casual, disagreement, agreement, teaching, joking, emotional. Different contexts reveal different facets of voice. -- **Mix of topics**: Not all about the same subject. Topic diversity reveals stable voice patterns vs topic-specific patterns. -- **DO NOT clean up samples**: Typos, run-on sentences, fragments, loose punctuation ARE the voice. Cleaning destroys authenticity markers. This is the wabi-sabi principle in action at the very first step. -- **DO NOT cherry-pick**: Include mediocre posts alongside great ones. The mundane reveals default patterns. - -#### Directory Setup - -```bash -mkdir -p skills/voice-{name}/references/samples/ -``` - -Place all sample files in `skills/voice-{name}/references/samples/`. Each file should contain multiple samples, separated by `---` or clear headers. - -#### Sample File Format - -Each sample file should preserve the original writing exactly: - -```markdown -# Reddit Samples - 2025-12-30 +Do NOT proceed past this step without 50+ samples, because the system tried with 3-10 and FAILED. 50+ is where it starts working. LLMs are pattern matchers -- rules tell AI what to do but samples show AI what the voice looks like. V7-V9 had correct rules but failed authorship matching (0/5 roasters). V10 passed 5/5 because it had 100+ categorized samples. -## r/subreddit - Thread Title -[Exact text of comment, typos and all] - ---- - -## r/subreddit - Another Thread -[Exact text] - ---- -``` +See `references/sample-collection.md` for the "Where to Find Samples" table, "Sample Quality Guidelines", "Directory Setup", and "Sample File Format". **GATE**: Count the samples. If fewer than 50 distinct writing samples exist across all files, STOP. Tell the user how many more are needed and where to find them. Do NOT proceed. @@ -177,7 +91,7 @@ Phase 1/7: COLLECT **Goal**: Extract quantitative voice metrics from the samples using `voice_analyzer.py`. -**Why deterministic first**: Scripts produce reproducible numbers. AI interpretation without data drifts toward "sounds like a normal person" rather than capturing what makes THIS person distinctive. The numbers ground everything that follows. +Always run script-based analysis before AI interpretation, because scripts produce reproducible, quantitative baselines. AI interpretation without data drifts toward "sounds like a normal person" rather than capturing what makes THIS person distinctive. The numbers ground everything that follows. #### Run the Analyzer @@ -233,99 +147,9 @@ Phase 2/7: EXTRACT **Goal**: Using the samples + profile.json, identify the distinctive patterns that make this voice THIS voice and not generic writing. -**Why AI-assisted**: The script extracts WHAT (numbers). This step identifies WHY those numbers are what they are and what distinctive PATTERNS produce them. A high contraction rate is a number; "uses contractions even in technical explanations, creating casual authority" is a pattern. - -#### Phrase Fingerprints (CRITICAL) - -Read through ALL samples and identify 10+ distinctive phrases that recur. These are what authorship-matching roasters use to confirm identity. - -Look for: -- **Signature openers**: How do they start responses? ("I think the issue is...", "So basically...", "Here's what I've found...") -- **Signature closers**: How do they end? ("but we'll see", "does that help?", "anyway, that's my take") -- **Filler phrases**: Verbal tics that appear across contexts ("For what it's worth", "to be fair", "honestly") -- **Hedging patterns**: How they express uncertainty ("probably", "I suspect", "my guess is") -- **Emphasis patterns**: How they stress a point ("the key thing is", "the part people miss") - -Document each fingerprint with 2-3 exact quotes from the samples showing it in context. - -#### Thinking Patterns - -How does this person reason? This is deeper than style; it's cognitive architecture. - -Common patterns to check for: -- **Concede-then-assert**: "That's fair, but..." (acknowledges opposing view, then states own position) -- **Hypothesis-experiment**: "My theory is... I tried... and found..." -- **Systems framing**: "The way this works is..." (explains mechanisms, not just opinions) -- **Experience-based**: "In my experience..." (grounds claims in personal observation) -- **Question-led**: "The question is..." (frames issues as questions to investigate) -- **Analogy-driven**: Uses metaphors and comparisons from specific domains - -#### Response Length Distribution - -From the samples, estimate what percentage of responses fall into each bucket: -- Very short (1 sentence): ____% -- Short (2-3 sentences): ____% -- Medium (4-6 sentences): ____% -- Long (paragraph+): ____% - -This distribution is critical because most people write short responses most of the time, and AI tends to generate medium-to-long responses by default. +The script extracted WHAT (numbers). This step identifies WHY those numbers are what they are and what distinctive PATTERNS produce them. A high contraction rate is a number; "uses contractions even in technical explanations, creating casual authority" is a pattern. -#### Natural Typos (Authenticity Markers) - -Scan samples for 5+ real typos. Document them with the correct spelling. These become wabi-sabi markers that signal authenticity. - -Examples of what to look for: -- Missing apostrophes ("dont" instead of "don't") -- Common word swaps ("there" for "their") -- Dropped letters ("probabl" for "probably") -- Double-typed characters ("tthe") -- Missing spaces after punctuation ("works.But") - -#### Wabi-Sabi Markers - -Identify which "imperfections" ARE the voice. These are not bugs to fix; they are features to preserve. - -- Run-on sentences: Does this person chain clauses with commas? -- Fragments: Do they use sentence fragments for emphasis? -- Loose punctuation: Is comma usage inconsistent? Is that part of the texture? -- Self-corrections: Do they change direction mid-sentence? ("Well, actually..." or "I mean,") -- Tangential asides: Do they go on tangents? (Parenthetical digressions?) - -#### Linguistic Architectures - -Beyond sentence-level patterns, identify the **structural moves** that operate across sentences and paragraphs. These are what AI erases most aggressively — the model defaults to claim → evidence → hedged conclusion regardless of the writer's actual architecture. - -Analyze ALL samples (not just a few) for each dimension: - -**Argument Architecture** — How does the writer build a case? -- **Direction**: Inductive (examples → conclusion) vs deductive (claim → evidence) vs mixed? Where does the main claim appear relative to supporting evidence? -- **Escalation**: Do stakes increase through the piece? Narrow → broad? Low → high severity? -- **Ending reframe**: Does the ending restate the opening, or transform it into something new? - -Document with exact quotes: "In {N} of {M} samples, the writer builds inductively — evidence first, conclusion last. Example from sample X: [quote showing the build]" - -**Concession Architecture** — How does the writer handle "yes, but"? -- **Structure**: Short admission → pivot? Long qualification → reversal? Never concedes? -- **Pivot markers**: Which words signal the turn? ("but", "though", "the thing is", "and yet", "that said") -- **Position**: Where do concessions appear? Opening? Mid-argument? Never at the end? - -Document with exact quotes: "Concessions follow a [short admission → blunt pivot] pattern. Example: [quote showing concession shape]" - -**Analogy Architecture** — Where do metaphors and references come from? -- **Source domains**: Which fields? (cooking, construction, sports, warfare, nature, machinery, music, software, etc.) -- **Deployment**: Are analogies used to open? To explain mid-section? To close with a memorable image? -- **Density**: Every post? Rarely? Only for technical concepts? - -Document with exact quotes: "Analogies cluster around {domains}. Example: [quote showing analogy from that domain]" - -**Bookend Architecture** — How do pieces open and close? -- **Opening moves**: Question? Declarative claim? Anecdote? Provocation? Scene-setting? -- **Closing moves**: Reframe? Fragment punch? Circle back to opening? Call to action? Open question? -- **Symmetry**: Does the closing echo or answer the opening? - -Document with exact quotes: "Opens with [pattern] in {N}/{M} samples. Closes with [pattern]. Opening and closing are [symmetric/independent]." - -**Note**: Not all writers exhibit all 4 architectures. Tweet-only writers may not have argument or bookend architecture. The gate requires 2 of 4, not 4 of 4. +See `references/pattern-identification.md` for detailed guidance on "Phrase Fingerprints", "Thinking Patterns", "Response Length Distribution", "Natural Typos", "Wabi-Sabi Markers", and all 4 "Linguistic Architectures" (Argument, Concession, Analogy, Bookend) with documentation templates. **GATE**: At least 10 phrase fingerprints documented with exact quotes. At least 3 thinking patterns identified. Response length distribution estimated. At least 5 natural typos found. Wabi-sabi markers identified. At least 2 of 4 linguistic architectures documented with evidence quotes. @@ -346,100 +170,9 @@ Phase 3/7: PATTERN **Goal**: Transform the patterns identified in Step 3 into actionable rules for the voice skill. -**Why rules AND samples**: Rules set boundaries. Samples show execution. You need both, but samples do the heavy lifting. Rules prevent the worst failures (AI phrases, wrong structure). Samples guide the model toward authentic output. The voice-calibrator learned this through 10 versions of iteration. - -#### What This Voice IS (Positive Identity) - -Write 4-6 core traits with examples from the samples. Use probability dampening to avoid caricature: - -- **"subtly" skeptical** not "skeptical" -- dampens the trait so it appears naturally, not performatively -- **"generally" conversational** not "conversational" -- allows for variation -- **"slightly" self-deprecating** not "self-deprecating" -- prevents over-application - -For each trait, include 2-3 exact quotes from samples that demonstrate it. - -WHY probability dampening: Without adverb modifiers, the model cranks traits to 100%. "Skeptical" becomes every-sentence-is-a-challenge. "Conversational" becomes aggressively casual. The dampening keeps traits at natural frequency. - -#### What This Voice IS NOT (Contrastive Identity) - -Build a contrastive table showing THIS voice vs Generic AI for at least 6 aspects: - -| Aspect | This Voice | Generic AI | -|--------|-----------|------------| -| Opening | [Example from samples] | "In today's rapidly evolving landscape..." | -| Uncertainty | [How they express doubt] | "It's worth noting that perspectives may vary" | -| Agreement | [How they agree] | "I absolutely agree with your insightful point" | -| Disagreement | [How they disagree] | "While there are valid concerns, I would respectfully suggest..." | -| Conclusion | [How they end] | "In conclusion, we have explored..." | -| Technical | [Technical style] | "This represents a robust paradigm for..." | - -#### Hard Prohibitions - -Identify patterns this voice NEVER uses. Apply attention anchoring (bold) to all negative constraints because the model pays more attention to bolded text: - -Common prohibitions to evaluate: -- **Em-dashes**: Does this person ever use them? If not, FORBIDDEN -- **Formal transitions**: "However", "Furthermore", "Moreover", "Additionally", "Consequently" -- **AI-typical phrases**: "Let's dive in", "Here's the thing", "delve", "robust", "leverage", "ecosystem" -- **The "It's not X. It's Y" pattern**: Signature AI structure. Almost always prohibited -- **Excessive hedging**: "It's worth noting", "One might argue", "At the end of the day" - -For each prohibition, explain WHY it's prohibited for this specific voice (not just "because it's AI-sounding"). - -#### Wabi-Sabi Rules - -Which "errors" MUST be preserved? This is the inversion of typical quality rules: - -- If they write run-on sentences: "Allow comma-chain sentences up to {N} words when expressing enthusiasm or building arguments" -- If they use fragments: "Target {X}% fragment rate for emphasis and pacing" -- If punctuation is loose: "Do not standardize comma usage; match the inconsistent pattern from samples" -- If they self-correct: "Include at least one visible direction change per long-form response" +Rules set boundaries while samples show execution. You need both, but samples do the heavy lifting, because V7-V9 had detailed rules and failed 0/5 authorship matching -- V10 passed with samples. Rules prevent the worst failures (AI phrases, wrong structure). Samples guide the model toward authentic output. -#### Anti-Essay Patterns - -Most voices are NOT essay-writers. Identify the structural anti-patterns: - -- Staccato rhythm? (Short sentences dominating) -- No signposting? (No "First... Second... Third...") -- Single-sentence paragraphs? (Common in chat/forum) -- No introduction/conclusion structure? (Just starts talking) -- Abrupt endings? (No wrap-up, just stops) - -#### Architectural Patterns - -Translate the linguistic architectures discovered in Step 3 into actionable rules for the voice skill. These go in a dedicated `## Architectural Patterns` section in the generated SKILL.md, between the identity rules and banned patterns. - -For each documented architecture, write a rule: - -```markdown -## Architectural Patterns - -### Argument Flow -[Inductive/Deductive/Mixed] — [one-sentence description] -Build arguments by [specific instruction]. The main claim should appear [position]. -Example from samples: "[exact quote showing the pattern]" - -### Concessions -Structure: [short admission → pivot / long qualification → reversal / never concedes] -Pivot markers: [list of words this voice uses] -Example: "[exact quote]" - -### Analogy Domains -Primary: [domain1, domain2] -Deployment: [where analogies appear — openers? mid-section? closers?] -Density: [frequency] -**NEVER draw analogies from**: [domains this voice avoids] -Example: "[exact quote]" - -### Bookends -Opening move: [pattern] -Closing move: [pattern] -Symmetry: [yes/no/sometimes] -Example opening: "[quote]" -Example closing: "[quote]" -``` - -Omit any architecture that wasn't documented in Step 3. A missing section is better than a fabricated one. +See `references/voice-rules-template.md` for the full "What This Voice IS" positive identity format, the "What This Voice IS NOT" contrastive table template, "Hard Prohibitions" checklist, "Wabi-Sabi Rules", "Anti-Essay Patterns", and the "Architectural Patterns" template with all 4 rule sections (Argument Flow, Concessions, Analogy Domains, Bookends). **GATE**: Positive identity has 4+ traits with dampening adverbs. Contrastive table covers 6+ aspects. At least 3 hard prohibitions defined. Wabi-sabi rules specify which imperfections to preserve. Anti-essay patterns documented. Architectural patterns documented for each architecture identified in Step 3. @@ -460,138 +193,13 @@ Phase 4/7: RULE **Goal**: Generate the complete voice skill files following the voice-calibrator template. -**Why the template matters**: The template structure (lines 1063-1512 of `pipelines/voice-calibrator/SKILL.md`) was refined over 10 iterations. It embeds prompt engineering best practices (attention anchoring, probability dampening, XML context tags, few-shot examples for prohibitions). Deviating from the template means losing those lessons. - -#### Files to Create - -1. **`skills/voice-{name}/SKILL.md`** -- The voice skill itself (2000+ lines) -2. **`skills/voice-{name}/config.json`** -- Validation configuration -3. **`skills/voice-{name}/profile.json`** -- Already created in Step 2 - -#### SKILL.md Structure - -Follow the template from voice-calibrator. The sections, in order of importance by line count: - -| Section | Target Lines | Why This Size | -|---------|-------------|---------------| -| Extensive Authentic Samples | 400+ | THIS IS WHAT MAKES AUTHORSHIP MATCHING WORK. V7-V9 failed with rules-only. V10 passed with 100+ samples. | -| Voice Metrics (from profile.json) | ~100 | Quantitative targets give the model measurable goals, not vague aspirations | -| Rules and Prohibitions | ~200 | Hard constraints prevent the worst AI tells | -| Phrase Fingerprints | ~50 | Exact phrases that roasters use to confirm identity | -| Generation Protocol | ~100 | Pre/During/Post checklists keep output consistent | -| Natural Typos | ~20 | Authenticity markers that signal human-ness | -| Contrastive Examples | ~50 | Shows the model what NOT to do with concrete alternatives | -| Thinking Patterns | ~80 | Cognitive architecture, not just surface style | - -**Total: 2000+ lines minimum. Most should be SAMPLES, not rules.** - -#### SKILL.md Frontmatter - -```yaml ---- -name: voice-{name} -user-invocable: false -allowed-tools: - - Read - - Write - - Bash - - Grep - - Glob - - Edit - - Task - - Skill -description: | - Apply {Name}'s voice profile for content generation: [2-3 key traits], - and modal writing. Use when generating content that must match {Name}'s - distinctive voice. Do NOT use for voice analysis, voice profile creation, - or generating content in other voices. -version: 1.0.0 -command: /voice-{name} ---- -``` - -#### SKILL.md Operator Context - -Include standard operator context sections: - -- **Hardcoded Behaviors**: CLAUDE.md compliance, voice fidelity, wabi-sabi principle, data integrity (never modify curated content) -- **Default Behaviors**: Voice validation via script, em-dash prohibition (if applicable), mode selection -- **Optional Behaviors**: Strict mode, A/B testing - -#### Sample Organization (THE MOST IMPORTANT SECTION) - -Organize samples from Step 1 into the SKILL.md following this structure: +NEVER modify voice_analyzer.py, voice_validator.py, banned-patterns.json, voice-calibrator, voice-orchestrator, or any existing skill/script, because the existing tools work. This skill only creates new files in `skills/voice-{name}/`. -**By Response Length:** -- Very short (1 sentence, ~{X}% of responses): Include 10+ samples -- Short (2-3 sentences, ~{X}% of responses): Include 15+ samples -- Medium (4-6 sentences, ~{X}% of responses): Include 10+ samples -- Long (paragraph+, ~{X}% of responses): Include 5+ samples +Before generating, show users any existing voice implementation in `skills/voice-*/` as a concrete example of "done", because reference implementations ground expectations. -**By Pattern Type:** -- Admitting Mistakes: 5+ samples -- Acknowledging Limits: 5+ samples -- Respectful Disagreement: 5+ samples -- Technical Expertise (delivered casually): 5+ samples -- Strong Opinions (unhedged): 5+ samples -- Casual Closers: 5+ samples -- Sarcasm/Wit (if applicable): examples +Follow the template structure from voice-calibrator (lines 1063-1512 of `pipelines/voice-calibrator/SKILL.md`), because it was refined over 10 iterations and embeds prompt engineering best practices (attention anchoring, probability dampening, XML context tags, few-shot examples for prohibitions). Deviating from the template means losing those lessons. -#### Voice Metrics Section - -Transfer the profile.json data into human-readable tables with targets and tolerances: - -```markdown -| Metric | Target | Tolerance | Notes | -|--------|--------|-----------|-------| -| Average sentence length | {X} words | +/- 2 words | Primary rhythm indicator | -| Short sentences (3-10 words) | {X}% | +/- 5% | For emphasis and pacing | -``` - -#### Two-Layer Architecture - -Design the skill with two layers: - -- **Layer A (Always-On Base Voice)**: Core traits, sentence rhythm, punctuation signature, contraction rate, function word signature. These apply to ALL content regardless of mode. -- **Layer B (Mode-Specific Overlays)**: Different modes (e.g., technical, casual, opinion, review) that adjust tone, formality, and structure while keeping Layer A constant. - -#### Prompt Engineering Techniques (Apply Throughout) - -These techniques were validated over 10 iterations of a reference voice: - -1. **Probability Dampening**: Use "**subtly**", "**slightly**", "**generally**" before traits. WHY: Without dampening, the model cranks traits to 100% -2. **Attention Anchoring**: **Bold** all negative constraints. WHY: The model pays more attention to formatted text -3. **XML Context Tags**: Use `` for directives and `` for prohibitions. WHY: Structured tags signal instruction priority to the model -4. **Few-Shot Examples**: Include 3+ examples for every prohibition (especially "It's not X. It's Y"). WHY: Rules without examples are abstract; examples are concrete -5. **Contrastive Pairs**: For every "DO" include a "DON'T" with concrete text. WHY: The model needs to see both sides of the boundary - -#### config.json - -Create the validation configuration: - -```json -{ - "name": "{Name}", - "version": "1.0.0", - "description": "{Brief voice description}", - "modes": ["technical", "casual", "opinion"], - "validation": { - "strict_banned_patterns": true, - "em_dash_forbidden": true, - "metric_tolerance": 0.20, - "required_checks": ["banned_phrases", "punctuation", "rhythm"], - "optional_checks": ["metrics", "sentence_starters", "opening_pattern"] - }, - "thresholds": { - "pass_score": 70, - "error_max": 0, - "warning_max": 3 - }, - "voice_specific_patterns": [] -} -``` - -Adjust `em_dash_forbidden`, `modes`, and `pass_score` based on the specific voice's characteristics. +See `references/skill-generation.md` for "Files to Create", the "SKILL.md Structure" table (sections by line count), "SKILL.md Frontmatter", "Sample Organization" (by length and by pattern type), "Voice Metrics Section" format, "Two-Layer Architecture", "Prompt Engineering Techniques" (5 validated techniques), and the `config.json` template. **GATE**: `SKILL.md` exists with 2000+ lines. Samples section has 400+ lines. All template sections present (samples, metrics, rules, fingerprints, protocol, typos, contrastive examples, thinking patterns). `config.json` exists with valid JSON. Frontmatter has correct fields. @@ -610,7 +218,9 @@ Phase 5/7: GENERATE **Goal**: Generate test content using the new skill, then validate it against the profile using deterministic scripts. -**Why validate with scripts, not self-assessment**: Self-assessment drifts. The model will convince itself the output sounds right. Scripts measure whether sentence length, punctuation density, and contraction rate actually match the targets. Objective measurement prevents rationalization. +Validate with scripts, not self-assessment, because self-assessment drifts. The model will convince itself the output sounds right. Scripts measure whether sentence length, punctuation density, and contraction rate actually match the targets. Objective measurement prevents rationalization. + +Run both `voice_validator.py validate` and `voice_validator.py check-banned` during this step. #### Generate Test Content @@ -650,7 +260,7 @@ python3 ~/.claude/scripts/voice_validator.py check-banned \ | 50-59 with warnings only | MARGINAL | Review warnings, fix if simple, or proceed | | < 50 or errors present | FAIL | Identify top 3 violations, fix in SKILL.md, regenerate, revalidate | -**Important wabi-sabi check**: If validation flags natural imperfections as errors (run-on sentences, fragments, loose punctuation that match the samples), the validator threshold needs adjusting in config.json, NOT the content. The authentic writing scored what it scored; synthetic content should match, not exceed. +**Wabi-sabi check**: If validation flags natural imperfections as errors (run-on sentences, fragments, loose punctuation that match the samples), adjust the validator threshold in config.json, NOT the content, because the authentic writing scored what it scored and synthetic content should match it, not exceed it. If the original writing "fails" validation, the validator is wrong, not the writing. #### If Validation Fails @@ -661,7 +271,7 @@ python3 ~/.claude/scripts/voice_validator.py check-banned \ - A **false positive** (natural imperfection flagged as error) -- adjust config.json thresholds 4. Make targeted fixes (one at a time, not wholesale rewrites) 5. Regenerate test content and revalidate -6. Maximum 3 iterations +6. Maximum 3 validation/refinement iterations before escalating to user **GATE**: At least one test piece scores 60+ with 0 errors (script pass threshold is 60, calibrated against real human writing). No banned pattern violations. If failed after 3 iterations, proceed to Step 7 with best score and report issues. @@ -679,9 +289,9 @@ Phase 6/7: VALIDATE ### Step 7: ITERATE -- Refine Until Authentic -**Goal**: Test the voice against human judgment through authorship matching. This is the ultimate quality gate. +**Goal**: Test the voice against human judgment through authorship matching, because metrics measure surface features but humans detect deeper patterns -- the "feel" of a voice. A piece can pass all metrics and still feel synthetic. Do not treat validation passing as completion. -**Why authorship matching**: Metrics measure surface features. Humans detect deeper patterns -- the "feel" of a voice. A piece can pass all metrics and still feel synthetic. Authorship matching catches what metrics miss. +Maximum 3 iterations in this step before escalating to user. #### The Authorship Matching Test @@ -693,7 +303,7 @@ Phase 6/7: VALIDATE #### If Authorship Matching Fails -The answer is almost always MORE SAMPLES, not more rules. +The answer is almost always MORE SAMPLES, not more rules, because adding "just one more rule" was tried through V7-V9 and never worked -- what worked was adding 100+ categorized samples in V10. | Failure Pattern | Diagnosis | Fix | |----------------|-----------|-----| @@ -724,6 +334,7 @@ Before declaring the voice complete, verify: - [ ] Typos from the natural typos list appear occasionally (not forced) - [ ] Content does NOT read like polished professional writing (unless the original voice IS polished) - [ ] If content is too perfect, the skill needs MORE samples and LOOSER constraints, not fewer +- [ ] If generated content "feels too rough," compare against original samples before adjusting -- if it matches the samples' roughness, it's correct **GATE**: 4/5 roasters say SAME AUTHOR. If roaster test is not feasible, use self-assessment checklist: Does the generated content feel like reading the original samples? Could you tell them apart? If yes (you can tell them apart), more work is needed. @@ -737,7 +348,7 @@ Phase 7/7: ITERATE --- -## Final Output +### Final Output After all phases complete: @@ -804,60 +415,13 @@ After all phases complete: ### Error: "Wabi-sabi violations flagged as errors" **Cause**: Validator is flagging natural imperfections that are actually part of the voice. -**Solution**: Adjust config.json thresholds, NOT the content. If the authentic writing "fails" validation, the validator is wrong, not the writing. See `skills/shared-patterns/wabi-sabi-authenticity.md` for the full pattern. - ---- - -## Anti-Patterns - -### Do Not Clean Up Samples - -**What it looks like**: Fixing typos, completing fragments, or reformatting samples before analysis. -**Why it's wrong**: Those "imperfections" ARE the voice. Cleaning them removes authenticity markers and produces a sanitized profile that generates sterile content. -**Do instead**: Keep samples exactly as written. Document the imperfections as features in Step 3. - -### Do Not Skip Deterministic Analysis - -**What it looks like**: Going straight from samples to AI pattern identification without running voice_analyzer.py. -**Why it's wrong**: Without quantitative baselines, AI interpretation drifts toward "sounds like a normal person" rather than capturing what's distinctive. Numbers ground the analysis. -**Do instead**: Always run the script first. Use the numbers to guide pattern identification. - -### Do Not Over-Rule, Under-Sample - -**What it looks like**: Writing 500 lines of rules and including 100 lines of samples. -**Why it's wrong**: V7-V9 had 500+ lines of rules and failed authorship matching. V10 had 100+ samples and passed 5/5. LLMs are pattern matchers -- they learn from examples, not instructions. -**Do instead**: Target 400+ lines of samples, 200 lines of rules. Samples are the bulk. - -### Do Not Manufacture Imperfections - -**What it looks like**: Adding random typos or fragments that don't appear in the original samples. -**Why it's wrong**: Manufactured imperfections feel forced. Authentic imperfections have patterns (the same typos recur, fragments appear in specific contexts). Forced imperfections are as detectable as forced perfection. -**Do instead**: Only include imperfections observed in the actual samples. Document where each one came from. - -### Do Not Skip Authorship Matching - -**What it looks like**: Declaring the voice complete after validation passes without testing against human judgment. -**Why it's wrong**: Metrics measure surface features. A piece can pass all metrics and still feel synthetic to a human reader. The metrics are necessary but not sufficient. -**Do instead**: Always run the authorship matching test, even informally. If roasters aren't available, do a self-assessment: shuffle generated and original samples, then try to sort them. If you can easily tell them apart, more work is needed. +**Solution**: Adjust config.json thresholds, NOT the content. If the authentic writing "fails" validation, the validator is wrong, not the writing. --- -## Anti-Rationalization +## References -| Rationalization | Why It's Wrong | Required Action | -|----------------|----------------|-----------------| -| "30 samples should be enough" | The system tried with 3-10 and FAILED. 50 is the empirically validated minimum. | Collect 50+ samples before proceeding past Step 1 | -| "The rules are detailed enough, samples are optional" | V7-V9 had detailed rules and failed 0/5 authorship matching. V10 passed with samples. | Samples are mandatory. 400+ lines in SKILL.md | -| "I'll clean up the samples for consistency" | Cleaning removes authenticity markers. Typos and fragments ARE the voice | Keep samples exactly as written | -| "Validation passed, so the voice is done" | Metrics measure surface. Humans detect deeper patterns. Passing metrics != sounding authentic | Run authorship matching or self-assessment | -| "I can skip the analyzer and identify patterns manually" | AI interpretation without data drifts toward generic patterns | Run voice_analyzer.py FIRST, always | -| "The imperfections make it look bad" | Perfection is the enemy of authenticity. Sterile content is an AI tell | Preserve wabi-sabi markers from samples | -| "Just one more rule will fix the authorship matching" | The answer is almost always more samples, not more rules | Add 20-50 more samples before adding rules | -| "The generated content is too rough" | If it matches the original samples' roughness, it's correct. Over-polishing destroys the voice | Compare against samples, not against "good writing" | - ---- - -## Reference Implementations +### Reference Implementations Study any existing voice profile in `skills/voice-*/` to understand what "done" looks like. A complete voice profile contains: @@ -868,9 +432,7 @@ Study any existing voice profile in `skills/voice-*/` to understand what "done" | `skills/voice-{name}/config.json` | ~20 lines | Validation configuration structure | | `skills/voice-{name}/profile.json` | ~80 lines | Profile structure from voice_analyzer.py | -Create your own voice profiles with `/create-voice`. - -## Components This Skill Delegates To +### Components This Skill Delegates To | Component | Type | What It Does | When Called | |-----------|------|-------------|-------------| @@ -880,4 +442,3 @@ Create your own voice profiles with `/create-voice`. | `scripts/voice_validator.py check-banned` | Script | Quick banned pattern check | Step 6: VALIDATE | | `scripts/data/banned-patterns.json` | Data | AI pattern database used by validator | Step 6 (via validator) | | `pipelines/voice-calibrator/SKILL.md` | Skill | Voice skill template (lines 1063-1554, including the validation checklist) | Step 5: GENERATE (template reference) | -| `skills/shared-patterns/wabi-sabi-authenticity.md` | Pattern | Wabi-sabi principle reference | All steps | diff --git a/skills/create-voice/references/pattern-identification.md b/skills/create-voice/references/pattern-identification.md new file mode 100644 index 00000000..01dc4443 --- /dev/null +++ b/skills/create-voice/references/pattern-identification.md @@ -0,0 +1,119 @@ +# Pattern Identification Guide + +Detailed reference for Step 3: PATTERN of the create-voice pipeline. + +--- + +## Phrase Fingerprints (CRITICAL) + +Read through ALL samples and identify 10+ distinctive phrases that recur. These are what authorship-matching roasters use to confirm identity. + +Look for: +- **Signature openers**: How do they start responses? ("I think the issue is...", "So basically...", "Here's what I've found...") +- **Signature closers**: How do they end? ("but we'll see", "does that help?", "anyway, that's my take") +- **Filler phrases**: Verbal tics that appear across contexts ("For what it's worth", "to be fair", "honestly") +- **Hedging patterns**: How they express uncertainty ("probably", "I suspect", "my guess is") +- **Emphasis patterns**: How they stress a point ("the key thing is", "the part people miss") + +Document each fingerprint with 2-3 exact quotes from the samples showing it in context. + +--- + +## Thinking Patterns + +How does this person reason? This is deeper than style; it's cognitive architecture. + +Common patterns to check for: +- **Concede-then-assert**: "That's fair, but..." (acknowledges opposing view, then states own position) +- **Hypothesis-experiment**: "My theory is... I tried... and found..." +- **Systems framing**: "The way this works is..." (explains mechanisms, not just opinions) +- **Experience-based**: "In my experience..." (grounds claims in personal observation) +- **Question-led**: "The question is..." (frames issues as questions to investigate) +- **Analogy-driven**: Uses metaphors and comparisons from specific domains + +--- + +## Response Length Distribution + +From the samples, estimate what percentage of responses fall into each bucket: +- Very short (1 sentence): ____% +- Short (2-3 sentences): ____% +- Medium (4-6 sentences): ____% +- Long (paragraph+): ____% + +This distribution is critical because most people write short responses most of the time, and AI tends to generate medium-to-long responses by default. + +--- + +## Natural Typos (Authenticity Markers) + +Scan samples for 5+ real typos. Document them with the correct spelling. These become wabi-sabi markers that signal authenticity. Do NOT treat these as errors to correct, because those "imperfections" ARE the voice -- perfection is an AI tell. + +Examples of what to look for: +- Missing apostrophes ("dont" instead of "don't") +- Common word swaps ("there" for "their") +- Dropped letters ("probabl" for "probably") +- Double-typed characters ("tthe") +- Missing spaces after punctuation ("works.But") + +--- + +## Wabi-Sabi Markers + +Identify which "imperfections" ARE the voice. These are not bugs to fix; they are features to preserve. + +- Run-on sentences: Does this person chain clauses with commas? +- Fragments: Do they use sentence fragments for emphasis? +- Loose punctuation: Is comma usage inconsistent? Is that part of the texture? +- Self-corrections: Do they change direction mid-sentence? ("Well, actually..." or "I mean,") +- Tangential asides: Do they go on tangents? (Parenthetical digressions?) + +--- + +## Linguistic Architectures + +Beyond sentence-level patterns, identify the **structural moves** that operate across sentences and paragraphs. These are what AI erases most aggressively -- the model defaults to claim -> evidence -> hedged conclusion regardless of the writer's actual architecture. + +Analyze ALL samples (not just a few) for each dimension: + +### Argument Architecture + +How does the writer build a case? +- **Direction**: Inductive (examples -> conclusion) vs deductive (claim -> evidence) vs mixed? Where does the main claim appear relative to supporting evidence? +- **Escalation**: Do stakes increase through the piece? Narrow -> broad? Low -> high severity? +- **Ending reframe**: Does the ending restate the opening, or transform it into something new? + +Document with exact quotes: "In {N} of {M} samples, the writer builds inductively -- evidence first, conclusion last. Example from sample X: [quote showing the build]" + +### Concession Architecture + +How does the writer handle "yes, but"? +- **Structure**: Short admission -> pivot? Long qualification -> reversal? Never concedes? +- **Pivot markers**: Which words signal the turn? ("but", "though", "the thing is", "and yet", "that said") +- **Position**: Where do concessions appear? Opening? Mid-argument? Never at the end? + +Document with exact quotes: "Concessions follow a [short admission -> blunt pivot] pattern. Example: [quote showing concession shape]" + +### Analogy Architecture + +Where do metaphors and references come from? +- **Source domains**: Which fields? (cooking, construction, sports, warfare, nature, machinery, music, software, etc.) +- **Deployment**: Are analogies used to open? To explain mid-section? To close with a memorable image? +- **Density**: Every post? Rarely? Only for technical concepts? + +Document with exact quotes: "Analogies cluster around {domains}. Example: [quote showing analogy from that domain]" + +### Bookend Architecture + +How do pieces open and close? +- **Opening moves**: Question? Declarative claim? Anecdote? Provocation? Scene-setting? +- **Closing moves**: Reframe? Fragment punch? Circle back to opening? Call to action? Open question? +- **Symmetry**: Does the closing echo or answer the opening? + +Document with exact quotes: "Opens with [pattern] in {N}/{M} samples. Closes with [pattern]. Opening and closing are [symmetric/independent]." + +--- + +## Architecture Coverage Note + +Not all writers exhibit all 4 architectures. Tweet-only writers may not have argument or bookend architecture. The gate requires 2 of 4, not 4 of 4. diff --git a/skills/create-voice/references/sample-collection.md b/skills/create-voice/references/sample-collection.md new file mode 100644 index 00000000..2254b042 --- /dev/null +++ b/skills/create-voice/references/sample-collection.md @@ -0,0 +1,57 @@ +# Sample Collection Guide + +Detailed reference for Step 1: COLLECT of the create-voice pipeline. + +--- + +## Where to Find Samples + +| Source | What to Look For | File Naming | +|--------|-----------------|-------------| +| Reddit history | Comments, posts, replies | `reddit-samples-YYYY-MM-DD.md` | +| Hacker News | Comments, Ask HN answers | `hn-samples-YYYY-MM-DD.md` | +| Blog posts | Published articles | `blog-samples.md` | +| Forum posts | Any discussion forum | `forum-samples-YYYY-MM-DD.md` | +| Emails | Professional and casual | `email-samples.md` | +| Chat logs | Slack, Discord, iMessage | `chat-samples.md` | +| Social media | Twitter/X threads | `social-samples.md` | + +--- + +## Sample Quality Guidelines + +- **Mix of lengths**: Very short (1 sentence), short (2-3 sentences), medium (paragraph), long (multi-paragraph). The distribution matters because most people write short responses most of the time. +- **Mix of contexts**: Technical, casual, disagreement, agreement, teaching, joking, emotional. Different contexts reveal different facets of voice. +- **Mix of topics**: Not all about the same subject. Topic diversity reveals stable voice patterns vs topic-specific patterns. +- **DO NOT clean up samples**: Typos, run-on sentences, fragments, loose punctuation ARE the voice, because cleaning destroys authenticity markers. This is the wabi-sabi principle (natural imperfections are features, not bugs) in action at the very first step. +- **DO NOT cherry-pick**: Include mediocre posts alongside great ones, because the mundane reveals default patterns. + +--- + +## Directory Setup + +```bash +mkdir -p skills/voice-{name}/references/samples/ +``` + +Place all sample files in `skills/voice-{name}/references/samples/`. Each file should contain multiple samples, separated by `---` or clear headers. + +--- + +## Sample File Format + +Each sample file should preserve the original writing exactly: + +```markdown +# Reddit Samples - 2025-12-30 + +## r/subreddit - Thread Title +[Exact text of comment, typos and all] + +--- + +## r/subreddit - Another Thread +[Exact text] + +--- +``` diff --git a/skills/create-voice/references/skill-generation.md b/skills/create-voice/references/skill-generation.md new file mode 100644 index 00000000..078ef197 --- /dev/null +++ b/skills/create-voice/references/skill-generation.md @@ -0,0 +1,142 @@ +# Skill Generation Guide + +Detailed reference for Step 5: GENERATE of the create-voice pipeline. + +--- + +## Files to Create + +1. **`skills/voice-{name}/SKILL.md`** -- The voice skill itself (2000+ lines) +2. **`skills/voice-{name}/config.json`** -- Validation configuration +3. **`skills/voice-{name}/profile.json`** -- Already created in Step 2 + +--- + +## SKILL.md Structure + +The sections, in order of importance by line count. Target 400+ lines of samples and ~200 lines of rules, because V7-V9 had 500+ lines of rules with 100 lines of samples and failed authorship matching -- V10 inverted that ratio and passed 5/5: + +| Section | Target Lines | Why This Size | +|---------|-------------|---------------| +| Extensive Authentic Samples | 400+ | THIS IS WHAT MAKES AUTHORSHIP MATCHING WORK. V7-V9 failed with rules-only. V10 passed with 100+ samples. | +| Voice Metrics (from profile.json) | ~100 | Quantitative targets give the model measurable goals, not vague aspirations | +| Rules and Prohibitions | ~200 | Hard constraints prevent the worst AI tells | +| Phrase Fingerprints | ~50 | Exact phrases that roasters use to confirm identity | +| Generation Protocol | ~100 | Pre/During/Post checklists keep output consistent | +| Natural Typos | ~20 | Authenticity markers that signal human-ness | +| Contrastive Examples | ~50 | Shows the model what NOT to do with concrete alternatives | +| Thinking Patterns | ~80 | Cognitive architecture, not just surface style | + +**Total: 2000+ lines minimum. Most should be SAMPLES, not rules.** Do NOT pad with verbose rules to hit the line count -- the goal is 2000+ lines of USEFUL content, primarily samples. + +--- + +## SKILL.md Frontmatter + +```yaml +--- +name: voice-{name} +user-invocable: false +allowed-tools: + - Read + - Write + - Bash + - Grep + - Glob + - Edit + - Task + - Skill +description: | + Apply {Name}'s voice profile for content generation: [2-3 key traits], + and modal writing. Use when generating content that must match {Name}'s + distinctive voice. Do NOT use for voice analysis, voice profile creation, + or generating content in other voices. +version: 1.0.0 +command: /voice-{name} +--- +``` + +--- + +## Sample Organization (THE MOST IMPORTANT SECTION) + +Organize samples from Step 1 into the SKILL.md following this structure: + +**By Response Length:** +- Very short (1 sentence, ~{X}% of responses): Include 10+ samples +- Short (2-3 sentences, ~{X}% of responses): Include 15+ samples +- Medium (4-6 sentences, ~{X}% of responses): Include 10+ samples +- Long (paragraph+, ~{X}% of responses): Include 5+ samples + +**By Pattern Type:** +- Admitting Mistakes: 5+ samples +- Acknowledging Limits: 5+ samples +- Respectful Disagreement: 5+ samples +- Technical Expertise (delivered casually): 5+ samples +- Strong Opinions (unhedged): 5+ samples +- Casual Closers: 5+ samples +- Sarcasm/Wit (if applicable): examples + +--- + +## Voice Metrics Section + +Transfer the profile.json data into human-readable tables with targets and tolerances: + +```markdown +| Metric | Target | Tolerance | Notes | +|--------|--------|-----------|-------| +| Average sentence length | {X} words | +/- 2 words | Primary rhythm indicator | +| Short sentences (3-10 words) | {X}% | +/- 5% | For emphasis and pacing | +``` + +--- + +## Two-Layer Architecture + +Design the skill with two layers: + +- **Layer A (Always-On Base Voice)**: Core traits, sentence rhythm, punctuation signature, contraction rate, function word signature. These apply to ALL content regardless of mode. +- **Layer B (Mode-Specific Overlays)**: Different modes (e.g., technical, casual, opinion, review) that adjust tone, formality, and structure while keeping Layer A constant. + +--- + +## Prompt Engineering Techniques (Apply Throughout) + +These techniques were validated over 10 iterations of a reference voice: + +1. **Probability Dampening**: Use "**subtly**", "**slightly**", "**generally**" before traits. Without dampening, the model cranks traits to 100% +2. **Attention Anchoring**: **Bold** all negative constraints. The model pays more attention to formatted text +3. **XML Context Tags**: Use `` for directives and `` for prohibitions. Structured tags signal instruction priority to the model +4. **Few-Shot Examples**: Include 3+ examples for every prohibition (especially "It's not X. It's Y"). Rules without examples are abstract; examples are concrete +5. **Contrastive Pairs**: For every "DO" include a "DON'T" with concrete text. The model needs to see both sides of the boundary + +--- + +## config.json Template + +Create the validation configuration: + +```json +{ + "name": "{Name}", + "version": "1.0.0", + "description": "{Brief voice description}", + "modes": ["technical", "casual", "opinion"], + "validation": { + "strict_banned_patterns": true, + "em_dash_forbidden": true, + "metric_tolerance": 0.20, + "required_checks": ["banned_phrases", "punctuation", "rhythm"], + "optional_checks": ["metrics", "sentence_starters", "opening_pattern"] + }, + "thresholds": { + "pass_score": 70, + "error_max": 0, + "warning_max": 3 + }, + "voice_specific_patterns": [] +} +``` + +Adjust `em_dash_forbidden`, `modes`, and `pass_score` based on the specific voice's characteristics. diff --git a/skills/create-voice/references/voice-rules-template.md b/skills/create-voice/references/voice-rules-template.md new file mode 100644 index 00000000..630e5d31 --- /dev/null +++ b/skills/create-voice/references/voice-rules-template.md @@ -0,0 +1,108 @@ +# Voice Rules Template + +Detailed reference for Step 4: RULE of the create-voice pipeline. + +--- + +## What This Voice IS (Positive Identity) + +Write 4-6 core traits with examples from the samples. Use probability dampening to avoid caricature: + +- **"subtly" skeptical** not "skeptical" -- dampens the trait so it appears naturally, not performatively +- **"generally" conversational** not "conversational" -- allows for variation +- **"slightly" self-deprecating** not "self-deprecating" -- prevents over-application + +For each trait, include 2-3 exact quotes from samples that demonstrate it. + +Use dampening adverbs because without them the model cranks traits to 100%. "Skeptical" becomes every-sentence-is-a-challenge. "Conversational" becomes aggressively casual. + +--- + +## What This Voice IS NOT (Contrastive Identity) + +Build a contrastive table showing THIS voice vs Generic AI for at least 6 aspects: + +| Aspect | This Voice | Generic AI | +|--------|-----------|------------| +| Opening | [Example from samples] | "In today's rapidly evolving landscape..." | +| Uncertainty | [How they express doubt] | "It's worth noting that perspectives may vary" | +| Agreement | [How they agree] | "I absolutely agree with your insightful point" | +| Disagreement | [How they disagree] | "While there are valid concerns, I would respectfully suggest..." | +| Conclusion | [How they end] | "In conclusion, we have explored..." | +| Technical | [Technical style] | "This represents a robust paradigm for..." | + +--- + +## Hard Prohibitions + +Identify patterns this voice NEVER uses. Apply attention anchoring (**bold**) to all negative constraints because the model pays more attention to bolded text: + +Common prohibitions to evaluate: +- **Em-dashes**: Does this person ever use them? If not, FORBIDDEN +- **Formal transitions**: "However", "Furthermore", "Moreover", "Additionally", "Consequently" +- **AI-typical phrases**: "Let's dive in", "Here's the thing", "delve", "robust", "leverage", "ecosystem" +- **The "It's not X. It's Y" pattern**: Signature AI structure. Almost always prohibited +- **Excessive hedging**: "It's worth noting", "One might argue", "At the end of the day" + +For each prohibition, explain WHY it's prohibited for this specific voice (not just "because it's AI-sounding"). + +--- + +## Wabi-Sabi Rules + +Which "errors" MUST be preserved? This is the inversion of typical quality rules. Only include imperfections actually observed in the samples, because manufactured imperfections feel forced and are as detectable as forced perfection: + +- If they write run-on sentences: "Allow comma-chain sentences up to {N} words when expressing enthusiasm or building arguments" +- If they use fragments: "Target {X}% fragment rate for emphasis and pacing" +- If punctuation is loose: "Do not standardize comma usage; match the inconsistent pattern from samples" +- If they self-correct: "Include at least one visible direction change per long-form response" + +--- + +## Anti-Essay Patterns + +Most voices are NOT essay-writers. Identify the structural anti-patterns: + +- Staccato rhythm? (Short sentences dominating) +- No signposting? (No "First... Second... Third...") +- Single-sentence paragraphs? (Common in chat/forum) +- No introduction/conclusion structure? (Just starts talking) +- Abrupt endings? (No wrap-up, just stops) + +--- + +## Architectural Patterns Template + +Translate the linguistic architectures discovered in Step 3 into actionable rules for the voice skill. These go in a dedicated `## Architectural Patterns` section in the generated SKILL.md, between the identity rules and banned patterns. + +For each documented architecture, write a rule: + +```markdown +## Architectural Patterns + +### Argument Flow +[Inductive/Deductive/Mixed] — [one-sentence description] +Build arguments by [specific instruction]. The main claim should appear [position]. +Example from samples: "[exact quote showing the pattern]" + +### Concessions +Structure: [short admission → pivot / long qualification → reversal / never concedes] +Pivot markers: [list of words this voice uses] +Example: "[exact quote]" + +### Analogy Domains +Primary: [domain1, domain2] +Deployment: [where analogies appear — openers? mid-section? closers?] +Density: [frequency] +**NEVER draw analogies from**: [domains this voice avoids] +Example: "[exact quote]" + +### Bookends +Opening move: [pattern] +Closing move: [pattern] +Symmetry: [yes/no/sometimes] +Example opening: "[quote]" +Example closing: "[quote]" +``` + +Omit any architecture that wasn't documented in Step 3. A missing section is better than a fabricated one. diff --git a/skills/cron-job-auditor/SKILL.md b/skills/cron-job-auditor/SKILL.md index 22a18795..da8818eb 100644 --- a/skills/cron-job-auditor/SKILL.md +++ b/skills/cron-job-auditor/SKILL.md @@ -24,48 +24,7 @@ routing: # Cron Job Auditor Skill -## Operator Context - -This skill operates as an operator for cron script auditing workflows, configuring Claude's behavior for deterministic, checklist-driven static analysis. It implements the **Systematic Inspection** architectural pattern -- discover scripts, audit against best practices, report findings -- with **Domain Intelligence** embedded in cron-specific reliability patterns. - -### Hardcoded Behaviors (Always Apply) -- **Read-Only**: Only read and analyze script files; never execute them -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before auditing -- **Pattern-Based Detection**: Use regex for reliable, reproducible checks -- **Structured Output**: Produce machine-parseable PASS/FAIL/WARN results -- **Severity Classification**: Every finding gets CRITICAL, HIGH, MEDIUM, or LOW -- **No Auto-Fix**: Report problems with recommendations; do not modify scripts - -### Default Behaviors (ON unless disabled) -- **Full Checklist**: Run all 9 best-practice checks on every script -- **Actionable Recommendations**: Provide specific code fixes for every failure -- **Score Calculation**: Report pass/total as percentage -- **Recursive Discovery**: Search `scripts/`, `cron/`, `jobs/` directories for `.sh` files -- **Shebang Validation**: Verify scripts start with `#!/bin/bash` or equivalent - -### Optional Behaviors (OFF unless enabled) -- **Strict Mode**: Treat MEDIUM/LOW findings as failures (raise exit code) -- **Custom Patterns**: Add project-specific checks beyond the standard 9 -- **Crontab Schedule Analysis**: Parse crontab entries for scheduling conflicts -- **JSON Output**: Emit results as JSON instead of human-readable report - -## What This Skill CAN Do -- Detect missing error handling, logging, lock files, and cleanup traps -- Check for explicit PATH/environment setup (cron has minimal defaults) -- Identify scripts vulnerable to concurrent execution -- Verify log rotation prevents unbounded disk growth -- Provide copy-paste code snippets to fix every finding -- Audit multiple scripts in a single pass with aggregate scoring - -## What This Skill CANNOT Do -- Execute scripts or validate runtime behavior -- Parse crontab scheduling syntax (focus is script content) -- Check external dependencies or verify services are running -- Test notification delivery (email, webhook, Slack) -- Analyze complex control flow beyond pattern matching -- Replace a full shell linter (shellcheck) for syntax issues - ---- +Static analysis of cron and scheduled job scripts against a 9-point reliability checklist. Produces structured PASS/FAIL/WARN results with severity classification (CRITICAL, HIGH, MEDIUM, LOW) and paste-ready code fixes for every finding. Audits are read-only and pattern-based -- scripts are never executed, because cron scripts may delete data, send emails, or modify production state. ## Instructions @@ -73,23 +32,25 @@ This skill operates as an operator for cron script auditing workflows, configuri **Goal**: Locate all cron/scheduled scripts to audit. -**Step 1: Identify target scripts** +**Step 1: Read repository CLAUDE.md** (if present) to understand project conventions before auditing. + +**Step 2: Identify target scripts** -If the user provides specific paths, use those. Otherwise search: +If the user provides specific paths, use those. Otherwise search these directories recursively: ``` scripts/*.sh, cron/*.sh, jobs/*.sh, bin/*.sh ``` Also check for scripts referenced in crontab files, Makefiles, or CI configs. -**Step 2: Validate targets** +**Step 3: Validate targets** For each discovered file: - Confirm it exists and is readable - Check it has a shell shebang (`#!/bin/bash`, `#!/bin/sh`, `#!/usr/bin/env bash`) -- Skip non-shell files (Python cron jobs, etc.) with a note +- Skip non-shell files (Python cron jobs, etc.) with a note -- this skill audits shell scripts only; it cannot replace shellcheck for syntax issues or analyze complex control flow beyond pattern matching -**Step 3: Log discovery results** +**Step 4: Log discovery results** ```markdown ## Scripts Found @@ -102,14 +63,16 @@ For each discovered file: ### Phase 2: AUDIT -**Goal**: Run every check against every script. No shortcuts. +**Goal**: Run every check against every script. Run all 9 checks regardless of script size or apparent simplicity -- small scripts grow, and missing basics cause production incidents. **Step 1: Read each script fully** -Read the entire file content. Do not sample or skip sections. +Read the entire file content. Do not sample or skip sections. If the script sources a common library file (`source ...` or `. ...`), read the sourced file too -- patterns provided by sourced libraries count as PASS (with a note indicating the source). **Step 2: Run the 9-point checklist** +Use regex pattern matching for reliable, reproducible detection. Verify matches are not inside comments (`# ...`) before counting them -- when a match appears in a comment or string, note reduced confidence rather than silently accepting it. + | # | Check | Patterns | Severity | |---|-------|----------|----------| | 1 | Error handling | `set -e`, `set -o errexit`, `\|\| exit` | CRITICAL | @@ -124,7 +87,7 @@ Read the entire file content. Do not sample or skip sections. For each check, record: - PASS with line number where pattern found, OR -- FAIL/WARN with specific recommendation +- FAIL/WARN with specific recommendation including a paste-ready code snippet (findings without fixes create work without guidance) **Step 3: Calculate score** @@ -138,7 +101,7 @@ Classify scripts: 90-100% Excellent, 70-89% Good, 50-69% Needs Work, <50% Critic ### Phase 3: REPORT -**Goal**: Produce structured, actionable audit output. +**Goal**: Produce structured, actionable audit output. Do not modify any scripts -- report problems with recommendations only. **Step 1: Format per-script results** @@ -155,7 +118,7 @@ SCORE: 7/9 (78%) - Good **Step 2: Provide recommendations** -For every FAIL and WARN, provide a specific code snippet the user can paste: +Every FAIL and WARN must include a specific code snippet the user can paste. Keep recommendations proportional to the script's scope -- suggest lock files, not monitoring frameworks. ```bash # Recommendation: Add lock file @@ -180,28 +143,6 @@ Most common gap: Lock files (3/4 scripts missing) **Gate**: Every finding has a recommendation. Report is complete. Audit is done. ---- - -## Examples - -### Example 1: Single Script Audit -User says: "Audit the backup cron script" -Actions: -1. Read `scripts/backup.sh`, verify shebang (DISCOVER) -2. Run 9-point checklist, record PASS/FAIL per check (AUDIT) -3. Format report with score and recommendations (REPORT) -Result: Structured report with actionable fixes - -### Example 2: Repository-Wide Audit -User says: "Check all our cron jobs for best practices" -Actions: -1. Glob for `.sh` files in `scripts/`, `cron/`, `jobs/` (DISCOVER) -2. Audit each script against full checklist (AUDIT) -3. Per-script reports plus aggregate summary (REPORT) -Result: Comprehensive audit with prioritized remediation list - ---- - ## Error Handling ### Error: "No Shell Scripts Found" @@ -232,52 +173,8 @@ Solution: 2. Read the sourced file for the missing patterns 3. If patterns exist in sourced libraries, mark as PASS with note ---- - -## Anti-Patterns - -### Anti-Pattern 1: Executing Scripts to Test Them -**What it looks like**: Running the cron script to see if it "works" -**Why wrong**: Cron scripts may delete data, send emails, or modify production state -**Do instead**: Static analysis only. Read the file, match patterns, report. - -### Anti-Pattern 2: Skipping Checks Because Script Is "Simple" -**What it looks like**: "This is just a 5-line script, no need for lock files" -**Why wrong**: Simple scripts grow. Missing basics cause production incidents. -**Do instead**: Run all 9 checks regardless of script size. - -### Anti-Pattern 3: Recommending Over-Engineering -**What it looks like**: Suggesting Prometheus alerting for a log cleanup script -**Why wrong**: Recommendations should match script scope and complexity -**Do instead**: Provide proportional fixes. Lock file yes, monitoring framework no. - -### Anti-Pattern 4: Ignoring Sourced Dependencies -**What it looks like**: Marking FAIL because `set -e` is in a sourced common.sh -**Why wrong**: Many teams use shared library files sourced at script start -**Do instead**: Check `source` and `.` commands, read sourced files for patterns. - -### Anti-Pattern 5: Reporting Without Recommendations -**What it looks like**: "FAIL: No error handling" with no suggested fix -**Why wrong**: Findings without fixes create work without guidance -**Do instead**: Every FAIL/WARN must include a paste-ready code snippet. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Script is too simple to audit" | Simple scripts cause outages too | Run full 9-point checklist | -| "It works in production already" | Working ≠ reliable under failure | Audit for failure-mode handling | -| "Lock files are overkill" | Concurrent cron runs cause data corruption | Always check for concurrency safety | -| "Logging slows things down" | Debugging blind cron failures wastes hours | Verify logging with timestamps | - ### Best Practices Reference ```bash diff --git a/skills/data-analysis/SKILL.md b/skills/data-analysis/SKILL.md index ab25198a..0353c8c0 100644 --- a/skills/data-analysis/SKILL.md +++ b/skills/data-analysis/SKILL.md @@ -48,45 +48,7 @@ routing: # Data Analysis Skill -## Operator Context - -This skill operates as an operator for decision-first data analysis, configuring Claude's behavior for structured analytical reasoning with statistical rigor. It implements a **Decision-First Framework** -- every analysis begins with the decision being supported, works backward to the evidence required, and only then touches the data. This prevents the common failure mode where analysis produces impressive summaries that answer the wrong question. - -Core thesis: **"Analysis without a decision is just arithmetic."** - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution. Project instructions override default behaviors. -- **Over-Engineering Prevention**: Analyze what was asked. No speculative analyses, no "while I'm at it" tangents into unrelated metrics. -- **Decision-First Ordering**: ALWAYS establish the decision context (Phase 1) before loading data (Phase 3). Starting with data produces technically correct but practically useless analysis because the patterns found may not map to the decision-maker's options. -- **Separate Extraction from Interpretation**: Phase 3 (EXTRACT) loads and profiles data. Phase 4 (ANALYZE) interprets it. Never combine these steps. Combining them causes confirmation bias -- you see what you expect instead of what the data shows. -- **Metric Definitions Are Immutable**: Once Phase 2 (DEFINE) is complete and data loading begins, metric definitions cannot change silently. If they must change, re-enter Phase 2 and document why. This prevents the common anti-pattern of adjusting definitions to produce favorable results (p-hacking by another name). -- **Uncertainty Quantification**: Report confidence intervals, not point estimates. "3-7% lift" is useful; "5% lift" is misleading because it implies false precision. - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Lead with insights, not methods. The decision-maker needs "Revenue is declining 3% month-over-month, driven by churning mid-tier accounts" -- not "I performed a linear regression on the time series data using OLS estimation." -- **Artifact Trail**: Save artifacts at every phase. Context is ephemeral; files persist. Each phase produces a named artifact that can be audited later. -- **Graceful Tool Degradation**: Detect pandas/matplotlib availability via try/except. Use them when available, fall back to stdlib (csv, json, statistics, collections) when not. Analysis quality must be identical -- only presentation differs. -- **Statistical Rigor Gates**: Apply all four rigor gates during Phase 4. Violations must be remediated or documented as explicit limitations. See `references/rigor-gates.md` for detailed gate documentation. - -### Optional Behaviors (OFF unless enabled) -- **Visualization Output**: Generate matplotlib charts saved as PNG when matplotlib is available and user requests visual output. -- **Multi-Dataset Joins**: Join across multiple data files when analysis requires cross-referencing (e.g., user events + revenue data). -- **Exploratory Mode**: Skip Phase 1 framing when the user explicitly asks for open-ended exploration ("just show me what's interesting"). Still apply rigor gates and label all findings as exploratory. - -## What This Skill CAN Do -- Analyze structured data (CSV, JSON, SQLite exports, log files) to support specific business decisions -- Profile data quality: row counts, missing values, outliers, date range coverage, type distributions -- Compute summary statistics with confidence intervals using Python stdlib or pandas -- Compare groups (cohorts, A/B variants, time periods) with statistical rigor checks -- Detect trends, distributions, anomalies, and correlations with appropriate caveats -- Produce decision-oriented reports that lead with insights and state limitations explicitly - -## What This Skill CANNOT Do -- **Machine learning**: No model training, prediction, or hyperparameter tuning. That is a separate capability. -- **Real-time monitoring**: This is batch analysis of snapshot data, not live stream processing. -- **Database querying**: The skill analyzes data already extracted. It does not connect to databases or APIs. The user provides the data file. -- **Codebase analysis**: Use codebase-analyzer for code convention discovery. This skill analyzes business/operational data. -- **Automated recurring reports**: Each analysis is a one-shot investigation. Scheduled analysis requires separate automation. +Every analysis begins with the decision being supported, works backward to the evidence required, and only then touches the data. This prevents the common failure mode where analysis produces impressive summaries that answer the wrong question. **Analysis without a decision is just arithmetic.** --- @@ -96,7 +58,7 @@ Core thesis: **"Analysis without a decision is just arithmetic."** **Goal**: Establish what decision this analysis supports and what evidence would change it. -**Why this phase exists**: Starting with data before establishing the decision context is the single most common analytical failure. The analyst finds interesting patterns and presents them, but the decision-maker cannot act because the patterns do not map to their options. Framing first ensures every computation serves the decision. +Starting with data before establishing the decision context is the single most common analytical failure. The analyst finds interesting patterns and presents them, but the decision-maker cannot act because the patterns do not map to their options. Framing first ensures every computation serves the decision. Do not skip framing because "the user just wants numbers" -- numbers without decision context are not actionable, and the user may not know they need framing, which is exactly why this phase enforces it. **Step 1: Identify the decision** - What specific decision does this analysis support? @@ -104,7 +66,7 @@ Core thesis: **"Analysis without a decision is just arithmetic."** - What are their options? (Option A vs. Option B vs. do nothing) - What is the current default action if no analysis is performed? -If the user does not articulate a decision, ask: "What will you do differently based on this analysis?" If the answer is "nothing" or "I just want to see the data," switch to Exploratory Mode (optional behavior) and label all output as exploratory. +If the user does not articulate a decision, ask: "What will you do differently based on this analysis?" If the answer is "nothing" or "I just want to see the data," switch to Exploratory Mode and label all output as exploratory. Exploratory Mode still applies rigor gates but makes no causal claims. **Step 2: Define evidence requirements** - What evidence would favor Option A over Option B? @@ -145,7 +107,7 @@ Save `analysis-frame.md`: **Goal**: Define exactly what will be measured, how, and over what population. Write definitions to file before any data is loaded. -**Why this phase exists**: Defining metrics after seeing data enables (consciously or not) choosing definitions that produce favorable results. Locking definitions first makes the analysis auditable -- anyone can verify whether the definitions were followed. +Defining metrics after seeing data enables (consciously or not) choosing definitions that produce favorable results. Locking definitions first makes the analysis auditable -- anyone can verify whether the definitions were followed. Do not treat a metric definition as "close enough" -- a slight change in numerator or denominator can flip a conclusion. A/B tests have been decided on the wrong metric because "daily active" vs "monthly active" seemed interchangeable. **Step 1: Define metrics** @@ -197,7 +159,7 @@ Save `metric-definitions.md`: **GATE**: All metrics defined with formulas and populations. Definitions saved to file. If this is a comparison analysis, fairness checks documented. Proceed only when gate passes. -**Immutability rule**: Once Phase 3 begins, these definitions are locked. If the data reveals that a definition is unworkable (e.g., the column doesn't exist), return to Phase 2, update the definition, and document the change and its reason. Do not silently adjust. +**Immutability rule**: Once Phase 3 begins, these definitions are locked. If the data reveals that a definition is unworkable (e.g., the column doesn't exist), return to Phase 2, update the definition, and document the change and its reason in the artifact. Do not silently adjust -- silent definition changes are p-hacking by another name, and the change must be visible in the artifact trail for the analysis to be auditable. --- @@ -205,7 +167,7 @@ Save `metric-definitions.md`: **Goal**: Load the data, profile its quality, and determine whether it is adequate for the planned analysis. Do NOT interpret results during this phase. -**Why extraction is separate from analysis**: Combining loading and interpretation causes confirmation bias. When you compute a metric and interpret it in the same breath, you see what you expect. Extracting first forces you to confront data quality issues (missing values, unexpected distributions, date gaps) before they silently distort your conclusions. +Combining loading and interpretation causes confirmation bias -- you see what you expect instead of what the data shows. Extracting first forces you to confront data quality issues (missing values, unexpected distributions, date gaps) before they silently distort your conclusions. **Step 1: Detect available tools** @@ -225,7 +187,7 @@ except ImportError: HAS_MATPLOTLIB = False ``` -If pandas is unavailable, fall back to `csv.DictReader` + `statistics` module. Analysis quality must be identical. +If pandas is unavailable, fall back to `csv.DictReader` + `statistics` module. Analysis quality must be identical -- only presentation differs. **Step 2: Load and inspect data** @@ -239,7 +201,7 @@ Profile the dataset: **Step 3: Assess data quality** -Apply the Sample Adequacy gate (see `references/rigor-gates.md` Gate 1): +Apply the Sample Adequacy gate (see `references/rigor-gates.md` Gate 1). Do not assume a sample is "probably big enough" -- that is not a statistical assessment. Check actual numbers against these minimums: | Check | Minimum | Action if Failed | |-------|---------|------------------| @@ -283,7 +245,7 @@ Save `data-quality-report.md`: ### Phase 4: ANALYZE (Compute metrics. Apply rigor gates.) -**Goal**: Compute metrics per the locked definitions from Phase 2, applying statistical rigor gates at every step. +**Goal**: Compute metrics per the locked definitions from Phase 2, applying statistical rigor gates at every step. Report confidence intervals, not point estimates -- "3-7% lift" is useful; "5% lift" is misleading because it implies false precision. **Step 1: Compute primary metrics** @@ -329,7 +291,7 @@ Before interpreting any group comparison, verify (see `references/rigor-gates.md **Step 3: Apply Multiple Testing Correction** (if testing multiple hypotheses) -See `references/rigor-gates.md` Gate 3: +See `references/rigor-gates.md` Gate 3. Do not cherry-pick a single significant segment from many tests -- if you test 10 segments, one will likely show significance by chance (5% false positive rate per test). Report all segments tested. | Scenario | Correction | |----------|------------| @@ -381,9 +343,7 @@ Save `analysis-results.md`: ### Phase 5: CONCLUDE (Lead with insights. Return to the decision.) -**Goal**: Translate analytical results into a decision-oriented report. Lead with what the data says, not how you computed it. - -**Why this phase is separate**: Phase 4 produces numbers. Phase 5 produces meaning. Separating them prevents the analyst from burying the insight under methodology. The decision-maker reads Phase 5; the auditor reads Phases 2-4. +**Goal**: Translate analytical results into a decision-oriented report. Lead with what the data says about the decision, not how you computed it -- the decision-maker reads Phase 5; the auditor reads Phases 2-4. Methodology belongs in the appendix. **Step 1: State the headline finding** @@ -401,6 +361,8 @@ Summarize the key metrics that support the headline, in order of importance: **Step 3: State limitations explicitly** +Do not omit limitations because the analysis is complex and "the user won't understand" -- hiding limitations is more misleading than explaining them, and simple language makes limitations accessible. If confidence intervals are wide, that IS the finding (the data is insufficient to support a decision), not a formatting problem to hide by reporting only the point estimate. + - What the data does NOT tell you - Rigor gate violations and their implications - Known confounders that could not be controlled @@ -457,9 +419,9 @@ Save `analysis-report.md`: --- -## Examples +### Examples -### Example 1: A/B Test Evaluation +#### Example 1: A/B Test Evaluation User says: "Evaluate this A/B test - here's the CSV of results" Actions: 1. FRAME: "Should we ship variant B?" Options: ship B, keep A, extend test. Evidence: conversion lift >1% with 95% CI excluding zero. @@ -468,7 +430,7 @@ Actions: 4. ANALYZE: Variant B conversion 4.2% vs A 3.9%. Difference 0.3% (CI: -0.1% to 0.7%). Fails practical significance -- CI includes zero. 5. CONCLUDE: "Data is inconclusive. The observed 0.3% lift has a confidence interval that includes zero. Recommend extending the test for 2 more weeks to reach adequate power." -### Example 2: Trend Analysis +#### Example 2: Trend Analysis User says: "What's happening with our monthly revenue? Here's 2 years of data." Actions: 1. FRAME: "Is revenue growth slowing, and should we invest in acquisition?" Options: increase spend, maintain, cut. @@ -477,7 +439,7 @@ Actions: 4. ANALYZE: Overall +2.1%/mo but returning customer revenue flat. All growth from new customers. Seasonality adjusted. 5. CONCLUDE: "Revenue growth is entirely acquisition-driven. Returning customer revenue has been flat for 8 months, suggesting a retention problem. Recommend investigating churn before increasing acquisition spend." -### Example 3: Distribution Profiling +#### Example 3: Distribution Profiling User says: "Our API response times feel slow. Here's a week of latency data." Actions: 1. FRAME: "Do we need to optimize the API?" Options: optimize, add caching, do nothing. Threshold: p99 >500ms warrants action. @@ -486,6 +448,20 @@ Actions: 4. ANALYZE: p50=45ms (fine), p99=890ms (exceeds threshold). /search endpoint contributes 73% of p99 violations. Peak hours 2x worse. 5. CONCLUDE: "p99 latency exceeds the 500ms threshold, concentrated in /search during peak hours. Recommend optimizing /search specifically rather than system-wide caching." +### Blocker Criteria + +STOP and ask the user (do NOT proceed autonomously) when: + +| Situation | Why Stop | Ask This | +|-----------|----------|----------| +| No decision context and user resists framing | Analysis without purpose wastes effort | "Help me understand: what will change based on this analysis?" | +| Data format unclear | Parsing errors corrupt analysis | "What format is this data in? What do the columns represent?" | +| Critical columns have >50% missing values | Analysis on mostly-missing data is unreliable | "Column X is 60% missing. Should we exclude it or is there another data source?" | +| Metric definitions contradict each other | Conflicting definitions produce conflicting results | "Metric A and B use different definitions of 'active user'. Which should we standardize on?" | +| Results are ambiguous (CI spans zero for primary metric) | User needs to know the data is inconclusive | State clearly: "The data does not support a confident decision. Here are options for getting more data." | + +Never guess on column semantics, population definitions, business thresholds, or causal claims (correlation is not causation). + --- ## Error Handling @@ -501,6 +477,7 @@ Actions: 2. Detect delimiter: comma, tab, semicolon, pipe 3. If JSON: validate structure, identify if it's array-of-objects or nested 4. If still failing: ask user for format details. Do not guess. +5. Maximum 3 parse attempts before asking the user for format help. ### Error: "Insufficient data for planned segments" **Cause**: Metric definitions specify segments (by region, by tier) but some segments have <30 observations. @@ -516,97 +493,20 @@ Actions: 2. Document what changed and why 3. Save updated metric-definitions.md with change log 4. Do NOT silently adjust -- the change must be visible in the artifact trail +5. Maximum 2 definition revisions before flagging scope concern. ---- - -## Anti-Patterns - -### Data-First Analysis -**What it looks like**: Loading the CSV immediately and computing summary statistics before asking what decision the analysis supports. -**Why wrong**: Produces technically correct summaries that answer the wrong question. The analyst finds "interesting patterns" that don't map to the decision-maker's options. Hours of work, zero actionable insight. -**Do instead**: Complete Phase 1 (FRAME) before touching Phase 3 (EXTRACT). If the user pushes back, explain: "I want to make sure we compute the right metrics. What will you do differently based on this analysis?" - -### Point Estimates Without Uncertainty -**What it looks like**: "Conversion rate is 4.2%" with no confidence interval, sample size, or context. -**Why wrong**: 4.2% from 100 observations means something very different from 4.2% from 100,000 observations. Without uncertainty bounds, the decision-maker cannot judge reliability. A 4.2% rate with CI [1.1%, 7.3%] is very different from 4.2% with CI [4.0%, 4.4%]. -**Do instead**: Always report confidence intervals: "4.2% (95% CI: 3.8-4.6%, N=12,400)". - -### Silent Definition Changes -**What it looks like**: Defining "active users" as "logged in last 30 days" in Phase 2, then computing it as "logged in last 7 days" in Phase 4 because the data only has 7-day granularity. -**Why wrong**: This is p-hacking. Changing definitions after seeing data -- even for practical reasons -- invalidates the pre-registration. If the change is benign, it should be documented. If it is not documented, there is no way to audit whether it was benign. -**Do instead**: Return to Phase 2, update the definition, document the reason, then proceed. - -### Cherry-Picked Segments -**What it looks like**: "Conversion improved in the 25-34 age group!" without reporting all other age groups or applying multiple testing correction. -**Why wrong**: If you test 10 segments, one will likely show significance by chance (5% false positive rate per test). Reporting only the significant one is misleading. -**Do instead**: Report all segments tested. Apply Bonferroni correction for 6+ comparisons. Label exploratory findings as exploratory. - -### Methods-First Communication -**What it looks like**: "I performed a chi-squared test on the contingency table of conversion outcomes stratified by experimental group, yielding a test statistic of 4.12 with 1 degree of freedom..." -**Why wrong**: The decision-maker needs the insight, not the methodology. Leading with methods buries the finding under jargon. The methodology belongs in the appendix for auditors. -**Do instead**: Lead with the insight: "Variant B converts 12% better than A (95% CI: 3-21%). The effect is statistically significant and exceeds our 5% threshold for shipping." Put methodology in the appendix. - ---- - -## Anti-Rationalization - -See [shared-patterns/anti-rationalization-core.md](../shared-patterns/anti-rationalization-core.md) for universal patterns. - -### Domain-Specific Rationalizations +### Death Loop Prevention +If the analysis is cycling (returning to Phase 2 repeatedly, growing artifact count without convergence, same error recurring), simplify: drop segments, reduce metrics to the single most important one, narrow the time window. A tightly framed decision in Phase 1 produces fewer metrics and faster convergence. -| Rationalization Attempt | Why It's Wrong | Required Action | -|------------------------|----------------|-----------------| -| "The user just wants numbers, skip framing" | Numbers without decision context are not actionable. The user may not know they need framing -- that is exactly why the skill enforces it. | Complete Phase 1. Ask "What will you do differently?" | -| "This sample is probably big enough" | "Probably" is not a statistical assessment. Small samples produce wide CIs that cannot support decisions. | Check the actual sample size against the adequacy gate. Report N and CI. | -| "The metric definition is close enough" | Close enough in a numerator or denominator can flip a conclusion. A/B tests have been decided on the wrong metric because "daily active" vs "monthly active" seemed interchangeable. | Use the exact definition from Phase 2. If it must change, return to Phase 2. | -| "This one significant segment is the real finding" | Cherry-picking the significant result from many tests is textbook p-hacking. The one segment may be a false positive. | Report all segments. Apply multiple testing correction. Label as exploratory if warranted. | -| "CIs are too wide, just report the point estimate" | Wide CIs ARE the finding -- they mean the data is insufficient to support a decision. Hiding this misleads the decision-maker. | Report the CI. State that the data is insufficient. Recommend more data. | -| "The analysis is complex, the user won't understand limitations" | Hiding limitations is more misleading than explaining them. Simple language makes limitations accessible. | State limitations in plain language. "We cannot be confident because..." | - ---- - -## Blocker Criteria - -STOP and ask the user (do NOT proceed autonomously) when: - -| Situation | Why Stop | Ask This | -|-----------|----------|----------| -| No decision context and user resists framing | Analysis without purpose wastes effort | "Help me understand: what will change based on this analysis?" | -| Data format unclear | Parsing errors corrupt analysis | "What format is this data in? What do the columns represent?" | -| Critical columns have >50% missing values | Analysis on mostly-missing data is unreliable | "Column X is 60% missing. Should we exclude it or is there another data source?" | -| Metric definitions contradict each other | Conflicting definitions produce conflicting results | "Metric A and B use different definitions of 'active user'. Which should we standardize on?" | -| Results are ambiguous (CI spans zero for primary metric) | User needs to know the data is inconclusive | State clearly: "The data does not support a confident decision. Here are options for getting more data." | - -### Never Guess On -- Column semantics (what does "status" mean? what values are valid?) -- Population definitions (who is included/excluded from the analysis) -- Business thresholds (what constitutes a "meaningful" change) -- Causal claims (correlation is not causation -- do not imply otherwise) - ---- - -## Death Loop Prevention - -### Retry Limits -- Maximum 3 attempts to parse a data file before asking the user for format help -- Maximum 2 definition revisions in Phase 2 before flagging scope concern -- Maximum 3 rigor gate remediation attempts before documenting as limitation - -### Recovery Protocol -1. **Detection**: Phase cycling (returning to Phase 2 repeatedly), growing artifact count without convergence, same error recurring -2. **Intervention**: Simplify the analysis scope. Drop segments, reduce metrics to the single most important one, narrow time window. -3. **Prevention**: Frame the decision tightly in Phase 1. Fewer options = fewer metrics = faster convergence. +Maximum retry limits: +- 3 attempts to parse a data file +- 2 definition revisions in Phase 2 +- 3 rigor gate remediation attempts before documenting as limitation --- ## References -For detailed information: - **Rigor Gates**: [references/rigor-gates.md](references/rigor-gates.md) - Detailed statistical gate documentation with examples - **Output Templates**: [references/output-templates.md](references/output-templates.md) - Templates for different analysis types (A/B test, trend, distribution, cohort) - **Anti-Patterns**: [references/anti-patterns.md](references/anti-patterns.md) - Extended anti-pattern catalog with code examples - -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) - Phase transition rules diff --git a/skills/decision-helper/SKILL.md b/skills/decision-helper/SKILL.md index 853858c1..fe27d41b 100644 --- a/skills/decision-helper/SKILL.md +++ b/skills/decision-helper/SKILL.md @@ -32,27 +32,7 @@ routing: # Decision Helper Skill -## Operator Context - -This skill operates as an operator for structured decision-making, configuring Claude's behavior for weighted scoring of architectural and technology choices. Runs inline (no context fork) because users adjust criteria and weights interactively. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution -- **Option Limit**: Maximum 4 options. More than 4 = decompose into sub-decisions first -- **Close-Call Detection**: Always flag when top two options differ by <0.5 weighted score -- **No Gut Overrides**: If the matrix contradicts intuition, fix the criteria -- never override the math - -### Default Behaviors (ON unless disabled) -- **Default Criteria**: Use the standard criteria table unless the user provides custom criteria -- **ADR Persistence**: Check `.adr-session.json` and append decision there; fall back to task plan -- **Score Justification**: Brief (1 sentence) justification for each score - -### Optional Behaviors (OFF unless enabled) -- **Custom Criteria**: User replaces or supplements default criteria and weights -- **Sensitivity Analysis**: Re-score with adjusted weights to test recommendation stability -- **Skip Persistence**: Don't record the decision (for informal exploration) - ---- +Structured weighted scoring for architectural and technology choices. Runs inline (no context fork) because users adjust criteria and weights interactively. ## Instructions @@ -61,10 +41,10 @@ This skill operates as an operator for structured decision-making, configuring C **Goal**: Turn the user's question into a clear, scorable decision. - State the decision in one sentence (e.g., "Which HTTP router should we use for the API service?") -- List 2-4 concrete options. If the user provides more than 4, help them eliminate or group options before proceeding +- List 2-4 concrete options. If the user provides more than 4, help them eliminate or group options before proceeding -- never score more than 4 at once because larger matrices dilute focus and invite analysis paralysis - Identify hard constraints that eliminate options immediately (e.g., "must be MIT licensed" eliminates Option C) -If the user's request is too vague to frame, ask clarifying questions. Do not guess at options. +If the user's request is too vague to frame, ask clarifying questions. Do not guess at options. If someone invoked this skill, the decision is not obvious -- run the full framework even when a quick answer feels tempting. **Gate**: Decision statement defined, 2-4 options listed, hard constraints applied. @@ -72,7 +52,7 @@ If the user's request is too vague to frame, ask clarifying questions. Do not gu **Goal**: Establish what matters for this decision and how much. -Present the default criteria table. Ask the user if they want to adjust weights or add/remove criteria. +Present the default criteria table unless the user provides custom criteria. Ask if they want to adjust weights or add/remove criteria. | Criterion | Weight | What It Measures | |-----------|--------|-----------------| @@ -86,16 +66,22 @@ Present the default criteria table. Ask the user if they want to adjust weights WHY these defaults: Correctness dominates because a wrong solution has zero value regardless of other factors. Complexity/Maintainability/Risk form a middle tier because they determine long-term cost. Effort/Familiarity are lower because they're temporary (teams learn, effort is one-time). Ecosystem is lowest because it rarely decides between otherwise-equal options. +Use defaults unless the user has a strong reason to change them. Agonizing over whether Complexity should be weight 3 or 4 rarely changes the outcome -- the framework exists to make decisions faster, not slower. Set weights before scoring; adjusting weights after seeing results to make a preferred option win is confirmation bias with extra steps. + +If the user wants sensitivity analysis, re-score with adjusted weights after the initial pass to test recommendation stability. + **Gate**: Criteria and weights confirmed (default or custom). ### Step 3: Score Each Option **Goal**: Rate each option against each criterion with justification. -Score every criterion 1-10 (1-3 poor, 4-6 adequate, 7-9 strong, 10 exceptional). Provide a one-sentence justification per score to prevent arbitrary numbers. +Score every criterion 1-10 (1-3 poor, 4-6 adequate, 7-9 strong, 10 exceptional). Provide a one-sentence justification per score -- this prevents arbitrary numbers and makes disagreements productive. Calculate weighted score: `sum(score * weight) / sum(weights)` +Treat scores as subjective estimates, not measurements. A difference of 0.03 between two options is noise, not signal -- the close-call detection in Step 4 handles this. + **Gate**: All options scored, all scores justified, weighted scores calculated. ### Step 4: Analyze Results @@ -105,10 +91,12 @@ Calculate weighted score: `sum(score * weight) / sum(weights)` Apply these rules in order: 1. **No Good Option** (all weighted scores <6.0): Flag that none of the options are strong. Suggest the user explore alternatives or revisit constraints -2. **Close Call** (top two within 0.5): Flag as "close call -- additional factors should decide." Identify which criteria drive the difference and ask the user what matters most +2. **Close Call** (top two within 0.5): Always flag as "close call -- additional factors should decide." Identify which criteria drive the difference and ask the user what matters most. Never hand-wave a close call with "close enough, just pick one" -- these deserve explicit acknowledgment 3. **Clear Winner** (top option leads by >0.5): Recommend the winner. Note which high-weight criteria drove the result 4. **Dominant Option** (top option leads on ALL weight-5 criteria): Note the dominance -- this is a high-confidence recommendation +If the matrix contradicts the user's intuition, do not override the math. Instead, ask which criterion is missing or mis-weighted. Add it, re-score, and see if the matrix now agrees. If it does, you found the hidden factor. If it still disagrees, trust the matrix -- it surfaces the reasoning that gut feelings obscure. + Present the output table: ``` @@ -145,6 +133,8 @@ cat .adr-session.json 2>/dev/null **If no ADR**: Note the decision in the active task plan (`plan/active/*.md`). If neither exists, present the record to the user for manual recording. +The user can skip persistence for informal exploration by requesting it. + **Gate**: Decision recorded or presented. Workflow complete. --- @@ -165,35 +155,6 @@ cat .adr-session.json 2>/dev/null --- -## Anti-Patterns - -### Analysis Paralysis -**What it looks like**: User agonizes over whether Complexity should be weight 3 or 4 -**Why wrong**: Weight differences of 1 rarely change the outcome. The framework exists to make decisions faster, not slower. -**Do instead**: Use defaults. Only customize weights when the user has a strong reason. - -### False Precision -**What it looks like**: "Option A scores 7.21 vs Option B at 7.18, so A wins" -**Why wrong**: A 0.03 difference is noise. Scores are subjective estimates, not measurements. -**Do instead**: Close-call detection handles this. Scores within 0.5 are flagged as ties needing additional context. - -### Gut Override -**What it looks like**: "The matrix says B, but I just feel like A is right" -**Why wrong**: If the matrix contradicts your intuition, the criteria or scores are wrong -- not the math. Overriding teaches you nothing about WHY your gut disagrees. -**Do instead**: Ask which criterion is missing or mis-weighted. Add it, re-score, and see if the matrix now agrees with intuition. If it does, you found the hidden factor. If it doesn't, trust the matrix. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "The answer is obvious, no need to score" | Obvious answers don't need a skill; if you're here, it's not obvious | Run the full framework | -| "Close enough, just pick one" | Close calls deserve explicit acknowledgment, not hand-waving | Flag the close call, identify differentiating factors | -| "I'll adjust weights until my preferred option wins" | That's confirmation bias with extra steps | Set weights BEFORE scoring, don't adjust to fit a desired outcome | -| "This decision is too small for a matrix" | Then don't invoke the skill -- but if you did, commit to the process | Either skip the skill or run it fully | +- Repository CLAUDE.md files (read before execution for project-specific constraints) diff --git a/skills/dispatching-parallel-agents/SKILL.md b/skills/dispatching-parallel-agents/SKILL.md index 6ef5d377..5a0c7d06 100644 --- a/skills/dispatching-parallel-agents/SKILL.md +++ b/skills/dispatching-parallel-agents/SKILL.md @@ -35,47 +35,7 @@ routing: # Dispatching Parallel Agents -## Operator Context - -This skill operates as an operator for parallel dispatch workflows, configuring Claude's behavior for concurrent investigation of independent problems. It implements the **Fan-Out / Fan-In** architectural pattern -- dispatch isolated agents, collect results, integrate -- with **Domain Separation** ensuring agents never interfere with each other. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before dispatching agents -- **Over-Engineering Prevention**: Fix only what is broken. No speculative improvements across domains -- **Independence Verification**: MUST confirm problems are independent before parallel dispatch -- **Branch Convergence (ADR-093)**: Orchestrator MUST create the target branch BEFORE dispatching agents. Each agent prompt MUST include the branch name with explicit instructions to commit there — NOT create a new branch. This overrides the default "always create a branch" rule for parallel-dispatched agents. -- **Single Message Dispatch**: MUST launch all parallel agents in ONE message for true concurrency -- **Scoped Prompts**: Each agent MUST receive explicit scope, constraints, and expected output -- **Post-Integration Verification**: MUST run full test suite after all agents return - -### Default Behaviors (ON unless disabled) -- **Conflict Detection**: Check if agents modified overlapping files after completion -- **Prompt Template**: Use structured prompt with scope, goal, constraints, and output format -- **Summary Collection**: Require each agent to return root cause and files modified -- **Maximum Parallelism**: Cap at 10 concurrent agents to avoid coordination overhead -- **Result Spot-Check**: Verify at least one agent's fix manually before declaring done -- **Sequential Fallback**: If agents report same root cause, stop and investigate holistically - -### Optional Behaviors (OFF unless enabled) -- **Dependency Graph**: Map subsystem dependencies before dispatching -- **Resource Isolation**: Assign exclusive file/port ranges to each agent -- **Retry on Failure**: Re-dispatch failed agents with additional context - -## What This Skill CAN Do -- Dispatch multiple agents to work on independent problems concurrently -- Reduce total investigation time proportional to number of independent problems -- Detect conflicts between agent fixes during integration -- Provide structured prompts that keep agents focused on their domain -- Integrate results and verify the combined fix - -## What This Skill CANNOT Do -- Parallelize problems that share state or root cause -- Guarantee agents will not edit overlapping files -- Replace systematic debugging for single complex bugs (use systematic-debugging instead) -- Plan implementation work (use workflow-orchestrator instead) -- Execute sequential dependent tasks (use subagent-driven-development instead) - ---- +Fan-Out / Fan-In pattern for concurrent investigation of independent problems. Dispatch isolated agents with domain separation so they never interfere with each other, collect results, integrate, verify. ## Instructions @@ -85,6 +45,8 @@ This skill operates as an operator for parallel dispatch workflows, configuring **Step 1: List all problems** +Read and follow the repository CLAUDE.md before proceeding. Then enumerate every problem: + ```markdown ## Problems Identified 1. [Problem A] - [Subsystem] - [Error summary] @@ -92,12 +54,16 @@ This skill operates as an operator for parallel dispatch workflows, configuring 3. [Problem C] - [Subsystem] - [Error summary] ``` +Fix only what is broken -- do not add speculative improvements across domains. + **Step 2: Test independence** For each pair of problems, ask: "If I fix problem A, does it affect problem B?" - If NO for all pairs --> Independent, proceed to parallel dispatch - If YES or MAYBE for any pair --> Investigate those together first, parallelize the rest +Do not skip this step regardless of how obvious independence appears. "These problems look independent" is an assumption, not a verification -- test each pair explicitly. If there are many problems, classify all of them before dispatching any. + **Step 3: Check scope overlap (deterministic)** Before dispatching, infer file scopes from each task description, then run the overlap checker: @@ -113,11 +79,11 @@ Scope inference rules: - Extract file paths and directories mentioned in the task description - If a task only reads files (investigation, analysis), set `"readonly": true` - If scope cannot be inferred, use the subsystem directory as a broad scope -- When in doubt, over-scope — false positives (unnecessary serialization) are safe +- When in doubt, over-scope -- false positives (unnecessary serialization) are safe Interpret the output: -- `"conflicts": []` → All tasks can run in parallel. Proceed to Phase 2. -- `"conflicts": [...]` → Use `"parallel_groups"` to determine wave ordering. Tasks in the same group run together; groups run sequentially. +- `"conflicts": []` --> All tasks can run in parallel. Proceed to Phase 2. +- `"conflicts": [...]` --> Use `"parallel_groups"` to determine wave ordering. Tasks in the same group run together; groups run sequentially. - Display the grouping decision in the dispatch summary. **Step 4: Verify no shared state beyond files** @@ -135,7 +101,7 @@ These are NOT caught by the scope overlap script and require manual verification **Goal**: Launch focused agents with clear scope on a single shared branch. -**Step 0: Create target branch (ADR-093 — Branch Convergence)** +**Step 0: Create target branch (ADR-093 -- Branch Convergence)** Before dispatching any agents, the orchestrator creates and checks out the target branch: @@ -143,13 +109,11 @@ Before dispatching any agents, the orchestrator creates and checks out the targe git checkout -b feat/{descriptive-name} ``` -This branch is the single convergence point for all parallel agents. Individual agents MUST NOT create their own branches. - -**Why**: Without this step, N agents create N branches (the "scattered branches" problem). Cherry-picking and branch discovery after the fact is fragile and error-prone. Creating the branch before dispatch is simple and deterministic. +This branch is the single convergence point for all parallel agents. Individual agents MUST NOT create their own branches. Without this step, N agents create N branches (the "scattered branches" problem) -- cherry-picking and branch discovery after the fact is fragile and error-prone. Creating the branch before dispatch is simple and deterministic. **Step 1: Create agent prompts** -Each agent prompt MUST include: +Each agent receives an explicit prompt with scope, goal, constraints, and expected output format. Vague prompts like "Fix the failing tests" cause agents to wander, modify out-of-scope files, and take too long. Use this template: ```markdown Fix [N] failing tests in [FILE/SUBSYSTEM]: @@ -182,15 +146,21 @@ Return: - How to verify the fix ``` +The branch name must appear in every agent prompt -- agents will not "figure out the branch" on their own, and merging scattered branches after the fact creates conflicts and pollutes history. + **Step 2: Dispatch agents using scope overlap grouping** +All agents in a wave MUST be dispatched in a single message for true concurrency. Dispatching them one at a time serializes the work and defeats the purpose. + - If Phase 1 scope check returned a single parallel group: dispatch all agents in ONE message. All run concurrently. - If Phase 1 scope check returned multiple groups: dispatch each group as a sequential wave. All agents within a wave run concurrently, but waves run sequentially. Wait for wave N to complete before dispatching wave N+1. +Cap at 10 concurrent agents per wave to avoid coordination overhead. + ```markdown ## Dispatch Plan -Wave 1 (parallel): [task-1, task-2] — no file overlap -Wave 2 (after wave 1): [task-3] — overlaps with task-1 on handlers/auth.go +Wave 1 (parallel): [task-1, task-2] -- no file overlap +Wave 2 (after wave 1): [task-3] -- overlaps with task-1 on handlers/auth.go ``` **Gate**: All agents dispatched with scoped prompts and constraints. Proceed only when all agents return. @@ -213,7 +183,7 @@ If an agent created a rogue branch: 2. Cherry-pick its commits to the target branch: `git cherry-pick ` 3. Delete the rogue branch: `git branch -d ` -If an agent used a worktree, its commits are on a separate branch by design — cherry-pick them to the target branch. +If an agent used a worktree, its commits are on a separate branch by design -- cherry-pick them to the target branch. **Step 2: Read each agent summary** - What was the root cause? @@ -221,17 +191,20 @@ If an agent used a worktree, its commits are on a separate branch by design — - Did the agent's local tests pass? - Did the agent confirm the correct branch? +If agents report the same root cause, stop integration immediately. The problems were not actually independent -- consolidate into a single investigation. + **Step 3: Check for conflicts** - Did any agent modify files outside its declared scope? (Compare actual files modified vs Phase 1 scope declarations) - Did any two agents modify the same file? (Should not happen if scope overlap check was clean, but verify) -- Did any agent report the same root cause as another? - Did any agent report inability to reproduce? +"No conflicts in the file list" does not mean no conflicts in logic -- spot-check actual code changes, not just file names. + If conflicts detected: Do NOT auto-merge. Understand which fix is correct. May need sequential re-investigation. **Step 4: Run full test suite** -Execute the complete test suite to verify all fixes work together without regressions. +Execute the complete test suite to verify all fixes work together without regressions. An agent reporting "it's fixed" is not the same as integrated verification -- the full suite catches cross-subsystem regressions that individual agents cannot see. **Step 5: Spot-check at least one fix** @@ -280,59 +253,19 @@ Solution: Provide additional context. If still cannot reproduce, the problem may ### Error: "Agent Commits to Wrong Branch" Cause: Agent creates its own branch (ignoring convergence protocol) or worktree diverges from target. -Solution (ADR-093 — Branch Convergence): +Solution (ADR-093 -- Branch Convergence): 1. Orchestrator creates the target branch BEFORE dispatching agents (Phase 2, Step 0) 2. Each agent prompt explicitly states `Work on branch: {name}. Do NOT create a new branch.` 3. Each agent runs `git branch --show-current` as first step to verify correct branch 4. After all agents return, Phase 3 Step 1 verifies convergence 5. If a commit landed on the wrong branch, cherry-pick to the target and delete the rogue branch -*Graduated from learning.db — multi-agent-coordination/worktree-branch-confusion. Superseded by ADR-093.* - ---- - -## Anti-Patterns - -### Anti-Pattern 1: Parallelizing Related Problems -**What it looks like**: Dispatching agents for failures that share a root cause -**Why wrong**: Multiple agents fix the same thing differently, creating conflicts -**Do instead**: Test independence first. If unsure, investigate sequentially. - -### Anti-Pattern 2: Vague Agent Prompts -**What it looks like**: "Fix the failing tests" with no scope or constraints -**Why wrong**: Agent wanders, modifies out-of-scope files, takes too long -**Do instead**: Use the structured prompt template with explicit scope and constraints. - -### Anti-Pattern 3: Skipping Integration Verification -**What it looks like**: "All agents reported success, we're done!" -**Why wrong**: Individual fixes may conflict or introduce cross-subsystem regressions -**Do instead**: Run full test suite after all agents return. Spot-check at least one fix. - -### Anti-Pattern 4: Dispatching Before Understanding -**What it looks like**: Immediately parallelizing without confirming independence -**Why wrong**: Wastes agent effort; conflicting fixes require rework -**Do instead**: Complete Phase 1 classification. Independence verification is not optional. - -### Anti-Pattern 5: Scattered Branches (ADR-093) -**What it looks like**: Each parallel agent creates its own feature branch, producing N branches for N agents -**Why wrong**: Orchestrator must detective-work through cherry-picks, branch discovery, and manual integration. Merge commits pollute history. Conflicts emerge during integration that could have been prevented. -**Do instead**: Orchestrator creates the target branch BEFORE dispatch. Each agent prompt includes `Work on branch: {name}. Do NOT create a new branch.` All changes converge on a single branch. -*Graduated from ADR-093 and learning.db — multi-agent-coordination/parallel-agents-scatter-branches* +*Graduated from learning.db -- multi-agent-coordination/worktree-branch-confusion. Superseded by ADR-093.* --- ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "These problems look independent" | Looking ≠ verified independence | Test each pair explicitly | -| "Agent said it's fixed" | Agent report ≠ integrated verification | Run full test suite | -| "No conflicts in file list" | File-level ≠ logic-level conflict | Spot-check actual changes | -| "Too many problems to classify" | Skipping classification causes rework | Classify all before dispatch | -| "Agents will figure out the branch" | Without explicit branch, each agent creates its own | Pass branch name in every prompt (ADR-093) | -| "I'll merge the branches after" | Post-hoc merge creates conflicts and pollutes history | Create branch before dispatch, not after | +- `scripts/check-scope-overlap.py` -- Deterministic scope overlap checker for parallel task dispatch +- `skills/systematic-debugging/SKILL.md` -- Use instead for single complex bugs or related failures +- `skills/workflow-orchestrator/SKILL.md` -- Use instead for planning implementation work +- `skills/subagent-driven-development/SKILL.md` -- Use instead for sequential dependent tasks diff --git a/skills/distinctive-frontend-design/SKILL.md b/skills/distinctive-frontend-design/SKILL.md index 7510214c..0ca86883 100644 --- a/skills/distinctive-frontend-design/SKILL.md +++ b/skills/distinctive-frontend-design/SKILL.md @@ -28,59 +28,9 @@ routing: # Distinctive Frontend Design Skill -## Operator Context - -This skill operates as an operator for frontend design workflows, configuring Claude's behavior for creative aesthetic development that systematically avoids generic outputs. It implements the **Exploration-First** architectural pattern -- full context analysis and aesthetic exploration before any implementation. - -### Hardcoded Behaviors (Always Apply) - -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Over-Engineering Prevention**: Only implement what is directly requested; keep solutions focused on distinctive aesthetics, not unnecessary abstractions or design system scaffolding -- **Exploration-First Workflow**: ALWAYS complete aesthetic exploration (Phase 1) before any implementation; never jump to code with first-instinct choices -- **Anti-Pattern Prevention**: NEVER use banned fonts (Inter, Roboto, Arial, Helvetica, system fonts, Space Grotesk), purple gradients on white, or other cliches from `references/anti-patterns.json` -- **Context-Driven Decisions**: ALL design choices must be justified by project context, not convenience or defaults -- **Validation Required**: MUST run validation scripts before finalizing design specifications -- **Variety Enforcement**: NEVER reuse the same aesthetic choices across different projects; check project history - -### Default Behaviors (ON unless disabled) - -- **Concise Reporting**: Show validation results rather than describing them; be specific about aesthetic choices with hex values and font names -- **Temporary File Cleanup**: Remove exploration artifacts at completion; keep only final specifications and validation reports -- **Diverse Options**: Provide 3-5 distinct aesthetic directions during exploration phase -- **Cultural Research**: Research and incorporate cultural/thematic inspiration relevant to project context -- **Typography Excellence**: Prioritize beautiful, unexpected font combinations from curated catalog -- **Animation Orchestration**: Plan choreographed sequences for high-impact moments only; resist animating everything -- **Atmospheric Backgrounds**: Create depth through layered effects; never use flat solid colors - -### Optional Behaviors (OFF unless enabled) - -- **Design System Generation**: Create comprehensive design tokens and component library -- **Accessibility Auditing**: Full WCAG compliance checking beyond basic contrast -- **Performance Profiling**: Detailed animation performance analysis and optimization -- **Dark Mode Variants**: Automatic dark theme generation with adjusted glow/pattern opacities - -## What This Skill CAN Do - -- Guide systematic aesthetic exploration before implementation -- Validate design choices against anti-pattern database -- Generate diverse font recommendations by aesthetic category -- Create context-driven color palettes from cultural/thematic sources -- Plan high-impact animation choreography with easing curves -- Design atmospheric backgrounds with layered gradient/pattern/texture effects -- Audit existing designs for generic patterns and "AI slop" signals -- Provide concrete CSS/React/Tailwind implementation guidance -- Score designs on distinctiveness metrics with actionable feedback - -## What This Skill CANNOT Do - -- Generate pixel-perfect mockups (provides specifications and code, not visual design files) -- Replace human creativity (guides and validates, but requires creative direction from user) -- Guarantee uniqueness against all existing designs on the internet -- Auto-generate complete design systems (focuses on aesthetic distinctiveness, not comprehensive systems) -- Handle backend, API, or data visualization concerns -- Validate against specific brand guidelines without those guidelines provided +Systematic aesthetic exploration that produces contextual, validated design specifications. Every design choice flows from project context -- purpose, audience, emotion -- not from defaults or convenience. The workflow enforces exploration before implementation: you cannot write CSS until you have a validated aesthetic direction, typography selection, color palette, animation strategy, and atmospheric background. ---- +Optional capabilities (off unless explicitly enabled by the user): design system generation, full WCAG accessibility auditing, animation performance profiling, dark mode variant generation. ## Instructions @@ -88,7 +38,7 @@ This skill operates as an operator for frontend design workflows, configuring Cl **Goal**: Understand the project deeply before making any aesthetic decisions. -**Step 1: Gather context** by asking (adapt based on what is already known): +**Step 1: Read and follow the repository CLAUDE.md**, then gather context by asking (adapt based on what is already known): 1. **Purpose**: What is this frontend for? (portfolio, SaaS product, creative showcase, documentation, landing page) 2. **Audience**: Who will use it? (developers, artists, enterprise users, general public, specific demographics) @@ -96,9 +46,9 @@ This skill operates as an operator for frontend design workflows, configuring Cl 4. **Cultural context**: Any geographic, cultural, or thematic associations? (Japanese minimalism, industrial, retro, academic) 5. **Constraints**: Accessibility requirements, performance budgets, existing brand elements to preserve? 6. **Tech stack**: React, Vue, vanilla HTML/CSS, Next.js, framework preferences? -7. **Previous projects**: Any recent frontend work? (to ensure variety across projects) +7. **Previous projects**: Any recent frontend work? (to avoid reusing the same aesthetic choices -- variety across projects is mandatory) -**Step 2: Define 2-3 aesthetic directions** using `references/color-inspirations.json` and `references/font-catalog.json` as starting points. +**Step 2: Define 3-5 distinct aesthetic directions** using `references/color-inspirations.json` and `references/font-catalog.json` as starting points. Providing multiple directions prevents anchoring on a first instinct, which is the primary source of generic "AI slop" output. Example directions and what they mean: - **Neo-Brutalist Technical**: Bold typography, harsh contrasts, geometric precision, industrial textures @@ -107,7 +57,7 @@ Example directions and what they mean: - **Botanical Minimal**: Natural greens, generous whitespace, serif elegance, organic shapes - **Arctic Technical**: Cool blues, sharp geometry, monospace accents, clean precision -**Step 3: Output** `aesthetic_direction.json` with chosen direction(s) and contextual justification. See `references/implementation-examples.md` for template. +**Step 3: Output** `aesthetic_direction.json` with chosen direction(s) and contextual justification. Every direction must link back to project purpose, audience, and emotion -- context-driven justification is what separates distinctive design from arbitrary choices. See `references/implementation-examples.md` for template. **Gate**: Aesthetic direction defined with contextual justification linking project purpose, audience, and emotion to chosen direction. Do NOT proceed without this gate passing. @@ -115,14 +65,14 @@ Example directions and what they mean: **Goal**: Select distinctive, contextual font pairings that define the design's personality. -**Step 1: Load** `references/font-catalog.json`. All fonts in catalog are pre-approved; banned fonts (Inter, Roboto, Arial, Helvetica, system fonts, Space Grotesk) are excluded from catalog. +**Step 1: Load** `references/font-catalog.json`. All fonts in the catalog are pre-approved. The following fonts are banned because they are overused to the point of invisibility and signal generic output: Inter, Roboto, Arial, Helvetica, system fonts (e.g., `-apple-system, BlinkMacSystemFont, 'Segoe UI'`), Space Grotesk. Do not use them in selections or fallback stacks. **Step 2: Select font pairing** using this process: 1. Identify 3-5 candidate fonts from the appropriate aesthetic category -2. Eliminate any that feel "obvious" or overused for this context +2. Eliminate any that feel "obvious" or overused for this context -- resist the first instinct and explore deeper in the catalog, because beautiful unexpected combinations are the goal 3. Test combinations: Display font + Body font, or single font family with weight variation 4. Verify the pairing creates clear visual hierarchy -5. Check against project history for variety +5. Check against project history to confirm you are not reusing recent choices Selection criteria: - Matches aesthetic direction from Phase 1 @@ -138,9 +88,9 @@ Selection criteria: # Banned: Inter, Roboto, Arial, Helvetica, system fonts, Space Grotesk ``` -Manually verify: no banned fonts in selection, pairing not recently used, aesthetic match with direction. +Manually verify: no banned fonts in selection or fallback stacks (`sans-serif` alone counts as a banned system font), pairing not recently used, aesthetic match with direction. -**Step 4: Document** typography specification with font families, weights, usage roles, and rationale for each selection. See `references/implementation-examples.md` for template. +**Step 4: Document** typography specification with font families, weights, usage roles, and rationale for each selection. Be specific -- include exact font names and weights, not vague descriptions. See `references/implementation-examples.md` for template. **Gate**: Font validation passes (no banned fonts, no recent reuse, aesthetic match confirmed). Do NOT proceed until gate passes. @@ -155,7 +105,7 @@ Manually verify: no banned fonts in selection, pairing not recently used, aesthe - **Historical periods**: Art Deco, Mid-century modern, Victorian industrial - **Artistic movements**: Bauhaus, De Stijl, Impressionism -Select an inspiration source that resonates with the project context from Phase 1. +Select an inspiration source that resonates with the project context from Phase 1. The palette must trace back to that context -- convenience or personal preference is not a valid reason for a color choice. **Step 2: Build palette** with strict dominance structure: - **Dominant** (60-70%): Base background and major surfaces -- this sets the mood @@ -163,8 +113,10 @@ Select an inspiration source that resonates with the project context from Phase - **Accent** (5-10%): High-impact moments, CTAs, highlights -- demands attention sparingly - **Functional**: Success, warning, error, info states -- consistent across all designs +Colors distributed evenly without a clear dominant create visual chaos. The 60/30/10 ratio is non-negotiable because without it, no coherent aesthetic emerges. + **Step 3: Check against anti-patterns** in `references/anti-patterns.json`: -- No purple (#8B5CF6, #A855F7) as accent on white background +- No purple (#8B5CF6, #A855F7) as accent on white background -- the most cliched color scheme in modern web design, signaling generic SaaS template - No evenly distributed colors without clear dominance - No generic blue (#3B82F6) as primary on white - No pastels without saturation variation @@ -177,13 +129,13 @@ Select an inspiration source that resonates with the project context from Phase # Manual alternative: check palette against anti-patterns in references/anti-patterns.json ``` -Manually verify: no cliche patterns (purple on white, generic blue), clear 60/30/10 dominance ratio, sufficient contrast for accessibility. +Manually verify: no cliche patterns, clear 60/30/10 dominance ratio, sufficient contrast for accessibility. Report results with specific hex values rather than describing colors abstractly. **Gate**: Palette passes cliche detection and demonstrates clear 60/30/10 dominance ratio. Do NOT proceed until gate passes. ### Phase 4: Animation Strategy -**Goal**: Design choreography for high-impact moments only. Restraint is a feature. +**Goal**: Design choreography for high-impact moments only. Restraint is a feature -- animating everything dilutes impact and signals lack of intentionality. **Step 1: Identify high-impact moments** worth investing animation effort: - Initial page load (hero section reveal) @@ -223,7 +175,7 @@ Duration by scope: ### Phase 5: Background & Atmosphere -**Goal**: Create depth and mood through layered effects. Never use flat solid colors as backgrounds. +**Goal**: Create depth and mood through layered effects. Flat solid-color backgrounds fail this phase because they produce no atmospheric depth -- every surface needs at least two layers. **Step 1: Choose technique** from `references/background-techniques.md` based on aesthetic direction: - **Layered radial gradients**: Atmospheric depth with soft colored glows (sophisticated, landing pages) @@ -243,7 +195,7 @@ Duration by scope: ### Phase 6: Validation & Scoring -**Goal**: Objective quality assessment through validation scripts before any finalization. +**Goal**: Objective quality assessment before any finalization. Validation must run before delivering specifications -- skipping it means flaws compound through every downstream implementation decision. **Step 1: Run comprehensive validation** @@ -271,7 +223,7 @@ Duration by scope: ### Phase 7: Design Specification Output -**Goal**: Deliver a complete, implementable design specification. +**Goal**: Deliver a complete, implementable design specification. Only implement what was directly requested -- focus on distinctive aesthetics, not unnecessary abstractions or design system scaffolding unless the user explicitly asked for it. **Step 1: Generate CSS custom properties** (design tokens) covering typography, colors, spacing, shadows, and animation values. Reference `references/implementation-examples.md` for comprehensive token template. @@ -290,13 +242,13 @@ Duration by scope: **Step 4: If implementation is requested**, provide framework-specific starter code. Reference `references/implementation-examples.md` for React+Tailwind config, HTML+CSS templates, and design system templates. -**Gate**: Design specification document delivered with all sections complete and validation score included. +**Step 5: Clean up** temporary exploration artifacts (intermediate JSON files, draft palettes). Keep only the final specification and validation report. ---- +**Gate**: Design specification document delivered with all sections complete and validation score included. -## Examples +### Examples -### Example 1: New Landing Page +#### Example 1: New Landing Page User says: "Create a distinctive design for a developer tool landing page" Actions: 1. Gather context: developer audience, technical but approachable emotion (PHASE 1) @@ -309,7 +261,7 @@ Actions: 8. Output design specification with CSS tokens (PHASE 7) Result: Contextual, validated design specification ready for implementation -### Example 2: Design Audit +#### Example 2: Design Audit User says: "This site looks too generic, review it for AI slop" Actions: 1. Read existing CSS/design files to inventory current choices (PHASE 1) @@ -321,7 +273,19 @@ Actions: 7. Deliver report with specific replacement recommendations (PHASE 7) Result: Actionable audit with specific fixes for each detected issue ---- +## Reference Material + +### Design Catalogs + +These reference files contain the curated domain knowledge that drives design decisions: + +- `${CLAUDE_SKILL_DIR}/references/font-catalog.json`: Curated fonts by aesthetic category (banned fonts excluded) +- `${CLAUDE_SKILL_DIR}/references/color-inspirations.json`: Cultural/contextual color palette sources +- `${CLAUDE_SKILL_DIR}/references/animation-patterns.md`: High-impact animation choreography patterns with CSS and React examples +- `${CLAUDE_SKILL_DIR}/references/background-techniques.md`: Atmospheric background creation methods with code snippets +- `${CLAUDE_SKILL_DIR}/references/anti-patterns.json`: Banned fonts, cliche colors, layout and component cliches +- `${CLAUDE_SKILL_DIR}/references/implementation-examples.md`: CSS tokens, base styles, framework templates, specification document templates +- `${CLAUDE_SKILL_DIR}/references/project-history.json`: Aesthetic choices across projects (auto-generated by validation) ## Error Handling @@ -348,58 +312,12 @@ Solution: 3. Check if font + color + background form a cohesive story or feel disconnected 4. Iterate and re-validate -- max 3 attempts before reconsidering the aesthetic direction ---- - -## Anti-Patterns - -### Anti-Pattern 1: Skipping Aesthetic Exploration -**What it looks like**: Jumping to CSS/React implementation with first-instinct font and color choices -**Why wrong**: Produces generic, unconsidered design with no contextual justification; most "AI slop" originates here -**Do instead**: Complete Phase 1 context discovery and direction selection before touching implementation - -### Anti-Pattern 2: Using Banned Fonts -**What it looks like**: `font-family: 'Inter', sans-serif` or `font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI'` -**Why wrong**: Overused to the point of invisibility; instant "AI slop" aesthetic signal; fails validation immediately -**Do instead**: Select from `references/font-catalog.json`, run font validator, choose unexpected pairings - -### Anti-Pattern 3: Purple Gradients on White -**What it looks like**: `background: linear-gradient(135deg, #667eea, #764ba2)` with white surfaces and purple accents -**Why wrong**: Most cliched color scheme in modern web design; signals generic SaaS template; no contextual justification possible -**Do instead**: Research cultural/contextual color inspiration, create palette with 60/30/10 dominance structure - -### Anti-Pattern 4: Evenly Distributed Colors -**What it looks like**: Five accent colors used in roughly equal proportion across the design -**Why wrong**: Creates visual chaos without hierarchy; no dominant aesthetic emerges; looks like color picker experimentation -**Do instead**: Follow 60/30/10 rule strictly (dominant/secondary/accent), validate dominance ratio - -### Anti-Pattern 5: Implementing Without Validation -**What it looks like**: Writing CSS/React implementation without ever running design validation -**Why wrong**: May build entire frontend on flawed design foundations; wastes time on rework; no objective quality measure -**Do instead**: Run validation in Phase 6, ensure score >= 80 before proceeding to specification output - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Inter is clean and readable" | Clean = invisible; readable != distinctive | Select from curated catalog | -| "Purple gradient looks modern" | Modern = cliched; every SaaS uses this | Research contextual inspiration | -| "I'll validate later" | Later = never; flaws compound through phases | Validate at each phase gate | -| "Simple solid background is fine" | Solid = flat; depth creates atmosphere | Add at least 2 background layers | -| "Same fonts worked last time" | Worked != distinctive; variety is required | Check project history, choose new | - -### Reference Files -- `${CLAUDE_SKILL_DIR}/references/font-catalog.json`: Curated fonts by aesthetic category (banned fonts excluded) -- `${CLAUDE_SKILL_DIR}/references/color-inspirations.json`: Cultural/contextual color palette sources -- `${CLAUDE_SKILL_DIR}/references/animation-patterns.md`: High-impact animation choreography patterns with CSS and React examples -- `${CLAUDE_SKILL_DIR}/references/background-techniques.md`: Atmospheric background creation methods with code snippets -- `${CLAUDE_SKILL_DIR}/references/anti-patterns.json`: Banned fonts, cliche colors, layout and component cliches -- `${CLAUDE_SKILL_DIR}/references/implementation-examples.md`: CSS tokens, base styles, framework templates, specification document templates -- `${CLAUDE_SKILL_DIR}/references/project-history.json`: Aesthetic choices across projects (auto-generated by validation) +- `${CLAUDE_SKILL_DIR}/references/font-catalog.json` +- `${CLAUDE_SKILL_DIR}/references/color-inspirations.json` +- `${CLAUDE_SKILL_DIR}/references/animation-patterns.md` +- `${CLAUDE_SKILL_DIR}/references/background-techniques.md` +- `${CLAUDE_SKILL_DIR}/references/anti-patterns.json` +- `${CLAUDE_SKILL_DIR}/references/implementation-examples.md` +- `${CLAUDE_SKILL_DIR}/references/project-history.json` diff --git a/skills/do-parallel/SKILL.md b/skills/do-parallel/SKILL.md index 8dad062e..7e15a5ce 100644 --- a/skills/do-parallel/SKILL.md +++ b/skills/do-parallel/SKILL.md @@ -30,46 +30,7 @@ routing: $ARGUMENTS - Target agent/skill name + source material file path ---- - -## Operator Context - -This skill operates as an operator for intensive multi-perspective analysis, configuring Claude's behavior for true parallel independence across 10 analytical agents. It implements the **Fan-Out / Fan-In** architectural pattern -- dispatch independent agents in parallel, collect results, synthesize into unified recommendations -- with **Domain Intelligence** embedded in each perspective's focus constraints. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Over-Engineering Prevention**: Apply only Priority 1 and Priority 2 synthesized rules. Do not invent improvements beyond what the source material supports. No speculative enhancements. -- **True Parallel Independence**: All 10 Task invocations MUST be in a single message. Each agent receives ONLY its assigned perspective with zero cross-contamination. All 10 agents are read-only analysts — scope overlap checking (`scripts/check-scope-overlap.py`) is not required since agents produce analysis artifacts, not code modifications. -- **Artifact Persistence**: Save synthesis document and completion report to files. Context is ephemeral; artifacts persist. -- **Source Material Assessment**: Validate source material has sufficient depth before spawning agents. Thin material (under 500 words) should use inline analysis instead. -- **Validate Inputs First**: Verify target agent/skill exists and source material is readable before spawning any agents. -- **No Behavior Changes**: Synthesized rules ADD depth. They NEVER remove or significantly alter existing working patterns in the target. - -### Default Behaviors (ON unless disabled) -- **10 Perspectives**: Use all 10 analytical frameworks (see `references/perspective-prompts.md`) -- **Priority-Based Application**: Apply Must-Have rules first, then Should-Have. Skip Nice-to-Have unless user requests. -- **Synthesis Before Application**: Collect all 10 reports and synthesize before making any changes to the target. -- **Completion Report**: Generate detailed report showing impact, changes, and perspective contributions. -- **Graceful Degradation**: If agents time out, proceed with available results (3+ of 10 sufficient). -- **Git Commit**: Commit improvements with descriptive message after application. - -### Optional Behaviors (OFF unless enabled) -- **Reduced Perspectives**: Use 5 perspectives instead of 10 for faster completion -- **Dry Run Mode**: Generate synthesis without applying changes to target -- **Compare Mode**: Analyze two sources and extract differences - -## What This Skill CAN Do -- Extract comprehensive insights from complex source material through 10 independent lenses -- Synthesize cross-perspective patterns into prioritized improvement recommendations -- Apply synthesized rules to enhance an existing agent or skill -- Produce detailed reports showing which perspectives contributed to each improvement -- Detect patterns that single-threaded analysis misses due to cognitive anchoring - -## What This Skill CANNOT Do -- Replace inline analysis for simple or straightforward material (use `/do-perspectives` for single-target improvements) -- Generate value from poor source material (marketing fluff, auto-generated docs, under 500 words) -- Guarantee all 10 agents complete (network/timeout issues may reduce count) -- Skip the synthesis phase and apply raw per-perspective rules directly +This skill implements Fan-Out / Fan-In: dispatch 10 independent agents in parallel, collect results, synthesize into unified recommendations, and apply improvements to a target agent or skill. The primary value comes from cross-perspective pattern detection that single-threaded analysis misses due to cognitive anchoring. --- @@ -100,6 +61,8 @@ ls skills/{target_name}/SKILL.md **Step 3: Validate source material** +Read and follow repository CLAUDE.md before proceeding. Then assess the source: + ```markdown ## Source Material Assessment @@ -115,7 +78,8 @@ Assessment: SUITABLE / UNSUITABLE ``` - Read source file, confirm it is non-empty -- If material fails 2+ quality indicators, recommend inline analysis instead and ask user to confirm +- Material under 500 words lacks sufficient depth for 10-angle extraction -- recommend inline analysis instead and ask user to confirm +- If material fails 2+ quality indicators, recommend inline analysis (`/do-perspectives` for single-target improvements) and ask user to confirm **Gate**: Target exists and is readable. Source material is present and substantive. Proceed only when gate passes. @@ -125,6 +89,8 @@ Assessment: SUITABLE / UNSUITABLE **Step 1: Launch all 10 agents in a SINGLE message** +All 10 Task invocations MUST appear in one message to achieve true parallelism. Each agent is a read-only analyst receiving ONLY its assigned perspective with zero cross-contamination between agents. + Each agent receives: 1. The FULL source material 2. ONE assigned perspective (from `references/perspective-prompts.md`) @@ -143,6 +109,8 @@ The 10 perspectives are: 9. Complexity Management 10. Limitation and Nuance Handling +**Optional**: Use 5 perspectives instead of 10 for faster completion if user requests reduced mode. + **Step 2: Collect results with timeout awareness** Wait for all agents to complete. Monitor using this decision tree: @@ -161,6 +129,8 @@ Agent running > 5 minutes? **Step 3: Assess completeness** +Not all 10 agents are guaranteed to complete (network/timeout issues may reduce count). Degrade gracefully: + | Agents Completed | Action | |------------------|--------| | 8-10 of 10 | Full pipeline, excellent coverage | @@ -173,7 +143,7 @@ Agent running > 5 minutes? ### Phase 3: SYNTHESIZE -**Goal**: Merge 10 independent analyses into prioritized, unified recommendations. +**Goal**: Merge all independent analyses into prioritized, unified recommendations. Always collect ALL reports and synthesize before touching the target -- applying per-perspective rules one at a time misses cross-cutting themes and introduces contradictions. **Step 1: Create cross-reference matrix** @@ -198,6 +168,8 @@ For each rule extracted by any perspective, track which perspectives identified **Step 4: Prioritize rules** +Apply only Priority 1 and Priority 2 rules. Do not invent improvements beyond what the source material supports -- no speculative enhancements. + ```markdown ## Priority Rules for [Target] @@ -217,16 +189,20 @@ Rules present in 1-3 perspectives OR moderate impact: 2. [Rule] - Found in: [perspective] ``` +Priority 3 rules are documented but NOT applied unless the user explicitly requests them. Applying all 30-50 extracted rules without filtering leads to target bloat and conflicts with existing patterns. + **Step 5: Save synthesis document** - Write to `skills/do-parallel/artifacts/synthesis-{target}-{date}.md` - Include the cross-reference matrix, themes, and prioritized rules -- This artifact persists for future reference and can inform later analyses +- This artifact persists for future reference and can inform later analyses (context is ephemeral; artifacts persist) + +**Optional**: In dry run mode, stop here and present the synthesis without applying changes. **Gate**: Synthesis document exists with at least 3 Must-Have and 3 Should-Have rules. Proceed only when gate passes. ### Phase 4: APPLY -**Goal**: Improve the target agent/skill using synthesized recommendations. +**Goal**: Improve the target agent/skill using synthesized recommendations. Synthesized rules ADD depth -- they NEVER remove or significantly alter existing working patterns in the target. **Step 1: Read current target state** @@ -256,13 +232,12 @@ Map each Priority 1 and Priority 2 rule to a specific location in the target: **Step 3: Apply Priority 1 rules** - Add or enhance sections based on Must-Have recommendations -- Preserve all existing working patterns +- Preserve all existing working patterns -- additions only - After each rule application, verify target file is still valid markdown **Step 4: Apply Priority 2 rules** - Add Should-Have enhancements where they integrate naturally -- Do NOT force rules that conflict with existing patterns -- If a Should-Have rule conflicts with an existing pattern, document the conflict in the report and skip +- If a Should-Have rule conflicts with an existing pattern, document the conflict in the report and skip it **Step 5: Commit changes** - Create descriptive git commit explaining what was improved and from what source @@ -385,41 +360,6 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Using Parallel for Simple Material -**What it looks like**: Running 10 agents on a 200-word README -**Why wrong**: No depth to analyze from 10 angles. Simple material yields the same insights from a single reading. -**Do instead**: Use `/do-perspectives` for single-target improvements or simpler inline analysis. Reserve do-parallel for complex, hard-to-grasp material. - -### Anti-Pattern 2: Applying All Rules Without Prioritization -**What it looks like**: Dumping all 30-50 extracted rules into the target without filtering -**Why wrong**: Low-frequency rules may conflict with existing patterns. Quantity overwhelms quality. Target becomes bloated. -**Do instead**: Apply Priority 1 first, then Priority 2. Skip Priority 3 unless explicitly requested. - -### Anti-Pattern 3: Skipping Synthesis Phase -**What it looks like**: Reading each agent report and applying rules one perspective at a time -**Why wrong**: Cross-perspective patterns are the primary value. Applying per-perspective rules misses common themes and introduces contradictions. -**Do instead**: Always collect all reports, identify common themes, then create unified recommendations before touching the target. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Pipeline Architecture](../shared-patterns/pipeline-architecture.md) - Phase-gated pipeline design -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) - Phase transition rules - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Source is simple, 10 perspectives overkill" | Simple source = use inline analysis instead | Check material depth in Phase 1, downgrade if thin | -| "3 perspectives returned, close enough" | 3 is minimum for synthesis, not ideal | Wait for timeout threshold, then proceed with available | -| "I can synthesize as I go" | Per-perspective application misses cross-cutting themes | Complete all collection before ANY synthesis | -| "Existing patterns in target are outdated" | Existing patterns may work; new rules ADD, never replace | Preserve all existing content, add depth only | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/perspective-prompts.md`: All 10 perspective templates, synthesis format, completion report template, and source material guidance diff --git a/skills/do/SKILL.md b/skills/do/SKILL.md index b7efe26b..6097e015 100644 --- a/skills/do/SKILL.md +++ b/skills/do/SKILL.md @@ -26,58 +26,7 @@ routing: # /do - Smart Router -## Operator Context - -This skill operates as the primary routing operator, classifying requests and dispatching them to specialized agents and skills. It implements the **Router** pattern: parse request, select agent, pair skill, execute. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before any routing decision -- **Over-Engineering Prevention**: Route to the simplest agent+skill that satisfies the request. Do not stack unnecessary skills -- **Route Code Changes**: NEVER edit code directly. Any code modification MUST be routed to a domain agent -- **Force-Route Compliance**: When force-route triggers match, invoke that skill BEFORE any other action -- **Anti-Rationalization Injection**: Auto-inject anti-rationalization patterns for code, review, security, and testing tasks -- **Plan Before Execute**: Create `task_plan.md` for Simple+ complexity before routing to agents -- **Parallel First**: Check for parallelizable patterns BEFORE standard sequential routing -- **Branch Safety**: Route to agents that create branches; never allow direct main/master commits -- **Mandatory Pre-Merge Review Loop**: For repos without organization-gated workflows, run up to 3 iterations of `/pr-review` → fix before creating a PR -- **Organization-Gated Workflow**: Repos under protected organizations (via `scripts/classify-repo.py`) require user confirmation before EACH git action. NEVER auto-execute or auto-merge -- **Routing Banner**: ALWAYS display the routing decision banner as the FIRST visible output after classifying. Show BEFORE creating plans, BEFORE invoking agents, BEFORE any work begins -- **Creation Protocol**: For "create"/"new" requests at Simple+ complexity, automatically sequence: (1) ADR, (2) task plan, (3) implementation via domain agent - -### Default Behaviors (ON unless disabled) -- **Retro Knowledge Injection**: Auto-inject from learning.db (benchmark: +5.3 avg, 67% win rate). Relevance-gated by FTS5 keyword matching -- **Enhancement Stacking**: Add verification-before-completion, TDD, or parallel reviewers when signals detected -- **Negative Enhancement Rules**: Check skill's `pairs_with` before stacking. Empty `pairs_with: []` = no stacking. Do NOT stack verification on skills with built-in verification gates. Do NOT stack TDD on `fast` -- **Local Agent Discovery**: Route to `.claude/agents/` local agents when `[cross-repo]` output is present -- **Auto-Pipeline Fallback**: When no agent/skill matches, invoke auto-pipeline to classify and execute with phase gates -- **Post-Task Learning**: After Simple+ tasks, extract reusable patterns and record via `retro-record-adhoc` - -### Optional Behaviors (OFF unless enabled) -- **Dry Run Mode**: Show routing decision without executing -- **Verbose Routing**: Explain why each alternative was rejected -- **Force Direct**: Override routing for explicitly trivial operations - -## What This Skill CAN Do -- Route to any agent, skill, or command in the system -- Decompose multi-part requests into parallel or sequential sub-tasks -- Stack enhancement skills (TDD, verification, anti-rationalization) on top of primary routing -- Detect force-route triggers and invoke mandatory skills -- Launch up to 10 parallel agents in a single message - -## What This Skill CANNOT Do -- Edit code directly (must route to a domain agent) -- Override CLAUDE.md requirements or skip verification steps -- Route to agents or skills that do not exist -- Handle Medium+ complexity tasks without creating a plan first -- Skip force-route triggers when they match - ---- - -## Instructions - -### Role: /do is a ROUTER, not a worker - -/do's ONLY job is to ROUTE requests to agents. It does NOT execute, implement, debug, review, or fix anything itself. +/do is a **ROUTER**, not a worker. Its ONLY job is to classify requests, select the right agent + skill, and dispatch. It does NOT execute, implement, debug, review, or fix anything itself. **What the main thread does:** (1) Classify, (2) Select agent+skill, (3) Dispatch via Agent tool, (4) Evaluate if more work needed, (5) Route to ANOTHER agent if yes, (6) Report results. @@ -85,6 +34,10 @@ This skill operates as the primary routing operator, classifying requests and di The main thread is an **orchestrator**. If you find yourself reading source code, writing code, or doing analysis instead of dispatching an agent — STOP. Route it. +--- + +## Instructions + ### Phase Banners (MANDATORY) Every phase MUST display a banner BEFORE executing: `/do > Phase N: PHASE_NAME — description...` @@ -97,6 +50,8 @@ After Phase 2, display the full routing decision banner (`===` block). Phase ban **Goal**: Determine request complexity and whether routing is needed. +Read and follow the repository CLAUDE.md before making any routing decision, because it contains project-specific conventions that affect agent selection and skill pairing. + | Complexity | Agent | Skill | Direct Action | |------------|-------|-------|---------------| | Trivial | No | No | **ONLY reading a file the user named by exact path** | @@ -104,36 +59,52 @@ After Phase 2, display the full routing decision banner (`===` block). Phase ban | Medium | **Required** | **Required** | Never | | Complex | Required (2+) | Required (2+) | Never | -**Trivial = reading a file the user named by exact path.** Everything else is Simple+ and MUST use an agent, skill, or pipeline. When uncertain, classify UP not down — tokens are cheap, bad code is expensive. +**Trivial = reading a file the user named by exact path.** Everything else is Simple+ and MUST use an agent, skill, or pipeline. When uncertain, classify UP not down — because under-routing wastes implementations while over-routing only wastes tokens, and tokens are cheap but bad code is expensive. **Common misclassifications** (these are NOT Trivial — route them): evaluating repos/URLs, any opinion/recommendation, git operations, codebase questions (`explore-pipeline`), retro lookups (`retro` skill), comparing approaches. -**Maximize skill/agent/pipeline usage.** If a skill or pipeline exists for the task, USE IT — even if handling directly seems faster. Display the routing banner for ALL classifications including Trivial. +**Maximize skill/agent/pipeline usage.** If a skill or pipeline exists for the task, USE IT — even if handling directly seems faster, because skills encode domain patterns that prevent common mistakes. -**Check for parallel patterns FIRST**: 2+ independent failures or 3+ subtasks → `dispatching-parallel-agents`; broad research → `research-coordinator-engineer`; multi-agent coordination → `project-coordinator-engineer`; plan exists + "execute" → `subagent-driven-development`; new feature → `feature-design` (check `.feature/` directory; if present, use `feature-state.py status` for current phase). +**Check for parallel patterns FIRST** because independent work items can run concurrently, saving significant time — sequential dispatch when parallel is possible wastes wall-clock time needlessly: 2+ independent failures or 3+ subtasks → `dispatching-parallel-agents`; broad research → `research-coordinator-engineer`; multi-agent coordination → `project-coordinator-engineer`; plan exists + "execute" → `subagent-driven-development`; new feature → `feature-design` (check `.feature/` directory; if present, use `feature-state.py status` for current phase). + +**Optional: Force Direct** — OFF by default. When explicitly enabled, overrides routing for trivial operations. Only applies when the user explicitly requests it. **Gate**: Complexity classified. Display routing banner (ALL classifications). If not Trivial, proceed to Phase 2. If Trivial, handle directly after showing banner. +--- + ### Phase 2: ROUTE **Goal**: Select the correct agent + skill combination from the INDEX files and routing tables. **Step 1: Check force-route triggers** -Force-route triggers are in `skills/INDEX.json` (field: `force_route: true`). If a force-route trigger matches the request, invoke that skill BEFORE any other action. +Force-route triggers are in `skills/INDEX.json` (field: `force_route: true`). If a force-route trigger matches the request, invoke that skill BEFORE any other action, because force-routes encode critical domain patterns that prevent common mistakes — skipping them causes the exact class of bugs they were designed to prevent. -**Critical**: "push", "commit", "create PR", "merge" are NOT trivial git commands. They MUST route through skills that run quality gates. Running raw `git push` or `gh pr create` bypasses all quality gates. +Check triggers literally against the request text. If triggers match, force-route applies — no exceptions, no judgment calls about whether "it applies here." + +Trigger phrases must contain only user-language keywords, never sibling skill names, because the router matches triggers against request text and a sibling skill name would cause false matches. Each trigger phrase must map to exactly one skill — duplicates across skills make deterministic routing impossible. + +**Critical**: "push", "commit", "create PR", "merge" are NOT trivial git commands. They MUST route through skills that run quality gates, because running raw `git push`, `git commit`, `gh pr create`, or `gh pr merge` directly bypasses lint checks, test runs, review loops, CI verification, and repo classification. **Step 2: Select agent + skill** Read the routing tables in `references/routing-tables.md` and the INDEX files (`agents/INDEX.json`, `skills/INDEX.json`, `pipelines/INDEX.json`) to identify candidates by trigger-overlap. Select the best match; use LLM judgment to tiebreak when multiple candidates fit equally well. +Route to the simplest agent+skill that satisfies the request, because over-engineering the routing itself (stacking unnecessary skills) creates more overhead than it prevents. + +When `[cross-repo]` output is present, route to `.claude/agents/` local agents because they contain project-specific knowledge that generic agents lack. + +Never edit code directly — any code modification MUST be routed to a domain agent, because domain agents carry language-specific expertise, testing methodology, and quality gates that the router lacks. + **Step 3: Apply skill override** (task verb overrides default skill) When the request verb implies a specific methodology, override the agent's default skill. Common overrides: "review" → systematic-code-review, "debug" → systematic-debugging, "refactor" → systematic-refactoring, "TDD" → test-driven-development. Full override table in `references/routing-tables.md`. **Step 4: Display routing decision** (MANDATORY — do this NOW, before anything else) +This banner MUST be the FIRST visible output for EVERY /do invocation. Display BEFORE creating plans, BEFORE invoking agents, BEFORE any work begins. No exceptions. + ``` =================================================================== ROUTING: [brief summary] @@ -149,7 +120,9 @@ When the request verb implies a specific methodology, override the agent's defau For Trivial: show `Classification: Trivial - [reason]` and `Handling directly (no agent/skill needed)`. -This banner MUST be the FIRST visible output for EVERY /do invocation. Display BEFORE any work begins. No exceptions. +**Optional: Dry Run Mode** — OFF by default. When enabled, show the routing decision without executing. + +**Optional: Verbose Routing** — OFF by default. When enabled, explain why each alternative was rejected. **Step 5: Record routing decision** (Simple+ only — skip Trivial): @@ -165,10 +138,14 @@ Tags: `force-route`, `llm-override`, `auto-pipeline` (as applicable). This call **Gate**: Agent and skill selected. Banner displayed. Routing decision recorded. Proceed to Phase 3. +--- + ### Phase 3: ENHANCE **Goal**: Stack additional skills based on signals in the request. +Auto-inject retro knowledge from `learning.db` for any substantive work (benchmark: +5.3 avg, 67% win rate), because historical patterns prevent repeat mistakes. Relevance-gated by FTS5 keyword matching — only inject when keywords overlap. + | Signal in Request | Enhancement to Add | |-------------------|-------------------| | Any substantive work (code, design, plan) | **Auto-inject retro knowledge** (via `retro-knowledge-injector` hook) | @@ -179,7 +156,9 @@ Tags: `force-route`, `llm-override`, `auto-pipeline` (as applicable). This call | "review" with 5+ files | Use parallel-code-review (3 reviewers) | | Complex implementation | Offer subagent-driven-development | -**Auto-inject anti-rationalization** for these task types: +Before stacking any enhancement, check the target skill's `pairs_with` field in `skills/INDEX.json`, because some skills have built-in verification gates that make stacking redundant or harmful. Specifically: empty `pairs_with: []` means no stacking allowed. Do NOT stack verification on skills with built-in verification gates. Do NOT stack TDD on `fast`. + +**Auto-inject anti-rationalization** for these task types, because these categories are where shortcut rationalization causes the most damage: | Task Type | Patterns Injected | |-----------|-------------------| @@ -194,34 +173,44 @@ For explicit maximum rigor, use `/with-anti-rationalization [task]`. **Gate**: Enhancements applied. Proceed to Phase 4. +--- + ### Phase 4: EXECUTE **Goal**: Invoke the selected agent + skill and deliver results. **Step 0: Execute Creation Protocol** (for creation requests ONLY) -If request contains "create", "new", "scaffold", "build pipeline/agent/skill/hook" AND complexity is Simple+: (1) Write ADR at `adr/{kebab-case-name}.md`, (2) Register via `adr-query.py register`, (3) Proceed to plan creation. The `adr-context-injector` and `adr-enforcement` hooks handle cross-agent ADR compliance automatically. +If request contains "create", "new", "scaffold", "build pipeline/agent/skill/hook" AND complexity is Simple+, automatically sequence: (1) Write ADR at `adr/{kebab-case-name}.md`, (2) Register via `adr-query.py register`, (3) Proceed to plan creation. The `adr-context-injector` and `adr-enforcement` hooks handle cross-agent ADR compliance automatically. This protocol fires automatically because creation requests at Simple+ complexity need architectural grounding before implementation begins. **Step 1: Create plan** (for Simple+ complexity) -Create `task_plan.md` before execution. The `auto-plan-detector.py` hook auto-injects `` context. Skip only for Trivial tasks. +Create `task_plan.md` before execution, because executing without a plan produces wrong results faster — not correct results sooner. The `auto-plan-detector.py` hook auto-injects `` context. Skip only for Trivial tasks. **Step 2: Invoke agent with skill** Dispatch the agent. MCP tool discovery is the agent's responsibility — each agent's markdown declares which MCP tools it needs. Do not inject MCP instructions from /do. +Route to agents that create branches; never allow direct main/master commits, because main branch commits affect everyone and bypassing branch protection causes cascading problems. + +When dispatching agents for file modifications, explicitly include "commit your changes on the branch" in the agent prompt, because otherwise the agent completes file edits but changes sit unstaged — the orchestrator assumes committed work and moves on, and changes are lost. + +For repos without organization-gated workflows, run up to 3 iterations of `/pr-review` → fix before creating a PR, because post-merge fixes cost 2 PRs instead of 1. For repos under protected organizations (via `scripts/classify-repo.py`), require user confirmation before EACH git action — never auto-execute or auto-merge, because organization-gated repos have compliance requirements that automation must not bypass. + **Step 3: Handle multi-part requests** Detect: "first...then", "and also", numbered lists, semicolons. Sequential dependencies execute in order. Independent items launch multiple Task tools in single message. Max parallelism: 10 agents. **Step 4: Auto-Pipeline Fallback** (when no agent/skill matches AND complexity >= Simple) -Invoke `auto-pipeline` (MANDATORY — "handle directly" is not an option). If no pipeline matches either, fall back to closest agent + verification-before-completion. +Invoke `auto-pipeline` (MANDATORY — "handle directly" is not an option), because a missing agent match is a routing gap to report, not a license to bypass routing. If no pipeline matches either, fall back to closest agent + verification-before-completion. -When uncertain which route: **ROUTE ANYWAY.** Add verification-before-completion as safety net. +When uncertain which route: **ROUTE ANYWAY.** Add verification-before-completion as safety net. Routing overhead is always less than the cost of unreviewed code changes. **Gate**: Agent invoked, results delivered. Proceed to Phase 5. +--- + ### Phase 5: LEARN **Goal**: Ensure session insights are captured to `learning.db`. @@ -267,69 +256,8 @@ Solution: Stop execution. Create `task_plan.md`. Resume routing after plan is in --- -## Anti-Patterns - -### Anti-Pattern 1: Handling Code Directly -**What it looks like**: Editing source files without routing to a domain agent -**Why wrong**: Bypasses domain expertise, testing methodology, and quality gates -**Do instead**: Route to the domain agent. Always. Even for "simple" changes. - -### Anti-Pattern 2: Under-Routing -**What it looks like**: Treating code changes as "trivial" to avoid routing overhead -**Why wrong**: Under-routing wastes implementations. Over-routing only wastes tokens. Tokens are cheap; bad code is expensive. -**Do instead**: Default to routing. Trivial = reading a file the user named by path. Nothing else qualifies. - -### Anti-Pattern 3: Skipping Force-Routes -**What it looks like**: Writing Go tests without invoking go-testing, or Go concurrency without go-concurrency -**Why wrong**: Force-routes encode critical domain patterns that prevent common mistakes -**Do instead**: Check force-route triggers BEFORE selecting a general agent. Force-routes override defaults. - -### Anti-Pattern 4: Sequential When Parallel Is Possible -**What it looks like**: Fixing 3 independent test failures one at a time -**Why wrong**: Independent work items can run concurrently, saving significant time -**Do instead**: Detect independent items and use dispatching-parallel-agents. - -### Anti-Pattern 5: Raw Git Commands Instead of Skills -**What it looks like**: Running `git push`, `git commit`, `gh pr create`, or `gh pr merge` directly -**Why wrong**: Bypasses lint checks, test runs, review loops, CI verification, and repo classification -**Do instead**: Route ALL git submission actions through their skills. No exceptions. - -### Anti-Pattern 6: Force-Route Triggers Containing Sibling Skill Names -**What it looks like**: A force-route trigger list includes the name of another skill -**Why wrong**: The router matches the sibling name as a trigger for the wrong skill -**Do instead**: Triggers must contain only user-language keywords, never sibling skill names. - -### Anti-Pattern 7: Duplicate Trigger Phrases Across Skills -**What it looks like**: Two skills claim the same trigger phrase -**Why wrong**: The router cannot deterministically pick between them -**Do instead**: Each trigger phrase must map to exactly one skill. Check for collisions before adding. - -### Anti-Pattern 8: Dispatching Agents Without Commit Instructions -**What it looks like**: Dispatching an agent to modify files on a branch, but not telling the agent to commit its changes -**Why wrong**: The agent completes file edits but the changes sit unstaged. The dispatching orchestrator assumes the work is committed and moves on. Changes are lost or require manual cleanup. -**Do instead**: When dispatching agents for file modifications, explicitly include "commit your changes on the branch" in the agent prompt. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) - Phase transition gates - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "This is trivial, I'll handle it directly" | Trivial = reading a named file, nothing else | Route to agent; show banner regardless | -| "No agent matches, I'll just do it myself" | Missing agent is a gap to report, not a bypass | Report gap, route to closest match | -| "Force-route doesn't apply here" | If triggers match, force-route applies. No exceptions | Check triggers literally | -| "Routing overhead isn't worth it for this" | Routing overhead < cost of unreviewed code changes | Route anyway; tokens are cheap | -| "User wants it fast, skip the plan" | Fast without a plan produces wrong results faster | Create plan, then execute | -| "User seems impatient, skip the review" | **There is never time pressure.** A denied tool call is NOT permission to skip quality gates | Run the full review loop | -| "Just push it, we can fix later" | Post-merge fixes cost 2 PRs instead of 1 | Route through pr-sync/pr-pipeline | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/routing-tables.md`: Complete category-specific skill routing - `agents/INDEX.json`: Agent triggers and metadata diff --git a/skills/docs-sync-checker/SKILL.md b/skills/docs-sync-checker/SKILL.md index aeca61c9..f6705c39 100644 --- a/skills/docs-sync-checker/SKILL.md +++ b/skills/docs-sync-checker/SKILL.md @@ -27,52 +27,11 @@ routing: # Documentation Sync Checker Skill -## Operator Context - -This skill operates as an operator for documentation synchronization workflows, configuring Claude's behavior for automated drift detection between filesystem tools and their README entries. It implements deterministic scanning and comparison -- no AI judgment on content quality, only presence/absence/version verification. - -The 4-phase workflow (Scan, Cross-Reference, Detect, Report) ensures systematic coverage. Each phase has a gate that must pass before proceeding. The skill produces a sync score (percentage of tools properly documented) and actionable fix suggestions for every detected issue. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Over-Engineering Prevention**: Only scan, compare, and report. No speculative features -- **Deterministic Scanning**: File existence, YAML parsing, and markdown extraction must be deterministic -- **Documentation File Locations**: Check specific files: skills/README.md, agents/README.md, commands/README.md, README.md, docs/REFERENCE.md -- **Sync Rules**: Skills in skills/README.md, agents in agents/README.md, commands in commands/README.md - -### Default Behaviors (ON unless disabled) -- **Concise Reporting**: Report facts without self-congratulation; show data, not descriptions -- **Temporary File Cleanup**: Remove helper scripts and debug outputs at task completion -- **Stale Entry Detection**: Flag documented tools that no longer exist in the filesystem -- **Version Mismatch Detection**: Compare YAML frontmatter versions with documented versions -- **Severity Assignment**: HIGH for missing entries, MEDIUM for stale entries, LOW for version mismatches - -### Optional Behaviors (OFF unless enabled) -- **Auto-Fix Mode**: Automatically add missing documentation entries (--auto-fix) -- **Strict Mode**: Exit with error code if sync issues found (--strict) -- **JSON Output**: Machine-readable report for CI/CD pipelines (--format json) - -## What This Skill CAN Do -- Discover all skills (skills/*/SKILL.md), agents (agents/*.md), and commands (commands/**/*.md) -- Parse YAML frontmatter and extract name, description, version fields -- Parse markdown tables and lists from README files using deterministic parsing -- Detect missing entries (tool exists in filesystem, not documented in README) -- Detect stale entries (documented in README, tool no longer in filesystem) -- Detect version mismatches (YAML version differs from documented version) -- Generate actionable sync reports with exact suggested fixes (markdown rows to add/remove) -- Handle namespaced commands (commands/code/cleanup.md -> /code cleanup) -- Calculate sync score as percentage of tools properly documented -- Support strict mode for CI/CD integration (exit code 1 on issues) - -## What This Skill CANNOT Do -- Judge whether descriptions are accurate or helpful (presence only, not quality) -- Generate or improve documentation content (uses YAML description verbatim) -- Resolve markdown merge conflicts in documentation files -- Validate cross-references or internal links between documents -- Track when documentation drift occurred (point-in-time snapshot only) -- Fix semantic inconsistencies between descriptions in different files -- Automatically fix documentation without human review (auto-fix is experimental, requires explicit opt-in) -- **Reason**: This skill is deterministic scanning and comparison. Content generation and quality assessment require different skills. +Deterministic 4-phase drift detector that compares the filesystem against README entries. Each phase (Scan, Cross-Reference, Detect, Report) has a gate that must pass before proceeding. The skill produces a sync score (percentage of tools properly documented) and actionable fix suggestions for every detected issue. + +This skill checks presence, absence, and version alignment only -- it does not judge description quality, generate documentation content, resolve merge conflicts, validate cross-references, or track when drift occurred. Suggested fixes use YAML descriptions verbatim; content generation and quality assessment require different skills. + +Optional flags: `--auto-fix` (experimental, requires explicit opt-in), `--strict` (exit code 1 on issues), `--format json` (machine-readable output for CI/CD). --- @@ -80,7 +39,7 @@ The 4-phase workflow (Scan, Cross-Reference, Detect, Report) ensures systematic ### Phase 1: SCAN -**Goal**: Discover all skills, agents, and commands in the repository filesystem. +**Goal**: Discover all skills, agents, and commands in the repository filesystem. All discovery (file existence checks, YAML parsing, markdown extraction) must be deterministic -- no AI judgment on content quality. **Step 1: Run the scan script** @@ -119,7 +78,7 @@ YAML errors: [N] (must be 0 to proceed) ### Phase 2: CROSS-REFERENCE -**Goal**: Extract documented tools from README files and compare with discovered tools. +**Goal**: Extract documented tools from README files and compare with discovered tools. Each tool type has a primary documentation file: skills belong in `skills/README.md`, agents in `agents/README.md`, commands in `commands/README.md`. **Step 1: Run the documentation parser** @@ -129,6 +88,8 @@ python3 skills/docs-sync-checker/scripts/parse_docs.py --repo-root $HOME/claude- **Step 2: Parse each documentation file** +These are the five documentation files to check -- no others: + | File | Format | What to Extract | |------|--------|-----------------| | `skills/README.md` | Markdown table | Name, Description, Command, Hook columns | @@ -151,13 +112,13 @@ For each documentation file, collect the set of tool names found. This creates a ### Phase 3: DETECT -**Goal**: Compare discovered tools with documented tools to identify drift. +**Goal**: Compare discovered tools with documented tools to identify drift. This is a point-in-time snapshot -- it cannot tell you when drift occurred, only that it exists now. **Step 1: Compute set differences** For each tool type and its primary documentation file: - `missing = filesystem_tools - documented_tools` (tools that exist but are not documented) -- `stale = documented_tools - filesystem_tools` (documented tools that no longer exist) +- `stale = documented_tools - filesystem_tools` (documented tools that no longer exist -- users waste time trying to invoke non-existent tools, so always flag these) **Step 2: Check version consistency** @@ -167,6 +128,8 @@ For tools that appear in both sets, compare: **Step 3: Categorize and assign severity** +Severity reflects user impact: missing entries mean tools are undiscoverable, stale entries waste time, version mismatches cause confusion. + | Category | Condition | Severity | |----------|-----------|----------| | Missing Entry | Tool in filesystem, not in primary README | HIGH | @@ -182,7 +145,7 @@ For each issue, capture: tool type, tool name, tool path, affected documentation ### Phase 4: REPORT -**Goal**: Generate human-readable report with actionable fix suggestions. +**Goal**: Generate human-readable report with actionable fix suggestions. Report facts concisely -- show data, not self-congratulatory descriptions. Target 100% sync score; even one missing entry erodes trust in all documentation. **Step 1: Run the report generator** @@ -209,7 +172,7 @@ Report must include these sections: **Step 3: Validate actionability** -Every issue in the report must have a concrete suggested fix. No issue should say "review manually" without specifying what to review and where. +Every issue in the report must have a concrete suggested fix. No issue should say "review manually" without specifying what to review and where. The fix should enable a single-commit resolution -- tool files and documentation entries should be added/removed together. **Step 4: Report format for missing entries** @@ -228,13 +191,15 @@ For each missing command, generate a suggested list item: - `/command-name` - Description from command file ``` -**Gate**: Report generated with actionable suggestions for every issue. +**Step 5: Cleanup** ---- +Remove any helper scripts and debug outputs created during execution. -## Examples +**Gate**: Report generated with actionable suggestions for every issue. + +### Examples -### Example 1: New Skill Missing from README +#### Example 1: New Skill Missing from README User created `skills/my-new-skill/SKILL.md` but forgot to update `skills/README.md`. Actions: 1. SCAN discovers `my-new-skill` in filesystem @@ -242,7 +207,7 @@ Actions: 3. DETECT flags as HIGH severity missing entry 4. REPORT suggests exact table row to add to skills/README.md -### Example 2: Removed Agent Still Documented +#### Example 2: Removed Agent Still Documented User deleted `agents/old-agent.md` but `agents/README.md` still lists it. Actions: 1. SCAN does not find `old-agent` in filesystem @@ -250,7 +215,7 @@ Actions: 3. DETECT flags as MEDIUM severity stale entry 4. REPORT suggests removing the row from agents/README.md -### Example 3: Version Bump Without Doc Update +#### Example 3: Version Bump Without Doc Update User updated `version: 2.0.0` in `skills/code-linting/SKILL.md` but `docs/REFERENCE.md` still shows `Version: 1.5.0`. Actions: 1. SCAN reads YAML version as 2.0.0 @@ -258,7 +223,7 @@ Actions: 3. DETECT flags as LOW severity version mismatch 4. REPORT suggests updating version line in docs/REFERENCE.md to 2.0.0 -### Example 4: Batch Changes After Refactor +#### Example 4: Batch Changes After Refactor User created 3 new skills and deleted 2 old ones in a refactoring PR. Actions: 1. SCAN discovers 3 new skills in filesystem, does not find 2 removed skills @@ -266,8 +231,6 @@ Actions: 3. DETECT flags 3 HIGH (missing) + 2 MEDIUM (stale) issues 4. REPORT provides exact table rows to add and identifies rows to remove ---- - ## Error Handling ### Error: "YAML Parse Error" @@ -304,51 +267,7 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Creating Tools Without Documentation -**What it looks like**: `git commit -m "Add new skill"` without updating skills/README.md -**Why wrong**: Documentation drifts from reality; users cannot discover the tool -**Do instead**: Run docs-sync-checker before committing; add tool AND documentation in same commit - -### Anti-Pattern 2: Removing Tools, Leaving Documentation -**What it looks like**: `rm -rf skills/old-skill` followed by commit, but README still lists it -**Why wrong**: Users waste time trying to invoke non-existent tools; erodes trust in docs -**Do instead**: Remove tool files AND documentation entries together in one commit - -### Anti-Pattern 3: Manual Edits Without Validation -**What it looks like**: Hand-editing README tables without running sync checker afterward -**Why wrong**: Typos in tool names create phantom entries (documented but non-existent) -**Do instead**: Always validate with docs-sync-checker after manual documentation edits - -### Anti-Pattern 4: Ignoring Sync Failures in CI -**What it looks like**: `continue-on-error: true` on the sync check step in GitHub Actions -**Why wrong**: Documentation drift accumulates; the check becomes meaningless -**Do instead**: Run with --strict and let the build fail; fix docs before merging - -### Anti-Pattern 5: Updating Version Without Syncing Documentation -**What it looks like**: Bumping `version: 2.0.0` in YAML frontmatter, committing without checking docs -**Why wrong**: Version mismatch creates confusion about which version is deployed; users reference wrong version -**Do instead**: Update YAML version, run sync checker, update all documentation locations in one commit - ---- - ## References - -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I'll update the README later" | Later never comes; drift accumulates | Update docs in the same commit | -| "It's just one missing entry" | One missing entry erodes trust in all docs | Fix immediately | -| "The sync score is still 90%" | 90% means 1 in 10 tools is undocumented | Target 100% | -| "CI will catch it" | Only if CI is configured and not ignored | Verify locally first | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/documentation-structure.md`: Documentation file matrix, required fields per location, cross-reference requirements - `${CLAUDE_SKILL_DIR}/references/markdown-formats.md`: Expected table/list formats for each README file, parsing rules, common formatting errors - `${CLAUDE_SKILL_DIR}/references/sync-rules.md`: Synchronization rules, severity levels, deprecation handling, namespace rules diff --git a/skills/e2e-testing/SKILL.md b/skills/e2e-testing/SKILL.md index b3de8de5..46daa45f 100644 --- a/skills/e2e-testing/SKILL.md +++ b/skills/e2e-testing/SKILL.md @@ -40,31 +40,9 @@ routing: # E2E Testing Skill (Playwright) -## Operator Context +Playwright-based E2E testing across four phases: Scaffold, Build, Run, Validate. Each phase produces a saved artifact and must pass its gate before the next phase begins. -This skill operates as an operator for Playwright E2E testing workflows. It implements the **Pipeline** architectural pattern — four phases with explicit gates, deterministic validation before LLM judgment, and a saved artifact at each phase exit. - -**Scope:** This skill is exclusively for Playwright-based E2E tests that exercise a running application through a real browser. Do NOT use it for unit/component tests, Go testing, or test-first development. - -### Hardcoded Behaviors (Always Apply) -- **POM Required**: Every page or feature area gets a typed Page Object class — no inline locators in spec files -- **data-testid Selectors**: All locators use `data-testid` attributes — no CSS selectors, no XPath, no text matching for interactive elements -- **No Arbitrary Waits**: `waitForTimeout` and `setTimeout` in tests are forbidden — use condition-based waiting only -- **Deterministic Before Subjective**: Run `tsc --noEmit` and check JSON existence before any LLM triage -- **Artifacts Over Memory**: Each phase produces a file; nothing lives only in context -- **Quarantine Before Delete**: Flaky tests get `test.fixme()` and a `--repeat-each=5` reproduction attempt before removal - -### Default Behaviors (ON unless disabled) -- **Multi-Browser Matrix**: Test on Chromium, Firefox, and WebKit unless project constrains otherwise -- **Screenshot on Failure**: `screenshot: 'only-on-failure'` in Playwright config -- **Trace on Retry**: `trace: 'on-first-retry'` for post-failure debugging -- **Video Retain on Failure**: `video: 'retain-on-failure'` -- **CI Retries**: `retries: process.env.CI ? 2 : 0` -- **Structured Report**: Phase 4 always produces `e2e-report.md` with pass/fail counts and artifact inventory - ---- - -## Phases +## Instructions ### PHASE 1: SCAFFOLD @@ -85,12 +63,12 @@ This skill operates as an operator for Playwright E2E testing workflows. It impl traces/ videos/ ``` -3. Write `playwright.config.ts` using the template below. -4. Confirm `playwright.config.ts` is valid TypeScript: `npx tsc --noEmit`. +3. Write `playwright.config.ts` using the template below. The config bakes in failure diagnostics by default: `screenshot: 'only-on-failure'`, `trace: 'on-first-retry'`, and `video: 'retain-on-failure'` so that every failure produces actionable artifacts without manual setup. CI retries (`retries: process.env.CI ? 2 : 0`) absorb transient infrastructure flakiness without masking real bugs. +4. Confirm `playwright.config.ts` is valid TypeScript: `npx tsc --noEmit`. Run this deterministic check before any subjective assessment of the config -- compiler errors are facts, opinions are not. **Artifact:** `playwright.config.ts` + `tests/e2e/` directory structure. -**Gate:** `playwright.config.ts` exists AND `tests/e2e/` directory exists. If either is missing, do not proceed to Phase 2 — diagnose and fix. +**Gate:** `playwright.config.ts` exists AND `tests/e2e/` directory exists. If either is missing, do not proceed to Phase 2 -- diagnose and fix. #### playwright.config.ts Template @@ -126,22 +104,26 @@ export default defineConfig({ }); ``` +The multi-browser matrix (Chromium, Firefox, WebKit) is the default because cross-browser bugs caught in CI are cheaper than cross-browser bugs caught in production. Remove browsers only when the project explicitly constrains the target set. + --- ### PHASE 2: BUILD **Goal:** Write POM classes for target feature areas, then write spec files that use those POMs. +Every page or feature area gets a typed Page Object class. Spec files never contain inline locators -- all selectors live in the POM. This separation means a selector change is a one-line POM edit, not a grep-and-replace across dozens of specs. + **Actions:** 1. Identify the feature areas under test (auth, checkout, dashboard, etc.). -2. For each area, create a POM class in `pages/` (see POM Pattern below). -3. Write spec files in `tests/e2e//` using the POMs — no inline locators in specs. +2. For each area, create a POM class in `pages/` (see POM Pattern below). All locators must use `data-testid` attributes via `page.getByTestId()`. CSS selectors (`page.locator('.btn-primary')`) break silently when styles change. XPath breaks on DOM restructuring. Text matching (`page.locator('text=Submit')`) breaks on copy changes. `data-testid` is a testing contract that survives all three. +3. Write spec files in `tests/e2e//` using the POMs. 4. Run `npx tsc --noEmit` to verify all files compile. 5. Fix any TypeScript errors before proceeding. **Artifact:** `tests/e2e/**/*.spec.ts` files + `pages/*.ts` POM classes, all compiling cleanly. -**Gate:** At least one `.spec.ts` exists under `tests/e2e/` AND `npx tsc --noEmit` exits 0. If compile fails, fix errors — do not proceed to Phase 3 with broken TypeScript. +**Gate:** At least one `.spec.ts` exists under `tests/e2e/` AND `npx tsc --noEmit` exits 0. If compile fails, fix errors -- do not proceed to Phase 3 with broken TypeScript. #### POM Pattern @@ -205,10 +187,21 @@ test.describe('Login Flow', () => { #### data-testid Convention -- **Format**: `-` — e.g., `login-email`, `checkout-submit`, `nav-profile-link` +- **Format**: `-` -- e.g., `login-email`, `checkout-submit`, `nav-profile-link` - **Scope**: Add `data-testid` to interactive elements and status regions the tests need to assert on -- **Stability**: `data-testid` attributes must not change with styling or refactoring — they are a testing contract -- **No CSS selectors**: `page.locator('.btn-primary')` is forbidden — CSS changes break tests silently +- **Stability**: `data-testid` attributes must not change with styling or refactoring -- they are a testing contract + +#### Waiting and Timing + +Never use `waitForTimeout()` or `setTimeout()` in tests. Arbitrary waits pass slowly on fast machines and fail on slow ones -- they encode a guess about timing instead of observing the actual condition. Use condition-based waiting instead: + +| Instead of | Use | +|-----------|-----| +| `await page.waitForTimeout(2000)` | `await expect(locator).toBeVisible()` or `await page.waitForResponse(...)` | +| `await page.waitForTimeout(0)` to "flush" | `await page.waitForLoadState('networkidle')` | +| `page.click('button')` without waiting | `locator.click()` -- Playwright auto-waits for actionability | + +Each test must own its own setup in `beforeEach`. Tests sharing state via global variables break parallel execution because Playwright runs specs concurrently by default. --- @@ -226,18 +219,39 @@ test.describe('Login Flow', () => { ```bash npx playwright test tests/e2e/auth/login.spec.ts --repeat-each=5 ``` -4. Quarantine confirmed flaky tests with `test.fixme()`: +4. Quarantine confirmed flaky tests with `test.fixme()`. Never delete a failing test -- deleted tests leave silent coverage gaps. Quarantined tests are visible debt with tracking references: ```typescript test.fixme('flaky: login redirects intermittently', async ({ page }) => { - // TODO: #123 — investigate race condition with auth cookie + // TODO: #123 -- investigate race condition with auth cookie ... }); ``` -5. Do NOT delete failing tests. Do NOT use `test.skip()` to hide broken tests — `test.skip()` is for conditional skips (e.g., environment guards), not for hiding failures. +5. Do NOT use `test.skip()` to hide broken tests. `test.skip()` is for conditional environment guards (e.g., "skip on WebKit"), not for sweeping failures under the rug. + +**Artifact:** `playwright-results.json` (presence is the gate -- pass rate is not). + +**Gate:** `playwright-results.json` exists at the project root. The file must contain valid JSON. Pass rate does not block Phase 4 -- reporting on failures is Phase 4's job. -**Artifact:** `playwright-results.json` (presence is the gate — pass rate is not). +#### Flaky Test Quarantine Protocol -**Gate:** `playwright-results.json` exists at the project root. The file must contain valid JSON. Pass rate does not block Phase 4 — reporting on failures is Phase 4's job. +When a test fails intermittently: + +1. **Reproduce**: `npx playwright test --repeat-each=5` -- if it fails at least once in 5 runs, it is flaky. +2. **Quarantine**: Replace `test(` with `test.fixme(` and add a comment with the symptom and a tracking reference. +3. **Do not delete**: Deleted tests leave coverage gaps. Quarantined tests are visible debt. +4. **Fix criteria**: Before removing `test.fixme`, the test must pass 10/10 with `--repeat-each=10`. + +```typescript +// Before +test('checkout completes successfully', async ({ page }) => { ... }); + +// After quarantine +test.fixme('checkout completes successfully', async ({ page }) => { + // FLAKY: intermittent race on payment confirmation response + // TODO: #456 -- investigate network timing in checkout flow + ... +}); +``` --- @@ -246,7 +260,7 @@ test.describe('Login Flow', () => { **Goal:** Deterministic checks on test output, then structured report generation. **Actions:** -1. **Deterministic checks** (run these first, before any LLM summary): +1. **Deterministic checks first** -- run these before any LLM summary because compiler output and JSON parsing are facts, not opinions: - `playwright-results.json` exists and parses as valid JSON. - Extract counts: `python3 -c "import json,sys; d=json.load(open('playwright-results.json')); print(d.get('stats', d))"` - Identify all `unexpected` (failed) and `flaky` result entries. @@ -312,50 +326,9 @@ test.describe('Login Flow', () => { --- -## Anti-Patterns - -These patterns cause flakiness or test fragility. Stop and correct immediately if encountered. - -| Anti-Pattern | Why It Fails | Correct Approach | -|-------------|-------------|-----------------| -| `await page.waitForTimeout(2000)` | Arbitrary time — passes slowly, fails on lag | Use `waitForResponse`, `waitForSelector`, or `expect(locator).toBeVisible()` | -| `page.locator('.submit-btn')` | CSS breaks with restyling | `page.getByTestId('form-submit')` | -| `page.locator('text=Submit')` | Breaks with copy changes | `page.getByTestId('form-submit')` | -| `page.click('button')` without waiting | Race with rendering | Auto-wait via `locator.click()` — Playwright waits for actionability | -| `await page.waitForTimeout(0)` to "flush" | Masks async ordering bugs | Use `waitForResponse` or `waitForLoadState` | -| Tests share state via global variables | Breaks parallel execution | Each test owns its own setup in `beforeEach` | -| `test.skip()` on a broken test | Silently hides failures | Fix it, or quarantine with `test.fixme()` + tracking issue | -| Locators depending on element order (`nth(0)`) | Fragile to DOM reorder | Add `data-testid` to the specific element | -| `page.fill()` without clearing first | Appends to existing value | Use `locator.clear()` then `locator.fill()`, or pass empty string first | - ---- - -## Flaky Test Quarantine Protocol - -When a test fails intermittently: - -1. **Reproduce**: `npx playwright test --repeat-each=5` — if it fails at least once in 5 runs, it is flaky. -2. **Quarantine**: Replace `test(` with `test.fixme(` and add a comment with the symptom and a tracking reference. -3. **Do not delete**: Deleted tests leave coverage gaps. Quarantined tests are visible debt. -4. **Fix criteria**: Before removing `test.fixme`, the test must pass 10/10 with `--repeat-each=10`. - -```typescript -// Before -test('checkout completes successfully', async ({ page }) => { ... }); - -// After quarantine -test.fixme('checkout completes successfully', async ({ page }) => { - // FLAKY: intermittent race on payment confirmation response - // TODO: #456 — investigate network timing in checkout flow - ... -}); -``` - ---- - -## CI/CD Integration +### CI/CD Integration -### GitHub Actions Workflow Template +#### GitHub Actions Workflow Template ```yaml name: E2E Tests @@ -412,27 +385,25 @@ jobs: --- -## What This Skill CAN Do -- Scaffold Playwright config and directory structure for a project -- Write typed POM classes and spec files for any feature area -- Execute tests and produce structured JSON + markdown reports -- Quarantine flaky tests with discipline (not deletion) -- Integrate E2E runs into GitHub Actions with artifact upload +## Error Handling -## What This Skill CANNOT Do -- Test units or components in isolation (use `vitest-runner`) -- Write tests before the implementation exists (use `test-driven-development`) -- Run Go tests (use `go-testing`) -- Fix application bugs discovered by E2E tests (route to the appropriate engineer agent) -- Guarantee zero flakiness — it can only minimize and manage it systematically +| Symptom | Likely Cause | Fix | +|---------|-------------|-----| +| `npx tsc --noEmit` fails after Phase 1 | Bad config template or missing types | Check `@playwright/test` is in devDependencies, verify `tsconfig.json` includes the test directory | +| Tests pass locally, fail in CI | Missing browser deps or wrong `BASE_URL` | Use `npx playwright install --with-deps` in CI; verify `BASE_URL` env var matches the running app | +| `playwright-results.json` missing after run | Reporter not configured or test runner crashed | Verify `json` reporter is in `playwright.config.ts`; check for OOM or process kill signals | +| Locator timeout on element that exists | Element present but not actionable (hidden, disabled, covered) | Use `await expect(locator).toBeVisible()` before interaction; check for overlays or modals | +| `page.fill()` appends instead of replacing | Input field has existing value | Use `locator.clear()` then `locator.fill()` | +| Flaky test passes 4/5 runs | Race condition, network timing, or animation interference | Quarantine with `test.fixme()`, reproduce with `--repeat-each=10`, check for missing `waitFor` conditions | +| Locators depending on `nth(0)` break randomly | DOM order is not stable | Add a `data-testid` to the specific element instead of relying on position | --- ## References -- [playwright-patterns.md](references/playwright-patterns.md) — POM examples, condition-based waiting, multi-browser config, financial skip guards -- [wallet-testing.md](references/wallet-testing.md) — Web3/MetaMask mock patterns with `addInitScript` -- [financial-flows.md](references/financial-flows.md) — Production skip guards, blockchain confirmation waits -- [flakiness-triage.md](references/flakiness-triage.md) — `--repeat-each`, `--retries`, quarantine decision tree -- [ADR-107](../../adr/ADR-107-e2e-testing.md) — Decision record for this skill -- [Playwright docs](https://playwright.dev/docs/intro) — Official API reference +- [playwright-patterns.md](references/playwright-patterns.md) -- POM examples, condition-based waiting, multi-browser config, financial skip guards +- [wallet-testing.md](references/wallet-testing.md) -- Web3/MetaMask mock patterns with `addInitScript` +- [financial-flows.md](references/financial-flows.md) -- Production skip guards, blockchain confirmation waits +- [flakiness-triage.md](references/flakiness-triage.md) -- `--repeat-each`, `--retries`, quarantine decision tree +- [ADR-107](../../adr/ADR-107-e2e-testing.md) -- Decision record for this skill +- [Playwright docs](https://playwright.dev/docs/intro) -- Official API reference diff --git a/skills/endpoint-validator/SKILL.md b/skills/endpoint-validator/SKILL.md index c6836e09..6c3ab83c 100644 --- a/skills/endpoint-validator/SKILL.md +++ b/skills/endpoint-validator/SKILL.md @@ -28,47 +28,7 @@ routing: # Endpoint Validator Skill -## Operator Context - -This skill operates as an operator for API endpoint validation workflows, configuring Claude's behavior for deterministic, structured health checking. It implements the **Discover, Validate, Report** pattern -- find endpoints, test each against expectations, produce machine-readable results with clear pass/fail verdicts. - -### Hardcoded Behaviors (Always Apply) -- **Read-Only by Default**: Only makes GET requests unless explicitly configured otherwise -- **Timeout Safety**: Default 5-second timeout per request prevents hanging -- **Structured Output**: Always produces machine-parseable results with exit codes -- **No Data Mutation**: Never sends POST/PUT/DELETE without explicit user configuration -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before validation - -### Default Behaviors (ON unless disabled) -- **Progress Display**: Show each endpoint result as it completes -- **Summary Statistics**: Pass/fail counts and percentages at end of run -- **Timing Information**: Response time in milliseconds for every endpoint -- **Threshold Enforcement**: Flag endpoints exceeding configured max_time -- **Sequential Execution**: Test endpoints one at a time for predictable output - -### Optional Behaviors (OFF unless enabled) -- **POST/PUT/DELETE Testing**: Requires explicit method + body in configuration -- **Authentication Headers**: Bearer tokens or basic auth passed via config -- **Response Body Validation**: Deep JSON key checking beyond top-level -- **Custom Headers**: Additional headers per endpoint (e.g., Accept, Content-Type) -- **Parallel Requests**: Test multiple endpoints concurrently - -## What This Skill CAN Do -- Validate HTTP GET endpoints for expected status codes -- Check JSON responses contain expected top-level keys -- Measure and report response times per endpoint -- Detect slow endpoints exceeding configured thresholds -- Produce CI/CD compatible exit codes (0 = all pass, 1 = any fail) -- Read endpoint definitions from JSON config files - -## What This Skill CANNOT Do -- Perform load or stress testing (single request per endpoint only) -- Execute browser-based or JavaScript-rendered tests -- Handle OAuth flows or multi-step authentication chains -- Validate full JSON schemas (top-level key presence only) -- Test WebSocket or gRPC endpoints (HTTP only) - ---- +Deterministic HTTP endpoint validation following a **Discover, Validate, Report** pattern. Finds endpoints, tests each against expectations, and produces machine-readable results with clear pass/fail verdicts and CI-compatible exit codes. ## Instructions @@ -76,14 +36,20 @@ This skill operates as an operator for API endpoint validation workflows, config **Goal**: Locate or receive endpoint definitions before making any requests. -**Step 1: Search for endpoint configuration** +**Step 1: Read repository CLAUDE.md** + +Check for and follow any repository-level CLAUDE.md before running validation. It may contain base URL conventions, environment variable names, or endpoint paths relevant to the project. + +**Step 2: Search for endpoint configuration** Look for definitions in priority order: 1. `endpoints.json` in project root 2. `tests/endpoints.json` 3. Inline specification provided by user or calling agent -**Step 2: Parse and validate configuration** +Prefer config files checked into version control over ad-hoc endpoint lists. Manually listing endpoints every run leads to drift and missed endpoints. + +**Step 3: Parse and validate configuration** Configuration must contain `base_url` and at least one endpoint: @@ -101,11 +67,17 @@ Configuration must contain `base_url` and at least one endpoint: Each endpoint supports these fields: - `path` (required): URL path appended to base_url - `expect_status` (default: 200): Expected HTTP status code -- `expect_key` (optional): Top-level JSON key that must exist in response -- `timeout` (default: 5): Request timeout in seconds +- `expect_key` (optional): Top-level JSON key that must exist in response. Only top-level key presence is checked -- full JSON schema validation is out of scope. +- `timeout` (default: 5): Request timeout in seconds. The 5-second default prevents hanging on unresponsive endpoints. - `max_time` (optional): Fail if response exceeds this threshold in seconds +- `method` (optional): HTTP method. Defaults to GET. POST/PUT/DELETE require explicit configuration with a request body -- never send mutating requests without the user specifying them. +- `headers` (optional): Additional headers per endpoint (e.g., Accept, Content-Type, Authorization) -**Step 3: Confirm base URL is reachable** +If `base_url` points to a production host and the config includes POST/PUT/DELETE endpoints, warn the user before proceeding. Mutating production data or triggering rate limits during a smoke test is a serious risk. Use staging environments for write operations; reserve production for GET-only health checks. + +Avoid hardcoded IP addresses in `base_url` (e.g., `http://192.168.1.42:8000`). They break on every other machine and CI environment. Use `localhost` with a configurable port or environment variables instead. + +**Step 4: Confirm base URL is reachable** Make a single request to `base_url` before running the full suite. If unreachable, report immediately rather than failing every endpoint individually. @@ -117,22 +89,25 @@ Make a single request to `base_url` before running the full suite. If unreachabl **Step 1: Execute requests sequentially** -For each endpoint: +Test endpoints one at a time for predictable, reproducible output. For each endpoint: 1. Construct full URL from `base_url` + `path` -2. Send GET request with configured timeout +2. Send request with configured method (GET by default) and timeout 3. Record status code, response time, and body +4. Display each result as it completes so the user sees progress + +This skill sends one request per endpoint. It is not a load tester or stress tester -- it validates contract compliance, not throughput. **Step 2: Evaluate against expectations** For each response, check in order: 1. **Status code**: Does it match `expect_status`? If not, mark FAIL. 2. **JSON key**: If `expect_key` set, parse JSON and check key exists. If missing or not valid JSON, mark FAIL. -3. **Response time**: If `max_time` set and elapsed exceeds it, mark SLOW. +3. **Response time**: If `max_time` set and elapsed exceeds it, mark SLOW. Do not ignore slow endpoints -- they indicate degradation that becomes failure under load. 4. **Security headers**: Check response headers for common security headers. Report missing headers as WARN (not FAIL): - - `Strict-Transport-Security` — HSTS enforcement (expected on HTTPS endpoints) - - `Content-Security-Policy` — XSS mitigation - - `X-Content-Type-Options` — should be `nosniff` - - `X-Frame-Options` — clickjacking prevention (or CSP `frame-ancestors`) + - `Strict-Transport-Security` -- HSTS enforcement (expected on HTTPS endpoints) + - `Content-Security-Policy` -- XSS mitigation + - `X-Content-Type-Options` -- should be `nosniff` + - `X-Frame-Options` -- clickjacking prevention (or CSP `frame-ancestors`) Skip security header checks for localhost/127.0.0.1 endpoints (development environments don't typically set these). Only check on non-localhost base URLs unless explicitly configured. @@ -186,11 +161,9 @@ SUMMARY: **Gate**: Report printed, exit code set. Validation complete. ---- - -## Examples +### Examples -### Example 1: Pre-Deployment Health Check +#### Example 1: Pre-Deployment Health Check User says: "Validate all endpoints before we deploy" Actions: 1. Find `endpoints.json` in project root (DISCOVER) @@ -198,7 +171,7 @@ Actions: 3. Print report, exit 0 if all pass (REPORT) Result: Structured pass/fail report with CI-compatible exit code -### Example 2: Smoke Test After Migration +#### Example 2: Smoke Test After Migration User says: "Check if the API is still working after the database migration" Actions: 1. Read endpoint config, confirm base URL reachable (DISCOVER) @@ -233,35 +206,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Testing Against Production Without Safeguards -**What it looks like**: Pointing base_url at production with POST/DELETE endpoints -**Why wrong**: Can mutate production data, cause outages, or trigger rate limits -**Do instead**: Use staging environments for write operations; production only for GET health checks - -### Anti-Pattern 2: Ignoring Slow Endpoints -**What it looks like**: "All status codes are 200, ship it!" while ignoring 8-second response times -**Why wrong**: Slow endpoints indicate degradation that will become failures under load -**Do instead**: Set `max_time` thresholds and treat SLOW as actionable warnings - -### Anti-Pattern 3: Hardcoding Base URLs -**What it looks like**: `"base_url": "http://192.168.1.42:8000"` in checked-in config -**Why wrong**: Breaks on every other machine, CI environment, and deployment target -**Do instead**: Use environment variables or localhost with configurable port - -### Anti-Pattern 4: No Endpoint Config in Repository -**What it looks like**: Manually listing endpoints every time validation runs -**Why wrong**: Endpoints drift, new ones get missed, no single source of truth -**Do instead**: Maintain `endpoints.json` in version control alongside the API code - ---- - ## References -This skill uses these shared patterns: -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - ### CI/CD Integration ```yaml diff --git a/skills/fast/SKILL.md b/skills/fast/SKILL.md index 043d0e24..a1f26b6a 100644 --- a/skills/fast/SKILL.md +++ b/skills/fast/SKILL.md @@ -36,40 +36,7 @@ routing: # /fast - Zero-Ceremony Execution -## Operator Context - -This skill implements the Fast tier from the five-tier task hierarchy (Fast > Quick > Simple > Medium > Complex). It exists because the full ceremony of plan files, agent routing, and quality gates is wasteful for a typo fix. The process should scale down to match the task. - -### Hardcoded Behaviors (Always Apply) -- **3-Edit Scope Limit**: If the task requires more than 3 file edits, STOP and redirect to `/quick`. The work done so far is preserved — do not restart. This gate exists because uncapped "fast" tasks silently grow into untracked large changes. -- **No Plan File**: Do not create `task_plan.md`. The overhead of planning exceeds the task itself at this tier. -- **No Subagent Spawning**: Execute inline. Subagents add latency and context setup cost that dwarfs the actual work. -- **No Research Phase**: If the task requires reading documentation, investigating behavior, or understanding unfamiliar code, it is not a Fast task. Redirect to `/quick --research`. -- **No New Dependencies**: If the task requires adding imports from new packages, installing libraries, or modifying dependency files (go.mod, package.json, requirements.txt), redirect to `/quick`. -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution. -- **Branch Safety**: Create a feature branch if currently on main/master. Even fast tasks get proper branches. -- **Commit After Edit**: Every fast task ends with a commit. Uncommitted fast edits defeat the auditability that justifies using the system at all. - -### Default Behaviors (ON unless disabled) -- **STATE.md Logging**: Append completed task to STATE.md quick tasks table (create if absent) -- **Conventional Commits**: Use conventional commit format for the commit message -- **Edit Counting**: Track edits during execution to enforce the 3-edit scope gate - -### Optional Behaviors (OFF unless enabled) -- **No Commit Mode** (`--no-commit`): Skip the commit step (for when the user wants to batch changes) -- **Dry Run** (`--dry-run`): Show what would change without editing - -## What This Skill CAN Do -- Fix typos, rename variables, update config values, fix imports -- Make 1-3 targeted file edits and commit them -- Log the action to STATE.md for auditability - -## What This Skill CANNOT Do -- Research unfamiliar code or APIs (redirect to `/quick --research`) -- Add new dependencies (redirect to `/quick`) -- Edit more than 3 files (redirect to `/quick`) -- Run quality gates or parallel reviews (those belong to Simple+ tiers) -- Create plans or spawn subagents +The Fast tier sits at the bottom of the five-tier task hierarchy (Fast > Quick > Simple > Medium > Complex). It exists because the full ceremony of plan files, agent routing, and quality gates is wasteful for a typo fix. Execute inline without plans or subagents, commit the result, and log it. --- @@ -79,20 +46,29 @@ This skill implements the Fast tier from the five-tier task hierarchy (Fast > Qu **Goal**: Confirm the task is Fast-eligible and know exactly what to change. -**Step 1: Read the request** +**Step 1: Read CLAUDE.md** -Parse the user's request to identify: +Read and follow the repository CLAUDE.md before any other action, because repository-specific constraints may affect how the edit should be made. + +**Step 2: Parse the request** + +Identify from the user's request: - Which file(s) need editing - What specific change is needed - Whether this is clearly a 1-3 edit task -**Step 2: Scope check** +If `--dry-run` was passed, show what would change without editing and stop. + +**Step 3: Scope check** Ask these questions silently (do not display to user): -- Does this need research or investigation? If yes -> redirect to `/quick --research` -- Does this touch more than 3 files? If yes -> redirect to `/quick` -- Does this add new dependencies? If yes -> redirect to `/quick` -- Is the change ambiguous or underspecified? If yes -> ask user for clarification + +| Question | If Yes | +|----------|--------| +| Does this need reading docs, investigating behavior, or understanding unfamiliar code? | Redirect to `/quick --research` because investigation means uncertainty, and uncertainty means this is not a Fast task | +| Does this touch more than 3 files? | Redirect to `/quick` because uncapped "fast" tasks silently grow into untracked large changes | +| Does this add imports from new packages, install libraries, or modify dependency files (go.mod, package.json, requirements.txt)? | Redirect to `/quick` because new dependencies carry risk that needs proper tracking | +| Is the change ambiguous or underspecified? | Ask user one clarifying question. If still ambiguous after one round, redirect to `/quick --discuss` | If redirecting, say: ``` @@ -100,7 +76,7 @@ This task exceeds /fast scope ([reason]). Redirecting to /quick. ``` Then invoke the quick skill with the original request. -**Step 3: Locate target files** +**Step 4: Locate target files** Read the file(s) that need editing. Confirm the exact lines to change. @@ -108,32 +84,36 @@ Read the file(s) that need editing. Confirm the exact lines to change. ### Phase 2: DO -**Goal**: Make the edits. +**Goal**: Make the edits inline without spawning subagents, because subagents add latency and context setup cost that dwarfs the actual work at this tier. + +Do not create `task_plan.md`, because the overhead of planning exceeds the task itself for a 1-3 edit change. **Step 1: Execute edits** -Make the changes using Edit tool. Track the number of files edited. +Make the changes using the Edit tool. Track the number of files edited after each operation, because the 3-edit scope gate depends on an accurate count. **Step 2: Mid-execution scope check** -After each edit, check: have we hit 3 edits? If the task needs MORE edits to complete: +After each edit, check: have we hit 3 edits? If the task needs MORE edits to complete, stop immediately — do not rationalize "just one more edit" because the 3-edit gate exists specifically to prevent silent scope creep: ``` Scope exceeded during execution (3+ edits needed). Preserving work done. Redirecting remainder to /quick. ``` -Hand off to `/quick` with context about what was already done. +Hand off to `/quick` with context about what was already done. Do not start additional "while I'm here" fixes, because scope creep in fast mode produces untracked large changes with no plan or review. **GATE**: All edits complete. Edit count is 1-3. ### Phase 3: COMMIT -**Goal**: Commit the changes with a clean message. +**Goal**: Commit the changes with a clean message, because uncommitted fast edits are invisible to the audit trail and defeat the traceability that justifies using the system at all. Even one-line changes get commits because one-line changes cause one-line bugs that are invisible without commit history. + +If `--no-commit` was passed, skip this phase (for when the user wants to batch changes). **Step 1: Check branch** -If on main/master, create a feature branch first: +If on main/master, create a feature branch first because even fast tasks get proper branches — never commit directly to main: ```bash git checkout -b fast/ ``` @@ -154,7 +134,7 @@ Use conventional commit format. The type is usually `fix:`, `chore:`, or `refact ### Phase 4: LOG -**Goal**: Record the task for auditability. +**Goal**: Record the task for auditability, because without logging fast tasks are invisible and the system loses its audit trail. **Step 1: Append to STATE.md** @@ -200,33 +180,3 @@ Fast tasks do not get task IDs (that is a Quick-tier feature). Use `-` for the I ### Error: On Main Branch **Cause**: Currently on main/master **Solution**: Create `fast/` branch before editing. Never commit directly to main. - ---- - -## Anti-Patterns - -### Anti-Pattern 1: Using Fast for Investigation -**What it looks like**: Reading 5 files to understand a bug before fixing it -**Why wrong**: Investigation is research. Fast is for when you already know what to change. -**Do instead**: Use `/quick --research` for tasks that need understanding first. - -### Anti-Pattern 2: Skipping the Commit -**What it looks like**: Making fast edits but not committing because "it's just a small change" -**Why wrong**: Uncommitted changes are invisible to the audit trail. The whole point of /fast over raw editing is traceability. -**Do instead**: Always commit. Use `--no-commit` only when explicitly batching. - -### Anti-Pattern 3: Stretching Scope -**What it looks like**: "While I'm here, let me also fix this other thing" — turning 2 edits into 6 -**Why wrong**: Scope creep in fast mode produces untracked large changes with no plan or review -**Do instead**: Stop at 3 edits. Open a new `/fast` or `/quick` for additional work. - ---- - -## Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Just one more edit won't hurt" | The 3-edit gate exists to prevent silent scope creep | Redirect to /quick at edit 4 | -| "This is basically fast, just needs a little research" | Research means uncertainty; uncertainty means /quick | Redirect to /quick --research | -| "No need to commit a one-line change" | One-line changes cause one-line bugs that are invisible without commits | Commit every fast task | -| "STATE.md logging is overhead" | Without logging, fast tasks are invisible — defeating auditability | Always log to STATE.md | diff --git a/skills/feature-design/SKILL.md b/skills/feature-design/SKILL.md index 9aa2ca9c..51b39ac0 100644 --- a/skills/feature-design/SKILL.md +++ b/skills/feature-design/SKILL.md @@ -36,68 +36,38 @@ routing: # Feature Design Skill -## Purpose - -Transform a feature idea into a structured design document through collaborative dialogue. This is Phase 1 of the feature lifecycle pipeline (design → plan → implement → validate → release). - -## Operator Context - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before design -- **Over-Engineering Prevention**: Design only what's requested. Don't add speculative requirements. -- **State Management via Script**: All state operations go through `python3 ~/.claude/scripts/feature-state.py` -- **Gate Enforcement**: Check gate status before proceeding past decision points -- **Design Doc Required**: Phase CANNOT complete without a design document artifact in `.feature/state/design/` -- **ADR for Design Decisions**: Write an ADR to `adr/{feature-name}.md` documenting architectural decisions made during design exploration. Register the ADR session (`python3 ~/.claude/scripts/adr-query.py register --adr adr/{name}.md`) so sub-phase skills (feature-plan, feature-implement) receive design context via hook injection. -- **Branch Safety**: Create feature branch via worktree, never work on main - -### Default Behaviors (ON unless disabled) -- **Collaborative Dialogue**: Ask clarifying questions before committing to an approach -- **Multiple Approaches**: Generate 2-3 approaches with trade-offs before selecting -- **Context Loading**: Read L0 and L1 context at phase start - -### Optional Behaviors (OFF unless enabled) -- **Auto-approve gates**: Skip human approval for design gates (requires config) - -## What This Skill CAN Do -- Facilitate design discussion with the user -- Produce structured design documents -- Initialize feature state and worktree -- Load and update feature context hierarchy - -## What This Skill CANNOT Do -- Skip design and go straight to code -- Produce implementation plans (that's feature-plan) -- Write code (that's feature-implement) -- Modify published context directly (retro loop only) +Transform a feature idea into a structured design document through collaborative dialogue. This is Phase 1 of the feature lifecycle pipeline (design > plan > implement > validate > release). ## Instructions ### Phase 0: PRIME -**Goal**: Initialize feature state and load context. +**Goal**: Initialize feature state, load context, and prepare the workspace. + +1. Read and follow the repository's CLAUDE.md before any design work begins -- design decisions must align with existing project conventions. -1. Initialize feature state: +2. Create a feature branch via worktree. Never work on main -- design artifacts on main block other contributors and bypass review. ```bash python3 ~/.claude/scripts/feature-state.py init "FEATURE_NAME" ``` + All state operations throughout this skill go through `feature-state.py` -- direct file manipulation risks state corruption and breaks downstream skills (feature-plan, feature-implement) that depend on consistent state format. -2. Load L0 context: +3. Load L0 context -- skipping existing context discards previous learnings and causes redundant design work: ```bash python3 ~/.claude/scripts/feature-state.py context-read "" L0 ``` -3. Load L1 design context: +4. Load L1 design context: ```bash python3 ~/.claude/scripts/feature-state.py context-read "" L1 --phase design ``` -4. If existing L2 context is relevant, load on-demand: +5. If existing L2 context is relevant, load on-demand: ```bash python3 ~/.claude/scripts/feature-state.py context-read "" L2 --phase design ``` -5. **Surface relevant seeds** (ADR-075): Check `.seeds/index.json` for dormant seeds whose trigger conditions match the current feature. Compare the feature name and description against each seed's `trigger` field using fuzzy keyword overlap. If matches are found, present them: +6. **Surface relevant seeds** (ADR-075): Check `.seeds/index.json` for dormant seeds whose trigger conditions match the current feature. Compare the feature name and description against each seed's `trigger` field using fuzzy keyword overlap. If matches are found, present them: ``` ## Relevant Seeds (N matched) @@ -117,7 +87,7 @@ Transform a feature idea into a structured design document through collaborative If `.seeds/` does not exist or contains no dormant seeds, skip this step silently. -**Gate**: Feature state initialized. Context loaded. Seeds surfaced (if any). Proceed to Execute. +**Gate**: Feature state initialized. Context loaded. Seeds surfaced (if any). Check gate status before proceeding -- gates exist because downstream phases assume specific artifacts exist, so skipping them causes silent failures later. ### Phase 1: EXECUTE (Design Dialogue) @@ -128,7 +98,7 @@ Transform a feature idea into a structured design document through collaborative Check gate: `python3 ~/.claude/scripts/feature-state.py gate FEATURE design.intent-discussion` If gate mode is `human`: -- Ask clarifying questions about the feature, one at a time +- Ask clarifying questions about the feature, one at a time -- committing to an approach before understanding requirements produces designs that need rework - Prefer multiple-choice when possible - Establish success criteria - Identify constraints @@ -141,7 +111,8 @@ If gate mode is `auto`: Check gate: `python3 ~/.claude/scripts/feature-state.py gate FEATURE design.approach-selection` -Generate 2-3 approaches: +Generate 2-3 approaches with trade-offs before selecting one -- a single approach provides no basis for evaluating whether it is the right one. Design only what is requested in each approach; do not add speculative requirements or future-proofing -- unasked-for complexity increases review burden, slows implementation, and often gets removed later. + ```markdown ## Approach 1: [Name] **Pros**: [advantages] @@ -192,6 +163,8 @@ Create the design document: Check gate: `python3 ~/.claude/scripts/feature-state.py gate FEATURE design.design-approval` +This phase cannot complete without a design document artifact in `.feature/state/design/` -- feature-plan reads from this path, so a missing document causes the next skill to fail silently or plan against stale data. + Validation checklist: - [ ] Problem statement is clear - [ ] Requirements are enumerable (not vague) @@ -207,29 +180,34 @@ If gate is `auto`: verify checklist passes. ### Phase 3: CHECKPOINT -**Goal**: Save artifacts, run retro pipeline, advance. +**Goal**: Save artifacts, record learnings, advance to next phase. 1. Save design document: ```bash echo "DESIGN_CONTENT" | python3 ~/.claude/scripts/feature-state.py checkpoint FEATURE design ``` -2. **Record learnings** — if this phase produced non-obvious insights, record them: +2. Write an ADR to `adr/{feature-name}.md` documenting the architectural decisions made during design exploration. Register the ADR so sub-phase skills (feature-plan, feature-implement) receive design context via hook injection -- without registration, downstream skills operate without awareness of the design rationale: + ```bash + python3 ~/.claude/scripts/adr-query.py register --adr adr/{name}.md + ``` + +3. **Record learnings** -- if this phase produced non-obvious insights, record them: ```bash python3 ~/.claude/scripts/learning-db.py record TOPIC KEY "VALUE" --category design ``` -3. Advance to plan phase: +4. Advance to plan phase: ```bash python3 ~/.claude/scripts/feature-state.py advance FEATURE ``` -4. Suggest next step: +5. Suggest next step: ``` Design complete. Run /feature-plan to break this into implementation tasks. ``` -**Gate**: Artifacts saved. Retro complete. Phase finished. +**Gate**: Artifacts saved. Learnings recorded. Phase finished. ## Error Handling @@ -239,18 +217,7 @@ If gate is `auto`: verify checklist passes. | Gate returns exit 2 | Human input required | Present decision to user, wait for response | | No design doc produced | Skipped design dialogue | Return to Phase 1, complete all steps | -## Anti-Patterns - -| Anti-Pattern | Why Wrong | Do Instead | -|--------------|-----------|------------| -| Skip design, go straight to code | Undesigned features require rework | Complete design dialogue first | -| Design everything at once | Over-engineering | Design only what's needed for this feature | -| Ignore existing context | Loses previous learnings | Load L0/L1 at prime | - ## References -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) -- [Retro Loop](../shared-patterns/retro-loop.md) - [State Conventions](../_feature-shared/state-conventions.md) -- [Plant Seed](../plant-seed/SKILL.md) — seed-based deferred work surfaced in Phase 0 (ADR-075) +- [Plant Seed](../plant-seed/SKILL.md) -- seed-based deferred work surfaced in Phase 0 (ADR-075) diff --git a/skills/feature-implement/SKILL.md b/skills/feature-implement/SKILL.md index 78315747..79fe0160 100644 --- a/skills/feature-implement/SKILL.md +++ b/skills/feature-implement/SKILL.md @@ -34,74 +34,40 @@ routing: # Feature Implement Skill -## Purpose - -Execute the implementation plan by dispatching tasks to domain agents wave by wave. Phase 3 of the feature lifecycle (design → plan → **implement** → validate → release). - -## Operator Context - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md -- **Plan Required**: CANNOT implement without a plan in `.feature/state/plan/` -- **State Management via Script**: All state operations through `python3 ~/.claude/scripts/feature-state.py` -- **Domain Agent Dispatch**: Every task dispatched to its assigned domain agent via Task tool -- **Wave Order Enforcement**: Complete all tasks in Wave N before starting Wave N+1 -- **Wave Checkpoint**: Run relevant tests after each wave completes -- **Deviation Handling**: Tier 1-2 auto-fix, Tier 3 stops for user - -### Default Behaviors (ON unless disabled) -- **Context Loading**: Read L0, L1, plan artifact, and design artifact at prime -- **Fresh Agent Per Task**: Each task gets a clean agent dispatch (no context pollution) -- **Spec Compliance Check**: After each task, verify output matches plan specification -- **Progress Reporting**: Report after each task and wave completion - -### Optional Behaviors (OFF unless enabled) -- **Parallel within wave**: Dispatch parallel-safe tasks simultaneously -- **Auto-fix Tier 2 deviations**: Handle missing dependencies automatically - -## What This Skill CAN Do -- Dispatch tasks to domain agents (golang-general-engineer, typescript-frontend-engineer, etc.) -- Execute wave-ordered plans with dependency tracking -- Handle deviations with tiered escalation -- Run wave checkpoints (tests) between waves - -## What This Skill CANNOT Do -- Implement without a plan -- Override domain agent selection from plan -- Skip wave ordering -- Handle Tier 3 (architectural) deviations without user input -- Bypass the consultation gate for Medium+ features with an existing ADR +Execute the implementation plan by dispatching tasks to domain agents wave by wave. Phase 3 of the feature lifecycle (design > plan > **implement** > validate > release). ## Instructions ### Phase 0: PRIME -1. Verify feature state: +1. Read and follow the repository CLAUDE.md before any implementation work begins. + +2. Verify feature state -- a plan must exist before implementation can start: ```bash python3 ~/.claude/scripts/feature-state.py status FEATURE ``` - Verify current phase is `implement` and `plan` is completed. + Verify current phase is `implement` and `plan` is completed. All state operations throughout this skill go through `feature-state.py` because it is the single source of truth for lifecycle phase tracking. -2. Load plan artifact from `.feature/state/plan/`. +3. Load plan artifact from `.feature/state/plan/`. -3. **Consultation Gate** (Medium+ complexity only): +4. **Consultation Gate** (Medium+ complexity only): - Extract the feature name and task complexity from the plan. - - If complexity is Simple or no ADR exists in `adr/` matching the feature name → skip this gate, proceed to step 4. + - If complexity is Simple or no ADR exists in `adr/` matching the feature name, skip this gate and proceed to step 5. This gate cannot be bypassed for Medium+ features that have an existing ADR -- skipping it risks implementing against a design that has unresolved architectural concerns. - If an ADR exists for this feature AND complexity is Medium or higher: 1. Check if `adr/{adr-name}/synthesis.md` exists. - 2. If `synthesis.md` does not exist → **BLOCK**: Print "Consultation required for Medium+ feature. Run /adr-consultation first." and STOP. Do not proceed to implementation. + 2. If `synthesis.md` does not exist, **BLOCK**: Print "Consultation required for Medium+ feature. Run /adr-consultation first." and STOP. 3. If `synthesis.md` exists, read it and check the verdict. - - If verdict is "PROCEED" → gate passes, continue. - - If verdict is "BLOCKED" → **BLOCK**: Print "Consultation blocked implementation. Resolve concerns in adr/{adr-name}/concerns.md before implementing." and STOP. + - If verdict is "PROCEED", gate passes, continue. + - If verdict is "BLOCKED", **BLOCK**: Print "Consultation blocked implementation. Resolve concerns in adr/{adr-name}/concerns.md before implementing." and STOP. -4. Load design artifact from `.feature/state/design/` for reference. +5. Load design artifact from `.feature/state/design/` for reference. -5. Load L1 implement context: +6. Load L1 implement context (along with L0 and plan/design artifacts, this provides the full context needed for accurate implementation): ```bash python3 ~/.claude/scripts/feature-state.py context-read FEATURE L1 --phase implement ``` -6. Capture BASE_SHA: +7. Capture BASE_SHA for later diff validation: ```bash git rev-parse HEAD ``` @@ -110,14 +76,15 @@ Execute the implementation plan by dispatching tasks to domain agents wave by wa ### Phase 1: EXECUTE (Wave Dispatch) -**For each wave in the plan:** +**For each wave in the plan, in strict order** -- complete every task in Wave N before starting Wave N+1, because later waves depend on artifacts produced by earlier ones and out-of-order execution causes missing-dependency failures: **Step 1: Dispatch Tasks** For each task in the wave: -1. Check if task is parallel-safe and parallel mode is enabled -2. Dispatch to assigned domain agent via Task tool: +1. Check if task is marked parallel-safe in the plan AND parallel mode is enabled. Only dispatch parallel-safe tasks simultaneously -- dispatching all tasks in parallel causes file conflicts and data corruption when tasks touch overlapping files. + +2. Dispatch to the domain agent assigned in the plan via Task tool. Each task gets a fresh agent dispatch (no reusing agents across tasks) because shared context between tasks causes subtle pollution where fixes for one task leak assumptions into another: ``` Agent( @@ -127,40 +94,48 @@ Agent( ) ``` -3. Verify task output matches plan specification +Use the agent specified in the plan, never override it -- the plan assigns agents based on domain expertise alignment determined during planning. + +3. After each task completes, verify the output matches the plan specification (expected files, function signatures, behavior). Catching spec drift per-task is far cheaper than discovering it after the entire wave. **Step 2: Handle Deviations** +Classify every deviation by tier and act accordingly. Ignoring small deviations lets them compound into architectural problems across waves: + | Tier | Examples | Action | |------|----------|--------| | Tier 1: Auto-Fix | Bug, type error, missing import | Auto-apply, record in retro | | Tier 2: Blocking | Missing dependency, config issue | Auto-fix if possible, record | | Tier 3: Architectural | Schema change, API change, scope expansion | **STOP**, present to user | -Check gate for Tier 3: `python3 ~/.claude/scripts/feature-state.py gate FEATURE implement.architectural-deviation` +Tier 3 deviations require user input because they represent scope or design changes that the plan did not authorize: +```bash +python3 ~/.claude/scripts/feature-state.py gate FEATURE implement.architectural-deviation +``` **Step 3: Wave Checkpoint** -After all tasks in a wave complete: -1. Run the project's test suite (or relevant subset) -2. If tests fail: identify which task caused the failure, route back to that agent +After all tasks in a wave complete, run the project's test suite (or relevant subset). Skipping this causes failures to compound silently across waves, making root-cause identification exponentially harder: + +1. Run tests +2. If tests fail: identify which task caused the failure, route back to that agent for a fix 3. If tests pass: proceed to next wave **Step 4: Progress Report** -After each wave: +After each wave, report status: ``` Wave N complete: X/Y tasks passed - T1: [agent] ✓ - T2: [agent] ✓ - T3: [agent] ✗ (deviation: [description]) + T1: [agent] pass + T2: [agent] pass + T3: [agent] fail (deviation: [description]) ``` **Gate**: All waves complete. All tasks verified. Proceed to Validate. ### Phase 2: VALIDATE (Implementation Review) -Quick validation before formal validation phase: +Quick validation before the formal validation phase: - [ ] All planned files were created/modified - [ ] All verification commands pass - [ ] No unplanned files were modified (check `git diff --name-only BASE_SHA..HEAD`) @@ -175,7 +150,7 @@ Quick validation before formal validation phase: echo "IMPL_SUMMARY" | python3 ~/.claude/scripts/feature-state.py checkpoint FEATURE implement ``` -2. **Record learnings** — if this phase produced non-obvious insights, record them: +2. **Record learnings** -- if this phase produced non-obvious insights, record them: ```bash python3 ~/.claude/scripts/learning-db.py record TOPIC KEY "VALUE" --category design ``` @@ -201,18 +176,7 @@ Quick validation before formal validation phase: | Wave test failure | Task broke existing tests | Route back to responsible agent for fix | | Tier 3 deviation | Architectural decision needed | Stop, present options to user | -## Anti-Patterns - -| Anti-Pattern | Why Wrong | Do Instead | -|--------------|-----------|------------| -| Implement without dispatching to agents | Bypasses domain expertise | Use Task tool to dispatch | -| Skip wave checkpoints | Failures compound across waves | Test after every wave | -| Ignore deviations | Small deviations become big problems | Classify and handle per tier | -| Dispatch all tasks in parallel | File conflicts cause corruption | Respect wave ordering and parallel-safe flags | - ## References -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) -- [Retro Loop](../shared-patterns/retro-loop.md) - [State Conventions](../_feature-shared/state-conventions.md) - [Subagent-Driven Development](../subagent-driven-development/SKILL.md) diff --git a/skills/feature-plan/SKILL.md b/skills/feature-plan/SKILL.md index 42d6cee7..ad292e3c 100644 --- a/skills/feature-plan/SKILL.md +++ b/skills/feature-plan/SKILL.md @@ -33,62 +33,32 @@ routing: # Feature Plan Skill -## Purpose - -Transform a design document into a wave-ordered implementation plan with tasks assigned to domain agents. Phase 2 of the feature lifecycle (design → **plan** → implement → validate → release). - -## Operator Context - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md -- **Design Doc Required**: CANNOT plan without a design document in `.feature/state/design/` -- **State Management via Script**: All state operations through `python3 ~/.claude/scripts/feature-state.py` -- **Wave Ordering**: Tasks grouped by dependency wave; Wave N must complete before Wave N+1 -- **Domain Agent Assignment**: Every implementation task MUST specify which domain agent handles it -- **Parallel Safety Analysis**: Flag which tasks in the same wave can run in parallel - -### Default Behaviors (ON unless disabled) -- **Context Loading**: Read L0, L1, and design artifact at prime -- **Task Duration Targeting**: Each task scoped to 2-5 minutes of agent work -- **File Conflict Detection**: Detect tasks that modify the same files and sequence them - -### Optional Behaviors (OFF unless enabled) -- **Auto-approve plan**: Skip human approval gate - -## What This Skill CAN Do -- Read design documents and decompose into tasks -- Assign domain agents from our system to each task -- Detect file conflicts and resequence waves -- Produce structured plan artifacts - -## What This Skill CANNOT Do -- Create plans without a design document -- Implement code (that's feature-implement) -- Skip plan approval gate without configuration -- Override domain agent routing +Transform a design document into a wave-ordered implementation plan with tasks assigned to domain agents. Phase 2 of the feature lifecycle (design → **plan** → implement → validate → release). This skill decomposes only — it never implements code. Implementation is the responsibility of feature-implement. ## Instructions ### Phase 0: PRIME -1. Check feature state: +1. Read and follow the repository CLAUDE.md before any other work. + +2. Check feature state: ```bash python3 ~/.claude/scripts/feature-state.py status FEATURE ``` - Verify current phase is `plan` and `design` is in completed phases. + Verify current phase is `plan` and `design` is in completed phases. A design document must exist before planning can begin — there are no requirements to decompose without one, and skipping design guarantees a plan built on assumptions. -2. Load design artifact: +3. Load design artifact: ```bash ls .feature/state/design/*-FEATURE.md ``` Read the design document. -3. Load L1 plan context: +4. Load L0, L1, and plan-phase context: ```bash python3 ~/.claude/scripts/feature-state.py context-read FEATURE L1 --phase plan ``` -**Gate**: Design doc loaded. Feature in plan phase. Proceed. +**Gate**: Design doc loaded. Feature in plan phase. All state operations use `python3 ~/.claude/scripts/feature-state.py` — never manipulate state files directly. ### Phase 1: EXECUTE (Task Decomposition) @@ -101,7 +71,7 @@ From the design document, extract: **Step 2: Create Wave-Ordered Tasks** -Group tasks by dependency wave: +Group tasks by dependency wave — Wave N must complete before Wave N+1 begins. Scope each task to 2-5 minutes of agent work; larger tasks get split, smaller tasks get merged. Do not collapse all tasks into a single wave — that loses parallelization opportunities and masks true dependencies. ```markdown # Implementation Plan: [Feature Name] @@ -136,13 +106,16 @@ Group tasks by dependency wave: **Step 3: File Conflict Analysis** -For each wave, check if any two tasks modify the same files: +For each wave, check if any two tasks modify the same files. This analysis is not optional — skipping it causes parallel execution to corrupt files when two agents write to the same path. + - If yes: mark `Parallel-safe: false` and add ordering constraint - If no: mark `Parallel-safe: true` **Step 4: Agent Routing Verification** -For each task, verify the assigned agent exists in our system: +Every implementation task must specify which domain agent handles it. The agent assignment is authoritative — do not override domain agent routing after assignment. + +For each task, verify the assigned agent exists: - Check against known agent triggers - If uncertain, default to the closest domain agent - Log routing decisions @@ -190,7 +163,7 @@ No vague verbs ("align," "ensure," "handle," "improve") without specifying what **Rule 2: Self-Contained Execution** -The executor should be able to complete the task from the action text alone, without needing to ask clarifying questions. If a task requires context not present in the task description, that context must be added. +The executor should be able to complete the task from the action text alone, without needing to ask clarifying questions. Never reference external context ("as discussed," "per the meeting," "as mentioned") — the agent has no access to that context. If a task requires context not present in the task description, that context must be added inline. Test: Can a domain agent with no prior conversation context execute this task? If not, add the missing context. @@ -212,7 +185,7 @@ Check gate: `python3 ~/.claude/scripts/feature-state.py gate FEATURE plan.plan-a **Step 1: Requirements Coverage Gate** -Before running the rest of validation, verify that every stated requirement is covered by at least one task. An uncovered requirement is a **blocker**, not a warning — partial coverage guarantees partial delivery. +Before running the rest of validation, verify that every stated requirement is covered by at least one task. An uncovered requirement is a **blocker**, not a warning — partial coverage guarantees partial delivery. A plan that covers 8 of 10 requirements looks "mostly done" but delivers an incomplete feature. Catching gaps here costs minutes; catching them later costs hours or days. The coverage check works as follows: @@ -235,8 +208,6 @@ The coverage check works as follows: - Explicitly document why the requirement is deferred (with a follow-up ticket reference) - Do NOT proceed to approval with uncovered requirements -**Why 100% is non-negotiable**: A plan that covers 8 of 10 requirements looks "mostly done" but delivers an incomplete feature. The missing requirements are discovered during implementation or — worse — after release. Catching gaps here costs minutes; catching them later costs hours or days. - **Step 2: Structural Validation Checklist** Validation checklist: @@ -263,7 +234,7 @@ Scan every task in the plan against the three deep work rules: If any task fails a rule, rewrite it before proceeding. Do not approve plans with vague tasks — they create execution debt that compounds across waves. -If gate is `human`: present plan to user for approval. +If gate is `human`: present plan to user for approval. Plan approval cannot be skipped unless explicitly configured with auto-approve. If gate is `auto`: verify all checklist items pass. **Gate**: Requirements 100% covered. All tasks pass deep work rules. Plan approved. Proceed to Checkpoint. @@ -298,22 +269,6 @@ If gate is `auto`: verify all checklist items pass. | Feature not in plan phase | Phase mismatch | Check status, advance if needed | | Agent not found | Invalid agent assignment | Check agents/INDEX.json, use closest match | -## Anti-Patterns - -| Anti-Pattern | Why Wrong | Do Instead | -|--------------|-----------|------------| -| Plan without design | No requirements to decompose | Complete /feature-design first | -| Vague task descriptions | Can't be executed by subagent | Specify exact files, operations, verification | -| All tasks in one wave | Loses parallelization opportunity | Group by actual dependencies | -| Skip file conflict analysis | Parallel execution causes corruption | Analyze every wave for conflicts | -| Uncovered requirements | Guarantees partial delivery — gaps discovered during implementation or after release | Run requirements coverage matrix; 100% coverage before approval | -| Implementation-focused success criteria | "All tasks done" can diverge from "goal achieved" — tasks might be wrong | Define success as observable behaviors, not task completion | -| Vague verbs in tasks ("align," "ensure," "handle") | Executor guesses what to do, builds wrong thing or blocks asking questions | Rewrite with concrete actions: what exactly changes, where, how to verify | -| Tasks that reference external context | "As discussed" or "per the meeting" — agent has no access to that context | Make every task self-contained with all needed context inline | - ## References -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) -- [Retro Loop](../shared-patterns/retro-loop.md) - [State Conventions](../_feature-shared/state-conventions.md) -- [Pipeline Architecture](../shared-patterns/pipeline-architecture.md) diff --git a/skills/feature-release/SKILL.md b/skills/feature-release/SKILL.md index b24c7d3e..aba12fc5 100644 --- a/skills/feature-release/SKILL.md +++ b/skills/feature-release/SKILL.md @@ -33,82 +33,59 @@ routing: # Feature Release Skill -## Purpose - Merge the validated feature to main branch via PR, optionally tag a release, and clean up the feature worktree. Phase 5 of the feature lifecycle (design → plan → implement → validate → **release**). -## Operator Context - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md -- **Validation Required**: CANNOT release without validation passing -- **State Management via Script**: All state operations through `python3 ~/.claude/scripts/feature-state.py` -- **PR-Based Merge**: Always merge via pull request, never direct push to main -- **Conventional Commits**: Use conventional commit format -- **No Attribution Lines**: No "Co-Authored-By" or "Generated with Claude Code" - -### Default Behaviors (ON unless disabled) -- **Context Loading**: Read all artifacts (design, plan, implement, validate) -- **Squash Merge**: Squash feature branch into single commit on main -- **Worktree Cleanup**: Remove worktree after successful merge -- **Feature State Archive**: Move feature state to completed - -### Optional Behaviors (OFF unless enabled) -- **Version Bump**: Bump version number based on change type -- **Changelog Update**: Add entry to CHANGELOG.md -- **Tag Release**: Create git tag after merge - -## What This Skill CAN Do -- Create pull requests from feature branches -- Generate PR descriptions from design/plan/validation artifacts -- Clean up worktrees and feature state after merge -- Archive completed feature state - -## What This Skill CANNOT Do -- Release without passing validation -- Push directly to main -- Skip PR review process -- Delete feature branch before merge confirmation - ## Instructions ### Phase 0: PRIME -1. Verify feature state is `release` and `validate` is completed. -2. Load all artifacts: design, plan, implementation summary, validation report. -3. Check gate: `python3 ~/.claude/scripts/feature-state.py gate FEATURE release.merge-strategy` +1. Read and follow the repository CLAUDE.md before any other action. Repository-level conventions override defaults here. + +2. Verify feature state is `release` and `validate` is completed. All state operations go through the state script -- never manipulate state files directly: + ```bash + python3 ~/.claude/scripts/feature-state.py gate FEATURE release.merge-strategy + ``` + +3. Load all artifacts: design document, plan, implementation summary, and validation report. These provide the raw material for PR content in the next phase; skipping any artifact risks an incomplete PR description. -**Gate**: All artifacts loaded. Validation passed. Proceed. +4. Confirm validation passed before proceeding. A feature with failing validation cannot be released -- there are no exceptions. If validation has not passed, stop and direct the user to run `/feature-validate` first. + +**Gate**: All artifacts loaded. Validation passed. State confirmed as `release`. Proceed. ### Phase 1: EXECUTE (Release) **Step 1: Generate PR Content** -From the accumulated artifacts, generate: +From the accumulated artifacts, generate a PR description following this structure: ```markdown ## Summary [From design document problem statement] ## Changes -[From implementation artifact — what was built] +[From implementation artifact -- what was built] ## Testing -[From validation report — what was verified] +[From validation report -- what was verified] ## Design Decision -[From design document — why this approach] +[From design document -- why this approach] ``` **Step 2: Create PR** -Use our existing pr-pipeline patterns: -1. Ensure feature branch is pushed -2. Create PR via `gh pr create` -3. Link to design decisions +Use existing pr-pipeline patterns: + +1. Ensure feature branch is pushed to remote. +2. Create PR via `gh pr create` targeting main. Always merge via pull request -- never push directly to main, even for small features. This is non-negotiable because direct pushes bypass review and CI. +3. Use squash merge (default) so the feature branch collapses into a single clean commit on main. +4. Use conventional commit format for the squash commit message. Do not add "Co-Authored-By" or "Generated with Claude Code" attribution lines. +5. Link to design decisions in the PR body. **Step 3: Post-Merge Cleanup** (after PR is merged) +Do not begin cleanup until the PR merge is confirmed. Deleting the branch or worktree before merge confirmation risks losing work. + 1. Clean up worktree: ```bash python3 ~/.claude/scripts/feature-state.py worktree FEATURE cleanup @@ -121,6 +98,11 @@ Use our existing pr-pipeline patterns: 3. Update L0 (remove feature from active list). +**Optional post-merge actions** (only when explicitly enabled by the user): +- **Version Bump**: Bump version number based on change type (patch/minor/major). +- **Changelog Update**: Add entry to CHANGELOG.md. +- **Tag Release**: Create git tag after merge. + **Gate**: PR created/merged. Cleanup complete. ### Phase 2: VALIDATE @@ -136,7 +118,7 @@ Use our existing pr-pipeline patterns: echo "RELEASE_NOTES" | python3 ~/.claude/scripts/feature-state.py checkpoint FEATURE release ``` -2. **Record learnings** — final pass, capture insights from the full lifecycle: +2. **Record learnings** -- final pass, capture insights from the full lifecycle: ```bash python3 ~/.claude/scripts/learning-db.py record TOPIC KEY "VALUE" --category design ``` @@ -150,9 +132,17 @@ Use our existing pr-pipeline patterns: - Retro findings: [N] recorded, [M] promoted to context ``` +## Error Handling + +| Scenario | Action | +|----------|--------| +| Validation not passed | Stop. Direct user to `/feature-validate`. Do not proceed. | +| PR creation fails | Check branch is pushed, `gh` is authenticated, and target branch exists. Retry once. | +| Worktree cleanup fails | Log the error but do not block completion. User can clean up manually. | +| State script errors | Report the exact error. Do not fall back to manual file manipulation. | + ## References - [PR Pipeline](../pr-pipeline/SKILL.md) - [Git Commit Flow](../git-commit-flow/SKILL.md) -- [Retro Loop](../shared-patterns/retro-loop.md) - [State Conventions](../_feature-shared/state-conventions.md) diff --git a/skills/feature-validate/SKILL.md b/skills/feature-validate/SKILL.md index 0f3fc967..298bc512 100644 --- a/skills/feature-validate/SKILL.md +++ b/skills/feature-validate/SKILL.md @@ -32,59 +32,28 @@ routing: # Feature Validate Skill -## Purpose - Run comprehensive quality gates on the implemented feature. Phase 4 of the feature lifecycle (design → plan → implement → **validate** → release). -## Operator Context - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md -- **Implementation Required**: CANNOT validate without implementation artifacts -- **State Management via Script**: All state operations through `python3 ~/.claude/scripts/feature-state.py` -- **Show Full Output**: NEVER summarize test results. Show actual command output. -- **All Gates Must Pass**: Cannot proceed to release with any gate failing -- **Existing Quality Gate Integration**: Use our existing quality gate skills (go-pr-quality-gate, python-quality-gate, universal-quality-gate) - -### Default Behaviors (ON unless disabled) -- **Auto-detect Language**: Detect project language and run appropriate quality gate -- **Context Loading**: Read L0, L1, and implementation artifact at prime -- **Regression Check**: Verify existing tests still pass - -### Optional Behaviors (OFF unless enabled) -- **Security scan**: Run security-focused review agent -- **Performance check**: Run benchmarks against baseline - -## What This Skill CAN Do -- Run language-specific quality gates (tests, lint, type checks) -- Verify all planned files were created/modified -- Check for regressions -- Produce validation report - -## What This Skill CANNOT Do -- Fix failing tests (route back to feature-implement) -- Skip validation gates -- Approve with failures - ## Instructions ### Phase 0: PRIME -1. Verify feature state is `validate` and `implement` is completed. -2. Load implementation artifact from `.feature/state/implement/`. -3. Load L1 validate context. +1. Read and follow the repository CLAUDE.md before any other action — it may override defaults for linting, test commands, or gate criteria. +2. Verify feature state is `validate` and `implement` is completed. All state operations go through `python3 ~/.claude/scripts/feature-state.py` — never modify state files directly. +3. Load implementation artifact from `.feature/state/implement/`. Validation cannot proceed without implementation artifacts; if missing, stop and report. +4. Load L0, L1, and implementation context so quality gates run against the correct scope. -**Gate**: Implementation artifact loaded. Proceed. +**Gate**: Implementation artifact loaded. Feature state confirms `implement` completed. Proceed. ### Phase 1: EXECUTE (Quality Gates) **Step 1: Language Detection** -Detect project language(s) from file extensions, build files, and implementation artifact. +Auto-detect project language(s) from file extensions, build files, and the implementation artifact. This detection drives which quality gate skill runs next. **Step 2: Run Quality Gates** -For each detected language, run the appropriate quality gate: +Use the repository's existing quality gate skills — do not re-implement linting or test runners inline. Route to the appropriate skill per language: | Language | Quality Gate | Command | |----------|-------------|---------| @@ -93,21 +62,33 @@ For each detected language, run the appropriate quality gate: | TypeScript | universal-quality-gate | `npm run typecheck && npm run lint && npm test` | | Other | universal-quality-gate | Detect and run project-specific checks | +Show the full, unedited command output for every gate — never summarize or truncate test results, because summaries hide the exact failure details needed for diagnosis. + **Step 3: Regression Check** -Run full test suite and compare against pre-implementation baseline: +Run the full test suite and compare against pre-implementation baseline: - New test failures = regression - Missing tests for new code = coverage gap +Existing tests must still pass; new failures here block advancement regardless of new-feature test results. + **Step 4: Custom Gates** If the design document specified custom validation criteria, check those too. -**Gate**: All quality gates pass. Proceed. +**Step 5 (optional): Security and Performance** + +If explicitly enabled by the user: +- **Security scan**: Run security-focused review agent +- **Performance check**: Run benchmarks against baseline + +These are off by default — do not run them unless the user requests it. + +**Gate**: Every gate must pass. No gate may be skipped and no failure may be approved — a single failing gate blocks advancement to release. Proceed only when all results are green. ### Phase 2: VALIDATE (Report) -Produce validation report: +Produce the validation report: ```markdown # Validation Report: [Feature Name] @@ -130,14 +111,16 @@ Produce validation report: ## Verdict: PASS / NEEDS_FIXES / BLOCK ``` -If `NEEDS_FIXES`: suggest running `/feature-implement` with specific fix tasks. +The verdict must reflect actual gate results — never mark PASS if any gate failed. + +If `NEEDS_FIXES`: suggest running `/feature-implement` with specific fix tasks. This skill does not fix failing tests; it reports them and routes back. If `BLOCK`: explain blocking issues. **Gate**: Report produced. Proceed to Checkpoint. ### Phase 3: CHECKPOINT -1. Save validation artifact: +1. Save validation artifact (all state operations through the feature-state script): ```bash echo "VALIDATION_REPORT" | python3 ~/.claude/scripts/feature-state.py checkpoint FEATURE validate ``` @@ -167,6 +150,4 @@ If `BLOCK`: explain blocking issues. ## References -- [Verification Checklist](../shared-patterns/verification-checklist.md) -- [Retro Loop](../shared-patterns/retro-loop.md) - [State Conventions](../_feature-shared/state-conventions.md) diff --git a/skills/fish-shell-config/SKILL.md b/skills/fish-shell-config/SKILL.md index fe29d536..6fcafd6d 100644 --- a/skills/fish-shell-config/SKILL.md +++ b/skills/fish-shell-config/SKILL.md @@ -38,65 +38,23 @@ routing: # Fish Shell Configuration Skill -## Operator Context - -This skill operates as an operator for Fish shell configuration tasks, configuring Claude's behavior for correct Fish syntax and idioms. It implements **Domain Intelligence** — Fish-specific patterns that differ fundamentally from Bash/POSIX — ensuring generated shell code actually works in Fish. - -### Hardcoded Behaviors (Always Apply) -- **Fish Syntax Only**: Never emit Bash syntax (`VAR=value`, `[[ ]]`, `export`, heredocs) in Fish contexts -- **Variables Are Lists**: Treat every Fish variable as a list; never use colon-separated PATH strings -- **No Word Splitting**: `$var` and `"$var"` are identical in Fish; do not add defensive quotes for word-splitting -- **`test` Over Brackets**: Use `test` builtin, never `[[ ]]` or `[ ]` -- **`set` Over Assignment**: Variable assignment is always `set VAR value`, never `VAR=value` -- **Filename = Function Name**: Autoloaded function files must match: `functions/foo.fish` contains `function foo` - -### Default Behaviors (ON unless disabled) -- **Modular Config**: Place config in `conf.d/` files, keep `config.fish` minimal -- **`fish_add_path`**: Use for PATH manipulation instead of manual `set PATH` -- **Interactive Guards**: Wrap abbreviations and key bindings in `if status is-interactive` -- **`type -q` Checks**: Guard tool integrations with existence checks -- **Numeric Prefixes**: Use `00-`, `10-`, `20-` prefixes in `conf.d/` for ordering - -### Optional Behaviors (OFF unless enabled) -- **Universal Variables**: Use `-U` flag for cross-session persistence -- **Bash Migration**: Convert Bash scripts to Fish syntax (see `references/bash-migration.md`) -- **Completion Authoring**: Write custom Fish completions - -## What This Skill CAN Do -- Write syntactically correct Fish functions, config, and abbreviations -- Structure `~/.config/fish/` with proper modular layout -- Manage variable scoping (local, function, global, universal, export) -- Integrate tools (Starship, direnv, fzf, Homebrew, Nix) with Fish -- Migrate Bash patterns to Fish equivalents - -## What This Skill CANNOT Do -- Write POSIX-compatible scripts (Fish is not POSIX) -- Fix Bash/Zsh configurations (use appropriate shell skill) -- Manage Fish plugin frameworks (Fisher, Oh My Fish) beyond basic guidance -- Debug Fish shell internals or C/Rust source code - ---- +Fish is not POSIX. Every pattern here targets Fish 3.0+ (supports `$()`, `&&`, `||`). Fish 4.0 (Rust rewrite) has no syntax changes. All generated code must use Fish-native syntax exclusively — never emit Bash constructs (`VAR=value`, `[[ ]]`, `export`, heredocs) in Fish contexts. ## Instructions -### Phase 1: DETECT +### Step 1: Confirm Fish Context -**Goal**: Confirm Fish shell context before writing any shell code. +Before writing any shell code, confirm the target is Fish: -**Step 1: Check shell environment** - `$SHELL` contains `fish`, or - Target file has `.fish` extension, or - Target directory is `~/.config/fish/` -**Step 2: Identify Fish version constraints** -- All patterns target Fish 3.0+ (supports `$()`, `&&`, `||`) -- Fish 4.0 (Rust rewrite) has no syntax changes +If none of these hold, stop — this skill does not apply to Bash, Zsh, or POSIX shells. -**Gate**: Confirmed Fish context. Proceed only when gate passes. +### Step 2: Choose the Correct File Location -### Phase 2: STRUCTURE - -**Goal**: Place configuration in the correct location. +Place configuration in `conf.d/` modules with numeric prefixes for ordering — keep `config.fish` minimal. A monolithic `config.fish` with hundreds of lines is slow to load, hard to maintain, and impossible to selectively disable. **Directory layout**: ``` @@ -126,26 +84,26 @@ This skill operates as an operator for Fish shell configuration tasks, configuri | Completions | `completions/.fish` | | One-time interactive init | `config.fish` (inside `status is-interactive`) | -**Gate**: Correct file location chosen. Proceed only when gate passes. - -### Phase 3: WRITE +### Step 3: Write Variables -**Goal**: Generate syntactically correct Fish code. - -**Step 1: Variables** +Variable assignment is always `set VAR value` — never `VAR=value` (syntax error in Fish) or `export VAR=value`. ```fish set -l VAR value # Local — current block only set -f VAR value # Function — entire function scope set -g VAR value # Global — current session -set -U VAR value # Universal — persists across sessions +set -U VAR value # Universal — persists across sessions (use sparingly) set -x VAR value # Export — visible to child processes set -gx VAR value # Global + Export (typical for env vars) set -e VAR # Erase variable set -q VAR # Test if set (silent, for conditionals) ``` -**Step 2: PATH management** +Every Fish variable is a list. Never use colon-separated strings for PATH or similar variables — `set PATH "$PATH:/new/path"` creates a single malformed element because Fish PATH is a list, not a colon-delimited string. + +### Step 4: Manage PATH + +Use `fish_add_path` for PATH manipulation — it handles deduplication and persistence automatically. Manual `set PATH` only for session-scoped overrides. ```fish # CORRECT: fish_add_path handles deduplication and persistence @@ -160,7 +118,9 @@ set -gx PATH ~/custom/bin $PATH # set PATH "$PATH:/new/path" ``` -**Step 3: Functions** +### Step 5: Write Functions + +The autoloaded function filename must match the function name exactly — `functions/foo.fish` must contain `function foo`. A mismatch causes "Unknown command" errors. ```fish # ~/.config/fish/functions/mkcd.fish @@ -189,7 +149,7 @@ function backup --description "Create timestamped backup" end ``` -**Step 4: Abbreviations vs Functions vs Aliases** +### Step 6: Choose Between Abbreviations, Functions, and Aliases | Use Case | Mechanism | Why | |----------|-----------|-----| @@ -197,7 +157,7 @@ end | Needs arguments/logic | `function` in `functions/` | Full programming, works in scripts | | Wrapping a command | `alias ll "ls -la"` | Convenience; creates function internally | -Abbreviations are **interactive-only** — they do not work in scripts. +Abbreviations are interactive-only — they do not work in scripts. Always wrap them in an interactive guard because they have no effect during non-interactive sourcing: ```fish # Always guard abbreviations @@ -210,7 +170,9 @@ if status is-interactive end ``` -**Step 5: Conditionals and control flow** +### Step 7: Write Conditionals and Control Flow + +Use the `test` builtin for conditionals — never `[[ ]]` (syntax error in Fish) or `[ ]` (calls external `/bin/[`, slower than the builtin). Fish has no word splitting, so `$var` and `"$var"` behave identically — quote only when you need to prevent list expansion or preserve empty strings. ```fish # Conditionals — use 'test', not [[ ]] @@ -241,9 +203,10 @@ switch $argv[1] end ``` -**Step 6: Tool integrations** +### Step 8: Integrate External Tools + +Guard every tool integration with `type -q` so the config works on machines where the tool is not installed: -Always guard with `type -q`: ```fish # ~/.config/fish/conf.d/30-tools.fish if type -q starship @@ -269,43 +232,30 @@ if test -e /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.fish end ``` -**Gate**: Code uses correct Fish syntax. No Bash-isms present. Proceed only when gate passes. - -### Phase 4: VERIFY - -**Goal**: Confirm configuration works and is correctly structured. - -**Step 1**: Syntax check — `fish -n ` (parse without executing) - -**Step 2**: For functions — verify filename matches function name +### Step 9: Verify -**Step 3**: For conf.d — verify `status is-interactive` guards on interactive-only code - -**Step 4**: Test in clean environment — `fish --no-config` then `source ` - -**Gate**: All verification steps pass. Configuration is complete. +1. **Syntax check** — run `fish -n ` (parse without executing) +2. **Function name match** — verify filename matches function name for every file in `functions/` +3. **Interactive guards** — verify `status is-interactive` guards on abbreviations and key bindings in `conf.d/` +4. **Clean environment test** — run `fish --no-config` then `source ` to confirm isolated correctness --- -## Examples +## Reference Material -### Example 1: Setting Up a New Fish Config +### Example: Setting Up a New Fish Config User says: "Set up my Fish shell config" -Actions: -1. Detect Fish context (DETECT) -2. Create modular structure in `~/.config/fish/` (STRUCTURE) -3. Write `conf.d/00-path.fish`, `conf.d/10-env.fish`, `conf.d/20-abbreviations.fish` (WRITE) -4. Syntax-check all files (VERIFY) -Result: Clean modular Fish configuration - -### Example 2: Migrating a Bash Alias File +1. Confirm Fish context +2. Create modular structure in `~/.config/fish/` +3. Write `conf.d/00-path.fish`, `conf.d/10-env.fish`, `conf.d/20-abbreviations.fish` +4. Syntax-check all files + +### Example: Migrating a Bash Alias File User says: "Convert my .bash_aliases to Fish" -Actions: -1. Read `.bash_aliases`, confirm Fish target (DETECT) -2. Determine which become abbreviations vs functions (STRUCTURE) -3. Write abbreviations to `conf.d/`, functions to `functions/` (WRITE) -4. Syntax-check, test in clean shell (VERIFY) -Result: Bash aliases converted to idiomatic Fish +1. Read `.bash_aliases`, confirm Fish target +2. Determine which become abbreviations vs functions +3. Write abbreviations to `conf.d/`, functions to `functions/` +4. Syntax-check, test in clean shell --- @@ -329,50 +279,7 @@ Solution: Use `set -gx VAR value` to make variable visible to subprocesses. Chec --- -## Anti-Patterns - -### Anti-Pattern 1: Bash Assignment Syntax -**What it looks like**: `VAR=value` or `export VAR=value` in a `.fish` file -**Why wrong**: Syntax error in Fish. Fish has no inline assignment. -**Do instead**: `set VAR value` or `set -gx VAR value` - -### Anti-Pattern 2: Colon-Separated PATH -**What it looks like**: `set PATH "$PATH:/new/path"` -**Why wrong**: Fish PATH is a list, not a colon-delimited string. Creates a single malformed element. -**Do instead**: `fish_add_path /new/path` or `set PATH $PATH /new/path` - -### Anti-Pattern 3: Monolithic config.fish -**What it looks like**: Hundreds of lines in `config.fish` — PATH, env, aliases, functions, integrations -**Why wrong**: Slow to load, hard to maintain, impossible to selectively disable. -**Do instead**: Split into `conf.d/` modules and `functions/` autoload files. - -### Anti-Pattern 4: Bracket Conditionals -**What it looks like**: `if [[ -f file ]]` or `if [ -f file ]` -**Why wrong**: `[[ ]]` is a syntax error. `[ ]` calls external `/bin/[`, slower than builtin. -**Do instead**: `if test -f file` — uses Fish's fast builtin. - -### Anti-Pattern 5: Word-Split Defensive Quoting -**What it looks like**: Always quoting `"$var"` out of Bash habit -**Why wrong**: Not harmful, but misleading. Fish never word-splits; `$var` and `"$var"` are identical. -**Do instead**: Quote only when you need to prevent list expansion or preserve empty strings. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Quotes won't hurt in Fish" | Masks misunderstanding of Fish semantics | Learn Fish variable expansion rules | -| "Just put it all in config.fish" | Monolithic config is an anti-pattern | Use conf.d/ and functions/ | -| "Bash syntax is close enough" | Fish is not POSIX; Bash-isms cause errors | Use Fish-native syntax only | -| "I'll use [ ] since it works" | Calls external binary, slower than test | Use `test` builtin always | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/bash-migration.md`: Complete Bash-to-Fish syntax translation table - `${CLAUDE_SKILL_DIR}/references/fish-quick-reference.md`: Variable scoping, special variables, and command cheatsheet diff --git a/skills/forensics/SKILL.md b/skills/forensics/SKILL.md index 21768315..98ea6d05 100644 --- a/skills/forensics/SKILL.md +++ b/skills/forensics/SKILL.md @@ -41,53 +41,14 @@ routing: # Forensics Skill -## Purpose - -Investigate failed or stuck workflows through post-mortem analysis of git history, plan files, and session artifacts. The forensics skill answers "what went wrong and why" -- it detects workflow-level failures that individual tool errors don't reveal. +Investigate failed or stuck workflows through post-mortem analysis of git history, plan files, and session artifacts. Forensics answers "what went wrong and why" -- it detects workflow-level failures that individual tool errors don't reveal. **Key distinction**: A tool error is "ruff found 3 lint errors." A workflow failure is "the agent entered a fix/retry loop editing the same file 5 times and never progressed." The error-learner handles tool-level errors. Forensics handles workflow-level patterns. -## Operator Context - -This skill operates as a read-only diagnostic instrument. It examines git history, plan files, and worktree state to detect anomaly patterns, then produces a structured report. It never modifies files, creates commits, or attempts repairs. - -### Hardcoded Behaviors (Always Apply) -- **READ-ONLY**: Never modify files, create commits, or attempt repairs. WHY: A diagnostic tool that modifies state destroys the evidence it needs to analyze. Forensics examines -- it does not fix. The tool restriction to Read/Grep/Glob enforces this at the platform level. -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before analysis. Repository conventions inform what "normal" looks like (e.g., expected branch patterns, required artifacts). -- **Credential Scrubbing**: Before including any file content or path in the report, scan for and redact patterns matching secrets (API keys, tokens, passwords, connection strings). WHY: Diagnostic reports may be shared or logged. Leaking credentials through a forensics report is worse than the original workflow failure. -- **Path Redaction**: Redact absolute home directory paths in report output. Replace paths like `/home/alice/` or `/Users/alice/` with `~/`. WHY: Reports shared across teams should not expose filesystem layout or usernames. -- **Confidence-Based Reporting**: Every anomaly includes a confidence level (High/Medium/Low) based on signal strength. WHY: False positives erode trust. A "High" confidence stuck loop (5 identical commits) is qualitatively different from a "Low" confidence one (3 commits to the same file with different messages). Consumers filter on confidence. -- **No Remediation Execution**: Recommended remediation is advisory text only. Forensics never executes fixes, even if the user asks. WHY: Remediation requires understanding intent, not just detecting anomalies. The wrong fix applied automatically can destroy work. Recommend, don't execute. - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report findings factually with evidence. Show git hashes, timestamps, and file paths rather than making assertions. No self-congratulation. -- **Full Scan**: Run all 5 anomaly detectors on every invocation. Skipping detectors creates blind spots -- a stuck loop that also drifted scope is a different situation than a stuck loop alone. -- **Severity-Ordered Output**: Report anomalies ordered by confidence (High first). The reader needs the strongest signals first. -- **Evidence Inclusion**: Include relevant git log excerpts, file snippets, and timestamps as evidence for each anomaly. Claims without evidence are not diagnostics. - -### Optional Behaviors (OFF unless enabled) -- **Cross-session comparison**: Compare current branch against other recent branches for pattern similarity (OFF by default -- requires broader git history scan). -- **Timeline visualization**: Produce ASCII timeline of commits with anomaly markers (OFF by default -- adds report length). - -## What This Skill CAN Do -- Detect stuck fix/retry loops through git commit pattern analysis -- Identify missing pipeline artifacts by checking expected phase outputs -- Find abandoned work through timestamp gap analysis -- Measure scope drift by comparing modified files against plan expectations -- Detect crash/interruption through uncommitted changes and orphaned worktrees -- Produce structured diagnostic reports with confidence levels -- Scrub credentials and redact paths in report output - -## What This Skill CANNOT Do -- Fix any of the issues it finds (read-only -- recommend only) -- Debug code-level bugs (use systematic-debugging) -- Review code quality (use systematic-code-review) -- Access external services or APIs (filesystem and git only) -- Analyze workflows that left no git history or plan files (no evidence = no diagnosis) -- Run shell commands (allowed-tools restricts to Read/Grep/Glob) - ## Instructions +This is a **read-only diagnostic**. The tool restriction to Read/Grep/Glob enforces this at the platform level. A diagnostic tool that modifies state destroys the evidence it needs to analyze -- forensics examines, it does not fix. Even when the user asks you to fix what you find, complete the report and recommend remediation instead. The wrong fix applied automatically can destroy work. + ### Phase 1: GATHER **Goal**: Collect the raw evidence needed for anomaly detection. Determine what branch, plan, and time range to analyze. @@ -99,6 +60,8 @@ Accept the target from one of these sources (in priority order): 2. **Current branch**: Use the current git branch if no branch specified 3. **Explicit plan**: User points to a specific `task_plan.md` +Before analysis, read the repository's CLAUDE.md if present. Repository conventions inform what "normal" looks like (e.g., expected branch patterns, required artifacts). + **Step 2: Locate the plan file** Search for the plan that governed the workflow: @@ -106,7 +69,7 @@ Search for the plan that governed the workflow: - Check `.feature/state/plan/` for feature plans - Check `plan/active/` for workflow-orchestrator plans -Record whether a plan exists. If no plan is found, note this -- it limits scope drift and abandoned work detection but does not block the investigation. +Record whether a plan exists. If no plan is found, note this -- it limits scope drift and abandoned work detection but does not block the investigation. Three of the five detectors (stuck loop, crash/interruption, and degraded abandoned work) still function without a plan, so never skip analysis because no plan file was found. **Step 3: Collect git history** @@ -119,6 +82,8 @@ Use Grep to search git log output for patterns. Focus on: - File change frequency across commits - Commit message patterns (similarity, repetition) +If the branch has hundreds of commits, focus on the most recent 50 and note the truncation in the final report. + **Step 4: Check working tree state** Examine the current state: @@ -132,7 +97,7 @@ Examine the current state: ### Phase 2: DETECT -**Goal**: Run all 5 anomaly detectors against the collected evidence. Each detector produces zero or more findings with confidence levels. +**Goal**: Run all 5 anomaly detectors against the collected evidence. Always run every detector -- anomalies are often correlated (a stuck loop causes missing artifacts causes abandoned work), so partial analysis misses the causal chain. Each detector produces zero or more findings, and every finding must include a confidence level (High/Medium/Low) because false positives erode trust. A "High" confidence stuck loop (5 identical commits) is qualitatively different from a "Low" confidence one (3 commits to the same file with different messages). #### Detector 1: Stuck Loop @@ -152,7 +117,7 @@ Analyze the git history for files that appear in consecutive commits: | Same file in 3 consecutive commits, different messages | **Low** | Could be legitimate iterative development | | Same file in 3+ commits with messages containing "fix", "retry", "attempt" | **High** | Explicit retry language strengthens the signal regardless of count | -**False positive awareness**: Legitimate multi-pass refactoring (e.g., "extract method", "add tests", "clean up") touches the same file repeatedly with genuinely different messages. Check whether the file's changes are cumulative (refactoring) or oscillating (loop). Oscillating changes -- where content reverts and re-applies -- are the strongest stuck loop signal. +**False positive awareness**: Legitimate multi-pass refactoring (e.g., "extract method", "add tests", "clean up") touches the same file repeatedly with genuinely different messages. Check whether the file's changes are cumulative (refactoring) or oscillating (loop). Oscillating changes -- where content reverts and re-applies -- are the strongest stuck loop signal. When evidence is ambiguous, report it at Low confidence rather than suppressing the finding -- let the consumer decide. #### Detector 2: Missing Artifacts @@ -258,7 +223,7 @@ Check for the combination of these indicators: ### Phase 3: REPORT -**Goal**: Compile findings into a structured diagnostic report with root cause hypothesis and remediation recommendations. +**Goal**: Compile findings into a structured diagnostic report with root cause hypothesis and remediation recommendations. Every claim in the report must trace to specific evidence -- a forensics report without evidence is an opinion piece, not a diagnostic. **Step 1: Scrub sensitive content** @@ -266,11 +231,11 @@ Before assembling the report, scan all evidence strings for: - API keys, tokens, passwords (patterns: `sk-`, `ghp_`, `token=`, `password=`, `secret=`, `key=`, bearer tokens, base64-encoded credentials) - Absolute home directory paths -Replace sensitive values with `[REDACTED]` and home paths with `~/`. +Replace sensitive values with `[REDACTED]` and home paths with `~/`. Treat all credential-shaped strings as real -- you cannot determine whether a credential is live from its format alone. Reports may be shared or logged, so a leaked credential in a forensics report is worse than the original workflow failure. Redact paths in every report regardless of audience; it costs nothing and prevents future exposure. **Step 2: Compile anomaly table** -Order findings by confidence (High first), then by detector number: +Order findings by confidence (High first, then by detector number) so the reader gets the strongest signals first: ``` ## Forensics Report: [branch name or session identifier] @@ -295,13 +260,13 @@ Connect the anomalies into a coherent narrative. Look for causal chains: - Missing artifacts + abandoned work = session crashed before producing outputs - Crash/interruption + stuck loop = agent exhausted retries and was terminated -The hypothesis should be specific and testable, not generic: +The hypothesis must be specific, testable, and grounded in evidence from the anomaly findings -- never speculate beyond what the data supports: - BAD: "Something went wrong during execution" - GOOD: "Agent entered a lint fix loop on server.go (4 consecutive commits with 'fix lint' messages), which consumed the session's context budget before Phase 3 VERIFY could execute, leaving test artifacts missing" **Step 4: Recommend remediation** -Provide specific, actionable recommendations. Each recommendation should reference the anomaly it addresses: +Provide specific, actionable recommendations. Each recommendation should reference the anomaly it addresses. Remediation is advisory text only -- never execute fixes, even if the user asks. Remediation requires understanding intent, not just detecting anomalies. | Anomaly Type | Typical Remediation | |--------------|-------------------| @@ -313,6 +278,8 @@ Provide specific, actionable recommendations. Each recommendation should referen **Step 5: Format final report** +Include relevant git log excerpts, file snippets, and timestamps as evidence for every anomaly. Show git hashes, timestamps, and file paths rather than making unsupported assertions. + ``` ================================================================ FORENSICS REPORT: [branch/session identifier] @@ -368,28 +335,6 @@ Provide specific, actionable recommendations. Each recommendation should referen | Git log too large | Long-lived branch with hundreds of commits | Focus analysis on the most recent 50 commits. Note truncation in report. | | Ambiguous branch target | User request doesn't clearly identify which branch | Ask: "Which branch should I investigate? Current branch is [X]." | -## Anti-Patterns - -| Anti-Pattern | Why Wrong | Do Instead | -|--------------|-----------|------------| -| Attempting to fix detected issues | Forensics that modifies state destroys evidence and violates read-only constraint. A fix applied without understanding intent can make things worse. | Report findings and recommend remediation. Let the user decide what to fix. | -| Reporting anomalies without evidence | "There might be a stuck loop" without commit hashes or file names is not a diagnostic. Unsubstantiated claims waste the reader's time. | Include specific git hashes, file paths, timestamps, and commit messages for every anomaly. | -| Treating all anomalies as equal severity | A High confidence stuck loop and a Low confidence scope drift require different urgency. Flat reporting obscures priority. | Order by confidence. Lead with the strongest signals. | -| Running only some detectors | "The user asked about stuck loops so I'll skip the other detectors." Anomalies are often correlated -- a stuck loop causes missing artifacts causes abandoned work. Partial analysis misses the causal chain. | Run all 5 detectors. The user asked about one symptom, but the diagnosis may involve several. | -| Including raw credentials in report | Diagnostic reports may be shared, logged, or pasted into issues. A credential in a forensics report is a security incident. | Scrub before reporting. Always. | -| Guessing at root cause without evidence | "The agent probably ran out of context" without timestamp or commit evidence is speculation, not diagnosis. | Every claim in the root cause hypothesis must trace to a specific anomaly finding with evidence. | - -## Anti-Rationalization - -| Rationalization Attempt | Why It's Wrong | Required Action | -|------------------------|----------------|-----------------| -| "I can see the problem, let me just fix it" | Forensics is read-only. Fixing destroys evidence. Even if the fix is obvious, the user needs the diagnostic to understand what happened. | Complete the report. Recommend the fix. Do not execute it. | -| "No plan file, so forensics can't help" | 3 of 5 detectors work without a plan (stuck loop, crash/interruption, and degraded abandoned work). Missing the plan limits analysis, it doesn't prevent it. | Run all detectors. Note limitations for plan-dependent detectors. Report what you can. | -| "This is just normal iterative development, not a loop" | That's what confidence levels are for. Report it as Low confidence if the evidence is ambiguous. Don't suppress findings based on your interpretation of intent. | Report the finding with appropriate confidence. Let the consumer decide. | -| "The report is long enough, I'll skip the evidence section" | A forensics report without evidence is an opinion piece. Evidence is what makes it a diagnostic. | Include evidence. Every anomaly must have supporting data in the evidence section. | -| "Credentials in this file aren't real secrets" | You cannot determine whether a credential is real from its format. Treat all credential-shaped strings as real. | Scrub all credential patterns. No exceptions. | -| "Path redaction isn't needed for internal reports" | You don't control where the report ends up. Internal today, shared tomorrow. | Redact paths in every report. It costs nothing and prevents future exposure. | - ## References - [ADR-073: Forensics Meta-Workflow Diagnostics](/adr/073-forensics-meta-workflow-diagnostics.md) @@ -397,5 +342,3 @@ Provide specific, actionable recommendations. Each recommendation should referen - [Workflow Orchestrator](/pipelines/workflow-orchestrator/SKILL.md) -- produces the plans forensics analyzes - [Plan Checker](/skills/plan-checker/SKILL.md) -- validates plans pre-execution (forensics analyzes post-execution) - [Error Learner Hook](/hooks/error-learner.py) -- handles tool-level errors (forensics handles workflow-level patterns) -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) diff --git a/skills/frontend-slides/SKILL.md b/skills/frontend-slides/SKILL.md index 084dad60..2d7fa287 100644 --- a/skills/frontend-slides/SKILL.md +++ b/skills/frontend-slides/SKILL.md @@ -49,71 +49,11 @@ routing: # Frontend Slides Skill -## Operator Context +Generate browser-based HTML presentations as a single self-contained `.html` file. Three entry paths: new build from topic/notes, PPTX-to-HTML conversion, or enhancement of an existing HTML deck. -This skill operates as an operator for browser-based presentation workflows. It configures -Claude's behavior to enforce viewport-fit as a hard constraint, apply a curated preset catalog, -and implement a full JS navigation controller — never improvising visual style or layout. +**Routing disambiguation**: When the user says only "slides" or "deck" without specifying format, ask exactly one question before proceeding: "Should this be an HTML file (opens in browser) or a PowerPoint file (.pptx)?" Route to `pptx-generator` for PowerPoint/Keynote/Google Slides requests. -### Hardcoded Behaviors (Always Apply) - -- **Viewport fit is non-negotiable**: Every `.slide` must have `height: 100vh; height: 100dvh; overflow: hidden`. When content overflows, the rule is split the slide, never shrink text. -- **CSS base block verbatim**: The mandatory base CSS block from `skills/frontend-slides/references/STYLE_PRESETS.md` must appear verbatim in every output file. Do not paraphrase or rewrite it. -- **No negated CSS functions**: `clamp()` cannot be negated with a unary minus. Always use `calc(-1 * clamp(...))`. See STYLE_PRESETS.md CSS Gotchas. -- **Density limits enforced**: Apply the density table from STYLE_PRESETS.md without exception. Never add a seventh bullet to a content slide. -- **Full JS controller required**: Keyboard, touch/swipe, wheel navigation, slide index indicator, and Intersection Observer for reveal animations — all required, not optional. -- **Reduced-motion required**: `@media (prefers-reduced-motion: reduce)` block must suppress all animations. -- **Format disambiguation required**: When the user says only "slides" or "deck" without specifying format, ask exactly one question before routing: "Should this be an HTML file (opens in browser) or a PowerPoint file (.pptx)?" -- **Self-contained output**: Output must be a single `.html` file with all CSS and JS inline. No external CDN dependencies. - -### Default Behaviors (ON unless disabled) - -- **Visual preview approach**: Generate 3 single-slide preview HTML files in `.design/slide-previews/` for style selection when the user has not named a preset. -- **Mood-to-preset mapping**: Use the mood table in STYLE_PRESETS.md to translate the user's mood description to a preset name rather than asking them to choose from a list. -- **PPTX via python-pptx**: On the conversion path, use `python-pptx` as the extractor. Preserve slide notes. Maintain asset order. -- **Cleanup after delivery**: Delete `.design/slide-previews/` at the end unless the user explicitly asks to keep them. -- **OS-appropriate opener**: Deliver the file using the correct OS opener (`open` on macOS, `xdg-open` on Linux, `start` on Windows). -- **Customization summary**: After delivery, print the 3 easiest CSS custom properties the user can change for re-theming. -- **Playwright validation**: Run `skills/frontend-slides/scripts/validate-slides.py` as the viewport-fit gate. Fall back to the manual checklist gate only if Playwright is unavailable (exit code 2). - -### Optional Behaviors (OFF unless enabled) - -- **Keep preview files**: Retain `.design/slide-previews/` if user requests them for reuse. -- **Speaker notes panel**: Add a visible notes panel toggled by `n` key if user wants speaker view. -- **Print stylesheet**: Add `@media print` CSS for PDF-via-browser export if user needs handouts. -- **Slide timer**: Add a configurable countdown timer overlay if user requests timed presentation mode. - ---- - -## Routing Disambiguation: frontend-slides vs pptx-generator - -| Dimension | frontend-slides | pptx-generator | -|-----------|-----------------|----------------| -| Output format | Self-contained `.html` file | `.pptx` file | -| Delivery | Opens in browser | Opens in PowerPoint / Keynote / Google Slides | -| Sharing | URL or file attachment | Email attachment, cloud upload | -| Customization | CSS + JS in-file | PowerPoint themes, slide master | -| Input path | Topic, notes, or existing PPTX → HTML | Topic or notes → PPTX | - -**Route to `frontend-slides`**: "HTML slides", "browser presentation", "web deck", "reveal-style", "viewport presentation", "convert PPTX to web/HTML", "slides for a browser/kiosk/projector with a browser", "interactive presentation with keyboard navigation". - -**Route to `pptx-generator`**: "PowerPoint", "PPTX", ".pptx", "Keynote-compatible", "Google Slides", "deck I can email", "slides for upload to Drive". - -**Ambiguous**: "make me a deck", "create slides" without format → ask one disambiguation question. - ---- - -## Reference Files - -Load these files at the phase that requires them. Do not load them all upfront. - -| File | Load At | Contains | -|------|---------|----------| -| `skills/frontend-slides/references/STYLE_PRESETS.md` | Phase 3 (DISCOVER STYLE) and Phase 4 (BUILD) | Mandatory CSS base block, 12 named presets, mood mapping, animation feel mapping, CSS gotchas, density limits, validation breakpoints | - ---- - -## Pipeline +## Instructions ### Phase 1: DETECT @@ -121,7 +61,7 @@ Identify which of the three paths applies: | Path | Signal | Action | |------|--------|--------| -| **New build** | User provides topic, outline, or notes — no existing file | Proceed to Phase 2 to gather content | +| **New build** | User provides topic, outline, or notes -- no existing file | Proceed to Phase 2 to gather content | | **PPTX conversion** | User provides a `.pptx` file path | Extract with `python-pptx`; collect slides, notes, and asset order; then proceed to Phase 3 | | **HTML enhancement** | User provides an existing `.html` deck | Read the file; identify what needs improving; skip to Phase 4 | @@ -131,7 +71,7 @@ Identify which of the three paths applies: ### Phase 2: DISCOVER CONTENT -Ask exactly three questions — no more: +Ask exactly three questions -- no more: 1. What is the purpose of this presentation? (e.g., pitch, tutorial, conference talk, internal review) 2. How many slides? (approximate is fine) @@ -149,11 +89,11 @@ Collect or generate the content before touching any style decisions. Two sub-paths: -**Sub-path A — User names a preset directly**: Skip previews. Confirm the preset name exists in STYLE_PRESETS.md. Proceed to Phase 4. +**Sub-path A -- User names a preset directly**: Skip previews. Confirm the preset name exists in STYLE_PRESETS.md. Proceed to Phase 4. -**Sub-path B — User does not know the preset**: Ask for mood using exactly these four options: impressed / energized / focused / inspired. Map the mood to candidate presets using the mood table in STYLE_PRESETS.md. Generate 3 single-slide HTML preview files in `.design/slide-previews/` — one per candidate preset — using real slide content (not placeholder lorem ipsum). Present the previews and ask the user to pick. +**Sub-path B -- User does not know the preset**: Ask for mood using exactly these four options: impressed / energized / focused / inspired. Map the mood to candidate presets using the mood table in STYLE_PRESETS.md -- translate the user's mood description to a preset name rather than asking them to choose from a list. Generate 3 single-slide HTML preview files in `.design/slide-previews/` -- one per candidate preset -- using real slide content (not placeholder lorem ipsum). Present the previews and ask the user to pick. -**GATE 3**: User has either named a preset from STYLE_PRESETS.md or selected one of the three previews. A vague direction like "make it look professional" is not sufficient — a named preset must be confirmed before Phase 4. If no selection is made, regenerate previews with different presets. +**GATE 3**: User has either named a preset from STYLE_PRESETS.md or selected one of the three previews. A vague direction like "make it look professional" is not sufficient -- a named preset must be confirmed before Phase 4. If no selection is made, regenerate previews with different presets. Never fall back to a generic purple gradient; presets exist to avoid exactly that. --- @@ -161,23 +101,37 @@ Two sub-paths: **Load `skills/frontend-slides/references/STYLE_PRESETS.md` if not already loaded.** -Build rules (all mandatory): +Build the presentation as a single `.html` file with all CSS and JS inline (no external CDN dependencies). Follow these rules: -1. **Single file**: Output is one `.html` file with all CSS and JS inline. -2. **CSS base block verbatim**: Copy the mandatory CSS base block from STYLE_PRESETS.md exactly as written. Apply the chosen preset's theme variables on top. -3. **Density limits**: Apply the density table from STYLE_PRESETS.md without exception. -4. **JS controller class**: Implement `SlideController` with: +1. **CSS base block verbatim**: Copy the mandatory CSS base block from STYLE_PRESETS.md exactly as written -- do not paraphrase or rewrite it. Apply the chosen preset's theme variables on top. The base block must be present character-for-character because the validation script checks for it. + +2. **Viewport fit on every slide**: Every `.slide` element must have `height: 100vh; height: 100dvh; overflow: hidden`. When content overflows, split the slide into multiple slides -- never shrink text, add scrollbars, or set `min-height` that could allow growth past 100dvh. A slide with scrollable content is a web page, not a slide. + +3. **Density limits**: Apply the density table from STYLE_PRESETS.md without exception. Maximum 6 bullets per content slide. If content needs a 7th bullet, split into two slides -- dense text is unreadable in presentation context. + +4. **Responsive sizing**: All body text must use `clamp()` for font sizing. No fixed-height content boxes (`height: 300px` on inner elements). For images or code blocks that need height constraints, use `max-height: min(Xvh, Ypx)` with `overflow: hidden`. + +5. **CSS negation rule**: Never write `-clamp(...)` -- browsers silently compute it to `0`, causing text to disappear. Always write `calc(-1 * clamp(...))` when a negative value is needed. + +6. **JS controller class**: Implement `SlideController` with all of the following (not optional): - Keyboard: `ArrowRight`/`ArrowLeft`/`Space` forward; `ArrowLeft`/`Backspace` backward; `Home`/`End` for first/last - Touch/swipe: `touchstart`/`touchend` with 50px threshold - - Wheel: debounced `wheel` event (150ms) + - Wheel: debounced `wheel` event (150ms) -- without debounce, wheel events cause multi-slide jumps. Add a `navigating` flag that blocks re-entry during transition. - Slide index indicator: `currentSlide / totalSlides` visible in corner - - Intersection Observer: add `.visible` class when slide enters viewport for reveal animations -5. **Reduced-motion**: `@media (prefers-reduced-motion: reduce)` suppresses all transitions and animations. -6. **CSS negation rule**: Never write `-clamp(...)`. Write `calc(-1 * clamp(...))` instead. -7. **Font loading**: Use `@font-face` with `font-display: swap` or system font stacks. Never reference external CDN fonts without a local fallback. -8. **PPTX path**: If converting from PPTX, use `python-pptx` to extract text, notes, and asset paths. If `python-pptx` is unavailable, print a clear error and ask the user to install it (`pip install python-pptx`) or provide content manually — do not silently skip content. + - Intersection Observer: add `.visible` class when slide enters viewport for reveal animations. Use `opacity: 0` + `transform: translateY(20px)` to hide slides before reveal -- never `display: none`, which prevents Intersection Observer callbacks. + +7. **Reduced-motion**: Include a `@media (prefers-reduced-motion: reduce)` block that suppresses all transitions and animations. -**GATE 4**: The output HTML file exists on disk and contains the verbatim mandatory CSS base block from STYLE_PRESETS.md. Verify with a string search before proceeding. A file that "looks right" is not sufficient — the exact block must be present. +8. **Font loading**: Use `@font-face` with `font-display: swap` and a system font stack fallback. Never reference external CDN fonts without a local fallback. Missing `font-display: swap` causes invisible text during load (FOIT). + +9. **PPTX path**: If converting from PPTX, use `python-pptx` to extract text, notes, and asset paths. Preserve slide notes and maintain asset order. If `python-pptx` is unavailable, print a clear error and ask the user to install it (`pip install python-pptx`) or provide content manually -- do not silently skip content. + +**Optional features** (off by default, add only when user requests): +- Speaker notes panel toggled by `n` key +- `@media print` CSS for PDF-via-browser export +- Configurable countdown timer overlay + +**GATE 4**: The output HTML file exists on disk and contains the verbatim mandatory CSS base block from STYLE_PRESETS.md. Verify with a string search before proceeding -- a file that "looks right" is not sufficient. --- @@ -190,13 +144,13 @@ python3 skills/frontend-slides/scripts/validate-slides.py path/to/output.html ``` **Exit codes**: -- `0` — All slides pass at all 9 breakpoints. Proceed to Phase 6. -- `1` — Overflow detected. The script prints which slides overflow at which breakpoints. Fix by splitting the overflowing slides. Re-run validation. Do not proceed until exit code is 0. -- `2` — Playwright unavailable. Fall back to the manual checklist gate below. +- `0` -- All slides pass at all 9 breakpoints. Proceed to Phase 6. +- `1` -- Overflow detected. The script prints which slides overflow at which breakpoints. Fix by splitting the overflowing slides. Re-run validation. Do not proceed until exit code is 0. Content that fits at 1920x1080 but overflows at 375x667 still fails -- `clamp()` sizing solves most cases; if not, split the slide. +- `2` -- Playwright unavailable. Fall back to the manual checklist gate below. Tell the user validation is running in manual mode and is less reliable. **Manual checklist gate (fallback, only when exit code is 2)**: -For every slide, verify all of the following. If any item fails, fix it before proceeding. Do not mark this gate passed speculatively. +For every slide, verify all of the following. If any item fails, fix it before proceeding. - [ ] `height: 100vh` and `height: 100dvh` present on `.slide` - [ ] `overflow: hidden` present on `.slide` @@ -205,15 +159,13 @@ For every slide, verify all of the following. If any item fails, fix it before p - [ ] No `min-height` on `.slide` that could allow growth past 100dvh - [ ] No `-clamp(...)` patterns anywhere in CSS -**GATE 5**: Exit code 0 from the validation script, or — only if Playwright is unavailable (exit code 2) — explicit user confirmation that the manual checklist passed for every slide. "Looks fine on my screen" is not a gate pass. User confirmation must enumerate the slide count checked. +**GATE 5**: Exit code 0 from the validation script, or -- only if Playwright is unavailable (exit code 2) -- explicit user confirmation that the manual checklist passed for every slide. User confirmation must enumerate the slide count checked. --- ### Phase 6: DELIVER -Delivery sequence: - -1. Delete `.design/slide-previews/` (unless user asked to keep them): +1. Delete `.design/slide-previews/` unless the user explicitly asked to keep them: ```bash rm -rf .design/slide-previews/ ``` @@ -233,13 +185,11 @@ Delivery sequence: **GATE 6**: Delivery summary printed. File exists at the stated path. Previews deleted (or user confirmed to keep). Task is complete only when all three conditions are met. ---- - ## Error Handling | Error | Cause | Resolution | |-------|-------|------------| -| `-clamp(...)` in CSS | CSS negation of `clamp()` is silently ignored by browsers — it computes to `0` | Replace every instance with `calc(-1 * clamp(...))`. Run a grep search for `-clamp` before delivery. | +| `-clamp(...)` in CSS | CSS negation of `clamp()` is silently ignored by browsers -- it computes to `0` | Replace every instance with `calc(-1 * clamp(...))`. Run a grep search for `-clamp` before delivery. | | Font load failure / FOUT | External font CDN unreachable, or `@font-face` src missing `format()` hint | Use `font-display: swap` on every `@font-face`. Include a system font stack fallback. Test offline. | | PPTX extraction error | `python-pptx` unavailable, or PPTX uses embedded OLE objects | Print a clear error message naming the missing dependency. Ask the user to `pip install python-pptx` or provide content manually. Do not silently skip slides. | | Playwright unavailable (exit 2) | `playwright` not installed or Chromium browser not available | Fall back to the manual checklist gate in Phase 5. Explicitly tell the user validation is running in manual mode and is less reliable. | @@ -247,31 +197,8 @@ Delivery sequence: | Reveal animations not triggering | Intersection Observer threshold too high, or slides hidden with `display:none` | Use `display: flex` with `opacity: 0` + `transform` for hidden slides. Never use `display: none` on slides that need IO callbacks. | | JS controller not advancing | `wheel` event not debounced, causing multi-slide jumps | Enforce 150ms debounce on wheel. Add a `navigating` flag that blocks re-entry during transition. | ---- - -## Anti-Patterns +## References -Never do these. They are failure modes, not style choices. - -| Anti-Pattern | Why It Fails | Correct Approach | -|--------------|-------------|------------------| -| Generic purple gradient background | Signals AI-generated content, visually dated, no alignment with user's brand or message | Use a named preset from STYLE_PRESETS.md. Generic gradients are not a style — they are the absence of style. | -| Negated `clamp()`: `-clamp(2rem, 3vw, 4rem)` | Silently computes to `0` in all browsers — text disappears or collapses | Always use `calc(-1 * clamp(2rem, 3vw, 4rem))` when a negative value is needed | -| Scrollable content inside a slide | Breaks the "each slide = one viewport" contract; audiences cannot scroll during a live presentation | Split the slide into multiple slides. If content genuinely needs scrolling, that is a web page, not a slide. | -| Fixed-height inner boxes | `height: 300px` on an image or code block causes overflow at smaller viewports | Use `max-height: min(Xvh, Ypx)` with `overflow: hidden`. Resize the asset, not the slide. | -| Font loading failure silent fallback | Missing `font-display: swap` causes invisible text during load (FOIT) | Always declare `font-display: swap` and a system font stack fallback. Test with network throttling. | -| Bullet walls (7+ bullets per slide) | Dense text is unreadable in a presentation context; audiences read or listen, not both | Enforce the density table: max 6 bullets per content slide. Split into multiple slides if needed. | -| Missing keyboard navigation | Mouse-only or click-only decks are inaccessible and break projector workflows | Implement `SlideController` with the full keyboard + touch + wheel interface. | -| `display: none` on slides for animation | Intersection Observer never fires on `display: none` elements | Use `opacity: 0` + `transform: translateY(20px)` to hide, not `display: none`. | - ---- - -## Anti-Rationalization - -| Rationalization | Reality | Required Action | -|-----------------|---------|-----------------| -| "Slide looks fine on my screen" | Your screen is one of 9 required breakpoints | Run `validate-slides.py`. Do not ship until exit code 0. | -| "User won't resize the browser during a presentation" | Projectors, kiosks, and conference room displays vary; the skill requirement is 9 breakpoints | Validate all 9 breakpoints. The user's assumption is not a gate pass. | -| "Splitting adds too many slides" | A presentation with 18 tight slides is better than one with 12 overflowing ones | Split. Density limits exist because cognitive load is real. | -| "The CSS base block is already implied by my styles" | The base block must be present verbatim for validation and portability | Copy the block exactly from STYLE_PRESETS.md. Do not paraphrase. | -| "Playwright check is optional if it looks right" | Visual inspection misses mobile breakpoints and zoom interactions | Exit code 2 triggers the manual checklist. Exit code 0 is required to pass Gate 5. | +| File | Load At | Contains | +|------|---------|----------| +| `skills/frontend-slides/references/STYLE_PRESETS.md` | Phase 3 (DISCOVER STYLE) and Phase 4 (BUILD) | Mandatory CSS base block, 12 named presets, mood mapping, animation feel mapping, CSS gotchas, density limits, validation breakpoints | diff --git a/skills/full-repo-review/SKILL.md b/skills/full-repo-review/SKILL.md index 2546d073..c1de5601 100644 --- a/skills/full-repo-review/SKILL.md +++ b/skills/full-repo-review/SKILL.md @@ -39,7 +39,9 @@ repository, not just changed files. Delegates the actual review to the auto-fixes. **When to use**: Quarterly health checks, after major refactors, onboarding to -a new codebase, or any time you want a systemic view of codebase quality. +a new codebase, or any time you want a systemic view of codebase quality. This +is expensive (all files through all waves) -- use `comprehensive-review` for +PR-scoped work. **How it differs from comprehensive-review**: This skill changes the SCOPE phase to scan all source files instead of git diff, and changes the output from @@ -48,53 +50,26 @@ identical. --- -## Operator Context - -### Hardcoded Behaviors (Always Apply) -- **Full-Scope, Not Diff-Scope**: Always review ALL source files. Never fall back to git diff. The entire point of this skill is codebase-wide coverage. -- **Report, Don't Auto-Fix**: Output is a prioritized backlog (`full-repo-review-report.md`), not auto-applied fixes. Full-repo auto-fix is impractical and risky -- the user triages and batches fixes. -- **Deterministic Pre-Check First**: Run `score-component.py` before the LLM review. Deterministic checks are cheap and catch structural issues (missing frontmatter, no error handling section) that LLM reviewers shouldn't waste tokens on. -- **Delegate to comprehensive-review**: This skill orchestrates scope and output. The actual 3-wave review is performed by `comprehensive-review` with `--review-only` mode. +## Instructions -### Default Behaviors (ON unless disabled) -- **Score Pre-Check**: Run `score-component.py --all-agents --all-skills` and include scores in the report -- **Severity Aggregation**: Group findings by CRITICAL/HIGH/MEDIUM/LOW -- **Systemic Pattern Detection**: Identify patterns that appear across multiple files/directories -- **Report Artifact**: Write `full-repo-review-report.md` to repo root +### Options -### Optional Behaviors (OFF unless enabled) - **--directory [dir]**: Review only a single directory (e.g., `scripts/`) instead of the full repo. Useful for splitting a large repo into manageable chunks. - **--skip-precheck**: Skip the `score-component.py` deterministic pre-check. Only use if the script is unavailable or you need faster iteration. - **--min-severity [level]**: Only include findings at or above a severity threshold (CRITICAL, HIGH, MEDIUM) in the report. Default: include all. --- -## Capabilities - -### What This Skill CAN Do -- Discover all source files across scripts/, hooks/, skills/, agents/, docs/ -- Run deterministic health scoring on all agents and skills via `score-component.py` -- Invoke comprehensive-review in `--review-only` mode with the full file list -- Aggregate findings by severity into a prioritized backlog report -- Identify systemic patterns that appear across multiple files - -### What This Skill CANNOT Do -- Auto-fix findings (by design -- output is a report for human triage) -- Review non-source files (images, binaries, config files without .py/.md extension) -- Replace PR-scoped comprehensive-review (different use case, different frequency) -- Run individual review agents directly (delegates to comprehensive-review for wave orchestration) - ---- - -## Instructions - ### Phase 1: DISCOVER AND PRE-CHECK **Goal**: Identify all source files and run deterministic health checks. **Step 1: Discover source files** -Build the complete file list by scanning these directories: +Build the complete file list by scanning these directories. Always scan ALL +source files -- never fall back to git diff. The entire point of this skill is +codebase-wide coverage. If a specific `--directory` was provided, scope the +scan to that directory only. ```bash # Python scripts (exclude test files and __pycache__) @@ -115,8 +90,16 @@ find docs/ -name "*.md" 2>/dev/null Log the total file count. If zero files found, STOP and report: "No source files discovered. Verify you are in the correct repository root." +If the file count is too large for a single session, split by directory +(`scripts/`, `hooks/`, `agents/`, `skills/` separately) rather than +cherry-picking "important" files -- selective review defeats the purpose. + **Step 2: Run deterministic pre-check** +Run scoring before the LLM review. Deterministic checks are cheap and catch +structural issues (missing frontmatter, no error handling section) that LLM +reviewers should not waste tokens rediscovering. + ```bash python3 ~/.claude/scripts/score-component.py --all-agents --all-skills --json ``` @@ -138,12 +121,15 @@ skip the review phase. **Goal**: Run the comprehensive-review pipeline against all discovered files. +This skill orchestrates scope and output only. The actual 3-wave review is +performed by `comprehensive-review` with `--review-only` mode. + **Step 1: Invoke comprehensive-review** Invoke the `comprehensive-review` skill with these overrides: - **Scope**: Pass the full file list from Phase 1 (use `--focus [files]` mode) -- **Mode**: Use `--review-only` to skip auto-fix (this skill produces a report, not patches) -- **All waves**: Do NOT use `--skip-wave0` or `--wave1-only`. Full-repo review needs maximum coverage. +- **Mode**: Use `--review-only` to skip auto-fix. Output is a prioritized backlog for human triage, not patches -- full-repo auto-fix touches too many files at once and risks cascading breakage. +- **All waves**: Do NOT use `--skip-wave0` or `--wave1-only`. Full-repo review needs maximum coverage. Wave 0 per-package context is what makes full-repo review valuable; deterministic checks catch structure, but only the full 3-wave review catches logic and design issues. The comprehensive-review skill handles Wave 0 (per-package), Wave 1 (foundation agents), and Wave 2 (deep-dive agents) internally. @@ -223,6 +209,9 @@ Write `full-repo-review-report.md` to the repo root with this structure: - Score pre-check: {pass/warn/fail} ``` +The report is the final output. Do not auto-apply any fixes -- the user triages +findings and batches corrections into manageable PRs. + **GATE**: Report file exists at `full-repo-review-report.md` and contains at least the severity sections and deterministic scores. @@ -239,36 +228,6 @@ least the severity sections and deterministic scores. --- -## Anti-Patterns - -### Do NOT auto-fix findings -**Why**: Full-repo auto-fix touches too many files at once. Risk of cascading -breakage is high and review of the fixes themselves would be a massive PR. -Report findings for human triage. - -### Do NOT skip the deterministic pre-check -**Why**: score-component.py catches structural issues (missing YAML fields, -no error handling section) cheaply. Skipping it wastes LLM tokens on issues -a script can find in milliseconds. - -### Do NOT run on every PR -**Why**: This is expensive (all files through all waves). Use -comprehensive-review for PR-scoped work. This skill is for periodic health -checks. - ---- - -## Anti-Rationalization - -| Rationalization | Why Wrong | Required Action | -|-----------------|-----------|-----------------| -| "Too many files, let's just review the important ones" | Cherry-picking defeats the purpose of full-repo review | Review ALL discovered files. If it's too large, split by directory -- don't skip. | -| "The score pre-check already found the issues" | Deterministic checks catch structure, not logic | Always run the full 3-wave review after pre-check | -| "We can auto-fix the obvious ones" | This skill produces a report, not patches | Write findings to the report. User decides what to fix and when. | -| "Wave 0 is slow, let's skip it" | Wave 0 per-package context is what makes full-repo review valuable | Run all three waves. No shortcuts on coverage. | - ---- - ## References - [Report Template](references/report-template.md) -- Full structure for `full-repo-review-report.md` output diff --git a/skills/gemini-image-generator/SKILL.md b/skills/gemini-image-generator/SKILL.md index 48eef560..ee2ba326 100644 --- a/skills/gemini-image-generator/SKILL.md +++ b/skills/gemini-image-generator/SKILL.md @@ -35,63 +35,21 @@ routing: # Gemini Image Generator -## Operator Context - -This skill operates as an operator for CLI-based image generation, configuring Claude's behavior for deterministic Python script execution against Google Gemini APIs. It implements an **Execute-Verify** pattern — validate environment, generate image, verify output — with **Domain Intelligence** embedded in model selection and prompt engineering. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files -- **Over-Engineering Prevention**: Only generate what is directly requested -- **Exact Model Names**: Use only `gemini-2.5-flash-image` or `gemini-3-pro-image-preview` — no variations, no date suffixes -- **API Key Validation**: Always verify `GEMINI_API_KEY` exists before any generation attempt -- **Output Verification**: Confirm output file exists and is non-zero bytes after generation -- **Absolute Paths**: Always use absolute paths for output files - -### Default Behaviors (ON unless disabled) -- **Show Complete Output**: Display full script output, never summarize -- **Rate Limit Handling**: Wait between requests to avoid 429 errors -- **Retry on Failure**: Retry transient failures with exponential backoff (3 attempts) -- **Status Reporting**: Output structured status for Claude to parse - -### Optional Behaviors (OFF unless enabled) -- **Watermark Removal**: Clean watermarks from corners with `--remove-watermark` -- **Background Transparency**: Make solid backgrounds transparent with `--transparent-bg` -- **Batch Mode**: Generate multiple images from a prompt file with `--batch` - -## What This Skill CAN Do -- Generate images from text prompts via CLI using Gemini APIs -- Select between fast (`gemini-2.5-flash-image`) and quality (`gemini-3-pro-image-preview`) models -- Save images to specified file paths with automatic directory creation -- Remove watermarks from generated images via post-processing -- Make solid-color backgrounds transparent for game sprites and assets -- Generate multiple images in batch mode from a prompt file -- Retry on transient failures with exponential backoff - -## What This Skill CANNOT Do -- Build web applications with image generation (use `nano-banana-builder` instead) -- Use non-Gemini models (DALL-E, Midjourney, Stable Diffusion) -- Fine-tune or train models -- Generate video or audio content -- Bypass content policy restrictions -- Edit or modify existing images (generation only) - ---- +Generate images from text prompts via CLI using Google Gemini APIs. Supports model selection between fast (`gemini-2.5-flash-image`) and quality (`gemini-3-pro-image-preview`) models, batch generation, watermark removal, and background transparency. ## Instructions -### Phase 1: ENVIRONMENT - -**Goal**: Verify all prerequisites before attempting generation. +### Step 1: Validate Environment -**Step 1: Validate API key** +Verify the API key exists before any generation attempt -- a missing key produces confusing errors that waste time debugging. ```bash echo "GEMINI_API_KEY is ${GEMINI_API_KEY:+set}" ``` -Expect: `GEMINI_API_KEY is set`. If not set, instruct user to configure it. +Expect: `GEMINI_API_KEY is set`. If not set, instruct user to configure it and stop. -**Step 2: Verify dependencies** +Verify Python dependencies are available: ```bash python3 -c "from google import genai; from PIL import Image; print('OK')" @@ -102,17 +60,13 @@ If missing, install: pip install google-genai Pillow ``` -**Step 3: Determine output path** - -Use an absolute path for the output file. Verify the parent directory exists or will be created. - -**Gate**: API key is set, dependencies installed, output path is valid. Proceed only when gate passes. +Determine the output path. Always use absolute paths for output files -- relative paths break when scripts run in different working directories. Verify the parent directory exists or will be created. -### Phase 2: CONFIGURE +**Proceed only when**: API key is set, dependencies installed, output path is valid. -**Goal**: Select the correct model and options for the request. +### Step 2: Select Model and Compose Prompt -**Step 1: Select model** +Choose the model based on the use case: | Scenario | Model | Why | |----------|-------|-----| @@ -122,7 +76,7 @@ Use an absolute path for the output file. Verify the parent directory exists or | Text in image, typography | `gemini-3-pro-image-preview` | Better text rendering | | Product photography | `gemini-3-pro-image-preview` | Detail matters | -**CRITICAL: Use ONLY these exact model strings. Do not invent, guess, or add date suffixes.** +Use ONLY these exact model strings -- the API returns cryptic errors for anything else, and date suffixes (valid for text models) do not work for image models: | Correct (use exactly) | WRONG (never use) | |------------------------|-------------------| @@ -131,9 +85,7 @@ Use an absolute path for the output file. Verify the parent directory exists or | | `gemini-3-flash-image` (doesn't exist) | | | `gemini-pro-vision` (that's image input) | -**Step 2: Compose prompt** - -Follow this structure: `[Subject] [Style] [Background] [Constraints]` +Compose the prompt using this structure: `[Subject] [Style] [Background] [Constraints]` For transparent background post-processing, include: - "solid dark gray background" or "solid uniform gray background (#3a3a3a)" @@ -141,19 +93,16 @@ For transparent background post-processing, include: Always include negative constraints: "no text", "no labels", "character only" -**Step 3: Determine post-processing flags** - +Determine post-processing flags: - Need watermark removal? Add `--remove-watermark` - Need transparent background? Add `--transparent-bg` - Custom background color? Add `--bg-color "#FFFFFF" --bg-tolerance 20` -**Gate**: Model selected, prompt composed, flags determined. Proceed only when gate passes. +**Proceed only when**: Model selected, prompt composed, flags determined. -### Phase 3: GENERATE +### Step 3: Generate -**Goal**: Execute the generation script and capture output. - -**Step 1: Run generation** +Always use the provided `generate_image.py` script -- it contains retry logic, rate limiting, post-processing, model validation, and error handling that inline Python would miss. ```bash python3 $HOME/claude-code-toolkit/skills/gemini-image-generator/scripts/generate_image.py \ @@ -170,37 +119,27 @@ python3 $HOME/claude-code-toolkit/skills/gemini-image-generator/scripts/generate --model gemini-2.5-flash-image ``` -**Step 2: Read script output** - -Check for `SUCCESS` or `ERROR` in output. If rate limited (429), the script handles retry automatically. +Display the full script output -- never summarize it, since the user needs to see status, warnings, and any partial failures. -**Gate**: Script exited with code 0 and printed SUCCESS. Proceed only when gate passes. +Check for `SUCCESS` or `ERROR` in output. If rate limited (429), the script handles retry automatically with exponential backoff (up to 3 attempts). -### Phase 4: VERIFY +**Proceed only when**: Script exited with code 0 and printed SUCCESS. -**Goal**: Confirm the output file exists and is valid. +### Step 4: Verify Output -**Step 1: Verify file exists** +Confirm the output file exists and has non-zero size -- a zero-byte file means the write succeeded but no image data was returned: ```bash ls -la /absolute/path/to/output.png ``` -File must exist and have non-zero size. - -**Step 2: Check dimensions (optional)** +Optionally check dimensions: ```bash python3 -c "from PIL import Image; img = Image.open('/absolute/path/to/output.png'); print(f'Size: {img.size}, Mode: {img.mode}')" ``` -**Step 3: Visual inspection (MANDATORY)** - -Read the generated image file using the Read tool to visually inspect it: - -``` -Read the image at /absolute/path/to/output.png -``` +**Visual inspection is mandatory.** Read the generated image file using the Read tool to visually inspect it. A file can pass all size and dimension checks but still contain watermarks, wrong composition, excessive padding, or content that doesn't match the prompt. Check for: - Content matches the prompt intent (correct subject, layout, composition) @@ -211,7 +150,7 @@ Check for: If the image fails visual inspection, regenerate with an adjusted prompt before reporting to the user. Do not commit or deliver images without visual verification. -**Step 4: Report result** +### Step 5: Report Result Provide the user with: - Output file path @@ -220,33 +159,7 @@ Provide the user with: - Visual verification status (what you checked and confirmed) - Any post-processing applied (cropping, resizing) -**Gate**: Output file exists with non-zero size AND visual inspection passed. Generation is complete. - ---- - -## Script Reference - -### generate_image.py - -**Location**: `$HOME/claude-code-toolkit/skills/gemini-image-generator/scripts/generate_image.py` - -| Argument | Required | Description | -|----------|----------|-------------| -| `--prompt` | Yes* | Text prompt for image generation | -| `--output` | Yes* | Output file path (.png) | -| `--model` | No | Model name (default: gemini-3-pro-image-preview) | -| `--remove-watermark` | No | Remove watermarks from corners | -| `--transparent-bg` | No | Make background transparent | -| `--bg-color` | No | Background color hex (default: #3a3a3a) | -| `--bg-tolerance` | No | Color matching tolerance (default: 30) | -| `--batch` | No | File with prompts (one per line) | -| `--output-dir` | No | Directory for batch output | -| `--retries` | No | Max retry attempts (default: 3) | -| `--delay` | No | Delay between batch requests in seconds (default: 3) | - -*Required unless using `--batch` + `--output-dir` - -**Exit Codes**: 0 = success, 1 = missing API key, 2 = generation failed, 3 = invalid arguments +Only report what was directly requested -- do not suggest additional generations, style variations, or enhancements the user did not ask for. --- @@ -270,7 +183,7 @@ Solution: ### Error: "No image in response" Cause: API returned text-only response or generation was blocked Solution: -1. Add more detail to the prompt — vague prompts sometimes fail +1. Add more detail to the prompt -- vague prompts sometimes fail 2. Try a different model 3. Check that the prompt does not violate content policy 4. Verify the script sets `response_modalities=["IMAGE", "TEXT"]` @@ -284,42 +197,29 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Inventing Model Names -**What it looks like**: `model="gemini-2.5-flash-image-preview-12-25"` or `model="gemini-3-flash-image"` -**Why wrong**: These models do not exist. Date suffixes are for text models only. The API returns cryptic errors. -**Do instead**: Use exactly `gemini-2.5-flash-image` or `gemini-3-pro-image-preview`. No variations. - -### Anti-Pattern 2: Skipping Environment Validation -**What it looks like**: Running `generate_image.py` without checking API key or dependencies first -**Why wrong**: Produces confusing error messages. Wastes time debugging environment issues as generation bugs. -**Do instead**: Complete Phase 1 (ENVIRONMENT) before any generation attempt. Always. - -### Anti-Pattern 3: Generating Without Visual Verification -**What it looks like**: Running the script, checking file size, and committing the image without reading it to visually inspect -**Why wrong**: The file may exist with correct dimensions but contain watermarks, wrong composition, excessive padding, or content that doesn't match the prompt. A 952KB PNG with a cat watermark and wrong aspect ratio passed file-exists checks but looked bad in the README. -**Do instead**: Complete Phase 4 (VERIFY) including Step 3 (visual inspection). Read the image file with the Read tool. Check composition, content, and artifacts before delivering or committing. - -### Anti-Pattern 4: Writing Custom Generation Code Instead of Using the Script -**What it looks like**: Writing inline Python to call the Gemini API directly instead of using `generate_image.py` -**Why wrong**: Misses retry logic, rate limiting, post-processing, model validation, and error handling already built into the script. -**Do instead**: Always use the provided `generate_image.py` script. It handles all edge cases. +## References -### Anti-Pattern 5: Storing Base64 in Memory Instead of Saving to File -**What it looks like**: Keeping image data in a variable without writing to disk -**Why wrong**: Data is lost on exit, cannot be used by other tools, wastes memory for large images. -**Do instead**: Save to file immediately. The script does this automatically. +### Script Reference: generate_image.py ---- +**Location**: `$HOME/claude-code-toolkit/skills/gemini-image-generator/scripts/generate_image.py` -## References +| Argument | Required | Description | +|----------|----------|-------------| +| `--prompt` | Yes* | Text prompt for image generation | +| `--output` | Yes* | Output file path (.png) | +| `--model` | No | Model name (default: gemini-3-pro-image-preview) | +| `--remove-watermark` | No | Remove watermarks from corners | +| `--transparent-bg` | No | Make background transparent | +| `--bg-color` | No | Background color hex (default: #3a3a3a) | +| `--bg-tolerance` | No | Color matching tolerance (default: 30) | +| `--batch` | No | File with prompts (one per line) | +| `--output-dir` | No | Directory for batch output | +| `--retries` | No | Max retry attempts (default: 3) | +| `--delay` | No | Delay between batch requests in seconds (default: 3) | -This skill uses these shared patterns: -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks +*Required unless using `--batch` + `--output-dir` -### Reference Files -- `${CLAUDE_SKILL_DIR}/references/prompts.md`: Categorized example prompts by use case (game art, characters, product photography, pixel art, icons) +**Exit Codes**: 0 = success, 1 = missing API key, 2 = generation failed, 3 = invalid arguments ### Prompt Engineering Quick Reference @@ -334,11 +234,5 @@ This skill uses these shared patterns: **Negative constraints**: Always include "no text", "no labels", "no watermarks", "character only" -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I know the right model name" | Model names are exact strings, not patterns | Check the two valid names | -| "Output file was probably created" | Probably is not verified | Run `ls -la` on the output path | -| "API key is probably set" | Silent failures waste debugging time | Check explicitly in Phase 1 | -| "Custom code is faster than the script" | Script has retry, rate limiting, validation | Use `generate_image.py` | +### Reference Files +- `${CLAUDE_SKILL_DIR}/references/prompts.md`: Categorized example prompts by use case (game art, characters, product photography, pixel art, icons) diff --git a/skills/generate-claudemd/SKILL.md b/skills/generate-claudemd/SKILL.md index 04227451..9422ffd5 100644 --- a/skills/generate-claudemd/SKILL.md +++ b/skills/generate-claudemd/SKILL.md @@ -34,48 +34,20 @@ routing: # Generate CLAUDE.md Skill -## Operator Context - -This skill operates as an operator for CLAUDE.md generation, configuring Claude's behavior for systematic repository analysis and project-specific documentation creation. It implements a **4-phase pipeline** -- SCAN repo facts, DETECT domain enrichment, GENERATE from template, VALIDATE output -- producing a CLAUDE.md that makes new Claude sessions immediately productive. - -### Hardcoded Behaviors (Always Apply) -- **Analyze Before Writing**: Every section MUST be derived from actual repo analysis (reading files, parsing configs, checking paths). Why: guessed content wastes the context window and teaches Claude wrong patterns for the project. Never fill a section from assumptions. -- **Template Conformance**: Output follows `${CLAUDE_SKILL_DIR}/references/CLAUDEMD_TEMPLATE.md` structure. Why: consistent structure means Claude sessions can parse CLAUDE.md predictably across projects. -- **Path Verification**: Every path mentioned in the output MUST exist. Every command MUST be runnable. Why: a CLAUDE.md with broken paths is worse than no CLAUDE.md -- it teaches Claude to trust wrong information. -- **No Generic Advice**: Phrases like "use meaningful variable names", "write clean code", or "follow best practices" are banned. Why: generic advice wastes tokens and provides zero signal. Only project-specific facts belong in CLAUDE.md. -- **No Overwrite Without Confirmation**: If CLAUDE.md already exists, write to `CLAUDE.md.generated` and show a diff. Why: overwriting an existing, potentially hand-tuned CLAUDE.md destroys work. - -### Default Behaviors (ON unless disabled) -- **Domain Enrichment**: Auto-detect repo domain in Phase 2 and load domain-specific patterns (sapcc Go conventions, OpenStack patterns, etc.) -- **Makefile Analysis**: Parse Makefile targets for build commands rather than guessing. Why: the Makefile IS the source of truth for build commands in most repos. -- **License Header Detection**: Note SPDX conventions if present in source files - -### Why No `context: fork` -This skill requires interactive user gates (confirmation when CLAUDE.md already exists, review of generated output). Running in a forked context would bypass these safety checks. - -### Optional Behaviors (OFF unless enabled) -- **Subdirectory CLAUDE.md**: Generate per-package CLAUDE.md files for monorepos (on explicit request) -- **Minimal Mode**: Only 3 sections -- Overview, Commands, Architecture (on explicit request: "minimal claude.md") - -## What This Skill CAN Do -- Analyze a repository and produce a complete, project-specific CLAUDE.md -- Detect language, framework, build system, and test infrastructure from config files -- Load domain-specific enrichment (sapcc Go conventions, OpenStack patterns, etc.) -- Verify every path and command in the generated output -- Safely handle repos that already have a CLAUDE.md - -## What This Skill CANNOT Do -- Improve an existing CLAUDE.md (use `claude-md-improver` for that) -- Generate documentation for code it cannot read (private dependencies, encrypted configs) -- Understand runtime behavior (it reads static files, does not execute the project) -- Replace domain expertise (enrichment patterns are templates, not deep knowledge) +Produce a project-specific CLAUDE.md through a 4-phase pipeline: SCAN repo facts, DETECT domain enrichment, GENERATE from template, VALIDATE output. The goal is a CLAUDE.md that makes new Claude sessions immediately productive by documenting only verified, project-specific facts. ---- +This skill generates new CLAUDE.md files. It cannot improve an existing one (use `claude-md-improver` for that), cannot document private dependencies or encrypted configs it cannot read, cannot infer runtime behavior from static files, and cannot replace deep domain expertise -- enrichment patterns are templates, not knowledge. + +This skill does not use `context: fork` because it requires interactive user gates (confirmation when CLAUDE.md already exists, review of generated output), which a forked context would bypass. ## Instructions Execute all phases sequentially. Verify each gate before advancing. Load the template from `${CLAUDE_SKILL_DIR}/references/CLAUDEMD_TEMPLATE.md` before Phase 3. +On explicit user request, two optional modes are available: +- **Subdirectory CLAUDE.md**: Generate per-package CLAUDE.md files for monorepos. +- **Minimal Mode** ("minimal claude.md"): Only 3 sections -- Overview, Commands, Architecture. + ### Phase 1: SCAN **Goal**: Gather facts about the repository -- language, build system, directory structure, test patterns, config approach. @@ -86,13 +58,9 @@ Execute all phases sequentially. Verify each gate before advancing. Load the tem ls -la CLAUDE.md .claude/CLAUDE.md 2>/dev/null ``` -If a CLAUDE.md already exists: -- Inform the user: "CLAUDE.md already exists. Output will be written to CLAUDE.md.generated so you can compare." -- Set output path to `CLAUDE.md.generated` -- Continue with all phases (the generated file is still useful for comparison) +If a CLAUDE.md already exists, write output to `CLAUDE.md.generated` and show a diff, because overwriting a hand-tuned CLAUDE.md destroys work. Inform the user: "CLAUDE.md already exists. Output will be written to CLAUDE.md.generated so you can compare." Continue with all phases -- the generated file is still useful for comparison. -If no CLAUDE.md exists: -- Set output path to `CLAUDE.md` +If no CLAUDE.md exists, set output path to `CLAUDE.md`. **Step 2: Detect language and framework** @@ -108,7 +76,7 @@ Check root directory for language indicators: | `Gemfile` | Ruby | | `mix.exs` | Elixir | -Read the detected config file to extract: project name, dependencies, language version. +Read the detected config file to extract: project name, dependencies, language version. Do not assume standard language patterns apply to this project -- read actual source files before writing any section, because conventions vary even within the same language ecosystem. For Go projects, also check: ```bash @@ -124,7 +92,7 @@ cat package.json | head -30 **Step 3: Parse build system** -Check for build tools and extract commands: +Parse the Makefile (or equivalent) for actual build targets rather than guessing commands, because the Makefile IS the source of truth for build commands in most repos and may wrap tools with flags, coverage, or race detection that raw invocations would miss. ```bash # Check for Makefile @@ -140,7 +108,7 @@ Also check for: - `justfile` - CI config (`.github/workflows/`, `.gitlab-ci.yml`) -Record: build command, test command, lint command, "check everything" command. +Record: build command, test command, lint command, "check everything" command. If no build system is found at all, document the gap rather than inventing commands. **Step 4: Map directory structure** @@ -208,7 +176,7 @@ If found, note the license type and header convention. ### Phase 2: DETECT -**Goal**: Identify domain-specific enrichment sources based on repo characteristics. +**Goal**: Identify domain-specific enrichment sources based on repo characteristics. Auto-detect the repo domain and load domain-specific patterns (sapcc Go conventions, OpenStack patterns, etc.) because generic language knowledge is insufficient for project-specific CLAUDE.md generation. **Step 1: Check for sapcc domain (Go repos)** @@ -283,23 +251,23 @@ Enrichment Plan: ### Phase 3: GENERATE -**Goal**: Load template, fill sections from scan results and enrichment, write CLAUDE.md. +**Goal**: Load template, fill sections from scan results and enrichment, write CLAUDE.md. Every section must be derived from actual repo analysis (reading files, parsing configs, checking paths) because guessed content wastes the context window and teaches Claude wrong patterns. **Step 1: Load template** -Read `${CLAUDE_SKILL_DIR}/references/CLAUDEMD_TEMPLATE.md` for the output structure. +Read `${CLAUDE_SKILL_DIR}/references/CLAUDEMD_TEMPLATE.md` for the output structure. Follow its structure exactly because consistent structure means Claude sessions can parse CLAUDE.md predictably across projects. **Step 2: Fill required sections** Fill all 6 required sections from Phase 1 scan results: -**Section 1 -- Project Overview**: Use project name from config file and a description derived from README.md (first paragraph), go.mod module path, or package.json description. List 3-5 key concepts extracted from directory names and core module names. +**Section 1 -- Project Overview**: Use project name from config file and a description derived from README.md (first paragraph), go.mod module path, or package.json description. List 3-5 key concepts extracted from directory names and core module names. Extract relevant facts from README (project purpose, key concepts) but reframe for Claude's needs -- README is for GitHub visitors (humans browsing the repo), CLAUDE.md is for Claude sessions (AI working in the codebase), so skip installation guides, badges, and user-facing documentation. -**Section 2 -- Build and Test Commands**: Use ONLY commands found in Makefile, package.json scripts, or equivalent. Format as table. Include "check everything" command prominently. Include single-test and package-test commands. +**Section 2 -- Build and Test Commands**: Use ONLY commands found in Makefile, package.json scripts, or equivalent. Format as table. Include "check everything" command prominently. Include single-test and package-test commands. Never write `go test ./...` without checking the Makefile first because the project's canonical command may include flags, coverage, or race detection. **Section 3 -- Architecture**: Map directory structure from Phase 1 Step 4. Identify key components by reading entry points and core modules. Use absolute directory descriptions, not guesses. -**Section 4 -- Code Style**: Document linter config findings, import ordering (from reading actual source files), naming conventions (from actual code patterns), and tooling that enforces style. +**Section 4 -- Code Style**: Document linter config findings, import ordering (from reading actual source files), naming conventions (from actual code patterns), and tooling that enforces style. Document CLI commands for linting and formatting because those are what Claude actually uses -- do not include IDE/editor setup (VS Code extensions, launch configs) because CLAUDE.md is read by Claude, not by editors. **Section 5 -- Testing Conventions**: Document test framework, assertion library, mocking approach, file naming pattern, and integration test requirements from Phase 1 Step 5. @@ -309,11 +277,23 @@ Fill all 6 required sections from Phase 1 scan results: - Test requirements (e.g., "integration tests require PostgreSQL running locally") - Config requirements (e.g., "OS_AUTH_URL must be set for any OpenStack operation") -Do NOT invent pitfalls. If nothing notable was found, include 1-2 based on the build system (e.g., "run make check before committing"). +Do NOT invent pitfalls because fabricated warnings erode trust. If nothing notable was found, include 1-2 based on the build system (e.g., "run make check before committing"). + +Ban these generic phrases -- they waste tokens and provide zero project-specific signal: +- "use meaningful variable names" +- "write clean code" +- "follow best practices" +- "ensure code quality" +- "maintain consistency" +- "keep it simple" +- "write tests" +- "handle errors properly" + +If you find yourself writing generic advice, remove it entirely because leaving a section out is better than filling it with filler. **Step 3: Fill optional sections** -Based on the Phase 2 enrichment plan, fill applicable optional sections: +Based on the Phase 2 enrichment plan, fill applicable optional sections. If an optional section is included, it must be backed by evidence from repo analysis because optional sections without evidence are worse than omitted sections: - **Error Handling**: For Go repos, document wrapping conventions found in source. For sapcc repos, include `fmt.Errorf("...: %w", err)` pattern and note error checking tools from linter config. - **Database Patterns**: Document the driver/ORM, migration tool, and key query patterns found in source. @@ -339,7 +319,7 @@ In Common Pitfalls, add: **Step 5: Write output** -Write the completed CLAUDE.md (or CLAUDE.md.generated) to the output path determined in Phase 1 Step 1. +Write the completed CLAUDE.md (or CLAUDE.md.generated) to the output path determined in Phase 1 Step 1. Verify every path mentioned in the output exists and every command is runnable before writing, because a CLAUDE.md with broken paths is worse than no CLAUDE.md -- it teaches Claude to trust wrong information. If writing to `CLAUDE.md.generated`, also show the user a summary diff: ```bash @@ -356,7 +336,7 @@ diff CLAUDE.md CLAUDE.md.generated 2>/dev/null || echo "New file created" **Step 1: Verify all paths exist** -Extract every file path and directory path mentioned in the generated CLAUDE.md. Check each one: +Extract every file path and directory path mentioned in the generated CLAUDE.md. Check each one with `test -e` because "probably exists" is not verified -- one broken path undermines the entire document: ```bash # For each path mentioned in the output @@ -481,50 +461,8 @@ Result: CLAUDE.md.generated alongside existing file, with diff for comparison --- -## Anti-Patterns - -### Anti-Pattern 1: Generic Filler Content -**What it looks like**: "Write clean, maintainable code" or "Follow best practices for error handling" in the generated CLAUDE.md. -**Why wrong**: Generic advice wastes context tokens and provides zero project-specific signal. Claude already knows generic best practices. -**Do instead**: Only include facts derived from actual repo analysis. "Error wrapping uses `fmt.Errorf('during %s: %w', action, err)` pattern (see `internal/api/handler.go`)" is useful. "Handle errors properly" is not. - -### Anti-Pattern 2: Guessing Commands -**What it looks like**: Writing `go test ./...` without checking the Makefile, or `npm test` without reading package.json scripts. -**Why wrong**: The Makefile may wrap `go test` with flags, coverage, or race detection. Using the wrong command teaches Claude to skip project-specific tooling. -**Do instead**: Read the Makefile (or equivalent) first. Use the project's canonical commands. If `make check` exists, document `make check`, not the raw tool invocations. - -### Anti-Pattern 3: Copying README Verbatim -**What it looks like**: Pasting the README's project description, installation steps, or usage examples into CLAUDE.md. -**Why wrong**: README is for GitHub visitors (humans browsing the repo). CLAUDE.md is for Claude sessions (AI working in the codebase). Different audiences need different information. README covers "what is this and how to install it." CLAUDE.md covers "how to work effectively in this codebase." -**Do instead**: Extract relevant facts from README (project purpose, key concepts) but reframe for Claude's needs. Skip installation guides, badges, and user-facing documentation. - -### Anti-Pattern 4: Including IDE Setup -**What it looks like**: Adding VS Code extensions, editor config, or debugging launch configurations to CLAUDE.md. -**Why wrong**: CLAUDE.md is read by Claude, not by editors. IDE setup belongs in README or CONTRIBUTING.md. -**Do instead**: Document CLI commands for linting, formatting, and testing. These are what Claude actually uses. - ---- - -## Anti-Rationalization - -### Domain-Specific Rationalizations - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I know this is a Go project, I can fill in standard Go patterns" | Standard patterns may not match this project's conventions | Read actual source files before writing any section | -| "The README has a good description, I'll use that" | README is for humans, CLAUDE.md is for Claude sessions | Extract facts, reframe for Claude's context | -| "No Makefile, but `go build` is obvious" | The project may use a custom build script or task runner | Document the gap, don't guess commands | -| "This section is optional, I'll skip the analysis" | Optional sections still require evidence if included | Either analyze properly or omit the section entirely | -| "The paths probably exist, no need to check each one" | Probably != verified. One broken path undermines trust | Check every path with `test -e` | -| "Generic advice is better than nothing" | Generic advice is worse than nothing -- it wastes tokens | Leave section out rather than fill with filler | - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/CLAUDEMD_TEMPLATE.md`: Template structure for generated CLAUDE.md files with required and optional sections - Official Anthropic `claude-md-management:claude-md-improver`: Companion skill for improving existing CLAUDE.md files (use after generation for refinement) diff --git a/skills/git-commit-flow/SKILL.md b/skills/git-commit-flow/SKILL.md index 26d01851..51b066a7 100644 --- a/skills/git-commit-flow/SKILL.md +++ b/skills/git-commit-flow/SKILL.md @@ -30,45 +30,13 @@ routing: # Git Commit Flow Skill -## Operator Context - -This skill operates as an operator for git commit workflows, configuring Claude's behavior for standardized commit creation with quality enforcement. It implements a **4-phase gate** pattern: VALIDATE, STAGE, COMMIT, VERIFY. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution. Enforce banned patterns ("Generated with Claude Code", "Co-Authored-By: Claude") in all commit messages -- **Sensitive File Blocking**: NEVER commit `.env`, `*credentials*`, `*secret*`, `*.pem`, `*.key`, `.npmrc`, `.pypirc`. Hard fail if detected in staging area -- **Over-Engineering Prevention**: Only implement the requested commit workflow. No speculative features, no "while I'm here" improvements -- **Atomic Operations**: Each phase gate must pass before proceeding. No partial commits -- **Branch Protection**: Warn and require confirmation before committing to main/master -- **No Skipped Phases**: Execute all 4 phases sequentially. Never skip validation - -### Default Behaviors (ON unless disabled) -- **Interactive Confirmation**: Show staging plan and commit message for user approval before executing -- **Conventional Commit Enforcement**: Validate message follows `[scope]: ` format -- **Working Tree Validation**: Check for clean state (no merge/rebase in progress) before starting -- **Smart File Staging**: Group files by type (docs, code, config, tests, CI) for logical commits -- **Post-Commit Verification**: Confirm commit exists in log and working tree is clean after commit -- **Temporary File Cleanup**: Remove validation artifacts created during workflow - -### Optional Behaviors (OFF unless enabled) -- **Auto-Stage All**: Stage all modified files without confirmation (`--auto-stage`) -- **Skip Validation**: Bypass conventional commit format checks (`--skip-validation`) -- **Dry Run Mode**: Show what would be committed without executing (`--dry-run`) -- **Push After Commit**: Automatically push to remote after success (`--push`) - -## What This Skill CAN Do -- Detect sensitive files before they are committed (regex pattern matching) -- Validate commit messages against conventional commit format and CLAUDE.md banned patterns -- Smart-group files by type for logical, atomic commits -- Generate compliant commit messages from staged changes -- Verify commits succeeded and working tree is clean post-commit - -## What This Skill CANNOT Do -- Resolve merge conflicts (requires contextual code judgment) -- Perform interactive rebases (incompatible with deterministic workflow) -- Amend previous commits (use `git commit --amend` directly) -- Judge code quality (use systematic-code-review skill instead) -- Auto-resolve conflicting CLAUDE.md rules (requires human judgment) +Create validated, compliant git commits through a 4-phase gate pattern: VALIDATE, STAGE, COMMIT, VERIFY. Every phase must pass its gate before the next phase begins -- no partial commits, no skipped phases. Only implement the requested commit workflow; do not add speculative improvements or "while I'm here" changes. + +**Flags** (all OFF by default): +- `--auto-stage`: Stage all modified files without confirmation +- `--skip-validation`: Bypass conventional commit format checks +- `--dry-run`: Show what would be committed without executing +- `--push`: Automatically push to remote after success --- @@ -76,10 +44,12 @@ This skill operates as an operator for git commit workflows, configuring Claude' ### Phase 1: VALIDATE -**Goal**: Confirm environment is safe for committing. +**Goal**: Confirm the environment is safe for committing. **Step 1: Check working tree state** +Verify clean state before starting because committing during a merge or rebase produces broken history that is painful to untangle. + ```bash git status --porcelain git rev-parse --abbrev-ref HEAD @@ -89,18 +59,24 @@ Verify: - Not in merge or rebase state (check for `.git/MERGE_HEAD` or `.git/rebase-merge/`) - Not in detached HEAD (if so, warn user to create branch first) - Identify current branch name +- No stash/pop operations across branch merges are in progress, because stashed changes based on a pre-merge state can silently apply to the wrong base when popped after a merge, causing branch drift. If stash is detected, verify the working tree diff after pop with `git diff` to confirm changes still make sense against the new base. **Step 2: Scan for sensitive files** -Check all changed files against sensitive patterns. See `references/banned-patterns.md` for the full pattern list. +NEVER allow `.env`, `*credentials*`, `*secret*`, `*.pem`, `*.key`, `.npmrc`, or `.pypirc` into a commit because credentials in git history are permanent -- removing them requires a full history rewrite and credential rotation. This is a hard fail, not a warning. + +Check all changed files against sensitive patterns: ```bash -# TODO: scripts/validate_state.py not yet implemented -# Manual alternative: check for sensitive files in staged changes git diff --cached --name-only | grep -iE '\.(env|pem|key)$|credentials|secret|\.npmrc|\.pypirc' ``` -If sensitive files detected: display them, suggest `.gitignore` additions, and HARD STOP until resolved. +If sensitive files detected: +1. Display them +2. Suggest `.gitignore` additions +3. HARD STOP until resolved -- do not proceed regardless of user urgency + +This scan applies to every commit, including documentation-only changes, because doc commits can accidentally include `.env` files staged alongside them. **Step 3: Load CLAUDE.md rules** @@ -111,11 +87,13 @@ Read repository CLAUDE.md to extract: If no CLAUDE.md exists, use defaults: ban "Generated with Claude Code" and "Co-Authored-By: Claude". +These banned patterns are enforced because they add noise instead of meaningful context and violate repository standards. They will be checked again in Phase 3 during message validation. + **Step 4: Check branch state** -If on `main` or `master`: warn user and require explicit confirmation before proceeding. +If on `main` or `master`: warn user and require explicit confirmation before proceeding, because direct commits to main bypass code review and CI, risk breaking production, and make rollback difficult. Even small changes belong on a feature branch. -**Gate**: All checks pass. No sensitive files, no merge/rebase state, CLAUDE.md loaded. +**Gate**: All checks pass. No sensitive files, no merge/rebase state, CLAUDE.md loaded, branch confirmed. ### Phase 2: STAGE @@ -131,6 +109,8 @@ Parse file statuses: Modified (`M`), Added (`A`), Deleted (`D`), Untracked (`??` **Step 2: Group files by type** +Group files into logical categories because massive commits with unrelated changes make review overwhelming, break `git bisect`, and are difficult to revert. Each commit should represent one logical change that is independently reviewable. + Apply staging rules (see `references/staging-rules.md` for full rules): | Category | Patterns | Commit Prefix | @@ -143,19 +123,23 @@ Apply staging rules (see `references/staging-rules.md` for full rules): **Step 3: Present staging plan and get confirmation** -Show the user which files will be staged and in how many commits. Wait for approval. +Show the user which files will be staged and in how many commits. Wait for approval before executing, because showing the plan first catches mistakes like accidentally staging generated files or mixing unrelated changes. + +If `--auto-stage` flag is set, skip confirmation and stage all modified files. **Step 4: Execute staging** +Stage files explicitly by name -- never use `git add .` or `git add -A` because blind bulk staging bypasses sensitive file detection and groups unrelated changes together. + ```bash git add ``` -Re-validate that no sensitive files ended up in the staging area. +Re-validate that no sensitive files ended up in the staging area, because files can be added between the initial scan and staging. **Gate**: Files staged, no sensitive files in staging area, user confirmed plan. -### Phase 2.5: ADR DECISION COVERAGE (conditional — ADR-094) +### Phase 2.5: ADR DECISION COVERAGE (conditional -- ADR-094) **Goal**: Verify staged changes cover all ADR decision points. @@ -177,7 +161,7 @@ Read the active ADR path from `.adr-session.json` (`adr_file` field). | PARTIAL (>0%) | Display uncovered decision points. Ask: "N decision points not covered. Proceed anyway, or address them first?" | | FAIL (0%) | Display warning. Ask: "No ADR decision points found in staged changes. This may mean the wrong files are staged, or implementation is incomplete." | -This is advisory — the implementer can acknowledge uncovered points as intentionally deferred (e.g., "will be covered in a follow-up PR"). +This is advisory -- the implementer can acknowledge uncovered points as intentionally deferred (e.g., "will be covered in a follow-up PR"). **Gate**: Coverage reported. User acknowledged any gaps. @@ -187,10 +171,12 @@ This is advisory — the implementer can acknowledge uncovered points as intenti **Step 1: Get commit message** -Either accept user-provided message or generate one from staged changes. +Either accept user-provided message or generate one from staged changes. Show the message to the user for approval before executing, because commit messages are permanent history and worth getting right the first time. **Step 2: Validate message** +Validate now, not later, because git history is permanent and "I'll fix the message later" never happens in practice. + ```bash # TODO: scripts/validate_message.py not yet implemented # Manual alternative: validate commit message format @@ -198,10 +184,11 @@ Either accept user-provided message or generate one from staged changes. ``` Check: -- Conventional commit format: `[scope]: ` (see `references/conventional-commits.md`) -- No banned patterns (see `references/banned-patterns.md`) +- Conventional commit format: `[scope]: ` (see `references/conventional-commits.md`). Skip this check if `--skip-validation` flag is set. +- No banned patterns from CLAUDE.md (see `references/banned-patterns.md`). Never skip this check -- banned pattern enforcement applies even with `--skip-validation` because these patterns violate repository-level standards, not just formatting preferences. - Subject line: lowercase after type, no trailing period, max 72 chars, imperative mood - Body: separated by blank line, wrapped at 72 chars +- Focus on WHAT changed and WHY -- no attribution, no emoji unless repo style requires it If validation fails with CRITICAL (banned pattern): block commit, show suggested revision. If validation fails with WARNING (line length): show warning, allow user to proceed or revise. @@ -221,6 +208,8 @@ EOF Capture commit hash from output for verification. +If `--dry-run` flag is set, display the commit command and message without executing, then stop. + **Gate**: Commit message validated and commit executed successfully. ### Phase 4: VERIFY @@ -249,12 +238,16 @@ No staged files should remain (unless user had additional unstaged changes). git log -1 --format="%B" ``` -Confirm no banned patterns and format preserved (hooks may modify messages). +Confirm no banned patterns and format preserved. Pre-commit hooks may modify messages, so re-check the persisted version rather than trusting the input. + +**Step 4: Clean up and display summary** -**Step 4: Display summary** +Remove any validation artifacts created during the workflow. Report: commit hash, branch, files changed, validation results, and suggested next steps (push, create PR). +If `--push` flag is set, push to remote after displaying the summary. + **Gate**: All verification passes. Workflow complete. --- @@ -305,7 +298,7 @@ Runs VALIDATE and STAGE phases, shows commit message preview, but does not execu 1. Read hook output to identify the issue 2. Fix the issue (run formatter, fix lint errors) 3. Re-stage fixed files: `git add -u` -4. Create a NEW commit (do not amend - the previous commit did not happen) +4. Create a NEW commit (do not amend -- the previous commit did not happen) ### Error: Merge/Rebase in Progress **Cause**: Working tree is in an incomplete merge or rebase state. @@ -313,56 +306,8 @@ Runs VALIDATE and STAGE phases, shows commit message preview, but does not execu --- -## Anti-Patterns - -### Anti-Pattern 1: Committing Without Validation -**What it looks like**: `git add . && git commit -m "update files"` -**Why wrong**: Skips sensitive file detection, CLAUDE.md compliance, conventional format checks. Risk of leaking credentials or creating inconsistent history. -**Do instead**: Use this skill to validate all changes before manual commits. - -### Anti-Pattern 2: Using Banned Commit Patterns -**What it looks like**: Adding "Generated with Claude Code" or "Co-Authored-By: Claude" to messages. -**Why wrong**: Violates CLAUDE.md standards, adds noise instead of meaningful context. -**Do instead**: Focus on WHAT changed and WHY. No attribution, no emoji unless repo style requires it. - -### Anti-Pattern 3: Massive Commits with Unrelated Changes -**What it looks like**: Staging 15 files across 5 features with `git add .` and message "update". -**Why wrong**: Makes review overwhelming, breaks `git bisect`, unclear purpose, difficult to revert. -**Do instead**: Use staging groups. One logical change per commit. Each commit independently reviewable. - -### Anti-Pattern 4: Committing Directly to Main/Master -**What it looks like**: Making changes on `main` and pushing directly. -**Why wrong**: Bypasses code review, risks breaking production, makes rollback difficult. -**Do instead**: Create feature branch, commit there, push, create PR. - -### Anti-Pattern 5: Ignoring Sensitive File Warnings -**What it looks like**: Dismissing warnings about `.env` or credential files and committing anyway. -**Why wrong**: Credentials in git history are permanent. Requires history rewrite and credential rotation to fix. -**Do instead**: IMMEDIATELY add to `.gitignore`, unstage, and rotate any exposed credentials. - -### Anti-Pattern 6: Stash/Pop Across Branch Merges -**What it looks like**: Running `git stash`, switching branches to merge or rebase, then `git stash pop` back on the original branch. -**Why wrong**: Stashed changes were based on the pre-merge state. Popping after a merge can silently apply changes to the wrong base, causing branch drift. -**Do instead**: Commit changes before switching branches. If stash is unavoidable, verify the working tree diff after pop with `git diff` to confirm changes still make sense against the new base. -*Graduated from learning.db — multi-agent-coordination/stash-pop-branch-drift* - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Quick commit, no need to validate" | Quick commits leak credentials | Run all 4 phases | -| "It's just docs, skip sensitive scan" | Docs commits can include `.env` files | Validate every commit | -| "I'll fix the message later" | Later never comes; history is permanent | Validate message now | -| "Main branch is fine for this small change" | Small changes cause big problems | Create feature branch | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/conventional-commits.md`: Type definitions, format rules, examples, flowchart - `${CLAUDE_SKILL_DIR}/references/banned-patterns.md`: Prohibited phrases, detection rules, alternatives diff --git a/skills/github-actions-check/SKILL.md b/skills/github-actions-check/SKILL.md index 33017a16..778b1fa0 100644 --- a/skills/github-actions-check/SKILL.md +++ b/skills/github-actions-check/SKILL.md @@ -26,50 +26,15 @@ routing: # GitHub Actions Check Skill -## Operator Context - -This skill operates as an operator for GitHub Actions monitoring workflows, configuring Claude's behavior for automated CI/CD status checking after git operations. It implements the **Observe and Report** pattern -- wait for workflow registration, check status, identify failures, suggest remediation. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution -- **Over-Engineering Prevention**: Only check what's directly relevant. No speculative monitoring, no custom API scripts when gh CLI works, no workflow file modifications -- **Wait Before Check**: Always wait 5-10 seconds after push for GitHub to register the workflow -- **Complete Output Display**: Show full `gh` command output, never summarize as "build passed" or "tests failed" -- **Branch-Aware Checking**: Always check workflows for the branch that was actually pushed, not default branch -- **Prefer gh CLI**: Always use `gh` CLI over raw API calls -- it handles auth, pagination, and formatting - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report facts without self-congratulation. Show command output rather than describing it. Be concise but informative. -- **Temporary File Cleanup**: Remove any temporary scripts or cache files created during workflow status checking at task completion. -- **Failure Investigation**: When failures detected, identify specific failing jobs and suggest local reproduction commands -- **Auto-fix Suggestions**: For common failures (linting, formatting), provide exact fix commands without executing them - -### Optional Behaviors (OFF unless enabled) -- **Automatic Fix Application**: Only fix and re-push if user explicitly requests it -- **Watch Mode**: Only use `gh run watch` for interactive monitoring if user asks -- **Detailed Job Logs**: Only fetch full job logs (`gh run view --log-failed`) if user needs debugging details - -## What This Skill CAN Do -- Check workflow run status for a specific branch after push -- Identify which jobs failed and show their output -- Suggest local reproduction commands for common CI failures -- Report on multiple workflow runs for comparison -- Show complete gh CLI output for user review - -## What This Skill CANNOT Do -- Auto-fix and re-push without explicit user permission -- Modify workflow YAML files or CI configuration -- Replace local debugging (use systematic-debugging for root cause analysis) -- Replace local linting (use code-linting for pre-push checks) -- Summarize or abbreviate gh CLI output - ---- +Check GitHub Actions workflow status after a git push, identify failures, and suggest local reproduction commands. This skill observes and reports -- it never modifies workflow files or auto-fixes code without explicit permission. ## Instructions ### Step 1: Identify Repository and Branch -**Goal**: Determine which repository and branch to check. +Read and follow any repository CLAUDE.md before proceeding -- it may contain CI-specific instructions or branch naming conventions. + +Determine which repository and branch to check: ```bash # Get repository from git remote @@ -79,11 +44,13 @@ git remote get-url origin git branch --show-current ``` -**Gate**: Repository and branch identified. Proceed only when gate passes. +Always use the branch that was actually pushed, never the default branch. Checking without `--branch` can show runs from other branches and give misleading status for the user's actual push. + +**Gate**: Repository and branch both identified. Do not proceed without both values confirmed. ### Step 2: Wait and Check Workflow Status -**Goal**: Allow GitHub to register the workflow, then retrieve status. +GitHub needs 5-10 seconds after a push to register the workflow run. Checking immediately returns stale results from previous runs, not the current push. ```bash # Wait for GitHub to register the workflow run @@ -94,21 +61,22 @@ BRANCH=$(git branch --show-current) gh run list --branch "$BRANCH" --limit 5 ``` -Show the complete output. Do not summarize. +Always use the `gh` CLI rather than raw GitHub API calls -- `gh` handles authentication, pagination, and formatting automatically. Writing custom scripts with `curl` or `requests` adds unnecessary complexity when `gh` already does the job. -**Gate**: Workflow status retrieved and displayed. Proceed only when gate passes. +Show the complete `gh` output verbatim. Never summarize results as "build passed" or "tests failed" -- that hides which jobs ran, their timing, and any warnings. Claiming "build passed" without showing output is unverifiable. The user needs to see the actual data. -### Step 3: Investigate Failures +**Gate**: Workflow status retrieved and complete output displayed to user. Do not proceed until the gate passes. -**Goal**: If any workflow failed, identify the specific failing jobs. +### Step 3: Investigate Failures -Only execute this step if Step 2 shows a failed or failing run. +Only execute this step if Step 2 shows a failed or failing run. Do not assume failures are pre-existing without comparing against previous runs -- that is speculation, not evidence. ```bash # Get details of the failed run gh run view -# For deeper investigation (only if user requests) +# For deeper investigation (only if user explicitly requests it, +# since full logs can be very verbose) gh run view --log-failed ``` @@ -125,22 +93,28 @@ Local reproduction: [command to reproduce locally] Suggested fix: [exact commands to fix, if applicable] ``` -**Gate**: All failures identified with reproduction commands. Proceed only when gate passes. +For common failures like linting or formatting, provide exact fix commands but do not execute them. Never auto-fix and re-push without explicit user permission -- making code changes and git commits without review may introduce unintended changes. Only use `gh run watch` for interactive monitoring if the user specifically asks for it. -### Step 4: Report and Suggest +**Gate**: All failures identified with reproduction commands. Do not proceed until the gate passes. -**Goal**: Present findings and suggest next steps without auto-fixing. +### Step 4: Report and Suggest If all checks passed: -- Show the complete `gh run list` output +- Show the complete `gh run list` output (not a summary) - Confirm which workflows ran and their status If checks failed: - Show the failure report from Step 3 - Suggest local reproduction commands -- Suggest fix commands (but do NOT execute without permission) +- Suggest fix commands but do NOT execute without permission - Ask the user if they want you to apply fixes +Report facts without self-congratulation. Show command output rather than describing it. Be concise but informative. + +Clean up any temporary scripts or cache files created during the check before finishing. + +This skill only checks CI status. For local debugging of test failures, hand off to systematic-debugging. For local linting, hand off to code-linting. Never modify workflow YAML files or CI configuration as part of this skill. + **Gate**: Complete status report delivered to user. --- @@ -168,49 +142,3 @@ If checks failed: 2. Verify `.github/workflows/` directory exists in the repository 3. Check if workflow is configured to trigger on the pushed branch 4. Verify push event matches workflow trigger conditions - ---- - -## Anti-Patterns - -### Anti-Pattern 1: Checking Immediately After Push -**What it looks like**: `git push && gh run list --limit 1` -**Why wrong**: GitHub needs 5-10 seconds to register the workflow. Immediate checks show stale results from previous runs. -**Do instead**: Always `sleep 10` between push and status check. - -### Anti-Pattern 2: Summarizing Workflow Results -**What it looks like**: "The build passed successfully." -**Why wrong**: Hides which jobs ran, timing, warnings. User cannot verify results. Violates Complete Output Display behavior. -**Do instead**: Show the complete `gh run list` or `gh run view` output verbatim. - -### Anti-Pattern 3: Auto-Fixing Without Permission -**What it looks like**: Detecting lint failure, then running `ruff check --fix . && git push` automatically. -**Why wrong**: Makes code changes and git commits without user review. May introduce unintended changes. Violates optional behavior default. -**Do instead**: Suggest fix commands and wait for explicit user confirmation before executing. - -### Anti-Pattern 4: Checking Wrong Branch -**What it looks like**: `gh run list --limit 1` without specifying `--branch`. -**Why wrong**: May show workflow runs from other branches (main, other feature branches). Gives misleading status for the user's actual push. -**Do instead**: Always use `--branch "$BRANCH"` with the branch that was actually pushed. - -### Anti-Pattern 5: Using Raw API When gh CLI Is Available -**What it looks like**: Writing 50 lines of Python with `requests` to hit the GitHub API. -**Why wrong**: `gh` handles auth, pagination, and formatting automatically. Custom code adds unnecessary complexity and maintenance burden. -**Do instead**: Use `gh run list`, `gh run view`, and `gh run watch`. Only fall back to API if `gh` is truly unavailable. - ---- - -## References - -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Build passed" (without showing output) | Claim without evidence is unverifiable | Show complete gh output | -| "Checked right after push" | Too early shows stale results | Wait 10 seconds minimum | -| "Those failures are pre-existing" | Assumption without evidence | Compare with previous runs | -| "I'll just fix and re-push" | Auto-fixing without permission | Ask user before applying fixes | diff --git a/skills/go-anti-patterns/SKILL.md b/skills/go-anti-patterns/SKILL.md index 5004d997..483e2efd 100644 --- a/skills/go-anti-patterns/SKILL.md +++ b/skills/go-anti-patterns/SKILL.md @@ -41,53 +41,19 @@ routing: # Go Anti-Patterns Skill -## Operator Context - -This skill operates as an operator for Go anti-pattern detection and remediation, configuring Claude's behavior to identify over-engineering, premature abstraction, and idiomatic violations in Go code. It implements the **Pattern Recognition** architectural approach -- scan, detect, explain, remediate -- with **Domain Intelligence** embedded in Go-specific heuristics. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before reviewing -- **Over-Engineering Prevention**: Flag complexity only when simpler Go exists; never add complexity while removing it -- **Evidence-Based Detection**: Every flagged anti-pattern must cite specific code location and explain concrete harm -- **YAGNI Enforcement**: Do not suggest abstractions (interfaces, generics, channels) without 2+ concrete use cases -- **Preserve Working Code**: Flag patterns for awareness; do not rewrite working code without explicit request -- **Idiomatic Go Priority**: Recommendations must align with Go proverbs and standard library conventions - -### Default Behaviors (ON unless disabled) -- **Quick Detection Table**: Use the detection guide to scan code systematically -- **One Pattern at a Time**: Address anti-patterns individually, not in bulk rewrites -- **Context-Aware Severity**: Rate impact as low/medium/high based on codebase context -- **Show Both Versions**: Present current code alongside recommended alternative -- **Root Cause Explanation**: Explain WHY the pattern is harmful, not just that it is -- **Scope Limitation**: Only flag patterns within the files under review - -### Optional Behaviors (OFF unless enabled) -- **Full Codebase Scan**: Scan entire repository for anti-pattern instances -- **Metrics Collection**: Count anti-pattern occurrences by type -- **Auto-Refactor**: Apply fixes directly instead of only flagging them -- **Historical Analysis**: Check git history for when anti-patterns were introduced - -## What This Skill CAN Do -- Detect the 7 core Go anti-patterns with code-level evidence -- Provide idiomatic Go alternatives with before/after examples -- Explain the concrete harm each pattern causes (complexity, bugs, maintenance) -- Distinguish between genuine anti-patterns and acceptable trade-offs -- Guide incremental cleanup without destabilizing working code - -## What This Skill CANNOT Do -- Rewrite entire codebases (use systematic-refactoring instead) -- Detect non-Go anti-patterns (use language-specific skills) -- Optimize performance (use performance profiling tools) -- Replace code review (use go-code-review for comprehensive review) -- Judge patterns without seeing the surrounding context - ---- +Detect and remediate the 7 core Go anti-patterns: premature interface abstraction, goroutine overkill, error wrapping without context, channel misuse, generic abuse, context soup, and unnecessary function extraction. Every detection is evidence-based with code location and concrete harm explanation, and every recommendation aligns with Go proverbs and standard library conventions. ## Instructions -### Step 1: Scan for Anti-Patterns +### Phase 1: Prepare + +Read and follow the repository's CLAUDE.md before reviewing any code. Identify which files are under review and restrict all analysis to those files -- do not flag patterns in files outside the review scope. + +If the user requests a full codebase scan or historical git analysis, enable those modes explicitly. Otherwise, stay within the files presented. + +### Phase 2: Scan for Anti-Patterns -Use the Quick Detection Guide to systematically check code under review. +Use the Quick Detection Guide to systematically check each file under review. Work through the table row by row against the code. | Code Smell | Detection Question | If Yes | |------------|-------------------|--------| @@ -99,23 +65,33 @@ Use the Quick Detection Guide to systematically check code under review. | Context in pure function | Does function do I/O? | Remove context param | | Tiny extracted function | Called from 2+ places? | Inline it | -### Step 2: Classify and Report +Flag complexity only when a simpler idiomatic Go alternative exists. Do not suggest adding complexity (interfaces, generics, channels, goroutines) without 2+ concrete use cases that justify it -- this is the YAGNI principle applied to Go abstractions. -For each detected anti-pattern, produce a structured report: +### Phase 3: Classify and Report + +For each detected anti-pattern, produce a structured report entry. Every flagged pattern must cite a specific code location and explain the concrete harm it causes -- never flag without evidence. ``` ANTI-PATTERN DETECTED: -- Pattern: [Name from catalog below] +- Pattern: [Name from AP catalog, e.g., AP-1: Premature Interface Abstraction] - Location: [File:line] - Issue: [What is wrong with current approach] - Impact: [Complexity/performance/maintainability cost] -- Severity: [Low/Medium/High] +- Severity: [Low/Medium/High based on codebase context] - Recommendation: [Simpler Go alternative] ``` -### Step 3: Provide Remediation +Rate severity based on the actual codebase context: a single-implementation interface in a small CLI is Low; in a hot path of a shared library it may be High. Address anti-patterns one at a time rather than proposing bulk rewrites. + +### Phase 4: Provide Remediation + +For each flagged pattern, show the current code alongside the recommended alternative so the reader can compare directly. Explain WHY the current pattern is harmful, not just that it is -- root cause understanding prevents recurrence. + +Reference `${CLAUDE_SKILL_DIR}/references/code-examples.md` for extended before/after examples covering all 7 anti-patterns. -Show before/after code. Reference the detailed examples in `references/code-examples.md` for full patterns. +Do not rewrite working code without an explicit request from the user. Flag patterns for awareness and let the user decide whether to act. When the user does request changes, apply them one pattern at a time to keep diffs reviewable. + +If metrics collection is requested, count anti-pattern occurrences by type to identify systemic issues. --- @@ -296,19 +272,6 @@ func (opts AuditorOpts) buildConnectionURL() (string, error) { --- -## Go-Specific Phantom Problem Indicators - -Watch for solutions looking for problems: - -- Adding interfaces when concrete types suffice -- Implementing channels when simple function calls work -- Creating goroutines for inherently sequential operations -- Over-abstracting with generics for single-use cases -- Adding middleware layers for simple HTTP handlers -- Creating worker pools for low-throughput scenarios - ---- - ## Error Handling ### Error: "False Positive -- Pattern Is Intentional" @@ -349,19 +312,4 @@ Result: Evidence-based recommendation with Go idiom context ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It Is Wrong | Required Action | -|-----------------|-----------------|-----------------| -| "This interface might be needed later" | YAGNI; future is unknown | Start concrete, extract when needed | -| "Goroutines make it faster" | Concurrency has overhead; profile first | Prove bottleneck exists before adding goroutines | -| "Context should be everywhere" | Context is for I/O and cancellation only | Remove from pure functions | -| "Generics make it more flexible" | Flexibility without use cases is complexity | Use concrete types until 2+ instantiations | -| "Small functions are always better" | Indirection has cognitive cost | Inline single-use trivial functions | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/code-examples.md`: Extended before/after examples for all 7 anti-patterns diff --git a/skills/go-code-review/SKILL.md b/skills/go-code-review/SKILL.md index 261d4d7f..a3efd9bf 100644 --- a/skills/go-code-review/SKILL.md +++ b/skills/go-code-review/SKILL.md @@ -39,50 +39,11 @@ routing: # Go Code Review Skill -## Operator Context - -This skill operates as an operator for Go code review workflows, configuring Claude's behavior for systematic, read-only analysis of Go codebases and pull requests. It implements the **Sequential Analysis** architectural pattern -- gather context, run automated checks, analyze quality, report findings -- with **Go Domain Intelligence** embedded in every phase. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before reviewing -- **Over-Engineering Prevention**: Report issues only. Do not suggest speculative improvements or "while reviewing" refactors -- **Read-Only Mode**: NEVER modify code during review. Analyze, identify, report -- do not fix -- **All Phases Required**: Complete all 6 review phases. No skipping, no shortcuts -- **Evidence-Based Findings**: Every issue must reference file, line, and concrete impact -- **Compilation First**: Run `go build ./...` before any linting or analysis commands - -### Default Behaviors (ON unless disabled) -- **Automated Checks**: Run go build, go test -race, go vet, and coverage analysis -- **Gopls MCP Analysis**: Use gopls MCP tools when available: `go_workspace` for structure, `go_file_context` for dependencies, `go_symbol_references` for impact analysis, `go_diagnostics` for build errors (fallback to grep/LSP) -- **Priority Classification**: Tag every finding as CRITICAL, HIGH, MEDIUM, or LOW -- **Structured Output**: Use the PR Review Output Template for all reports -- **Cross-Platform Build**: Verify compilation on linux, darwin, and windows -- **Module Verification**: Run go mod verify and check for tidy state - -### Optional Behaviors (OFF unless enabled) -- **Security Audit**: Run gosec and perform deep security analysis -- **Benchmark Review**: Evaluate benchmark coverage for performance-critical code -- **API Compatibility Check**: Verify exported API changes against semver expectations -- **Dependency Audit**: Deep review of new or changed dependencies +Systematic, read-only analysis of Go codebases and pull requests across 6 structured phases. Every phase is mandatory because small changes cause large bugs and skipping phases misses race conditions, compilation errors, and edge cases that visual inspection alone cannot catch. This skill gathers context, runs automated checks, analyzes quality, and reports findings -- it never modifies code. ## Available Scripts -- **`scripts/check-interface-compliance.sh`** — Find exported interfaces missing compile-time `var _ I = (*T)(nil)` checks. Run `bash scripts/check-interface-compliance.sh --help` for options. - -## What This Skill CAN Do -- Systematically review Go code across 6 structured phases -- Run automated checks (build, test, vet, staticcheck, coverage) -- Use gopls MCP for semantic analysis (implementations, references, symbols, diagnostics, vulnerability checks) -- Classify findings by severity with actionable recommendations -- Produce structured review reports with executive summary and detailed analysis - -## What This Skill CANNOT Do -- Modify, edit, or fix any code (reviewers report, they do not fix) -- Replace domain-specific skills (use go-concurrency for concurrency design, go-error-handling for error patterns) -- Skip phases or approve without completing all analysis areas -- Guarantee security (use dedicated security audit tools for compliance) - ---- +- **`scripts/check-interface-compliance.sh`** -- Find exported interfaces missing compile-time `var _ I = (*T)(nil)` checks. Run `bash scripts/check-interface-compliance.sh --help` for options. ## Instructions @@ -90,7 +51,13 @@ This skill operates as an operator for Go code review workflows, configuring Cla **Goal**: Understand what changed and why before analyzing code. -**Step 1: Review scope analysis** +**Step 1: Read project conventions** + +Read the repository CLAUDE.md first because project conventions override default review expectations. Note any project-specific linting rules, naming conventions, or architectural decisions that should inform the review. + +**Step 2: Review scope analysis** + +Complete all 6 phases regardless of PR size because small changes cause large bugs and config changes affect runtime behavior. ```markdown ## Review: [PR/Change Title] @@ -100,26 +67,37 @@ This skill operates as an operator for Go code review workflows, configuring Cla - Testing strategy: [approach] ``` -**Step 2: Change overview** +Do not shortcut scope analysis based on author reputation because everyone makes mistakes and the same rigor must apply to all authors. + +**Step 3: Change overview** - List all modified files and packages - Identify core changes vs supporting changes - Note deleted code and understand why - Check for generated or vendored code +Read enough surrounding context to understand the full impact because changes may break invariants or assumptions in adjacent code. + **Gate**: Scope understood. Proceed only when clear on intent. ### Phase 2: Automated Checks -**Goal**: Run all automated tools and capture outputs. +**Goal**: Run all automated tools and capture outputs. Run every check listed here regardless of how "simple" the change looks because visual inspection misses race conditions, compilation errors, and edge cases. + +**Step 1: Compilation (run first)** -**Step 1: Compilation and tests** +Run `go build ./...` before any linting or analysis because compilation errors invalidate all downstream analysis and linter output becomes noise against broken code. ```bash -# Verify compilation across platforms +# Verify compilation across platforms because cross-platform bugs +# are invisible on a single OS GOOS=linux go build ./... GOOS=darwin go build ./... GOOS=windows go build ./... +``` + +**Step 2: Tests and coverage** +```bash # Run tests with race detector go test -race -count=1 -v ./... @@ -128,7 +106,7 @@ go test -coverprofile=coverage.out ./... go tool cover -func=coverage.out | tail -10 ``` -**Step 2: Static analysis** +**Step 3: Static analysis and module integrity** ```bash # Static analysis @@ -141,30 +119,30 @@ go mod verify go mod tidy && git diff go.mod go.sum ``` -**Step 3: Gopls-enhanced analysis (MUST use when available)** +**Step 4: Gopls-enhanced analysis (use when available)** -**Prefer gopls MCP tools** (available when `.mcp.json` configures gopls): +Prefer gopls MCP tools (available when `.mcp.json` configures gopls) because they provide semantic analysis that grep-based searches cannot: ``` -# Detect workspace structure (MUST run first) +# Detect workspace structure (run first) go_workspace -# Understand file dependencies after reading .go files (MUST use) +# Understand file dependencies after reading .go files go_file_context({"file": "/path/to/changed_file.go"}) -# Find all references before modifying any symbol (MUST use) +# Find all references before assessing symbol impact go_symbol_references({"file": "/path/to/interface.go", "symbol": "Handler.ServeHTTP"}) # Fuzzy search for symbols go_search({"query": "Server"}) -# Check for build/analysis errors after edits (MUST use) +# Check for build/analysis errors go_diagnostics({"files": ["/path/to/changed_file.go"]}) # Inspect package public API go_package_api({"packagePaths": ["example.com/internal/storage"]}) ``` -**Fallback (when gopls MCP unavailable)** — use LSP tool or CLI: +Fallback (when gopls MCP unavailable) -- use LSP tool or CLI: ```bash # LSP tool: goToDefinition, findReferences, hover, documentSymbol # CLI fallback: @@ -180,9 +158,9 @@ When gopls adds value: - Renaming proposals: verify safe rename with `go_symbol_references` first - Post-edit verification: `go_diagnostics` catches errors before running full test suite -Fallback: If neither gopls MCP nor CLI is available, use grep-based searches but warn about potential false positives. +If neither gopls MCP nor CLI is available, use grep-based searches but warn about potential false positives in the report. -**Step 4: Security checks (if optional behavior enabled)** +**Step 5: Security checks (optional -- run when security audit is requested)** ```bash gosec ./... # if available @@ -192,7 +170,7 @@ gosec ./... # if available ### Phase 3: Code Quality Analysis -**Goal**: Evaluate architecture, idioms, and performance. +**Goal**: Evaluate architecture, idioms, and performance. Do not suggest speculative improvements or "while reviewing" refactors because the reviewer role is to identify real issues, not propose hypothetical enhancements. **Architecture and Design**: - SOLID principles followed? @@ -201,6 +179,8 @@ gosec ./... # if available - Dependency injection used properly? - Interfaces focused and minimal? +Treat refactors with the same scrutiny as new code because refactors change behavior subtly and require test verification that behavior is preserved. + **Go Idioms and Best Practices**: - Using `any` instead of `interface{}`? - Proper error handling with context? @@ -219,7 +199,7 @@ gosec ./... # if available ### Phase 4: Specific Analysis Areas -Review each area relevant to the changed code. +Review each area relevant to the changed code. Enable optional analysis (benchmarks, API compatibility, dependency audit) when the PR scope warrants it. **Concurrency Review**: - Data races possible? @@ -244,6 +224,8 @@ Review each area relevant to the changed code. - Test helpers marked with t.Helper()? - No test interdependencies? +Do not accept passing tests as sufficient evidence of correctness because tests can be incomplete or wrong -- review test coverage and quality alongside pass/fail status. + **Security Review**: - Input validation present? - SQL injection prevented? @@ -256,7 +238,7 @@ Review each area relevant to the changed code. ### Phase 5: Line-by-Line Review -**Goal**: Inspect each significant change individually. +**Goal**: Inspect each significant change individually. NEVER modify code during review because the reviewer role is read-only -- analyze, identify, report, but do not fix. Fixing bypasses author ownership and testing. For each significant change, ask: 1. Is the change necessary? @@ -267,9 +249,15 @@ For each significant change, ask: 6. Performance implications? 7. Security implications? +Every issue must reference file, line, and concrete impact because evidence-based findings are actionable while vague observations are not. Do not accept author explanations at face value -- verify the code itself because explanation does not equal correctness. + +Tag every finding with a severity level (CRITICAL, HIGH, MEDIUM, LOW) because priority classification drives merge decisions. Classify severity honestly based on impact, not author relationship, because severity is objective and downgrading to avoid conflict misrepresents risk. + **Gate**: All significant changes inspected. -### Phase 6: Documentation Review +### Phase 6: Documentation and Report + +**Goal**: Assess documentation and produce the structured review report. - Package comments present and helpful? - Public APIs documented? @@ -277,9 +265,9 @@ For each significant change, ask: - Examples provided for public APIs? - README updated if needed? -**Gate**: Documentation assessed. Review complete. Generate report. +Use the Review Output Template below for all reports because structured output ensures consistent, actionable review deliverables. ---- +**Gate**: Documentation assessed. Review complete. Generate report. ## Review Output Template @@ -339,8 +327,6 @@ Coverage: X%, Target: 80%+, Gaps: [areas needing tests] [Summary with actionable next steps] ``` ---- - ## Review Priority Guidelines **CRITICAL (Block Merge)**: Security vulnerabilities, data corruption risks, race conditions, memory leaks, breaking API changes, missing critical tests. @@ -351,9 +337,9 @@ Coverage: X%, Target: 80%+, Gaps: [areas needing tests] **LOW (Suggest)**: Naming improvements, code organization, additional examples, future refactoring opportunities. ---- +## Error Handling -## Death Loop Prevention +### Death Loop Prevention NEVER make changes that cause compilation failures during review: @@ -361,10 +347,6 @@ NEVER make changes that cause compilation failures during review: 2. **Function Signature Changes**: NEVER change return types without updating ALL call sites 3. **Compilation Before Linting**: Run `go build ./...` FIRST. If code does not compile, report compilation errors before linting issues ---- - -## Error Handling - ### Error: "Automated Tool Not Available" Cause: staticcheck, golangci-lint, gosec, or gopls not installed Solution: @@ -387,54 +369,7 @@ Solution: 2. Check commit messages for context 3. Ask the user to clarify the change scope before proceeding ---- - -## Anti-Patterns - -### Anti-Pattern 1: Fixing Code During Review -**What it looks like**: Opening Edit tool to "quickly fix" a found issue -**Why wrong**: Reviewer role is read-only. Fixing bypasses author ownership and testing. -**Do instead**: Report the issue with file, line, impact, and suggested fix in the report. - -### Anti-Pattern 2: Skipping Automated Checks -**What it looks like**: "I can see the code is fine, no need to run tests" -**Why wrong**: Visual inspection misses race conditions, compilation errors, and edge cases. -**Do instead**: Run ALL Phase 2 checks regardless of how "simple" the change looks. - -### Anti-Pattern 3: Severity Downgrading to Avoid Conflict -**What it looks like**: Marking a logic bug as LOW because "the author will be upset" -**Why wrong**: Severity is objective, not social. Logic bugs are HIGH or CRITICAL. -**Do instead**: Classify severity honestly based on impact, not author relationship. - -### Anti-Pattern 4: Approving Small PRs Without Full Review -**What it looks like**: "It's only 10 lines, LGTM" -**Why wrong**: Small changes cause large bugs. Config changes affect runtime behavior. -**Do instead**: Complete all 6 phases regardless of PR size. - -### Anti-Pattern 5: Reviewing Only Changed Lines -**What it looks like**: Ignoring surrounding context of the changed code -**Why wrong**: Changes may break invariants or assumptions in adjacent code. -**Do instead**: Read enough surrounding context to understand the full impact. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization: Review](../shared-patterns/anti-rationalization-review.md) - Review-specific rationalization prevention -- [Anti-Rationalization: Core](../shared-patterns/anti-rationalization-core.md) - Universal shortcut prevention -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Severity Classification](../shared-patterns/severity-classification.md) - Issue severity standards - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Tests pass, must be fine" | Tests can be incomplete or wrong | Review test coverage and quality too | -| "Small PR, quick review" | Small changes cause big bugs | Full 6-phase review regardless of size | -| "Author explained the logic" | Explanation does not equal correctness | Verify the code itself | -| "Just a refactor" | Refactors change behavior subtly | Verify behavior preserved with tests | -| "Trusted author" | Everyone makes mistakes | Same rigor for all authors | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/common-review-comments.md`: Go code patterns with good/bad examples for error handling, concurrency, and testing diff --git a/skills/go-concurrency/SKILL.md b/skills/go-concurrency/SKILL.md index c5de38a8..0d1b4bbb 100644 --- a/skills/go-concurrency/SKILL.md +++ b/skills/go-concurrency/SKILL.md @@ -49,53 +49,15 @@ routing: # Go Concurrency Skill -## Operator Context - -This skill operates as an operator for Go concurrency workflows, configuring Claude's behavior for correct, leak-free concurrent code. It implements the **Domain Intelligence** architectural pattern -- encoding Go concurrency idioms, sync primitives, and channel patterns as non-negotiable constraints rather than suggestions. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before writing concurrent code -- **Over-Engineering Prevention**: Use concurrency only when justified by I/O, CPU parallelism, or measured bottleneck. Sequential code is correct by default -- **Context First Parameter**: All cancellable or I/O operations accept `context.Context` as first parameter -- **No Goroutine Leaks**: Every goroutine must have a guaranteed exit path via context, channel close, or explicit shutdown -- **Race Detector Required**: Run `go test -race` on all concurrent code during development -- **Channel Ownership**: Only the sender closes a channel. Never close from receiver side -- **Select With Context**: Every `select` statement in concurrent code must include a `<-ctx.Done()` case - -### Default Behaviors (ON unless disabled) -- **errgroup Over WaitGroup**: Prefer `golang.org/x/sync/errgroup` for goroutine management with error collection -- **Buffered Channel Sizing**: Buffer size matches expected backpressure, not arbitrary large numbers -- **Directional Channel Returns**: Return `<-chan T` (receive-only) from producer functions to prevent caller misuse -- **Mutex Scope Minimization**: Lock only the critical section, use `defer Unlock()` immediately after `Lock()` -- **Loop Variable Safety**: Use Go 1.22+ loop variable semantics; remove legacy `item := item` shadows in new code -- **Graceful Shutdown**: Workers and servers implement clean shutdown with drain timeout -- **Atomic for Counters**: Use `atomic.Int64` / `atomic.Value` for simple shared counters instead of mutex - -### Optional Behaviors (OFF unless enabled) -- **Gopls MCP Analysis**: Use gopls MCP tools to trace channel usage and context propagation — `go_symbol_references` for tracing channel flow, `go_file_context` for understanding goroutine spawn sites, `go_diagnostics` after concurrent code edits. Fallback: `gopls references` CLI or LSP tool -- **Container GOMAXPROCS Tuning**: Configure GODEBUG flags for container CPU limit overrides -- **Performance Profiling**: Profile goroutine counts and channel contention under load -- **Custom Rate Limiter**: Build token-bucket rate limiter instead of using `golang.org/x/time/rate` - -## What This Skill CAN Do -- Guide implementation of worker pools, fan-out/fan-in, and pipeline patterns -- Apply correct context propagation through concurrent call chains -- Select appropriate sync primitives (Mutex, RWMutex, WaitGroup, Once, atomic) -- Implement rate limiting with context-aware waiting -- Diagnose and fix race conditions, deadlocks, and goroutine leaks -- Structure graceful shutdown for background workers and servers - -## What This Skill CANNOT Do -- Fix general Go bugs unrelated to concurrency (use systematic-debugging instead) -- Optimize non-concurrent performance (use performance optimization workflows instead) -- Write tests for concurrent code (use go-testing skill instead) -- Handle Go error handling patterns (use go-error-handling skill instead) - ---- +Guide implementation of correct, leak-free concurrent Go code using goroutines, channels, sync primitives, and context propagation. Works by assessing whether concurrency is justified, selecting the right primitive, enforcing context propagation, implementing the pattern, and verifying with the race detector. ## Instructions -### Step 1: Assess Concurrency Need +### Phase 1: Assess Concurrency Need + +**Goal**: Determine whether concurrency is justified before adding complexity. + +Read and follow the repository's CLAUDE.md before writing any concurrent code, because project-specific conventions (naming, package structure, error handling) override general patterns. Before writing concurrent code, answer these questions: @@ -103,9 +65,13 @@ Before writing concurrent code, answer these questions: 2. **Is the work CPU-bound?** -- concurrency helps only if parallelizable 3. **Is there a measured bottleneck?** -- if not measured, don't assume -If none apply, write sequential code. Concurrency adds complexity; justify it. +If none apply, write sequential code. Sequential code is correct by default -- concurrency adds goroutine lifecycle management, synchronization, and race risk. Only introduce it when I/O, CPU parallelism, or a measured bottleneck justifies the complexity. Assuming "sequential is too slow" without profiling is a common mistake; profile first, then add concurrency. + +**Gate**: At least one of the three conditions (I/O-bound, CPU-bound, measured bottleneck) is met. Proceed only when gate passes. -### Step 2: Choose the Right Primitive +### Phase 2: Choose the Right Primitive + +**Goal**: Select the minimal primitive that solves the concurrency need. | Need | Primitive | When | |------|-----------|------| @@ -117,9 +83,21 @@ If none apply, write sequential code. Concurrency adds complexity; justify it. | One-time initialization | `sync.Once` | Lazy singleton, config loading | | Simple shared counter | `atomic.Int64` | Increment/read without mutex overhead | -### Step 3: Context Propagation +Selection guidance: + +- Prefer `errgroup.Group` over `sync.WaitGroup` because errgroup collects errors and cancels remaining goroutines on first failure, which is what you want in most production scenarios. +- Use `atomic.Int64` or `atomic.Value` for simple shared counters instead of mutex because atomic operations avoid lock contention and are sufficient when the shared state is a single value. +- Return `<-chan T` (receive-only) from producer functions because it prevents callers from accidentally closing or sending on a channel they don't own. +- Size buffered channels to match expected backpressure, not arbitrary large numbers, because oversized buffers hide flow-control bugs that surface under production load. +- When you need a custom rate limiter instead of `golang.org/x/time/rate`, build a token-bucket implementation -- but only when the standard library doesn't meet your needs. + +**Gate**: Primitive selected with clear justification. Proceed only when gate passes. + +### Phase 3: Context Propagation + +**Goal**: Wire context through all cancellable operations so goroutines respond to cancellation. -Always pass context as first parameter for I/O or cancellable operations. +Accept `context.Context` as the first parameter for all I/O or cancellable operations because a function that's fast today may become slow under load tomorrow, and retrofitting context is harder than passing it from the start. ```go func FetchData(ctx context.Context, id string) (*Data, error) { @@ -149,6 +127,8 @@ func FetchData(ctx context.Context, id string) (*Data, error) { } ``` +Every `select` statement in concurrent code must include a `case <-ctx.Done()` because without it, a goroutine blocks forever if the channel never receives and the upstream context is cancelled -- this is the most common source of goroutine leaks. + When to use context vs not: ```go @@ -159,10 +139,18 @@ func FetchUserData(ctx context.Context, userID string) (*User, error) { ... } func CalculateTotal(prices []float64) float64 { ... } ``` -### Step 4: Implement the Pattern +When gopls MCP tools are available, use `go_symbol_references` to trace channel flow and `go_file_context` to understand goroutine spawn sites -- this helps verify context propagation through concurrent call chains. + +**Gate**: All I/O operations accept context, all `select` statements include `<-ctx.Done()`. Proceed only when gate passes. + +### Phase 4: Implement the Pattern + +**Goal**: Write the concurrent code using the selected primitive with correct lifecycle management. **Sync Primitives** +Lock only the critical section and use `defer mu.Unlock()` immediately after `mu.Lock()` because early returns or panics between Lock and Unlock cause deadlocks that are extremely difficult to diagnose in production. + ```go // Mutex for state protection type SafeCounter struct { @@ -196,7 +184,9 @@ func (c *Cache) Set(key string, value any) { } ``` -**errgroup for concurrent work with error handling (preferred over WaitGroup)** +**errgroup for concurrent work with error handling** + +Prefer `errgroup` over `sync.WaitGroup` because it collects errors and cancels remaining goroutines on first failure: ```go import "golang.org/x/sync/errgroup" @@ -214,6 +204,8 @@ func ProcessAll(ctx context.Context, items []Item) error { } ``` +Use Go 1.22+ loop variable semantics -- the `item` variable is per-iteration, so legacy `item := item` shadows are unnecessary in new code. + **sync.Once for one-time initialization** ```go @@ -233,6 +225,8 @@ func (c *Config) Load() (*AppConfig, error) { **Channel patterns: buffered vs unbuffered** +Only the sender closes a channel because closing from the receiver side causes panics if the sender is still sending, and multiple receivers may double-close. Use `defer close(ch)` in the goroutine that writes. + ```go // Unbuffered: synchronous, sender blocks until receiver ready ch := make(chan int) @@ -246,30 +240,53 @@ ch := make(chan int, 100) // - Buffer size should match expected backpressure ``` -For worker pool, fan-out/fan-in, pipeline, rate limiter, and graceful shutdown patterns, see `references/concurrency-patterns.md`. +**Graceful Shutdown** + +Workers and servers must implement clean shutdown with a drain timeout because abrupt termination can lose in-flight work and corrupt state. See `references/concurrency-patterns.md` for the full graceful shutdown pattern. + +For worker pool, fan-out/fan-in, pipeline, and rate limiter patterns, see `references/concurrency-patterns.md`. -### Step 5: Run Race Detector +When profiling goroutine counts and channel contention under load, use `runtime.NumGoroutine()` and pprof to identify bottlenecks. For container deployments, configure `GOMAXPROCS` to match container CPU limits when needed. + +**Gate**: Code compiles, follows the selected pattern, channels closed by sender only, mutexes use defer Unlock. Proceed only when gate passes. + +### Phase 5: Run Race Detector + +**Goal**: Verify no data races exist in the concurrent code. + +Run `go test -race` on all concurrent code because race conditions are silent until production -- they don't cause compile errors, often don't cause test failures, and manifest as rare, non-reproducible bugs under load. ```bash -# ALWAYS run with race detector during development +# Run with race detector during development go test -race -count=1 -v ./... # Run specific test with race detection go test -race -run TestConcurrentOperation ./... ``` -### Step 6: Concurrency Checklist +After editing concurrent code, use `go_diagnostics` (when gopls MCP tools are available) to catch errors before running tests. + +**Gate**: `go test -race` passes clean with no race conditions detected. Proceed only when gate passes. + +### Phase 6: Verify Completeness + +**Goal**: Confirm all concurrent code is correct and leak-free. + +Every goroutine must have a guaranteed exit path via context cancellation, channel close, or explicit shutdown signal because goroutine leaks compound over time and lead to OOM in production -- a single leaked goroutine in a request handler means unbounded memory growth. Before declaring concurrent code complete, verify: -- [ ] **Context propagation** - All I/O operations accept context -- [ ] **Goroutine exit paths** - Every goroutine can terminate -- [ ] **Channel closure** - Channels closed by sender only -- [ ] **Select with context** - All selects include `<-ctx.Done()` -- [ ] **Proper synchronization** - Shared state protected -- [ ] **Race detector passes** - `go test -race` clean -- [ ] **Graceful shutdown** - Workers stop cleanly -- [ ] **No goroutine leaks** - All goroutines tracked +- [ ] **Context propagation** -- All I/O operations accept context as first parameter +- [ ] **Goroutine exit paths** -- Every goroutine can terminate (via ctx.Done, channel close, or stop signal) +- [ ] **Channel closure** -- Channels closed by sender only, using `defer close(ch)` +- [ ] **Select with context** -- All `select` statements include `case <-ctx.Done()` +- [ ] **Proper synchronization** -- Shared state protected by mutex or atomic +- [ ] **Mutex discipline** -- `defer mu.Unlock()` immediately after `mu.Lock()` +- [ ] **Race detector passes** -- `go test -race` clean +- [ ] **Graceful shutdown** -- Workers and servers stop cleanly with drain timeout +- [ ] **No goroutine leaks** -- All goroutines tracked and have exit paths + +**Gate**: All checklist items verified. Concurrent code is complete. --- @@ -301,50 +318,7 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Goroutine Without Exit Path -**What it looks like**: `go func() { for { doWork() } }()` with no context check or stop channel -**Why wrong**: Goroutine runs forever, leaking memory. Cannot be cancelled or shut down gracefully. -**Do instead**: Always include `<-ctx.Done()` or a stop channel in goroutine loops. - -### Anti-Pattern 2: Unnecessary Concurrency -**What it looks like**: Spawning goroutines for sequential work that does not benefit from parallelism -**Why wrong**: Adds complexity (error channels, WaitGroups, race risks) without performance gain. Sequential code is simpler and correct by default. -**Do instead**: Measure first. Use concurrency only for I/O-bound, CPU-parallel, or proven bottleneck scenarios. - -### Anti-Pattern 3: Closing Channel From Receiver Side -**What it looks like**: Consumer goroutine calling `close(ch)` on a channel it reads from -**Why wrong**: Sender may still send, causing panic. Multiple receivers may double-close. -**Do instead**: Only the sender (producer) closes the channel. Use `defer close(ch)` in the goroutine that writes. - -### Anti-Pattern 4: Mutex Lock Without Defer Unlock -**What it looks like**: `mu.Lock()` followed by complex logic before `mu.Unlock()`, with early returns in between -**Why wrong**: Early returns or panics skip the `Unlock()`, causing deadlocks. -**Do instead**: Always `defer mu.Unlock()` immediately after `mu.Lock()`. - -### Anti-Pattern 5: Ignoring Context in Select -**What it looks like**: `select { case msg := <-ch: handle(msg) }` without a `<-ctx.Done()` case -**Why wrong**: Goroutine blocks forever if channel never receives and context is cancelled. -**Do instead**: Every `select` in concurrent code must include `case <-ctx.Done(): return`. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "No need for context, this is fast" | Fast today, slow tomorrow under load | Pass context to all I/O operations | -| "Race detector is slow, skip it" | Races are silent until production | Run `go test -race` every time | -| "One goroutine leak won't matter" | Leaks compound; OOM in production | Verify every goroutine has exit path | -| "Sequential is too slow" | Assumption without measurement | Profile first, then add concurrency | -| "Buffer of 1000 should be enough" | Arbitrary buffers hide backpressure bugs | Size buffers to actual throughput | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/concurrency-patterns.md`: Worker pool, fan-out/fan-in, pipeline, rate limiter, and graceful shutdown patterns with full code examples diff --git a/skills/go-error-handling/SKILL.md b/skills/go-error-handling/SKILL.md index 85cf6ca7..11e001c4 100644 --- a/skills/go-error-handling/SKILL.md +++ b/skills/go-error-handling/SKILL.md @@ -41,56 +41,23 @@ routing: # Go Error Handling Skill -## Operator Context - -This skill operates as an operator for Go error handling implementation, configuring Claude's behavior for idiomatic, context-rich error propagation. It implements the **Pattern Application** architectural approach -- select the right error pattern (wrap, sentinel, custom type), apply it consistently, verify the error chain is preserved. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before implementing error handling -- **Over-Engineering Prevention**: Use the simplest error pattern that fits. Do not create custom types when a sentinel suffices, or sentinels when a simple wrap is enough -- **Always Check Errors**: Every function that returns an error must have its error checked or explicitly ignored with `_ = fn()` -- **Always Wrap with Context**: Never use naked `return err`. Every wrap adds meaningful context describing the operation that failed -- **Preserve Error Chains**: Use `%w` verb in `fmt.Errorf` so callers can use `errors.Is` and `errors.As` -- **Error Messages Form a Narrative**: When read top-to-bottom, wrapped errors tell the story of what happened - -### Default Behaviors (ON unless disabled) -- **Lower-case Error Messages**: Error strings start lower-case, no trailing punctuation (Go convention) -- **Contextual Identifiers**: Include relevant IDs, keys, or filenames in wrap messages -- **Sentinel Errors for API Boundaries**: Define sentinel errors for conditions callers need to check -- **Custom Types for Rich Context**: Use custom error types only when callers need structured data -- **errors.Is for Values, errors.As for Types**: Never use string matching or type assertions for error checks -- **HTTP Status Mapping**: Map domain errors to HTTP status codes at the handler boundary - -### Optional Behaviors (OFF unless enabled) -- **gopls MCP Error Tracing**: Use `go_symbol_references` to find all usages of sentinel errors across the codebase, `go_diagnostics` to verify error handling correctness after edits. Fallback: `gopls references` CLI or LSP tool -- **Error Wrapping Audit**: Scan for naked `return err` statements and missing `%w` verbs -- **Table-Driven Error Tests**: Generate table-driven tests for error paths +Idiomatic Go error handling through context-rich wrapping, sentinel errors, custom error types, and proper error chain inspection. Every error return tells a story -- when read top-to-bottom, wrapped errors form a narrative of what happened and where. ## Available Scripts -- **`scripts/check-errors.sh`** — Detect bare `return err` and log-and-return anti-patterns. Run `bash scripts/check-errors.sh --help` for options. +- **`scripts/check-errors.sh`** -- Detect bare `return err` and log-and-return anti-patterns. Run `bash scripts/check-errors.sh --help` for options. -## What This Skill CAN Do -- Guide idiomatic error wrapping with `fmt.Errorf` and `%w` -- Define and use sentinel errors (`errors.New` package-level vars) -- Create custom error types that implement the `error` interface -- Implement `errors.Is` and `errors.As` checks throughout error chains -- Map domain errors to HTTP status codes at handler boundaries -- Verify error propagation in table-driven tests +## Instructions -## What This Skill CANNOT Do -- Debug runtime panics or stack traces (use systematic-debugging instead) -- Implement structured logging (use logging-specific guidance instead) -- Handle Go concurrency error patterns (use go-concurrency instead) -- Design error monitoring or alerting systems (out of scope) +### Phase 1: Understand the Context ---- +Read the repository CLAUDE.md before implementing any error handling, because project-specific conventions (error prefixes, logging patterns, custom base types) override generic Go idioms. -## Instructions +Scan existing error patterns in the package to understand what is already in use. When gopls MCP is available, use `go_symbol_references` to find all usages of sentinel errors across the codebase, and `go_diagnostics` to verify error handling correctness after edits. Fallback: `gopls references` CLI or LSP tool. -### Step 1: Identify the Error Pattern Needed +### Phase 2: Select the Right Error Pattern -Before writing any error handling code, determine which pattern fits: +Choose the simplest pattern that fits the situation, because over-engineering error types creates unnecessary abstraction that callers must learn and maintain. A simple wrap handles most cases; do not create custom types when a sentinel suffices, or sentinels when a simple wrap is enough. | Situation | Pattern | Example | |-----------|---------|---------| @@ -99,9 +66,13 @@ Before writing any error handling code, determine which pattern fits: | Caller needs structured error data | **Custom error type** | `type ValidationError struct{...}` | | Error at HTTP boundary | **Status mapping** | `errors.Is(err, ErrNotFound) -> 404` | -### Step 2: Wrap Errors with Context +Even when an error seems simple and unlikely to be inspected, always define a sentinel if the error crosses a package boundary, because you cannot predict what callers will need and adding sentinels later is a breaking change. + +**Gate**: Before writing code, name the pattern you are using and why it fits. If the answer is "custom type" but no caller needs structured data, downgrade to sentinel or wrap. -Every error return should add context describing the operation that failed. Error messages form a readable narrative when chained. +### Phase 3: Wrap Errors with Context + +Every error return adds context describing the operation that failed. Never use naked `return err`, because unwrapped errors lose context about where in the call chain the failure occurred -- by the time the error reaches the top, no one knows what operation triggered it. ```go func LoadConfig(path string) (*Config, error) { @@ -121,14 +92,22 @@ func LoadConfig(path string) (*Config, error) { ``` Rules for wrap messages: -- Describe the **operation**, not the error (`"load config"` not `"error loading config"`) -- Include **identifying data** (filename, ID, key) -- Use `%w` to preserve the error chain -- Start lower-case, no trailing punctuation +- Describe the **operation**, not the error (`"load config"` not `"error loading config"`), because prefixes like "error" or "failed" add zero information -- the caller already knows it is an error +- Include **identifying data** (filename, ID, key), because without it the error message cannot distinguish between instances of the same operation +- Use `%w` to preserve the error chain, because `%v` severs the chain and makes `errors.Is`/`errors.As` return false for all wrapped errors +- Start lower-case, no trailing punctuation, because this is Go convention and errors are often concatenated with colons where capitals and periods look wrong + +Wrap once per call boundary. Do not double-wrap at the same level (`fmt.Errorf("process: error processing: %w", fmt.Errorf("processing failed: %w", err))`), because redundant wrapping creates unreadable error chains where the same context appears multiple times. + +Each level should add only its own context, not repeat caller context, because duplicate context in logs makes errors harder to parse and debug. -### Step 3: Define Sentinel Errors When Callers Need to Check +Every function that returns an error must have its error checked. Never silently discard errors (`file.Close()` without checking), because silent failures cause data corruption, resource leaks, or hard-to-debug issues downstream. If you intentionally ignore an error, make it explicit with `_ = fn()` so readers know the discard was deliberate, not accidental. -Sentinel errors are package-level variables for conditions callers must handle specifically. +When auditing existing code, scan for naked `return err` statements and missing `%w` verbs to find places where context is being lost. + +### Phase 4: Define Sentinel Errors When Callers Need to Check + +Sentinel errors are package-level variables for conditions callers must handle specifically. Define them at API boundaries where callers need to branch on error identity. ```go package mypackage @@ -159,9 +138,9 @@ if errors.Is(err, ErrNotFound) { } ``` -### Step 4: Create Custom Error Types for Rich Context +### Phase 5: Create Custom Error Types for Rich Context -Use custom types when callers need to extract structured data from errors. +Use custom types only when callers need to extract structured data from errors, because a custom type forces callers to import your package and use `errors.As` -- unnecessary coupling if all they need is identity checking via `errors.Is`. ```go type ValidationError struct { @@ -187,7 +166,9 @@ if errors.As(err, &valErr) { } ``` -### Step 5: Use errors.Is and errors.As Correctly +### Phase 6: Inspect Errors with errors.Is and errors.As + +Use `errors.Is` for value comparison and `errors.As` for type extraction. Never use string matching (`strings.Contains(err.Error(), "not found")`) or direct type assertions (`err.(*MyError)`), because string matching is fragile -- error messages change across versions and wrapping -- and type assertions skip the unwrap chain so they miss wrapped errors entirely. **errors.Is** checks if any error in the chain matches a specific value: ```go @@ -209,9 +190,9 @@ if errors.As(err, &netErr) { } ``` -### Step 6: Map Errors to HTTP Status at Boundaries +### Phase 7: Map Errors to HTTP Status at Boundaries -Error-to-status mapping belongs at the HTTP handler level, not in domain logic. +Error-to-status mapping belongs at the HTTP handler level, not in domain logic, because embedding HTTP semantics in domain code couples your business logic to the transport layer. ```go func errorToStatus(err error) int { @@ -228,9 +209,9 @@ func errorToStatus(err error) int { } ``` -### Step 7: Test Error Paths +### Phase 8: Test Error Paths -Use table-driven tests to verify error handling: +Use table-driven tests to verify error handling, because error paths are the most common source of untested behavior and table-driven structure makes it easy to add cases: ```go func TestProcessUser(t *testing.T) { @@ -266,85 +247,40 @@ func TestProcessUser(t *testing.T) { } ``` -### Step 8: Verify Error Handling Completeness +### Phase 9: Verify Error Handling Completeness Before completing, check: -- [ ] All errors checked -- no unchecked returns -- [ ] Context added -- each wrap describes the operation -- [ ] `%w` verb used -- error chain preserved -- [ ] Narrative formed -- error messages readable top-to-bottom -- [ ] Sentinel errors defined -- for conditions callers must check -- [ ] `errors.Is`/`errors.As` used -- no string comparison or type assertion -- [ ] HTTP status mapped -- at handler boundaries only - ---- +- [ ] All errors checked -- no unchecked returns, because silent failures cause data corruption and resource leaks +- [ ] Context added -- each wrap describes the operation, because without context the final error message is meaningless +- [ ] `%w` verb used -- error chain preserved, because `%v` severs the chain and breaks `errors.Is`/`errors.As` +- [ ] Narrative formed -- error messages readable top-to-bottom, because debugging relies on the error chain telling a coherent story +- [ ] Sentinel errors defined -- for conditions callers must check, because adding sentinels later is a breaking change +- [ ] `errors.Is`/`errors.As` used -- no string comparison or type assertion, because string matching breaks when messages change +- [ ] HTTP status mapped -- at handler boundaries only, because domain code should not know about HTTP +- [ ] Simplest pattern used -- no over-engineered custom types where a wrap or sentinel suffices ## Error Handling ### Error: "error chain broken -- errors.Is returns false" -Cause: Used `%v` instead of `%w` in `fmt.Errorf`, or created a new error instead of wrapping +Cause: Used `%v` instead of `%w` in `fmt.Errorf`, or created a new error instead of wrapping. Solution: 1. Check all `fmt.Errorf` calls use `%w` for the error argument 2. Ensure sentinel errors are not re-created (use the same `var`) 3. Verify custom types implement `Unwrap()` if they wrap inner errors ### Error: "redundant error context in logs" -Cause: Same context added at multiple call levels, or wrapping errors that already contain the info +Cause: Same context added at multiple call levels, or wrapping errors that already contain the info. Solution: 1. Each level should add only its own context, not repeat caller context 2. One wrap per call boundary -- do not double-wrap at the same level ### Error: "sentinel error comparison fails across packages" -Cause: Error was recreated with `errors.New` instead of using the exported variable +Cause: Error was recreated with `errors.New` instead of using the exported variable. Solution: 1. Import and reference the package-level `var ErrX` directly 2. Never shadow sentinel errors with local `errors.New` calls ---- - -## Anti-Patterns - -### Anti-Pattern 1: Wrapping Without Meaningful Context -**What it looks like**: `return fmt.Errorf("error: %w", err)` or `return fmt.Errorf("failed: %w", err)` -**Why wrong**: "error" and "failed" add zero information. The chain becomes noise. -**Do instead**: Describe the operation: `return fmt.Errorf("load user %s from database: %w", userID, err)` - -### Anti-Pattern 2: Naked Error Returns -**What it looks like**: `return err` without wrapping -**Why wrong**: Loses context at every call boundary. Final error message lacks the narrative chain. -**Do instead**: Always wrap: `return fmt.Errorf("outer operation: %w", err)` - -### Anti-Pattern 3: Silently Ignoring Errors -**What it looks like**: `file.Close()` without checking the return, or `data, _ := fetchData()` without comment -**Why wrong**: Silent failures cause data corruption, resource leaks, or hard-to-debug issues downstream. -**Do instead**: Check and log: `if err := file.Close(); err != nil { log.Printf("close file: %v", err) }` or explicitly ignore: `_ = file.Close()` - -### Anti-Pattern 4: String Matching for Error Checks -**What it looks like**: `if strings.Contains(err.Error(), "not found")` -**Why wrong**: Fragile. Error messages change. Breaks across wrapped errors. -**Do instead**: Use `errors.Is(err, ErrNotFound)` or `errors.As(err, &target)` - -### Anti-Pattern 5: Over-Wrapping at the Same Level -**What it looks like**: `return fmt.Errorf("process: error processing: %w", fmt.Errorf("processing failed: %w", err))` -**Why wrong**: Redundant wrapping creates unreadable error chains. -**Do instead**: One clear wrap per call level: `return fmt.Errorf("process user request: %w", err)` - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Simple error, no need to wrap" | Unwrapped errors lose context at every level | Always wrap with `fmt.Errorf` and `%w` | -| "String check is fine for now" | String matching breaks when messages change | Use `errors.Is` or `errors.As` | -| "No one will check this error" | You cannot predict caller needs | Define sentinel if it crosses a package boundary | -| "Custom type is overkill" | Evaluate the actual need, but do not skip if callers need structured data | Match pattern to situation (Step 1 table) | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/patterns.md`: Extended patterns -- gopls tracing, HTTP handler patterns, error wrapping in middleware diff --git a/skills/go-pr-quality-gate/SKILL.md b/skills/go-pr-quality-gate/SKILL.md index 2eb4adcb..f9329a9b 100644 --- a/skills/go-pr-quality-gate/SKILL.md +++ b/skills/go-pr-quality-gate/SKILL.md @@ -28,61 +28,28 @@ routing: # Go PR Quality Gate Skill -## Operator Context - -This skill operates as an operator for Go quality validation workflows, configuring Claude's behavior for automated code quality checking. It implements the **Deterministic Execution** pattern -- `make check` is the single source of truth, the skill parses and categorizes output into actionable feedback. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Over-Engineering Prevention**: Only run the checks requested. No speculative analysis, no additional tooling beyond what `make check` provides -- **Deterministic Execution**: Always use `make check` as the single source of truth. No custom check orchestration or tool selection logic -- **Exit Code Fidelity**: Report exact exit codes and status from make. Never mask or modify build tool results -- **Validate First**: Always validate repository prerequisites (go.mod, Makefile) before running checks -- **Incremental Fixes**: Apply one category of fixes at a time, re-run checks after each - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report facts without self-congratulation. Show exact error messages and locations. Be concise but informative -- **Temporary File Cleanup**: Remove temporary analysis files and debug logs at completion. Keep only files needed for user review -- **Error Categorization**: Group errors by type (linting, tests, build) with actionable fix suggestions -- **Coverage Reporting**: Extract and report test coverage percentage from check output -- **Progressive Output**: Stream progress updates as checks run - -### Optional Behaviors (OFF unless enabled) -- **Verbose Debug Mode**: Available via `--verbose` flag for troubleshooting -- **Custom Coverage Thresholds**: Available via `--min-coverage` flag for stricter validation -- **JSON-Only Output**: Available via `--format json` for automation pipelines - -## What This Skill CAN Do -- Run `make check` and parse results into categorized, actionable output -- Categorize linting, test, and build errors by type and severity -- Extract coverage percentages from test output -- Suggest make targets for common fix patterns (goimports, tidy-deps, license-headers) -- Validate repository prerequisites (go.mod, Makefile, git) -- Generate structured JSON reports for automation - -## What This Skill CANNOT Do -- Fix code automatically -- identifies issues and suggests make targets only -- Run custom linters -- only works with linters configured in the repository's `make check` target -- Modify Makefile -- requires an existing `make check` target -- Run checks incrementally -- executes full `make check` suite, not individual file checks -- Interpret business logic errors -- provides technical categorization, not domain debugging - ---- +Run `make check` as the single source of truth for Go code quality, parse the output into categorized errors, and suggest actionable fixes. The skill identifies issues and suggests make targets -- it does not fix code automatically, run custom linters, modify Makefiles, run checks incrementally on single files, or interpret business logic errors. ## Instructions -### Prerequisites -- Go repository with `go.mod` at root -- Repository has `Makefile` with `check` target -- Works best with Makefile-based build workflow repositories +### Step 1: Read Repository CLAUDE.md + +Read and follow the repository's CLAUDE.md before doing anything else. It may contain project-specific quality requirements or overrides. -### Step 1: Validate Repository Context +### Step 2: Validate Repository Context + +Never skip validation, even if you think you know what the error is -- assumptions miss prerequisites and waste time debugging the environment instead of code quality. Run the validation script to check prerequisites: ```bash python3 ~/.claude/skills/go-pr-quality-gate/scripts/quality_checker.py --validate-only ``` +The repository needs: +- `go.mod` at root (Go module) +- `Makefile` with a `check` target +- A git repository + Expected success output: ```json { @@ -101,9 +68,11 @@ If validation fails: **Gate**: Validation returns `"status": "valid"`. Proceed only when gate passes. -### Step 2: Run Quality Checks +### Step 3: Run Quality Checks + +Always use `make check` through the script -- never bypass it by running golangci-lint, go test, or go vet individually. Different projects configure their quality gates differently, and running tools directly skips whatever the Makefile includes. -Execute comprehensive quality gate: +Execute the quality gate: ```bash python3 ~/.claude/skills/go-pr-quality-gate/scripts/quality_checker.py ``` @@ -112,18 +81,19 @@ The script will: 1. Run `make check` (static analysis + tests) 2. Parse output for errors and coverage 3. Categorize any failures -4. Generate actionable report with fix suggestions +4. Generate an actionable report with fix suggestions For verbose progress output: ```bash python3 ~/.claude/skills/go-pr-quality-gate/scripts/quality_checker.py --verbose ``` -### Step 3: Interpret Results +### Step 4: Interpret Results + +Report facts: exact error messages, file locations, exit codes, and coverage percentages. Do not editorialize or self-congratulate on passing checks. #### Success Scenario -Success output format: ```json { "status": "success", @@ -135,12 +105,11 @@ Success output format: When successful: 1. Acknowledge passing checks -2. Report coverage percentage +2. Report coverage percentage -- always include it, because silently dropping coverage hides regressions 3. Suggest next steps: view detailed coverage (`open build/cover.html`), create commit, or run specific checks #### Failure Scenario -Failure output format: ```json { "status": "failed", @@ -162,9 +131,13 @@ When failures occur: - License headers: `make license-headers` - Specific linter guidance: check `references/common-lint-errors.json` 4. **Provide context**: file paths, line numbers, error descriptions -5. **Suggest incremental fixes**: one make target at a time +5. **Report exact exit codes** from make -- never mask or modify them + +Keep explanations brief. Report the specific error with its location and the fix suggestion from script output. Only explain further if the user asks -- a quality check is not a tutorial. + +### Step 5: Apply Fixes Incrementally -### Step 4: Apply Suggested Fixes +Fix one category at a time. Applying 15 fixes across multiple files simultaneously makes it impossible to verify which fix resolved which error, and if one fix is wrong, everything needs rollback. For common error patterns, run suggested make targets one at a time: @@ -179,13 +152,13 @@ make tidy-deps make license-headers ``` -After each fix, re-run quality checks (Step 2) to verify resolution. +After each fix, re-run quality checks (Step 3) to verify resolution before moving to the next category. **Gate**: All checks pass (exit code 0). Coverage meets baseline. No linting errors. All tests pass. -### Step 5: Detailed Investigation (Optional) +### Step 6: Detailed Investigation (Optional) -For complex failures, use specific make targets: +For complex failures, use specific make targets to isolate the problem: ```bash # Run only static analysis @@ -201,6 +174,8 @@ go test -v -run TestSpecificTest ./pkg/service open build/cover.html ``` +Only use individual make targets for focused investigation after `make check` has failed -- not as a substitute for it. + ### Advanced Options Custom coverage threshold enforcement: @@ -218,36 +193,9 @@ Combined options for thorough debugging: python3 ~/.claude/skills/go-pr-quality-gate/scripts/quality_checker.py --min-coverage 80.0 --verbose ``` ---- +### Cleanup -## Examples - -### Example 1: Clean Quality Check -User says: "Run quality checks before I create a PR" -Actions: -1. Validate repository context (Step 1) -2. Run `make check` via quality_checker.py (Step 2) -3. Report all checks passed with coverage percentage (Step 3) -4. Suggest creating commit -Result: Clean quality gate, ready for PR - -### Example 2: Linting Failures with Auto-Fix -User says: "Check code quality" -Actions: -1. Validate, run checks -- import and license errors found -2. Report categorized errors with fix commands -3. Run `make goimports` then `make license-headers` -4. Re-run checks to verify resolution -Result: Issues fixed incrementally, all checks pass - -### Example 3: Test Failures -User says: "Why are the checks failing?" -Actions: -1. Run quality checks -- test failures detected -2. Report failing test names and packages -3. Suggest running specific test with verbose output for details -4. After user fixes, re-run to verify -Result: Test failures identified with actionable debug steps +Remove temporary analysis files and debug logs at completion. Keep only files needed for user review. --- @@ -284,51 +232,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Running Checks Without Validation -**What it looks like**: Immediately running quality_checker.py without checking for go.mod and Makefile -**Why wrong**: Fails with cryptic errors. Wastes time debugging environment instead of code quality. -**Do instead**: Always run Step 1 validation first. The `--validate-only` flag exists for this purpose. - -### Anti-Pattern 2: Fixing All Errors Simultaneously -**What it looks like**: "I see 15 linting errors, let me fix them all at once across multiple files" -**Why wrong**: Multiple concurrent changes are hard to review. If one fix is wrong, all changes need rollback. Cannot verify which fix resolved which error. -**Do instead**: Apply one category of fixes at a time. Run `make goimports`, re-run checks, then fix next category. - -### Anti-Pattern 3: Bypassing make check with Individual Tools -**What it looks like**: Running golangci-lint, go test, go vet separately instead of `make check` -**Why wrong**: Bypasses repository's configured quality gates. May miss checks the Makefile includes. Different projects have different configurations. -**Do instead**: Always use `make check` as single source of truth. Only run individual make targets for focused investigation after a check fails. - -### Anti-Pattern 4: Ignoring Coverage in Results -**What it looks like**: "All tests pass!" without mentioning that coverage dropped from 85% to 45% -**Why wrong**: Coverage regression indicates untested code paths. New code without tests reduces overall quality. -**Do instead**: Always report coverage percentage. Highlight changes if baseline is known. Use `--min-coverage` for threshold enforcement. - -### Anti-Pattern 5: Over-Explaining Linter Errors -**What it looks like**: Writing paragraphs about Go error handling philosophy when errcheck reports an unchecked return -**Why wrong**: User asked for quality check, not a tutorial. The script already provides fix suggestions. Delays actionable response. -**Do instead**: Report the specific error with location, show the fix suggestion from script output, explain only if the user asks. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "make check is slow, I'll just run go vet" | Skips configured quality gates | Run full `make check` | -| "Coverage is fine, no need to report it" | Hides regression information | Always report coverage | -| "I know what the error is, skip validation" | Assumptions miss prerequisites | Validate repository first | -| "One big fix is faster than incremental" | Can't verify individual fixes | Fix one category at a time | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/common-lint-errors.json`: Linter descriptions, severities, and fix suggestions - `${CLAUDE_SKILL_DIR}/references/makefile-targets.json`: Available make targets and when to use them - `${CLAUDE_SKILL_DIR}/references/expert-review-patterns.md`: Manual review patterns beyond automated linting diff --git a/skills/go-sapcc-conventions/SKILL.md b/skills/go-sapcc-conventions/SKILL.md index 6ee4d724..f239a230 100644 --- a/skills/go-sapcc-conventions/SKILL.md +++ b/skills/go-sapcc-conventions/SKILL.md @@ -34,91 +34,95 @@ routing: # SAP Converged Cloud Go Conventions -Coding standards extracted from extensive PR review analysis across `sapcc/keppel` and `sapcc/go-bits`. These are the real rules enforced in code review by the project's lead review standards. +Coding standards extracted from extensive PR review analysis across `sapcc/keppel` and `sapcc/go-bits`. These are the real rules enforced in code review by the project's lead review standards. The defining characteristic of SAP CC Go code is aggressive rejection of unnecessary complexity -- when this document conflicts with "Go best practices," this document wins for sapcc code. -## Tool Integration +## Instructions -**gopls MCP (MUST use when available)**: Use `go_workspace` at session start, `go_file_context` after reading .go files, `go_symbol_references` before modifying any symbol (critical for sapcc — lead review checks cross-package impact), `go_diagnostics` after every edit, `go_vulncheck` after go.mod changes. This gives type-aware analysis that catches issues grep cannot. +### Phase 1: ORIENT -**Modern Go Guidelines**: Detect Go version from go.mod. Sapcc projects typically target Go 1.22+. Use version-appropriate features: `t.Context()` (1.24+), `b.Loop()` (1.24+), `strings.SplitSeq` (1.24+), `wg.Go()` (1.25+), `errors.AsType[T]` (1.26+). +Read project context and reference files before writing any code. Project conventions override general Go patterns because the lead reviewer evaluates against these specific standards, not generic best practices. ---- +**1a. Read CLAUDE.md first** because project-level overrides (custom linter config, local build commands, repo-specific patterns) take precedence over this skill's defaults. + +**1b. Use gopls MCP when available**: Run `go_workspace` at session start, `go_file_context` after reading .go files, `go_symbol_references` before modifying any symbol (critical for sapcc -- lead review checks cross-package impact), `go_diagnostics` after every edit, `go_vulncheck` after go.mod changes. This gives type-aware analysis that catches issues grep cannot. + +**1c. Detect Go version from go.mod** because sapcc projects typically target Go 1.22+. Use version-appropriate features: `t.Context()` (1.24+), `b.Loop()` (1.24+), `strings.SplitSeq` (1.24+), `wg.Go()` (1.25+), `errors.AsType[T]` (1.26+). + +**1d. Load reference files** — this is NON-NEGOTIABLE. Do NOT rely on training data for sapcc conventions; read the actual references because they contain real rules from actual PR reviews. Load in this order because each builds on the previous: +1. **[references/sapcc-code-patterns.md](${CLAUDE_SKILL_DIR}/references/sapcc-code-patterns.md)** -- actual function signatures, constructors, interfaces, HTTP handlers, error handling, DB access, testing, package organization +2. **[references/library-reference.md](${CLAUDE_SKILL_DIR}/references/library-reference.md)** -- complete library table: 30 approved, 10+ forbidden, with versions and usage counts +3. **[references/architecture-patterns.md](${CLAUDE_SKILL_DIR}/references/architecture-patterns.md)** -- full 102-rule architecture specification (when working on architecture, handlers, or DB access) +4. Load others as needed: [references/review-standards-lead.md](${CLAUDE_SKILL_DIR}/references/review-standards-lead.md) (21 lead review comments), [references/review-standards-secondary.md](${CLAUDE_SKILL_DIR}/references/review-standards-secondary.md) (15 secondary review comments), [references/anti-patterns.md](${CLAUDE_SKILL_DIR}/references/anti-patterns.md) (20+ anti-patterns with BAD/GOOD examples), [references/extended-patterns.md](${CLAUDE_SKILL_DIR}/references/extended-patterns.md) (security micro-patterns, K8s namespace isolation, PR hygiene, changelog format) + +**Gate**: go.mod contains `github.com/sapcc/go-bits`. If absent, this skill does not apply -- use general Go conventions instead. -## 1. Anti-Over-Engineering Rules (Strongest Project Opinions) +### Phase 2: WRITE CODE -This section comes first because it is the defining characteristic of SAP CC Go code. The most frequent review theme (10 of 38 comments) is rejecting unnecessary complexity. +Apply sapcc conventions while writing. The strongest project opinion is anti-over-engineering (10 of 38 review comments reject unnecessary complexity). Every pattern below includes the reasoning so you understand when to apply it. -### 1.1 When NOT to Create Types +#### 2a. Anti-Over-Engineering (Highest Priority) -Do not create throwaway struct types just to marshal a simple JSON payload: +This section comes first because it is the defining characteristic of SAP CC Go code. + +**When NOT to create types** -- do not create throwaway struct types just to marshal a simple JSON payload because lead review considers this "overengineered": ```go -// BAD: Copilot suggested this. Rejected as "overengineered." +// REJECTED: Copilot suggested this type fsParams struct { Path string `json:"path"` } type fsConfig struct { Type string `json:"type"`; Params fsParams `json:"params"` } config, _ := json.Marshal(fsConfig{Type: "filesystem", Params: fsParams{Path: path}}) -// GOOD: project convention +// CORRECT: project convention storageConfig = fmt.Sprintf(`{"type":"filesystem","params":{"path":%s}}`, must.Return(json.Marshal(filesystemPath))) ``` -### 1.2 When NOT to Wrap Errors - -Do not add error context that the called function already provides. The Go stdlib's `strconv` functions include the function name, input value, and error reason: +**When NOT to wrap errors** -- do not add error context that the called function already provides because `strconv` functions include function name, input value, and error reason: ```go -// BAD: redundant wrapping +// REJECTED: redundant wrapping val, err := strconv.ParseUint(s, 10, 32) if err != nil { return fmt.Errorf("failed to parse chunk number %q: %w", s, err) } -// GOOD: strconv already says "strconv.ParseUint: parsing \"hello\": invalid syntax" +// CORRECT: strconv already says "strconv.ParseUint: parsing \"hello\": invalid syntax" chunkNumber := must.Return(strconv.ParseUint(chunkNumberStr, 10, 32)) ``` > "ParseUint is disciplined about providing good context in its input messages... So we can avoid boilerplate here without compromising that much clarity." -### 1.3 When NOT to Handle Errors - -Do not handle errors that are never triggered in practice. Apply the standard consistently: +**When NOT to handle errors** -- do not handle errors that are never triggered in practice because consistency matters more than theoretical completeness: ```go -// BAD: handling os.Stdout.Write errors +// REJECTED: handling os.Stdout.Write errors n, err := os.Stdout.Write(data) if err != nil { return fmt.Errorf("failed to write to stdout: %w", err) } -// GOOD: if fmt.Println ignoring errors is OK everywhere, so is os.Stdout.Write +// CORRECT: if fmt.Println ignoring errors is OK everywhere, so is os.Stdout.Write os.Stdout.Write(data) ``` > "I'm going to ignore this based purely on the fact that Copilot complains about `os.Stdout.Write()`, but not about the much more numerous instances of `fmt.Println` that theoretically suffer the same problem." -### 1.4 When NOT to Add Defer Close - -Do not add `defer Close()` on `io.NopCloser` just for theoretical contract compliance: +**When NOT to add defer close** -- do not add `defer Close()` on `io.NopCloser` just for theoretical contract compliance: > "This is an irrelevant contrivance. Either `WriteTrivyReport` does it, or the operation fails and we fatal-error out, in which case it does not matter anyway." -### 1.5 Dismiss Copilot/AI Suggestions That Add Complexity - -Lead review evaluates AI suggestions on merit and frequently simplifies them: +**Dismiss Copilot/AI suggestions that add complexity** -- lead review evaluates AI suggestions on merit and frequently simplifies them: - If a Copilot suggestion is inconsistent (complains about X but not equivalent Y), dismiss it - If a Copilot suggestion creates types for one-off marshaling, simplify it - Ask: "Can you point to a concrete scenario where this fails?" If not, don't handle it -### 1.6 When NOT to Build Smart Inference - -When a known future design change is coming, don't build abstractions that will break: +**When NOT to build smart inference** -- when a known future design change is coming, don't build abstractions that will break: ``` -# BAD: inferring params from driver name (won't work for future "multi" driver) +# REJECTED: inferring params from driver name (won't work for future "multi" driver) --driver swift (auto-generates swift params) -# GOOD: explicit and future-proof +# CORRECT: explicit and future-proof --driver swift --params '{"container":"foo"}' ``` @@ -126,74 +130,80 @@ When a known future design change is coming, don't build abstractions that will But also: do NOT preemptively solve the future problem. Just don't build something that blocks the future solution. -### 1.7 No Hidden Defaults for Niche Cases - -If a default value only applies to a subset of use cases, make the parameter required for everyone: +**No hidden defaults for niche cases** -- if a default value only applies to a subset of use cases, make the parameter required for everyone: > "This is, in effect, a default value that only applies to two specific storage drivers. These are not widely used enough to justify the undocumented quirk." ---- - -## 2. Lead Review Rules +#### 2b. Library Usage -The lead review style is directive. Statements, not suggestions. Top concerns: simplicity, API design, error handling. See [references/review-standards-lead.md](references/review-standards-lead.md) for all 21 PR comments with full context. +Use only approved libraries because the lead reviewer will reject PRs that introduce forbidden dependencies. SAP CC has its own equivalents for common Go libraries. -### Core Principles +**APPROVED Libraries**: -| Rule | Summary | -|------|---------| -| Trust the stdlib | Don't wrap errors that `strconv`, constructors, etc. already describe well | -| Use Cobra subcommands | Never manually roll argument dispatch that Cobra handles | -| CLI names: specific + extensible | `keppel test-driver storage`, not `keppel test` | -| Marshal structured data for errors | If you have a `map[string]any`, `json.Marshal` it instead of manually formatting fields | -| Tests must verify behavior | Never silently remove test assertions during refactoring | -| Explain test workarounds | Add comments when test setup diverges from production patterns | -| Use existing error utilities | Use `errext.ErrorSet` and `.Join()`, not manual string concatenation | -| TODOs need context | Include what, a starting point link, and why not done now | -| Documentation stays qualified | When behavior changes conditionally, update docs to state the conditions | -| Understand value semantics | Value receiver copies the struct, but reference-type fields share data | -| Variable names don't mislead | Don't name script vars as if the application reads them | +| Library | Purpose | Key Pattern | +|---------|---------|-------------| +| `sapcc/go-bits` | Core framework (170+ files) | `logg.Info`, `must.Return`, `assert.HTTPRequest` | +| `majewsky/gg/option` | `Option[T]` (45 files) | `Some(v)`, `None[T]()`, dot-import ONLY for this | +| `majewsky/schwift/v2` | Swift storage client | OpenStack storage driver only | +| `gorilla/mux` | HTTP routing | `r.Methods("GET").Path("/path").HandlerFunc(h)` | +| `spf13/cobra` | CLI framework | `AddCommandTo(parent)` pattern | +| `go-gorp/gorp/v3` | SQL ORM | `db:"column_name"` struct tags | +| `gophercloud/gophercloud/v2` | OpenStack SDK | Keystone auth, Swift storage | +| `prometheus/client_golang` | Metrics | Application + HTTP middleware metrics | +| `redis/go-redis/v9` | Redis client | Rate limiting, token caching | +| `gofrs/uuid/v5` | UUID generation | NOT google/uuid, NOT satori/uuid | +| `golang-jwt/jwt/v5` | JWT tokens | Auth token handling | +| `alicebob/miniredis/v2` | Testing only | In-memory Redis for tests | -### How Lead Review Works +**FORBIDDEN Libraries** -- using any of these will fail review: -- **Reads Copilot suggestions critically** -- agrees with principle, proposes simpler alternatives -- **Dismisses inconsistent AI complaints** -- if tool flags X but not equivalent Y, the concern is invalid -- **Thinks about forward compatibility** -- command names and API shapes evaluated for extensibility -- **Values brevity when stdlib provides clarity** -- removes wrappers that duplicate error info -- **Approves simple PRs quickly** -- doesn't manufacture concerns -- **Corrects misconceptions directly** -- states correct behavior without softening -- **Pushes fixes directly** -- sometimes pushes commits to address review concerns directly +| Library | Reason | Use Instead | +|---------|--------|-------------| +| `testify` (assert/require/mock) | SAP CC has own testing framework | `go-bits/assert` + `go-bits/must` | +| `zap` / `zerolog` / `slog` / `logrus` | SAP CC standardized on simple logging | `go-bits/logg` | +| `gin` / `echo` / `fiber` | SAP CC uses stdlib + gorilla/mux | `go-bits/httpapi` + `gorilla/mux` | +| `gorm` / `sqlx` / `ent` | Lightweight ORM preference | `go-gorp/gorp/v3` + `go-bits/sqlext` | +| `viper` | No config files; env-var-only config | `go-bits/osext` + `os.Getenv` | +| `google/uuid` / `satori/uuid` | Different UUID library chosen | `gofrs/uuid/v5` | +| `gomock` / `mockery` | Manual test double implementations | Hand-written doubles via driver interfaces | +| `ioutil.*` | Deprecated since Go 1.16 | `os` and `io` packages | +| `http.DefaultServeMux` | Global mutable state | `http.NewServeMux()` | +| `gopkg.in/square/go-jose.v2` | Archived, has CVEs | `gopkg.in/go-jose/go-jose.v2` | ---- +See [references/library-reference.md](references/library-reference.md) for the complete table with versions and usage counts. -## 3. Secondary Review Rules +**Import grouping convention** -- three groups, separated by blank lines. Enforced by `goimports -local github.com/sapcc/keppel`: -The secondary review style is inquisitive. Questions where lead review makes statements. Top concerns: configuration safety, migration paths, test completeness. See [references/review-standards-secondary.md](references/review-standards-secondary.md) for full details. +```go +import ( + // Group 1: Standard library + "context" + "encoding/json" + "fmt" + "net/http" -### Core Principles + // Group 2: External (includes sapcc/go-bits, NOT local project) + "github.com/gorilla/mux" + . "github.com/majewsky/gg/option" // ONLY dot-import allowed + "github.com/sapcc/go-bits/httpapi" + "github.com/sapcc/go-bits/logg" -| Rule | Summary | -|------|---------| -| Error messages must be actionable | "Internal Server Error" is unacceptable when the cause is knowable | -| Know the spec, deviate pragmatically | Reference RFCs, but deviate when spec is impractical | -| Guard against panics with clear errors | Check nil/empty before indexing, use `fmt.Errorf("invalid: %q", val)` | -| Strict configuration parsing | Use `DisallowUnknownFields()` on JSON decoders for config | -| Test ALL combinations | When changing logic with multiple inputs, test every meaningful combination | -| Eliminate redundant code | Ask "This check is now redundant?" when code is refactored | -| Comments explain WHY | When something non-obvious is added, request an explanatory comment | -| Domain knowledge over theory | Dismiss concerns that don't apply to actual domain constraints | -| Smallest possible fix | 2-line PRs are fine. Don't bundle unrelated changes | -| Respect ownership hierarchy | "LGTM but lets wait for lead review, we are in no hurry here" | -| Be honest about mistakes | Acknowledge errors quickly and propose fix direction | -| Validate migration paths | "Do we somehow check if this is still set and then abort?" | + // Group 3: Local project + "github.com/sapcc/keppel/internal/keppel" + "github.com/sapcc/keppel/internal/models" +) +``` ---- +**Dot-import whitelist** (only these 3 packages): +- `github.com/majewsky/gg/option` +- `github.com/onsi/ginkgo/v2` +- `github.com/onsi/gomega` -## 4. Architecture Rules +#### 2c. Architecture Keppel uses a strict layered architecture. See [references/architecture-patterns.md](references/architecture-patterns.md) for the complete 102-rule set with code examples. -### Directory Structure +**Directory structure**: ``` project/ @@ -211,8 +221,6 @@ project/ test/ # Test infrastructure, doubles, helpers ``` -### Key Patterns - **Pluggable Driver Pattern** (6 driver types in keppel): ```go // Interface in internal/keppel/ @@ -261,13 +269,11 @@ port := osext.GetenvOrDefault("KEPPEL_DB_PORT", "5432") debug := osext.GetenvBool("KEPPEL_DEBUG") ``` ---- - -## 5. Error Handling Rules +#### 2d. Error Handling See [references/architecture-patterns.md](references/architecture-patterns.md) for the complete 27-rule error handling specification. -### Error Wrapping Conventions +**Error wrapping conventions** -- use these prefixes consistently because lead review checks for consistency across the codebase: ```go // "while" for operations in progress @@ -285,7 +291,7 @@ return fmt.Errorf("could not get ManagedAccountNames(): %w", err) All error messages: lowercase, no trailing punctuation, include identifying data with `%q`, descriptive action prefix. -### must.Return / must.Succeed Scope +**must.Return / must.Succeed scope** -- these call `os.Exit(1)` on error, so using them in request handlers crashes the server: ```go // ALLOWED: startup/bootstrap code (fatal errors) @@ -300,9 +306,7 @@ digest := must.ReturnT(rc.UploadBlob(ctx, data))(t) // Never use must.* where errors should be propagated ``` -### must vs assert in Tests: When to Use Which - -In test code, `must` and `assert` serve different roles: +**must vs assert in tests** -- `must` and `assert` serve different roles: | Package | Calls | Use When | |---------|-------|----------| @@ -316,27 +320,27 @@ In test code, `must` and `assert` serve different roles: 4. Need a return value? -> `must.ReturnT` (no assert equivalent) ```go -// Setup (fatal) — next lines depend on this +// Setup (fatal) -- next lines depend on this must.SucceedT(t, store.UpdateMetrics()) families := must.ReturnT(registry.Gather())(t) -// Assertion (non-fatal) — checking expected outcome +// Assertion (non-fatal) -- checking expected outcome assert.ErrEqual(t, err, nil) assert.Equal(t, len(families), 3) ``` **The rule: helper = must, assertion = assert.** -### assert.Equal vs assert.DeepEqual +**assert.Equal vs assert.DeepEqual**: | Type supports `==`? | Use | Args | |---------------------|-----|------| | Yes (int, string, bool) | `assert.Equal(t, actual, expected)` | 3 | | No (slices, maps, structs) | `assert.DeepEqual(t, "label", actual, expected)` | 4 | -Common mistake flagged in review: `assert.DeepEqual(t, "count", len(events), 3)` — `len` returns `int` which is comparable, so use `assert.Equal(t, len(events), 3)`. +Common mistake flagged in review: `assert.DeepEqual(t, "count", len(events), 3)` -- `len` returns `int` which is comparable, so use `assert.Equal(t, len(events), 3)`. -### Logging Level Selection +**Logging level selection**: | Level | When | Example | |-------|------|---------| @@ -345,9 +349,7 @@ Common mistake flagged in review: `assert.DeepEqual(t, "count", len(events), 3)` | `logg.Info` | Operational events, graceful degradation | `logg.Info("rejecting overlong name: %q", name)` | | `logg.Debug` | Diagnostic, gated behind `KEPPEL_DEBUG` | `logg.Debug("parsing configuration...")` | -### Panic Rules - -Panic ONLY for: +**Panic rules** -- panic ONLY for: - Programming errors / unreachable code: `panic("unreachable")` - Invariant violations: `panic("(why was this not caught by Validate!?)")` - Infallible operations: `crypto/rand.Read`, `json.Marshal` on known-good data @@ -355,7 +357,7 @@ Panic ONLY for: NEVER panic for: user input, external services, database errors, request handling. -### HTTP Error Response Formats (3 distinct) +**HTTP error response formats** (3 distinct -- using the wrong format for an API surface will fail review): | API Surface | Format | Helper | |-------------|--------|--------| @@ -365,11 +367,9 @@ NEVER panic for: user input, external services, database errors, request handlin 5xx errors use `respondwith.ObfuscatedErrorText` which logs the real error with a UUID and returns `"Internal Server Error (ID = )"` to the client. ---- - -## 6. API Design Rules +#### 2e. API Design -### Handler Pattern (Every handler follows this sequence) +**Handler pattern** -- every handler follows this exact sequence because lead review checks for it: ```go func (a *API) handleGetAccount(w http.ResponseWriter, r *http.Request) { @@ -395,10 +395,9 @@ func (a *API) handleGetAccount(w http.ResponseWriter, r *http.Request) { } ``` -### Strict JSON Parsing +**Strict JSON parsing** -- always use `DisallowUnknownFields` for request bodies because the secondary reviewer specifically checks for this: ```go -// ALWAYS use DisallowUnknownFields for request bodies decoder := json.NewDecoder(r.Body) decoder.DisallowUnknownFields() err := decoder.Decode(&req) @@ -408,7 +407,7 @@ if err != nil { } ``` -### Response Conventions +**Response conventions**: ```go // Single resource: wrap in named key @@ -423,11 +422,9 @@ if len(items) == 0 { } ``` ---- - -## 7. Testing Rules +#### 2f. Testing -### Core Testing Stack +**Core testing stack** -- use only these tools because testify, gomock, and httptest.NewRecorder will fail review: - **Assertion library**: `go-bits/assert` (NOT testify, NOT gomock) - **DB testing**: `easypg.WithTestDB` in every `TestMain` @@ -436,7 +433,7 @@ if len(items) == 0 { - **Time control**: `mock.Clock` (never call `time.Now()` directly) - **Test doubles**: Implement real driver interfaces, register via `init()` -### assert.HTTPRequest Pattern +**assert.HTTPRequest pattern**: ```go assert.HTTPRequest{ @@ -458,7 +455,7 @@ assert.HTTPRequest{ }.Check(t, h) ``` -### DB Testing Pattern +**DB testing pattern**: ```go // In shared_test.go -- REQUIRED for every package with DB tests @@ -477,7 +474,7 @@ tr.DBChanges().AssertEqual(`UPDATE repos SET next_sync_at = 7200 WHERE id = 1;`) tr.DBChanges().AssertEmpty() // nothing else changed ``` -### Test Execution Flags +**Test execution flags**: ```bash go test -shuffle=on -p 1 -covermode=count -coverpkg=... -mod vendor ./... @@ -487,7 +484,7 @@ go test -shuffle=on -p 1 -covermode=count -coverpkg=... -mod vendor ./... - `-p 1`: Sequential packages (shared PostgreSQL database) - `-mod vendor`: Use vendored dependencies -### Test Anti-Patterns +**Test anti-patterns** -- these will fail review: | Anti-Pattern | Correct Pattern | |-------------|----------------| @@ -497,103 +494,25 @@ go test -shuffle=on -p 1 -covermode=count -coverpkg=... -mod vendor ./... | `time.Now()` in testable code | Inject `func() time.Time`, use `mock.Clock` | | `t.Run` subtests (rare in keppel) | Log test case index: `t.Logf("----- testcase %d/%d -----")` | ---- - -## 8. Library Usage Rules - -### APPROVED Libraries - -| Library | Purpose | Key Pattern | -|---------|---------|-------------| -| `sapcc/go-bits` | Core framework (170+ files) | `logg.Info`, `must.Return`, `assert.HTTPRequest` | -| `majewsky/gg/option` | `Option[T]` (45 files) | `Some(v)`, `None[T]()`, dot-import ONLY for this | -| `majewsky/schwift/v2` | Swift storage client | OpenStack storage driver only | -| `gorilla/mux` | HTTP routing | `r.Methods("GET").Path("/path").HandlerFunc(h)` | -| `spf13/cobra` | CLI framework | `AddCommandTo(parent)` pattern | -| `go-gorp/gorp/v3` | SQL ORM | `db:"column_name"` struct tags | -| `gophercloud/gophercloud/v2` | OpenStack SDK | Keystone auth, Swift storage | -| `prometheus/client_golang` | Metrics | Application + HTTP middleware metrics | -| `redis/go-redis/v9` | Redis client | Rate limiting, token caching | -| `gofrs/uuid/v5` | UUID generation | NOT google/uuid, NOT satori/uuid | -| `golang-jwt/jwt/v5` | JWT tokens | Auth token handling | -| `alicebob/miniredis/v2` | Testing only | In-memory Redis for tests | - -### Related Libraries - -| Library | Purpose | Key Pattern | -|---------|---------|-------------| -| `majewsky/gg/option` | `Option[T]` (45 files) | `Some(v)`, `None[T]()`, dot-import ONLY for this | -| `majewsky/schwift/v2` | Swift storage client | OpenStack storage driver only | - -### FORBIDDEN Libraries - -| Library | Reason | Use Instead | -|---------|--------|-------------| -| `testify` (assert/require/mock) | SAP CC has own testing framework | `go-bits/assert` + `go-bits/must` | -| `zap` / `zerolog` / `slog` / `logrus` | SAP CC standardized on simple logging | `go-bits/logg` | -| `gin` / `echo` / `fiber` | SAP CC uses stdlib + gorilla/mux | `go-bits/httpapi` + `gorilla/mux` | -| `gorm` / `sqlx` / `ent` | Lightweight ORM preference | `go-gorp/gorp/v3` + `go-bits/sqlext` | -| `viper` | No config files; env-var-only config | `go-bits/osext` + `os.Getenv` | -| `google/uuid` / `satori/uuid` | Different UUID library chosen | `gofrs/uuid/v5` | -| `gomock` / `mockery` | Manual test double implementations | Hand-written doubles via driver interfaces | -| `ioutil.*` | Deprecated since Go 1.16 | `os` and `io` packages | -| `http.DefaultServeMux` | Global mutable state | `http.NewServeMux()` | -| `gopkg.in/square/go-jose.v2` | Archived, has CVEs | `gopkg.in/go-jose/go-jose.v2` | - -See [references/library-reference.md](references/library-reference.md) for the complete table with versions and usage counts. - -### Import Grouping Convention +### Phase 3: BUILD AND LINT -Three groups, separated by blank lines. Enforced by `goimports -local github.com/sapcc/keppel`: +Run build tooling and lint checks before submitting. All build config is generated from `Makefile.maker.yaml` -- do NOT edit generated files directly because `go-makefile-maker` will overwrite them. -```go -import ( - // Group 1: Standard library - "context" - "encoding/json" - "fmt" - "net/http" - - // Group 2: External (includes sapcc/go-bits, NOT local project) - "github.com/gorilla/mux" - . "github.com/majewsky/gg/option" // ONLY dot-import allowed - "github.com/sapcc/go-bits/httpapi" - "github.com/sapcc/go-bits/logg" - - // Group 3: Local project - "github.com/sapcc/keppel/internal/keppel" - "github.com/sapcc/keppel/internal/models" -) -``` - -**Dot-import whitelist** (only these 3 packages): -- `github.com/majewsky/gg/option` -- `github.com/onsi/ginkgo/v2` -- `github.com/onsi/gomega` - ---- - -## 9. Build and CI Rules - -### go-makefile-maker - -All build config is generated from `Makefile.maker.yaml`. Do NOT edit these files directly: +**Generated files (do NOT edit)**: - `Makefile` - `.golangci.yaml` - `REUSE.toml` - `.typos.toml` - GitHub Actions workflows -### License Headers (REQUIRED on every .go file) +**License headers (REQUIRED on every .go file)** -- missing headers fail the REUSE lint check: ```go // SPDX-FileCopyrightText: 2024 SAP SE or an SAP affiliate company // SPDX-License-Identifier: Apache-2.0 ``` -### golangci-lint v2 Configuration - -35+ enabled linters. Key settings: +**golangci-lint v2 configuration** -- 35+ enabled linters. Key settings: | Setting | Value | Rationale | |---------|-------|-----------| @@ -609,9 +528,7 @@ All build config is generated from `Makefile.maker.yaml`. Do NOT edit these file - `(net/http.ResponseWriter).Write` - `(*github.com/spf13/cobra.Command).Help` -### revive Linter (REQUIRED — check on every sapcc repo) - -As of 2026-03-20, go-makefile-maker supports `revive` as an optional linter. **All sapcc repos should enable it.** When working in an sapcc repo, check `Makefile.maker.yaml` for `reviveRules`. If absent, recommend adding: +**revive linter (REQUIRED -- check on every sapcc repo)**: As of 2026-03-20, go-makefile-maker supports `revive` as an optional linter. When working in an sapcc repo, check `Makefile.maker.yaml` for `reviveRules`. If absent, recommend adding: ```yaml golangciLint: @@ -622,13 +539,9 @@ golangciLint: - disableChecksOnConstants ``` -This catches: -- Exported functions/types/methods without doc comments -- Private receivers on exported methods - -After adding, run `go-makefile-maker` to regenerate `.golangci.yaml`, then `make run-golangci-lint` to verify. +This catches exported functions/types/methods without doc comments and private receivers on exported methods. After adding, run `go-makefile-maker` to regenerate `.golangci.yaml`, then `make run-golangci-lint` to verify. -### Build Commands +**Build commands**: ```bash make build-all # Build binary @@ -639,19 +552,82 @@ make goimports # Format imports make vendor # go mod tidy + vendor + verify ``` ---- +**Gate**: `make check` passes. Proceed only when gate passes. + +### Phase 4: VALIDATE + +Run sapcc-specific checks that no linter covers. These detect convention violations that only surface during code review. + +**Available scripts** -- all support `--help`, `--json`, `--limit`, and meaningful exit codes (0 = clean, 1 = violations, 2 = error): + +| Script | What It Checks | +|--------|---------------| +| `scripts/check-sapcc-identify-endpoint.sh` | HTTP handlers missing `httpapi.IdentifyEndpoint` call | +| `scripts/check-sapcc-auth-ordering.sh` | Data access before authentication in handlers | +| `scripts/check-sapcc-json-strict.sh` | `json.NewDecoder` without `DisallowUnknownFields()` | +| `scripts/check-sapcc-time-now.sh` | Direct `time.Now()` in testable code (inject clock instead) | +| `scripts/check-sapcc-httptest.sh` | `httptest.NewRecorder` instead of `assert.HTTPRequest` | +| `scripts/check-sapcc-todo-format.sh` | Bare TODO comments without context/links | + +These scripts only apply to sapcc repos (detected by `github.com/sapcc/go-bits` in go.mod). + +**Gate**: All scripts exit 0. Proceed only when gate passes. + +## Review Standards + +Two complementary review styles govern sapcc code review. + +**Lead review** is directive -- statements, not suggestions. Top concerns: simplicity, API design, error handling. See [references/review-standards-lead.md](references/review-standards-lead.md) for all 21 PR comments with full context. + +| Rule | Summary | +|------|---------| +| Trust the stdlib | Don't wrap errors that `strconv`, constructors, etc. already describe well | +| Use Cobra subcommands | Never manually roll argument dispatch that Cobra handles | +| CLI names: specific + extensible | `keppel test-driver storage`, not `keppel test` | +| Marshal structured data for errors | If you have a `map[string]any`, `json.Marshal` it instead of manually formatting fields | +| Tests must verify behavior | Never silently remove test assertions during refactoring | +| Explain test workarounds | Add comments when test setup diverges from production patterns | +| Use existing error utilities | Use `errext.ErrorSet` and `.Join()`, not manual string concatenation | +| TODOs need context | Include what, a starting point link, and why not done now | +| Documentation stays qualified | When behavior changes conditionally, update docs to state the conditions | +| Understand value semantics | Value receiver copies the struct, but reference-type fields share data | +| Variable names don't mislead | Don't name script vars as if the application reads them | -## 10. go-bits Design Philosophy +How lead review works: +- Reads Copilot suggestions critically -- agrees with principle, proposes simpler alternatives +- Dismisses inconsistent AI complaints -- if tool flags X but not equivalent Y, the concern is invalid +- Thinks about forward compatibility -- command names and API shapes evaluated for extensibility +- Values brevity when stdlib provides clarity -- removes wrappers that duplicate error info +- Approves simple PRs quickly -- doesn't manufacture concerns +- Corrects misconceptions directly -- states correct behavior without softening +- Pushes fixes directly -- sometimes pushes commits to address review concerns directly -The go-bits library design rules that govern all of `sapcc/go-bits`. Understanding these rules helps predict what code will pass review. +**Secondary review** is inquisitive -- questions where lead review makes statements. Top concerns: configuration safety, migration paths, test completeness. See [references/review-standards-secondary.md](references/review-standards-secondary.md) for full details. -### Rule 1: One Package = One Concept -`must` = fatal errors. `logg` = logging. `respondwith` = HTTP responses. No package does two things. +| Rule | Summary | +|------|---------| +| Error messages must be actionable | "Internal Server Error" is unacceptable when the cause is knowable | +| Know the spec, deviate pragmatically | Reference RFCs, but deviate when spec is impractical | +| Guard against panics with clear errors | Check nil/empty before indexing, use `fmt.Errorf("invalid: %q", val)` | +| Strict configuration parsing | Use `DisallowUnknownFields()` on JSON decoders for config | +| Test ALL combinations | When changing logic with multiple inputs, test every meaningful combination | +| Eliminate redundant code | Ask "This check is now redundant?" when code is refactored | +| Comments explain WHY | When something non-obvious is added, request an explanatory comment | +| Domain knowledge over theory | Dismiss concerns that don't apply to actual domain constraints | +| Smallest possible fix | 2-line PRs are fine. Don't bundle unrelated changes | +| Respect ownership hierarchy | "LGTM but lets wait for lead review, we are in no hurry here" | +| Be honest about mistakes | Acknowledge errors quickly and propose fix direction | +| Validate migration paths | "Do we somehow check if this is still set and then abort?" | -### Rule 2: Minimal API Surface -`must` has 4 functions. `logg` has 5. `syncext` has 1 type with 3 methods. Fewer, more general functions beat many specific ones. +## go-bits Design Philosophy -### Rule 3: Names That Read as English +The go-bits library design rules govern all of `sapcc/go-bits`. Understanding these rules helps predict what code will pass review. See [references/go-bits-philosophy-detailed.md](references/go-bits-philosophy-detailed.md) for extended discussion. + +**Rule 1: One Package = One Concept** -- `must` = fatal errors. `logg` = logging. `respondwith` = HTTP responses. No package does two things. + +**Rule 2: Minimal API Surface** -- `must` has 4 functions. `logg` has 5. `syncext` has 1 type with 3 methods. Fewer, more general functions beat many specific ones. + +**Rule 3: Names That Read as English**: ```go must.Succeed(err) // "must succeed" must.Return(os.ReadFile(f)) // "must return" @@ -660,43 +636,44 @@ logg.Fatal(msg) // "log fatal" errext.As[T](err) // "error extension: as T" ``` -### Rule 4: Document the WHY, Not Just the WHAT -Extensive comments explaining design constraints and rejected alternatives. `must.ReturnT` has three paragraphs explaining why the signature is the only one that works given Go generics limitations. +**Rule 4: Document the WHY, Not Just the WHAT** -- extensive comments explaining design constraints and rejected alternatives. `must.ReturnT` has three paragraphs explaining why the signature is the only one that works given Go generics limitations. -### Rule 5: Panics for Programming Errors, Errors for Runtime Failures +**Rule 5: Panics for Programming Errors, Errors for Runtime Failures**: - **Panic**: nil factory in `pluggable.Add`, calling API outside `Compose`, mixing incompatible options - **Error return**: missing env var, failed SQL query, JSON marshal failure - **Fatal (os.Exit)**: `must.Succeed` for genuinely unrecoverable startup errors -### Rule 6: Concrete Before/After Examples in Docs -Every function's godoc shows the exact code it replaces. +**Rule 6: Concrete Before/After Examples in Docs** -- every function's godoc shows the exact code it replaces. -### Rule 7: Enforce Correct Usage Through Type System -`jobloop.Setup()` returns a private type wrapping the struct, enforcing that Setup was called. +**Rule 7: Enforce Correct Usage Through Type System** -- `jobloop.Setup()` returns a private type wrapping the struct, enforcing that Setup was called. -### Rule 8: Dependency Consciousness -Actively prevents unnecessary dependency trees. Importing UUID from `audittools` into `respondwith` was rejected because it would pull in AMQP dependencies. Solution: move to internal package. +**Rule 8: Dependency Consciousness** -- actively prevents unnecessary dependency trees. Importing UUID from `audittools` into `respondwith` was rejected because it would pull in AMQP dependencies. Solution: move to internal package. -### Rule 9: Prefer Functions Over Global Variables +**Rule 9: Prefer Functions Over Global Variables**: > "I don't like having a global variable for this that callers can mess with." Use `ForeachOptionTypeInLIQUID[T any](action func(any) T) []T` instead of `var LiquidOptionTypes = []any{...}`. -### Rule 10: Leverage Go Generics Judiciously -Use generics where they eliminate boilerplate or improve type safety: -- `must.Return[V]` preserves return type -- `errext.As[T]` eliminates pointer-to-pointer pattern -- `pluggable.Registry[T Plugin]` constrains plugin types +**Rule 10: Leverage Go Generics Judiciously** -- use generics where they eliminate boilerplate or improve type safety (`must.Return[V]`, `errext.As[T]`, `pluggable.Registry[T Plugin]`). Do NOT use generics where they add complexity without clear benefit. -Do NOT use generics where they add complexity without clear benefit. +**Rule 11: Graceful Deprecation** -- `assert.HTTPRequest` is deprecated but not removed. The deprecation notice includes a complete migration guide. No forced migration. -### Rule 11: Graceful Deprecation -`assert.HTTPRequest` is deprecated but not removed. The deprecation notice includes a complete migration guide. No forced migration. +**Rule 12: Defense in Depth with Documentation** -- handle theoretically impossible cases with branches that behave the same, and document the invariant reasoning. -### Rule 12: Defense in Depth with Documentation -Handle theoretically impossible cases with branches that behave the same, and document the invariant reasoning. +## SAP CC Rationalization Traps ---- +These are reasoning patterns that sound correct but lead to rejected PRs in sapcc repos: + +| Rationalization | Why It's Wrong | Required Action | +|---|---|---| +| "Tests pass, the error wrapping is fine" | Lead review checks error message quality, not just pass/fail | Verify error context matches project standards | +| "Copilot suggested this approach" | Lead review frequently rejects Copilot suggestions | Evaluate on merit, simplify where possible | +| "I need a struct for this JSON" | One-off JSON can use `fmt.Sprintf` + `json.Marshal` | Only create types if reused or complex | +| "Better safe than sorry" (re: error handling) | "Irrelevant contrivance" — over-handling is an anti-pattern | Ask "concrete scenario where this fails?" | +| "Standard library X works fine here" | SAP CC has go-bits equivalents that are expected | Use go-bits equivalents | +| "testify is the Go standard" | SAP CC uses go-bits/assert exclusively | Never introduce testify in sapcc repos | +| "I'll add comprehensive error wrapping" | Trust well-designed functions' error messages | Check if called function already provides context | +| "This needs a config file" | SAP CC uses env vars only | Use `osext.MustGetenv` / `GetenvOrDefault` / `GetenvBool` | ## Error Handling @@ -706,7 +683,7 @@ Handle theoretically impossible cases with branches that behave the same, and do ### Error: "Linter reports forbidden import" **Cause**: Using a FORBIDDEN library (testify, zap, gin, etc.) -**Solution**: Replace with the SAP CC equivalent. See the FORBIDDEN table in Section 8. +**Solution**: Replace with the SAP CC equivalent. See the FORBIDDEN Libraries table in Phase 2b. ### Error: "Missing SPDX license header" **Cause**: `.go` file missing the required two-line SPDX header @@ -720,89 +697,14 @@ Handle theoretically impossible cases with branches that behave the same, and do **Cause**: Mixing assertion libraries **Solution**: Replace `assert.Equal(t, expected, actual)` (testify) with `assert.DeepEqual(t, "desc", actual, expected)` (go-bits). Note the parameter order difference. ---- - -## Anti-Patterns - -See [references/anti-patterns.md](references/anti-patterns.md) for the full catalog with BAD/GOOD examples. - -### AP-1: Creating Types for One-Off JSON Marshaling -**What it looks like**: Struct types with json tags used once for `json.Marshal` -**Why wrong**: This is considered "overengineered" by project convention -**Do instead**: `fmt.Sprintf` with `must.Return(json.Marshal(dynamicPart))` - -### AP-2: Wrapping Errors That Already Have Context -**What it looks like**: `fmt.Errorf("parse error: %w", strconv.ParseUint(...))` -**Why wrong**: strconv already includes function name, input, and error type -**Do instead**: `must.Return(strconv.ParseUint(s, 10, 32))` - -### AP-3: Manual Argument Dispatch Instead of Cobra -**What it looks like**: Switch statement on `args[0]` to dispatch to code paths -**Why wrong**: Cobra subcommands handle this with better UX -**Do instead**: Change argument order if needed to allow Cobra subcommands - -### AP-4: Using must.Return in Request Handlers -**What it looks like**: `val := must.Return(someOperation())` inside an HTTP handler -**Why wrong**: `must.Return` calls `os.Exit(1)` on error, crashing the server -**Do instead**: Return errors properly; `must.*` is for startup code and tests only - -### AP-5: Global Mutable Variables for Configuration -**What it looks like**: `var Config = map[string]string{...}` at package level -**Why wrong**: Callers can modify the map, creating inconsistent state -**Do instead**: Functions that produce values: `func GetConfig() map[string]string` - ---- - -## Available Scripts - -Deterministic checks for sapcc-specific patterns that no linter covers. Run these during code review or as part of quality gates. All support `--help`, `--json`, `--limit`, and meaningful exit codes (0 = clean, 1 = violations, 2 = error). - -| Script | What It Checks | -|--------|---------------| -| `scripts/check-sapcc-identify-endpoint.sh` | HTTP handlers missing `httpapi.IdentifyEndpoint` call | -| `scripts/check-sapcc-auth-ordering.sh` | Data access before authentication in handlers | -| `scripts/check-sapcc-json-strict.sh` | `json.NewDecoder` without `DisallowUnknownFields()` | -| `scripts/check-sapcc-time-now.sh` | Direct `time.Now()` in testable code (inject clock instead) | -| `scripts/check-sapcc-httptest.sh` | `httptest.NewRecorder` instead of `assert.HTTPRequest` | -| `scripts/check-sapcc-todo-format.sh` | Bare TODO comments without context/links | - -These scripts only apply to sapcc repos (detected by `github.com/sapcc/go-bits` in go.mod). - ---- - -## Anti-Rationalization - -### SAP CC Domain-Specific Rationalizations - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Tests pass, the error wrapping is fine" | Lead review checks error message quality in review | Verify error context matches project standards | -| "Copilot suggested this approach" | Lead review frequently rejects Copilot suggestions as overengineered | Evaluate on merit, simplify where possible | -| "I need a struct for this JSON" | One-off JSON can be `fmt.Sprintf` + `json.Marshal` | Only create types if reused or complex | -| "Better safe than sorry" (re: error handling) | "Irrelevant contrivance" -- handle only practical concerns | Ask "concrete scenario where this fails?" | -| "Standard library X works fine here" | SAP CC has go-bits equivalents that are expected | Use go-bits (logg, assert, must, osext, respondwith) | -| "testify is the Go standard" | SAP CC projects use go-bits/assert exclusively | Never introduce testify in sapcc repos | -| "I'll add comprehensive error wrapping" | Trust well-designed functions' error messages | Check if the called function already provides context | -| "This needs a config file" | SAP CC uses env vars only | Use osext.MustGetenv, GetenvOrDefault, GetenvBool | - ---- - -## References (MUST READ) - -**NON-NEGOTIABLE**: Before working on ANY sapcc Go code, you MUST read these reference files. Do NOT skip them. Do NOT rely on your training data for sapcc conventions — read the actual references. These contain the real rules from actual PR reviews. - -**Load order** (read in this sequence): -1. **sapcc-code-patterns.md** — Read FIRST. This is the definitive reference with actual code patterns -2. **library-reference.md** — Read SECOND. Know which libraries are approved/forbidden before writing imports -3. **architecture-patterns.md** — Read THIRD if working on architecture, HTTP handlers, or DB access -4. Then load others as needed for the specific task +## References | File | What It Contains | When to Read | |------|-----------------|--------------| -| [references/sapcc-code-patterns.md](${CLAUDE_SKILL_DIR}/references/sapcc-code-patterns.md) | **Actual code patterns** — function signatures, constructors, interfaces, HTTP handlers, error handling, DB access, testing, package organization | **ALWAYS** — this is the primary reference | -| [references/library-reference.md](${CLAUDE_SKILL_DIR}/references/library-reference.md) | Complete library table: 30 approved, 10+ forbidden, with versions and usage counts | **ALWAYS** — need to know approved/forbidden imports | +| [references/sapcc-code-patterns.md](${CLAUDE_SKILL_DIR}/references/sapcc-code-patterns.md) | **Actual code patterns** -- function signatures, constructors, interfaces, HTTP handlers, error handling, DB access, testing, package organization | **ALWAYS** -- this is the primary reference | +| [references/library-reference.md](${CLAUDE_SKILL_DIR}/references/library-reference.md) | Complete library table: 30 approved, 10+ forbidden, with versions and usage counts | **ALWAYS** -- need to know approved/forbidden imports | | [references/architecture-patterns.md](${CLAUDE_SKILL_DIR}/references/architecture-patterns.md) | Full 102-rule architecture specification with code examples | When working on architecture, handlers, DB access | | [references/review-standards-lead.md](${CLAUDE_SKILL_DIR}/references/review-standards-lead.md) | All 21 lead review comments with full context and quotes | For reviews and understanding lead review reasoning | | [references/review-standards-secondary.md](${CLAUDE_SKILL_DIR}/references/review-standards-secondary.md) | All 15 secondary review comments with PR context | For reviews and understanding secondary review patterns | | [references/anti-patterns.md](${CLAUDE_SKILL_DIR}/references/anti-patterns.md) | 20+ SAP CC anti-patterns with BAD/GOOD code examples | For code review and avoiding common mistakes | -| [references/extended-patterns.md](${CLAUDE_SKILL_DIR}/references/extended-patterns.md) | **Extended patterns from related repos** — security micro-patterns, visual section separators, copyright format, K8s namespace isolation, PR hygiene (sort lists, clean orphans, document alongside), changelog format. Pipeline-generated. | For security-conscious code, K8s helm work, or PR hygiene | +| [references/extended-patterns.md](${CLAUDE_SKILL_DIR}/references/extended-patterns.md) | Extended patterns from related repos -- security micro-patterns, visual section separators, copyright format, K8s namespace isolation, PR hygiene | For security-conscious code, K8s helm work, or PR hygiene | diff --git a/skills/go-testing/SKILL.md b/skills/go-testing/SKILL.md index 4a1b9eb8..b6b4f0b9 100644 --- a/skills/go-testing/SKILL.md +++ b/skills/go-testing/SKILL.md @@ -42,57 +42,10 @@ routing: # Go Testing Skill -## Operator Context - -This skill operates as an operator for Go testing workflows, configuring Claude's behavior for idiomatic, thorough Go test development. It implements the **Pattern Library** architectural pattern -- applying canonical Go testing patterns (table-driven, subtests, helpers, mocking) with **Domain Intelligence** from the Go standard library's testing conventions. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before writing tests -- **Over-Engineering Prevention**: Write only the tests needed. No speculative coverage, no "while I'm here" test additions -- **Table-Driven Tests Required**: Multiple related cases MUST use table-driven pattern with `t.Run` -- **t.Helper() Required**: Every test helper function MUST call `t.Helper()` as its first line -- **Show Test Output**: Always show actual `go test` output. Never summarize as "tests pass" -- **Race Detector**: Run `go test -race` when testing concurrent code -- **Black-Box Testing**: Prefer `package_test` (external test package) over `package` (internal) -- **Test Name Accuracy**: Test names must accurately describe the code path exercised. If a test simulates behavior rather than exercising the production code path, the name must reflect this (e.g., `TestConsumeLoop_CanBeReenteredAfterFailure` not `TestRestartLoop_RestartsAfterTransientFailure` when the test calls `consumeLoop` directly instead of `Start()`). "Pragmatic approximation" is not a valid reason for a misleading test name. - -### Default Behaviors (ON unless disabled) -- **Parallel Execution**: Use `t.Parallel()` for independent tests by default -- **t.Cleanup Over defer**: Prefer `t.Cleanup()` for resource management in test helpers -- **t.Context()**: Use `t.Context()` (Go 1.24+) for context-aware tests -- **b.Loop()**: Use `b.Loop()` (Go 1.24+) instead of `for i := 0; i < b.N; i++` for benchmarks -- **gopls Diagnostics**: After editing test files, use `go_diagnostics` to catch errors before running tests -- **Error Path Testing**: Test error conditions, not just happy paths -- **Coverage Check**: Run `-coverprofile` and verify critical paths have >80% coverage -- **Cleanup Verification**: Each test must clean up after itself (no test pollution) - -### Optional Behaviors (OFF unless enabled) -- **synctest Usage**: Use `testing/synctest` (Go 1.25+) for deterministic concurrency testing -- **Benchmark Comparison**: Use `benchstat` for before/after performance comparisons -- **Coverage HTML Report**: Generate and open HTML coverage visualization -- **Interface Deduplication**: Test multiple interface implementations with shared test functions - -## Available Scripts - -- **`scripts/gen-table-test.sh`** — Scaffold a table-driven test file for a Go function. Run `bash scripts/gen-table-test.sh --help` for options. -- **`scripts/bench-compare.sh`** — Run Go benchmarks with optional benchstat comparison. Run `bash scripts/bench-compare.sh --help` for options. - -## What This Skill CAN Do -- Write idiomatic table-driven tests with `t.Run` subtests -- Create test helpers with proper `t.Helper()` marking -- Build manual mock implementations for interfaces -- Write benchmarks using modern `b.Loop()` pattern -- Set up parallel tests with proper variable capture -- Guide race detection and concurrent test patterns - -## What This Skill CANNOT Do -- Debug failing tests (use `systematic-debugging` instead) -- Write non-Go tests (use `test-driven-development` instead) -- Perform general Go development (use `golang-general-engineer` directly) -- Generate code from mocking frameworks (manual mocks preferred in Go) -- Optimize performance without test focus (use performance profiling tools) - ---- +Go testing follows a 4-phase workflow: understand what needs testing, write +idiomatic tests, run and verify, review quality. Every function with multiple +test cases uses table-driven pattern. Every helper calls t.Helper(). Every +concurrent test runs with -race. ## Instructions @@ -105,6 +58,9 @@ This skill operates as an operator for Go testing workflows, configuring Claude' - Is this a new test, modification, or coverage gap? - Are there existing tests to follow as patterns? +Read and follow repository CLAUDE.md before writing tests — project conventions +override these defaults. + **Step 2: Choose test type** | Need | Test Type | Pattern | @@ -118,7 +74,7 @@ This skill operates as an operator for Go testing workflows, configuring Claude' **Step 3: Verify test file structure** ```go -package mypackage_test // Black-box testing (preferred) +package mypackage_test // Black-box testing (preferred over internal) import ( "testing" @@ -128,6 +84,10 @@ import ( // Order: Unit tests, Integration tests, Benchmarks, Examples ``` +Black-box testing (`package_test`) is preferred because it tests the public API the +way consumers use it. Internal testing (`package`) is acceptable only when testing +unexported behavior that can't be reached through the public API. + **Gate**: Test scope, type, and file location identified. Proceed only when gate passes. ### Phase 2: WRITE Tests @@ -136,7 +96,9 @@ import ( **Step 1: Table-driven tests for multiple cases** -Every function with more than one test case MUST use table-driven pattern: +Multiple related cases MUST use table-driven pattern — this is the canonical Go +testing idiom because it makes adding cases trivial and the input/output +relationship explicit: ```go func TestParseConfig(t *testing.T) { @@ -173,8 +135,15 @@ func TestParseConfig(t *testing.T) { } ``` +Write only the tests needed — no speculative coverage, no "while I'm here" +additions. Each test should exercise a specific behavior, not pad a coverage number. + **Step 2: Test helpers with t.Helper()** +Every test helper MUST call `t.Helper()` as its first line — without it, test +failure messages point to the helper's line instead of the caller's, which makes +debugging failures slow and frustrating: + ```go func assertEqual[T comparable](t *testing.T, got, want T) { t.Helper() // MUST be first line @@ -205,6 +174,9 @@ See `references/go-test-patterns.md` for complete mock patterns with call tracki **Step 4: Parallel tests** +Use `t.Parallel()` for independent tests by default — it catches unintended +shared state and runs faster: + ```go for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { @@ -214,6 +186,11 @@ for _, tt := range tests { } ``` +**Step 5: Test error paths too** + +Test error conditions, not just happy paths. If a function can return an error, +write at least one test case that triggers that error and verifies the message. + **Gate**: Tests follow table-driven pattern, helpers use `t.Helper()`, mocks use function fields. Proceed only when gate passes. ### Phase 3: RUN and Verify @@ -226,7 +203,8 @@ for _, tt := range tests { # Standard run with verbose output go test -v ./path/to/package/... -# With race detector (REQUIRED for concurrent code) +# With race detector (REQUIRED for concurrent code — race conditions +# are silent until production; the -race flag catches them deterministically) go test -race -v ./path/to/package/... # With coverage @@ -234,12 +212,18 @@ go test -coverprofile=coverage.out ./path/to/package/... go tool cover -func=coverage.out ``` +Always show actual `go test` output. Never summarize as "tests pass" — the user +needs to see what ran, what passed, and what the output looks like. + **Step 2: Verify results** - All tests pass (show actual output) - No race conditions detected - Critical paths have >80% coverage - Error paths are exercised +After editing test files, use `go_diagnostics` (gopls MCP) to catch errors +before running tests — faster feedback than a full `go test` cycle. + **Step 3: Run full suite** ```bash @@ -264,13 +248,19 @@ Verify no regressions in other packages. - [ ] No test interdependencies? - [ ] Race detector passes? +Test names must accurately describe the code path exercised. If a test simulates +behavior rather than exercising the production code path, the name must reflect +this — "pragmatic approximation" is not a valid reason for a misleading test name. + **Gate**: All checklist items satisfied. Tests are complete. --- ## Benchmark Guide -**Use `b.Loop()` (Go 1.24+) for all new benchmarks.** It prevents dead code elimination, manages timers automatically, and produces more accurate results. +Use `b.Loop()` (Go 1.24+) for all new benchmarks — it prevents dead code +elimination, manages timers automatically, and produces more accurate results +than the manual `for i := 0; i < b.N; i++` loop: ```go func BenchmarkProcess(b *testing.B) { @@ -348,52 +338,14 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Separate Functions for Related Cases -**What it looks like**: `TestParseValid`, `TestParseInvalid`, `TestParseEmpty` as separate functions -**Why wrong**: Duplicates setup, obscures the input-output relationship, harder to add cases -**Do instead**: One `TestParse` with table-driven cases and `t.Run` subtests - -### Anti-Pattern 2: Missing t.Helper() -**What it looks like**: Test helper reports errors at the helper's line, not the caller's -**Why wrong**: Makes debugging test failures slow because error location is misleading -**Do instead**: Add `t.Helper()` as the first line of every test helper function - -### Anti-Pattern 3: Testing Implementation Instead of Behavior -**What it looks like**: Asserting internal method calls, field values, or execution order -**Why wrong**: Breaks on every refactor even when behavior is unchanged -**Do instead**: Test observable behavior (return values, side effects, state changes) - -### Anti-Pattern 4: Test Pollution via Shared State -**What it looks like**: Package-level variables modified by tests, test order matters -**Why wrong**: Tests become flaky, pass individually but fail together -**Do instead**: Create fresh state in each test. Use `t.Cleanup()` for teardown. +## Available Scripts -### Anti-Pattern 5: Hardcoded File Paths -**What it looks like**: `os.ReadFile("testdata/input.json")` without considering working directory -**Why wrong**: Breaks when test runs from different directory or in CI -**Do instead**: Use `t.TempDir()` for generated files, `os.Getwd()` + relative path for testdata +- **`scripts/gen-table-test.sh`** — Scaffold a table-driven test file for a Go function +- **`scripts/bench-compare.sh`** — Run Go benchmarks with optional benchstat comparison --- ## References -This skill uses these shared patterns: -- [Anti-Rationalization Core](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Anti-Rationalization Testing](../shared-patterns/anti-rationalization-testing.md) - Testing-specific rationalization prevention -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "One test case, no need for table-driven" | Will grow to multiple cases | Set up table-driven from the start | -| "t.Helper() is just cosmetic" | Wrong error location wastes debug time | Always add t.Helper() | -| "Tests pass, no need for -race" | Race conditions are silent until production | Run with -race for concurrent code | -| "Coverage is 80%, good enough" | What's in the uncovered 20%? | Check that critical paths are covered | -| "Mock is too complex to build" | Complex ≠ optional | Build the mock, track calls | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/go-test-patterns.md`: Full examples for table-driven tests, helpers, mocking, interface deduplication - `${CLAUDE_SKILL_DIR}/references/go-benchmark-and-concurrency.md`: b.Loop() benchmarks, benchstat, synctest, race detection patterns diff --git a/skills/headless-cron-creator/SKILL.md b/skills/headless-cron-creator/SKILL.md index f9ed3c55..bd744ae1 100644 --- a/skills/headless-cron-creator/SKILL.md +++ b/skills/headless-cron-creator/SKILL.md @@ -27,69 +27,26 @@ routing: # Headless Cron Creator Skill -## Operator Context - -This skill operates as an operator for creating headless Claude Code cron jobs, -configuring Claude's behavior for safe, templated cron job generation. It -implements the **Scaffold → Validate → Install** pattern with deterministic -Python scripts handling all crontab mutations. - -### Hardcoded Behaviors (Always Apply) -- **Never pipe to `crontab -`** — all crontab mutations go through `scripts/crontab-manager.py` -- **Backup before mutate** — every crontab change creates a timestamped backup in `~/.claude/crontab-backups/` -- **Dry-run default** — generated scripts do nothing destructive without `--execute` -- **Budget cap** — every generated script has `--max-budget-usd` (default $2.00) -- **Lockfile** — every generated script uses `flock` to prevent concurrent runs -- **No `--bare`** — breaks OAuth/keychain auth -- **No `--dangerously-skip-permissions`** — `--permission-mode auto` is sufficient -- **No `CronCreate` tool** — session-scoped, not persistent -- **Off-minute scheduling** — never use `:00` or `:30` minutes; spread load with odd minutes (7, 23, 47) -- **Tag all entries** — every crontab entry gets a `# claude-cron: ` marker for safe identification -- **Full absolute paths in crontab** — cron has minimal PATH; all commands use absolute paths - -### Default Behaviors (ON unless disabled) -- **Auto-validate with cron-job-auditor** — run the auditor on every generated script -- **Show next-steps** — print the exact commands to test and install -- **Heredoc prompts** — embed prompt text via bash heredoc to avoid escaping issues -- **Per-run log files** — `tee` output to timestamped log files - -### Optional Behaviors (OFF unless enabled) -- **Auto-install** — install the crontab entry (requires explicit user confirmation) -- **Custom allowed-tools** — override the default `Bash Read` tool set -- **Custom budget** — override the default $2.00 per-run budget - -## What This Skill CAN Do -- Generate wrapper scripts from task descriptions -- Install/remove/verify crontab entries safely -- List all Claude Code cron jobs -- Validate generated scripts against cron best practices - -## What This Skill CANNOT Do -- Execute the generated cron jobs (that's cron's job) -- Modify existing wrapper scripts (regenerate with `--force` instead) -- Manage non-Claude crontab entries (only touches `# claude-cron:` tagged entries) -- Install crontab entries without user confirmation - ---- +Generate headless Claude Code cron jobs from a task description and schedule. Creates a wrapper script with safety mechanisms (lockfile, budget cap, dry-run default, logging) and installs crontab entries. All crontab mutations go through `scripts/crontab-manager.py`, which writes to temp files and creates timestamped backups in `~/.claude/crontab-backups/` before every change -- never pipe directly to `crontab -` because a mid-stream pipe failure wipes the entire crontab. ## Instructions ### Phase 1: PARSE -**Goal**: Extract job parameters from the user's request. +Extract job parameters from the user's request. **Required parameters**: -- **name** — short kebab-case identifier (e.g., `reddit-automod`, `feed-health-check`) -- **prompt** — what Claude should do each run (natural language) -- **schedule** — cron expression or human-readable interval +- **name** -- short kebab-case identifier (e.g., `reddit-automod`, `feed-health-check`) +- **prompt** -- what Claude should do each run (natural language) +- **schedule** -- cron expression or human-readable interval **Optional parameters** (with defaults): -- **workdir** — where to `cd` before running (default: current repo root) -- **budget** — max USD per run (default: `2.00`) -- **allowed-tools** — which tools the headless session can use (default: `Bash Read`) -- **logdir** — where to store logs (default: `{workdir}/cron-logs/{name}`) +- **workdir** -- where to `cd` before running (default: current repo root) +- **budget** -- max USD per run (default: `2.00`; user may override) +- **allowed-tools** -- which tools the headless session can use (default: `Bash Read`; user may override) +- **logdir** -- where to store logs (default: `{workdir}/cron-logs/{name}`) -**Human-readable schedule conversion**: +**Human-readable schedule conversion** -- use off-minutes (7, 23, 47) instead of `:00`/`:30` because every cron job on the system fires at round minutes, creating load spikes: | Human Input | Cron Expression | |-------------|----------------| @@ -100,13 +57,11 @@ Python scripts handling all crontab mutations. | weekly on sunday | `7 9 * * 0` | | every 30 minutes | `*/30 * * * *` | -Always use off-minutes (7, 23, 47) instead of :00/:30 to spread load. - **Gate**: All required parameters extracted. Proceed to Phase 2. ### Phase 2: GENERATE -**Goal**: Create the wrapper script using `crontab-manager.py generate-wrapper`. +Create the wrapper script using `crontab-manager.py generate-wrapper`. Embed the prompt via bash heredoc to avoid escaping issues. ```bash python3 ~/.claude/scripts/crontab-manager.py generate-wrapper \ @@ -120,39 +75,41 @@ python3 ~/.claude/scripts/crontab-manager.py generate-wrapper \ Review the generated script. Verify it contains: - [ ] `set -euo pipefail` -- [ ] `flock` lockfile -- [ ] `--permission-mode auto` -- [ ] `--max-budget-usd` +- [ ] `flock` lockfile -- prevents concurrent runs of the same job +- [ ] `--permission-mode auto` -- never use `--dangerously-skip-permissions` (auto is sufficient) or `--bare` (breaks OAuth/keychain auth) +- [ ] `--max-budget-usd` -- caps spend per run (default $2.00) - [ ] `--no-session-persistence` - [ ] `--allowedTools` -- [ ] `tee` to per-run log file -- [ ] Dry-run/execute toggle +- [ ] `tee` to per-run timestamped log file +- [ ] Dry-run/execute toggle -- scripts do nothing destructive without `--execute` - [ ] Exit code propagation via `PIPESTATUS[0]` +Do not use the `CronCreate` tool -- it is session-scoped (dies when the session ends, auto-expires after 7 days). Use system `crontab` via `crontab-manager.py` instead. + **Gate**: Script generated and reviewed. Proceed to Phase 3. ### Phase 3: VALIDATE -**Goal**: Verify the generated script meets cron best practices. +Verify the generated script meets cron best practices. 1. Run the script in dry-run mode: ```bash bash -n scripts/{name}-cron.sh # syntax check ``` -2. Run `cron-job-auditor` checks manually: +2. Run `cron-job-auditor` checks: - [ ] Error handling (`set -e`) - [ ] Lock file (`flock`) - [ ] Logging (`tee`, `LOG_DIR`) - [ ] Working directory (absolute `cd`) - - [ ] PATH awareness (absolute `claude` path) + - [ ] PATH awareness (absolute path to `claude` -- cron has minimal PATH, so all commands must use absolute paths) - [ ] Cleanup on exit (lock release) **Gate**: All checks pass. Proceed to Phase 4. ### Phase 4: INSTALL -**Goal**: Install the crontab entry with user confirmation. +Install the crontab entry. Every entry gets a `# claude-cron: ` marker so `crontab-manager.py` can identify and manage only its own entries without touching non-Claude crontab lines. All paths in the crontab entry must be absolute because cron has minimal PATH. 1. Show the proposed entry: ```bash @@ -163,7 +120,7 @@ Review the generated script. Verify it contains: --dry-run ``` -2. **Ask the user for confirmation** before installing. +2. **Ask the user for confirmation** before installing. Never install without explicit approval. 3. If confirmed: ```bash @@ -182,7 +139,7 @@ Review the generated script. Verify it contains: ### Phase 5: REPORT -**Goal**: Summarize the created cron job. +Summarize the created cron job and print the exact commands to test and manage it. Output: - Script path @@ -197,7 +154,7 @@ Output: python3 ~/.claude/scripts/crontab-manager.py remove --tag {name} # uninstall ``` ---- +To modify an existing wrapper script, regenerate it with `--force` rather than editing in place. ## Error Handling @@ -214,21 +171,7 @@ If the path changes, regenerate the wrapper with `--force`. Cause: System crontab service issue. Solution: Check `crontab -l` manually. Restore from `~/.claude/crontab-backups/`. ---- - -## Anti-Patterns - -### Anti-Pattern 1: Piping to crontab -**What it looks like**: `crontab -l | { cat; echo "new entry"; } | crontab -` -**Why wrong**: If the pipe fails mid-stream, the entire crontab is wiped. -**Do instead**: Always use `crontab-manager.py` which writes to temp files. - -### Anti-Pattern 2: Using CronCreate tool -**What it looks like**: Calling the `CronCreate` tool to schedule jobs. -**Why wrong**: Session-scoped only — dies when session ends, auto-expires after 7 days. -**Do instead**: Use system `crontab` via `crontab-manager.py`. +## References -### Anti-Pattern 3: Round-number scheduling -**What it looks like**: `0 */6 * * *` or `30 * * * *` -**Why wrong**: Every cron job on the system fires at :00/:30, creating load spikes. -**Do instead**: Use off-minutes like `7`, `23`, `47`. +- `scripts/crontab-manager.py` -- all crontab mutations (add, remove, list, verify, generate-wrapper) +- `skills/cron-job-auditor/SKILL.md` -- validation checks for generated scripts diff --git a/skills/image-auditor/SKILL.md b/skills/image-auditor/SKILL.md index 0e90252a..f3f12bc4 100644 --- a/skills/image-auditor/SKILL.md +++ b/skills/image-auditor/SKILL.md @@ -27,52 +27,7 @@ routing: # Image Auditor Skill -## Operator Context - -This skill operates as an operator for image validation and optimization, configuring Claude's behavior for comprehensive image auditing. It implements a **Discovery-Validation** architectural pattern — scan references, validate existence, assess quality, report findings — with **Domain Intelligence** embedded in web performance and accessibility standards. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before auditing -- **Non-Destructive**: NEVER modify, resize, convert, or delete images without explicit user request -- **Complete Output**: Show all validation results with absolute file paths and line numbers -- **Reproduce First**: Verify every reported issue by reading the actual file/reference -- **Evidence Required**: Every FAIL or WARN must cite the file path, line number, and concrete evidence - -### Default Behaviors (ON unless disabled) -- **Full Audit**: Run all check categories (alt text, existence, size, format, unused) -- **Unused Image Detection**: Find images in static/ not referenced by any content -- **Size Thresholds**: Flag images >500KB for web optimization -- **Format Suggestions**: Recommend WebP conversion where beneficial -- **Page Weight Calculation**: Sum image sizes per post and flag heavy pages - -### Optional Behaviors (OFF unless enabled) -- **Deep Scan**: Include theme images and assets/ directory -- **Auto-Optimize**: Generate optimized versions (requires imagemagick, explicit consent only) -- **Strict Mode**: Treat all suggestions as blockers - -## What This Skill CAN Do -- Scan all content files for image references (Markdown, Hugo shortcodes, HTML) -- Verify all referenced images exist at resolved paths -- Analyze alt text quality (missing, generic, descriptive) -- Measure file sizes and flag oversized images against thresholds -- Detect format mismatches (photo as PNG, screenshot as JPEG) -- Find unused images not referenced by any content -- Calculate total page weight per post -- Generate optimization recommendations with estimated savings -- Report line numbers and absolute paths for all issues - -## What This Skill CANNOT Do -- Modify, resize, or convert images (destructive operations require explicit consent) -- Access external image URLs (CDN-hosted, remote images) -- Judge alt text semantic accuracy (only structural and heuristic checks) -- Delete unused images without user confirmation -- Skip any of the 4 phases - ---- - -## Instructions - -### Usage +Non-destructive 4-phase image validation pipeline: Discover, Validate, Analyze, Report. Read and follow the repository CLAUDE.md before starting any audit. ``` /image-audit # Audit entire site @@ -80,6 +35,12 @@ This skill operates as an operator for image validation and optimization, config /image-audit --post my-post.md # Audit single post ``` +By default, every audit runs all check categories (alt text, existence, size, format, unused images) and calculates per-post page weight. Optional modes: **Deep Scan** includes theme images and `assets/` directory. **Auto-Optimize** generates optimized versions (requires imagemagick and explicit user consent). **Strict Mode** treats all suggestions as blockers. + +--- + +## Instructions + ### Phase 1: DISCOVER **Goal**: Build a complete map of all images and all image references in the codebase. @@ -88,7 +49,7 @@ This skill operates as an operator for image validation and optimization, config Use Glob to locate image files: - Pattern: `static/**/*.{png,jpg,jpeg,gif,webp,svg}` -- Record each file's absolute path and size (use `ls -la` or `stat`) +- Record each file's absolute path and size (use `ls -la` or `stat`) -- measure actual bytes, never estimate **Step 2: Find all image references in content** @@ -110,23 +71,25 @@ For each image reference, record: **Path Resolution Rules:** - `/images/foo.png` resolves to `static/images/foo.png` - `images/foo.png` resolves to `static/images/foo.png` -- `../images/foo.png` resolves relative to the content file's location +- `../images/foo.png` resolves relative to the content file's location (always resolve from the content file's directory, never check the literal string against static/) - Hugo shortcode `src=` values follow the same resolution rules **Gate**: Reference map is complete with all images and all references catalogued. Proceed only when gate passes. ### Phase 2: VALIDATE -**Goal**: Check every reference and every image against quality criteria. +**Goal**: Check every reference and every image against quality criteria. Verify every reported issue by reading the actual file or reference before recording it -- never report an issue based on assumption alone. **Step 1: Alt text validation** | Status | Condition | |--------|-----------| | PASS | Alt text present, descriptive, 10-125 characters | -| WARN | Alt text too generic (single words: "image", "screenshot", "picture", "photo", "diagram", "figure", "img") | +| WARN | Alt text too generic (single words: "image", "screenshot", "picture", "photo", "diagram", "figure", "img") -- always check against this generic term list rather than subjectively judging quality | | FAIL | Alt text missing or empty | +15% of users rely on assistive technology, so validate all alt text regardless of perceived importance. + See `references/alt-text-examples.md` for detailed quality guidelines. **Step 2: File existence validation** @@ -155,11 +118,13 @@ See `references/size-guidelines.md` for type-specific thresholds. | Diagrams | SVG, WebP | Filename: "diagram", "chart", "graph", "flow" | | Icons/Logos | SVG | Filename: "icon", "logo", "favicon" | +Report format savings estimates and let the user decide whether to convert -- do not skip format recommendations on the assumption they are unnecessary. + See `references/format-selection.md` for the complete decision flowchart. **Step 5: Unused image detection** -Compare all files in static/images/ against the reference map. Any file with zero references is reported as unused. +Compare all files in static/images/ against the reference map. Any file with zero references is reported as unused. Always perform this step -- unused images bloat the repository and deployment size. **Gate**: All references validated against all criteria. Every issue has a severity level, file path, and line number. Proceed only when gate passes. @@ -199,7 +164,7 @@ Rank issues by potential impact: ### Phase 4: REPORT -**Goal**: Generate a structured, actionable audit report. +**Goal**: Generate a structured, actionable audit report. This phase is read-only -- never modify, resize, convert, or delete images. Report findings and recommendations only; changes require explicit user request. Follow the report format in `references/report-templates.md`. The report must include: @@ -213,6 +178,8 @@ Follow the report format in `references/report-templates.md`. The report must in 8. **Recommendations**: Numbered, prioritized action items 9. **Status Line**: PASS, WARN, or FAIL with counts +Every issue in the report must include an absolute file path and line number so the user can locate and fix it. Clearly distinguish severity levels: FAIL (broken, must fix), WARN (should fix), INFO (suggestion). Do not conflate severity -- a format suggestion is not a blocker. + **Gate**: Report is complete with all sections populated. Every issue is actionable (file path + line number + recommendation). Report ends with a status line. --- @@ -280,50 +247,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Modifying Images Without Consent -**What it looks like**: Automatically resizing or converting images during audit -**Why wrong**: User may have specific requirements. Destructive changes cannot be undone. -**Do instead**: Report findings and recommendations. Only modify when user explicitly requests it. - -### Anti-Pattern 2: Missing Line Numbers in Reports -**What it looks like**: Reporting "screenshot.png missing alt text" without file location -**Why wrong**: User cannot find and fix the issue efficiently without location. -**Do instead**: Always include absolute file path and line number for every issue. - -### Anti-Pattern 3: Treating Suggestions as Blockers -**What it looks like**: Marking a post as "failed audit" because images are not WebP -**Why wrong**: Format suggestions are optimizations, not requirements. Conflating severity levels undermines trust. -**Do instead**: Clearly distinguish FAIL (broken), WARN (should fix), INFO (suggestion). - -### Anti-Pattern 4: Skipping Unused Image Detection -**What it looks like**: Only auditing referenced images, ignoring orphaned files in static/ -**Why wrong**: Unused images bloat the repository and deployment size. -**Do instead**: Always compare static/ files against the reference map and report orphans. - -### Anti-Pattern 5: Not Resolving Relative Paths -**What it looks like**: Checking for `../images/foo.png` literally in static/ -**Why wrong**: Relative paths must be resolved from the content file's location to find the actual file. -**Do instead**: Compute absolute path from content file location, then check static/. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Alt text looks fine to me" | Subjective assessment misses generic patterns | Check against generic term list | -| "File sizes are probably okay" | Estimation is not measurement | Measure actual bytes | -| "Nobody uses screen readers" | 15% of users rely on assistive technology | Validate all alt text | -| "WebP is overkill for this site" | Format choice affects page load for every visitor | Report savings, let user decide | - ### Integration Notes **With pre-publish-checker**: The pre-publish-checker skill performs basic image validation (existence, alt text presence). This skill provides deeper analysis including format optimization, page weight, and unused image detection. Use pre-publish for quick pass/fail; use image-auditor for comprehensive audits. diff --git a/skills/image-to-video/SKILL.md b/skills/image-to-video/SKILL.md index af13e8a3..b634f143 100644 --- a/skills/image-to-video/SKILL.md +++ b/skills/image-to-video/SKILL.md @@ -37,57 +37,18 @@ routing: # Image to Video Skill -## Operator Context - -This skill operates as an operator for CLI-based video creation, configuring Claude's behavior for deterministic FFmpeg script execution. It implements the **Sequential Pipeline** architectural pattern -- Validate, Prepare, Encode, Verify -- with **Domain Intelligence** embedded in FFmpeg filter selection and resolution matching. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before creating video -- **Over-Engineering Prevention**: Only implement what is directly requested. No extra visualizations, no format conversions beyond MP4 -- **FFmpeg Validation**: Always verify FFmpeg is installed before attempting video creation -- **Input Validation**: Check that both image and audio files exist before processing -- **Absolute Paths Only**: Always use absolute paths for image, audio, and output arguments - -### Default Behaviors (ON unless disabled) -- **Resolution Default**: Use 1080p (1920x1080) unless user specifies otherwise -- **Static Mode**: No visualization overlay unless user requests one -- **AAC Audio**: Encode audio as 192k AAC for broad compatibility -- **H.264 Video**: Encode with libx264 preset medium, CRF 23, yuv420p pixel format -- **Output Verification**: Run ffprobe on output and report file size after creation - -### Optional Behaviors (OFF unless enabled) -- **Waveform Visualization**: Neon waveform overlay with `--visualization waveform` -- **Spectrum Visualization**: Scrolling frequency spectrum with `--visualization spectrum` -- **CQT Visualization**: Piano-roll style bars with `--visualization cqt` -- **Bars Visualization**: Frequency bar graph with `--visualization bars` -- **Custom Resolution**: Override with `--resolution` preset (720p, square, vertical) -- **Workspace Mode**: Batch process paired files with `--process-workspace` - -## What This Skill CAN Do -- Combine a static image with audio to produce an MP4 video -- Scale images to target resolution while preserving aspect ratio -- Add audio visualization overlays (waveform, spectrum, cqt, bars) -- Support multiple resolution presets (1080p, 720p, square, vertical) -- Batch process matching image+audio pairs from workspace directory -- Validate FFmpeg availability and report actionable install instructions - -## What This Skill CANNOT Do -- Generate images (use `gemini-image-generator` for that) -- Edit existing videos or trim/split audio -- Stream live video or produce non-MP4 formats -- Add text overlays, captions, or transitions -- Work without FFmpeg installed on the system - ---- +Combine a static image with an audio file to produce an MP4 video using FFmpeg. Supports resolution presets (1080p, 720p, square, vertical), optional audio visualization overlays (waveform, spectrum, cqt, bars), and batch processing of matched image+audio pairs. For image generation, use `gemini-image-generator` instead. ## Instructions ### Phase 1: VALIDATE -**Goal**: Confirm all prerequisites before attempting video creation. +Confirm all prerequisites before attempting video creation. **Step 1: Check FFmpeg installation** +Always run this check first -- many systems lack FFmpeg or have minimal builds, and skipping it produces confusing subprocess errors instead of clear install guidance. + ```bash ffmpeg -version ``` @@ -96,6 +57,8 @@ If FFmpeg is not installed, provide platform-specific install instructions and s **Step 2: Verify input files exist** +Both the image and audio files must be confirmed present before processing. Use absolute paths for all arguments -- relative paths break silently when the script executes from a different working directory. + ```bash ls -la /absolute/path/to/image.png /absolute/path/to/audio.mp3 ``` @@ -106,7 +69,9 @@ Confirm both files exist and have non-zero size. Supported formats: **Step 3: Determine parameters** -Resolve resolution preset and visualization mode from user request. If the user did not specify, use defaults (1080p, static). +Re-read the user's request before selecting defaults. Resolve resolution preset and visualization mode from what the user actually asked for. Only apply defaults (1080p, static) when the user did not specify -- defaulting to static when the user requested a visualization is a common mistake. + +If the user mentions a target platform, select the matching preset to avoid cropping or black bars on delivery: | Preset | Dimensions | Platform | |--------|------------|----------| @@ -115,11 +80,17 @@ Resolve resolution preset and visualization mode from user request. If the user | `square` | 1080x1080 | Instagram, social media | | `vertical` | 1080x1920 | Stories, Reels, TikTok | +Optional visualization modes (off unless the user requests one): +- `--visualization waveform` -- Neon waveform overlay +- `--visualization spectrum` -- Scrolling frequency spectrum +- `--visualization cqt` -- Piano-roll style bars +- `--visualization bars` -- Frequency bar graph + **Gate**: FFmpeg installed, both input files exist, parameters resolved. Proceed only when gate passes. ### Phase 2: PREPARE -**Goal**: Set up output path and confirm no conflicts. +Set up output path and confirm no conflicts. **Step 1: Determine output path** @@ -136,7 +107,9 @@ The script creates parent directories automatically. Verify the target directory ### Phase 3: ENCODE -**Goal**: Execute FFmpeg to produce the video. +Execute FFmpeg to produce the video. Only implement what the user requested -- no extra visualizations or format conversions beyond MP4. + +Encoding defaults: libx264 preset medium, CRF 23, yuv420p pixel format, 192k AAC audio. **Step 1: Run the script** @@ -165,7 +138,7 @@ The script prints progress including input paths, resolution, visualization mode ### Phase 4: VERIFY -**Goal**: Confirm the output video is valid and report results. +Confirm the output video is valid. Do not report success based on exit code alone -- FFmpeg can exit 0 but produce a corrupt or zero-duration file. **Step 1: Check file exists and has reasonable size** @@ -175,6 +148,8 @@ ls -la /absolute/path/to/output.mp4 **Step 2: Probe video metadata** +File size alone does not prove video integrity. Always probe with ffprobe to confirm the output is a valid video with correct duration. + ```bash ffprobe -v error -show_entries format=duration,size -show_entries stream=codec_name,width,height \ -of default=noprint_wrappers=1 /absolute/path/to/output.mp4 @@ -188,8 +163,6 @@ Provide: output file path, file size, duration, resolution, and visualization mo **Gate**: Output file exists, duration matches audio, metadata is valid. Task complete. ---- - ## Error Handling ### Error: "FFmpeg is not installed or not in PATH" @@ -220,47 +193,7 @@ Solution: 2. Convert to a known format: `ffmpeg -i input.audio -acodec pcm_s16le output.wav` 3. Re-run with the converted file ---- - -## Anti-Patterns - -### Anti-Pattern 1: Using Relative Paths -**What it looks like**: `python3 image_to_video.py -i ../cover.png -a song.mp3` -**Why wrong**: The script may execute from a different working directory, breaking all paths silently. -**Do instead**: Always use absolute paths for every argument. - -### Anti-Pattern 2: Skipping FFmpeg Verification -**What it looks like**: Running the script directly without checking `ffmpeg -version` first. -**Why wrong**: Produces confusing subprocess errors instead of clear install instructions. -**Do instead**: Complete Phase 1 validation before any encoding attempt. - -### Anti-Pattern 3: Wrong Resolution for Target Platform -**What it looks like**: Using 1080p landscape for TikTok, or vertical for YouTube. -**Why wrong**: Content gets cropped or displays with large black bars on the target platform. -**Do instead**: Ask the user what platform the video targets, then select the matching preset. - -### Anti-Pattern 4: Skipping Output Verification -**What it looks like**: Reporting success based on script exit code alone without probing the output. -**Why wrong**: FFmpeg can exit 0 but produce a corrupt or zero-duration file. -**Do instead**: Complete Phase 4 -- probe the output, confirm duration matches audio. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It Is Wrong | Required Action | -|-----------------|-----------------|-----------------| -| "FFmpeg is always installed" | Many systems lack it or have minimal builds | Run `ffmpeg -version` every time | -| "The script handles everything" | Script can fail silently with bad inputs | Validate inputs in Phase 1 | -| "File size looks right" | Size alone does not prove video integrity | Probe with ffprobe, check duration | -| "Static mode is fine" | User may have requested visualization | Re-read the request before defaulting | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/ffmpeg-filters.md`: FFmpeg filter documentation for visualization modes - `${CLAUDE_SKILL_DIR}/scripts/image_to_video.py`: Python CLI script (exit codes: 0=success, 1=no FFmpeg, 2=encode failed, 3=missing args) diff --git a/skills/install/SKILL.md b/skills/install/SKILL.md index 7d569c1f..264290a4 100644 --- a/skills/install/SKILL.md +++ b/skills/install/SKILL.md @@ -22,14 +22,7 @@ routing: # /install — Setup & Health Check -Verify your Claude Code Toolkit installation, diagnose issues, and get oriented. - -## When to Use - -- After cloning the repo and running `install.sh` -- When something seems broken (hooks not firing, missing commands) -- First time using the toolkit — to see what's available -- After a `git pull` to verify nothing broke +Verify your Claude Code Toolkit installation, diagnose issues, and get oriented. Use after cloning the repo and running `install.sh`, when something seems broken (hooks not firing, missing commands), for first-time orientation, or after a `git pull` to verify nothing broke. ## Instructions @@ -59,7 +52,7 @@ If the script is not found at `scripts/install-doctor.py`, try `~/.claude/script **Step 3: Display results clearly** -Show the check output to the user with a clear pass/fail summary. Use the raw script output — do not paraphrase or reformat excessively. +Show the check output to the user with a clear pass/fail summary. Display the raw script output without paraphrasing or reformatting, because the script already formats diagnostics for readability and rewriting them risks losing detail or misrepresenting status. **Gate**: Health check complete. If issues found, proceed to Phase 2. If clean, skip to Phase 3. @@ -67,8 +60,6 @@ Show the check output to the user with a clear pass/fail summary. Use the raw sc **Goal**: Guide the user through fixing detected issues. -**Important**: This phase is interactive. Show the user what needs fixing and let them choose. - **Step 1: Determine if install.sh needs to run** If `~/.claude` is missing or components are not installed, the user needs to run install.sh. Tell them: @@ -80,7 +71,7 @@ The toolkit hasn't been installed yet. Run this from the repo directory: ./install.sh --dry-run # preview first ``` -Wait for the user to confirm they've run it, then re-run the health check. +Wait for the user to confirm they've run it, then re-run the health check. This phase is interactive because installation changes system state -- always show the user what needs fixing and let them choose before acting. **Step 2: Fix individual issues** @@ -95,7 +86,7 @@ find ~/.claude/scripts -name "*.py" -exec chmod 755 {} \; pip install -r requirements.txt ``` -Only run fixes the user approves. +Only run fixes the user approves, because automated fixes to `~/.claude` can break an existing setup if assumptions about the environment are wrong. **Step 3: Re-check** @@ -118,7 +109,7 @@ python3 ~/.claude/scripts/install-doctor.py inventory **Step 2: Display summary** -Show the actual counts returned by `install-doctor.py inventory` — do not display hardcoded numbers. Present them as: +Show the actual counts returned by `install-doctor.py inventory` -- never display hardcoded numbers, because component counts change with every install and stale numbers erode trust. Present them as: ``` Your toolkit is ready. Here's what's installed: @@ -161,13 +152,13 @@ MCP Servers: Install: claude mcp add context7 -- npx @anthropic-ai/mcp-context7@latest ``` -Use ✓ for connected MCPs and ✗ for missing ones. For missing MCPs, show the install command. +Use checkmark for connected MCPs and X for missing ones. For missing MCPs, show the install command. **Gate**: MCP inventory displayed. Proceed to Phase 4. ### Phase 4: ORIENT -**Goal**: Give the user their bearings — what to do first. +**Goal**: Give the user their bearings -- what to do first. **Step 1: Show the three essential commands** diff --git a/skills/integration-checker/SKILL.md b/skills/integration-checker/SKILL.md index 9a10ae9a..d44a2af9 100644 --- a/skills/integration-checker/SKILL.md +++ b/skills/integration-checker/SKILL.md @@ -34,51 +34,11 @@ routing: # Integration Checker Skill -## Core Principle - **Existence does not equal integration.** A component existing is implementation-level verification; a component being connected is integration-level verification. Both are necessary. Neither is sufficient alone. This skill catches the most common class of real-world bugs in AI-generated code: components that are individually correct but not connected to each other. A function can exist, contain real logic, pass correctness verification, and never be imported or called. An API endpoint can be defined but never wired into the router. An event handler can be registered but never receive events. -## Purpose - -Verify cross-component wiring using four verification techniques. Phase 3.5 of the feature lifecycle (design -> plan -> implement -> **integration check** -> validate -> release). - -When used standalone (not in the feature pipeline), operates on the current working directory or a specified path. - -## Operator Context - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Read-Only Analysis**: This skill reads and reports. It does NOT fix wiring issues. Fixes route back to feature-implement or the user. -- **Language Detection First**: Detect project language(s) before applying verification techniques. Different languages have different import/export patterns. -- **Three-State Classification**: Every export gets exactly one status: WIRED, IMPORTED_NOT_USED, or ORPHANED. No ambiguous states. -- **Structural Not Semantic**: Contract checking verifies shape and naming compatibility, not whether data is logically correct. WHY: semantic correctness requires runtime information we don't have. - -### Default Behaviors (ON unless disabled) -- **Concise Reporting**: Report facts. Show the wiring map, not prose about the wiring map. -- **Feature Pipeline Integration**: When invoked within the feature pipeline, read implementation artifact from `.feature/state/implement/` for context on what was built. -- **Exclusion Patterns**: Skip `node_modules/`, `vendor/`, `.git/`, `__pycache__/`, `dist/`, `build/`, test fixtures, and generated files. WHY: these contain intentionally unused exports (library code, vendored deps) that would flood the report with false positives. -- **Severity Ordering**: Report ORPHANED (failure) before IMPORTED_NOT_USED (warning) before WIRED (pass). Users need to see failures first. - -### Optional Behaviors (OFF unless enabled) -- **Requirements Map**: When a task plan exists (`.feature/state/plan/`), produce per-requirement wiring status. Only available in feature pipeline context. -- **Verbose Mode**: Show WIRED components in addition to failures and warnings. Default only shows issues. - -## What This Skill CAN Do -- Build export/import maps for Go, Python, TypeScript, and JavaScript projects -- Classify exports as WIRED, IMPORTED_NOT_USED, or ORPHANED -- Detect hardcoded empty data, placeholder values, and dead parameters flowing through wired connections -- Verify output shape compatibility between connected components -- Produce per-requirement wiring status when a task plan exists -- Operate standalone or as part of the feature pipeline - -## What This Skill CANNOT Do -- Fix wiring issues (route to feature-implement or the user) -- Verify runtime behavior or semantic correctness -- Analyze dynamically-loaded modules, reflection-based wiring, or plugin architectures with certainty -- Replace integration tests (this is static analysis, not execution) -- **Reason**: This skill is static structural analysis. Runtime verification and fix application require different capabilities. +This is a read-only analysis skill -- it reads and reports but does not fix wiring issues. Fixes route back to /feature-implement or the user, because integration fixes often require design decisions about which component should call which. --- @@ -88,15 +48,17 @@ When used standalone (not in the feature pipeline), operates on the current work **Goal**: Establish context, detect language, identify scope. -**Step 1: Detect execution context** +**Step 1: Read repository CLAUDE.md** (if present) and follow any project-specific conventions before proceeding. + +**Step 2: Detect execution context** Determine if running within the feature pipeline or standalone: -- **Pipeline**: Check for `.feature/state/implement/` artifact. If present, load it to understand what was built and scope the check to changed/added files. +- **Pipeline**: Check for `.feature/state/implement/` artifact. If present, load it to understand what was built and scope the check to changed/added files. Scoping to changed files prevents wasting time analyzing unchanged code in large repositories. - **Standalone**: Scope to the current working directory or user-specified path. Analyze all source files. -**Step 2: Detect project language(s)** +**Step 3: Detect project language(s)** -Identify language(s) from file extensions, build files, and project structure: +Detect language(s) before applying any verification techniques -- different languages have fundamentally different import/export patterns. | Indicator | Language | |-----------|----------| @@ -107,7 +69,7 @@ Identify language(s) from file extensions, build files, and project structure: Multiple languages may coexist. Run all applicable techniques for each. -**Step 3: Identify language-specific patterns** +**Step 4: Identify language-specific patterns** | Language | Export Pattern | Import Pattern | Common Integration Failures | |----------|---------------|----------------|----------------------------| @@ -122,7 +84,7 @@ Multiple languages may coexist. Run all applicable techniques for each. ### Phase 1: EXPORT/IMPORT MAP -**Goal**: For every export in scope, determine its wiring status: WIRED, IMPORTED_NOT_USED, or ORPHANED. +**Goal**: For every export in scope, determine its wiring status. Every export gets exactly one of three states -- no ambiguous classifications. **Step 1: Discover exports** @@ -132,6 +94,8 @@ Scan source files for exported symbols. Be language-aware: - **Python**: Find all module-level function/class/variable definitions. Check `__all__` if present (it restricts the public API). Check `__init__.py` for re-exports. - **TypeScript/JavaScript**: Find all `export` declarations, `export default`, and barrel file re-exports. +Skip `node_modules/`, `vendor/`, `.git/`, `__pycache__/`, `dist/`, `build/`, test fixtures, and generated files. These contain intentionally unused exports (library code, vendored deps) that would flood the report with false positives. + Record each export as: `{file, name, kind (function/type/const/var), line}`. **Step 2: Discover imports and usages** @@ -140,7 +104,7 @@ For each export found, search the codebase for: 1. **Import**: The symbol is imported (appears in an import statement referencing the exporting module) 2. **Usage**: The imported symbol is actually used (called, referenced, assigned, passed as argument) beyond the import statement itself -This two-step check is critical. WHY: An import without usage is IMPORTED_NOT_USED, which is a distinct failure mode from ORPHANED. It signals someone intended to use the component but didn't finish wiring it. +Both checks are required. An import without usage is a distinct failure mode from an orphan -- it signals someone intended to use the component but didn't finish wiring it. **Step 3: Classify each export** @@ -150,7 +114,7 @@ This two-step check is critical. WHY: An import without usage is IMPORTED_NOT_US | Exported and imported, but never used beyond the import statement | **IMPORTED_NOT_USED** | Warning | | Exported but never imported anywhere in the project | **ORPHANED** | Failure | -**Exclusions** (do not flag as ORPHANED): +**Exclusions** (do not flag as ORPHANED -- these have legitimate reasons for not being imported internally): - `main()` functions and entry points - Interface implementations that satisfy an interface (Go) - Test helpers exported for use by `_test.go` files in other packages @@ -160,7 +124,7 @@ This two-step check is critical. WHY: An import without usage is IMPORTED_NOT_US **Step 4: Build the export/import map** -Produce a structured map: +Report failures first -- users need to see ORPHANED before IMPORTED_NOT_USED before WIRED. ``` ## Export/Import Map @@ -186,10 +150,12 @@ Produce a structured map: ### Phase 2: DATA FLOW AND CONTRACT CHECK -**Goal**: For WIRED components, verify that real data flows through the connections and that output shapes match input expectations. +**Goal**: For WIRED components, verify that real data flows through the connections and that output shapes match input expectations. A component wired to always receive empty data is functionally disconnected. This phase checks two things simultaneously because they both operate on the same set of WIRED connections identified in Phase 1. +This is structural analysis, not semantic verification. Contract checking verifies shape and naming compatibility, not whether data is logically correct -- semantic correctness would require runtime information that static analysis cannot provide. + #### Data Flow Tracing For each WIRED connection, check whether real data actually reaches the component. Specifically look for: @@ -234,18 +200,20 @@ For each WIRED connection where component A's output feeds into component B's in Record each finding as: `{producer_file, consumer_file, mismatch_kind, description}`. -**Important context**: Contract checking is approximate in dynamic languages. Report findings with appropriate confidence: +Report contract findings with appropriate confidence levels. In dynamic languages, approximate checking still catches obvious mismatches -- report what you can find rather than skipping the check: - **High confidence**: Explicit type annotations match/mismatch, struct/interface definitions - **Medium confidence**: Inferred from usage patterns, variable names, JSDoc/docstrings - **Low confidence**: Dynamic access patterns, computed property names, reflection +Note: dynamically-loaded modules, reflection-based wiring, and plugin architectures cannot be analyzed with certainty through static analysis. Flag what is visible and note the limitation. + **Gate**: Data flow and contract findings recorded. Proceed to Phase 3. --- ### Phase 3: REPORT -**Goal**: Produce a structured integration report with actionable findings. +**Goal**: Produce a structured integration report with actionable findings. Report facts and show the wiring map -- not prose about the wiring map. **Step 1: Requirements integration map (pipeline mode only)** @@ -308,6 +276,8 @@ FAIL: Has ORPHANED components, data flow issues, or high-confidence contract mis 4. [Specific action for each contract mismatch] ``` +Only fail the verdict on high-confidence contract mismatches. Low-confidence findings in dynamic languages are informational -- they belong in the WARN tier, not FAIL. + **Step 3: Verdict and next steps** | Verdict | Next Step | @@ -328,34 +298,10 @@ FAIL: Has ORPHANED components, data flow issues, or high-confidence contract mis | Language not detected | No recognizable build files or source extensions | Specify language manually or check project structure | | Too many exports to analyze | Large monorepo or library with thousands of exports | Narrow scope to changed files (use `git diff --name-only` against base branch) | | False positive ORPHANED | Library code, plugin interfaces, or entry points | Check exclusion patterns. If legitimate public API, add to exclusions. | -| Circular import detected | Python circular imports or Go import cycles | Report as separate finding — circular imports are integration issues themselves | +| Circular import detected | Python circular imports or Go import cycles | Report as separate finding -- circular imports are integration issues themselves | | No implementation artifact | Running in pipeline mode but implement phase didn't checkpoint | Fall back to standalone mode using git diff to identify changed files | -## Anti-Patterns - -| Anti-Pattern | Why Wrong | Do Instead | -|--------------|-----------|------------| -| Checking only that imports exist | Import without usage is a distinct failure mode (IMPORTED_NOT_USED) | Verify import AND usage for each export | -| Treating all ORPHANED as bugs | Library code, plugin interfaces, and entry points are intentionally not imported internally | Apply exclusion patterns before flagging | -| Skipping data flow check because wiring exists | A component wired to always receive empty data is functionally disconnected | Check data flow for every WIRED connection | -| Running on entire monorepo in pipeline mode | Wastes time analyzing unchanged code | Scope to files changed since implementation started | -| Auto-fixing wiring issues | Integration fixes often require design decisions (which component should call which?) | Report issues, let human or feature-implement decide the fix | -| Treating low-confidence contract findings as failures | Dynamic language contract checking is approximate without runtime types | Report confidence level, only FAIL on high-confidence mismatches | - -## Anti-Rationalization - -| Rationalization Attempt | Why It's Wrong | Required Action | -|------------------------|----------------|-----------------| -| "The component exists and has real logic, so it's integrated" | Existence is not integration. A function that's never called is dead code. | Check import AND usage, not just existence | -| "It's imported, so it must be used" | Unused imports are a specific failure mode — someone started wiring but didn't finish | Verify the import is followed by actual usage | -| "The wiring looks right so data must flow" | Wiring to an empty array is functionally the same as no wiring | Trace actual data through each connection | -| "Contract checking is too hard in dynamic languages" | Approximate checking still catches obvious mismatches | Check what you can, report confidence levels | -| "These ORPHANED exports are for future use" | Future use is not current integration. Flag them now, exclude intentionally if confirmed. | Report all ORPHANED, let the user decide which are intentional | -| "Integration checking is overkill for a small change" | Small changes are where wiring gets forgotten — new function added but never called | Run the check. Scope narrows automatically for small changes. | - ## References - [Feature State Conventions](../_feature-shared/state-conventions.md) -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) -- [Anti-Rationalization Core](../shared-patterns/anti-rationalization-core.md) - [ADR-078: Integration Checker](../../adr/078-integration-checker.md) diff --git a/skills/joy-check/SKILL.md b/skills/joy-check/SKILL.md index 80d7db23..c19ff496 100644 --- a/skills/joy-check/SKILL.md +++ b/skills/joy-check/SKILL.md @@ -39,42 +39,11 @@ routing: # Joy Check -## Operator Context - -This skill operates as an operator for tonal framing validation, configuring Claude's behavior for evaluating whether content frames experiences through joy, curiosity, and generosity rather than grievance, accusation, or victimhood. It implements the **Two-Pass Validation** architectural pattern -- deterministic regex pre-filter followed by LLM semantic analysis -- to catch both obvious and subtle negative framing. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before validating -- **Joy is the lens**: Every paragraph should frame its subject through curiosity, wonder, generosity, or earned satisfaction -- because content that builds a case for grievance alienates readers and undermines the author's credibility, even when the underlying experience is legitimate -- **No grievance framing**: Content should never build a case for being wronged -- because accumulating evidence of injustice reads as prosecution, not communication -- **No accusation framing**: Content should never point fingers at specific people or imply bad faith -- because attributing motive to others is speculation wearing the mask of observation -- **No victimhood framing**: Content should never position the author as a victim -- because it shifts agency from the author to external forces -- **Reframe, don't suppress**: Negative experiences are valid topics. The skill checks *framing*, not *topic*. A difficult experience can be framed through what you learned, what you built, or what you now understand. Suppressing legitimate experiences would be dishonest; reframing them through curiosity is editorial craft. - -### Default Behaviors (ON unless disabled) -- **Paragraph-level analysis**: Evaluate each paragraph independently against the Joy Framing Rubric -- **Regex pre-filter**: Run `scan-negative-framing.py` first as a fast gate -- catches obvious patterns without spending LLM tokens -- **Suggestion mode**: Flag and suggest reframes without modifying content -- **Score output**: Produce a joy score (0-100) for the full piece - -### Optional Behaviors (OFF unless enabled) -- **Fix mode** (--fix): Rewrite flagged paragraphs in place, then re-verify -- **Strict mode** (--strict): Fail on any paragraph below 60 joy score - -## What This Skill CAN Do -- Evaluate tonal framing at the paragraph level using the Joy Framing Rubric -- Detect subtle grievance, accusation, and victimhood patterns that regex misses (passive-aggressive factuality, accumulative grievance, reluctant generosity) -- Suggest positive reframes for negatively-framed paragraphs -- Produce a joy score (0-100) for publication readiness -- Rewrite flagged paragraphs in fix mode while preserving substance - -## What This Skill CANNOT Do -- Evaluate voice authenticity (use voice-validator for that) -- Detect AI writing patterns (use anti-ai-editor for that) -- Change the topic or substance of content (only reframes the framing) -- Evaluate factual accuracy or grammar (out of scope entirely) +Validate content for joy-centered tonal framing. Runs a two-pass pipeline -- regex pre-filter for obvious patterns, then LLM semantic analysis -- to evaluate whether content frames experiences through curiosity, generosity, and earned satisfaction rather than grievance, accusation, or victimhood. ---- +By default the skill evaluates each paragraph independently, produces a joy score (0-100), and suggests reframes without modifying content. Optional flags change behavior: `--fix` rewrites flagged paragraphs in place and re-verifies; `--strict` fails on any paragraph below 60. + +This skill checks *framing*, not *topic* and not *voice*. Difficult experiences are valid subjects. Voice fidelity belongs to voice-validator, AI pattern detection belongs to anti-ai-editor, and grammar/style editing is out of scope entirely. ## Instructions @@ -106,6 +75,8 @@ Read the full file. Identify paragraph boundaries (blank-line separated blocks). **Step 2: Evaluate each paragraph against the Joy Framing Rubric** +Every paragraph should frame its subject through curiosity, wonder, generosity, or earned satisfaction. Content that builds a case for grievance alienates readers and undermines the author's credibility, even when the underlying experience is legitimate. + | Dimension | Joy-Centered (PASS) | Grievance-Centered (FAIL) | |-----------|-------------------|--------------------------| | **Subject position** | Author as explorer, builder, learner | Author as victim, wronged party, unrecognized genius | @@ -115,6 +86,15 @@ Read the full file. Identify paragraph boundaries (blank-line separated blocks). | **Action framing** | "I decided to", "I realized", "I learned" | "I was forced to", "I had no choice", "they made me" | | **Closing energy** | Forward-looking, building, sharing, exploring | Cautionary, warning, demanding, lamenting | +When evaluating, watch for these subtle patterns that the regex scanner cannot catch: + +- **Defensive disclaimers** ("I'm not accusing anyone", "This isn't about blame"): If the author has to disclaim, the framing is already grievance-adjacent. The disclaimer signals the content that follows is accusatory enough to need a shield. Flag the paragraph and recommend removing both the disclaimer and the accusatory content it shields. +- **Accumulative grievance**: Each paragraph is individually mild, but together they build a case for being wronged. A reader who finishes the piece feeling "that person was wronged" has been led through a prosecution. Flag the accumulation pattern and recommend interspersing observations with what the author learned, built, or found interesting. +- **Passive-aggressive factuality** ("The timeline shows X. The repo was created Y days later. I'll let you draw your own conclusions."): Presenting facts in prosecution order is framing, not neutrality. "I'll let you draw your own conclusions" deputizes the reader as jury. Flag and recommend including facts where relevant to the experience, not as evidence. +- **Reluctant generosity** ("I'm not saying they did anything wrong, BUT..."): The "but" negates the generosity. This is grievance wearing a generous mask. Flag and recommend being generous without qualification, or acknowledging the complexity directly. + +Do not dismiss a paragraph as "fine because it's factual." Facts arranged as prosecution are framing, not neutrality -- evaluate the *arrangement* of facts, not just their accuracy. Similarly, do not excuse grievance framing because the author's feelings are justified. The skill checks framing, not whether the underlying feeling is earned. + **Step 3: Score each paragraph** For each paragraph, assign one of: @@ -123,7 +103,9 @@ For each paragraph, assign one of: - **CAUTION** (30-49): Leans toward grievance but recoverable with reframing - **GRIEVANCE** (0-29): Frames through accusation, victimhood, or bitterness -For any paragraph scored CAUTION or GRIEVANCE, draft a specific reframe suggestion that preserves the substance while shifting the framing toward curiosity or generosity. +For any paragraph scored CAUTION or GRIEVANCE, draft a specific reframe suggestion that preserves the substance while shifting the framing toward curiosity or generosity. Remember: reframing is editorial craft, not dishonesty. The substance stays the same; only the lens changes. A single GRIEVANCE paragraph poisons the tonal arc of the whole piece, so do not treat it as minor. + +If a paragraph seems "too subtle to flag," that is precisely when flagging matters most. Subtle grievance is what the regex scanner misses, making it the primary purpose of this LLM analysis phase. **GATE**: All paragraphs analyzed and scored. Reframe suggestions drafted for all CAUTION and GRIEVANCE paragraphs. Proceed to Phase 3. @@ -158,7 +140,7 @@ Overall: [summary of tonal arc -- where the piece starts, how it moves, where it If `--fix` mode is active: 1. Rewrite any CAUTION or GRIEVANCE paragraphs using the drafted reframe suggestions -2. Preserve the substance -- change only the framing +2. Preserve the substance -- change only the framing, not the topic or meaning 3. Re-run Phase 2 analysis on the rewritten paragraphs to verify fixes landed 4. If fixes introduce new CAUTION/GRIEVANCE scores, iterate (maximum 3 attempts) @@ -166,7 +148,9 @@ If `--fix` mode is active: --- -## The Joy Principle +## Reference Material + +### The Joy Principle This is the editorial philosophy that drives the check. @@ -180,13 +164,11 @@ Both describe the same events. The second frames it through the lens that define **Joy doesn't mean happiness.** It means engagement, curiosity, the energy of figuring things out. A joy-centered post about a frustrating debugging session isn't happy -- but it frames the frustration as the puzzle and the understanding as the reward. That's the lens. ---- - -## Examples +### Examples These examples show the same content reframed from grievance to joy. The substance is identical. Only the framing changes. -### Example 1: Describing a Difficult Experience +#### Example 1: Describing a Difficult Experience **GRIEVANCE (FAIL):** ``` @@ -206,7 +188,7 @@ Someone else got it. **Why the second works:** The author is an explorer who found something interesting, not a victim cataloguing injustice. "Mostly crickets" is honest without being bitter. "Someone else got it" is generous. -### Example 2: Discovering Similarity +#### Example 2: Discovering Similarity **GRIEVANCE (FAIL):** ``` @@ -224,7 +206,7 @@ architecture I'd spent months developing and writing about. **Why the second works:** "Excited to curious" is an explorer's arc. No accusation of copying. The observation is about what the author found interesting, not what was done to them. -### Example 3: Discussing How Ideas Spread +#### Example 3: Discussing How Ideas Spread **GRIEVANCE (FAIL):** ``` @@ -245,7 +227,7 @@ even better. **Why the second works:** The decision to release is framed as a positive realization, not a resignation. "Even better" at the end carries forward energy. -### Example 4: Talking About Credit +#### Example 4: Talking About Credit **GRIEVANCE (FAIL):** ``` @@ -263,7 +245,7 @@ it, and be understood. **Why the second works:** Locates the feeling in curiosity ("what made this interesting") not entitlement ("I deserve"). "Be understood" is a human need, not a demand. -### Example 5: The Conclusion +#### Example 5: The Conclusion **GRIEVANCE (FAIL):** ``` @@ -282,7 +264,7 @@ That's been the most enjoyable part of this whole process. **Why the second works:** Ends on what the author enjoys, not what they're defending against. "Seeing what sticks" carries the experimental energy. No timestamps-as-evidence framing. -### Example 6: Addressing Uncertainty About Origins +#### Example 6: Addressing Uncertainty About Origins **GRIEVANCE (FAIL):** ``` @@ -301,6 +283,18 @@ Claude drew on." That's true for everyone using these tools. Including me. **Why the second works:** Includes the author in the same uncertainty. "Including me" is the key phrase. It transforms from "I know and they should know" to "none of us fully know." +### Integration + +This skill integrates with the content validation pipeline: + +``` +CONTENT --> voice-validator (deterministic) --> scan-ai-patterns (deterministic) + --> scan-negative-framing (regex pre-filter) --> joy-check (LLM analysis) + --> anti-ai-editor (LLM style fixes) +``` + +The joy-check can be invoked standalone via `/joy-check [file]` or as part of the content pipeline for any content where positive framing matters. + --- ## Error Handling @@ -335,61 +329,8 @@ Claude drew on." That's true for everyone using these tools. Including me. --- -## Anti-Patterns - -### Anti-Pattern 1: Defensive Disclaimers -**What it looks like**: "I'm not accusing anyone" / "This isn't about blame" -**Why wrong**: If you have to disclaim, the framing is already grievance-adjacent. The disclaimer signals that the content that follows is accusatory enough to need a shield. -**Do instead**: Remove the disclaimer AND reframe the content so the disclaimer isn't needed. - -### Anti-Pattern 2: Accumulative Grievance -**What it looks like**: Each paragraph adds another piece of evidence for being wronged -**Why wrong**: Even if each paragraph is mild, the accumulation builds a case. A reader who finishes the piece feeling "that person was wronged" has been led through a prosecution. -**Do instead**: Intersperse observations with what you learned, built, or found interesting. Break the evidence chain. - -### Anti-Pattern 3: Passive-Aggressive Factuality -**What it looks like**: "The timeline shows X. The repo was created Y days later. I'll let you draw your own conclusions." -**Why wrong**: Presenting facts in prosecution order is framing, not neutrality. "I'll let you draw your own conclusions" is the most aggressive form of accusation -- it deputizes the reader. -**Do instead**: Include facts where relevant to the experience, not as evidence. If the timeline is interesting, say why it's interesting, not why it's damning. - -### Anti-Pattern 4: Reluctant Generosity -**What it looks like**: "I'm not saying they did anything wrong, BUT..." -**Why wrong**: The "but" negates the generosity. This is grievance wearing a generous mask. Readers hear the "but" louder than the generosity. -**Do instead**: Be generous without qualification, or acknowledge the complexity directly. "This is a complicated situation" is more honest than "They're great, BUT..." - ---- - -## Integration - -This skill integrates with the content validation pipeline: - -``` -CONTENT --> voice-validator (deterministic) --> scan-ai-patterns (deterministic) - --> scan-negative-framing (regex pre-filter) --> joy-check (LLM analysis) - --> anti-ai-editor (LLM style fixes) -``` - -The joy-check can be invoked standalone via `/joy-check [file]` or as part of the content pipeline for any content where positive framing matters. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "The content is factual, so it's fine" | Facts arranged as prosecution are framing, not neutrality | Evaluate the *arrangement* of facts, not just their accuracy | -| "The author earned the right to be upset" | Earned anger is still grievance framing | Check framing, not whether the feeling is justified | -| "It's only one negative paragraph" | One GRIEVANCE paragraph poisons the tonal arc of the whole piece | Flag it. One grievance paragraph is a FAIL condition | -| "The reframe would be dishonest" | Reframing is editorial craft, not dishonesty -- the substance stays the same | Preserve substance, change only the lens | -| "This is too subtle to flag" | Subtle grievance is the hardest to catch and the most important -- it's what regex misses | If it reads as building a case, flag it | - -### Related Skills and Scripts - `scan-negative-framing.py` -- Regex pre-filter for obvious negative framing patterns (Phase 1) - `voice-validator` -- Voice fidelity validation (complementary, different concern) - `anti-ai-editor` -- AI pattern detection and removal (complementary, different concern) diff --git a/skills/kubernetes-debugging/SKILL.md b/skills/kubernetes-debugging/SKILL.md index 8e871ded..11db95b6 100644 --- a/skills/kubernetes-debugging/SKILL.md +++ b/skills/kubernetes-debugging/SKILL.md @@ -9,24 +9,15 @@ agent: kubernetes-helm-engineer # Kubernetes Debugging Skill -## Operator Context - -This skill operates as an operator for Kubernetes debugging workflows, configuring Claude's behavior for systematic diagnosis of pod failures, networking issues, and resource problems. It encodes a structured triage flow -- describe, logs, events, exec -- as the default approach rather than guesswork. - -### Hardcoded Behaviors (Always Apply) -- **Systematic Triage**: Always follow the describe -> logs -> events -> exec flow before proposing a fix -- **Evidence Before Action**: Gather diagnostic output first; never suggest changes based on assumptions -- **Non-Destructive First**: Use read-only commands (describe, logs, get) before any modifications -- **Check Previous Logs**: Always check `--previous` logs for crashed containers before current logs -- **Namespace Awareness**: Always specify `-n ` explicitly; never rely on the default context namespace - ---- +Systematic diagnosis of pod failures, networking issues, and resource problems using a structured triage flow: describe, logs, events, exec. ## Instructions ### Step 1: Systematic Debugging Flow -Follow this sequence for every pod or workload issue. Do not skip steps. +Follow this sequence for every pod or workload issue. Do not skip steps -- many failures (scheduling, image pull, volume mount) are only visible in events and describe output, not in logs, so jumping straight to logs misses them. + +Always specify `-n ` explicitly in every command; never rely on the default context namespace, because the wrong namespace silently returns empty or misleading results. ```bash # 1. Get an overview of the resource state @@ -39,6 +30,8 @@ kubectl describe pod -n kubectl logs -n -c # 4. Check previous container logs (critical for CrashLoopBackOff) +# Always check --previous before current logs for crashed containers, +# because deleting or restarting the pod destroys these logs permanently. kubectl logs -n -c --previous # 5. Check namespace events sorted by time @@ -48,11 +41,11 @@ kubectl get events -n --sort-by='.lastTimestamp' kubectl exec -it -n -c -- /bin/sh ``` -### Step 2: CrashLoopBackOff Diagnosis +Use read-only commands (describe, logs, get) to gather evidence before proposing any modifications. Never suggest changes based on assumptions -- gather diagnostic output first. -CrashLoopBackOff means the container starts, exits, and Kubernetes restarts it with exponential backoff. +### Step 2: CrashLoopBackOff Diagnosis -**Common causes and diagnosis:** +CrashLoopBackOff means the container starts, exits, and Kubernetes restarts it with exponential backoff. Do not `kubectl delete pod` to "fix" this -- the replacement pod will crash the same way, and you lose the previous container's logs. Read `--previous` logs and describe events first. **OOMKilled** -- container exceeded memory limit: @@ -65,7 +58,7 @@ kubectl describe pod -n | grep -A 5 "Last State" kubectl top pod -n ``` -Fix: Increase `resources.limits.memory` or fix the memory leak in the application. +Fix: Increase `resources.limits.memory` or fix the memory leak in the application. Do not blindly increase limits without checking actual usage first -- over-provisioning wastes cluster resources, and a memory leak will eventually exceed any limit you set. Run `kubectl top pod` under realistic load, then set limits to 1.5-2x observed peak. **Application configuration error** -- missing env vars, bad config file, wrong DB host: @@ -141,6 +134,8 @@ kubectl get pod -n -o jsonpath='{.spec.containers[0].imag docker pull ``` +Do not trust `:latest` tags -- they are mutable, so the image you tested locally may differ from what the node pulled. Always use image digests or immutable tags, and verify the exact image reference with the jsonpath command above. + **Network issues** -- node cannot reach the registry: ```bash @@ -346,42 +341,6 @@ Solution: Check cluster version with `kubectl version`. For older clusters, crea --- -## Anti-Patterns - -### Anti-Pattern 1: Deleting and Recreating Instead of Diagnosing -**What it looks like**: `kubectl delete pod ` to "fix" CrashLoopBackOff without reading logs. -**Why wrong**: The replacement pod will crash the same way. You lose the previous container's logs. -**Do instead**: Read `--previous` logs and describe events before any destructive action. - -### Anti-Pattern 2: Guessing Resource Limits -**What it looks like**: Setting memory limit to 2Gi "just in case" without checking actual usage. -**Why wrong**: Over-provisioning wastes cluster resources; under-provisioning causes OOMKills. -**Do instead**: Run `kubectl top pod` under realistic load, then set limits to 1.5-2x observed peak. - -### Anti-Pattern 3: Ignoring Events -**What it looks like**: Jumping straight to logs without checking `kubectl describe` or events. -**Why wrong**: Many failures (scheduling, image pull, volume mount) are only visible in events, not logs. -**Do instead**: Always run `kubectl describe` and `kubectl get events` as part of the triage flow. - -### Anti-Pattern 4: Using Latest Tag in Debugging -**What it looks like**: "It works locally with :latest" -- but the cluster pulled a different :latest. -**Why wrong**: Tags are mutable; the image you tested locally may not match what the node pulled. -**Do instead**: Always use image digests or immutable tags. Check the exact image with `kubectl get pod -o jsonpath='{.spec.containers[0].image}'`. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Just restart the pod" | Masks root cause; same crash will recur | Read logs and events first | -| "Works on my machine" | Local Docker != cluster (networking, RBAC, resources) | Reproduce in cluster context | -| "Logs show nothing" | Did you check --previous? Events? Init containers? | Exhaust all diagnostic sources | -| "Must be a cluster problem" | 90% of issues are in the workload spec, not the cluster | Check the pod spec first | -| "I'll increase limits to fix OOM" | May mask a memory leak that gets worse over time | Profile the application first | +- [kubernetes-security skill](../kubernetes-security/SKILL.md) -- NetworkPolicy patterns and RBAC debugging diff --git a/skills/kubernetes-security/SKILL.md b/skills/kubernetes-security/SKILL.md index f6b0ac75..9cccaf7c 100644 --- a/skills/kubernetes-security/SKILL.md +++ b/skills/kubernetes-security/SKILL.md @@ -9,24 +9,13 @@ agent: kubernetes-helm-engineer # Kubernetes Security Skill -## Operator Context - -This skill operates as an operator for Kubernetes security hardening workflows, configuring Claude's behavior for secure-by-default cluster and workload configurations. It encodes RBAC, pod security, network isolation, secret management, and supply chain security as non-negotiable constraints. - -### Hardcoded Behaviors (Always Apply) -- **Least Privilege**: Every Role, ClusterRole, and ServiceAccount gets only the permissions it needs -- never wildcards in production -- **No Privileged Containers**: Containers must not run as privileged or with elevated capabilities unless explicitly justified -- **No Plain-Text Secrets**: Never store secrets in ConfigMaps, environment variables from manifests, or checked-in YAML -- **Network Deny-by-Default**: Namespaces should have a default-deny NetworkPolicy before allow rules are added -- **Non-Root by Default**: All containers run as non-root with a read-only root filesystem unless there is a documented exception - ---- +Harden Kubernetes clusters and workloads through RBAC, pod security, network isolation, secret management, and supply chain controls. ## Instructions ### Step 1: RBAC -- Least-Privilege Roles and Bindings -Grant the minimum permissions required. Prefer namespace-scoped Roles over ClusterRoles. +Grant the minimum permissions required. Prefer namespace-scoped Roles over ClusterRoles. Never use wildcard verbs or resources in production -- even in dev clusters, because dev habits carry forward and dev manifests get promoted. Write exact verbs and resources every time. ```yaml # Good: namespace-scoped Role with specific verbs and resources @@ -65,7 +54,7 @@ ServiceAccount best practices: ### Step 2: PodSecurityStandards -- Baseline vs Restricted -Kubernetes PodSecurity admission replaces the deprecated PodSecurityPolicy. Apply labels at the namespace level. +Kubernetes PodSecurity admission replaces the deprecated PodSecurityPolicy. Apply labels at the namespace level. All containers must run as non-root with a read-only root filesystem unless there is a documented exception -- if an app claims it needs root, it almost never does; it usually just needs a writable `/tmp`, which an emptyDir volume solves. ```yaml # Enforce restricted profile, warn on baseline violations @@ -116,7 +105,7 @@ Key differences: ### Step 3: Network Policies -- Default Deny and Allow-Lists -Start with a default-deny policy for both ingress and egress in every namespace. +Start with a default-deny policy for both ingress and egress in every namespace. Apply this on day one, not later -- without network policies, lateral movement between compromised pods is trivial. ```yaml # Default deny all traffic in the namespace @@ -179,7 +168,7 @@ spec: ### Step 4: Secret Management -Never use plain Kubernetes Secrets checked into Git. Use one of these approaches: +Never store secrets in ConfigMaps, environment variables from manifests, or checked-in YAML. Secrets exposed as env vars are visible in `kubectl describe pod` output, which makes them trivially discoverable after any pod compromise. Use one of these approaches instead: **Sealed Secrets** -- encrypts secrets client-side so they are safe in Git: @@ -220,6 +209,8 @@ Avoid these patterns: ### Step 5: Image Security +Containers must not run as privileged or with elevated capabilities unless explicitly justified -- privileged mode grants full host access to an attacker if the pod is compromised. Use specific capabilities or debug containers instead. + Build minimal, non-root container images: ```dockerfile @@ -293,8 +284,6 @@ Watch for these frequent security mistakes: | Latest tag without digest | Image can change without notice | Pin by digest | | Secrets as env vars in pod spec | Visible in `kubectl describe` | Mount as files or use external secrets | ---- - ## Error Handling ### Error: Pod rejected by PodSecurity admission @@ -309,20 +298,12 @@ Solution: Verify pod labels match the NetworkPolicy `podSelector` and `from`/`to Cause: ServiceAccount lacks required permissions. Solution: Identify the API group, resource, and verb from the error message. Create or update a Role with the exact permissions needed -- do not add wildcards. ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Privileged mode is faster to debug" | Grants full host access to attacker if pod is compromised | Use specific capabilities or debug containers | -| "Wildcard RBAC is fine for dev" | Dev clusters get promoted; habits carry | Write exact verbs and resources | -| "Secrets in env vars are convenient" | Visible in process listing, logs, kubectl describe | Mount as files or use external-secrets | -| "We'll add network policies later" | Lateral movement is trivial without them | Default-deny from day one | -| "Non-root breaks our app" | Almost never true; app just needs writable /tmp | Add an emptyDir volume for /tmp | +- [Kubernetes Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) +- [Kubernetes Network Policies](https://kubernetes.io/docs/concepts/services-networking/network-policies/) +- [Kubernetes RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) +- [External Secrets Operator](https://external-secrets.io/) +- [Sealed Secrets](https://github.com/bitnami-labs/sealed-secrets) +- [Cosign](https://docs.sigstore.dev/cosign/overview/) +- [Kyverno](https://kyverno.io/) diff --git a/skills/learn/SKILL.md b/skills/learn/SKILL.md index d6193a1e..d00e4e8b 100644 --- a/skills/learn/SKILL.md +++ b/skills/learn/SKILL.md @@ -23,78 +23,37 @@ routing: # Learn Error Pattern Skill -## Operator Context - -This skill operates as an operator for manual knowledge ingestion, configuring Claude to parse user-provided error-solution pairs and store them in the cross-session learning database. It implements a **Parse-Classify-Store-Confirm** workflow with deterministic Python scripts performing all database operations. - -### Hardcoded Behaviors (Always Apply) -- **Input Validation**: Never store empty error patterns or empty solutions -- **Environment Variable Injection Prevention**: Always pass values via environment variables to the Python script, never inline user strings into code -- **High Confidence**: Manually taught patterns always receive confidence 0.9 -- **Single Pattern Per Invocation**: Store exactly one pattern per `/learn` call - -### Default Behaviors (ON unless disabled) -- **Auto-Classification**: Determine error type and fix type from solution text -- **Confirmation Output**: Display stored pattern details back to user -- **Duplicate Handling**: If pattern already exists, update its confidence to 0.9 - -### Optional Behaviors (OFF unless enabled) -- **Batch Mode**: Accept multiple patterns from a file (one per line, `error -> solution` format) -- **Dry Run**: Parse and classify without storing, showing what would be saved - -## What This Skill CAN Do -- Parse "error -> solution" input into structured pattern data -- Classify fix types (auto, skill, agent, manual) from solution text -- Store patterns in the unified learning database at 0.9 confidence -- Update confidence on existing patterns for the same error signature - -## What This Skill CANNOT Do -- Debug live issues (use systematic-debugging instead) -- Automatically learn from tool errors (that is the error-learner hook) -- Query or list existing patterns (use `python3 ~/.claude/scripts/learning-db.py query --category error`) -- Delete or reset the database (remove `~/.claude/learning/learning.db`) -- Store patterns without user-provided error AND solution text - ---- +Parse a user-provided "error -> solution" pair, classify it, store it in the cross-session learning database at high confidence, and confirm back. One pattern per invocation. All database operations go through the `learning-db.py` CLI. ## Instructions -### Phase 1: PARSE +### Step 1: Parse Input -**Goal**: Extract error pattern and solution from user input. +Extract two fields from the user's input: -**Step 1: Identify input format** +- `error_pattern`: The error message or symptom text +- `solution`: The fix or resolution text -Accepted formats: +Accepted input formats: - `/learn "error pattern" -> "solution"` - `/learn "error pattern" => "solution"` - Freeform: "teach that X means Y" or "remember: when X, do Y" -**Step 2: Extract fields** +Both fields must be non-empty. If either is missing, ask the user for the missing part before proceeding. If the error pattern is vague (e.g., "it broke") or the solution is non-actionable (e.g., "fix it"), ask the user to provide the specific error message and concrete fix steps — vague patterns fail to match future errors and waste database space. -- `error_pattern`: The error message or symptom text -- `solution`: The fix or resolution text +### Step 2: Classify Fix Type -**Gate**: Both error_pattern and solution are non-empty strings. If either is missing, ask the user for the missing part. Do not proceed with empty fields. +Determine `fix_type` and `fix_action` from the solution text by applying these rules in order: -### Phase 2: CLASSIFY - -**Goal**: Determine fix type and action from the solution text. - -Apply these rules in order: -1. Solution contains install command (pip install, npm install, apt install) -> `fix_type=auto`, `fix_action=install_dependency` +1. Solution contains an install command (`pip install`, `npm install`, `apt install`) -> `fix_type=auto`, `fix_action=install_dependency` 2. Solution contains `replace_all` -> `fix_type=auto`, `fix_action=use_replace_all` 3. Solution references a skill name -> `fix_type=skill`, `fix_action=` 4. Solution references an agent name -> `fix_type=agent`, `fix_action=` 5. Otherwise -> `fix_type=manual`, `fix_action=apply_suggestion` -**Gate**: fix_type and fix_action are determined. Proceed. - -### Phase 3: STORE +### Step 3: Store Pattern -**Goal**: Persist the pattern to the learning database. - -Execute the storage script using the `learning-db.py` CLI: +Execute the `learning-db.py` CLI to persist the pattern. Always pass user-provided strings as CLI arguments exactly as shown — never inline them into Python code via f-strings or string concatenation, because quotes or special characters in error text will break the script and create injection risk. ```bash python3 ~/.claude/scripts/learning-db.py record \ @@ -105,7 +64,9 @@ python3 ~/.claude/scripts/learning-db.py record \ --confidence 0.9 ``` -Where `` is the classified type (e.g., "missing_file", "multiple_matches") and `` is a kebab-case key derived from the error pattern. +- ``: The classified type (e.g., "missing_file", "multiple_matches") +- ``: A kebab-case key derived from the error pattern +- Confidence is always 0.9 for manually taught patterns. If the pattern already exists, this updates its confidence to 0.9. Example: ```bash @@ -117,13 +78,12 @@ python3 ~/.claude/scripts/learning-db.py record \ --confidence 0.9 ``` -**Gate**: Script exits 0 and prints confirmation. If script fails, see Error Handling. +The script must exit 0 and print confirmation. If it fails, see Error Handling below. -### Phase 4: CONFIRM +### Step 4: Confirm to User -**Goal**: Report stored pattern back to the user. +Always display what was stored so the user can verify correctness — silently storing without confirmation hides typos and misclassifications: -Display: ``` Learned pattern: Error: "" @@ -132,10 +92,6 @@ Learned pattern: Confidence: 0.9 ``` -**Gate**: User sees confirmation. Skill complete. - ---- - ## Error Handling ### Error: "Script fails with ImportError or FileNotFoundError" @@ -150,27 +106,6 @@ Solution: Retry after 2 seconds. If persistent, check for hung processes with `l Cause: Incomplete input Solution: Ask the user explicitly for the solution text. Do not guess or fabricate solutions. ---- - -## Anti-Patterns - -### Anti-Pattern 1: Storing Vague Patterns -**What it looks like**: `/learn "it broke" -> "fix it"` -**Why wrong**: Pattern too vague to match future errors. Solution provides no actionable guidance. -**Do instead**: Ask user to provide the specific error message and concrete fix steps. - -### Anti-Pattern 2: Inlining User Strings Into Python Code -**What it looks like**: Building Python code with f-strings containing user input -**Why wrong**: Injection risk. Quotes or special characters in error text break the script. -**Do instead**: Always pass via environment variables as shown in Phase 3. - -### Anti-Pattern 3: Storing Without Confirming Back -**What it looks like**: Running the script and saying "done" without showing what was stored -**Why wrong**: User cannot verify correctness. Typos go unnoticed. -**Do instead**: Always complete Phase 4 with the full confirmation output. - ---- - ## References - `hooks/lib/learning_db_v2.py`: Unified learning database module diff --git a/skills/link-auditor/SKILL.md b/skills/link-auditor/SKILL.md index 892e9753..7d2941a9 100644 --- a/skills/link-auditor/SKILL.md +++ b/skills/link-auditor/SKILL.md @@ -26,58 +26,19 @@ routing: # Link Auditor Skill -## Operator Context - -This skill operates as an operator for link health analysis on Hugo static sites, configuring Claude's behavior for comprehensive, non-destructive link auditing. It implements the **Pipeline** architectural pattern -- Scan, Analyze, Validate, Report -- with **Domain Intelligence** embedded in Hugo path resolution and SEO link graph metrics. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before auditing -- **Non-Destructive**: Never modify content files without explicit user request -- **Complete Output**: Show all findings; never summarize or abbreviate issue lists -- **Issue Classification**: Clearly distinguish critical issues (orphans, broken links) from suggestions (under-linked) -- **Hugo Path Awareness**: Try multiple path resolutions before reporting a link as broken - -### Default Behaviors (ON unless disabled) -- **Full Scan**: Analyze all markdown files in content/ -- **Graph Analysis**: Build and analyze internal link adjacency graph -- **Image Validation**: Check all image paths exist in static/ -- **Skip External Validation**: Do not HTTP-check external URLs (enable with --check-external) -- **Issues-Only Output**: Show only problems, not all valid links - -### Optional Behaviors (OFF unless enabled) -- **External Link Validation**: HTTP HEAD check on external URLs (--check-external) -- **Verbose Mode**: Show all links including valid ones (--verbose) -- **Custom Inbound Threshold**: Flag pages with fewer than N inbound links (--min-inbound N) - -## What This Skill CAN Do -- Extract internal, external, and image links from Hugo markdown content -- Build adjacency matrix of internal link relationships -- Identify orphan pages (0 inbound internal links) and under-linked pages -- Detect link sinks (receive links, no outbound) and hub pages (many outbound) -- Validate internal link paths resolve to real content files -- Validate image files exist in static/ -- Optionally validate external URLs via HTTP HEAD requests -- Handle known false positives (LinkedIn, Twitter block bot requests) -- Generate audit reports with actionable fix suggestions - -## What This Skill CANNOT Do -- Validate external URLs by default (network latency, rate limiting concerns) -- Guarantee external link accuracy (social media sites block bots) -- Automatically fix broken links or add missing links -- Analyze JavaScript-rendered content or Hugo shortcodes beyond standard patterns -- Replace pre-publish-checker for single-post validation +Hugo site link health analysis through a 4-phase pipeline: Scan, Analyze, Validate, Report. Extracts internal, external, and image links from Hugo markdown content; builds an adjacency matrix of internal link relationships; identifies orphan pages, under-linked pages, link sinks, and hub pages; validates that link targets resolve to real files; and generates audit reports with actionable fix suggestions. ---- +Read and follow the repository CLAUDE.md before starting any audit. ## Instructions ### Phase 1: SCAN -**Goal**: Extract all links from markdown files and classify them by type. +**Goal**: Extract all links from every markdown file and classify them by type. **Step 1: Identify content root** -Locate the Hugo content directory and enumerate all markdown files: +Scan all markdown files in `content/` because even small sites with 10 posts can have orphan pages, and partial scans miss graph-level issues. Locate the Hugo content directory and enumerate all markdown files: ```bash # TODO: scripts/link_scanner.py not yet implemented @@ -87,7 +48,7 @@ grep -rn '\[.*\](.*' ~/your-blog/content/ --include="*.md" **Step 2: Extract links by type** -Parse each markdown file for three link categories: +Parse each markdown file for three link categories. Classify by type to understand link distribution: Internal Links: - `[text](/posts/slug/)` -- absolute internal path @@ -116,7 +77,7 @@ Record total internal, external, and image links per file for the summary. **Step 1: Build adjacency matrix** -Map every internal link to its source and target: +Always build the adjacency matrix and compute inbound link counts because orphan pages are invisible to search crawlers and this is often the highest-impact finding an audit produces. Map every internal link to its source and target: ``` Page A -> Page B (A links to B) @@ -131,12 +92,14 @@ Page E -> (no outbound, no inbound = orphan) | Metric | Definition | SEO Impact | |--------|------------|------------| | Orphan Pages | 0 inbound internal links | Critical -- invisible to crawlers | -| Under-Linked | < N inbound links (default 2) | Missed SEO opportunity | +| Under-Linked | < N inbound links (default 2, adjustable with --min-inbound N) | Missed SEO opportunity | | Link Sinks | Receives links, no outbound | May indicate incomplete content | | Hub Pages | Many outbound links | Good for navigation | **Step 3: Classify findings by severity** +Clearly distinguish critical issues from suggestions because they require different urgency levels. Organize all findings by impact: + - **Critical**: Orphan pages, broken internal links, missing images - **Warning**: Under-linked pages, link sinks - **Info**: Hub pages, external link stats @@ -149,7 +112,8 @@ Page E -> (no outbound, no inbound = orphan) **Step 1: Validate internal links** -For each internal link target: +For each internal link target, try all Hugo path resolutions before reporting a link as broken because Hugo resolves paths through multiple conventions. Check these resolutions in order: + 1. Parse the link target path 2. Try Hugo path resolutions: `content/posts/slug.md`, `content/posts/slug/index.md`, `content/posts/slug/_index.md` 3. Mark as broken only if ALL resolutions fail @@ -157,21 +121,23 @@ For each internal link target: **Step 2: Validate image paths** -For each image reference: +Check all image paths against `static/` because missing images are critical issues. Validate both absolute and relative interpretations: + 1. Parse image source path (absolute or relative) -2. Map to static/ directory +2. Map to static/ directory, checking both absolute and relative interpretations 3. Check file exists 4. Record source file and line number for missing images **Step 3: Validate external links (optional)** -Only when `--check-external` is enabled: +Skip external URL validation by default because network latency, rate limiting, and bot-blocking make results unreliable. Only run validation when explicitly enabled with `--check-external` flag. When enabled, follow these steps: + 1. HTTP HEAD request to URL 2. Follow redirects (up to 3) 3. Check response status code -4. Mark known false positives as "blocked (expected)" not broken +4. Report known bot-blocked sites as "blocked (expected)" not broken because LinkedIn (403), Twitter/X (403/999), and Facebook actively block automated requests while links work fine in browsers -Known false positives: LinkedIn (403), Twitter/X (403/999), Facebook (varies). +Use `--verbose` to include valid links in the output (default: issues only). **Gate**: All link targets checked. Broken links have file and line numbers. External results (if enabled) distinguish real failures from false positives. Proceed only when gate passes. @@ -179,6 +145,8 @@ Known false positives: LinkedIn (403), Twitter/X (403/999), Facebook (varies). **Goal**: Present findings in a structured, actionable audit report. +Never modify content files during this phase because users must approve all content changes. Report findings with specific suggestions and let the user decide which fixes to apply. + **Step 1: Generate summary header** ``` @@ -195,7 +163,7 @@ Known false positives: LinkedIn (403), Twitter/X (403/999), Facebook (varies). **Step 2: Report by severity** -List critical issues first (orphans, broken links, missing images), then warnings (under-linked, sinks), then info (hubs, valid external counts). +List critical issues first (orphans, broken links, missing images), then warnings (under-linked, sinks), then info (hubs, valid external counts). Show all findings without summarizing or abbreviating because partial issue lists hide problems. Each issue must include: - File path @@ -216,6 +184,8 @@ Conclude with numbered, actionable recommendations ordered by impact: =============================================================== ``` +Always run the full 4-phase audit regardless of how few issues appear because link rot is progressive and orphan pages are invisible without graph analysis. + **Gate**: Report generated with all findings. Every issue has a file path and actionable suggestion. Audit is complete. --- @@ -245,46 +215,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Treating Bot-Blocked Sites as Broken -**What it looks like**: Reporting LinkedIn/Twitter links as broken when they return 403/999. -**Why wrong**: These sites actively block bot requests. Links work fine in browsers. -**Do instead**: Maintain false-positives list. Report as "blocked (expected)" not broken. - -### Anti-Pattern 2: Skipping Graph Analysis -**What it looks like**: Only checking for broken links without analyzing the link graph. -**Why wrong**: Orphan pages are invisible to search crawlers. This is often the highest-impact finding. -**Do instead**: Always build the adjacency matrix and compute inbound link counts. - -### Anti-Pattern 3: Literal Path Matching Without Hugo Resolution -**What it looks like**: Treating `/posts/slug/` as a literal file path and reporting it broken. -**Why wrong**: Hugo resolves paths through multiple conventions (slug.md, slug/index.md, slug/_index.md). -**Do instead**: Try all Hugo path resolutions before reporting a link as broken. - -### Anti-Pattern 4: Modifying Content Without User Consent -**What it looks like**: Automatically adding links to orphan pages or fixing broken paths. -**Why wrong**: This skill is non-destructive. Users must approve all content changes. -**Do instead**: Report findings with specific suggestions. Let the user decide which fixes to apply. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Only 3 broken links, not worth a full audit" | Orphan pages are invisible without graph analysis | Run full 4-phase audit | -| "External links probably still work" | Link rot is progressive and silent | Validate with --check-external periodically | -| "Hugo will resolve it somehow" | Hugo path resolution has specific rules | Test all resolution patterns explicitly | -| "Small site doesn't need link auditing" | Even 10 posts can have orphans | Run audit regardless of site size | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/link-graph-metrics.md`: Graph metrics definitions and SEO impact - `${CLAUDE_SKILL_DIR}/references/false-positives.md`: Sites known to block validation requests - `${CLAUDE_SKILL_DIR}/references/fix-strategies.md`: Resolution strategies for each issue type diff --git a/skills/nano-banana-builder/SKILL.md b/skills/nano-banana-builder/SKILL.md index d8e30a47..3b2f07fa 100644 --- a/skills/nano-banana-builder/SKILL.md +++ b/skills/nano-banana-builder/SKILL.md @@ -35,42 +35,72 @@ routing: # Nano Banana Builder -## Operator Context +Image generation and post-processing via two deterministic Python scripts (`nano-banana-generate.py` and `nano-banana-process.py`). Your job is prompt crafting and flag selection; the scripts handle all mechanical operations. -This skill orchestrates image generation and post-processing via two deterministic Python scripts. The LLM's job is **prompt crafting** (describing what to generate) and **flag selection** (choosing the right script options). The scripts handle all mechanical operations. +## Instructions -### Architecture +### 1. Validate the environment -| Component | Path | Purpose | -|-----------|------|---------| -| `nano-banana-generate.py` | `scripts/` | Generate images via Gemini API | -| `nano-banana-process.py` | `scripts/` | Post-process: crop, bg removal, watermarks, format | +Before any generation call, confirm the API key is set. The scripts read `GEMINI_API_KEY` (or `GOOGLE_API_KEY`) from the environment and will fail without it. Checking this first avoids wasted time on API calls that will error immediately. -**Principle**: "LLMs orchestrate. Programs execute." The skill tells you *when* to call *which script* with *which flags*. The scripts do the work deterministically. +### 2. Choose the right script and subcommand -### Hardcoded Behaviors (Always Apply) +Translate user intent into the correct script call. Always call the scripts — never write inline PIL/sharp/image-processing code, because that duplicates tested logic and bypasses the deterministic pipeline. -- **Scripts Only**: ALL generation and processing goes through the Python scripts. Never write inline image processing code. -- **Exact Model Names Only**: Only two models exist: `flash` (gemini-2.5-flash-image) and `pro` (gemini-3-pro-image-preview). The scripts enforce this. -- **Save Originals**: For any batch or expensive generation, use `--save-original` or `--originals-dir` to preserve raw API output before processing. Re-generating costs money; re-processing a saved original is free. -- **API Key Required**: Scripts read `GEMINI_API_KEY` (or `GOOGLE_API_KEY`) from environment. Validate before running. -- **Rate Limiting**: Use `--delay` for batch operations. Default 2s for flash, 3s for pro. +| User wants | Script + subcommand | Key flags | +|------------|-------------------|-----------| +| Generate a single image | `nano-banana-generate.py generate` | `--prompt`, `--model`, `--aspect-ratio` | +| Generate a sprite/character | `nano-banana-generate.py generate` | `--model pro`, `--aspect-ratio 1:1` | +| Generate card art | `nano-banana-generate.py generate` | `--model flash`, `--aspect-ratio 16:9` | +| Generate backgrounds | `nano-banana-generate.py generate` | `--model flash`, `--aspect-ratio 9:16` (vertical) or `16:9` (landscape) | +| Style-match a reference | `nano-banana-generate.py with-reference` | `--reference`, `--prompt` | +| Batch generate enemies/items | `nano-banana-generate.py batch` | `--manifest`, `--skip-existing`, `--delay 3` | +| Crop to exact dimensions | `nano-banana-process.py crop` | `--width`, `--height`, `--bias` | +| Make background transparent | `nano-banana-process.py remove-bg` | `--bg-color`, `--tolerance` | +| Clean up watermarks | `nano-banana-process.py remove-watermarks` | `--margin`, `--threshold` | +| Convert format | `nano-banana-process.py convert` | `--format`, `--quality` | +| Full sprite pipeline | `nano-banana-process.py pipeline` | `--remove-bg`, `--remove-watermarks`, crop flags | +| Batch reprocess originals | `nano-banana-process.py pipeline` | directory input, crop + format flags | -### Default Behaviors (ON unless disabled) +### 3. Select the model -- **Flash for iterations, Pro for finals**: Use `--model flash` for drafts and experimentation, `--model pro` for final output -- **Skip existing in batch**: Use `--skip-existing` to avoid re-generating images that already exist -- **Top-biased crop for characters**: Use `--bias 0.35` when cropping character/sprite art to preserve heads +Only two model aliases exist: `flash` (gemini-2.5-flash-image) and `pro` (gemini-3-pro-image-preview). The scripts enforce this — never pass raw Gemini model strings like `gemini-2.5-flash-preview-05-20`, because they don't support image generation and the call will fail. Use `--model flash` for drafts and iteration (fast, 2-5s). Use `--model pro` for final output (higher quality, ~30s). ---- +### 4. Match aspect ratio to target shape + +Set `--aspect-ratio` to match the final use so the generated image fills the frame. Generating 1:1 then cropping to 16:9 loses 56% of pixels and wastes quality. Valid ratios: 1:1, 2:3, 3:2, 3:4, 4:3, 4:5, 5:4, 9:16, 16:9, 21:9. + +- **Sprites/characters**: 1:1 +- **Card art**: 16:9 +- **Vertical maps/backgrounds**: 9:16 +- **Landscape arenas**: 16:9 + +### 5. Save originals + +For any batch or expensive generation, use `--save-original` (single) or `--originals-dir` (batch) to preserve raw API output before processing. Re-generating costs money and quota; re-processing a saved original is free. This enables experimentation with different crop/bg-removal settings without re-invoking the API. + +### 6. Craft the prompt -## Scripts Reference +Your unique contribution is writing effective prompts for the Gemini API. -### nano-banana-generate.py +**Sprites/Characters** (use with `--model pro`): +- Always specify: "solid dark gray background color only" (enables bg removal with `--bg-color 3a3a3a`) +- Always specify: "ONE character only, full body visible from head to feet, centered in frame" +- Always specify: "no text, no labels, no background details" +- Describe art style explicitly: "Slay the Spire card game style, heavy ink outlines, golden glowing outline" + +**Card Art** (use with `--model flash`): +- Specify composition: "WIDE SHOT, full bodies with space around them" +- Specify style: "sketchy rough painterly, muted desaturated sepia palette" +- Include context: "wrestling ring ropes in background" + +**Backgrounds** (use with `--model flash`): +- Specify darkness: "Very dark overall (UI elements need to be readable on top)" +- Specify no characters: "NO text, NO labels, NO characters" -**Dependencies**: `pip install google-genai pillow` +### 7. Generate -#### generate — Single image +#### Single image ```bash python3 ~/.claude/scripts/nano-banana-generate.py generate \ @@ -83,14 +113,14 @@ python3 ~/.claude/scripts/nano-banana-generate.py generate \ | Flag | Required | Default | Description | |------|----------|---------|-------------| -| `--prompt` | Yes | — | Text prompt for generation | -| `--output` | Yes | — | Output file path (.png or .jpg) | +| `--prompt` | Yes | -- | Text prompt for generation | +| `--output` | Yes | -- | Output file path (.png or .jpg) | | `--model` | No | flash | `flash` (fast, 2-5s) or `pro` (quality, ~30s) | | `--aspect-ratio` | No | model default | 1:1, 2:3, 3:2, 3:4, 4:3, 4:5, 5:4, 9:16, 16:9, 21:9 | -| `--reference` | No | — | Reference image for style matching | -| `--save-original` | No | — | Save raw API output to this path | +| `--reference` | No | -- | Reference image for style matching | +| `--save-original` | No | -- | Save raw API output to this path | -#### with-reference — Style transfer +#### Style transfer (with-reference) ```bash python3 ~/.claude/scripts/nano-banana-generate.py with-reference \ @@ -102,7 +132,9 @@ python3 ~/.claude/scripts/nano-banana-generate.py with-reference \ Same flags as `generate` but `--reference` is required. -#### batch — Multiple images from manifest +#### Batch generation + +Use `--delay` to stay within rate limits (default 2s for flash, 3s for pro). Use `--skip-existing` to avoid re-generating images that already exist. ```bash python3 ~/.claude/scripts/nano-banana-generate.py batch \ @@ -115,18 +147,20 @@ python3 ~/.claude/scripts/nano-banana-generate.py batch \ | Flag | Required | Default | Description | |------|----------|---------|-------------| -| `--manifest` | Yes | — | JSON file: `[{"id": "name", "prompt": "...", "reference": "optional.png"}]` | -| `--output-dir` | Yes | — | Directory for generated images | -| `--originals-dir` | No | — | Save raw API outputs here | +| `--manifest` | Yes | -- | JSON file: `[{"id": "name", "prompt": "...", "reference": "optional.png"}]` | +| `--output-dir` | Yes | -- | Directory for generated images | +| `--originals-dir` | No | -- | Save raw API outputs here | | `--skip-existing` | No | off | Skip items with existing output files | | `--variants` | No | 1 | Number of variants per item (1-5) | | `--delay` | No | 2.0 | Seconds between API calls | -### nano-banana-process.py +### 8. Post-process + +Dependencies: `pip install pillow` (generation also needs `pip install google-genai pillow`). -**Dependencies**: `pip install pillow` +#### Crop -#### crop — Smart crop to dimensions +Use `--bias 0.35` when cropping character/sprite art to preserve heads (default 0.5 centers the crop). ```bash python3 ~/.claude/scripts/nano-banana-process.py crop \ @@ -136,11 +170,11 @@ python3 ~/.claude/scripts/nano-banana-process.py crop \ | Flag | Required | Default | Description | |------|----------|---------|-------------| -| `--width` | Yes | — | Target width in pixels | -| `--height` | Yes | — | Target height in pixels | +| `--width` | Yes | -- | Target width in pixels | +| `--height` | Yes | -- | Target height in pixels | | `--bias` | No | 0.5 | 0.0=anchor top, 0.35=keep top, 0.5=center, 1.0=anchor bottom | -#### remove-bg — Background removal +#### Background removal ```bash python3 ~/.claude/scripts/nano-banana-process.py remove-bg \ @@ -154,11 +188,11 @@ python3 ~/.claude/scripts/nano-banana-process.py remove-bg \ | `--tolerance` | 30 | Color variance (0-255) | Common background colors for Gemini prompts: -- `3a3a3a` — dark gray ("solid dark gray background") -- `ffffff` — white ("solid white background") -- `000000` — black ("solid black background") +- `3a3a3a` -- dark gray ("solid dark gray background") +- `ffffff` -- white ("solid white background") +- `000000` -- black ("solid black background") -#### remove-watermarks — Corner cleanup +#### Watermark removal ```bash python3 ~/.claude/scripts/nano-banana-process.py remove-watermarks \ @@ -166,7 +200,7 @@ python3 ~/.claude/scripts/nano-banana-process.py remove-watermarks \ input.png output.png ``` -#### convert — Format conversion +#### Format conversion ```bash python3 ~/.claude/scripts/nano-banana-process.py convert \ @@ -176,9 +210,9 @@ python3 ~/.claude/scripts/nano-banana-process.py convert \ Formats: `png` (lossless, supports transparency), `jpeg` (smaller, no alpha), `webp` (best compression). -#### pipeline — Full processing chain +#### Full pipeline (chained processing) -Runs all steps in order: watermarks → background → crop → format. +Runs all steps in order: watermarks -> background -> crop -> format. ```bash # Single file @@ -195,52 +229,9 @@ python3 ~/.claude/scripts/nano-banana-process.py pipeline \ staging/originals/ output/cards/ ``` ---- - -## Intent-to-Script Mapping +### 9. End-to-end workflow examples -The LLM's job is to translate user intent into the right script call. Here's the mapping: - -| User wants | Script + subcommand | Key flags | -|------------|-------------------|-----------| -| Generate a single image | `nano-banana-generate.py generate` | `--prompt`, `--model`, `--aspect-ratio` | -| Generate a sprite/character | `nano-banana-generate.py generate` | `--model pro`, `--aspect-ratio 1:1` | -| Generate card art | `nano-banana-generate.py generate` | `--model flash`, `--aspect-ratio 16:9` | -| Generate backgrounds | `nano-banana-generate.py generate` | `--model flash`, `--aspect-ratio 9:16` (vertical) or `16:9` (landscape) | -| Style-match a reference | `nano-banana-generate.py with-reference` | `--reference`, `--prompt` | -| Batch generate enemies/items | `nano-banana-generate.py batch` | `--manifest`, `--skip-existing`, `--delay 3` | -| Crop to exact dimensions | `nano-banana-process.py crop` | `--width`, `--height`, `--bias` | -| Make background transparent | `nano-banana-process.py remove-bg` | `--bg-color`, `--tolerance` | -| Clean up watermarks | `nano-banana-process.py remove-watermarks` | `--margin`, `--threshold` | -| Convert format | `nano-banana-process.py convert` | `--format`, `--quality` | -| Full sprite pipeline | `nano-banana-process.py pipeline` | `--remove-bg`, `--remove-watermarks`, crop flags | -| Batch reprocess originals | `nano-banana-process.py pipeline` | directory input, crop + format flags | - -### Prompt Crafting Guidelines - -The LLM's unique contribution is writing effective prompts. Key patterns: - -**Sprites/Characters** (use with `--model pro`): -- Always specify: "solid dark gray background color only" (for bg removal with `--bg-color 3a3a3a`) -- Always specify: "ONE character only, full body visible from head to feet, centered in frame" -- Always specify: "no text, no labels, no background details" -- Describe art style explicitly: "Slay the Spire card game style, heavy ink outlines, golden glowing outline" - -**Card Art** (use with `--model flash`): -- Specify composition: "WIDE SHOT, full bodies with space around them" -- Specify style: "sketchy rough painterly, muted desaturated sepia palette" -- Include context: "wrestling ring ropes in background" - -**Backgrounds** (use with `--model flash`): -- Specify darkness: "Very dark overall (UI elements need to be readable on top)" -- Specify no characters: "NO text, NO labels, NO characters" -- Match aspect ratio to use: 9:16 for vertical scrolling, 16:9 for landscape arenas - ---- - -## Workflows - -### Generate + Process (single image) +#### Generate + Process (single image) ```bash # 1. Generate raw sprite @@ -258,7 +249,7 @@ python3 ~/.claude/scripts/nano-banana-process.py pipeline \ staging/warrior_raw.png output/warrior.png ``` -### Batch Generate + Batch Process +#### Batch Generate + Batch Process ```bash # 1. Create manifest @@ -283,7 +274,7 @@ python3 ~/.claude/scripts/nano-banana-process.py pipeline \ staging/sprites/ output/sprites/ ``` -### Reprocess from originals (no regeneration cost) +#### Reprocess from originals (no regeneration cost) ```bash # Try different crop dimensions without re-generating @@ -292,15 +283,13 @@ python3 ~/.claude/scripts/nano-banana-process.py pipeline \ staging/originals/ output/cards/ ``` ---- - ## Error Handling ### Error: "GEMINI_API_KEY not set" Solution: `export GEMINI_API_KEY=your_key` or `export GOOGLE_API_KEY=your_key` ### Error: "No image in response" -Cause: Prompt may have triggered content safety filters, or API returned text-only +Cause: Prompt may have triggered content safety filters, or API returned text-only. Solution: Adjust prompt. Check for policy-violating content. Try a different phrasing. ### Error: "Missing dependency: google-genai" @@ -309,26 +298,9 @@ Solution: `pip install google-genai pillow` ### Error: "Rate limit exceeded (429)" Solution: Increase `--delay` value. Default 2s may be too aggressive for free tier. ---- - -## Anti-Patterns - -### Anti-Pattern 1: Writing Inline Image Processing -**What it looks like**: Writing PIL/sharp code directly instead of calling the scripts -**Why wrong**: Duplicates tested logic, introduces variance, bypasses the deterministic pipeline -**Do instead**: Call `nano-banana-process.py` with the appropriate subcommand and flags - -### Anti-Pattern 2: Inventing Model Names -**What it looks like**: Using `gemini-2.5-flash-preview-05-20` or `gemini-2.5-pro-image` -**Why wrong**: These strings don't support image generation. The scripts validate model names. -**Do instead**: Use `--model flash` or `--model pro`. The scripts map to correct API strings. - -### Anti-Pattern 3: Not Saving Originals -**What it looks like**: Generating and immediately processing without `--save-original` -**Why wrong**: If the crop/processing is wrong, you must re-generate (costs money + quota) -**Do instead**: Always use `--save-original` or `--originals-dir` for non-trivial generation +## References -### Anti-Pattern 4: Wrong Aspect Ratio for Use Case -**What it looks like**: Generating a 1:1 image then cropping to 16:9 (loses 56% of pixels) -**Why wrong**: Generates detail that gets cropped away. Wastes tokens and quality. -**Do instead**: Match `--aspect-ratio` to the target shape. 16:9 for cards, 1:1 for sprites, 9:16 for vertical maps. +| Script | Path | Dependencies | +|--------|------|-------------| +| `nano-banana-generate.py` | `~/.claude/scripts/nano-banana-generate.py` | `google-genai`, `pillow` | +| `nano-banana-process.py` | `~/.claude/scripts/nano-banana-process.py` | `pillow` | diff --git a/skills/pair-programming/SKILL.md b/skills/pair-programming/SKILL.md index 5ddd312c..93dcb633 100644 --- a/skills/pair-programming/SKILL.md +++ b/skills/pair-programming/SKILL.md @@ -31,48 +31,29 @@ routing: # Pair Programming Skill -## Operator Context +Collaborative coding through the **Announce-Show-Wait-Apply-Verify** micro-step protocol. The user controls pace, sees every planned change as a diff, and confirms before any file is modified. Works with any domain agent as executor. -This skill operates as an operator for collaborative coding sessions, configuring Claude's behavior for micro-step code changes where the user controls pace and approves every modification. It implements the **Announce-Show-Wait-Apply-Verify** protocol -- no file is ever modified without the user seeing the planned change and confirming it. - -This skill does NOT use `context: fork` because every step requires an interactive user gate. Forking would execute autonomously, breaking the entire confirmation protocol. - -### Hardcoded Behaviors (Always Apply) -- **Never Modify Files Silently**: Every change must go through the 5-step micro-step protocol (announce, show, wait, apply, verify). Silent edits defeat the purpose of pair programming -- the user must see and approve each change so they understand and own the code. -- **Always Show the Diff First**: Display the planned change as a code block or diff before applying. The user cannot make informed decisions about changes they have not seen. -- **Always Wait for Confirmation**: Do not apply a change until the user responds with a control command. Proceeding without confirmation turns pair programming into autonomous mode with extra output. -- **Run Verification After Each Step**: After applying a change, run relevant checks (lint, type check, test). Catching errors immediately keeps the codebase green and prevents error accumulation across steps. -- **Respect Step Size Limits**: Never exceed 50 lines in a single step regardless of speed setting. Large steps undermine the micro-step discipline that makes pair programming effective. -- **Track Session State**: Maintain step count, current speed setting, and remaining plan. The user needs orientation ("Step 3 of ~12") to stay engaged. - -### Default Behaviors (ON unless disabled) -- **Start at 15 Lines Per Step**: Default step size balances progress with reviewability. Adjust via `faster`/`slower` commands. -- **Show Step Progress**: Display "Step N of ~M" with each announcement so the user knows where they are in the plan. -- **Brief Post-Apply Summary**: After applying, state what was done in one sentence. Keeps context without monologuing. -- **Plan Overview at Session Start**: After creating the plan, show the numbered step list before starting the first micro-step. - -### Optional Behaviors (OFF unless enabled) -- **Auto-Verify Mode**: Run lint/test after every step without asking whether to verify. Useful for projects with fast test suites. - ---- +This skill runs in the main session (not `context: fork`) because every step requires an interactive user gate — forking would execute autonomously and break the confirmation protocol. ## Instructions ### Session Setup 1. **User describes what they want to build.** Read relevant code to understand the starting point. -2. **Create a high-level plan.** Break the task into numbered steps, each representing one logical change. Show the plan to the user. +2. **Create a high-level plan.** Break the task into numbered steps, each representing one logical change. Show the numbered step list to the user before starting any micro-steps. 3. **Confirm the plan.** Wait for user acknowledgment before starting Step 1. The user may reorder, remove, or add steps. +Maintain step count, current speed setting, and remaining plan throughout the session so you can display "Step N of ~M" with each announcement. The user needs this orientation to stay engaged with the collaborative flow. + ### Micro-Step Protocol (Per Change) For each step in the plan, execute this 5-step protocol: -**1. Announce** -- Describe the next change in 1-2 sentences: what will change and why. +**1. Announce** — Describe the next change in 1-2 sentences: what will change and why. Keep announcements brief (1-2 sentences) before showing code immediately because the user came to code together, not to read an essay. Long explanations break flow and reduce collaborative momentum. -**2. Show** -- Display the planned code as a diff or code block. Never exceed the current step size limit. +**2. Show** — Display the planned code as a diff or code block. The current step size limit (default 15 lines, max 50 lines) exists because users cannot make informed decisions about changes they have not seen. Every change gets shown—even trivial ones, which are fast to approve—so you and the user stay synchronized. Never exceed the limit: split large changes into sub-steps instead. -**3. Wait** -- Stop and let the user respond with a control command: +**3. Wait** — Stop and let the user respond with a control command. Do not proceed until you receive an explicit command—assuming the user will say "ok" and applying preemptively turns this into autonomous mode with a running commentary, violating the micro-step protocol. | Command | Action | |---------|--------| @@ -84,14 +65,16 @@ For each step in the plan, execute this 5-step protocol: | `plan` | Show remaining steps overview | | `done` | End pair session, run final verification | -**4. Apply** -- Execute the change only after receiving `ok`/`yes`/`y`. +**4. Apply** — Execute the change only after receiving `ok`/`yes`/`y`. Never apply changes without explicit confirmation—doing so turns collaborative pair programming into autonomous mode with a running commentary. -**5. Verify** -- Run relevant checks (lint, type check, test). Report results briefly. +**5. Verify** — Run relevant checks (lint, type check, test) and report results in one sentence. Catching errors immediately keeps the codebase green and prevents error accumulation across steps, ensuring every change is validated before moving forward. -If a step exceeds the current size limit, split it into sub-steps. Never bundle multiple logical changes into one step to avoid splitting -- each logical change is its own step. +If a step exceeds the current size limit, split it into sub-steps (Step 3a, 3b, 3c). Announce the split: "This change is ~40 lines. I will split it into 3 sub-steps." This splitting exists because the user must be able to approve or reject individual logical changes—bundling multiple logical changes into one step to avoid splitting defeats the purpose of the micro-step protocol. ### Speed Adjustment +Sessions start at 15 lines per step, balancing progress with reviewability. Apply speed changes immediately when the user requests them and acknowledge the new setting because ignoring pace signals breaks trust and the user's sense of control. The user must feel they are the one steering the session. + | Setting | Lines Per Step | Trigger | |---------|---------------|---------| | Slowest | 5 | Multiple `slower` commands | @@ -104,86 +87,41 @@ When the user says `faster` or `slower`, acknowledge the change: "Speed adjusted ### Session End -When the user says `done` or all steps are complete: run final verification (lint, type check, full test suite), show a summary (steps completed, steps skipped, files modified), and report verification results. - ---- +When the user says `done` or all steps are complete: run final verification (lint, type check, full test suite), show a summary (steps completed, steps skipped, files modified), and report verification results. This end-of-session gate ensures the codebase is left in a valid state. -## Examples +### Examples -### Example 1: Standard Session -User says: "Pair program a function that parses CSV lines in Go" -Actions: +**Standard Session** — User says: "Pair program a function that parses CSV lines in Go" 1. Read existing code, create 5-step plan (struct, parser func, error handling, tests, integration) 2. Show plan, wait for confirmation -3. Step 1: Announce "Define a CSVRecord struct" -- show 8-line struct -- wait -- user says `ok` -- apply -- verify -4. Step 2: Announce "Add ParseLine function" -- show 12-line function -- wait -- user says `ok` -- apply -- verify +3. Step 1: Announce "Define a CSVRecord struct" — show 8-line struct — wait — user says `ok` — apply — verify +4. Step 2: Announce "Add ParseLine function" — show 12-line function — wait — user says `ok` — apply — verify 5. Continue through remaining steps -### Example 2: Speed Adjustment -User says `faster` after Step 2 -Actions: +**Speed Adjustment** — User says `faster` after Step 2 1. Acknowledge: "Speed adjusted to ~30 lines per step." 2. Next step shows up to 30 lines instead of 15 3. If user says `slower` later, drop to ~15 -### Example 3: Session End -User says `done` after Step 4 of 6 -Actions: +**Session End** — User says `done` after Step 4 of 6 1. Run `go vet`, `go test ./...` 2. Report: "4 of 6 steps completed, 0 skipped. Modified: parser.go, parser_test.go. Tests: all passing." ---- - ## Error Handling ### User Says "Just Do It" / Wants Autonomous Mode -Cause: User wants speed, not collaboration +Cause: User wants speed, not collaboration. Solution: Acknowledge the preference and offer to switch. Say: "Would you like to switch to autonomous mode? I can implement the remaining steps without confirmation." If they accept, drop the micro-step protocol and implement normally. ### Verification Fails After a Step -Cause: Applied change introduces a lint error, type error, or test failure -Solution: Announce the fix as the next micro-step. Show the fix diff, wait for confirmation, apply, re-verify. Do not silently fix verification failures. +Cause: Applied change introduces a lint error, type error, or test failure. +Solution: Announce the fix as the next micro-step. Show the fix diff, wait for confirmation, apply, re-verify. Do not silently fix verification failures regardless of cause—silent fixes violate the protocol. ### Step Too Large to Fit Size Limit -Cause: A logical change requires more lines than the current limit +Cause: A logical change requires more lines than the current limit. Solution: Split into sub-steps (Step 3a, 3b, 3c). Each sub-step stays within the limit. Announce the split: "This change is ~40 lines. I will split it into 3 sub-steps." ---- - -## Anti-Patterns - -### Anti-Pattern 1: Silent Edits -**What it looks like**: Applying changes without showing the diff first -**Why wrong**: The user must see what will change. Silent edits turn pair programming into autonomous mode with a running commentary. -**Do instead**: Always show the planned diff and wait for confirmation. - -### Anti-Pattern 2: Monologue Mode -**What it looks like**: Five paragraphs of explanation before showing any code -**Why wrong**: The user came to code together, not to read an essay. Long explanations break flow and waste the user's attention. -**Do instead**: Announce in 1-2 sentences, then show the code immediately. - -### Anti-Pattern 3: Ignoring Pace Signals -**What it looks like**: User said `slower` but steps are still 30 lines -**Why wrong**: Disrespecting speed changes breaks trust and the user's sense of control. -**Do instead**: Apply speed changes immediately. Acknowledge the new setting. - -### Anti-Pattern 4: Bundling Steps -**What it looks like**: Combining 3 logical changes into one step to avoid splitting -**Why wrong**: Defeats micro-step discipline. The user cannot approve or reject individual changes. -**Do instead**: One logical change per step. Split large changes into sub-steps. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "This change is trivial, no need to show it" | Every change gets shown -- trivial changes are fast to approve | Show diff, wait for confirmation | -| "User is clearly going to say ok, I'll just apply" | Assuming consent is not consent | Wait for explicit command | -| "Splitting this into 3 sub-steps is tedious" | Tedium for the agent is discipline for the user | Split and show each sub-step | -| "I'll fix this lint error silently since it's my fault" | Silent fixes violate the protocol regardless of cause | Announce the fix as the next micro-step | +- [Micro-step protocol control commands](#micro-step-protocol-per-change) — user command table +- [Speed adjustment table](#speed-adjustment) — lines-per-step settings diff --git a/skills/parallel-code-review/SKILL.md b/skills/parallel-code-review/SKILL.md index 55ca9a80..173a3811 100644 --- a/skills/parallel-code-review/SKILL.md +++ b/skills/parallel-code-review/SKILL.md @@ -26,42 +26,7 @@ routing: # Parallel Code Review Skill -## Operator Context - -This skill operates as an orchestrator for parallel code review, configuring Claude's behavior to launch three specialized review agents simultaneously and aggregate their findings into a unified report. It implements the **Fan-Out/Fan-In** architectural pattern -- dispatch independent reviewers in parallel, collect results, merge by severity -- with **Domain Intelligence** embedded in each reviewer's focus area. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before dispatching reviewers -- **True Parallelism**: All 3 Task calls MUST be dispatched in a SINGLE message -- **No Skipping Reviewers**: All 3 reviewers run even for "simple" changes -- **READ-ONLY**: Reviewers never modify code; this is observation only -- **Severity Aggregation**: Combine findings by severity before reporting -- **Always Produce Verdict**: Every review ends with BLOCK, FIX, or APPROVE - -### Default Behaviors (ON unless disabled) -- **Re-run on Critical**: If CRITICAL issues found, re-run all reviewers after fixes -- **Severity Summary Table**: Include reviewer-by-severity matrix in output -- **File-Line References**: All findings must include `file:line` references -- **Domain-Specific Agent Selection**: Use language-appropriate agent for architecture reviewer - -### Optional Behaviors (OFF unless enabled) -- **Threat Model**: Enable with "include threat model" for security reviewer -- **Git Bisect**: Enable with "find breaking commit" for regression tracking -- **Performance Profiling**: Enable with "benchmark" for architecture reviewer - -## What This Skill CAN Do -- Dispatch 3 specialized reviewers in true parallel (single message) -- Aggregate findings across reviewers into unified severity classification -- Select domain-appropriate agents for architecture review (Go, Python, TS) -- Produce a structured report with BLOCK/FIX/APPROVE verdict -- Re-run all reviewers after critical fixes to verify resolution - -## What This Skill CANNOT Do -- Modify code or apply fixes (read-only review only) -- Run fewer than 3 reviewers to save time -- Skip aggregation and report individual reviewer results separately -- Replace systematic-code-review for simple sequential reviews -- Approve without all 3 reviewers completing +Orchestrate three specialized code reviewers (Security, Business Logic, Architecture) in true parallel using the Fan-Out/Fan-In pattern. Each reviewer runs independently with domain-specific focus, then findings are aggregated by severity into a unified BLOCK/FIX/APPROVE verdict. --- @@ -71,7 +36,9 @@ This skill operates as an orchestrator for parallel code review, configuring Cla **Goal**: Determine changed files and select appropriate agents before dispatching. -**Step 1: List changed files** +**Step 1: Read repository CLAUDE.md** to load project-specific conventions that reviewers must respect. + +**Step 2: List changed files** ```bash # For recent commits: @@ -80,7 +47,7 @@ git diff --name-only HEAD~1 gh pr view --json files -q '.files[].path' ``` -**Step 2: Select architecture reviewer agent** +**Step 3: Select architecture reviewer agent** based on the dominant language. This ensures the architecture reviewer applies idiomatic standards rather than generic advice, because different languages have fundamentally different design patterns and conventions. | File Types | Agent | |-----------|-------| @@ -89,15 +56,20 @@ gh pr view --json files -q '.files[].path' | `.ts`/`.tsx` files | `typescript-frontend-engineer` | | Mixed or other | `Explore` | +**Optional enrichments** (only when user explicitly requests): +- "include threat model" -- adds threat modeling to Security reviewer scope +- "find breaking commit" -- adds git bisect regression tracking +- "benchmark" -- adds performance profiling to Architecture reviewer scope + **Gate**: Changed files listed, architecture reviewer agent selected. Proceed only when gate passes. ### Phase 2: DISPATCH PARALLEL REVIEWERS **Goal**: Launch all 3 reviewers in a single message for true concurrent execution. -**CRITICAL**: All three Task calls MUST appear in ONE response. Sequential messages defeat parallelism. +**Critical constraint**: All three Task calls MUST appear in ONE response. Sending them sequentially triples wall-clock time and defeats the purpose of parallel review. This is not optional—parallelism is the entire value proposition of this skill. -Dispatch exactly these 3 agents: +Dispatch exactly these 3 agents. This is a read-only review—reviewers observe and report but never modify code. **Reviewer 1 -- Security** - Focus: OWASP Top 10, authentication, authorization, input validation, secrets exposure @@ -111,12 +83,16 @@ Dispatch exactly these 3 agents: - Focus: Design patterns, naming, structure, performance, maintainability - Output: Severity-classified findings with `file:line` references -**Gate**: All 3 Task calls dispatched in a single message. Proceed only when all 3 return results. +**Critical constraint**: Always run all 3 reviewers regardless of perceived change simplicity. Config changes can expose secrets, "trivial" fixes can break authorization, and each reviewer's specialization catches issues the others miss. Let a reviewer report "no findings" rather than skip it—because silence is information too. + +**Gate**: All 3 Task calls dispatched in a single message. Proceed only when ALL 3 return results—never issue a verdict from partial results, because the missing reviewer may hold the only CRITICAL finding. Partial results are worse than no review. ### Phase 3: AGGREGATE **Goal**: Merge all findings into a unified severity-classified report. +**Critical constraint**: Never dump raw reviewer outputs as three separate sections—the reader should not have to mentally merge findings across reviewers. Your job is to synthesize, not summarize. + **Step 1: Classify each finding by severity** | Severity | Meaning | Action | @@ -128,10 +104,12 @@ Dispatch exactly these 3 agents: **Step 2: Deduplicate overlapping findings** -Multiple reviewers may flag the same issue. Merge duplicates, keep the highest severity. +Multiple reviewers may flag the same issue. Merge duplicates, keeping the highest severity. Overlap between reviewers is a feature (independent confirmation), but the report should consolidate it so readers see a unified issue once, not three times. **Step 3: Build reviewer summary matrix** +Include this matrix in every report so stakeholders see the severity distribution at a glance: + ``` | Reviewer | CRITICAL | HIGH | MEDIUM | LOW | |----------------|----------|------|--------|-----| @@ -147,6 +125,8 @@ Multiple reviewers may flag the same issue. Merge duplicates, keep the highest s **Goal**: Produce final report with clear recommendation. +**Critical constraint**: Every review must end with an explicit verdict. Ambiguity is a decision to merge untested code. Choose: BLOCK, FIX, or APPROVE. + **Step 1: Determine verdict** | Condition | Verdict | @@ -183,11 +163,13 @@ Multiple reviewers may flag the same issue. Merge duplicates, keep the highest s **Step 3: If BLOCK verdict, initiate re-review protocol** -After user addresses CRITICAL issues, re-run ALL 3 reviewers to verify: +After user addresses CRITICAL issues, re-run ALL 3 reviewers (not just the one that found the issue) to verify: 1. Original CRITICAL issues resolved 2. No regressions introduced 3. No new CRITICAL/HIGH issues from fixes +Re-run all three because fixes often introduce new issues in adjacent code, and you need confirmation across all three domains that the solution is safe. + **Gate**: Structured report delivered with verdict. Review is complete. --- @@ -195,65 +177,36 @@ After user addresses CRITICAL issues, re-run ALL 3 reviewers to verify: ## Error Handling ### Error: "Reviewer Times Out" -Cause: One or more Task agents exceed execution time -Solution: -1. Report findings from completed reviewers immediately -2. Note which reviewer(s) timed out and on which files -3. Offer to re-run failed reviewer separately or proceed with partial results -### Error: "All Reviewers Fail" -Cause: Systemic issue (bad file paths, permission errors, context overflow) -Solution: -1. Verify changed file list is correct and files are readable -2. Reduce scope if file count is very large (split into batches) -3. Fall back to systematic-code-review (sequential) as last resort +**Cause**: One or more Task agents exceed execution time. -### Error: "Conflicting Findings Across Reviewers" -Cause: Two reviewers disagree on severity or interpretation of same code -Solution: -1. Keep the higher severity classification (classify UP) -2. Include both perspectives in the finding description -3. Flag as "needs author input" if genuinely ambiguous +**Solution**: +1. Report findings from completed reviewers immediately—a partial review is better than no review, because you'll know at least some classes of issues are present. +2. Note which reviewer(s) timed out and on which files, so the user understands the blind spots. +3. Offer to re-run failed reviewer separately or proceed with partial results (but disclose the incompleteness in the verdict). ---- +### Error: "All Reviewers Fail" -## Anti-Patterns +**Cause**: Systemic issue (bad file paths, permission errors, context overflow). -### Anti-Pattern 1: Sequential Dispatch -**What it looks like**: Sending one Task call, waiting for it, then sending the next. -**Why wrong**: Defeats the entire purpose of parallel review. Triples wall-clock time. -**Do instead**: All 3 Task calls in a SINGLE message. This is not optional. +**Solution**: +1. Verify changed file list is correct and files are readable—start with the basics. +2. Reduce scope if file count is very large (split into batches), because each reviewer needs enough context tokens to reason properly. +3. Fall back to systematic-code-review (sequential) as last resort, because at least one reviewer completing sequentially is better than zero reviewers failing. -### Anti-Pattern 2: Skipping a Reviewer for "Simple" Changes -**What it looks like**: "This is just a config change, no need for security review." -**Why wrong**: Config changes can expose secrets, break authorization, or introduce logic errors. Simple changes cause complex bugs. -**Do instead**: Run all 3 reviewers. Let them report "no findings" if truly clean. +### Error: "Conflicting Findings Across Reviewers" -### Anti-Pattern 3: Reporting Without Aggregation -**What it looks like**: Dumping each reviewer's raw output as three separate sections. -**Why wrong**: Reader must mentally merge findings. Duplicate issues appear multiple times. No unified severity picture. -**Do instead**: Complete Phase 3 aggregation. Deduplicate, classify, build matrix. +**Cause**: Two reviewers disagree on severity or interpretation of same code. -### Anti-Pattern 4: Approving With Partial Results -**What it looks like**: "2 of 3 reviewers passed, looks good to merge." -**Why wrong**: The failed reviewer may have found the only CRITICAL issue. Partial coverage gives false confidence. -**Do instead**: Wait for all 3 to complete, or explicitly re-run the failed reviewer before issuing verdict. +**Solution**: +1. Keep the higher severity classification (classify UP), because you want to err on the side of caution—false positives are correctable, false negatives ship bugs. +2. Include both perspectives in the finding description, so the code author understands the disagreement and can make an informed decision. +3. Flag as "needs author input" if genuinely ambiguous, and include both interpretations verbatim so they can choose. --- ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Anti-Rationalization (Review)](../shared-patterns/anti-rationalization-review.md) - Review-specific rationalizations -- [Severity Classification](../shared-patterns/severity-classification.md) - Issue severity definitions -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "One reviewer is enough" | Different perspectives catch different issues | Run all three | -| "Security reviewer covered logic" | Specialization matters; overlap is feature not bug | Don't skip business logic | -| "Small PR, skip parallel" | Small PRs can harbor big bugs | Consider scope, not size | -| "Reviewers will just duplicate" | Each has specific focus areas | Trust the specialization | +- Severity classification: CRITICAL (blocks merge), HIGH (fix before), MEDIUM (should fix), LOW (nice to have) +- Verdict decision tree: Any CRITICAL → BLOCK; HIGH without CRITICAL → FIX; MEDIUM/LOW only → APPROVE +- Re-review trigger: Always re-run all 3 reviewers after BLOCK fixes to catch regressions diff --git a/skills/pause-work/SKILL.md b/skills/pause-work/SKILL.md index 4655b96c..e8e2fb63 100644 --- a/skills/pause-work/SKILL.md +++ b/skills/pause-work/SKILL.md @@ -34,45 +34,13 @@ routing: # /pause - Session Handoff -## Operator Context - -This skill captures ephemeral session reasoning into durable artifacts so the next session can resume without wasting time on context reconstruction. It solves a specific gap: `task_plan.md` records WHAT tasks exist, but not WHY the current session chose a particular approach, what it rejected, or what it planned to do next. - -The two output files serve different audiences: -- `HANDOFF.json` — machine-readable, consumed by `/resume` for automated state reconstruction -- `.continue-here.md` — human-readable, for users who want to understand session state without starting a new session - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Dual-Format Output**: Always produce BOTH `HANDOFF.json` and `.continue-here.md`. The machine format enables automated resume; the human format enables manual inspection. Skipping either breaks half the use case. -- **Uncommitted Work Detection**: Always run `git status` and `git diff --stat` to identify uncommitted changes. Uncommitted work is the highest-risk information to lose across sessions. -- **False Completion Detection**: Grep for placeholder markers (TODO, FIXME, PLACEHOLDER, TBD, XXX, HACK, stub, not yet implemented) in uncommitted files. These indicate work that looks done but is not — the most dangerous handoff failure mode. -- **Project Root Placement**: Write both files to the project root (where `.git/` lives), not the current working directory if different. This ensures `/resume` can find them reliably. -- **No Destructive Operations**: This skill only creates files. It never deletes, modifies existing code, or runs destructive git commands. - -### Default Behaviors (ON unless disabled) -- **WIP Commit Suggestion**: If uncommitted changes exist, suggest a WIP commit before pausing. Uncommitted work can be lost if the worktree is cleaned up. Do not auto-commit — suggest and let the user decide. -- **task_plan.md Integration**: If `task_plan.md` exists, read it and incorporate its phase status into the handoff. The handoff supplements the plan, it does not replace it. -- **Timestamp in ISO 8601**: All timestamps use UTC ISO 8601 format for unambiguous parsing. - -### Optional Behaviors (OFF unless enabled) -- **Auto-Commit Handoff** (`--commit`): Commit the handoff files on the current branch. Default is to leave them uncommitted so the user can review first. -- **Quiet Mode** (`--quiet`): Skip the confirmation summary. For automated/scripted usage. - -## What This Skill CAN Do -- Capture completed tasks, remaining work, blockers, and decisions into structured handoff files -- Detect uncommitted work and suggest WIP commits -- Detect false completions (placeholder markers in modified files) -- Synthesize the session's reasoning context (approach chosen, alternatives rejected, mental model) -- Optionally commit handoff artifacts to the current branch - -## What This Skill CANNOT Do -- Replace `task_plan.md` — handoffs capture session reasoning, plans capture task structure -- Auto-commit code changes — it only suggests WIP commits, never executes them without user consent -- Guarantee reasoning accuracy — handoff quality depends on the session's self-awareness (same limitation as any self-assessment) -- Resume from handoff files — that is the `resume-work` skill +Capture ephemeral session reasoning into durable artifacts so the next session can resume without wasting time on context reconstruction. `task_plan.md` records WHAT tasks exist; this skill captures WHY the current session chose a particular approach, what it rejected, and what it planned to do next. ---- +Two output files serve different audiences because each addresses a distinct use case: +- `HANDOFF.json` — machine-readable, consumed by `/resume` for automated state reconstruction. Must always be produced to enable `/resume` routing. +- `.continue-here.md` — human-readable, for users who want to understand session state without starting a new session. Must always be produced to support human-only resumption paths. + +Skipping either file breaks half the use case: without the JSON, `/resume` cannot detect handoff state automatically; without the markdown, users cannot quickly grok where things stand. ## Instructions @@ -80,18 +48,22 @@ The two output files serve different audiences: **Goal**: Collect all state needed for the handoff. -**Step 1: Identify project root** +**Step 1: Read CLAUDE.md** + +Read and follow the repository CLAUDE.md before any other operations because it establishes conventions for the current project that may differ from defaults. + +**Step 2: Identify project root** Find the git root directory: ```bash git rev-parse --show-toplevel ``` -All subsequent paths and file writes are relative to this root. +All subsequent paths and file writes target this root, not the current working directory because writing to the project root ensures `/resume` can find the files reliably across different shell invocation contexts. -**Step 2: Collect git state** +**Step 3: Collect git state** -Run these commands to capture the current state: +Uncommitted work is the highest-risk information to lose across sessions because a new shell or worktree cleanup can destroy changes the user intended to save. Run these commands to capture the current state: ```bash # Current branch @@ -108,9 +80,9 @@ git diff --cached --stat git log --oneline -10 ``` -**Step 3: Check for false completions** +**Step 4: Check for false completions** -Search uncommitted/modified files for placeholder markers: +Search uncommitted/modified files for placeholder markers because they indicate work that looks done but is not — this is the most dangerous handoff failure mode. These markers are easily missed during context loss and become invisible in the next session. ```bash # Get list of modified files @@ -122,17 +94,17 @@ Use the Grep tool to search those files for these patterns: `TODO`, `FIXME`, `PL Record any findings — these are items that look complete but are not. -**Step 4: Read task_plan.md if present** +**Step 5: Read task_plan.md if present** -If `task_plan.md` exists in the project root, read it to extract: +If `task_plan.md` exists in the project root, read it and incorporate its phase status into the handoff because the plan captures WHAT phases and tasks exist while the handoff captures WHY and the session's mental model. The handoff supplements the plan (capturing session reasoning), it does not replace it. Extract: - Which phases are complete (marked `[x]`) - Which phases remain (marked `[ ]`) - Current status line - Decisions and errors logged -**Step 5: Read .adr-session.json if present** +**Step 6: Read .adr-session.json if present** -If `.adr-session.json` exists, note the active ADR for context in the handoff. +If `.adr-session.json` exists, note the active ADR for context in the handoff because ADRs record architectural decisions that influence remaining work. **GATE**: Git state collected. Modified file list available. Placeholder scan complete. Ready to synthesize. @@ -142,37 +114,39 @@ If `.adr-session.json` exists, note the active ADR for context in the handoff. **Step 1: Construct completed_tasks** -List what was accomplished this session. Draw from: +List what was accomplished this session with specificity because the next session needs to know what NOT to repeat. Draw from: - Git commits made during the session - Phases marked complete in task_plan.md - Work the session performed (files created, edited, reviewed) -Be specific: "Implemented scoring module in scripts/quality-score.py" not "Did some work on scoring." +Be specific: "Implemented scoring module in scripts/quality-score.py" not "Did some work on scoring" because vague entries waste the next session's time reconstructing what was done. **Step 2: Construct remaining_tasks** -List what still needs to be done. Draw from: +List what still needs to be done because this is the primary input to the next session's context. Draw from: - Unchecked phases in task_plan.md -- Placeholder markers found in Phase 1 Step 3 +- Placeholder markers found in Phase 1 Step 4 - Known incomplete work from session context **Step 3: Construct decisions** -Record key decisions made during the session and WHY. This is the highest-value handoff content because it prevents the next session from re-exploring dead ends. +Record key decisions made during the session and WHY because this is the highest-value handoff content. Git log shows WHAT changed but not WHY or what was rejected — decisions fill that gap and prevent the next session from re-exploring dead ends or reconsidering options that were already deliberated. Format: `{"decision description": "reasoning for the decision"}` **Step 4: Construct next_action** -Write a specific, actionable description of what the next session should do first. Include: +Write a specific, actionable description of what the next session should do first because what seems obvious now becomes opaque after context loss. Include: - The exact action (not vague "continue working") - Relevant file paths and function names - Integration points or dependencies - Why this is the right next step +Example: `"Wire quality-score.py into pr-pipeline Phase 3. The function signature is score_package(path) -> ScoreResult. Integration point is the gate check between STAGE and REVIEW phases."` + **Step 5: Construct context_notes** -Capture the session's mental model — the reasoning context that is NOT captured in code or commits: +Capture the session's mental model — the reasoning context that is NOT captured in code or commits because this information is the most likely to be lost and most expensive to reconstruct. Always include at least: what approach was chosen, what was rejected, and any gotchas discovered. This information prevents thrashing in the next session. Record: - Approaches tried and rejected (and why) - Assumptions being made - Gotchas discovered @@ -182,11 +156,11 @@ Capture the session's mental model — the reasoning context that is NOT capture ### Phase 3: WRITE -**Goal**: Write both handoff files to the project root. +**Goal**: Write both handoff files to the project root. This skill only creates files — it never deletes, modifies existing code, or runs destructive git commands because it must be safe to invoke repeatedly without side effects. **Step 1: Write HANDOFF.json** -Write to `{project_root}/HANDOFF.json`: +Write to `{project_root}/HANDOFF.json` with UTC ISO 8601 timestamps for unambiguous parsing across time zones and system clocks: ```json { @@ -223,7 +197,7 @@ Write to `{project_root}/HANDOFF.json`: **Step 2: Write .continue-here.md** -Write to `{project_root}/.continue-here.md`: +Write to `{project_root}/.continue-here.md` because humans need prose-form state before committing to `/resume`: ```markdown # Continue Here @@ -253,7 +227,7 @@ Write to `{project_root}/.continue-here.md`: **Step 3: Suggest WIP commit if needed** -If there are uncommitted changes (from Phase 1 Step 2), display: +If there are uncommitted changes (from Phase 1 Step 3), display a warning because uncommitted work can be lost if the worktree is cleaned up. However, do NOT auto-commit because auto-committing removes the user's ability to decide — changes may be experimental, broken, or intentionally staged for review. ``` WARNING: Uncommitted changes detected in N file(s): @@ -266,8 +240,6 @@ Consider a WIP commit before ending the session: Uncommitted work can be lost if the worktree is cleaned up. ``` -Do NOT auto-commit. The user decides. - **Step 4: Optional commit of handoff files** If `--commit` flag was provided: @@ -280,7 +252,7 @@ git commit -m "chore: session handoff artifacts" ### Phase 4: CONFIRM -**Goal**: Display summary and confirm handoff was captured. +**Goal**: Display summary and confirm handoff was captured. Skip this phase if `--quiet` flag was provided (for automated/scripted usage). Display the handoff summary: @@ -305,8 +277,6 @@ Display the handoff summary: =================================================================== ``` ---- - ## Error Handling ### Error: Not in a Git Repository @@ -321,49 +291,8 @@ Display the handoff summary: **Cause**: A previous `/pause` created handoff files that were never consumed by `/resume` **Solution**: Warn the user that stale handoff files exist. Offer to overwrite (default) or append. Overwriting is almost always correct — stale handoffs from abandoned sessions should not block new ones. ---- - -## Anti-Patterns - -### Anti-Pattern 1: Vague Next Actions -**What it looks like**: `"next_action": "Continue working on the feature"` -**Why wrong**: The entire point of handoff is to avoid reconstruction. Vague next actions force the next session to re-discover what "continue working" means. -**Do instead**: Be specific: `"next_action": "Wire quality-score.py into pr-pipeline Phase 3. The function signature is score_package(path) -> ScoreResult. Integration point is the gate check between STAGE and REVIEW phases."` - -### Anti-Pattern 2: Skipping context_notes -**What it looks like**: `"context_notes": ""` or omitting the field -**Why wrong**: Context notes capture WHY the session chose its approach — the information most likely to be lost and most expensive to reconstruct. -**Do instead**: Always include at least: what approach was chosen, what was rejected, and any gotchas discovered. - -### Anti-Pattern 3: Using Handoff as Task Plan -**What it looks like**: Creating detailed phase breakdowns in HANDOFF.json instead of task_plan.md -**Why wrong**: Handoff files are one-shot artifacts deleted after resume. Task plans persist as the task record of truth. Putting plan content in handoff means it vanishes after the next `/resume`. -**Do instead**: Keep task_plan.md for task structure. Use handoff for session-specific reasoning that supplements the plan. - -### Anti-Pattern 4: Auto-Committing Code Changes -**What it looks like**: Committing uncommitted work as part of the pause flow without user consent -**Why wrong**: Uncommitted changes may be experimental, broken, or intentionally staged for review. Auto-committing removes the user's ability to decide. -**Do instead**: Suggest a WIP commit. Show the files. Let the user decide. - ---- - -## Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "The task plan captures everything" | task_plan.md captures task state, not session reasoning (approach, rejections, mental model) | Create handoff files with context_notes | -| "Next session can figure it out from git log" | Git log shows WHAT changed, not WHY or what was rejected | Include decisions and context_notes | -| "No need for .continue-here.md, JSON is enough" | Humans read prose faster than JSON; .continue-here.md is for manual inspection | Always write both files | -| "The changes are obvious, no need for detailed next_action" | What's obvious now is opaque after context loss | Write specific next_action with file paths | -| "I'll just quickly commit the code too" | Auto-committing code without user consent risks committing broken/experimental work | Suggest WIP commit, never auto-commit code | - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations - ### Related Skills - `resume-work` — Consumes handoff artifacts to restore session state - `workflow-orchestrator` — For complex multi-phase tasks that benefit from handoff between phases diff --git a/skills/perses-code-review/SKILL.md b/skills/perses-code-review/SKILL.md index 39a3f9a5..3abbd66b 100644 --- a/skills/perses-code-review/SKILL.md +++ b/skills/perses-code-review/SKILL.md @@ -26,180 +26,103 @@ routing: # Perses Code Review -Review code changes in Perses repositories for domain-specific patterns, API conventions, plugin system compliance, and dashboard correctness. This is not a general-purpose code reviewer — it enforces Perses-specific invariants across Go backend, React frontend, CUE schemas, and dashboard definitions. - -## Operator Context - -This skill operates as a Perses-domain code reviewer. It understands the project layout (`/cmd`, `/pkg`, `/internal`), the plugin system (`@perses-dev/*`), CUE schema conventions, and dashboard definition structure. It dispatches sub-reviewers per file type and aggregates findings. - -### Hardcoded Behaviors (Always Apply) -- **Perses-specific focus**: Review against Perses patterns, not generic Go/React/CUE style guides. A passing `golangci-lint` does not mean the code follows Perses conventions. -- **CUE schema validation**: Every CUE schema must be in `package model`, use `close({})` for specs, and include a JSON example alongside the schema definition. -- **Dashboard definition validation**: Validate `$ref` panel references resolve, variable chains are acyclic, and datasource scopes (`global`, `project`, `dashboard`) are correct. -- **Project-scoped API compliance**: All CRUD API handlers must be project-scoped at `/api/v1/projects/{project}/...` unless the resource is explicitly global (e.g., `GlobalDatasource`, `GlobalSecret`). -- **percli lint gate**: Run `percli lint` on any dashboard JSON/YAML definitions touched in the PR. Lint failures are blockers. - -### Default Behaviors (ON unless disabled) -- **Multi-domain dispatch**: Route `.go` files to Go sub-reviewer, `.tsx`/`.ts` to React sub-reviewer, `.cue` to CUE sub-reviewer, dashboard JSON/YAML to dashboard sub-reviewer. -- **Cross-domain correlation**: When a PR touches both CUE schemas and plugin code, verify the schema changes match the plugin's expected input/output types. -- **Storage interface check**: Verify new resources implement the storage interface (`dao.go`) with all required CRUD methods including `List` with pagination support. - -### Optional Behaviors (OFF unless enabled) -- **Migration review**: Check `migrate/migrate.cue` for backward-compatible schema evolution when CUE schemas change. -- **E2E test coverage**: Verify that new API endpoints have corresponding E2E tests in the test suite. -- **Performance review**: Flag N+1 queries in storage implementations and unbounded list operations. - -## What This Skill CAN Do -- Review Go backend code for Perses API patterns, storage interface compliance, and auth middleware usage -- Review React frontend code for `@perses-dev/plugin-system` hook usage and component conventions -- Review CUE schemas for `package model` compliance, closed specs, and JSON examples -- Validate dashboard definitions for panel references, variable chains, and datasource scoping -- Run `percli lint` on dashboard definitions and report failures -- Correlate cross-domain changes (schema + plugin, API + frontend) - -## What This Skill CANNOT Do -- Deploy Perses instances (use `perses-deploy`) -- Create dashboards from scratch (use `perses-dashboard-create`) -- Develop new plugins (use `perses-plugin-create`) -- Perform general Go or React code review without Perses context (use `golang-general-engineer` or `typescript-frontend-engineer`) -- Run the full Perses test suite (use CI/CD) +Review code changes in Perses repositories for domain-specific patterns, API conventions, plugin system compliance, and dashboard correctness. This skill enforces Perses-specific invariants across Go backend, React frontend, CUE schemas, and dashboard definitions — a passing `golangci-lint` or generic code review does not mean the code follows Perses conventions. ---- - -## Error Handling - -| Cause | Symptom | Solution | -|-------|---------|----------| -| Go API handler doesn't follow Perses CRUD patterns | Missing pagination on `List` endpoint, wrong HTTP status codes (e.g., 200 instead of 201 on create), no project-scoping | Flag as blocker. Perses `List` handlers must accept `?page=N&size=M` query params and return paginated results. Create returns 201, Update returns 200, Delete returns 204. All non-global resources must be under `/api/v1/projects/{project}/`. | -| React component doesn't use `@perses-dev/plugin-system` hooks | Component uses raw `fetch()` or direct state for time range instead of `usePlugin`, `useTimeRange`, `useDataQueries` from the plugin system | Flag as blocker. Perses plugins MUST use the plugin system hooks to participate in the dashboard lifecycle (time range sync, variable interpolation, refresh). Direct data fetching bypasses the plugin contract. | -| CUE schema not in `package model` or spec not closed | Schema declares `package foo` instead of `package model`, uses open struct `{}` instead of `close({})`, no JSON example file alongside | Flag as blocker. All Perses CUE schemas must be `package model` to be discoverable by the schema registry. Specs must use `close({})` to prevent unexpected fields. A `_example.json` must accompany each schema for documentation and validation. | -| Dashboard definition has invalid `$ref` panel references | Layout references `$ref: #/spec/panels/myPanel` but panel key is `my-panel` or doesn't exist, causing render failures | Flag as blocker. Panel keys in `$ref` must exactly match keys in `spec.panels`. Run `percli lint` to catch these. Check for typos and naming convention mismatches (camelCase vs kebab-case). | -| Broken variable chains in dashboard | Variable B depends on variable A via `$A` in its query, but A is defined after B in the variables list, or A doesn't exist | Flag as blocker. Variable evaluation order follows list order. Dependees must appear before dependents. Missing variables cause silent empty interpolation. | -| Wrong datasource scope | Dashboard uses `datasource: {name: "prom"}` without specifying scope, or references a project datasource from a different project | Flag as warning. Datasources have three scopes: `global` (cluster-wide), `project` (project-level), `dashboard` (inline). The scope must be explicit. Cross-project references are invalid. | - ---- - -## Anti-Patterns +## Instructions -### 1. Reviewing Perses code with a general Go/React reviewer -**Why it fails**: A general Go reviewer will approve an API handler that returns a flat list without pagination, uses generic error responses, or doesn't enforce project-scoping. These are all Perses-specific requirements that generic linters and reviewers miss entirely. -**What to do instead**: Always route through this skill when the changed code lives in a Perses repository. +### Phase 1: CLASSIFY -### 2. Not running `percli lint` on dashboard definitions in the PR -**Why it fails**: Dashboard JSON can have structurally valid YAML/JSON but semantically broken panel references, invalid plugin kinds, or malformed variable expressions. Manual review catches some of these, but `percli lint` catches all of them deterministically. -**What to do instead**: Run `percli lint` on every dashboard file changed in the PR. Treat lint failures as blockers. +**Goal**: Determine the review domains for this PR so you can apply the right checks to each file type. -### 3. Ignoring CUE schema changes when reviewing plugin PRs -**Why it fails**: A plugin PR that adds new configuration options without updating the corresponding CUE schema means the schema registry is out of sync. Users will be able to set options in the UI that fail CUE validation, or the UI won't expose options that the backend accepts. -**What to do instead**: When a plugin PR touches TypeScript types or Go structs, check that the corresponding `.cue` schema file is also updated and that the JSON example reflects the new fields. +1. List all changed files and categorize: + - Go backend (`.go`) — files under `/cmd`, `/pkg`, `/internal` + - React frontend (`.ts`, `.tsx`) — files in `@perses-dev/*` packages + - CUE schemas (`.cue`) + - Dashboard definitions (`.json`, `.yaml` with `kind: Dashboard`) -### 4. Approving storage implementations without pagination -**Why it fails**: Perses projects can contain hundreds of dashboards. A `List` endpoint that returns all results without pagination will cause memory issues and slow API responses at scale. The frontend `useListResource` hook expects paginated responses. -**What to do instead**: Verify every `List` method in the storage layer accepts `page` and `size` parameters and returns a paginated response with total count. +2. Identify cross-domain changes, because schema and plugin changes must stay synchronized. When a PR touches both CUE schemas and plugin code, the schema changes must match the plugin's expected input/output types — a plugin PR that adds configuration options without updating the corresponding CUE schema leaves the schema registry out of sync, causing UI options to fail CUE validation or the backend to accept options the UI cannot expose. Flag these for paired review. ---- +3. Flag dashboard definition files for `percli lint` in Phase 2. Dashboard JSON can be structurally valid but semantically broken (invalid panel references, wrong plugin kinds, malformed variable expressions). Manual review catches some of these; `percli lint` catches all of them deterministically. -## Anti-Rationalization - -| Rationalization | Reality | Required Action | -|-----------------|---------|-----------------| -| "It's just a small dashboard JSON change, no need to lint" | Small JSON changes are where typos in `$ref` paths hide. A single wrong character breaks panel rendering silently. | **Run `percli lint`. Always.** | -| "The Go handler works, it just doesn't paginate yet" | "Yet" means never once it ships. Unpaginated list endpoints are tech debt that causes production incidents at scale. | **Block the PR until pagination is implemented.** | -| "The CUE schema is fine without `close()`, it still validates" | Open schemas accept any field, defeating the purpose of schema validation. Users will send garbage fields that silently pass validation. | **Require `close({})` on all spec structs.** | -| "This React component doesn't need `useTimeRange`, it manages its own time" | Components that manage their own time range break dashboard-level time sync. Users change the time picker and this panel doesn't update. | **Use `useTimeRange` from `@perses-dev/plugin-system`.** | -| "The variable ordering doesn't matter, the engine resolves dependencies" | The Perses variable engine evaluates in list order, not dependency order. Out-of-order variables produce empty interpolation with no error. | **Verify variable dependency order matches list order.** | +**Gate**: File classification complete. Domains identified. ---- +### Phase 2: REVIEW -## FORBIDDEN Patterns +**Goal**: Apply Perses-specific review checks per domain. Route `.go` files to Go sub-reviewer, `.tsx`/`.ts` to React sub-reviewer, `.cue` to CUE sub-reviewer, and dashboard JSON/YAML to dashboard sub-reviewer. -These are hard stops. If encountered, block the PR immediately. +#### Go backend (`/cmd`, `/pkg`, `/internal`) -- **Hardcoded datasource URLs in dashboard definitions** — Datasources must be referenced by name and scope, never by direct URL. Hardcoded URLs break when Perses is deployed in different environments. -- **`package main` in CUE schema files** — CUE schemas must be `package model`. Using `package main` makes the schema invisible to the Perses schema registry. -- **Raw HTTP calls in plugin React components** — Plugins must use the Perses plugin system (`useDataQueries`, `usePlugin`) for data fetching. Raw `fetch()` or `axios` calls bypass caching, auth token injection, and datasource proxy routing. -- **Global resource endpoints without admin auth middleware** — `GlobalDatasource`, `GlobalSecret`, and `GlobalVariable` endpoints must enforce admin-level authorization. Missing auth middleware is a security vulnerability. -- **Dashboard definitions without a `kind` field** — Every Perses resource must have a `kind` field. Dashboard definitions without `kind: Dashboard` will fail API validation on import. +- **Project-scoped API compliance**: All CRUD API handlers must be project-scoped at `/api/v1/projects/{project}/...` unless the resource is explicitly global (e.g., `GlobalDatasource`, `GlobalSecret`). This matters because projects provide multi-tenancy isolation — a handler that skips project-scoping for a non-global resource is a blocker and a security issue. ---- +- **Paginated List endpoints**: Every `List` method in the storage layer must accept `page` and `size` query parameters and return a paginated response with total count, because Perses projects can contain hundreds of dashboards — an unpaginated `List` endpoint causes memory issues and slow API responses at scale. The frontend `useListResource` hook expects paginated responses. Block the PR until pagination is implemented. -## Blocker Criteria +- **Storage interface implementation**: New resources must implement the storage interface (`dao.go`) with all required CRUD methods including `List` with pagination support, so the API can expose all standard operations uniformly. -A finding is a **blocker** (must fix before merge) if any of these apply: +- **Auth middleware on global endpoints**: `GlobalDatasource`, `GlobalSecret`, and `GlobalVariable` endpoints must enforce admin-level authorization. Missing auth middleware on global resource endpoints is a security vulnerability — block immediately. -1. `percli lint` fails on any dashboard definition in the PR -2. CUE schema is not in `package model` or uses open structs for spec types -3. API handler is missing project-scoping for a non-global resource -4. Plugin component uses raw HTTP instead of plugin system hooks -5. `$ref` panel references don't resolve to existing panel keys -6. Variable dependency chain is circular or out of order -7. Any FORBIDDEN pattern is present +- **RESTful status codes**: Create returns 201, Update returns 200, Delete returns 204. Error responses must use Perses error types, not raw HTTP status codes, so clients can consistently parse API responses. -A finding is a **warning** (should fix, not blocking) if: +#### React frontend (`@perses-dev/*` packages) -1. Missing JSON example alongside new CUE schema (documentation gap) -2. Datasource scope is implicit rather than explicit -3. Missing error handling for specific edge cases -4. Test coverage gaps for new functionality +- **Plugin system hooks required**: Components must use `usePlugin`, `useTimeRange`, `useDataQueries` from `@perses-dev/plugin-system`. Components that manage their own time range break dashboard-level time sync — users change the time picker and the panel does not update. Raw `fetch()` or `axios` calls in plugin components are a blocker because they bypass caching, auth token injection, datasource proxy routing, and the plugin contract for variable interpolation and refresh. ---- +- **Component conventions**: Props must follow `@perses-dev/dashboards` type conventions. UI components must use `@perses-dev/components` rather than raw MUI, so the Perses design system stays consistent. -## Instructions +#### CUE schemas -### Phase 1: CLASSIFY +- **`package model` required**: All Perses CUE schemas must be `package model` to be discoverable by the schema registry. `package main` in a CUE schema file makes it invisible to the registry — block immediately. -**Goal**: Determine the review domains for this PR. +- **Closed specs**: Spec structs must use `close({})` because open schemas accept any field, defeating the purpose of schema validation — users can send garbage fields that silently pass validation. -1. List all changed files and categorize: Go backend (`.go`), React frontend (`.ts`, `.tsx`), CUE schemas (`.cue`), dashboard definitions (`.json`, `.yaml` with `kind: Dashboard`) -2. Identify cross-domain changes — does the PR touch both schema and plugin? Both API and frontend? -3. Check for dashboard definition files that need `percli lint` +- **JSON example file**: A `_example.json` must accompany each schema for documentation and validation. Missing examples are a warning (documentation gap). -**Gate**: File classification complete. Domains identified. +- **Migration path**: When CUE schemas change, check `migrate/migrate.cue` for backward-compatible schema evolution, so existing dashboard definitions don't break. -### Phase 2: REVIEW +#### Dashboard definitions -**Goal**: Apply Perses-specific review checks per domain. +- **Run `percli lint`**: Execute `percli lint` on every dashboard file changed in the PR. Lint failures are blockers, because even small JSON changes can hide typos in `$ref` paths that break panel rendering silently. -**Go backend** (`/cmd`, `/pkg`, `/internal`): -- API handlers at `/api/v1/*` follow RESTful CRUD with project scoping -- Storage interfaces implement all required methods including paginated `List` -- Auth middleware is applied to global resource endpoints -- Error responses use Perses error types, not raw HTTP status codes +- **Panel reference validation**: Panel keys in `$ref` must exactly match keys in `spec.panels`. Check for typos and naming convention mismatches (camelCase vs kebab-case), because unresolved `$ref` references cause render failures. -**React frontend** (`@perses-dev/*` packages): -- Components use `usePlugin`, `useTimeRange`, `useDataQueries` from `@perses-dev/plugin-system` -- No raw `fetch()` or `axios` calls in plugin components -- Component props follow `@perses-dev/dashboards` type conventions -- UI components use `@perses-dev/components` rather than raw MUI +- **Variable chain ordering**: Variable evaluation order follows list order, not dependency order. Dependees must appear before dependents, because out-of-order variables produce empty interpolation with no error. Circular dependencies are blockers. -**CUE schemas**: -- Schema is `package model` -- Spec structs use `close({})` -- JSON example file exists alongside schema -- If schema changed, check `migrate/migrate.cue` for migration path +- **Datasource references**: Datasources must be referenced by name and explicit scope (`global`, `project`, `dashboard`). Cross-project references are invalid. Hardcoded datasource URLs in dashboard definitions are a blocker because they break when Perses is deployed in different environments. -**Dashboard definitions**: -- Run `percli lint` on all dashboard files -- Validate `$ref` panel references resolve -- Check variable chains for circular dependencies and ordering -- Verify datasource references use name + explicit scope +- **`kind` field required**: Every Perses resource must have a `kind` field. Dashboard definitions without `kind: Dashboard` fail API validation on import — block immediately. **Gate**: All domains reviewed. Findings collected. ### Phase 3: REPORT -**Goal**: Deliver structured review findings. +**Goal**: Deliver structured review findings in a format that clearly distinguishes blockers (must fix) from warnings (should fix). Report format: + 1. **Summary**: One-line verdict (approve, request changes, blocker found) -2. **Blockers**: Issues that must be fixed before merge (with file path and line) -3. **Warnings**: Issues that should be fixed but are not blocking +2. **Blockers**: Issues that must be fixed before merge (with file path and line). A finding is a blocker if: + - `percli lint` fails on any dashboard definition + - CUE schema is not `package model` or uses open structs for spec types + - API handler is missing project-scoping for a non-global resource + - Plugin component uses raw HTTP instead of plugin system hooks + - `$ref` panel references do not resolve to existing panel keys + - Variable dependency chain is circular or out of order + - Any forbidden pattern is present (hardcoded datasource URLs, `package main` in CUE, raw HTTP in plugins, global endpoints without admin auth, dashboards without `kind`) +3. **Warnings**: Issues that should be fixed but are not blocking (missing JSON example alongside CUE schema, implicit datasource scope, missing error handling for edge cases, test coverage gaps) 4. **Notes**: Observations and suggestions for improvement 5. **percli lint output**: Raw output if dashboard definitions were linted **Gate**: Review report delivered. Task complete. ---- +## Error Handling + +| Cause | Symptom | Solution | +|-------|---------|----------| +| Go API handler doesn't follow Perses CRUD patterns | Missing pagination on `List` endpoint, wrong HTTP status codes (e.g., 200 instead of 201 on create), no project-scoping | Flag as blocker. Perses `List` handlers must accept `?page=N&size=M` query params and return paginated results. Create returns 201, Update returns 200, Delete returns 204. All non-global resources must be under `/api/v1/projects/{project}/`. | +| React component doesn't use `@perses-dev/plugin-system` hooks | Component uses raw `fetch()` or direct state for time range instead of `usePlugin`, `useTimeRange`, `useDataQueries` from the plugin system | Flag as blocker. Perses plugins MUST use the plugin system hooks to participate in the dashboard lifecycle (time range sync, variable interpolation, refresh). Direct data fetching bypasses the plugin contract. | +| CUE schema not in `package model` or spec not closed | Schema declares `package foo` instead of `package model`, uses open struct `{}` instead of `close({})`, no JSON example file alongside | Flag as blocker. All Perses CUE schemas must be `package model` to be discoverable by the schema registry. Specs must use `close({})` to prevent unexpected fields. A `_example.json` must accompany each schema for documentation and validation. | +| Dashboard definition has invalid `$ref` panel references | Layout references `$ref: #/spec/panels/myPanel` but panel key is `my-panel` or doesn't exist, causing render failures | Flag as blocker. Panel keys in `$ref` must exactly match keys in `spec.panels`. Run `percli lint` to catch these. Check for typos and naming convention mismatches (camelCase vs kebab-case). | +| Broken variable chains in dashboard | Variable B depends on variable A via `$A` in its query, but A is defined after B in the variables list, or A doesn't exist | Flag as blocker. Variable evaluation order follows list order. Dependees must appear before dependents. Missing variables cause silent empty interpolation. | +| Wrong datasource scope | Dashboard uses `datasource: {name: "prom"}` without specifying scope, or references a project datasource from a different project | Flag as warning. Datasources have three scopes: `global` (cluster-wide), `project` (project-level), `dashboard` (inline). The scope must be explicit. Cross-project references are invalid. | ## References diff --git a/skills/perses-cue-schema/SKILL.md b/skills/perses-cue-schema/SKILL.md index a05e5ffa..c2e31130 100644 --- a/skills/perses-cue-schema/SKILL.md +++ b/skills/perses-cue-schema/SKILL.md @@ -28,124 +28,28 @@ routing: Write CUE schemas for Perses plugin data models, validation constraints, JSON examples, and Grafana migration logic. -## Operator Context - -This skill operates as a CUE schema author for Perses plugins. It produces validated schema files, matching JSON examples, and optional migration definitions that pass `percli plugin test-schemas`. - -### Hardcoded Behaviors (Always Apply) -- **Package model**: All plugin CUE schemas MUST use `package model` — any other package name causes compilation failure -- **Closed specs**: Use `close({...})` for all spec definitions — open specs allow invalid fields through validation silently -- **JSON example required**: Every schema MUST have a corresponding JSON example file at the same directory level -- **Test after write**: Run `percli plugin test-schemas` after creating or modifying any schema — never declare a schema complete without passing tests -- **CUE v0.12.0+**: Require CUE v0.12.0 or later; earlier versions have incompatible syntax for Perses schemas -- **Migration package**: Migration files MUST use `package migrate`, never `package model` - -### Default Behaviors (ON unless disabled) -- **Educational mode**: Explain CUE syntax and Perses-specific patterns as schemas are created -- **Import common types**: Import `github.com/perses/shared/cue/common` for shared types (`#format`, `#thresholds`, `#calculation`) rather than redefining them -- **Validate incrementally**: Test after each schema file is written, not only at the end - -### Optional Behaviors (OFF unless enabled) -- **Grafana migration**: Write `migrate/migrate.cue` with `#grafanaType` and `#mapping` for converting Grafana panels -- **Strict mode**: Treat all fields as required unless the user explicitly marks them optional - -## What This Skill CAN Do -- Define CUE schemas for any Perses plugin type (panel, variable, datasource) -- Create matching JSON example files that validate against schemas -- Write Grafana migration schemas with field mappings via `#panel` references -- Explain CUE syntax: `close()`, optional fields (`?`), arrays (`[...#type]`), nested types (`#name`) -- Debug CUE compilation errors and schema/example mismatches -- Import and use shared Perses types from `common` - -## What This Skill CANNOT Do -- Create Perses dashboards or layouts (use perses-dashboard-create) -- Scaffold full plugin projects with Go code (use perses-plugin-create) -- Deploy or configure Perses server instances (use perses-deploy) -- Write DaC pipeline CUE definitions (use perses-dac-pipeline) - ---- - -## Error Handling - -### CUE Compilation Error: Wrong Package Name -**Symptom**: `package is "foo", want "model"` or similar CUE loader error. -**Cause**: Schema file uses a package name other than `model`. -**Fix**: Change the first line of the `.cue` file to `package model`. Migration files use `package migrate` instead. - -### CUE Compilation Error: Unclosed Spec or Bad Import -**Symptom**: `cannot find package`, `expected '}' found EOF`, or `import path not valid`. -**Cause**: Missing closing brace in `close({...})`, typo in import path, or missing import statement. -**Fix**: Verify braces are balanced in the `close({})` block. Confirm the import path is exactly `github.com/perses/shared/cue/common` (not shortened or aliased incorrectly). - -### JSON Example Mismatch: close() Rejects Unknown Fields -**Symptom**: `percli plugin test-schemas` fails with `field not allowed` on a field present in the JSON but absent from the CUE schema. -**Cause**: `close({})` enforces a strict field set — the JSON example contains fields the schema does not declare. -**Fix**: Either add the missing field to the CUE spec (with `?` if optional) or remove it from the JSON example. Also check for type mismatches (e.g., `string` in schema but `number` in JSON). - -### Grafana Migration Schema Error -**Symptom**: `#grafanaType` value not matching, `#mapping` field path references fail, or `#panel` lookups resolve to `_|_` (bottom). -**Cause**: `#grafanaType` does not match the Grafana plugin ID exactly, or `#mapping` references a field path that does not exist on `#panel`. -**Fix**: Verify `#grafanaType` matches the Grafana plugin `type` field exactly (e.g., `"timeseries"`, not `"time_series"`). Check that `#mapping` field paths use `#panel.` with the correct Grafana JSON structure. - -### percli plugin test-schemas Failure: Schema/Example Not Found -**Symptom**: `no schema found` or `no example found` — test runner skips or errors on the plugin. -**Cause**: Directory structure does not follow the expected convention, or files are misnamed. -**Fix**: Ensure files are at `schemas///.cue` and `schemas///.json`. Names must match exactly. - ---- - -## Anti-Patterns - -| Anti-Pattern | Why It Fails | Correct Approach | -|---|---|---| -| **Open specs** (no `close({})`) | Allows any field through — invalid JSON passes validation silently | Always wrap spec fields in `close({...})` | -| **Not importing shared types** | Redefining `#format`, `#thresholds`, or `#calculation` locally diverges from upstream | Import from `github.com/perses/shared/cue/common` | -| **Schema without JSON example** | `percli plugin test-schemas` has nothing to validate against; schema correctness is unverified | Always create the `.json` example alongside the `.cue` schema | -| **Migration without real Grafana JSON** | `#mapping` paths are guessed and silently wrong at runtime | Test migration schemas against an actual exported Grafana panel JSON | -| **Using `package migrate` for schemas** | CUE loader expects `package model` for plugin schemas | Reserve `package migrate` only for files in the `migrate/` subdirectory | -| **Nested type defined outside close()** | Nested `#typeName` definitions placed outside `close({})` are not validated as part of the spec | Define nested types inside the `close({})` block | - ---- - -## Anti-Rationalization - -| Rationalization | Reality | Required Action | -|---|---|---| -| "The schema looks correct" | CUE has subtle constraints — looking correct is not being correct | **Run `percli plugin test-schemas`** | -| "close() is optional for simple schemas" | Without close(), any misspelled field passes silently | **Always use close() — no exceptions** | -| "I'll add the JSON example later" | Schema without example is untested; bugs compound | **Write the JSON example before moving on** | -| "The migration mapping is straightforward" | Grafana field paths vary across plugin versions | **Test against real Grafana export JSON** | -| "Common types aren't needed for this schema" | Diverging from upstream types causes runtime incompatibility | **Import common types unless genuinely unused** | - ---- - -## FORBIDDEN Patterns +## Instructions -- **NEVER** use `package` names other than `model` (schemas) or `migrate` (migration files) -- **NEVER** omit `close({})` around spec definitions -- **NEVER** declare a schema complete without a passing `percli plugin test-schemas` run -- **NEVER** hardcode values in migration `#mapping` — always reference `#panel` field paths -- **NEVER** place migration files outside the `migrate/` subdirectory +### Pre-Flight Check ---- +Before starting, stop and ask the user if any of these apply: -## Blocker Criteria - -Stop and ask the user before proceeding if: -- The plugin type is not one of the standard Perses plugin types (panel, variable, datasource) -- The user wants to modify shared types in `github.com/perses/shared` — these are upstream +- The plugin type is not panel, variable, or datasource (the standard Perses plugin types) +- The user wants to modify shared types in `github.com/perses/shared` (these are upstream and not ours to change) - `percli` is not installed or `percli plugin test-schemas` is unavailable in the environment - The target directory already contains schemas that would be overwritten ---- +### Phase 1: Define Data Model -## Instructions +Create the CUE schema for the plugin spec. + +**File placement**: Place the file at `schemas///.cue`. The file name must match the plugin name exactly because `percli plugin test-schemas` discovers schemas by convention and skips misnamed files. -### Phase 1: DEFINE DATA MODEL +**Package and structure requirement**: Every schema file must use `package model` as its first line because any other package name causes a CUE loader error (`package is "foo", want "model"`). The only exception is migration files, which use `package migrate` instead. -**Goal**: Create the CUE schema for the plugin spec. +Wrap all spec fields in `close({...})` because without it, CUE accepts any field name -- misspelled or unexpected fields pass validation silently, making the schema useless as a contract. Nested type definitions must live inside `close()` to be validated as part of the spec. -Location: `schemas///.cue` +**Version requirement**: This skill requires CUE v0.12.0+. Earlier versions have incompatible syntax for Perses schemas. ```cue package model @@ -160,7 +64,9 @@ spec: close({ // Optional fields (note the ?) optionalField?: int - // Constrained fields using shared types + // Constrained fields -- import shared types from common rather than + // redefining them locally, because local redefinitions diverge from + // upstream and cause runtime incompatibility. format?: common.#format thresholds?: common.#thresholds calculation?: common.#calculation @@ -168,7 +74,8 @@ spec: close({ // Arrays of typed items items: [...#item] - // Nested type definitions (inside close) + // Nested type definitions MUST live inside close() -- definitions + // placed outside are not validated as part of the spec. #item: { name: string value: number @@ -176,23 +83,29 @@ spec: close({ }) ``` -**Gate**: Schema file written and syntactically valid. Proceed to Phase 2. +Explain CUE syntax as you write: `close()` enforces strict field contracts, optional fields use `?`, arrays use `[...#type]`, and nested types use `#name`. This skill is educational by default. + +**Validation gate**: After writing the schema file, run `percli plugin test-schemas` immediately. Do not wait until the end; validating incrementally catches errors before they compound. + +**Gate**: Schema file written, `percli plugin test-schemas` passes on the schema alone. Proceed to Phase 2. + +### Phase 2: Create JSON Example -### Phase 2: CREATE JSON EXAMPLE +Write a JSON example that validates against the schema. -**Goal**: Write a JSON example that validates against the schema. +**File placement**: Place it at `schemas///.json` -- same directory, same base name as the `.cue` file. Every schema must have a corresponding JSON example because without one, `percli plugin test-schemas` has nothing to validate against and the schema's correctness is unverified. -Location: `schemas///.json` +Include all required fields and valid values for constrained types. Include optional fields to demonstrate their usage. -The JSON must include all required fields and valid values for any constrained types. Optional fields should be included to demonstrate their usage. +Run `percli plugin test-schemas` after writing the example. -**Gate**: JSON example file written. Proceed to Phase 3. +**Gate**: JSON example file written, `percli plugin test-schemas` passes with both schema and example. Proceed to Phase 3. -### Phase 3: WRITE MIGRATION (optional) +### Phase 3: Write Migration (optional) -**Goal**: Define Grafana-to-Perses field mapping. +Only proceed if the user explicitly requests Grafana migration support. -Location: `schemas///migrate/migrate.cue` +**File placement**: Place the migration file at `schemas///migrate/migrate.cue`. Migration files must live in the `migrate/` subdirectory and must never be placed elsewhere because the path structure is how `percli` discovers them. ```cue package migrate @@ -201,33 +114,64 @@ import "github.com/perses/shared/cue/migrate" #grafanaType: "" #mapping: { - // Map Grafana panel fields to Perses spec fields + // Map Grafana panel fields to Perses spec fields. + // Always reference #panel field paths -- never hardcode values, + // because Grafana field paths vary across plugin versions. perses_field: #panel.grafana_field } ``` -Only proceed if the user requests migration support. +**Constraint 1 - Type matching**: `#grafanaType` must match the Grafana plugin `type` field exactly (e.g., `"timeseries"`, not `"time_series"`). A mismatch causes `#panel` lookups to resolve to `_|_` (bottom), breaking the entire migration. -**Gate**: Migration file written. Proceed to Phase 4. +**Constraint 2 - Field validation**: Test against real exported Grafana panel JSON. Guessed `#mapping` paths are silently wrong at runtime because there is no compile-time check on field existence. Actual data is essential for correctness. -### Phase 4: VALIDATE +Run `percli plugin test-schemas` after writing the migration file. -**Goal**: Confirm all schemas and examples pass validation. +**Gate**: Migration file written, `percli plugin test-schemas` passes. Proceed to Phase 4. + +### Phase 4: Validate + +Confirm all schemas and examples pass validation together: ```bash percli plugin test-schemas ``` -If validation fails, return to the relevant phase and fix. Do not declare completion until tests pass. +If validation fails, return to the relevant phase and fix. Do not declare completion until tests pass -- a schema that "looks correct" is not correct until `percli` confirms it. **Gate**: `percli plugin test-schemas` passes. Task complete. ---- +## Error Handling + +### CUE Compilation Error: Wrong Package Name +**Symptom**: `package is "foo", want "model"` or similar CUE loader error. +**Cause**: Schema file uses a package name other than `model`. +**Fix**: Change the first line of the `.cue` file to `package model`. Migration files use `package migrate` instead. + +### CUE Compilation Error: Unclosed Spec or Bad Import +**Symptom**: `cannot find package`, `expected '}' found EOF`, or `import path not valid`. +**Cause**: Missing closing brace in `close({...})`, typo in import path, or missing import statement. +**Fix**: Verify braces are balanced in the `close({})` block. Confirm the import path is exactly `github.com/perses/shared/cue/common` (not shortened or aliased incorrectly). + +### JSON Example Mismatch: close() Rejects Unknown Fields +**Symptom**: `percli plugin test-schemas` fails with `field not allowed` on a field present in the JSON but absent from the CUE schema. +**Cause**: `close({})` enforces a strict field set -- the JSON example contains fields the schema does not declare. +**Fix**: Either add the missing field to the CUE spec (with `?` if optional) or remove it from the JSON example. Also check for type mismatches (e.g., `string` in schema but `number` in JSON). + +### Grafana Migration Schema Error +**Symptom**: `#grafanaType` value not matching, `#mapping` field path references fail, or `#panel` lookups resolve to `_|_` (bottom). +**Cause**: `#grafanaType` does not match the Grafana plugin ID exactly, or `#mapping` references a field path that does not exist on `#panel`. +**Fix**: Verify `#grafanaType` matches the Grafana plugin `type` field exactly (e.g., `"timeseries"`, not `"time_series"`). Check that `#mapping` field paths use `#panel.` with the correct Grafana JSON structure. + +### percli plugin test-schemas Failure: Schema/Example Not Found +**Symptom**: `no schema found` or `no example found` -- test runner skips or errors on the plugin. +**Cause**: Directory structure does not follow the expected convention, or files are misnamed. +**Fix**: Ensure files are at `schemas///.cue` and `schemas///.json`. Names must match exactly. ## References -- [Perses Plugin Development Guide](https://perses.dev/docs/plugins/) — official plugin documentation -- [CUE Language Specification](https://cuelang.org/docs/reference/spec/) — CUE syntax and semantics -- [Perses Shared CUE Types](https://github.com/perses/perses/tree/main/cue/schemas) — `common.#format`, `common.#thresholds`, etc. -- [percli CLI Reference](https://perses.dev/docs/tooling/percli/) — `percli plugin test-schemas` and other commands -- [Grafana Panel Schema Reference](https://grafana.com/docs/grafana/latest/developers/plugins/) — for migration `#grafanaType` values +- [Perses Plugin Development Guide](https://perses.dev/docs/plugins/) -- official plugin documentation +- [CUE Language Specification](https://cuelang.org/docs/reference/spec/) -- CUE syntax and semantics +- [Perses Shared CUE Types](https://github.com/perses/perses/tree/main/cue/schemas) -- `common.#format`, `common.#thresholds`, etc. +- [percli CLI Reference](https://perses.dev/docs/tooling/percli/) -- `percli plugin test-schemas` and other commands +- [Grafana Panel Schema Reference](https://grafana.com/docs/grafana/latest/developers/plugins/) -- for migration `#grafanaType` values diff --git a/skills/perses-dashboard-create/SKILL.md b/skills/perses-dashboard-create/SKILL.md index a5aba657..8e8563c2 100644 --- a/skills/perses-dashboard-create/SKILL.md +++ b/skills/perses-dashboard-create/SKILL.md @@ -30,69 +30,48 @@ routing: # Perses Dashboard Create -Guided workflow for creating Perses dashboards with validation and deployment. +Guided workflow for creating Perses dashboards from requirements through validation and deployment. -## Operator Context +## Overview -This skill operates as a guided workflow for Perses dashboard creation, from requirements gathering through validation and deployment. +This workflow guides you through four phases: gathering requirements, generating a dashboard definition, validating it, and deploying to Perses. The skill applies sensible defaults at each phase to minimize configuration while remaining flexible for advanced scenarios. -### Hardcoded Behaviors (Always Apply) -- **Validate before deploy**: Always run `percli lint` on generated dashboard definitions before applying -- **MCP-first**: Use Perses MCP tools when available, percli CLI as fallback -- **Scope-aware**: Ask which project the dashboard belongs to. Create the project first if it doesn't exist -- **Plugin-aware**: Only use panel/query/variable plugin kinds from the official 27 plugins - -### Default Behaviors (ON unless disabled) -- **CUE output**: Generate CUE definitions by default (can switch to JSON/YAML if requested) -- **Prometheus datasource**: Default to Prometheus datasource if no datasource type specified -- **Grid layout**: Use Grid layout with collapsible rows by default -- **Variable templating**: Add common variables (job, instance, namespace) based on query patterns - -### Optional Behaviors (OFF unless enabled) -- **Go SDK output**: Generate Go SDK code instead of CUE -- **Ephemeral mode**: Create EphemeralDashboard with TTL for preview/CI -- **Bulk creation**: Generate multiple dashboards from a specification - -## What This Skill CAN Do -- Create complete dashboard definitions in CUE, Go, JSON, or YAML format -- Configure datasources (Prometheus, Tempo, Loki, Pyroscope, ClickHouse, VictoriaLogs) -- Set up variables with proper chains and interpolation formats -- Validate definitions with percli lint -- Deploy to Perses via percli apply or MCP tools -- Generate CI/CD config for Dashboard-as-Code workflows - -## What This Skill CANNOT Do -- Migrate Grafana dashboards (use perses-grafana-migrate) -- Create custom panel plugins (use perses-plugin-create) -- Deploy/configure Perses server itself (use perses-deploy) - ---- +**Key workflow principle**: Requirements → Definition → Validation → Deployment. Never skip validation, even for simple dashboards, because percli lint catches structural errors early. ## Instructions ### Phase 1: GATHER Requirements -**Goal**: Understand what the dashboard should display. +Understand what the dashboard should display. 1. **Identify metrics/data**: What should the dashboard show? (CPU, memory, request rates, traces, logs) -2. **Identify datasource**: Which backend? (Prometheus, Tempo, Loki, Pyroscope, ClickHouse, VictoriaLogs) -3. **Identify project**: Which Perses project does this belong to? -4. **Identify layout**: How many panels? How should they be organized? -5. **Identify variables**: What filters should be available? (cluster, namespace, pod, job, instance) - -If the user provides minimal info, make reasonable defaults: -- Default datasource: Prometheus -- Default variables: job, instance -- Default layout: Grid with collapsible rows, 12-column width -- Default panels: TimeSeriesChart for time series, StatChart for single values, Table for lists +2. **Identify datasource**: Which backend? (Prometheus, Tempo, Loki, Pyroscope, ClickHouse, VictoriaLogs). Defaults to Prometheus unless a datasource type is explicitly specified because Prometheus is the most common monitoring backend. +3. **Identify project**: Which Perses project does this belong to? Always ask — if the project does not exist, create it first before proceeding, because dashboards cannot exist without a project. +4. **Identify layout**: How many panels? How should they be organized? Defaults to Grid layout with collapsible rows using a 12-column width because this layout accommodates responsive rendering and flexible panel sizing. +5. **Identify variables**: What filters should be available? (cluster, namespace, pod, job, instance). Automatically add job, instance, and namespace variables when query patterns suggest common labels, because these are the most frequently queried labels in monitoring scenarios. + +**Application of defaults**: When the user provides minimal information, apply these defaults to reduce friction: +- **Output format**: CUE definition by default because CUE provides strong type checking and modularity. Switch to JSON, YAML, or Go SDK only when explicitly requested. +- **Datasource**: Prometheus (default) unless another type is specified +- **Variables**: job, instance (minimum set for filtering) +- **Layout**: Grid with collapsible rows, 12-column width +- **Panels**: TimeSeriesChart for time series, StatChart for single values, Table for lists + +**Optional modes** (activate only when explicitly requested to keep the primary path simple): +- **Go SDK output**: Generate Go SDK code instead of CUE +- **Ephemeral mode**: Create EphemeralDashboard with TTL for preview/CI use +- **Bulk creation**: Generate multiple dashboards from a specification **Gate**: Requirements gathered. Proceed to Phase 2. ### Phase 2: GENERATE Definition -**Goal**: Create the dashboard definition. +Create the dashboard definition. **Step 1: Check for Perses MCP tools** + +Use MCP tools as the primary interface because they provide direct API integration and better error context. Fall back to percli CLI only when MCP is unavailable. + ``` Use ToolSearch("perses") to discover available MCP tools. If perses_list_projects is available, use it to verify the target project exists. @@ -101,7 +80,9 @@ If not, use percli get project to check. **Step 2: Generate dashboard definition** -Generate a CUE definition by default. The structure follows: +Generate a CUE definition by default, because CUE provides strong validation and modularity over JSON. Only use plugin kinds from the official set below — no invented or third-party kinds, because the Perses API only recognizes these standard plugins. + +The structure follows: ```yaml kind: Dashboard @@ -179,9 +160,8 @@ spec: ### Phase 3: VALIDATE -**Goal**: Ensure the dashboard definition is valid. +Always validate before deploying — never skip this phase even for simple dashboards, because percli lint catches structural errors that prevent deployment and schema mismatches early. -Run validation: ```bash percli lint -f # OR with online validation against running server: @@ -194,12 +174,12 @@ If validation fails, fix the issues and re-validate. ### Phase 4: DEPLOY -**Goal**: Deploy the dashboard to Perses. +Deploy the dashboard to Perses. -**Option A: MCP tools** (preferred if available) +**Option A: MCP tools** (preferred — use when available because MCP provides better error handling and atomicity) Use `perses_create_dashboard` MCP tool to create the dashboard directly. -**Option B: percli CLI** +**Option B: percli CLI** (fallback when MCP is unavailable) ```bash percli apply -f ``` @@ -218,3 +198,21 @@ perses_get_dashboard_by_name(project=, dashboard=) ``` **Gate**: Dashboard deployed and verified. Task complete. + +## Error Handling + +| Error | Cause | Fix | +|-------|-------|-----| +| `percli lint` fails with unknown plugin kind | Used a plugin kind not in the official set | Replace with one of the 16 panel or 4 variable plugin kinds listed in Phase 2 | +| Project does not exist | Dashboard targets a non-existent project | Create the project first with `percli apply` or `perses_create_project` MCP tool | +| MCP tool not found | Perses MCP server not connected | Fall back to percli CLI commands | +| `percli apply` auth error | Missing or expired credentials | Run `percli login` or check `~/.perses/config.yaml` | +| Online lint fails but offline passes | Server-side schema stricter than local | Fix the server-reported issues — online validation is authoritative | + +## References + +- Perses dashboard spec: https://perses.dev/docs/api/dashboard/ +- percli CLI: https://perses.dev/docs/tooling/percli/ +- Plugin catalog: https://perses.dev/docs/plugins/ +- Dashboard-as-Code: https://perses.dev/docs/tooling/dac/ +- Variable interpolation: https://perses.dev/docs/user-guides/variables/ diff --git a/skills/perses-dashboard-review/SKILL.md b/skills/perses-dashboard-review/SKILL.md index 46d4ba4f..75fd2adc 100644 --- a/skills/perses-dashboard-review/SKILL.md +++ b/skills/perses-dashboard-review/SKILL.md @@ -28,157 +28,49 @@ routing: Analyze and improve existing Perses dashboards through structured review of layout, queries, variables, and datasource configuration. -## Operator Context - -This skill operates as a non-destructive auditor of Perses dashboards, fetching definitions via MCP tools or percli CLI, then systematically analyzing them against quality criteria. - -### Hardcoded Behaviors (Always Apply) -- **Non-destructive by default** -- never modify dashboards without explicit `--fix` flag or user confirmation -- **MCP-first retrieval** -- attempt `perses_get_dashboard_by_name` before falling back to percli CLI -- **Structured findings** -- every finding must have a severity (critical, warning, info) and a concrete recommendation -- **Full-scope analysis** -- check all five review areas (layout, queries, variables, datasources, metadata) on every review -- **Preserve dashboard identity** -- never change dashboard name, project assignment, or display metadata unless explicitly requested - -### Default Behaviors (ON unless disabled) -- **Severity-sorted output** -- report findings from critical to info -- **Query validation** -- check PromQL/LogQL syntax and common anti-patterns (missing rate intervals, unbounded selectors) -- **Variable chain analysis** -- trace variable dependencies and verify ordering -- **Layout audit** -- flag orphan panels, empty rows, and width overflows - -### Optional Behaviors (OFF unless enabled) -- **Fix mode** (`--fix`) -- apply recommended improvements and redeploy -- **Performance scoring** -- assign a numeric quality score (0-100) to the dashboard -- **Cross-dashboard analysis** -- compare variable/datasource usage across multiple dashboards in the same project - -## What This Skill CAN Do -- Fetch dashboard definitions via MCP tools or percli CLI -- Analyze panel grid layout for organization and efficiency -- Audit PromQL and LogQL queries for correctness and performance -- Validate variable chains for dependency ordering and circular references -- Check datasource scoping and reachability -- Identify unused panels, missing descriptions, and unclear titles -- Generate a structured findings report with severity levels -- Apply fixes when `--fix` mode is enabled - -## What This Skill CANNOT Do -- Create new dashboards from scratch (use `perses-dashboard-create`) -- Develop or modify Perses plugins (use `perses-plugin-create`) -- Deploy or configure Perses server instances (use `perses-deploy`) -- Manage Perses projects, roles, or RBAC configuration +## Overview ---- - -## Error Handling - -### MCP Tools Not Available -**Symptom**: `perses_get_dashboard_by_name` or `perses_list_dashboards` calls fail or are not registered. -**Action**: Fall back to percli CLI. Run `percli describe dashboard --project -ojson`. If percli is also unavailable, ask the user to provide the dashboard JSON directly or check MCP server configuration. - -### Dashboard Not Found -**Symptom**: MCP or percli returns 404 or empty result for the dashboard name. -**Action**: List available dashboards with `perses_list_dashboards(project=)` or `percli get dashboard --project `. Confirm the project name and dashboard name with the user. Dashboard names are case-sensitive and use kebab-case by convention. - -### Datasource Unreachable -**Symptom**: Datasource referenced in panels returns connection errors, proxy failures, or auth rejections during validation. -**Action**: Log the unreachable datasource as an info-level finding (not a dashboard quality issue). Note which panels are affected. Do not block the review -- continue analyzing query syntax and structure without live validation. Suggest the user verify network/proxy/auth configuration separately. - -### Variable Chain Circular Dependency -**Symptom**: Variable A depends on variable B which depends on variable A (directly or transitively). -**Action**: Flag as a **critical** finding. Map the full dependency cycle and include it in the report. In `--fix` mode, propose breaking the cycle by making one variable static or removing the circular matcher. Never auto-fix circular dependencies without user confirmation. - -### Malformed Dashboard JSON -**Symptom**: Dashboard definition fails to parse or is missing required fields (`kind`, `metadata`, `spec`). -**Action**: Report the structural error and halt analysis. Do not attempt partial review of a malformed definition -- the results would be unreliable. - ---- - -## Anti-Patterns - -### Reviewing Only Panel Queries Without Checking Layout -**Wrong**: Jump straight to PromQL/LogQL analysis and ignore how panels are organized. -**Right**: Always start with layout review (Phase 2, Step 1). A dashboard with correct queries but chaotic layout is still a poor dashboard. Check grid positioning, row grouping, collapsible sections, and panel widths before diving into queries. - -### Not Checking Variable Dependency Ordering -**Wrong**: Review variables in isolation without tracing which variables feed into others. -**Right**: Build a dependency graph of all variables. Verify that variables are defined in topological order (parents before children). Check that matchers like `$variable_name` reference variables that are already resolved. Variable `sort_order` in `spec.display` affects the rendered dropdown but not dependency resolution -- don't confuse the two. - -### Ignoring Datasource Scope Mismatches -**Wrong**: Assume all panels can reach all datasources because they're "in the same Perses instance." -**Right**: Datasources in Perses have explicit scope (global or project-level). A panel referencing a project-scoped datasource from another project will fail silently at render time. Verify that every panel's datasource reference resolves within the dashboard's project scope or is globally available. - -### Treating All Findings as Equal Severity -**Wrong**: List every finding as a flat bullet list without distinguishing between broken queries and cosmetic suggestions. -**Right**: Assign severity levels. Critical = dashboard is broken or produces wrong data. Warning = dashboard works but has performance or usability issues. Info = cosmetic or best-practice suggestions. - ---- - -## Anti-Rationalization - -| Rationalization | Reality | Required Action | -|-----------------|---------|-----------------| -| "The queries look correct to me" | Visual inspection misses rate interval mismatches, label collisions, and unbounded selectors | **Parse and validate each query against known anti-patterns** | -| "Variable ordering doesn't matter" | Perses evaluates variables top-to-bottom; misordered dependencies cause empty dropdowns | **Build the dependency graph and verify topological order** | -| "Only a few panels -- quick scan is enough" | Small dashboards still have datasource scoping, variable chains, and layout issues | **Run the full 4-phase pipeline regardless of dashboard size** | -| "The dashboard renders fine so it must be correct" | Rendering without errors does not mean queries return correct data or layout is optimal | **Analyze query semantics and layout structure, not just render success** | - ---- - -## FORBIDDEN Patterns - -- **NEVER** modify a dashboard without `--fix` mode or explicit user confirmation -- **NEVER** delete panels, variables, or datasources during review -- only flag them as findings -- **NEVER** skip the FETCH phase and work from stale or assumed dashboard state -- **NEVER** report on dashboard quality without actually retrieving the current definition -- **NEVER** auto-fix circular variable dependencies without user approval -- **NEVER** change datasource assignments -- only report scope mismatches - ---- - -## Blocker Criteria - -The review **MUST NOT** proceed past the FETCH phase if: -- Dashboard definition cannot be retrieved (MCP + percli both fail, no JSON provided) -- Dashboard JSON is malformed and fails structural validation -- Project does not exist or user lacks read permissions - -The review **MUST** flag as critical blockers: -- Circular variable dependency chains -- Panels referencing non-existent datasources -- Queries that fail syntactic validation (malformed PromQL/LogQL) -- Grid layout with panels exceeding the 24-column width limit - ---- +This is a non-destructive review skill that audits dashboard quality without modification unless explicitly requested. The skill runs a 4-phase pipeline (FETCH, ANALYZE, REPORT, FIX) to identify layout issues, query inefficiencies, variable chain problems, and datasource scoping mismatches. Use this skill to review existing dashboards for quality improvements. Do NOT use this skill to create new dashboards—use perses-dashboard-create instead. ## Instructions ### Phase 1: FETCH -**Goal**: Retrieve the current dashboard definition. +**Goal**: Retrieve the current dashboard definition from the live server. -1. Attempt MCP retrieval: - ``` - perses_get_dashboard_by_name(project=, dashboard=) - ``` -2. If MCP unavailable, fall back to percli: - ```bash - percli describe dashboard --project -ojson - ``` -3. If both fail, ask the user to provide the dashboard JSON directly. -4. Parse and validate the JSON structure (`kind: Dashboard`, `metadata.name`, `spec.panels`, `spec.layouts`, `spec.variables`, `spec.datasources`). +1. Always fetch the live definition via MCP or API first, because a dashboard that "looks fine" locally may differ from what is deployed. + - Attempt MCP retrieval first (ensures freshest server-side state): + ``` + perses_get_dashboard_by_name(project=, dashboard=) + ``` + - If MCP is unavailable, fall back to percli CLI: + ```bash + percli describe dashboard --project -ojson + ``` + - If both fail, ask the user to provide the dashboard JSON directly. -**Gate**: Dashboard definition retrieved and structurally valid. Proceed to Phase 2. +2. Parse and validate the JSON structure. The review MUST NOT proceed past this phase if: + - Dashboard definition cannot be retrieved (MCP + percli both fail, no JSON provided) + - Dashboard JSON is malformed and fails structural validation (required fields: `kind: Dashboard`, `metadata.name`, `spec.panels`, `spec.layouts`, `spec.variables`, `spec.datasources`) + - Project does not exist or user lacks read permissions ### Phase 2: ANALYZE -**Goal**: Systematically audit all dashboard components. +**Goal**: Systematically audit all five dashboard components. Run all five analysis steps on every review regardless of dashboard size, because small dashboards still have datasource scoping, variable chains, and layout issues that skipping would miss. #### Step 1: Layout Review -- Verify grid layout uses 24-column system correctly (no panel exceeds width 24) + +Start here before touching queries (a dashboard with correct queries but chaotic layout is still a poor dashboard): + +- Verify grid layout uses 24-column system correctly — panels exceeding width 24 are a critical blocker - Check for collapsible rows with logical grouping - Identify orphan panels (defined in `spec.panels` but absent from `spec.layouts`) - Flag empty rows or sections with no panels #### Step 2: Query Analysis + +Parse and validate each query against known anti-patterns (visual inspection misses rate interval mismatches, label collisions, and unbounded selectors): + - Parse each panel's query (PromQL, LogQL, TraceQL, or SQL depending on plugin) - Check for common anti-patterns: - Missing `$__rate_interval` or hardcoded rate intervals @@ -186,34 +78,53 @@ The review **MUST** flag as critical blockers: - `rate()` without appropriate range vector - Recording rule candidates (complex expressions used in multiple panels) - Verify query references to variables use correct interpolation format (`$variable` or `${variable}`) +- Queries failing syntactic validation (malformed PromQL/LogQL) are critical blockers and halt analysis #### Step 3: Variable Chain Analysis + +Build the full dependency graph rather than reviewing variables in isolation, because Perses evaluates variables top-to-bottom and misordered dependencies cause empty dropdowns at render time: + - Build dependency graph from variable definitions - Verify topological ordering (parent variables defined before children) -- Check for circular dependencies +- Check for circular dependencies (critical blocker that prevents dashboard rendering) - Validate `matchers` reference existing variables - Check interpolation formats are appropriate for the context: `csv`, `regex`, `json`, `pipe`, `glob`, `lucene`, etc. - Confirm `spec.display.name` and `spec.display.description` are set for user-facing variables +- Note: `sort_order` in `spec.display` affects the rendered dropdown order but not dependency resolution—do not confuse them #### Step 4: Datasource Scoping + +Datasources in Perses have explicit scope (global or project-level). A panel referencing a project-scoped datasource from another project will fail silently at render time, so assume nothing about datasource reach: + - Map each panel to its datasource reference - Verify datasource scope (global vs. project-level) matches the dashboard's project -- Check for datasources referenced but not defined in the dashboard's `spec.datasources` +- Check for datasources referenced but not defined in the dashboard's `spec.datasources` (critical blocker) - Flag proxy configuration issues if datasource URLs are internal-only +- Never change datasource assignments during review—only report scope mismatches #### Step 5: Metadata and Usability + - Check for missing panel titles or descriptions - Verify dashboard-level `spec.display` has a meaningful name - Flag panels with identical titles (confusing for users) - Check `spec.duration` (default time range) is set appropriately +- Preserve dashboard identity: never change dashboard name, project assignment, or display metadata unless explicitly requested -**Gate**: All five analysis steps completed. Proceed to Phase 3. +**Completion Gate**: All five analysis steps completed. Findings collected and categorized. Proceed to Phase 3. ### Phase 3: REPORT -**Goal**: Generate a structured findings report. +**Goal**: Generate a structured findings report with severity levels and recommendations. + +Assign severity by impact: +- **Critical**: Dashboard is broken or produces wrong data +- **Warning**: Dashboard works but has performance or usability issues +- **Info**: Cosmetic or best-practice suggestions + +Note: A dashboard that renders without errors is not necessarily correct—analyze query semantics and layout structure, not just render success. + +Format findings as: -Output format: ``` ## Dashboard Review: (project: ) @@ -233,15 +144,15 @@ Output format: - Critical: N | Warnings: N | Info: N ``` -**Gate**: Report generated with all findings categorized. If `--fix` not requested, task complete. +**Completion Gate**: Report generated with all findings categorized. If `--fix` flag not present, task complete. -### Phase 4: FIX (optional, requires --fix) +### Phase 4: FIX (optional, requires --fix flag) -**Goal**: Apply recommended improvements. +**Goal**: Apply recommended improvements to the dashboard. This phase is OFF by default—never modify a dashboard without `--fix` mode or explicit user confirmation. 1. Present the list of proposed fixes to the user for confirmation -2. Apply approved fixes to the dashboard JSON -3. Deploy the updated dashboard: +2. Apply approved fixes to the dashboard JSON (never auto-fix circular variable dependencies without user approval) +3. Deploy the updated dashboard via MCP or percli: ``` perses_update_dashboard(project=, dashboard=, body=) # OR @@ -249,9 +160,29 @@ Output format: ``` 4. Re-run Phase 2 ANALYZE on the updated dashboard to verify fixes resolved the findings -**Gate**: Fixes applied and verified. Task complete. +**Completion Gate**: Fixes applied and verified. Task complete. ---- +## Error Handling + +### MCP Tools Not Available +**Symptom**: `perses_get_dashboard_by_name` or `perses_list_dashboards` calls fail or are not registered. +**Action**: Fall back to percli CLI. Run `percli describe dashboard --project -ojson`. If percli is also unavailable, ask the user to provide the dashboard JSON directly or check MCP server configuration. + +### Dashboard Not Found +**Symptom**: MCP or percli returns 404 or empty result for the dashboard name. +**Action**: List available dashboards with `perses_list_dashboards(project=)` or `percli get dashboard --project `. Confirm the project name and dashboard name with the user. Dashboard names are case-sensitive and use kebab-case by convention. + +### Datasource Unreachable +**Symptom**: Datasource referenced in panels returns connection errors, proxy failures, or auth rejections during validation. +**Action**: Log the unreachable datasource as an info-level finding (not a dashboard quality issue). Note which panels are affected. Do not block the review -- continue analyzing query syntax and structure without live validation. Suggest the user verify network/proxy/auth configuration separately. + +### Variable Chain Circular Dependency +**Symptom**: Variable A depends on variable B which depends on variable A (directly or transitively). +**Action**: Flag as a **critical** finding. Map the full dependency cycle and include it in the report. In `--fix` mode, propose breaking the cycle by making one variable static or removing the circular matcher. Never auto-fix circular dependencies without user confirmation. + +### Malformed Dashboard JSON +**Symptom**: Dashboard definition fails to parse or is missing required fields (`kind`, `metadata`, `spec`). +**Action**: Report the structural error and halt analysis. Do not attempt partial review of a malformed definition -- the results would be unreliable. ## References diff --git a/skills/perses-datasource-manage/SKILL.md b/skills/perses-datasource-manage/SKILL.md index 66c0054f..83e1669b 100644 --- a/skills/perses-datasource-manage/SKILL.md +++ b/skills/perses-datasource-manage/SKILL.md @@ -25,37 +25,7 @@ routing: # Perses Datasource Management -Create, update, and manage datasources across scopes. - -## Operator Context - -This skill operates as the lifecycle manager for Perses datasources, handling creation, updates, and deletion across global, project, and dashboard scopes. - -### Hardcoded Behaviors (Always Apply) -- **Scope-aware**: Always clarify scope — global (all projects), project, or dashboard — because scope determines resource kind and override priority -- **MCP-first**: Use Perses MCP tools when available, percli as fallback -- **Proxy configuration**: Always configure allowedEndpoints for HTTP proxy datasources — without them, queries will be blocked by the proxy - -### Default Behaviors (ON unless disabled) -- **Global scope**: Default to global datasource unless project is specified -- **Default flag**: Set first datasource of each type as default - -### Optional Behaviors (OFF unless enabled) -- **Multi-backend**: Configure multiple datasources of the same type with different names -- **Dashboard-scoped**: Embed datasource config directly in dashboard spec - -## What This Skill CAN Do -- Create/update/delete datasources at any scope -- Configure HTTP proxy with allowed endpoints -- Manage datasource priority (global vs project vs dashboard) -- Support all 6 datasource types: Prometheus, Tempo, Loki, Pyroscope, ClickHouse, VictoriaLogs - -## What This Skill CANNOT Do -- Create the datasource backends themselves (Prometheus, Loki, etc.) -- Manage Perses server configuration (use perses-deploy) -- Create dashboards (use perses-dashboard-create) - ---- +Create, update, and manage datasources across scopes. Use Perses MCP tools when available; fall back to percli CLI when MCP is not connected. ## Instructions @@ -63,7 +33,9 @@ This skill operates as the lifecycle manager for Perses datasources, handling cr **Goal**: Determine datasource type, scope, and connection details. -**Supported types**: +Before proceeding, confirm the backend URL is reachable and the datasource type is one of the 6 supported plugin kinds. If either is unknown or unresolvable, stop and ask the user because a datasource cannot function without a valid backend, and Perses does not support arbitrary plugins without custom development. + +**Supported types** (use exact casing because these are case-sensitive Go type identifiers; `prometheusdatasource` or `prometheus` will fail with an unhelpful "invalid plugin kind" error): | Plugin Kind | Backend | Common Endpoints | |-------------|---------|-----------------| @@ -74,9 +46,17 @@ This skill operates as the lifecycle manager for Perses datasources, handling cr | ClickHouseDatasource | ClickHouse | N/A (direct connection) | | VictoriaLogsDatasource | VictoriaLogs | `/select/.*` | +If the user requests a plugin kind not installed on the Perses server, verify available plugins before attempting creation. + **Scopes** (priority order, highest first): Dashboard > Project > Global -A dashboard-scoped datasource overrides a project-scoped one of the same name, which overrides a global one. Use global for organization-wide defaults, project for team-specific overrides, dashboard for one-off configurations. +A dashboard-scoped datasource overrides a project-scoped one of the same name, which overrides a global one. Choose scope deliberately at creation time because moving from global to project later requires deleting the global datasource and recreating it as project-scoped — a disruptive migration. Ask: "Does every project need this, or just one team?" + +- **Global**: Organization-wide defaults. Default to this scope unless the user specifies a project. Avoid putting team-specific backends at global scope because it pollutes the namespace and makes per-team access control impossible. +- **Project**: Team-specific overrides. Use when a datasource serves more than one dashboard but not the entire organization. The project datasource `metadata.name` must match the global datasource name exactly for override to work (names are case-sensitive). +- **Dashboard**: One-off configurations embedded in the dashboard spec. Reserve for true one-off test configurations only because dashboard-scoped config is duplicated in every dashboard that needs it and cannot be shared. + +Set the first datasource of each plugin kind as `default: true` so dashboard panels auto-discover it. Set `default: true` on exactly one datasource per plugin kind per scope because behavior with multiple defaults is undefined and varies between Perses versions. **Gate**: Type, scope, and connection URL identified. Proceed to Phase 2. @@ -84,6 +64,12 @@ A dashboard-scoped datasource overrides a project-scoped one of the same name, w **Goal**: Create the datasource resource. +Every HTTP proxy datasource **must** include `allowedEndpoints` with both `endpointPattern` and explicit `method` entries. Without them, the proxy returns 403 on all queries with no useful error message. Never use `method: *` or omit the `method` field because the Perses proxy requires explicit method matching. Configure both GET and POST for most backends because Prometheus `/api/v1/query_range` and `/api/v1/labels` use POST for large payloads, and Loki/Tempo also mix methods. + +Never embed secrets (passwords, tokens) in datasource YAML committed to version control — use Perses native auth or external secret management. + +For non-local deployments, use container/service names instead of `localhost`. In Docker, use the container network name or `host.docker.internal`. In Kubernetes, use the service DNS name (e.g., `http://prometheus.monitoring.svc:9090`). `localhost` refers to the container itself and will break. + **Via MCP** (preferred): ``` perses_create_global_datasource(name="prometheus", type="PrometheusDatasource", url="http://prometheus:9090") @@ -142,6 +128,8 @@ EOF **Goal**: Confirm the datasource exists and is accessible. +Creation succeeding only means the API accepted the resource definition. It does not validate that the backend URL is reachable or that allowedEndpoints are correct. Always test with a real query. + ```bash # Global datasources percli get globaldatasource @@ -152,7 +140,7 @@ percli get datasource --project # Describe specific datasource percli describe globaldatasource -# Test proxy connectivity (global) +# Test proxy connectivity (global) -- this is the real validation curl -s http://localhost:8080/proxy/globaldatasources//api/v1/query?query=up # Test proxy connectivity (project-scoped) @@ -165,7 +153,11 @@ perses_list_global_datasources() perses_list_datasources(project="") ``` -**Gate**: Datasource listed and configuration confirmed. Task complete. +If the proxy test returns persistent 5xx errors, this indicates infrastructure issues beyond datasource configuration — stop and escalate to the user. + +Before deleting any global datasource, check which projects and dashboards reference it by running `percli get datasource --project ` across all projects. If it is used by multiple projects, confirm the blast radius with the user before proceeding. + +**Gate**: Datasource listed, configuration confirmed, and proxy query returns a non-error response. Task complete. --- @@ -182,53 +174,6 @@ perses_list_datasources(project="") --- -## Anti-Patterns - -| Anti-Pattern | Why It Fails | Do This Instead | -|--------------|-------------|-----------------| -| Creating all datasources at global scope | Pollutes the namespace, makes per-team access control impossible, and forces every project to see every datasource | Use global scope only for organization-wide defaults. Use project-scoped datasources for team-specific backends | -| Omitting `allowedEndpoints` on HTTP proxy datasources | Queries are blocked silently — the proxy returns 403 with no useful error message in dashboards, making debugging difficult | Always define `allowedEndpoints` with both the `endpointPattern` regex and `method` for every HTTP proxy datasource | -| Not setting `default: true` on the primary datasource | Dashboard panels cannot auto-discover the datasource. Users must manually select it in every panel, and panel YAML must hardcode the datasource name | Set `default: true` on exactly one datasource per plugin kind per scope. If you have multiple Prometheus datasources, designate one as default | -| Using dashboard-scoped datasources when project scope would enable reuse | Dashboard-scoped datasource config is embedded in the dashboard JSON and cannot be shared. Every dashboard that needs it must duplicate the config | Use project-scoped datasources for any datasource used by more than one dashboard. Reserve dashboard scope for true one-off test configurations | -| Hardcoding `localhost` URLs in non-local deployments | Breaks when Perses runs in Docker or Kubernetes because `localhost` refers to the container, not the host | Use container/service names: Docker network names for Compose, K8s service DNS for Helm deployments | - ---- - -## Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|---------------|-----------------| -| "The datasource was created successfully, so it must be working" | Creation succeeding only means the API accepted the resource definition. It does not validate that the backend URL is reachable or that allowedEndpoints are correct | **Test the proxy endpoint** with a real query: `curl` the `/proxy/globaldatasources//...` path and verify a non-error response | -| "I don't need allowedEndpoints because I'm only doing GET requests" | Prometheus `/api/v1/query_range` and `/api/v1/labels` use POST for large payloads. Loki and Tempo also mix methods. A GET-only config breaks silently on certain queries | **Always configure both GET and POST** for the relevant endpoint patterns unless the datasource documentation explicitly states only one method is used | -| "Global scope is fine — we can always move it later" | Moving from global to project scope requires deleting the global datasource and recreating it as project-scoped. All dashboards referencing it by name will keep working only if the project datasource name matches exactly. This is a disruptive migration | **Choose scope deliberately** at creation time. Ask: "Does every project need this, or just one team?" | -| "The datasource type name is probably case-insensitive" | Plugin kind names are case-sensitive Go type identifiers. `prometheusdatasource` or `prometheus` will fail with an unhelpful "invalid plugin kind" error | **Use exact casing**: `PrometheusDatasource`, `TempoDatasource`, etc. Copy from the supported types table | - ---- - -## FORBIDDEN Patterns - -These patterns cause silent failures, data loss, or security issues. Never use them. - -- **NEVER** create a datasource without `allowedEndpoints` on HTTP proxy types — results in silent 403 on all queries -- **NEVER** use `method: *` or omit the `method` field in allowedEndpoints — the Perses proxy requires explicit method matching -- **NEVER** set `default: true` on multiple datasources of the same plugin kind at the same scope — behavior is undefined and varies between Perses versions -- **NEVER** embed secrets (passwords, tokens) in datasource YAML committed to version control — use Perses native auth or external secret management -- **NEVER** delete a global datasource without checking which projects and dashboards reference it — use `percli get datasource --project ` across all projects first - ---- - -## Blocker Criteria - -Stop and escalate to the user if ANY of these conditions are true: - -- Backend URL is unknown or unresolvable — cannot create a functional datasource without a reachable backend -- Datasource type is not one of the 6 supported plugin kinds — Perses does not support arbitrary datasource plugins without custom plugin development -- User requests a datasource plugin kind that is not installed on the Perses server — verify available plugins before attempting creation -- Proxy test returns persistent 5xx errors after datasource creation — indicates infrastructure issues beyond datasource configuration -- User wants to delete a global datasource used by multiple projects — requires explicit confirmation of the blast radius - ---- - ## References | Resource | URL | diff --git a/skills/perses-deploy/SKILL.md b/skills/perses-deploy/SKILL.md index fe4b8d17..f1b498f6 100644 --- a/skills/perses-deploy/SKILL.md +++ b/skills/perses-deploy/SKILL.md @@ -29,35 +29,11 @@ routing: Deploy and configure Perses server instances across different environments. -## Operator Context +## Overview -This skill operates as a deployment guide for Perses server instances, covering local development, Kubernetes, and bare metal deployments. +This skill guides you through deploying Perses server instances (local development, Kubernetes, bare metal) and configuring them with databases, authentication, plugins, and provisioning folders. **Do NOT use this skill for dashboard creation (use perses-dashboard-create) or plugin development (use perses-plugin-create).** -### Hardcoded Behaviors (Always Apply) -- **Never expose admin credentials** in plain text — use environment variables or secrets -- **Always configure auth** for non-local deployments — at minimum enable native auth -- **Validate connectivity** after deployment — check `/api/v1/projects` endpoint responds - -### Default Behaviors (ON unless disabled) -- **Local dev default**: Docker with file-based storage if deployment target not specified -- **Plugin loading**: Configure official plugins from perses/plugins repository -- **Health check**: Verify Perses is running and API is accessible after deployment - -### Optional Behaviors (OFF unless enabled) -- **Production hardening**: TLS, OIDC auth, SQL database, resource limits -- **Kubernetes operator**: Deploy via perses-operator CRDs instead of Helm -- **MCP server setup**: Install and configure perses-mcp-server alongside Perses - -## What This Skill CAN Do -- Deploy Perses via Docker, Helm, binary, or K8s operator -- Configure server settings: database, auth, plugins, provisioning, frontend -- Set up MCP server integration for Claude Code -- Verify deployment health and connectivity - -## What This Skill CANNOT Do -- Create or manage dashboards (use perses-dashboard-create) -- Develop plugins (use perses-plugin-create) -- Manage Kubernetes clusters (use kubernetes-helm-engineer) +By default, local dev deployments use Docker with file-based storage if you don't specify a target. Health checks verify the API is accessible after deployment. Plugin loading configures official plugins from the perses/plugins repository. --- @@ -67,10 +43,13 @@ This skill operates as a deployment guide for Perses server instances, covering **Goal**: Determine deployment target and requirements. -1. **Deployment target**: Docker (local dev), Helm (Kubernetes), Binary (bare metal), or Operator (K8s CRDs) +1. **Deployment target**: Choose Docker (local dev), Helm (Kubernetes), Binary (bare metal), or Operator (K8s CRDs) + - Defaults to Docker with file-based storage if you don't specify a target 2. **Storage backend**: File-based (default, no external DB needed) or SQL (MySQL) 3. **Authentication**: None (local dev), Native (username/password), OIDC, OAuth, or K8s ServiceAccount + - For non-local deployments, enable at minimum native auth because public API access requires credentials 4. **Plugin requirements**: Official plugins only, or custom plugins too? + - Perses will auto-configure official plugins from perses/plugins repository by default 5. **MCP integration**: Should we also set up the Perses MCP server for Claude Code? **Gate**: Environment assessed. Proceed to Phase 2. @@ -143,7 +122,7 @@ EOF ### Phase 3: CONFIGURE -**Goal**: Configure server settings. +**Goal**: Configure server settings to control database, auth, plugins, and provisioning. **Server configuration** (config.yaml): @@ -154,10 +133,11 @@ database: folder: "/perses/data" extension: "yaml" -# Security +# Security — ALWAYS configure auth for non-local deployments (minimum native auth) security: readonly: false enable_auth: true + # Use 32-byte AES-256 key — NEVER expose in plain text, use env var or secrets encryption_key: "<32-byte-AES-256-key>" authentication: access_token_ttl: "15m" @@ -172,12 +152,12 @@ security: # issuer: "https://github.com" # redirect_uri: "https://perses.example.com/api/auth/providers/oidc/github/callback" -# Plugins +# Plugins — auto-configures official plugins by default plugin: archive_path: "plugins-archive" path: "plugins" -# Provisioning (auto-load resources from folders) +# Provisioning — auto-load resources from folders for GitOps-style management provisioning: folders: - "/perses/provisioning" @@ -188,19 +168,21 @@ frontend: disable_custom: false ``` -**Environment variables** override config with `PERSES_` prefix: +**Environment variables** override config with `PERSES_` prefix (because env vars don't leak credentials in git): - `PERSES_DATABASE_FILE_FOLDER=/perses/data` - `PERSES_SECURITY_ENABLE_AUTH=true` -- `PERSES_SECURITY_ENCRYPTION_KEY=` +- `PERSES_SECURITY_ENCRYPTION_KEY=` (use this instead of embedding in config.yaml) **Gate**: Configuration applied. Proceed to Phase 4. ### Phase 4: VALIDATE -**Goal**: Verify deployment is healthy. +**Goal**: Verify deployment is healthy and all APIs are accessible. + +Always validate connectivity by checking the `/api/v1/projects` endpoint responds because this confirms auth, routing, and database connectivity work correctly: ```bash -# Check API is responding +# Check API is responding — HARDCODED requirement curl -s http://localhost:8080/api/v1/projects | head # Install percli and login @@ -219,11 +201,13 @@ EOF percli get project ``` -**Optional: Set up MCP server** +**Optional: Set up MCP server for Claude Code integration** + +If you want Claude Code to interact with Perses dashboards and resources, install and configure the Perses MCP server: ```bash # Install perses-mcp-server from releases -# Create config +# Create config — NEVER expose credentials in plain text cat > perses-mcp-config.yaml < -o json` if server is unavailable | -| Panel layout lost in migration | Grafana grid coordinates don't map cleanly to Perses Grid layout — panels overlap or have wrong sizes | After migration, review the `spec.layouts` section and manually adjust Grid `x`, `y`, `w`, `h` values to match the original Grafana layout intent | -| Missing datasource references | Migrated dashboard references datasource names that don't exist in Perses | Create matching Perses datasources before deploying, or update the migrated JSON to reference existing Perses datasource names | - -## Anti-Patterns - -| Anti-Pattern | Why It's Wrong | Do Instead | -|--------------|----------------|------------| -| Deploying migrated dashboards without reviewing StaticListVariable placeholders | Users see broken panels with placeholder values, lose trust in the migration | Search all migrated files for `StaticListVariable` placeholders, document each, fix or remove before deploy | -| Running migration in offline mode when online mode is available | Offline mode uses bundled plugin migration logic which may be outdated — misses latest panel type mappings | Always prefer `--online` when a Perses server is reachable; offline is a fallback, not a default | -| Deleting original Grafana JSON files after migration | No rollback path if migration output is wrong, no way to re-run with updated `percli` version | Keep originals in a `grafana-originals/` directory alongside migrated output — storage is cheap, re-migration is not | -| Batch migrating everything at once without prioritization | Critical dashboards get the same attention as abandoned test dashboards, errors pile up | Prioritize by usage: migrate the top 5-10 most-viewed dashboards first, validate thoroughly, then batch the rest | -| Migrating dashboards without first checking Grafana version | Wasted effort — older Grafana JSON schemas produce broken or empty Perses output | Run `curl /api/health` or check `version` in the Grafana API response before starting any migration | - -## Anti-Rationalization - -| Rationalization | Reality | Required Action | -|-----------------|---------|-----------------| -| "The migration completed without errors so it's correct" | `percli migrate` succeeds even when panels are replaced with StaticListVariable placeholders — zero errors does not mean zero data loss | **Diff panel counts**: compare number of panels in Grafana source vs Perses output, search for all placeholder values | -| "Online mode isn't necessary, offline is fine" | Offline mode bundles a snapshot of plugin migration logic that may be weeks or months behind — new panel type mappings are added to the server continuously | **Use online mode** whenever a Perses server is available, verify server version is current | -| "We can fix the placeholders later after deployment" | Users will see broken dashboards immediately, file bugs, lose confidence in the migration — fixing in production is always harder than fixing before deploy | **Fix or document every placeholder** before deploying, even if it delays the migration timeline | -| "The layout looks close enough" | Grafana's 24-column grid and Perses's Grid layout have different coordinate systems — "close enough" means overlapping panels or wasted whitespace that makes dashboards unusable | **Visually verify** every migrated dashboard in the Perses UI before declaring migration complete | - -## FORBIDDEN Patterns - -These patterns MUST NOT appear in migration workflows: - -- **NEVER** pipe raw Grafana API response directly to `percli migrate` without extracting `.dashboard` — the envelope metadata will cause parse failures -- **NEVER** use `percli migrate` on Grafana JSON from versions below 9.0.0 — the output will be silently wrong or empty -- **NEVER** deploy migrated dashboards to production without running `percli lint` — structural errors will break the Perses UI -- **NEVER** delete Grafana source dashboards or disable them before confirming the Perses migration is complete and validated by dashboard owners -- **NEVER** assume all Grafana panel types have Perses equivalents — annotations, alerting rules, and custom Grafana-only plugins have no mapping - -## Blocker Criteria - -STOP and escalate to the user if any of these conditions are met: +This skill orchestrates a four-phase migration pipeline: EXPORT Grafana dashboards as JSON, CONVERT to Perses format, VALIDATE converted output and fix incompatibilities, then DEPLOY to a Perses instance. -- **Grafana version < 9.0.0**: Migration will produce broken output. User must upgrade Grafana or manually convert dashboard JSON. -- **More than 30% of panels are unsupported**: Migration value is too low — more manual work than automated. Recommend building Perses dashboards from scratch instead. -- **No Perses server available and online mode required**: If the user specifically needs online mode features (latest plugin mappings) but has no server, the migration cannot proceed at the expected quality level. -- **Grafana API authentication unavailable**: Cannot export dashboards without API access. User must provide a service account token or admin credentials. -- **Target Perses project does not exist and user lacks create permissions**: Cannot deploy. User must create the project or get permissions first. +**Key constraints embedded in workflow**: +- Always validate after conversion because `percli migrate` succeeds even when panels become `StaticListVariable` placeholders — zero errors does not mean zero data loss +- Preserve originals (never delete Grafana source files) because migration is one-way; originals are the rollback path +- Extract `.dashboard` key when exporting from Grafana API because the raw API response wraps the dashboard in metadata that `percli migrate` cannot parse +- Verify Grafana version is 9.0.0+ before migration because older versions use dashboard JSON schemas that `percli` does not support +- Use online mode when a Perses server is available because offline mode bundles outdated plugin migration logic; online uses the latest logic from the server --- @@ -125,21 +47,20 @@ STOP and escalate to the user if any of these conditions are met: **Goal**: Export Grafana dashboards as JSON files. If user has JSON files already, skip to Phase 2. -Verify Grafana version first: +Verify Grafana version first (required because older Grafana versions produce broken or empty output): ```bash curl -s https://grafana.example.com/api/health | jq '.version' -# Must be 9.0.0+ +# Must be 9.0.0+. If below 9.0.0, user must upgrade or manually update schemaVersion (risky). ``` -Export a single dashboard: +Export a single dashboard (IMPORTANT: extract `.dashboard` key, not the full envelope, because `percli migrate` will fail on API metadata): ```bash -# Export from Grafana API — MUST extract .dashboard key curl -H "Authorization: Bearer " \ https://grafana.example.com/api/dashboards/uid/ \ | jq '.dashboard' > grafana-dashboard.json ``` -For bulk export, iterate over all dashboards: +For bulk export, iterate over all dashboards (prioritize by usage; migrate top 5-10 most-viewed first, then batch the rest): ```bash curl -H "Authorization: Bearer " \ https://grafana.example.com/api/search?type=dash-db \ @@ -154,54 +75,61 @@ done ### Phase 2: CONVERT -**Goal**: Convert Grafana JSON to Perses format. +**Goal**: Convert Grafana JSON to Perses format using `percli migrate`. +Prefer online mode (uses latest plugin migration logic from the server) over offline: ```bash -# Single dashboard (online mode - recommended) +# Single dashboard (online mode - RECOMMENDED) percli migrate -f grafana-dashboard.json --online -o json > perses-dashboard.json -# Bulk migration +# Bulk migration with online mode for f in grafana-*.json; do percli migrate -f "$f" --online -o json > "perses-${f#grafana-}" done -# K8s CR format +# K8s CR format (if needed for GitOps deployment) percli migrate -f grafana-dashboard.json --online --format cr -o json > perses-cr.json -# Offline fallback (when no Perses server available) +# Offline fallback (only if no Perses server is available) percli migrate -f grafana-dashboard.json -o json > perses-dashboard.json ``` -**Migration notes**: -- Requires Perses server connection for online mode (uses latest plugin migration logic) -- Compatible with Grafana 9.0.0+, latest version recommended -- Unsupported variables become `StaticListVariable` with values `["grafana", "migration", "not", "supported"]` -- Panel type mapping: Graph to TimeSeriesChart, Stat to StatChart, Table to Table +**Migration behavior** (awareness for downstream validation): +- Unsupported Grafana variables become `StaticListVariable` with marker values `["grafana", "migration", "not", "supported"]` +- Panel type mapping: Graph → TimeSeriesChart, Stat → StatChart, Table → Table - Panels with no Perses equivalent need manual replacement after migration +- Layout coordinates may not map perfectly (Grafana's 24-column grid vs Perses Grid have different coordinate systems) **Gate**: Conversion complete. All files produced without errors. Proceed to Phase 3. ### Phase 3: VALIDATE -**Goal**: Validate converted dashboards and report incompatibilities. +**Goal**: Validate converted dashboards and report incompatibilities before deploy. +Lint every migrated file (required to catch structural errors that will break the Perses UI): ```bash -# Lint every migrated file percli lint -f perses-dashboard.json +``` -# Search for unsupported plugin placeholders +Search for unsupported plugin placeholders (these represent broken functionality that will confuse end users): +```bash grep -r '"grafana","migration","not","supported"' perses-*.json +``` -# Count panels: compare source vs migrated +Count panels to detect data loss (compare source vs output; gaps indicate unsupported panels): +```bash +# Grafana panel count jq '.panels | length' grafana-dashboard.json +# Perses panel count jq '.spec.panels | length' perses-dashboard.json ``` -Check for: -- Panel types that weren't converted (search for StaticListVariable placeholders) -- Missing datasource references -- Variable references that didn't translate -- Layout issues (overlapping or mis-sized panels in Grid layout) +Check for additional incompatibilities: +- Variable references that didn't translate (search for references to non-existent variables) +- Missing datasource references (migrated dashboard references names that don't exist in Perses) +- Layout issues (overlapping panels or wrong sizes in Grid layout — visually verify in UI later) + +**Critical gate**: Before proceeding, find and document ALL `StaticListVariable` placeholders with a remediation plan (fix or remove). Never deploy migrated dashboards without first reviewing placeholders because users will see broken dashboards immediately and lose confidence in the migration. **Gate**: Validation passes. All StaticListVariable placeholders documented with remediation plan. Proceed to Phase 4. @@ -209,30 +137,45 @@ Check for: **Goal**: Deploy migrated dashboards to Perses. +Ensure the target project exists: ```bash -# Ensure project exists percli apply -f - < spec: {} EOF +``` -# Deploy dashboards +Deploy dashboards: +```bash percli apply -f perses-dashboard.json --project ``` -Verify migration: +Verify deployment: ```bash percli get dashboard --project ``` -Open Perses UI and visually confirm each migrated dashboard renders correctly. +Open Perses UI and visually verify each migrated dashboard renders correctly (because layout coordinates may have shifted, this is the final validation step). **Gate**: Dashboards deployed and accessible. Visual verification complete. Migration complete. --- +## Error Handling + +| Cause | Symptom | Solution | +|-------|---------|----------| +| Invalid Grafana JSON format | `percli migrate` fails with parse error or "unexpected token" | Verify JSON is valid with `jq .` — ensure you extracted the `.dashboard` key from Grafana API response, not the full envelope | +| Grafana version < 9.0.0 | `percli migrate` fails with schema errors or produces empty output | Upgrade Grafana to 9.0.0+ before export, or manually update the dashboard JSON `schemaVersion` field (risky — structural differences may remain) | +| Unsupported plugin warning | Migration succeeds but panels contain `StaticListVariable` with values `["grafana","migration","not","supported"]` | Document each unsupported panel, then manually replace with the closest Perses equivalent (TimeSeriesChart, StatChart, Table, or Markdown panel) | +| Online mode connection failure | `percli migrate --online` fails with "connection refused" or timeout | Verify Perses server URL and port, check authentication (run `percli login` first), fall back to offline mode with `percli migrate -f -o json` if server is unavailable | +| Panel layout lost in migration | Grafana grid coordinates don't map cleanly to Perses Grid layout — panels overlap or have wrong sizes | After migration, review the `spec.layouts` section and manually adjust Grid `x`, `y`, `w`, `h` values to match the original Grafana layout intent | +| Missing datasource references | Migrated dashboard references datasource names that don't exist in Perses | Create matching Perses datasources before deploying, or update the migrated JSON to reference existing Perses datasource names | + +--- + ## References | Resource | URL | diff --git a/skills/perses-lint/SKILL.md b/skills/perses-lint/SKILL.md index c2d463aa..b5f140d7 100644 --- a/skills/perses-lint/SKILL.md +++ b/skills/perses-lint/SKILL.md @@ -26,108 +26,6 @@ routing: Validate Perses resource definitions using `percli lint`. Supports local structural validation and online server-side validation that checks plugin schemas, datasource existence, and variable resolution. -## Operator Context - -This skill operates as a validation gate for Perses resources. It runs `percli lint` against JSON/YAML resource files, interprets errors, applies fixes, and re-validates until all issues are resolved. - -### Hardcoded Behaviors (Always Apply) -- **Show full output**: Always display complete lint output, never summarize or truncate -- **Online when possible**: Prefer `--online` mode when a Perses server is reachable -- **Fix-then-revalidate loop**: Never claim fixes are correct without re-running lint -- **Report every error**: Surface all lint errors, not just the first one encountered -- **Preserve user intent**: When fixing invalid plugin kinds, ask user which plugin they intended rather than guessing - -### Default Behaviors (ON unless disabled) -- **Batch mode for directories**: When pointed at a directory, validate all `.json` and `.yaml` files -- **Error grouping**: Group lint errors by category (plugin, datasource, variable, layout) in output -- **Suggest online mode**: If local lint passes but user has a server configured, suggest online re-check - -### Optional Behaviors (OFF unless enabled) -- **Auto-fix mode**: Automatically apply fixes for unambiguous errors (e.g., typo in plugin kind) -- **CI integration**: Output lint results in machine-parseable format for CI pipelines -- **Strict mode**: Treat warnings as errors (fail on any non-clean lint output) - -## What This Skill CAN Do -- Run local structural validation with `percli lint -f ` -- Run server-side validation with `percli lint -f --online` (checks plugin schemas, datasource existence) -- Validate individual files or batch-validate entire directories -- Diagnose and fix invalid panel plugin kinds, missing datasource references, broken variable references, and layout mismatches -- Re-validate after fixes to confirm resolution - -## What This Skill CANNOT Do -- Create dashboards from scratch (use perses-dashboard-create) -- Deploy or configure Perses servers (use perses-deploy) -- Develop or test custom plugins (use perses-plugin-create or perses-plugin-test) -- Fix PromQL/TraceQL query logic errors (lint validates structure, not query semantics) - ---- - -## Error Handling - -| Cause | Symptom | Solution | -|-------|---------|----------| -| **Invalid panel plugin kind** | `unknown kind "TimeseriesChart"` — plugin name not in the 27 official plugins | Check against official list below. Common typos: `TimeseriesChart` -> `TimeSeriesChart`, `Stat` -> `StatChart`, `Gauge` -> `GaugeChart`. Fix the `kind` field in the panel spec. | -| **Missing datasource reference** | `datasource "myPrometheus" not found` — panel references a datasource not defined in the dashboard | Add the datasource to the dashboard's `spec.datasources` map, or correct the datasource name to match an existing one. Online mode catches this more reliably. | -| **Invalid variable reference** | `variable "cluter" not found` — `$ref` points to a variable name that does not exist in `spec.variables` | Check all `$ref` values against the keys in `spec.variables`. Fix the typo or add the missing variable definition. | -| **Layout $ref mismatch** | `panel "panel-3" referenced in layout but not found in panels` — a panel ID in `spec.layouts[].spec.display.panels` does not match any key in `spec.panels` | Ensure every panel ID referenced in layout sections exists as a key in `spec.panels`. Remove stale layout references or add the missing panel. | -| **Connection refused (online mode)** | `connection refused` or `dial tcp: connect: connection refused` when using `--online` | Perses server is not running or URL is wrong. Verify server is up with `curl /api/v1/health`. Fall back to local lint with `percli lint -f ` (no `--online` flag). | -| **Authentication failure (online mode)** | `401 Unauthorized` or `403 Forbidden` when using `--online` | Login first with `percli login --username --password `. Check that the token has not expired. | - -### Official Plugin Kinds (27 total) - -**Chart plugins**: TimeSeriesChart, BarChart, GaugeChart, HeatmapChart, HistogramChart, PieChart, ScatterChart, StatChart, StatusHistoryChart, FlameChart - -**Table plugins**: Table, TimeSeriesTable, LogsTable, TraceTable - -**Display plugins**: Markdown, TracingGanttChart - -**Variable plugins**: DatasourceVariable, StaticListVariable - -**Datasource plugins**: PrometheusDatasource, TempoDatasource, and additional community datasource types - ---- - -## Anti-Patterns - -| Anti-Pattern | Why It Fails | Correct Approach | -|--------------|-------------|------------------| -| **Running `percli apply` without `percli lint` first** | Applies a broken resource to the server, then you discover errors at runtime or in the UI | Always run `percli lint -f ` before `percli apply -f `. Lint is the gate before deploy. | -| **Only validating locally when a server is available** | Local lint checks structure only. Online mode validates against actual plugin schemas, existing datasources, and variable resolution. Many errors only surface with `--online`. | Use `percli lint -f --online` whenever a Perses server is reachable. | -| **Ignoring lint warnings (only fixing errors)** | Warnings often indicate deprecated fields, unused variables, or schema drift that will become errors in future Perses versions | Fix all warnings. Use strict mode (`--strict` if available) in CI to enforce zero-warning policy. | -| **Fixing one error at a time and re-running** | Wastes cycles. Multiple errors are often related (e.g., renamed datasource breaks 5 panels) | Read ALL lint errors first, identify root causes, batch-fix related errors, then re-validate once. | - ---- - -## Anti-Rationalization - -| Rationalization | Reality | Required Action | -|-----------------|---------|-----------------| -| "Lint passed locally, so the dashboard is correct" | Local lint only checks structure. Plugin schemas, datasource existence, and variable resolution require online mode. | Run `--online` against a server before declaring the resource valid. | -| "I fixed the error, no need to re-run lint" | Fixes can introduce new errors (e.g., fixing a panel kind may reveal a previously-masked datasource error) | Always re-run `percli lint` after every fix. The loop is: lint -> fix -> lint -> confirm clean. | -| "That warning is not important" | Warnings in Perses often indicate fields that will be removed or required in the next version. Ignoring them creates upgrade debt. | Fix warnings now. They cost minutes today and hours during upgrades. | - ---- - -## FORBIDDEN Patterns - -- **NEVER** mark lint as passing when there are unresolved errors or warnings -- **NEVER** modify `percli` output to hide errors from the user -- **NEVER** skip re-validation after applying fixes -- **NEVER** guess a plugin kind without checking the official 27-plugin list -- **NEVER** run `percli apply` as a substitute for `percli lint` ("it will tell us if it fails") - ---- - -## Blocker Criteria - -Do NOT proceed past validation if any of these are true: -- `percli lint` reports any errors (warnings in strict mode) -- Online validation was requested but the server is unreachable and no fallback was acknowledged -- A plugin kind is used that is not in the official 27-plugin list and the user has not confirmed it is a custom plugin -- Any `$ref` in the layout or variables section points to a non-existent target - ---- - ## Instructions ### Phase 1: VALIDATE @@ -148,7 +46,9 @@ for f in *.json; do percli lint -f "$f"; done for f in *.yaml; do percli lint -f "$f"; done ``` -If online mode fails with connection errors, fall back to local mode and note the limitation. +Always display the complete lint output without summarizing or truncating (provides full context for diagnosis). If online mode fails with connection errors, fall back to local mode and note the limitation. + +When pointed at a directory, validate all `.json` and `.yaml` files. After validation completes, group lint errors by category (plugin, datasource, variable, layout) in the report for clarity. If local lint passes but a Perses server is configured and reachable, also suggest running online validation because local mode only checks structure—plugin schemas, datasource existence, and variable resolution require server-side checking. **Gate**: All lint output captured. Proceed to Phase 2. @@ -156,15 +56,21 @@ If online mode fails with connection errors, fall back to local mode and note th **Goal**: Resolve every reported error. -For each error, identify the root cause and apply the fix: +Read all lint errors first, identify common root causes, then batch-fix related errors together (because multiple errors are often related, e.g., a renamed datasource breaks 5 panels at once). This is more efficient than fixing one error and re-running. + +For each error category, apply the fix: + +1. **Invalid panel plugin kind** — Check the `kind` field against the 27 official plugins listed in Error Handling below. Correct typos or capitalization (common typos: `TimeseriesChart` → `TimeSeriesChart`, `Stat` → `StatChart`, `Gauge` → `GaugeChart`). When fixing invalid plugin kinds, ask the user which plugin they intended rather than guessing (to preserve their intent). + +2. **Missing datasource reference** — Add the missing datasource to `spec.datasources` or fix the name to match an existing datasource. Online validation catches this more reliably because local mode only checks structure. -1. **Invalid panel plugin kind** -- Check the `kind` field against the 27 official plugins listed above. Correct typos or capitalization. -2. **Missing datasource reference** -- Add the missing datasource to `spec.datasources` or fix the name to match an existing datasource. -3. **Invalid variable reference** -- Verify `$ref` values match keys in `spec.variables`. Fix typos or add missing variable definitions. -4. **Layout $ref mismatch** -- Ensure every panel ID in `spec.layouts[].spec.display.panels` has a corresponding entry in `spec.panels`. -5. **Unknown field errors** -- Check Perses API version. Field may have been renamed or removed in a newer version. +3. **Invalid variable reference** — Verify all `$ref` values match keys in `spec.variables`. Fix typos or add missing variable definitions. -When multiple errors share a root cause (e.g., a renamed datasource), fix the root cause once rather than patching each symptom individually. +4. **Layout $ref mismatch** — Ensure every panel ID in `spec.layouts[].spec.display.panels` has a corresponding entry in `spec.panels`. Remove stale layout references or add missing panels. + +5. **Unknown field errors** — Check the Perses API version, since fields may have been renamed or removed in newer versions. Fix all warnings, not just errors (because warnings often indicate deprecated fields or schema drift that will become errors in future Perses versions, creating upgrade debt). + +When multiple errors share a root cause, fix the root cause once rather than patching each symptom individually. **Gate**: All identified errors addressed. Proceed to Phase 3. @@ -179,13 +85,39 @@ percli lint -f percli lint -f --online ``` -- If new errors appear, return to Phase 2. -- Maximum 3 fix-revalidate cycles. If errors persist after 3 cycles, report remaining errors to the user with full context. +Always re-run `percli lint` after every fix (because fixes can introduce new errors—e.g., fixing a panel kind may reveal a previously-masked datasource error). The loop is: lint → fix → lint → confirm clean. Never skip re-validation after applying fixes. + +If new errors appear, return to Phase 2 for another fix cycle. Maximum 3 fix-revalidate cycles: if errors persist after 3 cycles, report remaining errors to the user with full context. Do not claim fixes are correct without re-running lint. **Gate**: Lint returns zero errors. Validation complete. --- +## Error Handling + +| Cause | Symptom | Solution | +|-------|---------|----------| +| **Invalid panel plugin kind** | `unknown kind "TimeseriesChart"` — plugin name not in the 27 official plugins | Check against official list below. Common typos: `TimeseriesChart` -> `TimeSeriesChart`, `Stat` -> `StatChart`, `Gauge` -> `GaugeChart`. Fix the `kind` field in the panel spec. | +| **Missing datasource reference** | `datasource "myPrometheus" not found` — panel references a datasource not defined in the dashboard | Add the datasource to the dashboard's `spec.datasources` map, or correct the datasource name to match an existing one. Online mode catches this more reliably. | +| **Invalid variable reference** | `variable "cluter" not found` — `$ref` points to a variable name that does not exist in `spec.variables` | Check all `$ref` values against the keys in `spec.variables`. Fix the typo or add the missing variable definition. | +| **Layout $ref mismatch** | `panel "panel-3" referenced in layout but not found in panels` — a panel ID in `spec.layouts[].spec.display.panels` does not match any key in `spec.panels` | Ensure every panel ID referenced in layout sections exists as a key in `spec.panels`. Remove stale layout references or add the missing panel. | +| **Connection refused (online mode)** | `connection refused` or `dial tcp: connect: connection refused` when using `--online` | Perses server is not running or URL is wrong. Verify server is up with `curl /api/v1/health`. Fall back to local lint with `percli lint -f ` (no `--online` flag). | +| **Authentication failure (online mode)** | `401 Unauthorized` or `403 Forbidden` when using `--online` | Login first with `percli login --username --password `. Check that the token has not expired. | + +### Official Plugin Kinds (27 total) + +**Chart plugins**: TimeSeriesChart, BarChart, GaugeChart, HeatmapChart, HistogramChart, PieChart, ScatterChart, StatChart, StatusHistoryChart, FlameChart + +**Table plugins**: Table, TimeSeriesTable, LogsTable, TraceTable + +**Display plugins**: Markdown, TracingGanttChart + +**Variable plugins**: DatasourceVariable, StaticListVariable + +**Datasource plugins**: PrometheusDatasource, TempoDatasource, and additional community datasource types + +--- + ## References - Perses documentation: https://perses.dev/docs/ diff --git a/skills/perses-onboard/SKILL.md b/skills/perses-onboard/SKILL.md index e7f2ec16..9cb2c45e 100644 --- a/skills/perses-onboard/SKILL.md +++ b/skills/perses-onboard/SKILL.md @@ -32,36 +32,9 @@ routing: First-time Perses setup and Claude Code integration pipeline. -## Operator Context +## Overview -This skill operates as an onboarding pipeline for new Perses users, guiding them from zero to a working Perses setup with Claude Code MCP integration. - -### Hardcoded Behaviors (Always Apply) -- **Check before deploy**: Always check if Perses is already running before offering to deploy -- **MCP setup**: Always offer to configure the Perses MCP server for Claude Code integration -- **Verify each phase**: Don't proceed to next phase until current phase passes validation - -### Default Behaviors (ON unless disabled) -- **Interactive**: Ask for confirmation at each phase gate -- **Local-first**: Default to Docker deployment if no server URL provided -- **Demo datasource**: Offer to connect to demo.perses.dev for exploration - -### Optional Behaviors (OFF unless enabled) -- **Production mode**: Configure auth, TLS, SQL database -- **Team onboarding**: Create multiple projects and RBAC roles - -## What This Skill CAN Do -- Discover existing Perses instances or deploy new ones -- Configure Claude Code MCP integration -- Create initial projects and datasources -- Verify end-to-end connectivity - -## What This Skill CANNOT Do -- Create dashboards (use perses-dashboard-create after onboarding) -- Configure complex auth (use perses-deploy for production setups) -- Develop plugins (use perses-plugin-create) - ---- +This is a 4-phase onboarding pipeline for new Perses users, guiding them from zero to a working Perses setup with Claude Code MCP integration. Scope: server discovery/deployment, MCP connection, initial project setup, and end-to-end validation. Out of scope: dashboard creation (use perses-dashboard-create), complex auth (use perses-deploy), and plugin development (use perses-plugin-create). ## Instructions @@ -69,7 +42,7 @@ This skill operates as an onboarding pipeline for new Perses users, guiding them **Goal**: Find or deploy a Perses server. -**Step 1: Check for existing Perses** +**Step 1: Check for existing Perses** (because we should reuse running instances before deploying) ```bash # Check if percli is installed @@ -82,7 +55,7 @@ percli whoami 2>/dev/null curl -s http://localhost:8080/api/v1/health 2>/dev/null ``` -**Step 2: Determine path** +**Step 2: Determine path** (because different scenarios require different actions) | Scenario | Action | |----------|--------| @@ -91,7 +64,7 @@ curl -s http://localhost:8080/api/v1/health 2>/dev/null | No Perses found, user has URL | Login to provided URL | | No Perses found, no URL | Offer: (a) deploy locally with Docker, (b) use demo.perses.dev | -**Step 3: Deploy if needed** (route to perses-deploy skill) +**Step 3: Deploy if needed** (route to perses-deploy skill for complex deployments; use simple Docker for quick local setup only) For quick local setup: ```bash @@ -104,11 +77,11 @@ percli login http://localhost:8080 # For demo: percli login https://demo.perses.dev ``` -**Gate**: Perses server accessible, percli authenticated. Proceed to Phase 2. +**Gate**: Perses server accessible, percli authenticated. Proceed to Phase 2 (because we cannot configure MCP until authentication is established). ### Phase 2: CONNECT -**Goal**: Set up Claude Code MCP integration. +**Goal**: Set up Claude Code MCP integration (so Claude Code can manipulate Perses resources directly). **Step 1: Check for Perses MCP server** @@ -116,11 +89,11 @@ percli login http://localhost:8080 which perses-mcp-server 2>/dev/null ``` -**Step 2: Install if needed** +**Step 2: Install if needed** (because MCP server is a separate binary from Perses itself) Guide user to install from https://github.com/perses/mcp-server/releases -**Step 3: Configure MCP server** +**Step 3: Configure MCP server** (because MCP needs explicit credentials and resource scoping) Create `perses-mcp-config.yaml`: ```yaml @@ -134,7 +107,7 @@ perses_server: credentials: "" ``` -**Step 4: Register in Claude Code settings** +**Step 4: Register in Claude Code settings** (because Claude Code MCP discovery reads from settings.json) Add to `~/.claude/settings.json` under `mcpServers`: ```json @@ -146,17 +119,17 @@ Add to `~/.claude/settings.json` under `mcpServers`: } ``` -**Step 5: Verify MCP connection** +**Step 5: Verify MCP connection** (because we must confirm the binary and socket are working before proceeding) Use ToolSearch("perses") to check if MCP tools are discoverable. If found, test with `perses_list_projects`. -**Gate**: MCP server configured and responsive. Proceed to Phase 3. +**Gate**: MCP server configured and responsive. Proceed to Phase 3 (because project creation is easier via MCP than CLI). ### Phase 3: CONFIGURE -**Goal**: Create initial project and datasources. +**Goal**: Create initial project and datasources (to establish a working workspace and data connectivity). -**Step 1: Create a project** +**Step 1: Create a project** (because projects provide resource isolation and role-based access control) ```bash percli apply -f - < + CLI: percli authenticated as [authenticated-user] Next steps: - Create a dashboard: /do create perses dashboard @@ -222,4 +195,28 @@ Next steps: - Set up Dashboard-as-Code: /do perses dac ``` -**Gate**: All checks pass. Onboarding complete. +**Gate**: All checks pass. Onboarding complete (because all prerequisites for dashboard creation and day-2 operations are now in place). + +## Error Handling + +**percli not installed**: Route user to [Perses installation docs](https://doc.perses.dev/latest/installation/installation/). Provide direct percli binary link for their OS. + +**Perses server not responding**: Verify the configured URL is correct and the server is running. If using Docker, check `docker ps`. If using deployed server, verify network connectivity and firewall rules. + +**MCP server binary not found**: Guide user to [MCP releases](https://github.com/perses/mcp-server/releases). Ensure the binary is in PATH or provide full path in settings.json. + +**MCP connection fails after registration**: Verify perses-mcp-config.yaml syntax (YAML indentation matters). Check that the bearer token from `percli whoami --show-token` is still valid (tokens may expire). Restart Claude Code harness to reload settings.json. + +**MCP tools not discoverable in Claude Code**: This indicates settings.json registration failed. Verify the file is syntactically correct JSON and the perses-mcp-server command exists at the specified path. Check Claude Code logs for socket errors. + +**Project creation fails with permissions error**: Verify user has admin role on the Perses server. Check `percli whoami` output for role information. + +**Datasource URL unreachable from Perses**: This is expected if the datasource (Prometheus/Loki/Tempo) is on a different network. Perses server needs network access to the datasource URL for proxying to work. For local dev, ensure services are on the same Docker network. + +## References + +- [Perses Documentation](https://doc.perses.dev/) +- [percli Installation and Usage](https://doc.perses.dev/latest/cli/installation/) +- [Perses MCP Server Repository](https://github.com/perses/mcp-server) +- [Claude Code MCP Configuration](https://claude.ai/docs/mcp) +- Companion skills: perses-deploy (production setup), perses-dashboard-create (dashboard building), perses-grafana-migrate (migration) diff --git a/skills/perses-plugin-create/SKILL.md b/skills/perses-plugin-create/SKILL.md index beb5366e..cb14f0b5 100644 --- a/skills/perses-plugin-create/SKILL.md +++ b/skills/perses-plugin-create/SKILL.md @@ -27,118 +27,19 @@ routing: Scaffold and implement Perses plugins with CUE schemas and React components. -## Operator Context +## Overview This skill guides the full lifecycle of creating a Perses plugin: scaffolding the directory structure with `percli`, defining the CUE schema and JSON example, implementing the React component, validating schemas, and building the distributable archive. -### Hardcoded Behaviors (Always Apply) -- **Schema + example together**: Always create both the CUE schema and a matching JSON example file -- never one without the other -- **Test before build**: Always run `percli plugin test-schemas` before `percli plugin build` -- a build without passing schema tests is forbidden -- **Model package**: CUE schemas must declare `package model` as the first line after imports -- **Close spec definitions**: Always wrap spec fields in `close({...})` to reject unknown fields during validation -- **Validate JSON against schema**: After creating the JSON example, run `percli plugin test-schemas` to confirm the example passes CUE validation before moving to React implementation - -### Default Behaviors (ON unless disabled) -- **Panel type**: Default to Panel plugin type if the user does not specify a type -- **Include migration**: Generate Grafana migration scaffold (`schemas///migrate/migrate.cue`) if a Grafana equivalent panel exists -- **Reference official plugins**: Check the 27 official plugins across 6 categories for similar implementations before creating a new plugin from scratch - -### Optional Behaviors (OFF unless enabled) -- **Multi-plugin module**: Create multiple related plugins within a single module (one module can contain multiple plugins) -- **Hot-reload dev server**: Start `percli plugin start` against a running Perses server for live development -- **Custom shared types**: Import shared CUE types from `github.com/perses/shared/cue/common` for format, thresholds, sorting, etc. - -## What This Skill CAN Do -- Scaffold a new plugin directory with `percli plugin generate` -- Create CUE schema definitions for any plugin type: Panel, Datasource, TimeSeriesQuery, TraceQuery, ProfileQuery, LogQuery, Variable, Explore -- Create matching JSON example files for schema validation -- Generate Grafana migration schemas for plugins with Grafana equivalents -- Implement React components with the rsbuild-based build system -- Build distributable archives (.zip/.tar/.tar.gz) containing package.json, mf-manifest.json, schemas/, and __mf/ - -## What This Skill CANNOT Do -- Create or manage dashboards (use perses-dashboard-create) -- Deploy Perses server instances (use perses-deploy) -- Review existing plugin code for quality (use perses-code-review) -- Run the plugin test suite beyond schema validation (use perses-plugin-test) -- Manage datasource connections or variables at the project level (use perses-datasource-manage or perses-variable-manage) - ---- - -## Error Handling - -| Cause | Symptom | Solution | -|-------|---------|---------| -| `percli plugin generate` fails: directory exists | "directory already exists" error | Remove or rename the existing directory, or generate into a different path | -| `percli plugin generate` fails: invalid type | Unrecognized plugin type error | Use one of: Panel, Datasource, TimeSeriesQuery, TraceQuery, ProfileQuery, LogQuery, Variable, Explore | -| `percli plugin generate` fails: missing flags | Missing required flag error | All four flags are required: `--module.org`, `--module.name`, `--plugin.type`, `--plugin.name` | -| CUE schema compilation error: missing package | "cannot determine package name" | Add `package model` as the first line of the .cue file (after any imports) | -| CUE schema compilation error: unclosed close() | Syntax error in CUE | Ensure `close({...})` has matching braces -- every `{` needs a `}` before the closing `)` | -| CUE schema compilation error: bad import path | Import not found for shared types | Use exact path `"github.com/perses/shared/cue/common"` -- not shorthand or relative imports | -| JSON example does not match schema: extra fields | `close()` rejects unknown fields | Remove fields from JSON that are not defined in the CUE schema, or add them to the schema | -| JSON example does not match schema: wrong types | Type mismatch error | Ensure JSON values match CUE type declarations (string vs int vs bool) | -| JSON example does not match schema: missing required | Required field not present | Add all non-optional fields (those without `?` suffix in CUE) to the JSON example | -| `percli plugin build` produces empty archive | Archive missing mf-manifest.json | Run `rsbuild build` (or equivalent npm build) first -- the React build must succeed before archive creation | -| `percli plugin build` fails: wrong directory | Build cannot find plugin config | Run `percli plugin build` from the module root directory (where package.json lives) | - ---- - -## Anti-Patterns - -| Anti-Pattern | Why It Fails | Do Instead | -|-------------|-------------|------------| -| Creating CUE schema without JSON example | `percli plugin test-schemas` has nothing to validate against -- schema errors are invisible until runtime | Always create both files together and run `percli plugin test-schemas` immediately | -| Not using `close({...})` for spec | Without `close()`, the schema accepts any unknown fields -- validation becomes meaningless | Always wrap spec in `close({...})` to enforce strict field validation | -| Skipping Grafana migration schema when equivalent exists | Users migrating from Grafana hit a dead end when no migration path exists for their panels | Check if a Grafana equivalent exists and create `migrate/migrate.cue` if so | -| Hard-coding default values in the React component | Users cannot configure the plugin's behavior through the dashboard JSON -- defaults are invisible | Define defaults as configurable fields in the CUE schema with sensible default values | -| Copying schema from another plugin without adjusting `kind` | The `kind` field identifies the plugin -- duplicates cause plugin resolution conflicts at runtime | Always set `kind` to match the unique plugin name | -| Building archive before running test-schemas | Schema errors ship in the archive and cause runtime failures on the Perses server | Run `percli plugin test-schemas` and fix all errors before `percli plugin build` | - ---- - -## Anti-Rationalization - -| Rationalization | Reality | Required Action | -|-----------------|---------|-----------------| -| "The schema is simple, it doesn't need a JSON example" | Even simple schemas need validation -- typos in field names or types are invisible without a test | **Create the JSON example and run test-schemas** | -| "close() is too restrictive, I'll add it later" | Without close(), the schema validates nothing -- any garbage JSON passes | **Add close() now, relax specific fields with `...` only if needed** | -| "The React component compiles so the plugin works" | React compilation says nothing about schema correctness or plugin registration | **Run percli plugin test-schemas AND verify the archive structure** | -| "I'll skip migration since most users start fresh" | Organizations migrating from Grafana are the primary adoption path -- migration is not optional when equivalents exist | **Check for Grafana equivalent and create migration schema** | -| "One test run passed, the schema is correct" | A single JSON example may not cover optional fields, edge cases, or constraint boundaries | **Verify the example includes representative values for all fields** | - ---- - -## FORBIDDEN Patterns - -These are hard stops. Do NOT proceed past these. - -- **NEVER** run `percli plugin build` without `percli plugin test-schemas` passing first -- **NEVER** create a CUE schema without a corresponding JSON example file -- **NEVER** omit `package model` from a CUE schema file -- **NEVER** omit `close({...})` around the spec definition -- **NEVER** use a `kind` value that duplicates an existing plugin in the same module - ---- - -## Blocker Criteria - -STOP and ask the user when: - -- **Plugin type is ambiguous**: The user's description could map to multiple plugin types (e.g., "query plugin" could be TimeSeriesQuery, TraceQuery, ProfileQuery, or LogQuery) -- **Schema fields are unclear**: The user describes desired behavior but not the data model -- ask what fields the plugin should accept -- **Module context is missing**: You cannot determine if this is a new module or an addition to an existing module -- **Grafana equivalent is uncertain**: You are unsure whether a Grafana panel equivalent exists for migration purposes -- **Shared type usage**: The user references "thresholds", "format", or "sort" but you are unsure which shared CUE types from `github.com/perses/shared/cue/common` to import - --- ## Instructions -### Phase 1: SCAFFOLD +### Phase 1: SCAFFOLD (Default to Panel type if unspecified) **Goal**: Generate the plugin directory structure with percli. -1. Determine plugin type from user request (default: Panel) +1. Determine plugin type from user request (default: Panel because most plugin requests are for panels) 2. Choose module organization: new module or add to existing 3. Run scaffolding: @@ -153,11 +54,11 @@ percli plugin generate \ **Gate**: Directory structure generated. Proceed to Phase 2. -### Phase 2: SCHEMA +### Phase 2: SCHEMA (Always pair schema with JSON example; use `close({...})` around spec) -**Goal**: Define the CUE schema and JSON example. +**Goal**: Define the CUE schema and JSON example. These must be created together because schema validation requires examples to catch typos and type errors that are invisible at compile time. -1. Edit CUE schema at `schemas///.cue`: +1. Edit CUE schema at `schemas///.cue`. **Must declare `package model` as the first line** (after any imports) so the CUE compiler can determine the package namespace. ```cue package model @@ -175,7 +76,9 @@ spec: close({ }) ``` -2. Create JSON example at `schemas///.json`: +**Always wrap spec in `close({...})`** to reject unknown fields during validation—without it, garbage JSON passes and validation becomes meaningless. + +2. Create JSON example at `schemas///.json`. Examples surface schema errors invisible at compile time (typos, wrong types, missing required fields). ```json { @@ -186,9 +89,9 @@ spec: close({ } ``` -3. If Grafana equivalent exists, create migration schema at `schemas///migrate/migrate.cue` +3. If Grafana equivalent exists, create migration schema at `schemas///migrate/migrate.cue` (organizations migrating from Grafana are the primary adoption path—migration is not optional when equivalents exist). -4. Validate immediately: +4. Validate immediately (before moving to React): ```bash percli plugin test-schemas @@ -196,13 +99,13 @@ percli plugin test-schemas **Gate**: `percli plugin test-schemas` passes. Proceed to Phase 3. -### Phase 3: IMPLEMENT +### Phase 3: IMPLEMENT (Reference the 27 official plugins before building from scratch) **Goal**: Build the React component. 1. Implement React component at `src///` 2. Follow the rsbuild-based build system conventions from the scaffolded template -3. Reference official plugins (27 across 6 categories) for similar implementations +3. Check the 27 official plugins across 6 categories for similar implementations before creating from scratch (saves time and ensures consistency) **Gate**: React component builds without errors. Proceed to Phase 4. @@ -220,9 +123,9 @@ percli plugin start **Gate**: All schema tests pass. Proceed to Phase 5. -### Phase 5: BUILD +### Phase 5: BUILD (Never build without passing schema tests; always validate archive structure) -**Goal**: Create the distributable archive. +**Goal**: Create the distributable archive. Never run `percli plugin build` without `percli plugin test-schemas` passing first—schema errors ship in the archive and cause runtime failures on the Perses server. ```bash percli plugin build @@ -242,6 +145,24 @@ Install archive in Perses server's `plugins-archive/` directory, or embed via np --- +## Error Handling + +| Cause | Symptom | Solution | +|-------|---------|---------| +| `percli plugin generate` fails: directory exists | "directory already exists" error | Remove or rename the existing directory, or generate into a different path | +| `percli plugin generate` fails: invalid type | Unrecognized plugin type error | Use one of: Panel, Datasource, TimeSeriesQuery, TraceQuery, ProfileQuery, LogQuery, Variable, Explore | +| `percli plugin generate` fails: missing flags | Missing required flag error | All four flags are required: `--module.org`, `--module.name`, `--plugin.type`, `--plugin.name` | +| CUE schema compilation error: missing package | "cannot determine package name" | Add `package model` as the first line of the .cue file (after any imports) | +| CUE schema compilation error: unclosed close() | Syntax error in CUE | Ensure `close({...})` has matching braces—every `{` needs a `}` before the closing `)` | +| CUE schema compilation error: bad import path | Import not found for shared types | Use exact path `"github.com/perses/shared/cue/common"`—not shorthand or relative imports | +| JSON example does not match schema: extra fields | `close()` rejects unknown fields | Remove fields from JSON that are not defined in the CUE schema, or add them to the schema | +| JSON example does not match schema: wrong types | Type mismatch error | Ensure JSON values match CUE type declarations (string vs int vs bool) | +| JSON example does not match schema: missing required | Required field not present | Add all non-optional fields (those without `?` suffix in CUE) to the JSON example | +| `percli plugin build` produces empty archive | Archive missing mf-manifest.json | Run `rsbuild build` (or equivalent npm build) first—the React build must succeed before archive creation | +| `percli plugin build` fails: wrong directory | Build cannot find plugin config | Run `percli plugin build` from the module root directory (where package.json lives) | + +--- + ## References - **Plugin types**: Panel, Datasource, TimeSeriesQuery, TraceQuery, ProfileQuery, LogQuery, Variable, Explore diff --git a/skills/perses-plugin-test/SKILL.md b/skills/perses-plugin-test/SKILL.md index 51d744da..76312c81 100644 --- a/skills/perses-plugin-test/SKILL.md +++ b/skills/perses-plugin-test/SKILL.md @@ -26,39 +26,17 @@ routing: Test Perses plugins across four layers: CUE schema validation, React component unit tests, integration testing against a live Perses server, and Grafana migration compatibility. -## Operator Context - -This skill validates Perses plugin correctness from schemas through rendered components. Testing follows a strict order because each layer depends on the previous one passing. - -### Hardcoded Behaviors (Always Apply) -- **Schema tests first**: Always run `percli plugin test-schemas` before any other test layer — component and integration tests are meaningless if schemas are invalid -- **JSON examples required**: Every CUE schema must have a matching JSON example at `schemas///.json` for test coverage -- **`package model` declaration**: Every CUE schema file must declare `package model` — `percli plugin test-schemas` silently skips files without it -- **Local server only**: Integration tests must target a local Perses instance (`localhost`), never a shared or production server -- **Validate all 27 official plugins**: When testing a plugin that extends or wraps an official plugin, verify compatibility with the upstream schema - -### Default Behaviors (ON unless disabled) -- **Run all four phases**: Execute schema, component, integration, and migration tests in order -- **Stop on phase failure**: If a phase fails, fix it before proceeding to the next phase -- **CUE `close({...})` validation**: Verify schemas use `close()` to reject unknown fields in JSON examples - -### Optional Behaviors (OFF unless enabled) -- **Cross-plugin compatibility**: Test that this plugin's output works alongside other plugins in the same dashboard -- **Performance profiling**: Measure React component render time with React Profiler during integration tests -- **Snapshot testing**: Generate and compare React component snapshots across test runs - -## What This Skill CAN Do -- Validate CUE schemas against JSON examples using `percli plugin test-schemas` -- Run React component unit tests with mocked `@perses-dev/plugin-system` hooks -- Start a hot-reload dev server with `percli plugin start` for integration testing -- Test Grafana dashboard migration logic against sample JSON fixtures -- Diagnose common CUE syntax errors (missing package declaration, unclosed specs, bad imports) - -## What This Skill CANNOT Do -- Create new plugins from scratch (use perses-plugin-create) -- Validate full dashboards or datasource connectivity (use perses-lint) -- Deploy or configure a Perses server (use perses-deploy) -- Write CUE schemas — this skill only tests them +## Overview + +This skill validates Perses plugin correctness from schemas through rendered components. Testing follows a strict four-phase order because each layer depends on the previous one passing — schema tests must succeed before component tests, which must succeed before integration tests, which must succeed before migration tests. + +**STOP before proceeding if**: +- `percli` is not installed or not on PATH +- No CUE schemas exist in the plugin (nothing to test with `test-schemas`) +- The plugin has no JSON examples and you'd need to create them from scratch +- Integration testing is requested but no local Perses server is available and Docker is not installed +- Migration testing is requested but no `migrate/migrate.cue` file exists in the plugin +- Schema tests produce errors you cannot diagnose from the CUE output alone --- @@ -68,13 +46,16 @@ This skill validates Perses plugin correctness from schemas through rendered com **Goal**: Validate all CUE schemas compile and match their JSON examples. -1. **Verify schema structure**: Each schema file must have `package model` at the top and use `close({...})` for strict validation -2. **Check JSON examples exist**: Every schema at `schemas///` must have a corresponding `.json` +1. **Verify schema structure**: Each schema file must have `package model` at the top (because `percli plugin test-schemas` silently skips files without this declaration) and use `close({...})` for strict field validation (because this rejects JSON examples with unknown fields). + +2. **Check JSON examples exist**: Every schema at `schemas///` must have a corresponding `.json` file (because every CUE schema needs matching JSON example for test coverage). + 3. **Run schema tests**: ```bash percli plugin test-schemas ``` -4. **On failure**: Read the CUE error output carefully — common issues are missing imports, unclosed braces, or JSON examples with fields not in the schema + +4. **On failure**: Read the CUE error output carefully — common issues are missing `package model` declarations, unclosed braces in `close({...})` specs, or JSON examples with fields not defined in the schema. Fix errors in this phase before proceeding to Phase 2 (because component and integration tests are meaningless if schemas are invalid). **Gate**: All schema tests pass. Proceed to Phase 2. @@ -82,12 +63,14 @@ percli plugin test-schemas **Goal**: Run React component unit tests. -1. **Verify test setup**: Component tests must mock `@perses-dev/plugin-system` hooks (e.g., `useDataQueries`, `useTimeRange`) +1. **Verify test setup**: Component tests must mock `@perses-dev/plugin-system` hooks (e.g., `useDataQueries`, `useTimeRange`) with realistic return values (because tests pass but don't verify correctness if hooks return empty stubs). + 2. **Run tests**: ```bash npm test -- --watchAll=false ``` -3. **Check coverage**: Ensure plugin component renders without errors and handles empty/error states + +3. **Check coverage**: Ensure plugin component renders without errors and handles empty/error states. Create multiple JSON examples per schema — minimal (required fields only), full (all fields), and edge cases (empty arrays, null values) (because one example may only exercise the default branch of a union type; other branches remain untested). **Gate**: All component tests pass. Proceed to Phase 3. @@ -95,16 +78,19 @@ npm test -- --watchAll=false **Goal**: Verify the plugin works inside a running Perses instance. -1. **Start local Perses server** (if not already running): +1. **Start local Perses server** (if not already running; use `localhost` only, never a shared or production server, because tests may corrupt real data, hit rate limits, or produce non-reproducible results): ```bash docker run --name perses-test -d -p 127.0.0.1:8080:8080 persesdev/perses ``` + 2. **Start plugin dev server**: ```bash percli plugin start ``` -3. **Verify plugin loads**: Confirm the plugin appears in the Perses UI panel type selector -4. **Test with real data**: Create a dashboard using this plugin and verify it renders with a connected datasource + +3. **Verify plugin loads**: Confirm the plugin appears in the Perses UI panel type selector. + +4. **Test with real data**: Create a dashboard using this plugin and verify it renders with a connected datasource (because integration tests catch issues that unit tests cannot: plugin registration, data binding, and render lifecycle). **Gate**: Plugin loads and renders in local Perses. Proceed to Phase 4. @@ -112,13 +98,16 @@ percli plugin start **Goal**: Verify Grafana dashboard JSON converts correctly through migration logic. -1. **Locate migration schema**: Check for `migrate/migrate.cue` -2. **Prepare test fixtures**: Use sample Grafana dashboard JSON that exercises all panel types this plugin handles +1. **Locate migration schema**: Check for `migrate/migrate.cue`. + +2. **Prepare test fixtures**: Use sample Grafana dashboard JSON that exercises all panel types this plugin handles. Test migration with diverse Grafana fixtures covering all supported panel types (because your dashboard may only use a subset of panel types; other users' dashboards will have panels you didn't test). + 3. **Run migration**: ```bash percli migrate --input grafana-dashboard.json --output perses-dashboard.json ``` -4. **Validate output**: Verify the migrated dashboard JSON matches expected Perses schema structure + +4. **Validate output**: Verify the migrated dashboard JSON matches expected Perses schema structure and test the output against the Perses schema (because upstream Grafana panel JSON evolves independently; a working migration can break without any local changes). **Gate**: Migration produces valid Perses dashboard JSON. Task complete. @@ -141,54 +130,6 @@ percli migrate --input grafana-dashboard.json --output perses-dashboard.json --- -## Anti-Patterns - -| Anti-Pattern | Why It Fails | Do Instead | -|--------------|-------------|------------| -| Running component tests before schema tests pass | Components depend on valid schemas; testing components against broken schemas produces misleading failures | Always run `percli plugin test-schemas` first and fix all schema errors before touching component tests | -| Testing against a shared or production Perses server | Tests may corrupt real data, hit rate limits, or fail due to network latency; results are non-reproducible | Always use a local Perses instance via Docker or binary — disposable and isolated | -| JSON examples that only test the happy path | Schemas with optional fields, unions, or conditional logic have branches that never get exercised | Create multiple JSON examples per schema: minimal (required fields only), full (all fields), and edge cases (empty arrays, null values) | -| Skipping migration tests because "the schema didn't change" | Upstream Grafana panel JSON evolves independently; a working migration can break without any local changes | Run migration tests against current Grafana sample fixtures on every test cycle | -| Mocking the entire plugin-system module with empty stubs | Tests pass but don't verify that hooks are called correctly or return expected shapes | Mock individual hooks with realistic return values (e.g., `useTimeRange` returns `{ start, end }`) | - ---- - -## Anti-Rationalization - -| Rationalization | Reality | Required Action | -|-----------------|---------|-----------------| -| "Schema tests pass so the plugin works" | Schema tests only validate CUE syntax and JSON conformance — they say nothing about whether the React component renders | **Run all four phases** | -| "I tested with one JSON example and it passed" | One example may only exercise the default branch of a union type; other branches remain untested | **Create JSON examples for every schema variant** | -| "Integration tests are slow, I'll skip them this time" | Integration tests catch issues that unit tests cannot: plugin registration, data binding, render lifecycle | **Always run integration tests against local Perses** | -| "The migration worked for my Grafana dashboard" | Your dashboard may only use a subset of panel types; other users' dashboards will have panels you didn't test | **Test migration with diverse Grafana fixtures covering all supported panel types** | - ---- - -## FORBIDDEN Patterns - -These are hard stops. If you encounter these, fix them before proceeding: - -- **NEVER** run integration tests against a non-localhost Perses URL -- **NEVER** commit CUE schemas that lack `package model` — `percli` will silently ignore them -- **NEVER** skip schema tests and jump straight to component or integration tests -- **NEVER** use `percli plugin test-schemas` output as proof that React components work -- **NEVER** test migrations without validating the output against the Perses schema - ---- - -## Blocker Criteria - -Stop and ask the user before proceeding if: - -- `percli` is not installed or not on PATH -- No CUE schemas exist in the plugin (nothing to test with `test-schemas`) -- The plugin has no JSON examples and you'd need to create them from scratch -- Integration testing is requested but no local Perses server is available and Docker is not installed -- Migration testing is requested but no `migrate/migrate.cue` file exists in the plugin -- Schema tests produce errors you cannot diagnose from the CUE output alone - ---- - ## References - [Perses Plugin Development Guide](https://perses.dev/docs/plugins/) diff --git a/skills/perses-project-manage/SKILL.md b/skills/perses-project-manage/SKILL.md index e3301ddc..3fe83ecf 100644 --- a/skills/perses-project-manage/SKILL.md +++ b/skills/perses-project-manage/SKILL.md @@ -25,120 +25,7 @@ routing: # Perses Project Management -Create and manage projects with RBAC configuration. - -## Operator Context - -This skill operates as the lifecycle manager for Perses projects and their RBAC configuration, handling project creation, role definitions, and role bindings. A Project in Perses is an organizational container that groups dashboards, datasources, variables, and other resources. When running via the Perses Operator on Kubernetes, each project maps to a K8s namespace. - -### Hardcoded Behaviors (Always Apply) -- **MCP-first**: Use Perses MCP tools when available, percli as fallback -- **RBAC awareness**: When creating projects in production, always set up roles and bindings — an unprotected project allows any authenticated user full access -- **Project context**: Always verify/set active project with `percli project` before operating on project-scoped resources — wrong project context silently applies resources to the wrong project -- **Verify before declare**: After creating any resource (project, role, binding), verify it exists with a list or describe command before reporting success - -### Default Behaviors (ON unless disabled) -- **Simple create**: Create project with default settings unless RBAC is requested -- **Set active**: After creating a project, set it as the active project context - -### Optional Behaviors (OFF unless enabled) -- **RBAC setup**: Create roles and role bindings alongside project creation -- **Multi-project**: Create multiple projects in batch for team onboarding - -## What This Skill CAN Do -- Create, list, describe, and delete projects -- Set up roles with granular permissions (read/create/update/delete on specific resource types) -- Create role bindings to assign users or groups to roles -- Switch active project context -- Manage global roles and global role bindings - -## What This Skill CANNOT Do -- Manage user accounts (that's Perses server admin configuration) -- Configure authentication providers (use perses-deploy) -- Create dashboards or datasources (use perses-dashboard-create, perses-datasource-manage) - ---- - -## Error Handling - -### Project creation fails - -| Symptom | Cause | Fix | -|---------|-------|-----| -| "already exists" / 409 Conflict | Project name is already taken | List existing projects with `percli get project` or `perses_list_projects()` and use a different name, or operate on the existing project | -| "invalid name" / 400 Bad Request | Project name contains invalid characters (uppercase, spaces, special chars) | Use lowercase alphanumeric names with hyphens only (e.g., `my-project`). Perses follows DNS label conventions | -| "unauthorized" / 401 | Not authenticated or session token expired | Run `percli login` first, or verify MCP server auth config has valid credentials | -| "forbidden" / 403 | Authenticated user lacks permission to create projects | User needs a GlobalRole with `create` action on Project scope, or admin access | - -### Role and RoleBinding creation fails - -| Symptom | Cause | Fix | -|---------|-------|-----| -| "role not found" in RoleBinding apply | The role referenced in `spec.role` does not exist | Create the Role first, then create the RoleBinding. Verify role exists with `percli get role --project ` | -| "subject not found" / binding has no effect | User or group name in subjects does not match any identity in the auth provider | Verify user identity with the configured auth provider (Native, OIDC, OAuth). For native auth, the username is the login name | -| "project not found" in role metadata | The project specified in `metadata.project` does not exist | Create the project first, or fix the project name in the role definition | -| GlobalRole apply returns 403 | User does not have cluster-level admin permissions | GlobalRole and GlobalRoleBinding require admin-level access; escalate to a Perses admin | - -### Wrong project context - -| Symptom | Cause | Fix | -|---------|-------|-----| -| Resources appear in wrong project | `percli project` was set to a different project than intended | Always run `percli project ` immediately before applying project-scoped resources | -| "project not set" error | No active project context configured | Run `percli project ` to set the active project | -| Role/binding created but permissions don't work | RoleBinding references a role from a different project | Ensure role and binding are in the same project; check `metadata.project` on both | - -### MCP tool failures - -| Symptom | Cause | Fix | -|---------|-------|-----| -| `perses_create_project` returns read-only error | Perses server has `security.readonly: true` in config | Ask user to disable read-only mode, or switch to a writable instance | -| MCP tool returns connection refused | MCP server cannot reach Perses API | Check MCP server config URL and ensure Perses server is running at that address | -| MCP list returns empty but projects exist | MCP auth credentials lack read permission | Verify MCP server auth config; the configured user needs at least read access | - ---- - -## Anti-Patterns - -| Anti-Pattern | Why It Fails | Correct Approach | -|--------------|-------------|------------------| -| **Creating projects without RBAC in production** — skipping role and binding setup | Any authenticated user gets full read/write/delete access to the project's resources | Always create at least a viewer role and an admin role with bindings for production projects | -| **Not setting active project before applying resources** — relying on "whatever was last set" | Resources silently apply to the wrong project; no error is raised | Run `percli project ` immediately before every `percli apply` for project-scoped resources | -| **Using GlobalRole for project-scoped permissions** — granting org-wide access when project-level suffices | Violates principle of least privilege; users get access to all projects instead of just one | Use project-scoped Role for project-specific permissions; reserve GlobalRole for truly organization-wide needs | -| **Creating RoleBindings without verifying the role exists** — assuming role was created in a prior step | Binding references a non-existent role; no permissions are granted; no error may surface until access is denied | Always verify role existence with `percli get role --project ` before creating bindings | -| **Deleting a project without checking for active dashboards** — removing a project that contains resources | All dashboards, datasources, variables, and other resources in the project are permanently deleted | List project contents with `percli get dashboard --project ` before deletion; confirm with user | - ---- - -## Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|---------------|-----------------| -| "The project was just created, RBAC can wait" | In the gap between creation and RBAC setup, any authenticated user has full access to the project | **Set up RBAC immediately** after project creation for production environments | -| "I already set the project context earlier" | Context may have changed if other commands ran between setting it and applying resources | **Re-run `percli project `** before every apply — it costs nothing and prevents silent misapplication | -| "The role name looks right, skip the verify step" | Typos in role names fail silently in RoleBindings — the binding is created but grants nothing | **Run `percli get role --project `** and confirm the exact role name before creating bindings | -| "GlobalRole is simpler than per-project roles" | Simpler setup, but violates least-privilege and gives access across every project | **Use project-scoped Roles** unless the permission genuinely needs to span all projects | -| "MCP created it so it must exist" | MCP tool may return success on a stale or cached response; network issues can cause partial creates | **Verify with a list or describe command** after every create operation | - ---- - -## FORBIDDEN Patterns - -- **NEVER** apply project-scoped resources without first confirming the active project context — silent wrong-project application is the most common error -- **NEVER** create production projects without at least one Role and RoleBinding — unprotected projects are open to all authenticated users -- **NEVER** use wildcard (`"*"`) in GlobalRole scopes without explicit user approval — this grants access to every resource type across every project -- **NEVER** hardcode user email addresses or identities in skill output — always use placeholder values and let the user supply real identities -- **NEVER** delete projects without listing their contents first and confirming with the user - ---- - -## Blocker Criteria - -Stop and ask the user before proceeding if: -- Auth provider type is unknown — role binding subject `kind` (User vs Group) depends on auth configuration -- Target environment (dev vs production) is ambiguous — determines whether RBAC setup is required -- Project name conflicts with an existing project — user must decide: reuse existing or rename -- User wants GlobalRole with wildcard scopes — requires explicit confirmation due to security implications -- MCP server is in read-only mode — cannot create projects or roles; user must change server config +Create and manage Perses projects with RBAC configuration. A Project is an organizational container grouping dashboards, datasources, variables, and other resources. When running via Perses Operator on Kubernetes, each project maps to a K8s namespace. --- @@ -148,7 +35,12 @@ Stop and ask the user before proceeding if: **Goal**: Create a new Perses project. -**Via percli**: +**Use MCP when available** (because MCP is the primary tool choice, faster than CLI): +``` +perses_create_project(project="") +``` + +**Or use percli CLI as fallback**: ```bash percli apply -f - < ``` -**Via MCP** (preferred): -``` -perses_create_project(project="") -``` +**Constraints**: +- Use **lowercase alphanumeric names with hyphens only** (e.g., `my-project`), because Perses follows DNS label conventions and rejects uppercase/spaces/special chars +- **Always verify creation** with `percli get project` or `perses_list_projects()` before reporting success, because MCP tools may return cached responses or partial creates on network errors +- **Set the active project immediately after creation**, because wrong context silently applies resources to the wrong project later +- **Stop and ask if project name conflicts**, because user must decide whether to reuse or rename +- **Stop and ask if target environment is ambiguous** (dev vs production), because production requires RBAC setup while dev may use defaults -**Gate**: Project created and set as active context. Proceed to Phase 2 if RBAC is needed, otherwise task complete. +**Gate**: Project created, verified, and set as active context. Proceed to Phase 2 if RBAC is needed, otherwise task complete. ### Phase 2: CONFIGURE RBAC (optional) -**Goal**: Set up roles and role bindings for access control. +**Goal**: Set up roles and role bindings for access control when requested or for production projects. **Step 1: Create a role** @@ -193,7 +87,8 @@ EOF **Available scopes** (resource types): Dashboard, Datasource, EphemeralDashboard, Folder, Role, RoleBinding, Secret, Variable -For organization-wide roles, use GlobalRole: +**Use project-scoped Role for project-specific permissions**, because it enforces least privilege. **Reserve GlobalRole for org-wide needs only**, because global roles grant access across every project: + ```bash percli apply -f - < +spec: + role: dashboard-editor + subjects: + - kind: User + name: +EOF +``` + For global role bindings: + ```bash percli apply -f - < EOF ``` -**Gate**: Roles and bindings created. Proceed to Phase 3. +**Constraints**: +- **Verify the role exists before creating bindings**, because typos in role names fail silently and the binding grants nothing +- **Ensure role and binding are in the same project**, because cross-project references don't work +- Run `percli get role --project ` before binding creation to confirm exact role name + +**Gate**: Roles and bindings created and verified. Proceed to Phase 3. ### Phase 3: VERIFY @@ -272,7 +193,49 @@ perses_list_project_role_bindings(project="") perses_list_global_roles() ``` -**Gate**: Project listed, roles and bindings confirmed. Task complete. +**Constraints**: +- **Always verify before declaring success**, because MCP tools may report cached responses +- **Re-run `percli project `** before applying project-scoped resources in subsequent phases, because context may have changed since initial setup + +**Gate**: Project listed and roles/bindings confirmed. Task complete. + +--- + +## Error Handling + +### Project creation fails + +| Symptom | Cause | Fix | +|---------|-------|-----| +| "already exists" / 409 Conflict | Project name is already taken | List existing projects with `percli get project` or `perses_list_projects()` and use a different name, or operate on the existing project | +| "invalid name" / 400 Bad Request | Project name contains invalid characters (uppercase, spaces, special chars) | Use lowercase alphanumeric names with hyphens only (e.g., `my-project`). Perses follows DNS label conventions | +| "unauthorized" / 401 | Not authenticated or session token expired | Run `percli login` first, or verify MCP server auth config has valid credentials | +| "forbidden" / 403 | Authenticated user lacks permission to create projects | User needs a GlobalRole with `create` action on Project scope, or admin access | + +### Role and RoleBinding creation fails + +| Symptom | Cause | Fix | +|---------|-------|-----| +| "role not found" in RoleBinding apply | The role referenced in `spec.role` does not exist | Create the Role first, then create the RoleBinding. Verify role exists with `percli get role --project ` | +| "subject not found" / binding has no effect | User or group name in subjects does not match any identity in the auth provider | Verify user identity with the configured auth provider (Native, OIDC, OAuth). For native auth, the username is the login name | +| "project not found" in role metadata | The project specified in `metadata.project` does not exist | Create the project first, or fix the project name in the role definition | +| GlobalRole apply returns 403 | User does not have cluster-level admin permissions | GlobalRole and GlobalRoleBinding require admin-level access; escalate to a Perses admin | + +### Wrong project context + +| Symptom | Cause | Fix | +|---------|-------|-----| +| Resources appear in wrong project | `percli project` was set to a different project than intended | Always run `percli project ` immediately before applying project-scoped resources | +| "project not set" error | No active project context configured | Run `percli project ` to set the active project | +| Role/binding created but permissions don't work | RoleBinding references a role from a different project | Ensure role and binding are in the same project; check `metadata.project` on both | + +### MCP tool failures + +| Symptom | Cause | Fix | +|---------|-------|-----| +| `perses_create_project` returns read-only error | Perses server has `security.readonly: true` in config | Ask user to disable read-only mode, or switch to a writable instance | +| MCP tool returns connection refused | MCP server cannot reach Perses API | Check MCP server config URL and ensure Perses server is running at that address | +| MCP list returns empty but projects exist | MCP auth credentials lack read permission | Verify MCP server auth config; the configured user needs at least read access | --- diff --git a/skills/perses-query-builder/SKILL.md b/skills/perses-query-builder/SKILL.md index 9bf0520a..a8f1cd19 100644 --- a/skills/perses-query-builder/SKILL.md +++ b/skills/perses-query-builder/SKILL.md @@ -29,136 +29,9 @@ routing: Build and optimize queries for Perses dashboard panels. -## Operator Context - -This skill constructs, validates, and optimizes queries embedded in Perses panel definitions. It handles PromQL (Prometheus), LogQL (Loki), and TraceQL (Tempo) with correct variable interpolation and datasource binding. - -### Hardcoded Behaviors (Always Apply) -- **Variable-aware**: Always use Perses variable syntax `$var` or `${var:format}` — never hardcode label values that should come from variables -- **Datasource-scoped**: Every query MUST reference its datasource by both `kind` and `name` fields -- **Interpolation-correct**: Use `${var:regex}` for `=~` matchers, `${var:csv}` or `${var:pipe}` for multi-select labels — never use bare `$var` with regex operators -- **Rate interval alignment**: Use `$__rate_interval` when the platform provides it; otherwise set rate intervals >= 4x the scrape interval - -### Default Behaviors (ON unless disabled) -- **PromQL default**: Default to PrometheusTimeSeriesQuery if query type not specified -- **Optimization suggestions**: Flag recording rule candidates for expensive aggregations over high-cardinality metrics -- **Label matcher validation**: Warn when queries lack a narrowing label matcher (e.g., selecting all series for a metric) -- **Multi-value detection**: When a variable is marked `allowMultiple`, automatically apply the correct interpolation format - -### Optional Behaviors (OFF unless enabled) -- **Recording rule generation**: Produce `recording_rules.yaml` for identified candidates -- **TraceQL exemplar linking**: Add exemplar query alongside PromQL for trace correlation -- **Query explain mode**: Annotate each query clause with comments explaining what it selects - -## What This Skill CAN Do -- Build PromQL, LogQL, and TraceQL queries for Perses panel specs -- Apply correct Perses variable interpolation formats (`${var:regex}`, `${var:csv}`, etc.) -- Validate query syntax and flag common PromQL/LogQL/TraceQL errors -- Suggest query optimizations (recording rules, label narrowing, rate intervals) -- Wire queries to the correct datasource kind and name - -## What This Skill CANNOT Do -- Create or configure datasources (use perses-datasource-manage) -- Build full dashboards or panel layouts (use perses-dashboard-create) -- Deploy Perses server instances (use perses-deploy) -- Develop custom Perses plugins (use perses-plugin-create) +## Overview ---- - -## Error Handling - -### PromQL Syntax Errors -**Symptom**: Query fails validation — missing closing bracket, invalid function name, bad label matcher syntax. -**Detection**: Look for unbalanced `()`, `{}`, `[]`; unknown function names; `=~` with unescaped special chars. -**Resolution**: Fix the syntax. Common fixes: -- Add missing closing `}` or `)` -- Replace `=~` value with a valid RE2 regex (no lookaheads) -- Use correct function name (e.g., `rate()` not `Rate()`, `histogram_quantile()` not `histogram_percentile()`) - -### Variable Interpolation Format Mismatch -**Symptom**: Dashboard renders wrong results or query errors when multi-value variable is selected. -**Detection**: `$var` or `${var}` used with `=~` matcher; `${var:csv}` used with `=~` (needs regex format). -**Resolution**: -- For `=~` matchers: use `${var:regex}` (produces `val1|val2|val3`) -- For `=` with multi-select: use `${var:csv}` or `${var:pipe}` depending on downstream expectation -- For JSON API params: use `${var:json}` - -### Datasource Kind Mismatch -**Symptom**: Query silently returns no data or errors at runtime with "unsupported query type". -**Detection**: Query plugin `kind` does not match datasource `kind` (e.g., `PrometheusTimeSeriesQuery` referencing a `TempoDatasource`). -**Resolution**: Align the query plugin kind with the datasource kind: -- `PrometheusTimeSeriesQuery` → `PrometheusDatasource` -- `TempoTraceQuery` → `TempoDatasource` -- `LokiLogQuery` → `LokiDatasource` - -### High-Cardinality Query Warnings -**Symptom**: Query is slow, times out, or overwhelms Prometheus. -**Detection**: No label matchers narrowing selection; `rate()` missing or with no interval; aggregation over unbounded label set. -**Resolution**: -- Add label matchers to reduce selected series (at minimum `job` or `namespace`) -- Wrap counters in `rate()` or `increase()` with an appropriate interval -- Consider a recording rule for expensive `histogram_quantile()` or multi-level aggregations - ---- - -## Anti-Patterns - -### Hardcoding Label Values -**Wrong**: `http_requests_total{namespace="production"}` in a panel query. -**Right**: `http_requests_total{namespace="$namespace"}` using a dashboard variable. -**Why**: Hardcoded values break reusability across environments and defeat the purpose of dashboard variables. - -### Bare `$var` with Multi-Value or Regex -**Wrong**: `http_requests_total{pod=~"$pod"}` when `pod` is a multi-select variable. -**Right**: `http_requests_total{pod=~"${pod:regex}"}`. -**Why**: Without `:regex` format, multi-select values are not joined with `|` — the query matches only the first selected value or produces a syntax error. - -### Missing Datasource Spec in Query -**Wrong**: Omitting the `datasource` block or specifying only `name` without `kind`. -**Right**: -```yaml -datasource: - kind: PrometheusDatasource - name: prometheus -``` -**Why**: Perses needs both `kind` and `name` to resolve the datasource. Omitting `kind` causes runtime resolution failures. - -### Using `rate()` Without Meaningful Interval -**Wrong**: `rate(http_requests_total{job="api"}[1s])`. -**Right**: `rate(http_requests_total{job="api"}[$__rate_interval])` or `[5m]` aligned with scrape interval. -**Why**: Intervals shorter than the scrape interval produce empty results; `$__rate_interval` auto-adapts. - ---- - -## Anti-Rationalization - -| Rationalization | Reality | Required Action | -|---|---|---| -| "Bare `$var` works fine for single-select" | Variables can be changed to multi-select later, breaking the query | **Always use explicit format when combined with `=~`** | -| "Datasource kind is obvious from context" | Perses resolves datasources by kind+name pair at runtime | **Always specify both `kind` and `name`** | -| "This query is simple enough to skip validation" | Simple queries with typos still fail silently | **Validate every query against syntax rules** | -| "Recording rules are premature optimization" | `histogram_quantile` over thousands of series will time out in production | **Flag recording rule candidates for expensive aggregations** | - ---- - -## FORBIDDEN Patterns - -- **NEVER** use `${var:regex}` with `=` (equality) matchers — regex format with `=` causes silent mismatches -- **NEVER** omit `kind` from the datasource reference — Perses cannot resolve by name alone -- **NEVER** mix query plugin types within a single panel query list (e.g., PromQL and TraceQL in the same `queries[]` array) -- **NEVER** use Grafana-style `$__interval` or `${__rate_interval}` — Perses uses `$__rate_interval` (no braces, double underscores) -- **NEVER** assume a variable supports multi-select — check the variable definition's `allowMultiple` field - ---- - -## Blocker Criteria - -Do NOT proceed past the BUILD phase if any of these are true: - -1. **Datasource unknown**: The target datasource name and kind have not been confirmed — query cannot be validated -2. **Variable definitions missing**: Query references `$var` but no matching variable exists in the dashboard spec -3. **Query type ambiguous**: Cannot determine whether PromQL, LogQL, or TraceQL is needed from user request -4. **Metric name unverified**: The metric name referenced does not exist in the target Prometheus/Loki/Tempo instance and the user has not confirmed it +This skill constructs, validates, and optimizes queries embedded in Perses panel definitions. It handles PromQL (Prometheus), LogQL (Loki), and TraceQL (Tempo) with correct variable interpolation and datasource binding. The workflow progresses through three phases: identifying query requirements, building the query with proper templating, and optimizing for performance and correctness. --- @@ -168,12 +41,21 @@ Do NOT proceed past the BUILD phase if any of these are true: **Goal**: Determine query type, datasource, and variable context. +**Blockers**: Do not proceed if any of these are unresolved (because Perses requires them for runtime resolution and query validation): + +1. **Datasource unknown** — The target datasource name and kind have not been confirmed. Perses resolves datasources at runtime using the `kind` and `name` pair; queries cannot be validated without this. +2. **Variable definitions missing** — Query references `$var` but no matching variable exists in the dashboard spec. Variables must be defined in the dashboard before queries can reference them. +3. **Query type ambiguous** — Cannot determine whether PromQL (metrics), LogQL (logs), or TraceQL (traces) is needed from the user request. Each query type maps to a specific datasource kind. +4. **Metric name unverified** — The metric name referenced does not exist in the target Prometheus/Loki/Tempo instance and the user has not confirmed it. Skip metric validation if the user explicitly says the metric exists or is intentional. + +**Steps**: + 1. **Query type**: Identify which query language is needed: - - PrometheusTimeSeriesQuery (PromQL) — metrics, counters, histograms - - TempoTraceQuery (TraceQL) — distributed traces - - LokiLogQuery (LogQL) — log streams -2. **Datasource**: Confirm the datasource `name` and `kind` from the dashboard or project context -3. **Variables**: Identify which dashboard variables the query should reference and their `allowMultiple` setting + - PrometheusTimeSeriesQuery (PromQL) — for metrics, counters, histograms + - TempoTraceQuery (TraceQL) — for distributed traces + - LokiLogQuery (LogQL) — for log streams +2. **Datasource**: Confirm the datasource `name` and `kind` from the dashboard or project context (because Perses cannot resolve datasources by name alone at runtime) +3. **Variables**: Identify which dashboard variables the query should reference and check their `allowMultiple` setting (because this determines which interpolation format to use) **Gate**: Query type, datasource, and variable context confirmed. Proceed to Phase 2. @@ -181,6 +63,17 @@ Do NOT proceed past the BUILD phase if any of these are true: **Goal**: Construct the query with proper variable templating and datasource binding. +**Constraints applied during building**: + +- **Always use Perses variable syntax** `$var` or `${var:format}` (not hardcoded label values) because dashboard variables enable query reusability across environments +- **Include both `kind` and `name` in the datasource spec** because Perses resolves datasources by kind+name pair at runtime and will fail silently if `kind` is omitted +- **Use the correct interpolation format for the operator context** — specifically, use `${var:regex}` for `=~` matchers and `${var:csv}` or `${var:pipe}` for equality matchers with multi-select variables, because bare `$var` with regex operators only interpolates the first selected value +- **Never use `${var:regex}` with `=` (equality) matchers** because regex format with equality causes silent mismatches; regex format is only for `=~` +- **Default to PrometheusTimeSeriesQuery** if query type is not explicitly specified +- **Use `$__rate_interval` for `rate()` and `increase()`** when the platform provides it, otherwise set intervals >= 4x the scrape interval, because shorter intervals produce empty results + +**Example**: + ```yaml queries: - kind: TimeSeriesQuery @@ -216,16 +109,62 @@ queries: **Goal**: Review the query for performance and correctness. -1. **Label narrowing**: Ensure at least one selective label matcher is present (e.g., `job`, `namespace`) -2. **Rate intervals**: Confirm `rate()`/`increase()` intervals align with scrape interval or use `$__rate_interval` -3. **Recording rule candidates**: Flag `histogram_quantile()` over high-cardinality metrics, multi-level `sum(rate(...))` aggregations, or any query aggregating over > 1000 estimated series -4. **Variable format audit**: Verify every `$var` reference uses the correct interpolation format for its operator context -5. **Datasource alignment**: Confirm query plugin kind matches datasource kind +**Constraints validated during optimization**: + +- **Validate label narrowing** — ensure at least one selective label matcher is present (e.g., `job`, `namespace`) because queries without label matchers select all series for a metric and can overwhelm Prometheus +- **Confirm rate intervals** — verify `rate()`/`increase()` intervals align with scrape interval or use `$__rate_interval`, because intervals shorter than scrape interval produce empty results +- **Flag recording rule candidates** — identify expensive patterns like `histogram_quantile()` over high-cardinality metrics, multi-level `sum(rate(...))` aggregations, or queries aggregating over > 1000 estimated series, because these will time out in production +- **Audit variable formats** — verify every `$var` reference uses the correct interpolation format for its operator context (regex format for `=~`, CSV/pipe for `=`), because mismatches produce wrong results +- **Align plugin and datasource kinds** — confirm query plugin kind matches datasource kind (e.g., `PrometheusTimeSeriesQuery` with `PrometheusDatasource`, not `TempoDatasource`), because mismatches cause "unsupported query type" errors at runtime + +**Steps**: + +1. Check that at least one selective label matcher narrows the selection +2. Verify rate intervals are appropriately tuned +3. Identify expensive aggregations that should become recording rules +4. Validate each variable uses the correct format for its context +5. Confirm datasource kind aligns with query plugin kind **Gate**: Query optimized and validated. Task complete. --- +## Error Handling + +### PromQL Syntax Errors +**Symptom**: Query fails validation — missing closing bracket, invalid function name, bad label matcher syntax. +**Detection**: Look for unbalanced `()`, `{}`, `[]`; unknown function names; `=~` with unescaped special chars. +**Resolution**: Fix the syntax. Common fixes: +- Add missing closing `}` or `)` +- Replace `=~` value with a valid RE2 regex (no lookaheads) +- Use correct function name (e.g., `rate()` not `Rate()`, `histogram_quantile()` not `histogram_percentile()`) + +### Variable Interpolation Format Mismatch +**Symptom**: Dashboard renders wrong results or query errors when multi-value variable is selected. +**Detection**: `$var` or `${var}` used with `=~` matcher; `${var:csv}` used with `=~` (needs regex format). +**Resolution**: +- For `=~` matchers: use `${var:regex}` (produces `val1|val2|val3`) +- For `=` with multi-select: use `${var:csv}` or `${var:pipe}` depending on downstream expectation +- For JSON API params: use `${var:json}` + +### Datasource Kind Mismatch +**Symptom**: Query silently returns no data or errors at runtime with "unsupported query type". +**Detection**: Query plugin `kind` does not match datasource `kind` (e.g., `PrometheusTimeSeriesQuery` referencing a `TempoDatasource`). +**Resolution**: Align the query plugin kind with the datasource kind: +- `PrometheusTimeSeriesQuery` → `PrometheusDatasource` +- `TempoTraceQuery` → `TempoDatasource` +- `LokiLogQuery` → `LokiDatasource` + +### High-Cardinality Query Warnings +**Symptom**: Query is slow, times out, or overwhelms Prometheus. +**Detection**: No label matchers narrowing selection; `rate()` missing or with no interval; aggregation over unbounded label set. +**Resolution**: +- Add label matchers to reduce selected series (at minimum `job` or `namespace`) +- Wrap counters in `rate()` or `increase()` with an appropriate interval +- Consider a recording rule for expensive `histogram_quantile()` or multi-level aggregations + +--- + ## References - [Perses Variable Interpolation](https://perses.dev/docs/user-guides/variables/) — Official docs on variable formats diff --git a/skills/perses-variable-manage/SKILL.md b/skills/perses-variable-manage/SKILL.md index 729d41fa..21980632 100644 --- a/skills/perses-variable-manage/SKILL.md +++ b/skills/perses-variable-manage/SKILL.md @@ -28,34 +28,9 @@ routing: Create and manage variables with chains and interpolation. -## Operator Context +## Overview -This skill operates as the lifecycle manager for Perses variables, handling creation, chaining, and interpolation configuration across scopes. - -### Hardcoded Behaviors (Always Apply) -- **Chain ordering**: Variables must be ordered so dependencies come first — Perses evaluates variables in array order, so a variable referencing `$cluster` must appear after the cluster variable -- **MCP-first**: Use Perses MCP tools when available, percli as fallback -- **Interpolation format**: Document which format is used and why — wrong format causes query syntax errors (e.g., regex format for Prometheus matchers, csv for multi-select labels) - -### Default Behaviors (ON unless disabled) -- **ListVariable**: Default to ListVariable with PrometheusLabelValuesVariable plugin -- **Dashboard scope**: Create variables at dashboard scope unless otherwise specified -- **Multi-select**: Enable allowMultiple and allowAllValue by default for filter variables - -### Optional Behaviors (OFF unless enabled) -- **Global/project variables**: Create at global or project scope for reuse across dashboards -- **TextVariable**: Use TextVariable for free-form user input fields - -## What This Skill CAN Do -- Create TextVariable and ListVariable at any scope (global, project, dashboard) -- Set up variable chains with cascading dependencies -- Configure interpolation formats (csv, regex, json, lucene, pipe, glob, etc.) -- Use all 4 variable plugin types - -## What This Skill CANNOT Do -- Create custom variable plugins (use perses-plugin-create) -- Create dashboards (use perses-dashboard-create) -- Manage datasources (use perses-datasource-manage) +This skill manages the full lifecycle of Perses variables: creation, configuration, chaining with dependencies, and interpolation across all scopes (global, project, dashboard). Variable chains enable cascading filters where dependent variables only show values filtered by their parent selections — for example, a pod variable that only displays pods matching the selected cluster and namespace. --- @@ -102,6 +77,8 @@ This skill operates as the lifecycle manager for Perses variables, handling crea | singlequote | `'a','b','c'` | Single-quoted CSV | | raw | `a` (first only) | Single value extraction | +**Key constraint on variable chains**: Variables must be ordered so dependencies come first because Perses evaluates variables in array order. A variable referencing `$cluster` must appear after the cluster variable definition. This is a strict sequencing requirement — if the parent variable appears after the child, the child will resolve `$cluster` as empty on first load, returning unfiltered results. + **Gate**: Variable type, plugin, scope, and dependencies identified. Proceed to Phase 2. ### Phase 2: CREATE @@ -148,7 +125,7 @@ perses_create_project_variable( **Variable chain** (dashboard scope — cluster -> namespace -> pod): -Variables must be ordered with dependencies first. Each subsequent variable uses matchers that reference the previous variables: +When building chains, order variables with dependencies first. Each subsequent variable uses matchers that reference the previous variables using the exact syntax shown below. Always include the `datasource` field — Perses will not infer it and the variable will fail to resolve if omitted. Use explicit interpolation formats in consuming queries (regex for Prometheus `=~` matchers, csv for other contexts) to ensure multi-value substitution works correctly. ```yaml variables: @@ -201,6 +178,11 @@ variables: - "namespace=\"$namespace\"" ``` +**Default behaviors** (applied unless overridden): +- Default to ListVariable with PrometheusLabelValuesVariable plugin for query-driven filtering +- Create variables at dashboard scope unless otherwise specified +- Enable `allowMultiple` and `allowAllValue` by default for filter variables (allows users to select all or no specific values) + **Gate**: Variables created without errors. Proceed to Phase 3. ### Phase 3: VERIFY @@ -240,47 +222,7 @@ Verify chain behavior by checking that dependent variables correctly filter when | MCP `perses_create_project_variable` fails | Error returned from MCP tool call — variable not created | Check: (1) variable name does not already exist in the project (names must be unique per scope), (2) the target project exists (create it first with `perses_create_project`), (3) plugin kind is spelled correctly (e.g., `PrometheusLabelValuesVariable` not `PrometheusLabelValues`) | | Matcher syntax error in child variable | Child variable returns empty results or Perses logs show query parse errors | Matchers must be exact PromQL label matcher syntax: `"label=\"$var\""` with escaped inner quotes. Missing escapes or wrong quote nesting breaks the matcher silently | ---- - -## Anti-Patterns - -| Anti-Pattern | Why It Fails | Correct Approach | -|-------------|-------------|------------------| -| Wrong dependency order in variable array (child before parent) | Perses evaluates variables top-to-bottom. A child variable referencing `$cluster` that appears before the cluster variable will resolve `$cluster` as empty, returning unfiltered results | Always order variables by dependency chain: root variables first, then each level of dependents in order | -| Using `${var}` or `${var:csv}` for Prometheus `=~` label matchers | `=~` expects a regex pattern. CSV format `a,b,c` is not valid regex and either causes a parse error or silently matches nothing | Use `${var:regex}` which produces `a\|b\|c` — valid regex alternation for Prometheus matchers | -| Setting `allowMultiple: true` without configuring the appropriate interpolation format in consuming queries | The variable will return multiple values, but the query uses bare `$var` which only substitutes the first value, silently dropping the rest | When `allowMultiple` is true, always use an explicit interpolation format in queries: `${var:regex}` for Prometheus, `${var:csv}` for APIs, `${var:lucene}` for LogQL | -| Creating GlobalVariable for project-specific filters | Global variables apply to all projects and dashboards, polluting the variable namespace and confusing users who see irrelevant filters | Use project-scoped variables (kind: Variable with project reference) for filters specific to a team or service. Reserve GlobalVariable for truly universal filters like environment or region | -| Duplicating variables across dashboards instead of using project/global scope | Changes must be made in every dashboard individually; variable definitions drift over time | Promote shared variables to project or global scope and reference them consistently across dashboards | -| Hardcoding datasource name without checking available datasources | Variable queries fail silently when the datasource name does not match any configured datasource | List available datasources first (`percli get datasource --project ` or `perses_list_datasources`) and use the exact name | - ---- - -## Anti-Rationalization - -These are shortcuts that seem reasonable but cause real failures: - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|---------------|-----------------| -| "The variable order doesn't matter, they all resolve eventually" | Perses evaluates variables strictly in array order, not by dependency graph. A child variable that appears before its parent will resolve against an empty parent value on first load | **Map the full dependency chain and verify array order matches** | -| "I'll just use `$var` without a format — Perses will figure it out" | Bare `$var` uses the default format which may not match the query context. For Prometheus `=~` matchers this silently produces wrong results | **Always specify the interpolation format explicitly when the variable is multi-select** | -| "The variable works in the UI so the interpolation must be correct" | It may work with a single selection but break with multiple selections. The default interpolation for single values happens to work, masking the missing format specification | **Test with multiple values selected to verify the interpolation format produces valid query syntax** | -| "I'll create the variable and fix the chain order later" | Variables that appear to work in isolation will return wrong results when chaining is broken, and the bug is subtle — dashboards show data, just unfiltered data | **Get the dependency order right before creating any variables** | - ---- - -## FORBIDDEN Patterns - -These patterns MUST NOT appear in any variable configuration produced by this skill: - -- **NEVER** define a child variable before its parent in the variables array — this silently breaks filtering -- **NEVER** use `${var:csv}` in a Prometheus `=~` or `!~` matcher — use `${var:regex}` instead -- **NEVER** hardcode label values in a ListVariable when the values come from Prometheus — use PrometheusLabelValuesVariable or PrometheusPromQLVariable instead -- **NEVER** create a variable with `allowMultiple: true` without verifying that all consuming queries use an appropriate multi-value interpolation format -- **NEVER** omit the `datasource` field in a Prometheus variable plugin — Perses will not infer it and the variable will fail to resolve - ---- - -## Blocker Criteria +### Blockers Do NOT proceed past each phase gate if any of these conditions exist: @@ -299,6 +241,25 @@ Do NOT proceed past each phase gate if any of these conditions exist: - Variable chain produces unfiltered results when parent is selected — dependency order or matcher syntax is wrong - Variable dropdown is empty — plugin configuration, datasource, or label name is incorrect +### Anti-Rationalization Guards + +These shortcuts seem reasonable but cause real failures: + +- **"The variable order doesn't matter, they all resolve eventually"**: Perses evaluates variables strictly in array order, not by dependency graph. A child variable that appears before its parent will resolve against an empty parent value on first load, silently producing wrong results. +- **"I'll just use `$var` without a format — Perses will figure it out"**: Bare `$var` uses the default format which may not match the query context. For Prometheus `=~` matchers this silently produces wrong results. Always specify the interpolation format explicitly when the variable is multi-select. +- **"The variable works in the UI so the interpolation must be correct"**: It may work with a single selection but break with multiple selections. Always test with multiple values selected to verify the interpolation format produces valid query syntax. +- **"I'll create the variable and fix the chain order later"**: Variables that appear to work in isolation will return wrong results when chaining is broken, and the bug is subtle — dashboards show data, just unfiltered data. Get the dependency order right before creating any variables. + +### Forbidden Patterns + +Never produce configurations with these patterns: + +- **NEVER** define a child variable before its parent in the variables array — this silently breaks filtering +- **NEVER** use `${var:csv}` in a Prometheus `=~` or `!~` matcher — use `${var:regex}` instead +- **NEVER** hardcode label values in a ListVariable when the values come from Prometheus — use PrometheusLabelValuesVariable or PrometheusPromQLVariable instead +- **NEVER** create a variable with `allowMultiple: true` without verifying that all consuming queries use an appropriate multi-value interpolation format +- **NEVER** omit the `datasource` field in a Prometheus variable plugin — Perses will not infer it and the variable will fail to resolve + --- ## References diff --git a/skills/plan-checker/SKILL.md b/skills/plan-checker/SKILL.md index 47997f45..43353b71 100644 --- a/skills/plan-checker/SKILL.md +++ b/skills/plan-checker/SKILL.md @@ -36,49 +36,13 @@ routing: # Plan Checker Skill -## Purpose - -Validate plans before execution using goal-backward analysis. Start from the stated goal, verify every requirement has a complete path through the plan to completion. Catch plan-level defects before they waste an entire execution cycle. - -**Key principle**: Plan completeness does not equal goal achievement. A plan can have all tasks filled in, each well-specified, and still miss the goal. The checker works backward from the goal through every requirement to verify complete coverage -- not just that each task looks reasonable in isolation. - -## Operator Context - -This skill operates as a gate between planning and execution. It validates plan quality across 10 dimensions, produces structured findings, and issues a PASS/BLOCK verdict. If issues are found, a bounded revision loop allows up to 3 iterations before proceeding with documented risks. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before validation. Repository rules inform dimension 9 (CLAUDE.md compliance). -- **Goal-Backward Analysis**: Always start from the stated goal and trace backward through the plan. Never validate tasks in isolation -- a task is only valid if it contributes to a path from goal to completion. -- **Structured Findings Format**: Every issue uses the standard format (plan, dimension, severity, description, fix_hint). Unstructured feedback is not actionable and gets lost. -- **Bounded Revision Loop**: Maximum 3 revision iterations. After 3, remaining issues are documented as known risks. This prevents infinite planning loops -- the cost of one more revision iteration is never zero, and at some point execution with known risks beats continued planning. -- **Codebase Verification for Dependencies**: Dimension 3 (dependency correctness) and dimension 9 (CLAUDE.md compliance) require reading actual files. Do not validate file paths or import references from memory -- verify against the codebase. -- **BLOCK on Any Blocker**: A single blocker-severity finding means BLOCK verdict. Blockers are non-negotiable because they predict execution failure, not just suboptimal execution. - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report findings factually with severity and fix hints. No self-congratulation. Show evidence (file paths, missing requirements) rather than assertions. -- **Full Dimension Scan**: Run all 10 dimensions on every check. Skipping dimensions creates false confidence -- a plan that passes 7 of 10 dimensions is not "mostly good," it has 3 unchecked risk areas. -- **Severity-Ordered Output**: Report blockers before warnings. The reader needs to know immediately whether execution is blocked. -- **Fix Hints Required**: Every finding must include a fix_hint. A finding without a fix is a complaint, not actionable feedback. - -### Optional Behaviors (OFF unless enabled) -- **Auto-revise mode**: Automatically revise the plan to fix findings instead of just reporting them (OFF by default -- human review of revisions is safer) -- **Cross-plan validation**: When multiple plans exist for the same feature, validate data contracts between them (OFF by default -- enable for multi-plan features) - -## What This Skill CAN Do -- Validate plans against 10 verification dimensions -- Verify file paths and imports against the actual codebase -- Check CLAUDE.md compliance for the target repository -- Produce structured findings with severity and fix hints -- Issue PASS/BLOCK verdicts -- Run bounded revision loops (max 3 iterations) -- Validate plans from any source (feature-plan, workflow-orchestrator, manual) - -## What This Skill CANNOT Do -- Create plans (use feature-plan or workflow-orchestrator) -- Execute plans (use feature-implement) -- Manage plan lifecycle (use plan-manager) -- Fix plans automatically without user review (unless auto-revise is enabled) -- Validate plans without a stated goal (goal-backward analysis requires a goal) +## Overview + +Validate plans before execution using goal-backward analysis. Start from the stated goal and verify every requirement has a complete path through the plan to completion. This catches plan-level defects before they waste an entire execution cycle. + +**Key principle**: Plan completeness does not equal goal achievement. A plan can have all tasks filled in, each well-specified, and still miss the goal entirely. The checker validates by working backward from the goal through every requirement to verify complete coverage -- not just that each task looks reasonable in isolation. + +The skill operates as a gate between planning and execution, validating across 10 dimensions and issuing a PASS/BLOCK verdict. If issues are found, a bounded revision loop allows up to 3 iterations before proceeding with documented risks (because the cost of continued planning is not zero -- it consumes context, delays execution, and each revision may introduce new issues). ## Instructions @@ -102,7 +66,7 @@ workflow-orchestrator to create one first. **Step 2: Extract the goal** -The goal is the anchor for goal-backward analysis. Find it in: +The goal is the anchor for goal-backward analysis (because analysis requires knowing what success looks like). Find it in: - The plan's own `## Goal` or `## Success Criteria` section - The parent design document (for feature plans: `.feature/state/design/`) - The user's original request (for workflow-orchestrator plans) @@ -115,7 +79,7 @@ analysis requires a goal. Add a ## Goal section to the plan. **Step 3: Load repository context** -Read the target repository's CLAUDE.md (if it exists) for dimension 9 validation: +Read the target repository's CLAUDE.md (if it exists) for dimension 9 validation (because repository rules inform which constraints the plan must comply with): ```bash find . -maxdepth 2 -name "CLAUDE.md" -type f 2>/dev/null | head -5 ``` @@ -138,7 +102,7 @@ Run goal-backward analysis across all 10 verification dimensions. For each dimen **Severity**: Blocker **Check**: Every extracted requirement appears in at least one task. -Trace backward from each requirement: +Trace backward from each requirement (because missing requirements directly predict execution failure): - Which task(s) address this requirement? - Is the coverage complete (full path to implementation) or partial? @@ -160,7 +124,7 @@ Any UNCOVERED requirement is a blocker finding. **Severity**: Warning **Check**: Each task has concrete actions, not vague descriptions. -Scan every task for vague verbs that signal incomplete thinking: +Scan every task for vague verbs that signal incomplete thinking (because vague tasks fail at execution time when the executor realizes "implement feature" has dozens of possible interpretations): | Vague (reject) | Concrete (accept) | |-----------------|-------------------| @@ -177,7 +141,7 @@ A task that uses vague verbs without specifying the concrete action is a warning **Severity**: Blocker **Check**: Tasks reference the right files and imports, verified against the actual codebase. -For each task that references files: +For each task that references files, verify using the filesystem (because file paths and imports validated from memory are unreliable -- the codebase is the source of truth): ```bash # Verify each referenced file exists ls -la /path/to/referenced/file @@ -193,7 +157,7 @@ A task referencing a nonexistent file or wrong import path is a blocker finding. **Severity**: Blocker **Check**: Cross-component wiring is explicitly tasked. -For any plan that adds new components, verify that the wiring between components is an explicit task. Common wiring that gets forgotten: +For any plan that adds new components, verify that the wiring between components is an explicit task (because unwired components are the most common post-execution discovery -- the feature "works" in isolation but is never reachable): | New Component | Required Wiring | |---------------|----------------| @@ -206,7 +170,7 @@ For any plan that adds new components, verify that the wiring between components | New agent | INDEX.json entry | | New skill | INDEX.json entry, routing table | -If the plan adds a component but has no task for its wiring, that is a blocker finding. WHY: Unwired components are the most common post-execution discovery -- the feature "works" in the new file but is never reachable because nobody registered the route or loaded the config. +If the plan adds a component but has no task for its wiring, that is a blocker finding. #### Dimension 5: Scope Sanity **Severity**: Warning at 4 tasks, Blocker at 5+ @@ -218,7 +182,7 @@ Count the tasks in the plan: |------------|---------|-----------| | 1-3 | Good | Right-sized for focused execution | | 4 | Warning | Approaching limit; review if any tasks can merge | -| 5+ | Blocker | Split required -- too many tasks for reliable single-context execution. WHY: context windows are finite, and each task adds execution state. Past 5 tasks, the executor loses track of earlier context, makes inconsistent decisions, or runs out of room for error recovery. | +| 5+ | Blocker | Split required -- context windows are finite, and each task adds execution state. Past 5 tasks, the executor loses track of earlier context, makes inconsistent decisions, or runs out of room for error recovery. | For blocker: suggest how to split (by wave, by component, by dependency boundary). @@ -226,7 +190,7 @@ For blocker: suggest how to split (by wave, by component, by dependency boundary **Severity**: Warning **Check**: How to verify the plan's goal is achieved must be explicit. -Look for a `## Verification` or `## Success Criteria` section. Check that it specifies: +Look for a `## Verification` or `## Success Criteria` section. Check that it specifies (because "run tests" is not verification -- expected outcomes and observable behaviors are): - Concrete commands to run (not just "run tests") - Expected outcomes (not just "tests pass") - Observable behaviors (not implementation tasks) @@ -245,7 +209,7 @@ A plan with no verification section or only vague verification is a warning find If the plan is part of a feature lifecycle (design -> plan -> implement): - Read the design document decisions -- Verify the plan doesn't contradict them +- Verify the plan doesn't contradict them (because contradictions with prior decisions create rework) - Check that architectural choices from design are reflected in task details If the plan is standalone (workflow-orchestrator): @@ -258,7 +222,7 @@ A plan that contradicts prior-phase decisions is a blocker finding. **Severity**: Blocker **Check**: One plan's transformations don't conflict with another's. -This dimension only applies when multiple plans exist for the same feature. Check: +This dimension only applies when multiple plans exist for the same feature. Check (because data contract conflicts cause silent failures -- code runs but produces wrong output): - Do two plans modify the same files? If so, are the modifications compatible? - Do two plans expect different shapes for shared data structures? - Does Plan B depend on output from Plan A that Plan A doesn't actually produce? @@ -269,7 +233,7 @@ If not applicable (single plan), mark as PASS with note "single plan -- no cross **Severity**: Blocker **Check**: Plan doesn't violate repository rules. -Cross-reference the plan against CLAUDE.md rules loaded in Phase 1. Common violations: +Cross-reference the plan against CLAUDE.md rules loaded in Phase 1 (because repository rules enforce architecture and safety patterns). Common violations: | Rule Category | What to Check | |---------------|--------------| @@ -287,7 +251,7 @@ If no CLAUDE.md was found in Phase 1, mark as PASS with note "no CLAUDE.md found **Severity**: Warning **Check**: Plan is completable within approximately 50% of a fresh context window. -Estimate the execution cost: +Estimate the execution cost (because execution always takes more than planned): - Number of tasks x average task complexity - Number of files to read for context - Number of verification commands to run @@ -295,7 +259,7 @@ Estimate the execution cost: If the plan looks like it would consume more than 50% of a context window: - Warning finding with suggestion to split or simplify -- WHY 50%: execution always takes more than planned. Errors happen, context is needed for debugging, and verification adds overhead. A plan that fits in 50% leaves room for reality. +- Rationale: A plan that fits in 50% leaves room for reality. Execution always has errors, context is needed for debugging, and verification adds overhead. **GATE**: All 10 dimensions checked. Findings collected. Proceed to VERDICT. @@ -369,7 +333,7 @@ If verdict is BLOCK, proceed to Phase 4 (Revision Loop). ### Phase 4: REVISION LOOP (only if BLOCK) -Bounded revision loop: fix blocker findings, re-check, max 3 iterations. This loop exists because most blocker findings are fixable in minutes -- a missing task, a wrong file path, an uncovered requirement. But it's bounded because infinite revision is worse than executing with known risks. +Bounded revision loop: fix blocker findings, re-check, max 3 iterations. After 3 good-faith attempts, remaining issues are either genuinely hard (and better discovered during execution with real code) or low-probability (and not worth further planning time). **Iteration tracking**: ``` @@ -416,8 +380,6 @@ Run Phase 2 again on the revised plan. Only re-check dimensions that had finding ================================================================ ``` -WHY proceed after 3 iterations: The cost of continued planning is not zero -- it consumes context, delays execution, and each revision may introduce new issues. After 3 good-faith attempts, the remaining issues are either genuinely hard (and better discovered during execution with real code) or low-probability (and not worth further planning time). - ## Error Handling | Error | Cause | Solution | @@ -429,29 +391,6 @@ WHY proceed after 3 iterations: The cost of continued planning is not zero -- it | Revision loop exhausted | 3 iterations couldn't resolve all blockers | Proceed with known risks documented | | Plan is inline text | User pasted plan instead of file path | Parse inline text; warn that revisions won't persist to file | -## Anti-Patterns - -| Anti-Pattern | Why Wrong | Do Instead | -|--------------|-----------|------------| -| Validate tasks in isolation | A task can look perfect in isolation while missing the goal entirely. Goal-backward analysis catches what task-forward review misses. | Always trace from goal backward through requirements to tasks | -| Skip dimension because "it looks fine" | "Looks fine" is not validation. Every skipped dimension is an unchecked risk area. | Run all 10 dimensions; mark explicit PASS for clean dimensions | -| Soft-pass on blockers | "It's probably fine" on a blocker means the executor hits the wall you saw coming. Blockers predict execution failure. | BLOCK on any blocker, no exceptions | -| Revise without re-checking | A revision that fixes one blocker can introduce another. Untested revisions give false confidence. | Always re-check after every revision | -| Unlimited revision iterations | Infinite planning is worse than executing with known risks. Each iteration has diminishing returns and non-zero cost. | Hard limit of 3 iterations, then proceed with documented risks | -| Validate without reading codebase | File paths and imports validated from memory are unreliable. The codebase is the source of truth, not the plan's claims. | Use Read/Grep/Glob to verify dimension 3 against actual files | -| Accept plan without goal | Goal-backward analysis requires a goal. Without one, you're checking form (do tasks look good?) not substance (do tasks achieve something?). | Require a stated goal before starting validation | - -## Anti-Rationalization - -| Rationalization Attempt | Why It's Wrong | Required Action | -|------------------------|----------------|-----------------| -| "The plan is detailed enough, no need to check" | Detail is not correctness. A detailed plan with a wrong file path still fails at execution. | Run all 10 dimensions regardless of plan quality appearance | -| "Only 1 blocker, we can work around it" | Workarounds during execution consume more context than fixing the plan. A known blocker is cheaper to fix pre-execution. | BLOCK verdict, enter revision loop | -| "This dimension doesn't apply to this plan" | Mark it PASS with a note explaining why. Skipping silently creates an audit gap. | Explicitly mark non-applicable dimensions as PASS with rationale | -| "The user wants to move fast, skip validation" | Fast execution of a broken plan wastes the entire execution cycle. 2 minutes of validation saves 20 minutes of failed execution. | Run the full check. Speed comes from not re-executing, not from skipping validation | -| "3 iterations wasn't enough, let me try one more" | The bounded loop exists for a reason. Iteration 4 has even lower probability of success and higher context cost. | Document remaining risks, proceed. The bound is the bound. | -| "File probably exists, I don't need to check" | "Probably" is not verified. Dimension 3 requires codebase verification. | Read the filesystem. Verify. | - ## References - [ADR-074: Plan Checker Pre-Execution Validation](/adr/074-plan-checker-pre-execution-validation.md) @@ -459,4 +398,3 @@ WHY proceed after 3 iterations: The cost of continued planning is not zero -- it - [Feature Implement Skill](/skills/feature-implement/SKILL.md) -- executes plans after this skill validates - [Workflow Orchestrator](/pipelines/workflow-orchestrator/SKILL.md) -- PLAN phase produces plans this skill can validate - [Verification Before Completion](/skills/verification-before-completion/SKILL.md) -- post-execution counterpart (validates results, not plans) -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) diff --git a/skills/plan-manager/SKILL.md b/skills/plan-manager/SKILL.md index 4f4deeb3..018685d0 100644 --- a/skills/plan-manager/SKILL.md +++ b/skills/plan-manager/SKILL.md @@ -24,57 +24,17 @@ routing: # Plan Manager Skill -## Operator Context - -This skill operates as an operator for plan lifecycle management, configuring Claude's behavior for consistent, deterministic plan operations that prevent stale plan execution and context drift. It implements the **CLI-First** pattern -- all lifecycle operations go through `scripts/plan-manager.py`, never manual file edits. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution -- **Over-Engineering Prevention**: Only perform requested plan operations. No speculative plan creation, no "while I'm here" cleanup of other plans -- **Staleness Check Required**: ALWAYS run `list --stale` before executing any plan tasks -- **Re-read Before Decisions**: Re-read plan via `show` before any decision that affects task direction -- **Show Before Modify**: ALWAYS run `show PLAN_NAME` before completing or abandoning a plan -- **Complete Output Display**: Show full script output to user; never summarize as "plan updated" -- **CLI-Only Lifecycle**: Use `check`, `complete`, `abandon` commands; never edit plan files directly for lifecycle changes -- **Error Logging to Plan**: When errors occur during plan work, log them to the plan's "Errors Encountered" section - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report facts without self-congratulation. Show command output rather than describing it -- **Warning Escalation**: Surface all staleness warnings prominently to user before proceeding -- **Confirmation Required**: Ask for explicit confirmation before lifecycle changes (complete/abandon) -- **Task Context Display**: When checking off tasks, show which task was marked and remaining count -- **Audit Trail**: Always provide `--reason` with abandonment operations -- **Re-validation After Changes**: Run `show` after any `check`, `complete`, or `create` to confirm state - -### Optional Behaviors (OFF unless enabled) -- **Batch Operations**: Process multiple plans in sequence (only if explicitly requested) -- **Auto-create Plans**: Create plans only when user explicitly requests plan creation -- **Skip Confirmation**: Skip confirmation dialogs only when user requests fast mode - -## What This Skill CAN Do -- List all active plans with staleness warnings -- Show plan details, remaining tasks, and status -- Check off completed tasks within plans -- Move plans to completed or abandoned states -- Create new plans with proper structure -- Audit plan directory for structural issues - -## What This Skill CANNOT Do -- Execute plan tasks (only tracks completion status) -- Modify plan content directly (only lifecycle operations via CLI) -- Auto-complete stale plans without user approval -- Delete plans permanently (only move to abandoned) -- Skip the staleness check at session start +## Overview ---- +This skill manages the complete lifecycle of task plans using a deterministic CLI-first pattern: all operations flow through `scripts/plan-manager.py`, never through manual file edits. This prevents stale plan execution, context drift, and loss of audit trails. Use this skill when the user needs to list, create, show, track, complete, or abandon plans — not for executing the tasks within plans themselves. ## Instructions -### Phase 1: ASSESS +### Phase 1: ASSESS Plan State -**Goal**: Establish current plan state before any operations. +**Goal**: Establish current plan state and surface any stale plans before proceeding with operations. -**Step 1: Check for stale plans** +**Always start with staleness check** (because stale plans waste effort on obsolete work): ```bash python3 ~/.claude/scripts/plan-manager.py list --stale @@ -82,19 +42,19 @@ python3 ~/.claude/scripts/plan-manager.py list --stale Surface any staleness warnings to user immediately. If stale plans exist, address them before proceeding with other operations. -**Step 2: List active plans** +**Then list active plans** (because you need context before making decisions): ```bash python3 ~/.claude/scripts/plan-manager.py list --human ``` -Show complete output to user. Never summarize. +Show complete output to user. Never summarize or abbreviate — the raw output contains critical details about plan age, task counts, and status that the user needs to see. **Gate**: Staleness check complete. User informed of plan state. Proceed only when gate passes. -### Phase 2: OPERATE +### Phase 2: OPERATE - Execute the Requested Plan Action -**Goal**: Execute the requested plan operation with full validation. +**Goal**: Execute the requested plan operation with full context and validation. **For showing a plan:** @@ -105,19 +65,21 @@ python3 ~/.claude/scripts/plan-manager.py show PLAN_NAME --tasks **For checking off a task:** +Re-read the plan first before marking tasks (because this keeps the plan's goals in your recency window and prevents context drift): + ```bash -# Re-read plan first (keeps goals in recency window) python3 ~/.claude/scripts/plan-manager.py show PLAN_NAME -# Mark task complete python3 ~/.claude/scripts/plan-manager.py check PLAN_NAME TASK_NUMBER -# Confirm updated state python3 ~/.claude/scripts/plan-manager.py show PLAN_NAME --tasks ``` +Then show updated state to user, including which task was marked and remaining count. + **For completing a plan:** +Show full status first before completing (because you must verify all checked tasks and understand what you're completing): + ```bash -# Show full status first (MANDATORY) python3 ~/.claude/scripts/plan-manager.py show PLAN_NAME # Ask user: "Complete this plan? (y/n)" # On confirmation: @@ -126,30 +88,32 @@ python3 ~/.claude/scripts/plan-manager.py complete PLAN_NAME **For abandoning a plan:** +Show full status first (because context prevents premature abandonment), then request explicit reason (because the audit trail requires documented rationale): + ```bash -# Show full status first (MANDATORY) python3 ~/.claude/scripts/plan-manager.py show PLAN_NAME -# Ask user for reason +# Ask user: "Why abandon this plan?" python3 ~/.claude/scripts/plan-manager.py abandon PLAN_NAME --reason "User requested: [reason]" ``` **For creating a plan:** +Confirm the plan name (lowercase-kebab-case) and descriptive title with user first (because speculative plan creation creates clutter; only create when explicitly requested): + ```bash -# Confirm name (lowercase-kebab-case) and title with user first python3 ~/.claude/scripts/plan-manager.py create PLAN_NAME --title "Descriptive title" python3 ~/.claude/scripts/plan-manager.py show PLAN_NAME ``` -**Gate**: Operation executed via CLI. Output shown to user in full. State confirmed with follow-up `show`. Proceed only when gate passes. +**Gate**: Operation executed via CLI, never via manual file edit. Full output shown to user. State confirmed with follow-up `show`. Proceed only when gate passes. -### Phase 3: VERIFY +### Phase 3: VERIFY - Confirm Successful State Change **Goal**: Confirm operation succeeded and state is consistent. -**Step 1**: Run `show` on the affected plan to confirm changes took effect. +**Step 1**: Run `show` on the affected plan to confirm changes took effect (because CLI operations must be validated). -**Step 2**: If lifecycle change (complete/abandon), verify the plan moved: +**Step 2**: If lifecycle change (complete/abandon), verify the plan moved to the correct directory: ```bash ls plan/completed/ # or plan/abandoned/ @@ -157,108 +121,42 @@ ls plan/completed/ # or plan/abandoned/ **Step 3**: Report final state to user with full output. -**Gate**: Plan state verified. User informed. Operation complete. - ---- - -## Command Reference - -| User Intent | Command | Example | -|-------------|---------|---------| -| "what's on our plan" | `list` | `python3 ~/.claude/scripts/plan-manager.py list --human` | -| "check for stale plans" | `list --stale` | `python3 ~/.claude/scripts/plan-manager.py list --stale` | -| "show me plan X" | `show PLAN_NAME` | `python3 ~/.claude/scripts/plan-manager.py show add-auth` | -| "what tasks remain" | `show --tasks` | `python3 ~/.claude/scripts/plan-manager.py show add-auth --tasks` | -| "mark task done" | `check PLAN_NAME N` | `python3 ~/.claude/scripts/plan-manager.py check add-auth 3` | -| "finish this plan" | `complete PLAN_NAME` | `python3 ~/.claude/scripts/plan-manager.py complete add-auth` | -| "stop this plan" | `abandon PLAN_NAME` | `python3 ~/.claude/scripts/plan-manager.py abandon add-auth --reason "..."` | -| "create a plan for X" | `create PLAN_NAME` | `python3 ~/.claude/scripts/plan-manager.py create add-auth --title "..."` | -| "audit plans" | `audit` | `python3 ~/.claude/scripts/plan-manager.py audit` | - -### Exit Codes - -| Code | Meaning | Action | -|------|---------|--------| -| 0 | Success | Continue normally | -| 1 | Error | Stop and report error message to user | -| 2 | Warning | Continue but inform user of warnings | - ---- +When errors occur during plan work, log them to the plan's "Errors Encountered" section (because this maintains a record for future sessions and prevents knowledge loss). -## Error Handling +**Gate**: Plan state verified. User informed. Operation complete. ### Error: "Plan not found" -Cause: Plan name misspelled or plan already moved to completed/abandoned -Solution: +**Cause**: Plan name misspelled or plan already moved to completed/abandoned + +**Solution**: 1. Run `list` to see available active plans 2. Check `plan/completed/` and `plan/abandoned/` directories 3. Verify spelling matches exactly (kebab-case) ### Error: "Task number out of range" -Cause: Task number does not exist in the plan -Solution: +**Cause**: Task number does not exist in the plan + +**Solution**: 1. Run `show PLAN_NAME --tasks` to see valid task numbers 2. Task numbers are 1-indexed; verify the correct number 3. Re-read plan to confirm task list hasn't changed ### Error: "Cannot complete: unchecked tasks" -Cause: Attempting to complete a plan with remaining tasks -Solution: +**Cause**: Attempting to complete a plan with remaining tasks + +**Solution**: 1. Run `show PLAN_NAME --tasks` to see remaining tasks 2. Either check remaining tasks first or ask user if they want to force-complete 3. Document why unchecked tasks are acceptable if force-completing ### Error: "Reason required for abandonment" -Cause: Missing `--reason` flag on abandon command -Solution: Always provide `--reason "..."` when abandoning -- this is mandatory for the audit trail - ---- - -## Anti-Patterns +**Cause**: Missing `--reason` flag on abandon command -### Anti-Pattern 1: Executing Without Staleness Check -**What it looks like**: User says "let's work on the auth plan" and assistant immediately starts tasks -**Why wrong**: Plan may be 30 days old with outdated requirements. Stale plans waste effort on obsolete work. -**Do instead**: Run `list --stale` first. Surface warnings. Then proceed. - -### Anti-Pattern 2: Completing Without Showing -**What it looks like**: Running `complete auth-plan` without first running `show auth-plan` -**Why wrong**: May complete plan with unchecked tasks, losing track of incomplete work. -**Do instead**: Always `show` before `complete` or `abandon`. Review output. Then proceed. - -### Anti-Pattern 3: Summarizing Script Output -**What it looks like**: "You have 3 active plans and 1 is stale" instead of showing full CLI output -**Why wrong**: Hides which plan is stale, how stale, what tasks remain. User loses critical detail. -**Do instead**: Show complete script output. Let the user read the details. - -### Anti-Pattern 4: Manual Plan File Editing -**What it looks like**: Using Edit tool to modify `plan/active/auth-plan.md` directly for lifecycle changes -**Why wrong**: Bypasses CLI validation. May corrupt plan format. Loses audit trail. -**Do instead**: Use `check`, `complete`, `abandon` commands. Only edit directly when adding new tasks to plan content. - -### Anti-Pattern 5: Creating Plans Without User Confirmation -**What it looks like**: User says "we should add authentication" and assistant immediately creates a plan -**Why wrong**: User may want to discuss scope first. Creates clutter with unwanted plans. -**Do instead**: Ask "Would you like me to create a plan for this? What should it be called and what tasks should it include?" +**Solution**: Always provide `--reason "..."` when abandoning — this is mandatory for the audit trail --- ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Plan looks current enough" | Staleness dates matter, not appearance | Run `list --stale` first | -| "Tasks seem mostly done" | Check marks are source of truth | Run `show --tasks` first | -| "I'll mark it complete later" | Lifecycle drift causes stale plans | Complete or abandon promptly | -| "The script is slow, I'll check manually" | Manual checks miss edge cases | Always use the CLI | -| "Plan is obvious, no need to show first" | Context prevents mistakes | Always `show` before `complete`/`abandon` | - -### Reference Files -- Plan format documentation: `plan/README.md` -- CLI implementation: `scripts/plan-manager.py` +- **Plan format documentation**: `plan/README.md` +- **CLI implementation**: `scripts/plan-manager.py` diff --git a/skills/planning-with-files/SKILL.md b/skills/planning-with-files/SKILL.md index 8f8140a8..f88409cb 100644 --- a/skills/planning-with-files/SKILL.md +++ b/skills/planning-with-files/SKILL.md @@ -26,43 +26,17 @@ routing: # Planning with Files Skill -## Operator Context - -This skill operates as an operator for file-based planning workflows, configuring Claude's behavior for persistent, structured task execution. It implements the **External Memory** architectural pattern — store on disk, re-read to maintain focus, update after each action — with **Context Engineering** principles from Manus-style agent design. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before starting any planned work -- **Over-Engineering Prevention**: Plan only the phases needed. No speculative phases, no "future-proofing" empty sections -- **Plan Before Execute**: NEVER start complex work without creating `task_plan.md` first -- **Re-Read Before Decide**: Read plan file before every major decision to combat context drift -- **Update After Act**: Mark phases complete and update status immediately after each phase -- **Store, Don't Stuff**: Large outputs go to files, not context. Keep only file paths in working memory - -### Default Behaviors (ON unless disabled) -- **3-File Pattern**: Create `task_plan.md`, `notes.md`, and deliverable file for non-trivial tasks -- **Error Logging**: Record all errors in the "Errors Encountered" section of plan file -- **Phase Checkboxes**: Use `- [ ]` / `- [x]` for phase tracking -- **Status Line**: Maintain a "Currently in Phase X" status at bottom of plan -- **Findings Synthesis**: Consolidate research notes into categorized findings before generating deliverables -- **Cleanup at Completion**: Remove temporary scratch files, keep plan and deliverable - -### Optional Behaviors (OFF unless enabled) -- **Timestamp Entries**: Add `[YYYY-MM-DD]` timestamps to errors and decisions -- **Git Integration**: Create commits at phase boundaries -- **Parallel Research**: Launch multiple search queries simultaneously in research phase - -## What This Skill CAN Do -- Structure complex tasks into trackable phases with file-based persistence -- Maintain focus across long sessions by re-reading goals into attention window -- Store and synthesize research findings across multiple sources -- Recover from context drift by grounding decisions in written plans -- Preserve failure traces for error recovery and institutional knowledge - -## What This Skill CANNOT Do -- Replace domain-specific skills (use systematic-debugging for bugs, systematic-refactoring for refactoring) -- Execute without a written plan for complex tasks -- Store sensitive data (credentials, tokens) in plan files -- Skip the re-read step before major decisions +## Overview + +This skill uses persistent markdown files as external memory to execute complex, multi-phase tasks without context drift. Files serve as the single source of truth for goals, progress, and decisions. Re-read files before major decisions to ground work in written commitments rather than fallible working memory. + +The workflow consists of four phases: +1. **CREATE PLAN** — Write goals and phases before executing +2. **RESEARCH AND GATHER** — Collect information, store findings, update plan +3. **EXECUTE** — Build deliverable using gathered information +4. **VERIFY AND DELIVER** — Confirm completeness, clean up temporary files + +This skill is mandatory for tasks with 3+ phases, research requirements, or risk of context drift after many tool calls. --- @@ -80,6 +54,8 @@ Determine if planning is needed: - Task spans multiple files or systems → Plan needed - Simple lookup or single edit → Skip planning +**Why this matters**: Creating a plan takes 30 seconds but saves hours in rework. Plans prevent mid-task goal drift by anchoring decisions to written commitment. Skip this step only for single-file edits or lookups answerable in one response. + **Step 2: Create `task_plan.md`** ```markdown @@ -115,11 +91,13 @@ Determine if planning is needed: **Step 1: Re-read plan** -Read `task_plan.md` to refresh goals in attention window. This is mandatory, not optional. +Open `task_plan.md` and read it completely. This is mandatory, not optional. + +**Why this matters**: After ~50 tool calls, memory degrades. Re-reading restores focus and prevents drift. Claims like "I remember the goal, no need to re-read" are the primary cause of failed complex tasks. **Step 2: Gather information** -Search, read, explore. Store findings in `notes.md`: +Search, read, explore. Store all findings in `notes.md`: ```markdown # Notes: [Topic] @@ -135,6 +113,8 @@ Search, read, explore. Store findings in `notes.md`: - [Finding with context] ``` +**Why separate files**: Context window is ephemeral. Files are persistent. Writing findings to `notes.md` immediately ensures they survive context compression. Reference the file by section header when needed rather than keeping all content in working memory. + **Step 3: Update plan** Mark Phase 2 complete. Log any decisions made. Update status line. @@ -147,7 +127,9 @@ Mark Phase 2 complete. Log any decisions made. Update status line. **Step 1: Re-read plan and notes** -Read `task_plan.md` then `notes.md`. Both reads are mandatory before generating output. +Read `task_plan.md` first, then `notes.md`. Both reads are mandatory before generating output. + +**Why this matters**: Phase transitions are high-risk points for context drift. Two reads ground execution in both original goals and current findings, preventing divergence from the stated intent. **Step 2: Create deliverable** @@ -169,11 +151,14 @@ Compare deliverable against original goal and key questions. Every question shou **Step 2: Verify completeness** +Check all verification criteria: - All phases marked `[x]` - All key questions answered - Deliverable matches stated goal - Errors section documents any issues encountered +**Why this matters**: "Done" is often an assumption, not a fact. This checklist is a defense-in-depth verification gate. Marks complete only when all criteria pass, not when work "feels done" or "should be done." + **Step 3: Deliver and clean up** Present the deliverable. Remove temporary scratch files. Keep `task_plan.md` and deliverable as artifacts. @@ -186,86 +171,86 @@ Present the deliverable. Remove temporary scratch files. Keep `task_plan.md` and ### Example 1: Research Task User says: "Research morning exercise benefits and write a summary" -Actions: -1. Create `task_plan.md` with goal and 4 phases (CREATE PLAN) -2. Search sources, store findings in `notes.md` (RESEARCH) -3. Re-read notes, write `morning_exercise_summary.md` (EXECUTE) -4. Verify summary covers all key questions, deliver (VERIFY) -Result: Structured summary grounded in documented research + +**Phase 1**: Create `task_plan.md` with goal and 4 phases +- Goal: Produce a summary of morning exercise benefits backed by research + +**Phase 2**: Search sources, store findings in `notes.md` +- Create notes.md with "Sources" section (studies, articles) +- Create "Synthesized Findings" with categories: mental health, physical health, productivity + +**Phase 3**: Re-read notes, write `morning_exercise_summary.md` +- Reference findings for accuracy + +**Phase 4**: Verify summary covers all key questions, deliver +- Result: Structured summary grounded in documented research ### Example 2: Multi-File Refactoring Plan User says: "Plan the migration from REST to GraphQL" -Actions: -1. Create `task_plan.md` with migration phases (CREATE PLAN) -2. Inventory endpoints, dependencies, store in `notes.md` (RESEARCH) -3. Write `migration_plan.md` with ordered steps (EXECUTE) -4. Verify all endpoints covered, deliver plan (VERIFY) -Result: Actionable migration plan with nothing missed + +**Phase 1**: Create `task_plan.md` with migration phases +- Phase 1: Inventory endpoints and dependencies +- Phase 2: Design GraphQL schema +- Phase 3: Implement resolvers +- Phase 4: Migrate clients +- Phase 5: Decommission REST endpoints + +**Phase 2**: Inventory endpoints, dependencies, store in `notes.md` +- Document endpoint mapping to GraphQL queries/mutations +- Identify clients and their endpoint usage + +**Phase 3**: Write `migration_plan.md` with ordered steps +- Step 1: Build GraphQL service alongside REST +- Step 2: Migrate internal clients first + +**Phase 4**: Verify all endpoints covered, deliver plan +- Result: Actionable migration plan with nothing missed --- ## Error Handling ### Error: "Context Drift — Forgot Original Goal" -Cause: Too many tool calls without re-reading the plan -Solution: +**Cause**: Too many tool calls without re-reading the plan + +**Solution**: 1. Immediately read `task_plan.md` 2. Compare current work against stated goal 3. Correct course if diverged 4. Increase re-read frequency for remainder of task ### Error: "Plan Becomes Stale or Inaccurate" -Cause: New information invalidates original phases or decisions -Solution: +**Cause**: New information invalidates original phases or decisions + +**Solution**: 1. Update plan with new information and revised phases 2. Log the change in Decisions Made with rationale 3. Continue from updated plan ### Error: "Notes File Too Large for Context" -Cause: Research phase produced more content than fits in attention window -Solution: +**Cause**: Research phase produced more content than fits in attention window + +**Solution**: 1. Add a "Summary" section at top of `notes.md` with key takeaways 2. Reference specific sections by heading when needed 3. Read only relevant sections, not entire file ---- - -## Anti-Patterns - -### Anti-Pattern 1: Executing Without a Plan -**What it looks like**: Jumping straight into code or content without creating `task_plan.md` -**Why wrong**: No way to track progress, detect drift, or verify completeness -**Do instead**: Create plan first. Always. Even if it takes 30 seconds. - -### Anti-Pattern 2: Writing Plan but Never Re-Reading -**What it looks like**: Creating `task_plan.md` then never opening it again -**Why wrong**: Plan exists to combat context drift. Unread plans provide zero value. -**Do instead**: Re-read before every major decision and phase transition. - -### Anti-Pattern 3: Stuffing Context Instead of Storing -**What it looks like**: Keeping all research findings in working memory instead of `notes.md` -**Why wrong**: Context window fills up, earlier information gets pushed out -**Do instead**: Write to files immediately. Read back only what you need. +### Error: "Task Becomes Unstuck Midway" +**Cause**: Required information is missing or deliverable is off-track -### Anti-Pattern 4: Over-Planning Simple Tasks -**What it looks like**: Creating 3 files and 8 phases for a single-file edit -**Why wrong**: Planning overhead exceeds task complexity. Wastes time. -**Do instead**: Use complexity check in Phase 1. Skip planning for simple tasks. +**Solution**: +1. Stop forward execution +2. Re-read plan to clarify original goal +3. Update plan with new discovery +4. Decide: continue with modified goal, or gather more information +5. Log decision and rationale in plan --- ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) - Phase transition enforcement +**Standard File Names**: Plans use `task_plan.md`, research notes use `notes.md`, deliverables use domain-specific names (e.g., `migration_plan.md`, `research_summary.md`) -### Domain-Specific Anti-Rationalization +**Required Elements**: Every plan must contain Goal (one sentence), Phases (with [ ] checkboxes), Key Questions, and Status line. Without these, the plan provides zero value. -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I remember the goal, no need to re-read" | Memory degrades after ~50 tool calls | Read `task_plan.md` before deciding | -| "Quick task, no plan needed" | Quick tasks become multi-step tasks | Assess complexity first | -| "Notes are in my context already" | Context is ephemeral, files are persistent | Write to `notes.md` | -| "Plan is mostly done, close enough" | Mostly ≠ Complete | Mark all phases [x] or explain why not | +**Phase Gate Enforcement**: Never advance to next phase until current phase gate passes. Gates are designed to catch problems early. "Mostly done" phases cause downstream errors. Enforce gates strictly. diff --git a/skills/plans/SKILL.md b/skills/plans/SKILL.md index 2c9399ec..fa71db1d 100644 --- a/skills/plans/SKILL.md +++ b/skills/plans/SKILL.md @@ -22,75 +22,40 @@ routing: # Plans - Plan Lifecycle Management -## Operator Context - -This skill operates as an operator for deterministic plan management, configuring Claude's behavior for structured task tracking through `scripts/plan-manager.py`. All plan mutations flow through the script; the LLM orchestrates but never edits plan files directly. - -### Hardcoded Behaviors (Always Apply) -- **Script-Only Mutations**: ALL plan changes go through `plan-manager.py` -- never edit plan files by hand -- **Show Before Modify**: ALWAYS run `show` on a plan before any mutation (check, complete, abandon) -- **Stale Check on Entry**: ALWAYS run `list --stale` before executing any plan tasks -- **Full Output**: NEVER summarize or truncate script output -- show it completely to the user -- **User Confirmation**: NEVER complete or abandon a plan without explicit user confirmation - -### Default Behaviors (ON unless disabled) -- **Human-Readable Output**: Pass `--human` flag for all display commands -- **Stale Plan Warnings**: Warn user about plans older than 7 days before proceeding -- **Task-Order Enforcement**: Work tasks in listed order unless user specifies otherwise -- **Status Logging**: Report current plan state after every mutation - -### Optional Behaviors (OFF unless enabled) -- **Audit on Session Start**: Run `audit` across all active plans at session open -- **Auto-Complete Detection**: Suggest completing plans when all tasks are checked -- **Cross-Plan Dependencies**: Track dependencies between related plans - -## What This Skill CAN Do -- Create new plans with structured phases and tasks -- List active plans, filter by staleness, show details -- Mark individual tasks as complete within a plan -- Archive completed plans to `completed/` directory -- Move abandoned plans to `abandoned/` with documented reason -- Audit all active plans for structural issues - -## What This Skill CANNOT Do -- Execute plan tasks (it tracks them; other skills execute) -- Edit plan files directly (all mutations go through the script) -- Skip the stale-check gate before working on plans -- Complete or abandon plans without user confirmation -- Replace Claude Code's built-in `/plan` command (this is `/plans`) +## Overview + +This skill manages the full lifecycle of task plans through deterministic commands in `scripts/plan-manager.py`. Plans track multi-phase work with task-level granularity, enabling progress tracking, stale-plan detection, and structured completion. The skill routes all mutations through the script—never edit plan files directly—and enforces gates at key decision points. + +**Scope**: Creating, listing, inspecting, checking off tasks, completing, and abandoning plans. Does NOT execute the tasks themselves (other skills do that) or replace Claude Code's built-in `/plan` command. --- ## Instructions -### Phase 1: CHECK +### Phase 1: CHECK (Stale Plans Gate) -**Goal**: Understand current plan landscape before any action. +Before any action, discover what plans exist and surface any that are stale (>7 days old). Always use `--human` flag for readable output. ```bash python3 ~/.claude/scripts/plan-manager.py list --human python3 ~/.claude/scripts/plan-manager.py list --stale --human ``` -If stale plans exist (>7 days), warn user before proceeding. +**Constraint**: If stale plans exist, warn the user and ask whether to proceed, abandon, or update the timeline. Never skip this gate. -**Gate**: Plan landscape is known. Stale plans are surfaced. Proceed only when gate passes. +### Phase 2: INSPECT (Show Before Modify) -### Phase 2: INSPECT - -**Goal**: Understand the target plan's current state before mutation. +Before any mutation (check, complete, abandon), display the plan's current state to the user. ```bash python3 ~/.claude/scripts/plan-manager.py show PLAN_NAME --tasks --human ``` -Review remaining tasks, completed tasks, and overall progress. - -**Gate**: Plan state is displayed to user. Proceed only when gate passes. +This surfaces task descriptions, remaining work, completion status, and staleness info. **Constraint**: Always run `show` before mutation—marking the wrong task or completing with remaining work is a critical error. -### Phase 3: MUTATE +### Phase 3: MUTATE (Script-Only) -**Goal**: Apply exactly the requested change. +Apply the exact requested action via the script: | Action | Command | |--------|---------| @@ -99,80 +64,83 @@ Review remaining tasks, completed tasks, and overall progress. | Complete | `python3 ~/.claude/scripts/plan-manager.py complete NAME` | | Abandon | `python3 ~/.claude/scripts/plan-manager.py abandon NAME --reason "reason"` | -For **complete** and **abandon**: require explicit user confirmation before executing. +**Constraint**: NEVER edit plan files with Read/Write/Edit tools. All mutations go through the script to maintain audit trail and validation. **Constraint**: For `complete` and `abandon`, require explicit user confirmation before executing—these are high-consequence actions. -**Gate**: Mutation succeeded (exit code 0). Proceed only when gate passes. +**Gate**: Mutation succeeds (exit code 0) or fails cleanly with a clear error message. -### Phase 4: CONFIRM +### Phase 4: CONFIRM (Post-Mutation Verification) -**Goal**: Verify mutation applied correctly. +Display the updated plan state to verify the mutation worked as expected. ```bash python3 ~/.claude/scripts/plan-manager.py show PLAN_NAME --human ``` -Display updated state to user. If exit code != 0, report error and stop. - -**Gate**: Post-mutation state matches expectation. User sees final result. +**Constraint**: NEVER summarize or truncate script output—show the complete output to the user so they see task lists, completion status, and any warnings. If exit code != 0, report the error and stop. --- ## Error Handling ### Error: "Plan Not Found" -Cause: Typo in plan name or plan already archived -Solution: -1. Run `list --human` to show all active plans -2. Check `completed/` and `abandoned/` directories for archived plans -3. Confirm correct name with user before retrying - -### Error: "Stale Plan Detected" -Cause: Plan has not been updated in >7 days -Solution: -1. Display the stale plan's current state to user -2. Ask: continue working, abandon, or update timeline? -3. Do NOT execute tasks from stale plans without explicit confirmation +**Cause**: Typo in plan name or plan was already archived to `completed/` or `abandoned/`. + +**Solution**: +1. Run `python3 ~/.claude/scripts/plan-manager.py list --human` to list all active plans +2. Check the `completed/` and `abandoned/` directories for archived plans +3. Confirm the correct plan name with the user before retrying + +--- + +### Error: "Stale Plan Detected" (>7 Days Without Update) +**Cause**: Plan hasn't been modified in over 7 days and may no longer reflect current work. + +**Solution**: +1. Display the stale plan's current state to the user using `show` +2. Ask explicitly: Continue working? Abandon? Update the plan timeline? +3. **Constraint**: Do NOT execute tasks from stale plans without explicit user confirmation + +--- ### Error: "Script Exit Code Non-Zero" -Cause: Invalid arguments, missing plan, or filesystem issue -Solution: -1. Show the full error output to user (never summarize) -2. Check script arguments match expected format +**Cause**: Invalid arguments, missing plan, filesystem permissions, or missing script file. + +**Solution**: +1. Show the full error output to the user (never summarize) +2. Check that script arguments match expected format 3. Verify `scripts/plan-manager.py` exists and is executable +4. If persist: ask user to diagnose environment issue --- -## Anti-Patterns +### Common Anti-Patterns (Constraint Violations) -### Anti-Pattern 1: Editing Plan Files Directly -**What it looks like**: Using Write/Edit to modify a plan markdown file -**Why wrong**: Bypasses script validation, breaks audit trail, may corrupt format -**Do instead**: All mutations through `plan-manager.py` +**Anti-Pattern: Editing Plan Files Directly** +- **Wrong**: Using Read/Write/Edit to modify a plan markdown file +- **Why**: Bypasses script validation, breaks audit trail, corrupts format +- **Correct**: All mutations through `plan-manager.py` -### Anti-Pattern 2: Skipping Show Before Modify -**What it looks like**: Running `check` or `complete` without first running `show` -**Why wrong**: May mark wrong task, complete plan with remaining work, or act on stale state -**Do instead**: Always Phase 2 (INSPECT) before Phase 3 (MUTATE) +**Anti-Pattern: Skipping Show Before Mutation** +- **Wrong**: Running `check` or `complete` without first running `show` +- **Why**: Risk marking the wrong task, completing with remaining work, or acting on stale data +- **Correct**: Always Phase 2 (INSPECT) before Phase 3 (MUTATE) -### Anti-Pattern 3: Summarizing Script Output -**What it looks like**: "The plan has 3 remaining tasks" instead of showing full output -**Why wrong**: User loses details, task descriptions, staleness info, and audit data -**Do instead**: Display complete script output, let user read it +**Anti-Pattern: Summarizing Script Output** +- **Wrong**: "The plan has 3 remaining tasks" instead of showing full output +- **Why**: User loses task descriptions, staleness info, completion details, and audit trail +- **Correct**: Display complete script output; let user read it -### Anti-Pattern 4: Auto-Completing Without Confirmation -**What it looks like**: Detecting all tasks done and running `complete` automatically -**Why wrong**: User may want to add tasks, review work, or keep plan active -**Do instead**: Suggest completion, wait for explicit user confirmation +**Anti-Pattern: Auto-Completing Without Confirmation** +- **Wrong**: Detecting all tasks done and running `complete` automatically +- **Why**: User may want to add tasks, review work, or keep the plan active for tracking +- **Correct**: Suggest completion after all tasks checked; wait for explicit user confirmation --- ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - ### Script Reference -- `scripts/plan-manager.py` - All plan CRUD operations -- Exit codes: 0 = success, 1 = error, 2 = warnings (e.g., stale plans) -- Default output is JSON; add `--human` for readable format +- **Location**: `scripts/plan-manager.py` +- **Exit codes**: 0 = success, 1 = error, 2 = warnings (e.g., stale plans detected) +- **Output format**: JSON by default; add `--human` flag for readable format +- **Mutations**: All plan changes must go through this script; direct file editing is forbidden diff --git a/skills/plant-seed/SKILL.md b/skills/plant-seed/SKILL.md index dd74ef71..6cd35c5a 100644 --- a/skills/plant-seed/SKILL.md +++ b/skills/plant-seed/SKILL.md @@ -33,40 +33,11 @@ routing: # Plant Seed Skill -## Purpose +## Overview -Capture forward-looking ideas with trigger conditions so they resurface at the right time. Seeds carry WHY (rationale) and WHEN (trigger), making them far more valuable than bare TODO comments. Seeds are stored locally in `.seeds/` and automatically surfaced during feature-design when their trigger conditions match. +Capture forward-looking ideas with trigger conditions so they resurface at the right time. Seeds carry WHY (rationale) and WHEN (trigger), making them far more valuable than bare TODO comments. -## Operator Context - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Local Storage Only**: Seeds go in `.seeds/` which is gitignored -- seeds are personal, not shared via version control. WHY: Different developers have different ideas and different trigger conditions; committing seeds pollutes the shared repo with personal notes. -- **No Immediate Work**: Seeds are for deferred ideas. If the user describes something that should happen now, suggest creating a task or issue instead. WHY: Planting a seed for current-session work means it never gets surfaced -- it just gets forgotten in a different way. -- **Breadcrumb Discovery**: Always grep for related files at capture time. WHY: Breadcrumbs preserve code references from capture time. Even if the codebase evolves, these paths help the user re-orient when the seed surfaces months later. -- **Unique IDs**: Seed IDs follow the format `seed-YYYY-MM-DD-slug` to ensure uniqueness and chronological sorting. - -### Default Behaviors (ON unless disabled) -- **Interactive Capture**: Ask clarifying questions about trigger condition and scope if not provided -- **Breadcrumb Grep**: Search the codebase for files related to the seed's topic -- **Confirmation Before Write**: Show the complete seed to the user before saving - -### Optional Behaviors (OFF unless enabled) -- **Batch Planting**: Capture multiple seeds in one session (enable with "plant several seeds") -- **Seed Review**: List and manage existing seeds (enable with "review seeds" or "list seeds") - -## What This Skill CAN Do -- Capture a deferred idea with structured metadata (trigger, scope, rationale, action, breadcrumbs) -- Search the codebase for related files to attach as breadcrumbs -- Store seeds in `.seeds/index.json` with consistent structure -- List existing seeds and their status -- Archive seeds that have been harvested or dismissed - -## What This Skill CANNOT Do -- Execute the deferred work (that happens during feature-design when the seed is surfaced) -- Automatically detect when a trigger condition is met (feature-design does fuzzy matching) -- Share seeds across machines or developers (seeds are local by design) -- Create TODOs, issues, or tasks (use appropriate tools for immediate work) +Seeds are stored locally in `.seeds/` (gitignored) and automatically surfaced during feature-design when their trigger conditions match. This workflow is designed for deferred ideas only — if the user describes work that should happen now, suggest creating a task or issue instead. --- @@ -76,6 +47,8 @@ Capture forward-looking ideas with trigger conditions so they resurface at the r **Goal**: Gather the idea, trigger condition, scope, and rationale from the user. +**Key Constraint**: This skill captures deferred ideas only. If the user describes something that should happen in the current session, suggest creating a task or issue instead. Planting a seed for immediate work means it never gets surfaced — it just gets forgotten in a different way. + **Step 1: Understand the idea** Extract from the user's description: @@ -105,11 +78,15 @@ Format: `seed-YYYY-MM-DD-slug` Example: `seed-2026-03-22-cache-layer` +Seeds use this consistent ID format to ensure uniqueness and chronological sorting. If two seeds are planted the same day with the same slug, append `-2`, `-3` to the slug. + **Step 3: Discover breadcrumbs** +**Key Constraint**: Breadcrumbs preserve code references from capture time. Even if the codebase evolves, these paths help the user re-orient when the seed surfaces months later. Always grep for related files at capture time. + Search the codebase for files related to the seed's topic. Use the Grep tool with 2-3 key terms from the seed's action and rationale. Collect up to 10 file paths as breadcrumbs. -If no files match, breadcrumbs can be empty -- the seed is still valuable without them. +If no files match, breadcrumbs can be empty — the seed is still valuable without them. **Gate**: Idea captured with all required fields (action, trigger, scope, rationale). Breadcrumbs discovered. Proceed to Confirm. @@ -141,6 +118,8 @@ Handle response: **Goal**: Persist the seed to `.seeds/index.json`. +**Key Constraint**: Seeds go in `.seeds/` which is gitignored — seeds are personal, not shared via version control. Different developers have different ideas and different trigger conditions; committing seeds pollutes the shared repo with personal notes. + **Step 1: Ensure directory exists** ```bash @@ -232,32 +211,11 @@ To archive: remove the seed from `index.json` and write it as a standalone file | Duplicate seed ID | Two seeds planted same day with same slug | Append `-2`, `-3` to slug | | No breadcrumbs found | Idea is forward-looking, no related code yet | Plant with empty breadcrumbs -- still valuable | | User describes immediate work | Seed system is for deferred work | Suggest creating a task or doing the work now | - -## Anti-Patterns - -| Anti-Pattern | Why Wrong | Do Instead | -|--------------|-----------|------------| -| Planting seeds for current-session work | Seeds are for deferred ideas, not TODOs | Create a task or do the work now | | Vague trigger like "someday" or "eventually" | Cannot be matched during feature-design | Ask for a specific, observable condition | | Missing rationale ("it would be nice") | Without WHY, the seed loses value when surfaced months later | Capture the specific insight or observation | -| Planting too many seeds at once | Seed fatigue -- too many dormant seeds become noise | Keep seeds focused and high-signal | -| Storing seeds in version control | Seeds are personal/local; committing pollutes shared repo | Keep in gitignored `.seeds/` directory | - -## Anti-Rationalization -See [core patterns](../shared-patterns/anti-rationalization-core.md). - -Domain-specific for plant-seed: - -| Rationalization | Why Wrong | Action | -|-----------------|-----------|--------| -| "I'll remember this idea" | You won't -- context is ephemeral | Plant the seed now | -| "A TODO comment is enough" | TODOs lack trigger conditions and rationale | Use seed for deferred work, TODO for next-session work | -| "The trigger is obvious" | Obvious to you now, opaque in 3 months | Write an explicit, human-readable trigger condition | -| "Breadcrumbs aren't important" | Code references ground the seed in the codebase | Always grep, even if results are sparse | -| "This seed is too small to bother" | Small seeds with good triggers have the best signal-to-noise | Plant it -- small seeds are low cost, high value | +--- ## References - [Feature Design](../feature-design/SKILL.md) - Seeds are surfaced during feature-design Phase 0 -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Core rationalization prevention diff --git a/skills/post-outliner/SKILL.md b/skills/post-outliner/SKILL.md index 5b372f70..72fbdc3b 100644 --- a/skills/post-outliner/SKILL.md +++ b/skills/post-outliner/SKILL.md @@ -27,47 +27,14 @@ routing: # Post Outliner Skill -## Operator Context - -This skill operates as an operator for blog post planning, configuring Claude's behavior for creating structural blueprints that ensure posts have logical flow and appropriate scope. It implements the **Structured Analysis** pattern -- assess topic, select template, generate outline, validate structure. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before outlining -- **Over-Engineering Prevention**: Outline only what was asked. No speculative series planning, no "while I'm here let me plan 5 more posts" -- **Structure First**: ALWAYS select a structure template before generating content -- **Word Count Estimates**: Every section MUST include estimated word counts -- **your blog Identity**: Posts are technical, deep, problem-solving focused -- no fluff or filler -- **Alternative Structures**: Always offer at least one alternative structure type -- **Assess Before Outline**: NEVER generate an outline without understanding the topic's core problem or value proposition first - -### Default Behaviors (ON unless disabled) -- **Section Summaries**: Include 2-3 sentence summaries per section -- **Reading Time**: Calculate and display estimated reading time (~250 wpm) -- **Complete Output**: Show full outline in formatted report block -- **Template Matching**: Auto-detect best structure from topic description -- **Hugo Frontmatter**: Include title, date, tags, summary planning -- **Logical Flow Validation**: Verify each section builds on the previous - -### Optional Behaviors (OFF unless enabled) -- **Minimal Mode**: Just section headers, no summaries -- **Deep Mode**: Expand to sub-section level with bullet points -- **Series Mode**: Plan multi-part post series with dependencies - -## What This Skill CAN Do -- Analyze topic briefs and identify the core "vex" or value proposition -- Select the most appropriate structure template for the content -- Generate structured outlines with section summaries and word counts -- Estimate reading time and validate scope -- Suggest alternative structures for the same topic -- Handle Hugo frontmatter planning (title, tags, summary) -- Validate logical flow between sections - -## What This Skill CANNOT Do -- Write actual post content (use `blog-post-writer` for that) -- Edit existing posts (use `anti-ai-editor` for style fixes) -- Guarantee SEO optimization (focus is structure, not keywords) -- Plan non-blog content like documentation or README files -- Create outlines without understanding the topic first -- always assess +## Overview + +This skill creates structural blueprints for blog posts by analyzing topic briefs, selecting appropriate structure templates, and generating outlines with word counts and section summaries. Posts should be technical, deep, problem-solving focused—no fluff or filler. The workflow follows a four-phase process: assess the topic for core value proposition, decide on the right template and scope, generate the outline with all required elements, then validate against quality standards. + +The skill operates under two core constraints: + +1. **Structure First**: Always select a structure template before generating content. Never output an outline without understanding the topic's core problem or value proposition first. +2. **No Over-Engineering**: Outline only what was asked. Do not speculate on series planning, suggest related posts, or generate more outlines than requested. --- @@ -86,6 +53,13 @@ Identify these elements: **Step 2: Ask key questions** +If the topic brief is too vague to answer these, ask clarifying questions BEFORE proceeding: +- "What specific problem did you encounter?" +- "What did you learn?" +- "Who is the audience?" + +Document your assessment: + ```markdown ## Topic Assessment Problem: [What problem does this solve?] @@ -94,12 +68,9 @@ Insight: [What's the key insight or solution?] Scope: [Single post or potential series?] ``` -If the topic brief is too vague to answer these, ask clarifying questions BEFORE proceeding: -- "What specific problem did you encounter?" -- "What did you learn?" -- "Who is the audience?" +**Gate**: Core problem/value identified. Topic is specific enough to outline. Do not proceed with outline generation without this. -**Gate**: Core problem/value identified. Topic is specific enough to outline. Proceed only when gate passes. +**Why This Phase**: Vague topics produce generic outlines. Section names should communicate content at a glance. Generic names (e.g., "Introduction", "Main Content", "Details", "Conclusion") reveal nothing about reader value and indicate shallow thinking about the topic. Always complete assessment with specific, content-descriptive section names in mind. ### Phase 2: DECIDE @@ -127,7 +98,9 @@ See `references/structure-templates.md` for full template details with section b | Tutorial | 1,200-2,000 | 5-6 | | Series part | 800-1,200 | 3-4 | -**Gate**: Template selected, scope defined. Proceed only when gate passes. +**Why Scope Matters**: Section bloat dilutes impact. Your blog cuts to the chase. Too many thin sections (8+) at 100 words each pad length without adding value. Merge related sections. Aim for 3-7 substantive sections with specific names. Do not justify every section's existence against the core insight: cut sections that don't serve the core message. Word count estimates must be verified to add up to the overall total—rough estimates undermine scope validation. + +**Gate**: Template selected, scope defined. Do not proceed without this. ### Phase 3: GENERATE @@ -170,7 +143,9 @@ Generate the outline in this exact format: =============================================================== ``` -**Gate**: Outline complete with all required elements. Proceed only when gate passes. +**Why Each Element**: Every section must have estimated word counts so you can validate scope and identify sections that are too heavy or too light. Calculating reading time (~250 wpm) helps authors understand audience engagement expectations. Include Hugo frontmatter planning (title, date, tags, summary) to reduce friction at the publication phase. Always offer at least one alternative structure so the author can choose from options rather than being locked into one approach. + +**Gate**: Outline complete with all required elements. ### Phase 4: VALIDATE @@ -186,10 +161,12 @@ Run through this checklist: - [ ] **Word counts present**: Every section has estimates - [ ] **Word counts add up**: Section totals match overall estimate - [ ] **Alternative structures**: At least one alternative offered -- [ ] **your blog identity**: Technical, direct, problem-solving +- [ ] **Blog identity**: Technical, direct, problem-solving If any check fails, revise the outline before presenting. +**Why This Validation**: Word counts are not rough—they must be precise. Section totals must match the overall estimate. Do not tolerate arithmetic drift. Generic section names are a red flag that you haven't thought deeply about the content and what readers gain from each part. + **Gate**: All validation checks pass. Outline is complete. --- @@ -250,46 +227,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Outline Without Understanding -**What it looks like**: Generating a generic outline immediately after receiving a broad topic like "Kubernetes" -**Why wrong**: No specific value proposition, no your blog identity. Produces hollow structure without substance. -**Do instead**: Complete Phase 1 ASSESS first. Ask clarifying questions. Identify the specific problem and insight before touching structure. - -### Anti-Pattern 2: Too Many Thin Sections -**What it looks like**: 8+ sections with Introduction, Background, Context, Problem Statement each at 100 words -**Why wrong**: your blog cuts to the chase. Multiple thin sections dilute impact and pad length without adding value. -**Do instead**: Merge related sections. Start with the vex. Aim for 3-7 substantive sections. - -### Anti-Pattern 3: Generic Section Names -**What it looks like**: "Introduction", "Main Content", "Details", "Conclusion" -**Why wrong**: Section names should communicate content at a glance. Generic names reveal nothing about what the reader gains. -**Do instead**: Use specific, descriptive names like "Hugo Builds Fail on Cloudflare" instead of "The Problem". - -### Anti-Pattern 4: Missing Word Counts -**What it looks like**: Sections listed without any size estimates -**Why wrong**: Cannot validate scope, cannot estimate reading time, cannot identify sections that are too heavy or too light. -**Do instead**: Every section gets a word count range. Totals must add up to overall estimate. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Topic is clear enough, skip assessment" | Vague topics produce generic outlines | Complete Phase 1 ASSESS with key questions | -| "One structure is fine, no need for alternatives" | Author should choose from options | Always include ALTERNATIVE STRUCTURES section | -| "Word counts are rough, close enough" | Section totals must match overall estimate | Verify arithmetic before presenting | -| "Generic section names work for now" | Names reveal outline quality and thinking depth | Use specific, content-descriptive names | -| "Just one more section won't hurt" | Section bloat dilutes impact | Justify every section's existence against core insight | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/structure-templates.md`: Complete template library with section breakdowns and signal words - `${CLAUDE_SKILL_DIR}/references/examples.md`: Real outlines from your blog posts demonstrating proper format diff --git a/skills/pptx-generator/SKILL.md b/skills/pptx-generator/SKILL.md index 23c8e42a..64b3907c 100644 --- a/skills/pptx-generator/SKILL.md +++ b/skills/pptx-generator/SKILL.md @@ -39,53 +39,11 @@ routing: # PPTX Presentation Generator -## Operator Context +## Overview -This skill operates as an operator for programmatic presentation generation, configuring Claude's behavior for a 6-phase pipeline that transforms content into polished PowerPoint decks. It uses `python-pptx` (pure Python, MIT licensed) for deterministic slide construction and a visual QA subagent loop for quality control. +This skill generates polished PowerPoint decks through a 6-phase pipeline that separates content decisions (LLM) from slide construction (deterministic script) from visual validation (fresh-eyes subagent). The core principle: **"Slides are visual documents, not text dumps. Generate mechanically, validate visually."** -Core thesis: **"Slides are visual documents, not text dumps. Generate mechanically, validate visually."** - -The pipeline separates content decisions (LLM) from slide construction (Python script) from visual validation (subagent). This separation prevents the common failure mode where the generator rationalizes away visual defects it introduced. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution. Project instructions override default behaviors. -- **Over-Engineering Prevention**: Build the deck the user asked for. No speculative slides, no "bonus" content, no unsolicited animations or transitions. -- **Anti-AI Slide Rules**: Every generated deck MUST avoid the 10 anti-AI patterns defined in `references/anti-ai-slide-rules.md`. Why: AI-generated slides have a distinctive "template pack" aesthetic that undermines credibility. The QA subagent checks for all violations. -- **Deterministic Generation**: Slide construction happens via `scripts/generate_pptx.py`, not inline python-pptx code. Why: scripts are testable, reproducible, and save tokens. The LLM decides WHAT goes on slides; the script decides HOW. -- **Blank Layout Only**: Always use `slide_layouts[6]` (blank) as the base layout. Why: using template-specific layouts (title, content) inherits unpredictable formatting from whatever default template python-pptx ships. Blank gives us full control. -- **Safe Fonts Only**: Use Calibri and Arial exclusively. Why: presentations are shared documents. Custom fonts cause rendering failures on machines that lack them. Portability trumps aesthetics. -- **Slide Map Approval**: Always present the slide map to the user for approval before generating the PPTX. Why: content rework after generation wastes QA iteration budget. - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Fact-based progress. "Generated 10 slides, Corporate palette, QA passed" -- not "I've crafted a stunning presentation." -- **Palette Selection**: Auto-select palette based on presentation type and audience using the heuristic in `references/design-system.md`. Default to **Minimal** when uncertain. -- **Widescreen Format**: 16:9 (13.333 x 7.5 inches). This is the universal modern presentation format. -- **Visual QA Loop**: When LibreOffice is available, convert to images and run a QA subagent. Max 3 fix iterations. When LibreOffice is unavailable, fall back to structural validation only and note the skip. -- **Cleanup**: Remove intermediate files (PDFs, PNGs, temp JSON) after final output. -- **Absolute Paths**: Always use absolute paths for all file arguments to scripts. - -### Optional Behaviors (OFF unless enabled) -- **Template-Based Generation**: Use an existing .pptx file as a template instead of blank. Enable when user provides a branded template. -- **Image Embedding**: Embed user-provided images into slides. Enable when user provides image files or paths. -- **Speaker Notes**: Add speaker notes to slides. Enable when user explicitly requests notes. -- **Dark Theme**: Use the Midnight palette with inverted contrast. Enable when user requests dark slides or keynote-style dark decks. - -## What This Skill CAN Do -- Generate complete .pptx slide decks from content outlines, documents, or freeform descriptions -- Apply 8 curated color palettes with consistent typography and spacing -- Build 8 layout types: title, section divider, content/bullets, two-column, image+text, quote/callout, table, closing -- Validate generated slides structurally (slide count, titles, content presence) -- Run visual QA via LibreOffice conversion and subagent inspection (when LibreOffice is available) -- Use existing .pptx files as templates for branded decks -- Embed user-provided images into slides - -## What This Skill CANNOT Do -- **Google Slides or Keynote export**: Generates .pptx only. Users convert manually. -- **Live editing or preview**: Produces a file; there is no interactive editing capability. -- **Custom fonts**: Only Calibri and Arial for portability. Custom font embedding is not supported. -- **Animations or transitions**: No animations by default. Python-pptx has limited animation support; even if requested, results are unreliable. -- **Auto-generate images**: Does not create diagrams, charts, or illustrations. Pair with `gemini-image-generator` for image assets. -- **Speaker notes generation**: Deferred to future version. The skill generates slides, not presenter scripts. +This separation prevents the common failure mode where the generator rationalizes away visual defects it introduced. The visual QA subagent has zero generation context and sees slides as viewers would. --- @@ -170,7 +128,7 @@ Select layout types for each slide. Use at least 2-3 distinct layout types to av Available layouts: `title`, `section` (divider), `content` (bullets), `two_column`, `image_text`, `quote` (callout), `table`, `closing` Layout rhythm rules: -- Never use the same layout more than 3 times in a row +- Never use the same layout more than 3 times in a row. (Reason: Identical layouts are the most obvious AI-slide tell. Real presentations have visual rhythm with varied layouts.) - For 10+ slide decks, use at least 3 distinct layout types - Insert a different layout type (quote, two-column, section divider) to break repetition @@ -220,7 +178,7 @@ Create a JSON array where each element represents one slide: Before presenting to the user, check: - [ ] At least 2-3 distinct layout types used (not all `content`) - [ ] No more than 3 consecutive slides with the same layout -- [ ] Max 6 bullets per content slide, max 10 words per bullet +- [ ] Max 6 bullets per content slide, max 10 words per bullet (Reason: 9 bullets is a document paragraph, not a slide. Readability degrades sharply past 6.) - [ ] Title slide is first, closing slide is last (if appropriate) - [ ] Section dividers placed before new sections (for 8+ slide decks) @@ -254,7 +212,7 @@ Approve this structure, or suggest changes? **Goal**: Execute the deterministic Python script to produce the .pptx file. -**Why this phase exists**: Slide construction is mechanical work -- given a slide map and design config, the output is deterministic. This belongs in a script, not in LLM-generated inline code. Scripts are testable, reproducible, and consistent. +**Why this phase exists**: Slide construction is mechanical work -- given a slide map and design config, the output is deterministic. This belongs in a script, not in LLM-generated inline code. Scripts are testable, reproducible, and consistent. (Reason: Inline code is not testable, wastes tokens on boilerplate, and risks inconsistency. The script encapsulates palette application, layout selection, font sizing, spacing rules, and all design system constraints.) **Step 1: Check dependencies** @@ -269,7 +227,7 @@ pip install python-pptx Pillow **Step 2: Write the slide map and design config to JSON files** -Save the approved slide map and design config to temporary files: +Save the approved slide map and design config to temporary files (use absolute paths for all file arguments): ```bash # Write slide map JSON to temp file @@ -300,6 +258,11 @@ python3 /path/to/skills/pptx-generator/scripts/generate_pptx.py \ Exit codes: 0 = success, 1 = missing python-pptx, 2 = invalid input, 3 = generation failed. +**Constraints applied during generation**: +- **Blank Layout Only**: Always use `slide_layouts[6]` (blank) as the base layout. Why: using template-specific layouts (title, content) inherits unpredictable formatting from whatever default template python-pptx ships. Blank gives us full control. +- **Safe Fonts Only**: Use Calibri and Arial exclusively. Why: presentations are shared documents. Custom fonts cause rendering failures on machines that lack them. Portability trumps aesthetics. +- **Widescreen Format**: 16:9 (13.333 x 7.5 inches). This is the universal modern presentation format. + **Step 4: Run structural validation** ```bash @@ -356,7 +319,7 @@ Check that one PNG exists per slide. If fewer PNGs than slides, some slides may **Why a subagent**: The generating agent has context bias -- it "knows" what the slide should look like and will rationalize visual problems. A fresh-eyes subagent with zero generation context sees the slide as a viewer would. This is the same anti-bias pattern as the voice-validator: the generator and the validator must be separate. -**Why max 3 iterations**: If visual issues persist after 3 fix cycles, the design is wrong, not the implementation. Looping further produces diminishing returns and wastes context. +**Why max 3 iterations**: If visual issues persist after 3 fix cycles, the design is wrong, not the implementation. Looping further produces diminishing returns and wastes context. (Reason: Do NOT continue iterating beyond 3. This signals that the design approach is wrong, not the implementation. More iterations burn context without convergence.) **Step 1: Dispatch QA subagent** @@ -371,7 +334,7 @@ The subagent checks each slide against these categories: 2. **Layout and Alignment**: Consistent margins, aligned elements, visual balance 3. **Color Usage**: Palette consistency, max 3 colors per slide, adequate contrast 4. **Content Accuracy**: Titles and bullets match the slide map -5. **Anti-AI Violations**: All 10 rules from `references/anti-ai-slide-rules.md` +5. **Anti-AI Violations**: All 10 rules from `references/anti-ai-slide-rules.md` (avoid accent lines under titles, gradient backgrounds, identical layouts, shadows on everything, rounded rectangles everywhere, clip art icons, gradient text) 6. **Structural Checks**: Slide count, title slide present, closing slide present Subagent prompt structure: @@ -421,7 +384,7 @@ Track iteration count: ``` QA Iteration 1/3: 2 issues found (1 Blocker, 1 Major) QA Iteration 2/3: 1 issue found (1 Minor) -QA Iteration 2/3: PASS (0 Blocker, 0 Major) +QA Iteration 3/3: PASS (0 Blocker, 0 Major) ``` **GATE**: QA subagent returns PASS, OR 3 iterations exhausted. If iterations exhausted with remaining issues, include them in the output report. Do not loop beyond 3. @@ -478,7 +441,7 @@ Remove: - `/tmp/design_config.json` - `/tmp/pptx_qa_images/` directory (PNG renders and PDFs) -Keep only the final .pptx file. +Keep only the final .pptx file. (Reason: Cleanup is a default behavior to remove intermediate files after final output.) --- @@ -553,54 +516,6 @@ Install with: `apt install libreoffice-impress` (Debian/Ubuntu) or `brew install --- -## Anti-Patterns - -### Inline python-pptx Code Instead of Script -**What it looks like**: Writing python-pptx code directly in a Bash tool call instead of using `scripts/generate_pptx.py`. -**Why wrong**: Inline code is not testable, wastes tokens on boilerplate, and risks inconsistency. The script encapsulates palette application, layout selection, font sizing, spacing rules, and all design system constraints. -**Do instead**: Always use `generate_pptx.py`. If the script lacks a feature, extend the script. - -### Skipping Slide Map Review -**What it looks like**: Generating the .pptx immediately after gathering content, without showing the user the planned structure. -**Why wrong**: Content changes after generation waste QA iteration budget. If the user wanted different emphasis or ordering, you lose fix cycles to structural rework instead of visual fixes. -**Do instead**: Present the slide map in Phase 2 and wait for approval. This is a gate, not a suggestion. - -### Same Layout for Every Slide -**What it looks like**: Every content slide uses the `content` (bullets) layout. -**Why wrong**: Identical layouts are the most obvious AI-slide tell. Real presentations have visual rhythm with varied layouts. -**Do instead**: Use at least 2-3 distinct layout types. Insert quote, two-column, or section divider slides to break repetition. - -### Looping QA Beyond 3 Iterations -**What it looks like**: "Just one more fix cycle" after iteration 3 because there is one remaining minor issue. -**Why wrong**: If 3 iterations could not fix the issue, the design approach is wrong. More iterations burn context without convergence. -**Do instead**: Stop at 3. Report remaining issues. Deliver with caveats. - -### Adding Decorative Elements "For Polish" -**What it looks like**: Adding accent lines under titles, shadows on every shape, rounded rectangles as bullet containers. -**Why wrong**: These are the exact patterns that mark a presentation as AI-generated. See `references/anti-ai-slide-rules.md`. -**Do instead**: Keep slides clean. Use typography (size, weight, spacing) for visual hierarchy, not decorative shapes. - ---- - -## Anti-Rationalization - -See [shared-patterns/anti-rationalization-core.md](../shared-patterns/anti-rationalization-core.md) for universal patterns. - -### Domain-Specific Rationalizations - -| Rationalization Attempt | Why It's Wrong | Required Action | -|------------------------|----------------|-----------------| -| "The slide map is obvious, skip user review" | Content rework after generation costs QA iterations | Present the slide map and wait for approval | -| "One more QA iteration will fix it" (after 3) | 3 failed iterations means the design approach is wrong, not the implementation | Stop, report issues, deliver with caveats | -| "LibreOffice is slow, skip visual QA" | Structural validation catches data errors but not visual ones (clipped text, contrast, alignment) | Run visual QA when LibreOffice is available; note when it is skipped | -| "Inline code is faster than the script" | Faster for one run, inconsistent across many; wastes tokens on boilerplate | Use `generate_pptx.py` always; extend it if features are missing | -| "Adding a decorative line makes it look polished" | This is the single most recognizable AI-slide tell | Never add decorative lines under titles | -| "The user won't notice identical layouts" | Audiences do notice visual monotony, even subconsciously | Use 2-3+ distinct layout types per deck | -| "9 bullets is close enough to 6" | 9 bullets is a document paragraph, not a slide. Readability degrades sharply past 6 | Enforce the 6-bullet max; split into multiple slides | -| "Custom fonts would look better" | Custom fonts break on machines that lack them | Calibri/Arial only; portability trumps aesthetics | - ---- - ## Blocker Criteria STOP and ask the user (do NOT proceed autonomously) when: @@ -618,18 +533,18 @@ STOP and ask the user (do NOT proceed autonomously) when: - Whether to use dark theme (Midnight palette) -- strong aesthetic choice - Whether to include images (user must provide assets or explicitly request generation) - Slide count when user is vague ("a few slides" -- ask for a number) -- Content that the user hasn't provided (do not invent slide content) +- Content that the user hasn't provided (do not invent slide content). Reason: Build the deck the user asked for. No speculative slides, no "bonus" content, no unsolicited animations or transitions. --- -## Death Loop Prevention +## Retry Limits and Recovery -### Retry Limits +**Retry Limits**: - Phase 3 (GENERATE): Max 2 retries for script failures before escalating to user - Phase 5 (QA): Max 3 iterations of the fix-and-recheck cycle - Slide map revision: Max 2 rounds of user feedback before freezing the map -### Recovery Protocol +**Recovery Protocol**: 1. **Detection**: Same QA issue reappearing after a fix attempt, generation script failing on the same input repeatedly, or slide map revisions not converging 2. **Intervention**: Simplify the deck. Reduce slide count, use only `content` and `title` layouts, drop complex layouts (table, two-column) that may be causing issues 3. **Prevention**: Validate the slide map JSON against the schema before generation. Check that bullet counts are within limits. Verify image paths exist before including `image_text` slides. @@ -722,8 +637,3 @@ For detailed information: - [research-to-article](../research-to-article/SKILL.md) -- research output can feed slide content - [gemini-image-generator](../gemini-image-generator/SKILL.md) -- generate images for slides - [workflow-orchestrator](../workflow-orchestrator/SKILL.md) -- orchestrate multi-step pipelines - -### Shared Patterns -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) -- prevents shortcut rationalizations -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) -- phase transition rules -- [Verification Checklist](../shared-patterns/verification-checklist.md) -- pre-completion checks diff --git a/skills/pr-cleanup/SKILL.md b/skills/pr-cleanup/SKILL.md index c173afad..ec41cb2b 100644 --- a/skills/pr-cleanup/SKILL.md +++ b/skills/pr-cleanup/SKILL.md @@ -21,41 +21,9 @@ routing: # PR Cleanup Skill -## Operator Context - -This skill operates as an operator for post-merge branch cleanup, configuring Claude's behavior for safe, systematic removal of stale local branches. It implements a **Sequential Safety** pattern -- identify target, verify merge status, delete safely, confirm result. - -### Hardcoded Behaviors (Always Apply) -- **Protected Branches**: NEVER delete main, master, or develop branches -- **Safe Delete First**: Always use `git branch -d` before considering `-D` -- **Identify Before Switch**: Capture branch name BEFORE switching to main -- **Verify Merge Status**: Confirm branch was merged or remote-deleted before removing -- **Report Results**: Always show what was deleted and what remains -- **Worktree Cleanup First**: Before deleting any branch, check if a git worktree references it. Worktree branches block `git branch -d/-D` and `gh pr merge --delete-branch`. Run `git worktree remove` before branch deletion. - -### Default Behaviors (ON unless disabled) -- **Prune Remote References**: Run `git remote prune origin` after cleanup -- **Pull Latest**: Pull main/master with `--prune` after switching -- **Show Remaining Branches**: List local branches after cleanup completes -- **Squash-Merge Detection**: Check for `[gone]` upstream when `-d` fails - -### Optional Behaviors (OFF unless enabled) -- **Batch Cleanup**: Delete all merged branches with `--all` flag -- **Dry Run**: Show what would be deleted without acting, with `--dry-run` flag - -## What This Skill CAN Do -- Delete local branches that have been merged into main/master -- Detect squash-merged branches by checking upstream tracking status -- Batch-delete all merged branches except protected ones -- Prune stale remote-tracking references -- Dry-run to preview cleanup actions - -## What This Skill CANNOT Do -- Delete remote branches (local cleanup only) -- Review or merge PRs (use /pr-review instead) -- Run CI checks (use ci skill instead) -- Create or rename branches -- Force-delete unmerged branches without explicit user confirmation +## Overview + +This skill provides safe, systematic cleanup of local branches after PR merge. It implements a **Sequential Safety** pattern: identify target, verify merge status, delete safely, confirm result. The workflow follows 5 sequential steps to prevent data loss and ensure reliable cleanup. --- @@ -65,7 +33,7 @@ This skill operates as an operator for post-merge branch cleanup, configuring Cl **Goal**: Remove any git worktrees referencing the target branch before attempting deletion. -Worktree agents (dispatched with `isolation: "worktree"`) create local branches that block both `git branch -d` and `gh pr merge --delete-branch`. Check and clean up worktrees first: +Worktree agents (dispatched with `isolation: "worktree"`) create local branches that block both `git branch -d` and `gh pr merge --delete-branch`. Since worktree branches prevent deletion, check and clean up first: ```bash # List worktrees referencing any branch @@ -75,7 +43,7 @@ git worktree list git worktree remove ``` -If worktree removal fails (e.g., uncommitted changes), report the issue to the user. Do not force-remove without confirmation. +If worktree removal fails (e.g., uncommitted changes), report the issue to the user rather than force-removing, to avoid data loss. **Gate**: No worktrees reference the target branch. @@ -83,15 +51,15 @@ If worktree removal fails (e.g., uncommitted changes), report the issue to the u **Goal**: Determine which branch to clean up before any state changes. -If user provides a branch name argument, use that. Otherwise capture the current branch: +Capture the target branch *before* switching (switching state changes current branch and complicates recovery). If user provides a branch name, use that; otherwise: ```bash BRANCH_TO_DELETE=$(git branch --show-current) ``` -If already on main/master, ask the user which branch to clean up. Do not proceed without a target. +If already on main/master, ask the user which branch to clean up — do not assume and prevent accidental deletion of the base branch. -Detect the main branch name: +Never delete protected branches (main, master, develop) — these are the foundation of the repo and accidentally deleting them causes widespread damage. Detect the main branch name for later use: ```bash MAIN_BRANCH=$(git symbolic-ref refs/remotes/origin/HEAD 2>/dev/null | sed 's@^refs/remotes/origin/@@' || echo "master") @@ -103,6 +71,8 @@ MAIN_BRANCH=$(git symbolic-ref refs/remotes/origin/HEAD 2>/dev/null | sed 's@^re **Goal**: Move to main branch and sync with remote. +Must switch away from the target branch before deletion — git forbids deleting the current branch. Prune old remote-tracking references to keep the branch list clean: + ```bash git checkout "$MAIN_BRANCH" && git pull --prune origin "$MAIN_BRANCH" ``` @@ -113,21 +83,25 @@ git checkout "$MAIN_BRANCH" && git pull --prune origin "$MAIN_BRANCH" **Goal**: Remove the target branch safely. -Attempt safe delete first: +Use `-d` (safe delete) before `-D` (force delete) to prevent destroying unmerged work. Git's `-d` will refuse to delete branches with unmerged commits — this is a guard rail: ```bash git branch -d "$BRANCH_TO_DELETE" ``` If `-d` fails with "not fully merged": -1. Check if branch was squash-merged by looking for gone upstream: - ```bash - git branch --format '%(refname:short) %(upstream:track)' | grep "$BRANCH_TO_DELETE" - ``` -2. If upstream shows `[gone]`, the remote branch was deleted (PR was merged). Inform user and offer `-D`. -3. If upstream is NOT gone, warn user the branch may contain unmerged work. Only use `-D` with explicit user confirmation. -Prune stale remote-tracking references: +Squash-merged and rebase-merged branches have no merge commit, so git doesn't recognize them as merged. Check if the remote branch was deleted (indicating a completed PR): + +```bash +git branch --format '%(refname:short) %(upstream:track)' | grep "$BRANCH_TO_DELETE" +``` + +If upstream shows `[gone]`, the remote branch was deleted after PR merge — evidence the work is safe. Inform user and offer `-D` for forced deletion. + +If upstream is NOT gone, the branch may contain unmerged work — warn user and ask for explicit confirmation before using `-D`. Never force-delete without user confirmation on unknown merge status. + +Prune stale remote-tracking references to keep the reference list accurate: ```bash git remote prune origin @@ -139,7 +113,7 @@ git remote prune origin **Goal**: Confirm what happened and show current state. -Report format: +Always report results so the user has visibility into what was removed and the final state. This prevents silent failures and gives confidence the cleanup succeeded: ``` PR Cleanup Complete @@ -152,11 +126,13 @@ PR Cleanup Complete Run `git branch` to show remaining local branches. ---- - ## Extended Cleanup (--all) -When user passes `--all`, first remove all stale worktrees, then delete branches: +When user passes `--all`, batch-delete all merged branches except protected ones. + +**Safety first**: Preview the branch list before deleting. If more than 3 branches will be deleted, ask for explicit confirmation to prevent unexpected wholesale cleanup: + +First, remove all stale worktrees pointing at branches we're about to delete: ```bash # Step 0: Clean up worktrees pointing at branches we're about to delete @@ -164,19 +140,19 @@ git worktree list --porcelain | grep -A2 'branch refs/heads/' | grep -v 'main\|m # For each stale worktree: git worktree remove ``` -Then delete all branches merged into main except protected branches: +Then delete all branches merged into main except protected branches (never delete main/master/develop): ```bash git branch --merged "$MAIN_BRANCH" | grep -v -E '^\*|main|master|develop' | xargs -r git branch -d ``` -Also find squash-merged branches with gone upstreams: +Also find and delete squash-merged branches by detecting gone upstreams: ```bash git branch --format '%(refname:short) %(upstream:track)' | awk '$2 == "[gone]" { print $1 }' ``` -Show the full list before deleting and confirm with user if more than 3 branches. +Show the full list before deleting and confirm with user if more than 3 branches to prevent accidental data loss. --- @@ -232,30 +208,6 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Force-Deleting Without Checking Merge Status -**What it looks like**: Using `git branch -D` immediately without trying `-d` first -**Why wrong**: May destroy unmerged work with no recovery path -**Do instead**: Always try `-d` first, check merge/upstream status, then `-D` only with evidence - -### Anti-Pattern 2: Deleting Before Switching -**What it looks like**: Trying to delete the current branch while still on it -**Why wrong**: Git will refuse, causing a confusing error -**Do instead**: Always switch to main/master before deleting the target branch - -### Anti-Pattern 3: Skipping the Report -**What it looks like**: Deleting branches silently without showing what was removed -**Why wrong**: User has no confirmation of what happened, no visibility into remaining state -**Do instead**: Always list deleted branches and remaining local branches - -### Anti-Pattern 4: Batch Delete Without Preview -**What it looks like**: Running `--all` and deleting everything without showing the list first -**Why wrong**: May delete branches user intended to keep -**Do instead**: Show the list of branches to be deleted and confirm before proceeding - ---- - ## References This skill uses these shared patterns: diff --git a/skills/pr-fix/SKILL.md b/skills/pr-fix/SKILL.md index eef6b09a..e9055f40 100644 --- a/skills/pr-fix/SKILL.md +++ b/skills/pr-fix/SKILL.md @@ -25,43 +25,11 @@ routing: # PR Fix Skill -## Operator Context - -This skill operates as an operator for PR comment resolution workflows, configuring Claude's behavior for validated, evidence-based fixes. It implements the **Validate-Before-Fix** architectural pattern -- verify each comment's claim against actual code, then apply targeted fixes with a single clean commit. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before making any changes -- **Validate Every Comment**: NEVER blindly fix a comment without verifying its claim against actual code -- **Show Plan First**: NEVER apply fixes without presenting the fix plan and getting confirmation -- **Single Commit**: All fixes go into one commit with descriptive message referencing the PR -- **No Scope Creep**: Fix only what reviewers asked for. No "while I'm here" improvements -- **Branch Safety**: Never commit directly to main/master; work on the PR's branch - -### Default Behaviors (ON unless disabled) -- **Comment Classification**: Categorize each comment as VALID, INVALID, or NEEDS-DISCUSSION -- **Before/After Display**: Show code diff for each fix before committing -- **Skip Invalid Comments**: Report invalid comments with explanation instead of fixing -- **Push After Commit**: Push changes to update the PR after committing -- **Final Report**: Display summary of fixed, skipped, and pending items - -### Optional Behaviors (OFF unless enabled) -- **Reply to Comments**: Post resolution replies on fixed comment threads via `gh api` -- **Resolve Threads**: Mark fixed comment threads as resolved -- **NEEDS-DISCUSSION Auto-Reply**: Draft reply templates for ambiguous comments - -## What This Skill CAN Do -- Fetch and validate PR review comments against actual code -- Distinguish valid feedback from incorrect claims -- Apply targeted fixes for validated comments only -- Commit and push all fixes in a single clean commit -- Report what was fixed, skipped, or needs discussion - -## What This Skill CANNOT Do -- Fix comments without first validating them against the codebase -- Apply fixes without showing the plan and getting confirmation -- Create PRs (use pr-pipeline instead) -- Review code for new issues (use /pr-review instead) -- Make unrelated improvements beyond what reviewers requested +## Overview + +This skill implements a **Validate-Before-Fix** workflow for addressing PR review comments. Each comment is verified against actual code before any fix is applied, ensuring only valid feedback gets implemented. The workflow produces a single clean commit referencing the PR, with optional replies and thread resolution. + +This skill is designed for targeted comment resolution only — not for new PRs, code reviews, or unrelated debugging. --- @@ -77,9 +45,9 @@ If no PR number is provided as argument: gh pr view --json number,title,headRefName --jq '{number, title, headRefName}' ``` -If no PR is found for the current branch, inform the user and stop. +If no PR is found for the current branch, inform the user and stop. (This prevents fixing comments on wrong PRs, which is the most common integration mistake.) -Verify the current branch matches the PR's head branch. If not, ask the user before proceeding. +Verify the current branch matches the PR's head branch. If not, ask the user before proceeding. (Branch safety constraint: Never commit directly to main/master — this check enforces working on the PR's branch.) **Gate**: PR identified with number, title, and correct branch checked out. @@ -87,6 +55,8 @@ Verify the current branch matches the PR's head branch. If not, ask the user bef **Goal**: Retrieve all review comments and validate each claim against actual code. +**Why validation is critical**: Reviewers make mistakes. Without verification, you risk applying invalid fixes that introduce bugs. + **Step 1: Fetch comments** ```bash @@ -109,12 +79,16 @@ For EACH comment: - **INVALID**: Claim does not match actual code state - **NEEDS-DISCUSSION**: Subjective or design-level feedback +This validation step is **mandatory** — there are no exceptions. Small comments deserve the same scrutiny as large ones, because small mistakes cause real bugs. + **Gate**: Every comment classified with evidence. Proceed only when gate passes. ### Phase 3: SHOW FIX PLAN **Goal**: Present the plan and get user confirmation before making changes. +User control over fix scope is essential. Always show the plan first. This prevents fixing things the user disagrees with and catches validation errors before commits. + Display a structured plan: ``` @@ -141,9 +115,11 @@ Proceed with fixes? **Goal**: Apply each validated fix, showing before/after for each change. +Scope constraint: Fix exactly what was requested. "While I'm here" improvements derail PR reviews and may introduce unrelated issues. Stay focused. + For each VALID comment: 1. Read the file at the referenced location -2. Apply the minimal fix that addresses the comment +2. Apply the **minimal** fix that addresses the comment (no refactoring, no enhancements) 3. Show the before/after diff 4. Verify the fix compiles or passes basic checks @@ -155,6 +131,8 @@ For NEEDS-DISCUSSION items the user chose to address, apply the same process. **Goal**: Create a single clean commit and push to update the PR. +Commit discipline matters: Multiple small commits clutter git history and make it harder for reviewers to see what changed. Always combine all fixes into a single descriptive commit. This follows the workflow principle: single commit per logical change set. + ```bash # Stage changed files (list specific files, not -A) git add {file1} {file2} ... @@ -194,25 +172,32 @@ Remaining: --- -## Examples +## Examples and Constraints in Practice ### Example 1: Mixed Valid and Invalid Comments + User says: "/pr-fix 42" -Actions: + +**How the constraints apply:** + 1. Fetch 5 review comments on PR #42 (IDENTIFY, FETCH) -2. Validate: 3 VALID, 1 INVALID (import IS used on line 45), 1 NEEDS-DISCUSSION (VALIDATE) -3. Show plan, user confirms 3 fixes (PLAN) -4. Apply fixes, show before/after for each (FIX) -5. Single commit, push (COMMIT) -Result: 3 fixes committed, 1 invalid explained, 1 pending discussion +2. **Validate each claim** (core constraint: NEVER blindly fix): 3 VALID, 1 INVALID (import IS used on line 45), 1 NEEDS-DISCUSSION + - Invalid comment detected because actual code shows import is used. This prevents an accidental break. +3. **Show plan, get confirmation** (core constraint: NEVER apply fixes without showing plan): User reviews and confirms 3 fixes +4. **Apply minimal fixes only**: No extra improvements despite obvious refactoring opportunities +5. **Single commit** (not 3): Combines all changes with references PR +6. Result: 3 fixes committed, 1 invalid explained, 1 pending discussion ### Example 2: All Comments Invalid + User says: "/pr-fix" -Actions: + +**How the constraints apply:** + 1. Detect PR for current branch, fetch 3 comments (IDENTIFY, FETCH) -2. Validate all 3: each claim does not match current code state (VALIDATE) -3. Report: no changes needed, explain why each is invalid (REPORT) -Result: No changes made, user informed with evidence +2. **Validate all claims**: Each claim does not match current code state (the validation step catches this) +3. **Report with evidence**: No changes made, user informed with specific reasons (URL actually returns 200, variable is already declared, etc.) +4. Result: No commits created, user learns why comments were invalid --- @@ -242,41 +227,15 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Blindly Applying All Comments -**What it looks like**: Fixing every comment without checking if claims are accurate -**Why wrong**: Reviewers make mistakes. Invalid fixes introduce bugs. -**Do instead**: Validate every claim against actual code before fixing. - -### Anti-Pattern 2: Fixing Beyond What Was Asked -**What it looks like**: "While fixing this nil check, I also refactored the whole function" -**Why wrong**: Scope creep makes the PR harder to review again and may introduce new issues. -**Do instead**: Fix exactly what the reviewer requested. Nothing more. - -### Anti-Pattern 3: Separate Commits Per Comment -**What it looks like**: Creating 5 commits for 5 review comments -**Why wrong**: Clutters git history. Makes it harder for reviewer to see what changed. -**Do instead**: Single commit with all fixes, descriptive message listing each change. - -### Anti-Pattern 4: Skipping the Plan Step -**What it looks like**: Immediately applying fixes without showing what will change -**Why wrong**: User loses control. May fix things they disagree with. -**Do instead**: Always show the fix plan and wait for confirmation. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization +### Related Skills +- `pr-pipeline` — For creating PRs from scratch +- `pr-review` — For reviewing code without fixing +- `systematic-debugging` — For general debugging unrelated to PR comments -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Reviewer must be right, just fix it" | Reviewers make mistakes too | Validate claim against code | -| "Small comment, no need to verify" | Small mistakes cause real bugs | Validate every comment | -| "I'll fix extra things while I'm here" | Scope creep derails PR reviews | Fix only what was requested | -| "One commit per fix is cleaner" | Multiple small commits clutter history | Single commit for all fixes | +### PR Comment Best Practices +- Always validate before fixing (this prevents introducing bugs that reviewers caught) +- Classify comments as VALID, INVALID, or NEEDS-DISCUSSION (ensures evidence-based decisions) +- Show the plan before committing (maintains user control and catches errors early) +- Single commit per fix cycle (clean git history for reviewers) diff --git a/skills/pr-miner/SKILL.md b/skills/pr-miner/SKILL.md index 2fc8fbfc..4d3502ba 100644 --- a/skills/pr-miner/SKILL.md +++ b/skills/pr-miner/SKILL.md @@ -28,46 +28,19 @@ routing: # PR Miner Skill -## Operator Context - -This skill operates as an operator for deterministic GitHub data extraction, configuring Claude's behavior for mining PR review comments. It implements the **Pipeline** architectural pattern — authenticate, mine, validate — with strict separation between extraction (this skill) and analysis (Code Archaeologist agent). - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Over-Engineering Prevention**: Extract only requested data. No analysis, interpretation, or pattern detection — that is the Code Archaeologist agent's job -- **Deterministic Extraction Only**: Output raw JSON. Never analyze patterns, generate rules, or interpret comments -- **Authenticate First**: Verify GitHub token before starting any mining operation -- **Rate Limit Respect**: Honor GitHub API rate limits with exponential backoff - -### Default Behaviors (ON unless disabled) -- **Merged PRs Only**: Focus on merged PRs representing accepted standards -- **Imperative Filtering**: Filter for imperative language keywords (should, must, avoid, prefer) -- **Progress Reporting**: Display mining progress during long operations -- **Temporary File Cleanup**: Remove partial JSON and temp files at completion; keep only final output -- **Summary Output**: Report interaction count and reviewer distribution after mining - -### Optional Behaviors (OFF unless enabled) -- **All Comments Mode**: Capture every review comment regardless of language (`--all-comments`) -- **Reviewer Filter**: Focus on comments from particular reviewers (`--reviewer`) -- **Date Range**: Limit mining to specific time periods (`--since`/`--until`) -- **Multi-Repo**: Mine across multiple repositories in single operation - -## What This Skill CAN Do -- Extract review comments with code context (before/after) from merged PRs -- Track resolution status per comment (changed, resolved, dismissed, unresolved) -- Filter by imperative language keywords or capture all comments -- Mine multiple repositories in a single operation -- Respect GitHub API rate limits with retry logic -- Generate structured JSON output for downstream analysis -- Validate GitHub authentication before mining - -## What This Skill CANNOT Do -- Analyze patterns or generate rules (Code Archaeologist agent's job) -- Interpret comment meaning or intent (pure extraction only) -- Create enforcement rules (no Semgrep/golangci-lint generation) -- Mine private repos without proper token permissions (requires `repo` scope) -- Process non-GitHub platforms (GitHub-specific implementation) -- Monitor PRs in real-time (snapshot-based mining only) +## Overview + +This skill extracts raw review comment data from GitHub pull requests in three deterministic phases: AUTHENTICATE (verify token and repo access), MINE (extract raw JSON with code context), and VALIDATE (confirm output quality). + +The skill operates with strict separation of concerns: +- **This skill**: Extracts raw data only. No analysis, interpretation, or pattern detection. +- **Code Archaeologist agent**: Analyzes patterns, generates rules, interprets comments. + +Key principles: +- **Over-engineering prevention**: Extract only what was requested. Do not analyze, interpret, or detect patterns. +- **Deterministic extraction only**: Output raw JSON. No speculation about intent or meaning. +- **Authenticate first**: Verify GitHub token before any mining operation. +- **Rate limit respect**: Honor GitHub API rate limits with exponential backoff and monitoring. --- @@ -85,6 +58,8 @@ python3 ~/.claude/scripts/miner.py --check-auth Confirm output shows valid authentication with `repo` scope. +Constraint: This token will be reused across sessions. Verify it every run because tokens expire and permissions change. Do not assume "it worked last time." + **Step 2: Validate target repositories** Confirm each target repository: @@ -98,7 +73,7 @@ Confirm each target repository: gh api rate_limit --jq '.resources.core | "Remaining: \(.remaining)/\(.limit), Resets: \(.reset)"' ``` -Ensure sufficient API calls remain for the planned mining scope (estimate 3-5 calls per PR). +Ensure sufficient API calls remain for the planned mining scope (estimate 3-5 calls per PR). If approaching limit, wait for reset or reduce scope. **Gate**: Token is valid, repositories are accessible, rate limits are sufficient. Proceed only when gate passes. @@ -106,6 +81,12 @@ Ensure sufficient API calls remain for the planned mining scope (estimate 3-5 ca **Goal**: Extract raw review comment data with code context. +**Constraints during mining**: +- Focus on merged PRs only. Merged PRs represent accepted standards; drafts and abandoned work are noise. +- Apply imperative filtering by default. Filter for keywords like "should", "must", "avoid", "prefer" because they signal actionable feedback. This reduces noise from discussion and off-topic comments. +- Display mining progress during long operations so you know the process is running and can estimate completion time. +- Never analyze patterns, generate rules, or interpret comments during mining. Your job is extraction only. + **Step 1: Determine scope** Choose mining parameters based on the task: @@ -113,7 +94,7 @@ Choose mining parameters based on the task: - **Multi-repo**: `python3 ~/.claude/scripts/miner.py org/repo-a,org/repo-b output.json --limit 50` - **Filtered**: Add `--reviewer name`, `--since date`, or `--all-comments` -Start with 50 PRs. Expand only after validating output quality. +Start with 50 PRs. Expand only after validating output quality. Large limits (1000+) burn rate limits, return outdated standards, and produce unwieldy files. **Step 2: Execute mining** @@ -122,9 +103,9 @@ python3 ~/.claude/scripts/miner.py mined_data/.json --limit ``` Monitor progress output. Watch for: -- Rate limit warnings -- Authentication errors -- Empty PR responses (may indicate bot-only reviews) +- Rate limit warnings (adjust scope if needed) +- Authentication errors (re-verify token) +- Empty PR responses (may indicate bot-only reviews; consider adjusting repo or time range) **Step 3: Verify extraction** @@ -169,7 +150,7 @@ Verify: **Step 4: Clean up temporary files** -Remove any partial JSON, debug logs, or temp files created during mining. Keep only the final output in `mined_data/`. +Remove any partial JSON, debug logs, or temp files created during mining. Keep only the final output in `mined_data/`. This prevents stale data from being accidentally reused. **Gate**: Validation passes. Data quality is sufficient for downstream analysis. Mining is complete. @@ -277,45 +258,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Analyzing During Mining -**What it looks like**: "I mined the data and found 5 key patterns: always use errors.Is()..." -**Why wrong**: This skill extracts raw data. Pattern analysis is the Code Archaeologist agent's job. Mixing extraction with interpretation creates unreliable, non-deterministic output. -**Do instead**: Mine data, validate output, hand off JSON to Code Archaeologist. - -### Anti-Pattern 2: Mining Without Authentication Check -**What it looks like**: Running `miner.py` immediately, failing 10 minutes later on "Bad credentials" -**Why wrong**: Wastes time and API rate limits. No early validation of token permissions. -**Do instead**: Complete Phase 1 (AUTHENTICATE) before any mining. - -### Anti-Pattern 3: Mining Entire Repository History -**What it looks like**: `--limit 10000` to get "everything" -**Why wrong**: Extremely slow, burns rate limits, old PRs reflect outdated standards, massive output files are hard to process. -**Do instead**: Start with `--limit 50 --since <6-months-ago>`. Expand only after validating output quality. - -### Anti-Pattern 4: Skipping Output Validation -**What it looks like**: Mining completes, immediately passing output to Code Archaeologist without checking -**Why wrong**: May contain zero useful interactions, incomplete data from API errors, or bot-generated noise. Garbage in, garbage out. -**Do instead**: Complete Phase 3 (VALIDATE). Spot-check interactions, verify counts, review distribution. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Token worked last time" | Tokens expire, permissions change | Run `--check-auth` every session | -| "50 PRs is enough" | Depends on review density | Validate interaction count before proceeding | -| "I can summarize the patterns" | Extraction skill, not analysis skill | Output raw JSON only | -| "All comments mode wastes time" | Imperative filter may miss valuable feedback | Consider `--all-comments` for first run | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/imperative-keywords.txt`: Full list of detected imperative keywords - `${CLAUDE_SKILL_DIR}/references/examples.md`: Real-world mining examples and expected output diff --git a/skills/pr-mining-coordinator/SKILL.md b/skills/pr-mining-coordinator/SKILL.md index 7aa22c6d..95f46735 100644 --- a/skills/pr-mining-coordinator/SKILL.md +++ b/skills/pr-mining-coordinator/SKILL.md @@ -27,45 +27,9 @@ routing: # PR Mining Coordinator Skill -## Operator Context - -This skill operates as an operator for PR mining coordination workflows, configuring Claude's behavior for background job management and tribal knowledge extraction. It implements the **Pipeline** architectural pattern -- Validate, Mine, Verify, Generate, Report -- with **Domain Intelligence** embedded in the mining methodology. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Over-Engineering Prevention**: Only implement what's directly requested. No speculative features -- **Background Execution**: Mining jobs always run in background with `&` -- **GitHub Token from Keychain**: Uses `security find-internet-password -s github.com -w` -- **Process Tracking**: Always store and monitor background job PIDs -- **Sequential by Default**: Run mining jobs one at a time to avoid API rate limits - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report facts without self-congratulation. Show command output, not descriptions -- **Temporary File Cleanup**: Remove coordination files and debug outputs at completion. Keep only mining results (JSON) and generated rules (markdown) -- **Progress Reporting**: Show mining job progress every 30-60 seconds -- **Auto Rules Generation**: Generate categorized markdown rules after successful mining -- **Error Detection**: Monitor for API rate limits, auth failures, empty results -- **Confidence Scoring**: Calculate HIGH/MEDIUM/LOW confidence for patterns - -### Optional Behaviors (OFF unless enabled) -- **Concurrent Mining**: Run multiple repos simultaneously (risk: rate limits) -- **Historical Analysis**: Mine specific date ranges with --since/--until flags -- **All Comments Mode**: Use --all-comments for senior reviewers (default: imperative only) -- **Cross-Repo Merging**: Combine patterns from multiple mining results into unified rules - -## What This Skill CAN Do -- Coordinate background PR mining jobs with the pr-miner tool -- Track running jobs and report progress to user -- Generate categorized coding rules documents from mined data -- Calculate pattern confidence from occurrence frequency -- Handle API rate limits, auth failures, and empty result sets - -## What This Skill CANNOT Do -- Mine without a valid GitHub token -- Run multiple mining jobs in parallel (sequential by default) -- Perform code review (use code-review skill instead) -- Write coding standards from scratch without PR data -- Skip prerequisite validation or result verification +## Overview + +This skill coordinates PR mining workflows to extract tribal knowledge and coding standards from GitHub PR review history. It implements a five-phase pipeline (Validate, Mine, Verify, Generate, Report) that manages background mining jobs, generates confidence-scored coding rules, and prevents common pitfalls like username errors and API rate limiting. --- @@ -73,7 +37,7 @@ This skill operates as an operator for PR mining coordination workflows, configu ### Phase 1: VALIDATE -**Goal**: Confirm prerequisites before starting any mining operation. +**Goal**: Confirm all prerequisites before starting mining. **Step 1: Check miner script exists** @@ -83,6 +47,8 @@ fish -c "ls ~/.claude/skills/pr-miner/scripts/miner.py" Expected: File exists at path. +**Constraint**: Never skip this step. Miner script must exist before mining can run. + **Step 2: Verify GitHub token** ```bash @@ -91,6 +57,8 @@ fish -c "security find-internet-password -s github.com -w 2>/dev/null" Expected: Token printed (ghp_...). +**Constraint**: Always extract token from keychain using `security find-internet-password -s github.com -w`. Never hardcode or accept tokens from user input. If empty, user must add token with `security add-internet-password`. + **Step 3: Verify reviewer username (if filtering by reviewer)** ```bash @@ -99,11 +67,13 @@ fish -c "gh pr list --repo {org/repo} --search 'reviewed-by:{username}' --limit Expected: PR results confirm username is valid and active. -**Gate**: Miner script exists, token available, reviewer verified. Proceed only when gate passes. +**Constraint**: Username verification is MANDATORY when user specifies --reviewer flag. Silently wrong usernames cause 0 interactions after 5+ minutes of wasted API quota. Verify before mining, not after. (Anti-pattern #1) + +**Gate**: Miner script exists, token available, reviewer verified if applicable. Proceed only when gate passes. ### Phase 2: MINE -**Goal**: Execute mining job in background and track progress. +**Goal**: Execute mining job in background, track progress, avoid rate limit exhaustion. **Step 1: Start mining job** @@ -117,19 +87,27 @@ fish -c "set -x GITHUB_TOKEN (security find-internet-password -s github.com -w 2 See `references/mining-commands.md` for full command patterns and flag reference. +**Constraint - Background Execution**: Always run mining with `&` (ampersand suffix) to background the job. Never block on mining operations. Capture and store the background job PID for tracking. + +**Constraint - GitHub Token Source**: Always extract token from keychain inline with `security find-internet-password -s github.com -w`. Export as GITHUB_TOKEN environment variable before calling miner.py. No other token sources are acceptable. + **Step 2: Track progress** -Monitor background job with BashOutput tool. Check every 30-60 seconds. Report progress to user. +Monitor background job with BashOutput tool. Check every 30-60 seconds. Report progress to user showing: repos scanned, PRs processed, interactions extracted so far. + +**Constraint - Progress Reporting**: Show command output, not marketing descriptions. Real numbers (e.g., "150 PRs scanned, 42 interactions extracted") beat "Mining is progressing nicely." **Step 3: Handle multiple repos** -Run jobs sequentially. Wait for each to complete before starting next. +Run jobs sequentially. Wait for each to complete before starting next. Never start new job until previous finishes. + +**Constraint - Sequential by Default**: Running multiple mining jobs in parallel exhausts the 5000 requests/hour API quota faster than you can track which job caused the failure. Sequential mining prevents rate limit cascades and makes attribution clear. (Anti-pattern #2) Only enable concurrency if explicitly requested AND user understands rate limit risk. -**Gate**: Mining job completes with non-zero interaction count. Proceed only when gate passes. +**Gate**: Mining job completes with non-zero interaction count. If job exits with 0 interactions, see Error Handling "0 interactions found" section. ### Phase 3: VERIFY -**Goal**: Confirm mining output is valid and contains usable data. +**Goal**: Confirm mining output is valid, contains interactions with usable data, and failed mining exits cleanly. **Step 1: Check output file exists and has content** @@ -162,40 +140,46 @@ Confirm JSON matches expected schema: } ``` -If `interaction_count` is 0, do not proceed -- see Error Handling for "0 interactions found". +**Constraint**: If `interaction_count` is 0, do NOT proceed to Phase 4. Instead, check Error Handling section "0 interactions found" for diagnosis. Common causes: wrong reviewer username (should have caught in Phase 1), no PR activity in date range, or repo has no review comments (only approvals). **Step 3: Check interaction quality** -Verify interactions have: pr_number, pr_title, comment text, and ideally code_before/code_after pairs. Interactions without code pairs can still produce rules but lack concrete examples. +Verify interactions have: pr_number, pr_title, comment text. Code pairs (code_before/code_after) are strongly preferred but not mandatory. Interactions without code pairs can still produce rules but are lower value. + +**Constraint - Prevent Flat Dumps**: Do not proceed to Phase 4 without checking that `interaction_count > 0`. Attempting to generate rules from empty results wastes time and produces nothing usable. Empty results signal a problem to diagnose, not a success to report. -**Gate**: Output JSON is valid, contains interactions with usable data. Proceed only when gate passes. +**Gate**: Output JSON is valid, interaction_count > 0, interactions have required fields. Proceed only when gate passes. ### Phase 4: GENERATE -**Goal**: Produce categorized coding rules document from mined data. +**Goal**: Produce categorized, confidence-scored markdown rules from mined data. **Step 1: Load and categorize patterns** -Read mined JSON. Categorize interactions by topic using standard categories from `references/pattern-categories.md`. +Read mined JSON. Group interactions by topic using standard categories from `references/pattern-categories.md`. Example categories: Error Handling, Testing, API Design, Concurrency, Performance, Naming, Documentation, Security, Refactoring, Tooling. + +**Constraint - Mandatory Categorization**: Do NOT generate a flat numbered list of 50 patterns. Flat lists are overwhelming, unscannable, and lose priority context. Organize by topic, then by confidence within topic. (Anti-pattern #3) **Step 2: Score confidence** +Calculate confidence from occurrence frequency and reviewer seniority: + | Level | Criteria | Action | |-------|----------|--------| -| HIGH | 5+ occurrences from senior reviewers | Include as standard practice | +| HIGH | 5+ occurrences (especially from senior reviewers) | Include as standard practice | | MEDIUM | 2-4 occurrences | Include with context caveats | -| LOW | Single occurrence | Place in "Additional Observations" | +| LOW | Single occurrence | Place in "Additional Observations" section | **Step 3: Generate markdown rules document** -Follow this structure for each pattern entry: +Follow this structure for each pattern: ```markdown ## {Category Name} ### {Pattern Name} ({CONFIDENCE} confidence) -**Pattern**: {Brief description} +**Pattern**: {Brief description of the rule} **Good**: \`\`\`{lang} @@ -211,7 +195,7 @@ Follow this structure for each pattern entry: "{comment_text}" ``` -Order categories by total pattern count (most patterns first). Within each category, sort HIGH before MEDIUM before LOW. +**Constraint - Ordering for Usability**: Sort categories by pattern count (descending: most patterns first). Within each category, sort patterns HIGH → MEDIUM → LOW confidence. Users scan from top, so high-confidence patterns must come first to maximize scanning efficiency. **Step 4: Save rules** @@ -219,44 +203,27 @@ Order categories by total pattern count (most patterns first). Within each categ fish -c "cat > ~/.claude/skills/pr-miner/rules/{repos}_coding_rules.md" ``` -**Gate**: Rules document is categorized, confidence-scored, and saved to disk. +**Constraint - Cleanup Behavior**: After saving rules to disk, remove temporary coordination files (PIDs, debug logs, intermediate JSON). Keep ONLY the final mining result JSON (for future reference) and generated rules markdown (for user consumption). + +**Gate**: Rules document is categorized, confidence-scored, saved to disk, and temporary files cleaned. ### Phase 5: REPORT -**Goal**: Deliver comprehensive results to user. +**Goal**: Deliver actionable results with all necessary context. Provide: - PRs analyzed count - Interactions extracted count -- File paths for mined data and generated rules -- Top HIGH confidence patterns with occurrence counts -- Summary of MEDIUM and LOW confidence pattern counts +- File path to mined data JSON +- File path to generated rules markdown +- List of top 3-5 HIGH confidence patterns with occurrence counts +- Summary of MEDIUM and LOW confidence pattern distribution -**Gate**: User has all information needed to act on mining results. +**Constraint - Report Clarity**: Show actual numbers and paths, not generic summaries. Example good report: "Analyzed 150 PRs, extracted 42 interactions. HIGH confidence (12 patterns): Error handling (5), Testing (4), Naming (3). MEDIUM confidence: 18 patterns. LOW confidence: 12 patterns. Rules: ~/.claude/skills/pr-miner/rules/myrepo_coding_rules.md" ---- +**Constraint - Communication Style**: Report facts without self-congratulation. Show what happened and where the output is. Avoid "Mining went great!" — instead say "Mined 42 interactions from 150 PRs." -## Examples - -### Example 1: Mine Specific Reviewer -User says: "Mine senior-reviewer's patterns from go-libs" -Actions: -1. Verify miner, token, and reviewer username (VALIDATE) -2. Run mining with --reviewer and --all-comments flags (MINE) -3. Check output JSON for valid interactions (VERIFY) -4. Categorize patterns and generate rules markdown (GENERATE) -5. Report top patterns and file locations (REPORT) -Result: Categorized coding rules with confidence scores - -### Example 2: Team Standards Extraction -User says: "Get coding standards from service-a and service-b" -Actions: -1. Verify miner and token, no reviewer to verify (VALIDATE) -2. Run mining without --reviewer to capture all reviewers (MINE) -3. Confirm output has interactions from multiple reviewers (VERIFY) -4. Generate team-wide rules document (GENERATE) -5. Report findings with reviewer distribution (REPORT) -Result: Team-wide coding rules across both repositories +**Gate**: User has file paths, pattern counts, and top patterns. They can immediately act on the rules markdown. --- @@ -294,34 +261,20 @@ Solution: 3. If stuck: check for network issues or API downtime 4. For future runs: reduce --limit to smaller batches ---- - -## Anti-Patterns - -### Anti-Pattern 1: Mining Without Verifying Reviewer Username -**What it looks like**: Running `--reviewer senior-reviewer` without checking the actual GitHub username -**Why wrong**: Job completes successfully with 0 interactions. Wastes API quota and 5-10 minutes. Username errors are silent. -**Do instead**: Verify username with `gh pr list --search 'reviewed-by:{username}'` before mining. - -### Anti-Pattern 2: Running Multiple Mining Jobs in Parallel -**What it looks like**: Starting 3+ mining jobs simultaneously to save time -**Why wrong**: Exhausts 5000 requests/hour rate limit across all jobs. Later jobs fail mid-execution. Cannot track which job consumed quota. -**Do instead**: Run jobs sequentially. Wait for each to complete before starting the next. - -### Anti-Pattern 3: Generating Flat Rules Without Categorization -**What it looks like**: A numbered list of 50 patterns with no organization or confidence scoring -**Why wrong**: Overwhelming to read. No way to find relevant patterns. Loses priority context. -**Do instead**: Categorize by topic (Error Handling, Testing, API Design, etc.) and sort by confidence level within each category. See `references/pattern-categories.md`. - -### Anti-Pattern 4: Skipping --all-comments for Senior Reviewers -**What it looks like**: Mining a senior reviewer without the --all-comments flag and getting 0-2 interactions -**Why wrong**: Senior reviewers use questions ("Why not use errors.Is here?") and suggestions instead of imperatives. Default mode misses the majority of their feedback. -**Do instead**: Always use `--all-comments` when mining senior or experienced reviewers. +### Error: "Senior reviewer returns 0-2 interactions" +Cause: Missing --all-comments flag when mining senior reviewers +Solution: +1. Senior reviewers often use questions and suggestions instead of imperative statements +2. Default mining mode captures only imperative comments +3. Re-run with `--all-comments` flag to capture all comment types +4. For future runs: always use `--all-comments` when mining experienced reviewers (Anti-pattern #4) -### Anti-Pattern 5: Testing Multi-Repo Mining Without Single-Repo Validation -**What it looks like**: Mining 5 repos at once on the first attempt without verifying any individually -**Why wrong**: If any repo has access issues, entire job fails after minutes of wasted time. Cannot determine which repo caused failure. -**Do instead**: Test with a single repo and `--limit 10` first. Expand incrementally after confirming access. +### Error: "Multi-repo mining fails partway through" +Cause: Running 5+ repos in parallel, early jobs exhaust rate limits, later jobs fail +Solution: +1. Check remaining rate quota with `gh rate-limit` +2. If critically low (<150 remaining): wait for reset before retrying +3. For future runs: test with a single repo and `--limit 10` first. Expand incrementally after confirming access works. (Anti-pattern #5) --- @@ -331,12 +284,3 @@ This skill uses these reference files: - `${CLAUDE_SKILL_DIR}/references/mining-commands.md`: Command patterns, flag reference, output naming conventions - `${CLAUDE_SKILL_DIR}/references/pattern-categories.md`: Standard categories for coding rules (10 categories with examples) - `${CLAUDE_SKILL_DIR}/references/reviewer-usernames.md`: Known GitHub usernames and verification methods - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Username is probably right" | Probably = 0 interactions after 5 min | Verify with gh pr list first | -| "Parallel mining saves time" | Saves nothing when rate limit kills jobs | Run sequentially | -| "Just dump all patterns" | Flat lists are unusable at 50+ items | Categorize and score confidence | -| "Low limit is enough" | Small samples produce low-confidence rules | Use --limit 100+ for meaningful patterns | diff --git a/skills/pr-review-address-feedback/SKILL.md b/skills/pr-review-address-feedback/SKILL.md index da7f2aa3..8d7e8447 100644 --- a/skills/pr-review-address-feedback/SKILL.md +++ b/skills/pr-review-address-feedback/SKILL.md @@ -26,43 +26,22 @@ routing: # PR Review Address Feedback Skill -## Operator Context - -This skill operates as an operator for PR feedback processing, configuring Claude's behavior for rigorous validation of reviewer comments before acting on them. It implements a **Validate-Then-Act** pattern -- fetch all feedback, verify each claim independently, summarize findings, then fix only validated issues. - -### Hardcoded Behaviors (Always Apply) -- **Fetch All Sources**: Always query reviews, inline comments, AND issue comments -- never rely on a single endpoint -- **Validate Before Acting**: NEVER start fixing code until every comment has been independently verified -- **Evidence Over Trust**: Reviewer claims are treated as hypotheses, not facts -- verify with code, tests, or HTTP requests -- **Summary Table Required**: Always present a validation summary table before making any changes -- **Phase Gates Enforced**: Each phase must complete before the next begins -- no skipping - -### Default Behaviors (ON unless disabled) -- **Auto-Detect PR Context**: Infer owner/repo/PR number from current git branch and remote when not provided -- **Structured Validation Output**: Use the validation entry format for each comment -- **Classify Verdicts**: Every comment gets VALID, INVALID, or NEEDS-DISCUSSION -- **Show Verification Work**: Display evidence for each verdict, not just the conclusion -- **Fix Only VALID Issues**: Skip INVALID comments, escalate NEEDS-DISCUSSION to user - -### Optional Behaviors (OFF unless enabled) -- **Post Reply Comments**: Reply to individual review comments on GitHub with findings -- **Request Re-Review**: Automatically request re-review after pushing fixes -- **Commit Per Finding**: Create separate commits for each validated fix -- **Diff Verification**: Run `git diff` after fixes to confirm only intended changes - -## What This Skill CAN Do -- Fetch and consolidate PR feedback from all three GitHub comment sources -- Independently verify each reviewer claim against the actual codebase -- Classify comments as VALID, INVALID, or NEEDS-DISCUSSION with evidence -- Fix validated issues after presenting the summary table -- Detect reviewer claims that lack supporting evidence - -## What This Skill CANNOT Do -- Fix issues without completing the validation phase first -- Trust reviewer claims without independent verification -- Skip the summary table and proceed directly to fixes -- Perform general code review (use systematic-code-review instead) -- Create or submit new pull requests (use pr-pipeline instead) +## Overview + +This skill processes PR feedback by implementing a **Validate-Then-Act** pattern: fetch all comments from every GitHub source, independently verify each claim, present findings in a summary table, then fix only validated issues. The pattern is mandatory because: +- Reviewer claims are hypotheses, not facts (verify with code/tests/HTTP requests) +- Single comment sources miss feedback (reviews, inline comments, and issue comments are separate endpoints) +- Unverified fixes waste time and introduce bugs (validation phase gates code changes) + +### Key Principles + +**Fetch Comprehensively**: Always query all three endpoints — reviews, inline comments, AND issue comments. Claude and other tools comment via `/issues/{pr}/comments`, which is a separate endpoint. Missing any source means missing feedback. + +**Verify Independently**: Reviewer claims must be tested using the verification tests table. Reviewers often miss indirect usage, re-exports, or dynamic references. Trust hierarchy (highest to lowest): running code/tests → HTTP requests → grep/search → reading source → reviewer's word. + +**Classify Honestly**: Classify comments as VALID, INVALID, or NEEDS-DISCUSSION based on evidence. Never downgrade NEEDS-DISCUSSION to VALID to avoid asking the user — acting on ambiguous feedback without user input may fix the wrong thing. + +**Present Summary Before Fixes**: Complete all validations and present the full summary table before making any code changes. User loses the big picture if you validate and fix incrementally, and NEEDS-DISCUSSION items may change approach to VALID ones. --- @@ -96,7 +75,7 @@ gh api repos/{owner}/{repo}/pulls/{pr_number}/comments gh api repos/{owner}/{repo}/issues/{pr_number}/comments ``` -All three comment sources must be fetched and combined. Treat all feedback identically regardless of submission method: +All three comment sources must be fetched and combined because feedback arrives through different endpoints. Treat all feedback identically regardless of submission method: - **Formal reviews** -- submitted via GitHub's "Review changes" button - **Inline code comments** -- added directly to specific code lines - **Issue/PR comments** -- general comments on the PR conversation @@ -140,7 +119,7 @@ Reason: [one sentence explanation] **Step 3: Flag automatic INVALID patterns** -Flag as likely INVALID if reviewer: +Flag as likely INVALID because evidence is required for all claims: - Claims code behavior without showing evidence - Says "future date" without checking if URL works - Claims unused without searching codebase @@ -219,7 +198,7 @@ LEARNED: [key] → [topic] [one-line value] ``` -**Quality gate**: Only record specific, actionable findings. "Reviewers are sometimes wrong" is NOT worth recording. "Reviewers flag fmt.Errorf wrapping but the org convention is errors.New" IS worth recording. +Only record specific, actionable findings because "Reviewers are sometimes wrong" provides no guidance. "Reviewers flag fmt.Errorf wrapping but the org convention is errors.New" IS worth recording as a real org standard. **Gate**: Learning recorded (or skipped if nothing reusable). Review complete. @@ -286,48 +265,9 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Fixing Before Validating -**What it looks like**: Reading reviewer comments and immediately editing code -**Why wrong**: Reviewer may be wrong. Fixing unverified claims wastes time and can introduce bugs. -**Do instead**: Complete Phase 2 validation for every comment before touching any code. - -### Anti-Pattern 2: Trusting Claims Without Evidence -**What it looks like**: Accepting "this import is unused" without grepping for usage -**Why wrong**: Reviewers often miss indirect usage, re-exports, or dynamic references. -**Do instead**: Independently verify each claim using the verification tests table. - -### Anti-Pattern 3: Skipping the Summary Table -**What it looks like**: Validating comments one at a time and fixing each immediately -**Why wrong**: User loses the big picture. NEEDS-DISCUSSION items may change the approach to VALID ones. -**Do instead**: Complete all validations, present the full summary table, then proceed to fixes. - -### Anti-Pattern 4: Fetching Only One Comment Source -**What it looks like**: Checking only `/pulls/{pr}/reviews` and missing inline or issue comments -**Why wrong**: Claude and other tools often comment via `/issues/{pr}/comments`, which is a separate endpoint. Missing any source means missing feedback. -**Do instead**: Always fetch from all three endpoints: reviews, pull comments, and issue comments. - -### Anti-Pattern 5: Downgrading NEEDS-DISCUSSION to VALID -**What it looks like**: Marking ambiguous feedback as VALID to avoid asking the user -**Why wrong**: Acting on ambiguous feedback without user input may fix the wrong thing or introduce unintended behavior. -**Do instead**: Classify honestly. Present NEEDS-DISCUSSION items to the user for their decision. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization: Code Review](../shared-patterns/anti-rationalization-review.md) - Review-specific rationalization prevention -- [Anti-Rationalization Core](../shared-patterns/anti-rationalization-core.md) - Universal anti-rationalization patterns -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) - Phase transition rules -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|---|---|---| -| "Reviewer is probably right" | Probably is not verified | Test the claim independently | -| "Small comment, just fix it" | Small fixes can break things | Validate first, even for trivial claims | -| "Already validated similar comment" | Each comment has unique context | Validate each comment individually | -| "No time for the summary table" | Summary prevents premature action | Always present the table before fixes | +Trust hierarchy, validation checklist patterns, and phase-gate enforcement are derived from: +- Anti-Rationalization principles (verify claims independently, never skip phases for time) +- Code review rigor standards (validate before touching code, classify honestly) +- Evidence-based troubleshooting (highest-trust sources first, lowest-trust sources last) diff --git a/skills/pr-status/SKILL.md b/skills/pr-status/SKILL.md index 9a61d90b..1deb46f4 100644 --- a/skills/pr-status/SKILL.md +++ b/skills/pr-status/SKILL.md @@ -22,67 +22,37 @@ routing: # PR Status Skill -## Operator Context - -This skill operates as an operator for PR status checks, configuring Claude's behavior for fast, accurate branch and PR state reporting. It implements a **Sequential Gather** pattern -- collect git state, PR metadata, CI status, reviews, and merge readiness in ordered steps, then present a unified status report. - -### Hardcoded Behaviors (Always Apply) -- **Read-Only**: Never modify files, branches, or PR state during status checks -- **Fetch Before Report**: Always `git fetch` before comparing local/remote state -- **gh CLI Required**: Verify `gh` is installed and authenticated before PR queries -- **Complete Report**: Show all available sections; never skip sections silently -- **Current Branch Context**: Always report which branch is being checked - -### Default Behaviors (ON unless disabled) -- **CI Detail**: Show individual check names with pass/fail status -- **Review Summary**: Show reviewer names with their review state -- **Merge Readiness**: Report conflicts, required approvals, and blocking checks -- **Behind-Main Count**: Show how many commits behind the base branch -- **Claude Review Detection**: Check for Claude Code review comments on the PR - -### Optional Behaviors (OFF unless enabled) -- **Recent PRs on Main**: When on main branch, list recent PRs -- **Verbose CI Output**: Show full CI log URLs for failed checks -- **Cross-Repository Status**: Check PRs across multiple remotes - -## What This Skill CAN Do -- Report current branch name and its relationship to base branch -- Show local working tree state (clean, modified files, ahead/behind) -- Display PR metadata (number, title, state, URL) -- List CI check results with pass/fail per check -- Summarize review states per reviewer -- Report merge readiness (conflicts, approvals needed) -- Detect Claude Code review activity - -## What This Skill CANNOT Do -- Create or update pull requests (use pr-sync instead) -- Fix review comments or CI failures (use pr-fix instead) -- Push commits or modify git state -- Perform code review (use /pr-review instead) +Collect git state, PR metadata, CI status, reviews, and merge readiness in ordered steps, then present a unified status report. This is a **read-only** skill -- it never modifies files, branches, or PR state because status checks that mutate state create surprising side effects and belong to other skills (pr-sync, pr-fix). --- ## Instructions -### Step 0: Check Prerequisites +### Phase 1: Prerequisites -Verify `gh` CLI is available and authenticated. If not, report the missing prerequisite and stop. +Verify `gh` CLI is available and authenticated before any PR queries. Without `gh`, PR metadata, CI checks, and review data are all inaccessible, so there is no point proceeding with a partial report. ```bash command -v gh &> /dev/null || { echo "GitHub CLI (gh) not installed. Install: https://cli.github.com/"; exit 1; } gh auth status &> /dev/null || { echo "GitHub CLI not authenticated. Run: gh auth login"; exit 1; } ``` -### Step 1: Get Branch Info +> **Gate**: If either check fails, report the specific missing prerequisite with the installation/auth URL and stop. Do not attempt partial status without `gh`. + +### Phase 2: Branch Identification + +Always report which branch is being checked -- users often forget what branch they are on, and every subsequent step depends on correct branch context. ```bash CURRENT_BRANCH=$(git branch --show-current) MAIN_BRANCH=$(git symbolic-ref refs/remotes/origin/HEAD 2>/dev/null | sed 's@^refs/remotes/origin/@@' || echo "master") ``` -If on the main branch, report that and optionally list recent PRs. No further steps needed. +If on the main branch, report that and optionally list recent PRs. No further phases needed. + +### Phase 3: Local State -### Step 2: Check Local State +Fetch before comparing local and remote state. Stale local refs produce inaccurate ahead/behind counts that mislead the user into thinking their branch is up to date when it is not. ```bash # Working tree status @@ -94,14 +64,18 @@ else echo "$FILE_COUNT files modified" fi -# Fetch and compare +# Fetch first -- comparing against stale refs gives wrong counts git fetch origin --quiet AHEAD=$(git rev-list origin/$CURRENT_BRANCH..$CURRENT_BRANCH --count 2>/dev/null || echo "0") BEHIND=$(git rev-list $CURRENT_BRANCH..origin/$CURRENT_BRANCH --count 2>/dev/null || echo "0") BEHIND_MAIN=$(git rev-list $CURRENT_BRANCH..origin/$MAIN_BRANCH --count 2>/dev/null || echo "0") ``` -### Step 3: Check PR Status +Show how many commits the branch is behind the base branch -- this surfaces rebase needs early. + +> **Constraint**: If the branch has never been pushed, there will be no remote tracking branch. In that case, report local branch state (modified files, commits) and suggest pushing with `git push -u origin [branch]`. + +### Phase 4: PR Metadata ```bash PR_JSON=$(gh pr view --json number,title,state,url,reviewDecision,statusCheckRollup,mergeable,reviews 2>/dev/null) @@ -116,15 +90,23 @@ else fi ``` -If no PR exists, report the branch state and suggest creating one. Skip Steps 4-7. +If no PR exists, report the branch state and suggest creating one. Skip Phases 5-7. -### Step 4: Check CI Status +> **Constraint**: If `gh pr view` fails, distinguish "no pull requests found" from a network/auth error. For network/auth errors, report the error and suggest `gh auth status`. Never invent a status when a command fails -- false status is worse than no status. + +### Phase 5: CI Status + +Show individual check names with pass/fail status so the user can see exactly which checks need attention. ```bash gh pr checks --json name,state,conclusion 2>/dev/null | jq -r '.[] | "\(.conclusion // .state) \(.name)"' ``` -### Step 5: Check Claude Review +> **Constraint**: If this command fails, report that CI data is unavailable rather than guessing. Saying "CI passing" when the command returned an error destroys trust in the report. + +### Phase 6: Claude Review + +Check for Claude Code review comments on the PR -- these automated reviews often contain actionable feedback that human reviewers may not catch. ```bash CLAUDE_COMMENTS=$(gh pr view --json comments --jq '[.comments[] | select( @@ -145,22 +127,24 @@ else fi ``` -### Step 6: Check Reviews +### Phase 7: Human Reviews + +Show reviewer names with their review state so the user knows who has reviewed and what the outcome was. ```bash gh pr view --json reviews --jq '.reviews[] | "\(.author.login): \(.state)"' gh pr view --json reviewDecision --jq '.reviewDecision' ``` -### Step 7: Check Merge Status +Report conflicts, required approvals, and blocking checks to give a complete merge readiness picture: ```bash gh pr view --json mergeable --jq '.mergeable' ``` -### Step 8: Present Status Report +### Phase 8: Status Report -Format the collected information into a structured report: +Format all collected information into a structured report. Show every section that was gathered; if a section failed to load, say so explicitly rather than silently omitting it -- a partial report without explanation hides problems the user needs to know about. ``` PR STATUS @@ -189,11 +173,15 @@ PR #[N]: "[title]" [mergeable/conflicts/approvals needed] ``` +> **Gate**: The report must include all sections. If any data-gathering phase returned an error, the report must state what information is unavailable and why, not silently drop the section. + --- -## Examples +## Reference Material + +### Examples -### Example 1: Feature Branch with Open PR +**Feature Branch with Open PR** User says: "/pr-status" Actions: 1. Verify gh CLI available, fetch latest @@ -202,7 +190,7 @@ Actions: 4. Report ready to merge Result: Complete status showing merge-ready PR -### Example 2: No PR Exists +**No PR Exists** User says: "/pr-status" Actions: 1. Verify gh CLI, identify branch `fix/typo` @@ -210,11 +198,11 @@ Actions: 3. Report no PR exists for this branch Result: Branch state shown, suggest creating PR -### Example 3: On Main Branch +**On Main Branch** User says: "/pr-status" Actions: 1. Detect current branch is main -2. List recent PRs if optional behavior enabled +2. List recent PRs if requested 3. Suggest switching to a feature branch Result: Brief report noting user is on main @@ -245,31 +233,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Modifying State During Status Check -**What it looks like**: Pushing commits, creating PRs, or fixing issues during a status check -**Why wrong**: Status is read-only. Modifications belong to other skills. -**Do instead**: Report status only. Suggest appropriate skills for actions. - -### Anti-Pattern 2: Skipping Fetch Before Reporting -**What it looks like**: Reporting ahead/behind counts from stale local refs -**Why wrong**: Produces inaccurate status that misleads the user -**Do instead**: Always `git fetch origin --quiet` before comparing - -### Anti-Pattern 3: Partial Report Without Explanation -**What it looks like**: Showing only CI status, omitting reviews and merge state -**Why wrong**: User expects complete picture. Missing sections hide problems. -**Do instead**: Show all sections. If a section fails to load, say so explicitly. - -### Anti-Pattern 4: Inventing Status When Commands Fail -**What it looks like**: Reporting "CI passing" when `gh pr checks` returned an error -**Why wrong**: False status is worse than no status -**Do instead**: Report the command failure and what information is unavailable - ---- - ## References -This skill uses these shared patterns: -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks +- PR creation and push: use `pr-sync` skill +- Fixing review comments: use `pr-fix` skill +- Code review: use `/pr-review` skill diff --git a/skills/pr-sync/SKILL.md b/skills/pr-sync/SKILL.md index fcfe19fa..f78cfcc4 100644 --- a/skills/pr-sync/SKILL.md +++ b/skills/pr-sync/SKILL.md @@ -26,57 +26,9 @@ routing: # PR Sync Skill -## Purpose - -Sync local changes to GitHub in a single command. Detects current state (main vs feature branch, staged vs unstaged changes, existing PRs), then executes the minimum steps needed: branch, commit, push, and create PR. - -## Operator Context - -This skill operates as an operator for GitHub sync workflows, configuring Claude's behavior for safe, predictable git operations that get local work onto GitHub with a PR. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before any git operations -- **Over-Engineering Prevention**: Execute only the steps needed for the current state. Do not add extra commits, rebase, or reorganize history beyond what is required to sync. -- **Never force push**: All pushes use standard `git push`, never `--force`. Exception: the review-fix loop (Step 4b) uses `--force-with-lease` after amending the tip commit, which is safe because the branch was just pushed by us. -- **Never modify pushed commits**: Commits already on remote are immutable. Exception: the review-fix loop amends only the tip commit it just created, before any external review. -- **Never commit sensitive files**: Block `.env`, credentials, secrets, API keys -- **Conventional commit format**: All commit messages follow conventional commits -- **Branch protection**: Never commit directly to main/master -- **Repo-Aware Review Gate**: Classify repo before creating PR. For personal repos, run up to 3 iterations of `/pr-review` → fix before PR creation. For protected-org repos, every step requires explicit user confirmation. -- **Organization-Gated Workflow**: Repos under protected organizations (configured in `scripts/classify-repo.py`) require user confirmation before EACH step: commit message review, push, and PR creation. Never auto-execute any of these steps. Present the proposed action and wait for user approval. - -### Default Behaviors (ON unless disabled) -- **Auto-detect branch name**: Generate from commit message or changed files when not provided -- **Auto-detect PR title**: Derive from branch name or first commit when not provided -- **Stage selectively**: Prefer staging specific files over `git add -A` -- **Check for unpushed commits**: Include existing unpushed commits in the push -- **Warn if behind remote**: Alert user if branch is behind main before pushing -- **Show PR URL**: Display the PR URL after creation for easy access - -### Optional Behaviors (OFF unless enabled) -- **Draft PR**: Create PR as draft instead of ready for review -- **Auto-assign reviewers**: Assign reviewers based on CODEOWNERS -- **Rebase before push**: Rebase on main before pushing to avoid merge conflicts - -## What This Skill CAN Do -- Detect current git state and choose the right workflow -- Create feature branches from main/master with conventional naming -- Stage, commit, and push changes in sequence -- Create GitHub PRs with summary and test plan -- Update existing PRs by pushing new commits - -## What This Skill CANNOT Do -- Force push to shared/upstream branches (force-with-lease on own feature branches during review-fix loop is permitted) -- Review PRs (use /pr-review instead) -- Clean up merged branches (use pr-cleanup instead) -- Run CI checks (use ci instead) -- Commit directly to main/master +Sync local changes to GitHub in a single command. Detects current state (main vs feature branch, staged vs unstaged changes, existing PRs), then executes the minimum steps needed: branch, commit, push, and create PR. Execute only the steps needed for the current state -- do not add extra commits, rebase, or reorganize history beyond what is required to sync. ---- - -## Instructions - -### Usage +## Usage ``` /pr-sync # Auto-detect everything @@ -84,21 +36,25 @@ This skill operates as an operator for GitHub sync workflows, configuring Claude /pr-sync fix/bug-123 "Fix login" # Specify branch and PR title ``` -### Step 0: Classify Repo +## Instructions + +### Step 0: Read CLAUDE.md and Classify Repo -Determine repo type before any git operations. +Read and follow the repository CLAUDE.md before any git operations, because repo-specific branch conventions, commit formats, or CI requirements override defaults in this skill. + +Then determine repo type: ```bash REPO_TYPE=$(python3 ~/.claude/scripts/classify-repo.py --type-only) ``` -**Protected-org repos**: Every subsequent step (commit, push, PR creation) requires **explicit user confirmation**. Present the proposed action, show what will happen, and wait for approval before executing. Never auto-execute. +**Protected-org repos**: Every subsequent step (commit, push, PR creation) requires **explicit user confirmation**. Present the proposed action, show what will happen, and wait for approval before executing. Never auto-execute, because protected-org repos have CI gates and review policies that assume human oversight at each stage. **Personal repos**: Run `/pr-review` comprehensive review before creating the PR. Auto-execute steps normally. ### Step 1: Detect Current State -Determine the starting point before taking any action. +Always detect state before taking any action, because skipping detection risks creating nested branches, committing to the wrong branch, or duplicating work already done. ```bash # Get current branch @@ -110,7 +66,7 @@ MAIN_BRANCH=$(git symbolic-ref refs/remotes/origin/HEAD 2>/dev/null | sed 's@^re # Check for uncommitted changes HAS_CHANGES=$(git status --porcelain) -# Check for unpushed commits +# Check for unpushed commits (include these in the push so nothing is left behind) UNPUSHED=$(git log origin/$CURRENT_BRANCH..$CURRENT_BRANCH --oneline 2>/dev/null) # Determine if on main/master @@ -120,9 +76,13 @@ if [[ "$CURRENT_BRANCH" == "main" || "$CURRENT_BRANCH" == "master" ]]; then fi ``` +If the branch is behind remote, warn the user before pushing so they can pull or rebase first and avoid a rejected push. + ### Step 2: Create Branch (if on main) -If on main/master with changes, create a feature branch first. +Never commit directly to main/master -- always create a feature branch first, because direct main commits skip code review, make rollback harder, and can break CI for everyone. + +If on main/master with changes, create a feature branch. If no branch name was provided, generate one from the changes or commit message. Branch naming conventions: @@ -135,8 +95,6 @@ Branch naming conventions: | Chore/maintenance | `chore/` | `chore/update-deps` | ```bash -# If no branch name provided, generate from changes or commit message -# Create and switch to new branch git checkout -b "$BRANCH_NAME" ``` @@ -144,10 +102,14 @@ If already on a feature branch, skip this step. ### Step 3: Stage and Commit -Stage changes selectively and create a conventional commit. +Stage files selectively by name rather than using `git add -A`, because blind staging catches unintended files -- build artifacts, `.env` files, editor configs, large binaries -- that pollute the repository and may leak secrets. Review `git status`, stage specific files, and verify with `git diff --cached` before committing. + +Never commit `.env`, credentials, secrets, or API keys. Block these files and warn the user if they appear in the staging area. + +All commit messages use conventional commit format (`type(scope): description`). ```bash -# Stage specific files (prefer over git add -A) +# Stage specific files (not git add -A) git add path/to/changed/files # Create commit with conventional format @@ -160,14 +122,18 @@ If no uncommitted changes exist, skip to Step 4. ### Step 4: Push to Remote +All pushes use standard `git push`, never `--force`, because force pushing destroys remote history and teammates lose work. If push is rejected due to the branch being behind remote, pull with rebase and resolve conflicts rather than forcing. + ```bash # Push with upstream tracking (CLAUDE_GATE_BYPASS=1 bypasses the git-submission-gate hook) CLAUDE_GATE_BYPASS=1 git push -u origin "$CURRENT_BRANCH" ``` +If the user requested a rebase before push, run `git pull --rebase origin $MAIN_BRANCH` first, but this is off by default. + **Protected-org repos**: Before executing the push, present the branch name, remote, and commits that will be pushed. Wait for explicit approval before pushing. -### Step 4a: ADR Decision Coverage (conditional — ADR-094) +### Step 4a: ADR Decision Coverage (conditional -- ADR-094) **Skip if**: No `.adr-session.json` exists in the working directory. @@ -181,13 +147,15 @@ If verdict is PARTIAL or FAIL, display uncovered decision points and ask whether ### Step 4b: Review-Fix Loop (personal repos only) -Iteratively review and fix issues before creating the PR. Up to 3 iterations of: `/pr-review` → fix → amend → push. - **Skip if**: `REPO_TYPE == "protected-org"` (protected-org repos use their own review gates). +Iteratively review and fix issues before creating the PR. Up to 3 iterations of: `/pr-review` -> fix -> amend -> push. + +Never modify commits already on the remote except the tip commit just pushed by this workflow. The review-fix loop amends only the tip commit it just created, before any external review, and uses `--force-with-lease` (not `--force`) because lease-checking confirms no one else has pushed to the branch in the meantime. + **Loop (max 3 iterations):** 1. Run `/pr-review` comprehensive review -2. If clean → exit loop, proceed to Step 5 +2. If clean -> exit loop, proceed to Step 5 3. Fix all reported issues 4. `git add [fixes] && git commit --amend --no-edit && CLAUDE_GATE_BYPASS=1 git push --force-with-lease` 5. Report iteration: `REVIEW-FIX [N/3]: X found, Y fixed, Z remaining` @@ -196,6 +164,8 @@ After 3 iterations, proceed to Step 5 with any remaining issues documented in th ### Step 5: Create or Update PR +Generate the PR title from the branch name or first commit when not provided by the user. Never create a PR with an empty description, because reviewers need context to understand the changes and a missing test plan signals incomplete work. + ```bash # Check if PR already exists for this branch EXISTING_PR=$(gh pr list --head "$CURRENT_BRANCH" --json number --jq '.[0].number' 2>/dev/null) @@ -218,7 +188,13 @@ else fi ``` -### Step 6: Post-Merge ADR Status Update (conditional — ADR-095) +If the user requested a draft PR, add `--draft` to `gh pr create`. If auto-assign reviewers was requested, assign based on CODEOWNERS. + +Always show the PR URL after creation for easy access. + +**Protected-org repos**: Before executing PR creation, present the PR title, body, and target branch. Wait for explicit approval. Do not run the review-fix loop -- protected-org repos rely on their own CI gates and human reviewers. + +### Step 6: Post-Merge ADR Status Update (conditional -- ADR-095) **Skip if**: No `.adr-session.json` exists, or the PR was only created (not merged). @@ -238,7 +214,7 @@ mv "$ADR_PATH" adr/completed/ rm .adr-session.json ``` -Report: `ADR updated: {name} → Accepted, moved to completed/` +Report: `ADR updated: {name} -> Accepted, moved to completed/` This is local-only (ADR files are gitignored). No branch or PR needed. @@ -248,7 +224,7 @@ This is local-only (ADR files are gitignored). No branch or PR needed. /pr-sync invoked | v - Classify repo (Step 0) + Read CLAUDE.md + Classify repo (Step 0) | v Has changes? @@ -392,50 +368,13 @@ Solution: 1. Always push the current branch: `git push -u origin $(git branch --show-current)` 2. Never hardcode branch names in push commands 3. Verify after push: `git log origin/$(git branch --show-current)..HEAD` should show 0 commits -*Graduated from learning.db — multi-agent-coordination/worktree-push-from-wrong-branch* - ---- - -## Anti-Patterns - -### Anti-Pattern 1: Committing Everything with git add -A -**What it looks like**: Running `git add -A` without checking what will be staged. -**Why wrong**: Catches unintended files -- build artifacts, `.env` files, editor configs, large binaries. These pollute the repository and may leak secrets. -**Do instead**: Review `git status`, stage specific files by name, verify with `git diff --cached` before committing. - -### Anti-Pattern 2: Force Pushing to Resolve Conflicts -**What it looks like**: Push fails, so running `git push --force` to override. -**Why wrong**: Destroys remote history. Teammates lose work. PRs become inconsistent. -**Do instead**: Pull with rebase, resolve conflicts, push normally. If the situation is complex, inform the user rather than forcing. - -### Anti-Pattern 3: Creating PR with Empty Description -**What it looks like**: `gh pr create --title "Updates" --body ""` -**Why wrong**: Reviewers have no context. PR purpose is unclear. Test plan is missing. -**Do instead**: Generate a summary from the diff and commit messages. Include a test plan section. Even a brief description is better than none. - -### Anti-Pattern 4: Skipping State Detection -**What it looks like**: Immediately creating a branch and committing without checking current state. -**Why wrong**: May create nested branches, commit to wrong branch, or duplicate work already done. -**Do instead**: Always run Step 1 first. Detect branch, changes, and PR state before taking any action. - -### Anti-Pattern 5: Committing Directly to Main -**What it looks like**: Staging and committing on main/master instead of creating a feature branch. -**Why wrong**: Violates branch protection conventions. Makes rollback harder. Skips code review. -**Do instead**: Always create a feature branch from main before committing. The only exception is if the user explicitly authorizes a direct commit. +*Graduated from learning.db -- multi-agent-coordination/worktree-push-from-wrong-branch* --- ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Quick push to main, I'll branch later" | Main commits skip review and break CI | Create branch first, always | -| "git add -A is fine, nothing sensitive" | Assumption without checking | Review status, stage selectively | -| "PR description can wait" | Reviewers need context now, not later | Write summary before creating PR | -| "Force push fixes the conflict" | Destroys teammate work on remote | Pull, rebase, resolve, push normally | +- `/pr-review` -- Comprehensive PR review (used in the review-fix loop) +- `/pr-cleanup` -- Post-merge branch cleanup +- `scripts/classify-repo.py` -- Repo classification for workflow gating +- `scripts/adr-decision-coverage.py` -- ADR decision coverage checker diff --git a/skills/pre-planning-discussion/SKILL.md b/skills/pre-planning-discussion/SKILL.md index 737ee090..460b7356 100644 --- a/skills/pre-planning-discussion/SKILL.md +++ b/skills/pre-planning-discussion/SKILL.md @@ -39,72 +39,45 @@ routing: # Pre-Planning Discussion Skill -## Purpose - Resolve specific implementation ambiguities before a plan is created. This is an optional step between design and planning in the feature lifecycle (design → **ambiguity resolution** → plan → implement → validate → release), or a standalone step before any planning workflow. The cost of a wrong assumption compounds: a silently wrong assumption at the planning stage produces a valid-looking plan that executes cleanly but delivers the wrong thing. The rework cost is the entire plan-execute cycle, not just a single task. This skill makes assumptions explicit so they can be corrected cheaply — before any code is written. -## Operator Context - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before starting -- **Scope Guardrail**: This skill clarifies HOW to implement scoped work. It NEVER expands scope to add new capabilities or question WHETHER the work should be done. If an ambiguity implies scope expansion, classify it as OUT and move on. WHY: scope creep during ambiguity resolution defeats the purpose — the user already decided what to build; we're resolving how. -- **Prior Decision Carryforward**: Check for existing context from earlier phases (feature-design output, prior ambiguity resolution, ADR decisions). Never re-ask a settled question. WHY: re-asking erodes trust and wastes interaction budget. -- **Structured Output Required**: Both modes MUST produce the identical context document format (Resolved Decisions + Carried Forward + Scope Boundary + Canonical References). WHY: downstream consumers (feature-plan, workflow-orchestrator) depend on a predictable format regardless of which mode produced it. -- **Canonical References Accumulation**: Track every spec, ADR, config file, or interface definition referenced during the discussion. These become the "canonical refs" list in the output document. WHY: downstream agents need to know which files are authoritative for the decisions made. - -### Default Behaviors (ON unless disabled) -- **Mode Auto-Detection**: If the working directory has substantial existing code (10+ source files in the relevant domain), default to Assumptions mode. Otherwise, default to Discussion mode. User can override. -- **Confidence Calibration**: In Assumptions mode, be honest about confidence. "Confident" means strong evidence from multiple files. "Likely" means one file or consistent pattern. "Unclear" means conflicting signals or insufficient data. Never inflate confidence. -- **Interaction Budget**: Assumptions mode targets 2-4 user interactions (corrections). Discussion mode targets 3-6 decisions. If either mode drifts beyond these bounds, something is wrong — the scope is too broad, or questions are too granular. - -### Optional Behaviors (OFF unless enabled) -- **Auto-approve assumptions**: Accept all Confident/Likely assumptions without presenting them (for fast iteration) - -## What This Skill CAN Do -- Identify implementation ambiguities (gray areas) and present concrete options -- Read codebase files to form evidence-based assumptions -- Produce a structured context document consumed by feature-plan or workflow-orchestrator -- Carry forward decisions from prior phases -- Accumulate canonical references for downstream agents - -## What This Skill CANNOT Do -- Expand scope beyond the current task boundary (scope guardrail) -- Replace feature-design for broad design exploration -- Produce implementation plans (that's feature-plan) -- Write code or modify existing files (beyond the context document) - ## Instructions ### Phase 0: PRIME **Goal**: Load context, detect mode, identify the work to be clarified. -1. **Load prior context**: Check for existing artifacts that contain decisions: +1. **Read repository CLAUDE.md** and follow its requirements throughout execution. This happens before any other work. + +2. **Load prior context**: Check for existing artifacts that contain decisions: - `.feature/state/design/` — design documents from feature-design - `adr/` — architecture decision records relevant to this work - `task_plan.md` — any existing plan context - Prior ambiguity resolution output (if re-running) -2. **Extract carried-forward decisions**: From any prior artifacts, extract decisions already made. These go directly into the "Carried Forward" section and are NOT re-asked. +3. **Extract carried-forward decisions**: From any prior artifacts, extract decisions already made. These go directly into the "Carried Forward" section and are NOT re-asked. Re-asking a settled question erodes trust and wastes interaction budget — if a prior phase decided it, carry it forward silently. -3. **Detect mode**: Assess the codebase: +4. **Detect mode**: Assess the codebase: - If the user explicitly requested a mode, use that mode - - If the working directory has substantial existing code in the relevant domain, use **Assumptions mode** + - If the working directory has substantial existing code in the relevant domain (10+ source files), use **Assumptions mode** - Otherwise, use **Discussion mode** + - Optional: if "auto-approve assumptions" is enabled, accept all Confident/Likely assumptions without presenting them (for fast iteration) -4. **Identify scope boundary**: From the user's request (and any design document), establish: +5. **Establish scope boundary**: From the user's request (and any design document), define: - IN: What this task covers - OUT: What this task explicitly does not cover -**GATE**: Prior context loaded. Mode selected. Scope boundary established. Proceed to Execute. + This skill clarifies HOW to implement scoped work. It never expands scope to add new capabilities or question WHETHER the work should be done. If an ambiguity implies scope expansion, classify it as OUT and move on. Scope creep during ambiguity resolution defeats the purpose — the user already decided what to build; we are resolving how. + +**GATE**: Prior context loaded (carried-forward decisions extracted). Mode selected. Scope boundary established. Proceed to Execute. ### Phase 1: EXECUTE #### Discussion Mode (greenfield / unclear requirements) -**Goal**: Surface gray areas — decisions that could go multiple ways — and present them with domain-aware options. +**Goal**: Surface gray areas — decisions that could go multiple ways — and present them with domain-aware options. Target 3-6 decisions total. If you find yourself drifting beyond that, the scope is too broad or the questions are too granular. **Step 1: Identify Gray Areas** @@ -119,11 +92,15 @@ Analyze the task requirements and classify ambiguities by domain: | Config/infrastructure | File format, environment variable naming, secret management, deployment strategy | | Content/documentation | Audience level, tone, structure, depth, cross-reference strategy | -For each gray area, prepare 2-3 concrete options with tradeoffs. NOT open-ended questions. +For each gray area, prepare 2-3 concrete options with tradeoffs. Never ask open-ended questions like "How should we handle errors?" — that forces the user to design the solution. Present specific options so the user can pick, not invent. + +Only present genuine gray areas where multiple approaches are valid. If there is only one reasonable option, it is not a gray area — just proceed with it. "Should we use JSON?" is not a gray area if JSON is the only sensible choice. + +Before declaring "no gray areas found," spend at least 2 minutes actively looking. Almost every task has ambiguities — if you found none, you are not looking hard enough. **Step 2: Present Gray Areas** -Present all identified gray areas grouped by domain. For each: +Present all identified gray areas grouped by domain, all at once (not one at a time) so the user can batch their decisions. For each: ```markdown ### Gray Area: [descriptive name] @@ -138,7 +115,7 @@ Present all identified gray areas grouped by domain. For each: **Default recommendation**: [which option and why, if you have a preference] ``` -Present all gray areas at once (not one at a time) so the user can batch their decisions. If a gray area has an obvious best choice, state the recommendation — the user can accept or override. +If a gray area has an obvious best choice, state the recommendation — the user can accept or override. **Step 3: Capture Decisions** @@ -149,9 +126,11 @@ For each gray area the user responds to, record: If the user defers a decision ("I don't care" or "whatever you think"), accept your recommendation and record it as "defaulted to recommendation." +Throughout this phase, track every spec, ADR, config file, or interface definition you reference. These become the "canonical refs" list in the output document — downstream agents need to know which files are authoritative for the decisions made. + #### Assumptions Mode (brownfield / existing codebase) -**Goal**: Read the codebase, form evidence-based opinions, and present them for the user to correct. The user only intervenes where the agent is actually wrong. +**Goal**: Read the codebase, form evidence-based opinions, and present them for the user to correct. The user only intervenes where the agent is actually wrong. Target 2-4 user interactions (corrections). If you are drifting beyond that, the scope is too broad or you are asking about things you should decide yourself. **Step 1: Codebase Survey** @@ -175,10 +154,14 @@ For each relevant aspect of the implementation, form an assumption: - **If wrong**: [concrete consequence of proceeding with this assumption] ``` +Every assumption MUST cite a file path and specific pattern. Assumptions without evidence are untestable claims that the user cannot evaluate. If you cannot point to evidence, the assumption is "Unclear" — say so honestly. + Confidence calibration: - **Confident**: Multiple files confirm the pattern, or explicit configuration states it -- **Likely**: One file shows the pattern, or it's consistent with the overall codebase style -- **Unclear**: Conflicting signals, insufficient data, or the codebase doesn't address this +- **Likely**: One file shows the pattern, or it is consistent with the overall codebase style +- **Unclear**: Conflicting signals, insufficient data, or the codebase does not address this + +Never inflate confidence. Do not rationalize "confident enough, no need to show evidence" — always cite the file path and pattern regardless of confidence level. **Step 3: Present Assumptions** @@ -186,6 +169,8 @@ Present all assumptions organized by confidence level (Confident first, Unclear > Here's what I believe about this codebase based on reading [N] files. Please correct anything that's wrong. You only need to respond to items that are incorrect — silence means agreement. +Make assumptions explicit so the user CAN correct you. The anti-pattern is assuming "the user will correct me if I'm wrong" while keeping assumptions silent — users cannot correct what they do not see. + **Step 4: Process Corrections** For each correction the user provides: @@ -193,11 +178,15 @@ For each correction the user provides: - Note the original (wrong) assumption and why it was wrong - Adjust confidence on related assumptions if the correction reveals a pattern +Throughout this phase, track every file you reference as a canonical reference for the output document. + ### Phase 2: COMPILE OUTPUT **Goal**: Produce the structured context document that downstream skills consume. -Both modes converge here. Produce a single document with this exact structure: +Both modes converge here and MUST produce the identical format. Downstream consumers (feature-plan, workflow-orchestrator) depend on a predictable structure regardless of which mode produced it. + +Produce a single document with this exact structure: ```markdown # Pre-Planning Context: [Task Name] @@ -259,33 +248,6 @@ Save this document: | Codebase too small for Assumptions mode | <5 relevant files | Switch to Discussion mode | | Conflicting prior decisions | Design doc and ADR disagree | Flag the conflict to the user. Do not resolve silently | -## Anti-Patterns - -| Anti-Pattern | Why Wrong | Do Instead | -|--------------|-----------|------------| -| Asking 15-20 individual questions | Exhausts user patience, most answers are predictable | Use Assumptions mode: form opinions, ask for corrections | -| Open-ended questions ("How should we handle errors?") | Forces user to design the solution | Present 2-3 concrete options with tradeoffs | -| Re-asking settled decisions | Wastes interaction budget, erodes trust | Check prior context first; carry forward existing decisions | -| Expanding scope during discussion | Defeats the purpose of scoped ambiguity resolution | Apply scope guardrail: clarify HOW, never WHETHER | -| Presenting one option as a question | "Should we use JSON?" is not a gray area if there's only one option | Only present genuine gray areas with multiple valid approaches | -| Skipping evidence in Assumptions mode | Assumptions without evidence can't be evaluated | Every assumption MUST cite a file path and specific pattern | - -## Anti-Rationalization - -See [core patterns](../shared-patterns/anti-rationalization-core.md). - -Domain-specific for pre-planning discussion: - -| Rationalization | Why Wrong | Action | -|-----------------|-----------|--------| -| "The requirements are clear, no ambiguities" | Almost every task has ambiguities; you're not looking hard enough | Spend 2 minutes actively looking for gray areas before declaring none exist | -| "I'll figure it out during implementation" | That's exactly what this skill prevents — silent wrong assumptions | Surface the ambiguity now; it's cheaper to resolve | -| "This expands scope but it's important" | Scope guardrail is non-negotiable | Mark as OUT, note it for future work | -| "User will correct me if I'm wrong" | Users don't know what you assumed silently | Make assumptions explicit so they CAN correct you | -| "Confident enough, no need to show evidence" | Assumptions without evidence are untestable claims | Always cite file path and pattern | - ## References -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - [ADR-072: Pre-Planning Ambiguity Resolution](../../adr/072-pre-planning-ambiguity-resolution.md) diff --git a/skills/pre-publish-checker/SKILL.md b/skills/pre-publish-checker/SKILL.md index a915e4e4..982def22 100644 --- a/skills/pre-publish-checker/SKILL.md +++ b/skills/pre-publish-checker/SKILL.md @@ -28,45 +28,11 @@ routing: # Pre-Publish Checker Skill -## Operator Context - -This skill operates as an operator for Hugo blog post validation, configuring Claude's behavior for rigorous pre-publication quality assurance. It implements a **Sequential Validation** architectural pattern — assess structure, validate fields, check assets, report — with **Domain Intelligence** embedded in Hugo-specific rules and SEO best practices. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before validation -- **Non-Destructive**: Never modify post files without explicit user request -- **Complete Output**: Show all validation results; never summarize or skip categories -- **Blocker Classification**: Clearly distinguish BLOCKER from SUGGESTION severity -- **Reproduce Findings**: Every reported issue must reference the exact line or field - -### Default Behaviors (ON unless disabled) -- **Full Validation**: Run all check categories (front matter, SEO, content, links, images, draft) -- **Taxonomy Suggestions**: Suggest tags/categories based on existing site taxonomy -- **Reading Time Calculation**: Calculate at 200 WPM for prose, 50% weight for code blocks -- **Skip External Links**: Do not validate external URLs (use `--check-external` to enable) -- **Severity Separation**: Count blockers and suggestions independently in final report - -### Optional Behaviors (OFF unless enabled) -- **External Link Validation**: Check that external URLs are reachable (`--check-external`) -- **Auto-Fix**: Offer to fix suggestion-level issues automatically (`--auto-fix`) -- **Strict Mode**: Treat all suggestions as blockers (`--strict`) - -## What This Skill CAN Do -- Parse Hugo TOML (+++) and YAML (---) front matter and validate required fields -- Calculate word count and reading time with code block weighting -- Analyze header structure (H2/H3 hierarchy) for logical organization -- Detect draft status, TODO/FIXME comments, and placeholder text -- Verify internal links point to existing content in content/ or static/ -- Verify image files exist and have alt text -- Suggest tags/categories from existing site taxonomy -- Generate structured validation report with clear pass/fail/skip status - -## What This Skill CANNOT Do -- Validate external URLs by default (network latency, rate limiting concerns) -- Judge content quality beyond structural and metadata checks -- Automatically publish posts or change draft status -- Modify files without explicit user consent -- Replace editorial review for prose quality or accuracy +## Overview + +This skill performs rigorous pre-publication validation for Hugo blog posts using a **Sequential Validation** workflow: assess structure, validate fields, check assets, and report results. It embeds Hugo-specific rules and SEO best practices to catch publication blockers before they reach production. + +The skill is **non-destructive** (never modifies files without explicit user request), **complete** (shows all validation results—never summarizes), and **severity-aware** (distinguishes BLOCKER from SUGGESTION throughout the workflow). --- @@ -114,60 +80,62 @@ Everything after front matter closing delimiter. Inventory: **Step 1: Front matter validation** -| Field | Requirement | Severity | -|-------|-------------|----------| -| title | Present, non-empty | BLOCKER | -| date | Present, valid format | BLOCKER | -| draft | Must be `false` | BLOCKER | -| description | Present, 150-160 chars | SUGGESTION | -| tags | Present, 3-5 items | SUGGESTION | -| categories | Present, 1-2 items | SUGGESTION | +| Field | Requirement | Severity | Reasoning | +|-------|-------------|----------|-----------| +| title | Present, non-empty | BLOCKER | Hugo build fails without title | +| date | Present, valid format | BLOCKER | Publishing requires valid timestamp | +| draft | Must be `false` | BLOCKER | Hugo excludes draft posts; this is a non-negotiable check | +| description | Present, 150-160 chars | SUGGESTION | SEO optimization; user may omit intentionally | +| tags | Present, 3-5 items | SUGGESTION | Recommendation for taxonomy consistency; not required | +| categories | Present, 1-2 items | SUGGESTION | Recommendation for site organization; not required | + +Always validate draft field first. Treat as highest-priority blocker—draft posts are excluded from production builds entirely. **Step 2: SEO validation** -| Check | Optimal Range | Severity | -|-------|---------------|----------| -| Title length | 50-60 characters | SUGGESTION | -| Description length | 150-160 characters | SUGGESTION | -| Slug format | URL-friendly, no special chars | SUGGESTION | +| Check | Optimal Range | Severity | Reasoning | +|-------|---------------|----------|-----------| +| Title length | 50-60 characters | SUGGESTION | Search engine display optimization | +| Description length | 150-160 characters | SUGGESTION | Meta description window sizing | +| Slug format | URL-friendly, no special chars | SUGGESTION | Internal naming convention | Derive slug from filename: `2025-01-my-post.md` becomes `my-post`. **Step 3: Content quality validation** -| Check | Requirement | Severity | -|-------|-------------|----------| -| Word count | Minimum 500 words | SUGGESTION | -| Reading time | Calculate at 200 WPM | INFO | -| Header structure | H2/H3 present, logical hierarchy | SUGGESTION | -| Opening paragraph | No preamble phrases | SUGGESTION | +| Check | Requirement | Severity | Reasoning | +|-------|-------------|----------|-----------| +| Word count | Minimum 500 words | SUGGESTION | Depth indicator; user retains discretion | +| Reading time | Calculate at 200 WPM | INFO | Report only, not a blocker | +| Header structure | H2/H3 present, logical hierarchy | SUGGESTION | Readability and scannability | +| Opening paragraph | No preamble phrases | SUGGESTION | Content quality; user may override | Preamble detection phrases: "In this post, I will...", "Today I'm going to...", "Let me explain...", "Welcome to...", "First of all...", "Before we begin..." **Step 4: Link validation** -- **Internal links**: Pattern `](/posts/...)` or `](/images/...)`. Verify target exists. Severity: BLOCKER if missing. -- **External links**: Pattern `](https://...)`. Skip by default. Severity: WARNING if unreachable (when enabled). -- **Image links**: Pattern `![alt](path)` or Hugo shortcodes. Verify file exists in static/. Severity: BLOCKER if missing. +- **Internal links**: Pattern `](/posts/...)` or `](/images/...)`. Verify target exists. Severity: BLOCKER if missing (broken links prevent navigation). +- **External links**: Pattern `](https://...)`. Skip by default. Severity: WARNING if unreachable (when enabled). Rationale: External validation adds network latency and rate-limiting concerns; skip unless user opts in with `--check-external`. +- **Image links**: Pattern `![alt](path)` or Hugo shortcodes. Verify file exists in static/. Severity: BLOCKER if missing (reader sees broken image). **Step 5: Image validation** -| Check | Requirement | Severity | -|-------|-------------|----------| -| Alt text | All images must have non-empty alt | SUGGESTION | -| File existence | All referenced images exist in static/ | BLOCKER | -| Path format | Correct Hugo static path convention | SUGGESTION | +| Check | Requirement | Severity | Reasoning | +|-------|-------------|----------|-----------| +| Alt text | All images must have non-empty alt | SUGGESTION | Accessibility best practice | +| File existence | All referenced images exist in static/ | BLOCKER | Missing images break page rendering | +| Path format | Correct Hugo static path convention | SUGGESTION | Consistency with site standards | Hugo image path patterns: `/images/filename.png` (absolute from static/), `images/filename.png` (relative), `{{< figure src="..." >}}` (shortcode). **Step 6: Draft status validation** -| Check | Requirement | Severity | -|-------|-------------|----------| -| draft field | Must be `false` | BLOCKER | -| TODO comments | None present | WARNING | -| FIXME comments | None present | WARNING | -| Placeholder text | None present | BLOCKER | +| Check | Requirement | Severity | Reasoning | +|-------|-------------|----------|-----------| +| draft field | Must be `false` | BLOCKER | Hugo build filter; non-negotiable | +| TODO comments | None present | WARNING | Development artifact—likely unintentional in published post | +| FIXME comments | None present | WARNING | Development artifact—likely unintentional in published post | +| Placeholder text | None present | BLOCKER | Content is incomplete if placeholders remain | Placeholder patterns: `[insert X here]`, `[TBD]`, `[TODO]`, `XXX`, `PLACEHOLDER`, `Lorem ipsum`. @@ -179,7 +147,7 @@ Placeholder patterns: `[insert X here]`, `[TBD]`, `[TODO]`, `XXX`, `PLACEHOLDER` **Step 1: Build taxonomy index** -Read existing posts to collect all tags and categories currently in use. +Read existing posts to collect all tags and categories currently in use. Rationale: Always read existing posts before suggesting. Inventing generic tags like "programming" or "tech" without checking the site creates inconsistent taxonomy and fragments content organization. **Step 2: Analyze content** @@ -187,7 +155,7 @@ Match current post content against existing taxonomy terms. Prefer established t **Step 3: Generate suggestions** -Suggest 3-5 tags and 1-2 categories. Avoid over-suggesting popular tags; distribute evenly across the taxonomy. +Suggest 3-5 tags and 1-2 categories. Avoid over-suggesting popular tags; distribute evenly across the taxonomy. Report suggestions even if tags/categories are already present—they validate against site conventions. **Gate**: Taxonomy suggestions generated from existing site data (not invented). Proceed only when gate passes. @@ -231,6 +199,8 @@ Format the report as: - READY FOR PUBLISH: Zero blockers (suggestions and warnings are acceptable) - NOT READY: One or more blockers present; list all blockers after result +Ensure accurate blocker count. Count blockers and suggestions independently in the final result—never mix them. + **Gate**: Report generated with accurate blocker count. Result matches blocker tally. --- @@ -285,51 +255,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Blocking on Optional Fields -**What it looks like**: Marking post as NOT READY because tags are missing -**Why wrong**: Tags are recommendations, not requirements. Users may have valid reasons to omit them. -**Do instead**: Classify as SUGGESTION, count separately from blockers in the final result. - -### Anti-Pattern 2: Inventing Taxonomy Instead of Reading -**What it looks like**: Suggesting generic tags like "programming" or "tech" without checking the site -**Why wrong**: Creates inconsistent taxonomy, fragments content organization, ignores established patterns. -**Do instead**: Always read existing posts first. Suggest from established taxonomy when possible. - -### Anti-Pattern 3: Skipping Draft Status Check -**What it looks like**: Reporting READY FOR PUBLISH when `draft: true` -**Why wrong**: Hugo excludes draft posts from production builds. This is the highest-priority blocker. -**Do instead**: Check draft field first. Treat as a non-negotiable blocker. - -### Anti-Pattern 4: Modifying Files Without Consent -**What it looks like**: Automatically adding missing description or fixing tags during validation -**Why wrong**: Validation is read-only. User must approve all changes to their content. -**Do instead**: Report findings, suggest fixes, wait for explicit user request before modifying. - -### Anti-Pattern 5: Silent Skip Without Reporting -**What it looks like**: Not reporting when a check could not be performed (directory not found, parse error) -**Why wrong**: User assumes check passed when it was actually skipped. False confidence. -**Do instead**: Always report `[SKIP]` with a reason when any check cannot complete. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Front matter looks fine" | Looking ≠ validating each field | Parse and check every required field | -| "Images are probably there" | Probably ≠ file system check | Verify each path exists on disk | -| "No need to check taxonomy" | Missing context creates bad suggestions | Read existing posts before suggesting | -| "Draft is obvious, skip it" | Obvious checks prevent obvious mistakes | Always validate draft field explicitly | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/seo-guidelines.md`: SEO length requirements and best practices - `${CLAUDE_SKILL_DIR}/references/hugo-frontmatter.md`: Hugo front matter fields and formats - `${CLAUDE_SKILL_DIR}/references/checklist.md`: Complete validation checklist reference diff --git a/skills/professional-communication/SKILL.md b/skills/professional-communication/SKILL.md index 2bbf65d8..a7caf406 100644 --- a/skills/professional-communication/SKILL.md +++ b/skills/professional-communication/SKILL.md @@ -29,44 +29,11 @@ routing: # Professional Communication Skill -## Operator Context - -This skill operates as an operator for technical communication transformation, configuring Claude's behavior for structured information extraction and business formatting. It implements the **Deterministic Template** architectural pattern -- extract all propositions, categorize by type, apply structured template -- with **Domain Intelligence** embedded in the status classification and tone transformation methodology. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before transformation -- **Over-Engineering Prevention**: Only transform what's directly requested. No speculative sections, no "nice to have" formatting not asked for -- **Complete Proposition Extraction**: NEVER lose technical details during transformation -- **Status Indicators**: Always use GREEN/YELLOW/RED in output (non-negotiable for executive readability) -- **Template Structure**: Always follow "Status -> Key Point -> Summary -> Technical Details -> Next Steps" format -- **Four-Phase Process**: PARSE -> STRUCTURE -> TRANSFORM -> VERIFY (mandatory workflow) - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report transformation results factually without self-congratulation -- **Executive Summary**: Include executive-friendly overview by default -- **Action Item Extraction**: Extract and list next steps with ownership implications -- **Context Bridging**: Make implicit knowledge explicit for non-technical audiences -- **Professional Tone**: Transform casual/defensive language into neutral business language -- **Temporary File Cleanup**: Remove intermediate analysis files at completion - -### Optional Behaviors (OFF unless enabled) -- **Technical Deep Dive**: Include extended technical details beyond 2-3 sentence summary -- **Comparative Analysis**: Add historical comparison sections -- **Multi-Stakeholder Variants**: Generate multiple versions for different audiences - -## What This Skill CAN Do -- Extract all propositions from dense technical communication without information loss -- Apply deterministic templates that produce consistent business-formatted output -- Classify status indicators (GREEN/YELLOW/RED) based on objective severity criteria -- Transform defensive or emotional language into neutral business language -- Generate specific, actionable next steps with ownership and timeline markers - -## What This Skill CANNOT Do -- Write new content from scratch (use appropriate writing skill instead) -- Generate documentation that doesn't transform an existing input -- Skip proposition extraction and jump straight to formatting -- Create multi-stakeholder variants by default (enable optional behavior first) -- Transform without verifying information completeness in Phase 4 +## Overview + +This skill transforms dense technical communication into clear, structured business formats using **proposition extraction** (identify all facts and relationships) and **deterministic templates** (apply consistent structure). It extracts every detail without loss, categorizes by business relevance, applies a standard template with professional tone, and verifies completeness before delivery. + +**Core principle**: Transformation ≠ creation. Never write new content; always extract from existing input and restructure it for executive clarity with preserved technical accuracy. --- @@ -74,11 +41,11 @@ This skill operates as an operator for technical communication transformation, c ### Phase 1: PARSE -**Goal**: Extract every proposition from the input before structuring anything. +**Goal**: Extract every proposition from the input before structuring anything. This prevents information loss and ensures technical accuracy is preserved. **Step 1: Classify input type** -Identify the communication type: +Identify the communication type (this determines categorization strategy in Phase 2): - Technical update (progress report with embedded facts) - Debugging narrative (stream-of-consciousness problem-solving) - Status report (project state with blockers/dependencies) @@ -86,17 +53,18 @@ Identify the communication type: **Step 2: Extract all propositions** -Parse each sentence systematically: +Parse each sentence systematically. Never summarize before extracting — summarizing skips propositions and loses facts: + 1. **Facts**: All distinct statements of truth 2. **Implications**: Cause-effect relationships 3. **Temporal markers**: Past/present/future actions 4. **System references**: All mentioned components 5. **Blockers**: Hidden dependencies and constraints -6. **Emotional context**: Frustration/satisfaction/urgency indicators +6. **Emotional context**: Frustration/satisfaction/urgency indicators (needed to transform defensive language) **Step 3: Document implicit context** -Surface assumptions the author takes for granted but the audience needs stated: +Surface assumptions the author takes for granted but the audience needs stated. Non-technical audiences cannot act without this: - Technical acronyms or project names the audience may not know - Timeline context (when things happened relative to milestones) - Organizational context (team relationships, reporting structures) @@ -123,10 +91,12 @@ Implicit Context: ### Phase 2: STRUCTURE -**Goal**: Categorize and prioritize all extracted propositions. +**Goal**: Categorize and prioritize all extracted propositions by business relevance. This prevents unsolicited sections and keeps output focused on what matters most. **Step 1: Categorize propositions** +Organize by type (categorization determines template section placement): + ```markdown Status: [items with current state] Actions: [completed, in-progress, planned] @@ -137,27 +107,33 @@ Next: [required actions] **Step 2: Priority order** +Rank by impact to executive decision-making, not completeness: + 1. Business Impact (revenue, customer, strategic) 2. Technical Functionality (core operation) 3. Project Timeline (schedule implications) 4. Resource Requirements (personnel, infrastructure) 5. Risk Management (potential issues) +Only the highest-priority categories go into the output. Lower-priority items are preserved in Technical Details but not emphasized. + **Step 3: Identify information gaps** -Flag any propositions that need clarification before transformation: -- Ambiguous severity (could be GREEN or YELLOW) -- Missing ownership for action items -- Undefined technical terms critical to business impact +Flag any propositions that need clarification before transformation. Ask for specifics only when severity classification is ambiguous: +- Ambiguous severity (could be GREEN or YELLOW — default to YELLOW if unclear) +- Missing ownership for action items (block on clarity, don't infer) +- Undefined technical terms critical to business impact (ask for definition) **Gate**: All propositions categorized and prioritized. Proceed only when gate passes. ### Phase 3: TRANSFORM -**Goal**: Apply output template with professional tone. +**Goal**: Apply standard template with professional tone. This ensures consistent, executive-ready formatting without speculative sections. **Step 1: Apply standard template** +Never add unsolicited sections (Risk Assessment, Historical Context, Mitigation Strategies). Use ONLY this structure: + ```markdown **STATUS**: [GREEN|YELLOW|RED] **KEY POINT**: [Single most important business takeaway] @@ -178,43 +154,48 @@ Flag any propositions that need clarification before transformation: **Step 2: Tone adjustment** -- Strip hedging, defensive language, and filler -- Preserve urgency markers and severity indicators -- Keep technical terms intact (do NOT oversimplify) -- Maintain causal chains and specific metrics +The transformation rules are deterministic (apply all): +- Strip hedging language: "I think we might need to..." → "Deploy X to address Y" +- Transform defensive tone: "We had to rollback because..." → "Rolled back to [previous version] due to [root cause]" +- Preserve urgency markers and severity indicators (needed for status classification) +- Keep technical terms intact (oversimplification loses information; non-technical audiences still need accuracy) +- Maintain causal chains and specific metrics (specific > generic) **Step 3: Status classification** +Apply criteria consistently (inconsistency confuses stakeholders and erodes trust): + - **GREEN**: Fully complete with no follow-up, all verification done - **YELLOW**: Resolved with follow-up needed, blocked on dependencies, partial completion - **RED**: Active critical issues, production impact, urgent intervention needed -- Document reasoning: "Status: YELLOW (deployment successful but monitoring pending)" + +Always document reasoning: "Status: YELLOW (deployment successful but monitoring pending)" not just "Status: YELLOW" **Step 4: Action item specificity** -Every next step MUST include: -- Specific action verb (investigate, deploy, coordinate, document) -- Clear scope (what exactly needs doing) -- Ownership implication (who or which team) -- Timeline marker when available (IMMEDIATE, by EOW, this sprint) +Vague action items cannot be executed. Every next step MUST include: +- Specific action verb (investigate, deploy, coordinate, document) — "fix" is too vague +- Clear scope (what exactly needs doing) — define the boundary +- Ownership implication (who or which team) — someone must be accountable +- Timeline marker when available (IMMEDIATE, by EOW, this sprint) — explicit > implied -**Gate**: Output follows template structure with professional tone. Proceed only when gate passes. +**Gate**: Output follows template structure with professional tone and all specificity rules applied. Proceed only when gate passes. ### Phase 4: VERIFY -**Goal**: Confirm transformation quality before delivery. +**Goal**: Confirm transformation quality before delivery. All gates must pass; proceed only when complete. -**Step 1**: Compare output against extracted propositions -- NO information loss +**Step 1**: Compare output against extracted propositions — NO information loss allowed. If a fact from Phase 1 doesn't appear in output, it belongs in Technical Details. -**Step 2**: Verify technical accuracy -- terms, metrics, causal chains preserved +**Step 2**: Verify technical accuracy — terms, metrics, causal chains preserved exactly. Never substitute synonyms ("database issues" for "Redis cluster failover") — specificity is required. -**Step 3**: Confirm status indicator matches actual severity +**Step 3**: Confirm status indicator matches actual severity. Check reasoning against actual criteria (GREEN ≠ YELLOW vs YELLOW ≠ RED boundaries). -**Step 4**: Validate action items are specific (who, what, when) not vague +**Step 4**: Validate action items are specific — check each next step for (verb, scope, owner, timeline). "Fix the issue" fails; "Complete Redis failover testing in staging (DevOps, by EOW)" passes. -**Step 5**: Check appropriate detail level for target audience +**Step 5**: Check appropriate detail level for target audience. If audience is non-technical, Technical Details should bridge jargon with plain explanations without losing precision. -**Step 6**: Document transformation summary +**Step 6**: Document transformation summary to prove gate passage: ```markdown ## Transformation Summary @@ -222,10 +203,10 @@ Input type: [type] Propositions extracted: [N] Status assigned: [GREEN|YELLOW|RED] ([reasoning]) Information loss: None -Template applied: [standard | executive | technical manager] +Template applied: standard ``` -**Gate**: All verification checks pass. Transformation is complete. +**Gate**: All verification checks pass. Transformation is complete. Do not proceed to delivery without all 6 steps passing. --- @@ -263,74 +244,45 @@ Result: RED status report with tiered emergency response actions ## Error Handling ### Error: "Missing Context in Input" -Cause: Technical terms or acronyms critical to business impact are undefined -Solution: -1. Ask user for clarification on terms critical to status classification -2. Make reasonable inferences for minor details -3. Note all assumptions explicitly in Technical Details section +**Cause**: Technical terms or acronyms critical to business impact are undefined. -### Error: "Ambiguous Status Classification" -Cause: Input contains mixed signals (e.g., issue resolved but monitoring incomplete) -Solution: -1. Default to YELLOW when unclear between GREEN/YELLOW -2. Default to RED only with clear critical indicators (production impact, data loss) -3. Document reasoning in parenthetical after status indicator +**Solution**: +1. Ask user for clarification on terms critical to status classification — speculation causes wrong status assignments +2. Make reasonable inferences only for minor details; flag all assumptions explicitly in Technical Details section +3. Don't skip transformation while waiting — provide output with a note: "Status classification assumed X because Y was undefined" -### Error: "Multi-Thread Update Contamination" -Cause: Input contains multiple unrelated topics that could cross-contaminate -Solution: -1. Process each thread as separate proposition set -2. Apply template independently to each thread -3. Combine with clear thread identification in final output -4. Ensure status indicators are thread-specific - ---- +### Error: "Ambiguous Status Classification" +**Cause**: Input contains mixed signals (e.g., issue resolved but monitoring incomplete). -## Anti-Patterns +**Solution**: +1. Default to YELLOW when unclear between GREEN/YELLOW — YELLOW preserves urgency for follow-up without false reassurance +2. Default to RED only with clear critical indicators: production impact (users affected), data loss (unrecoverable), or ongoing crisis (not yet mitigated) +3. Document reasoning in parenthetical: "Status: YELLOW (deployment successful but monitoring pending)" — transparency prevents misinterpretation -### Anti-Pattern 1: Skipping Proposition Extraction -**What it looks like**: Jumping straight to template formatting without parsing all embedded facts -**Why wrong**: Loses technical details, misses causal relationships, assigns wrong status -**Do instead**: Complete Phase 1 fully. Extract ALL propositions before any formatting. +### Error: "Multi-Thread Update Contamination" +**Cause**: Input contains multiple unrelated topics that could cross-contaminate status classifications. -### Anti-Pattern 2: Adding Unsolicited Sections -**What it looks like**: Adding "Risk Assessment", "Mitigation Strategies", "Historical Context" not in the original -**Why wrong**: Over-engineering that changes meaning, increases length, violates hardcoded behavior -**Do instead**: Apply ONLY the standard template. Let user request additional analysis. +**Solution**: +1. Process each thread as separate proposition set (Phase 1 extraction per thread) +2. Apply template independently to each thread (Phase 2-3 per thread) +3. Combine with clear thread identification in final output (use headers: "Thread A: Deployment", "Thread B: Data Recovery") +4. Ensure status indicators are thread-specific (Thread A may be GREEN while Thread B is RED) — separate outcomes, separate classifications -### Anti-Pattern 3: Losing Technical Accuracy for Simplicity -**What it looks like**: "Redis cluster failover with 15-second timeout" becomes "database problems caused delays" -**Why wrong**: Strips specificity needed for technical teams to act. Violates complete proposition extraction. -**Do instead**: Keep technical terms, preserve metrics, maintain causal chains in Technical Details section. +--- -### Anti-Pattern 4: Vague Action Items -**What it looks like**: "Fix the issue", "Follow up on dependencies", "Improve the system" -**Why wrong**: No ownership, no timeline, no scope. Stakeholders cannot act on vague items. -**Do instead**: "Complete Redis failover testing in staging (DevOps team, by EOW)" +## Error Handling Principles -### Anti-Pattern 5: Inconsistent Status Indicators -**What it looks like**: 2-hour outage marked GREEN, successful deploy with incomplete monitoring marked GREEN -**Why wrong**: Inconsistency confuses stakeholders and erodes trust in the status system -**Do instead**: Apply status criteria consistently. Document reasoning in parenthetical. +**Constraint distribution in error handling**: +- **Summarizing before extracting** = loses facts. Complete Phase 1 fully before proceeding. +- **Status is "obvious"** = assumption. Apply classification criteria consistently, document reasoning. +- **Technical details not needed for non-technical audience** = false. Always include Technical Details; bridge jargon with explanation. +- **Action items are implied** = stakeholders cannot execute implied work. Write explicit (verb, scope, owner, timeline) for every next step. +- **Professional tone is "close enough"** = defensive language still embedded. Apply ALL transformation rules: hedging → direct, emotional → neutral, vague → specific. --- ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I can summarize without extracting first" | Summarizing skips propositions, loses facts | Complete Phase 1 extraction | -| "Status is obviously GREEN" | Obvious ≠ verified against criteria | Apply status classification rules | -| "Technical details aren't needed here" | Non-technical audience ≠ no technical section | Include Technical Details section always | -| "Action items are implied" | Implied ≠ actionable for stakeholders | Write explicit next steps with ownership | -| "Close enough tone" | Professional tone requires specific transformations | Apply defensive->neutral, casual->professional rules | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/templates.md`: Status-specific templates, section formats, phrase transformations - `${CLAUDE_SKILL_DIR}/references/examples.md`: Complete transformation examples with proposition extraction diff --git a/skills/python-quality-gate/SKILL.md b/skills/python-quality-gate/SKILL.md index 0804915f..2b5eb215 100644 --- a/skills/python-quality-gate/SKILL.md +++ b/skills/python-quality-gate/SKILL.md @@ -36,95 +36,30 @@ routing: # Python Quality Gate Skill -## Operator Context - -This skill operates as an operator for Python code quality validation workflows, configuring Claude's behavior for automated quality assurance. It runs four tools in deterministic order: ruff, pytest, mypy, bandit. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Over-Engineering Prevention**: Only validate code. Never add tools, features, or flexibility not requested -- Run all available quality tools in fixed order: ruff check, ruff format, mypy, pytest, bandit -- Show complete command output with exact file paths and line numbers (never summarize) -- Exit with non-zero status if any critical check fails -- Categorize issues by severity: critical, high, medium, low - -### Default Behaviors (ON unless disabled) -- **Report Facts**: Show command output rather than describing it. No self-congratulation. -- **Temporary File Cleanup**: Remove intermediate outputs at completion. Keep final report only if requested. -- Group errors by type and file for readability -- Show suggested auto-fix commands when available -- Calculate and display quality metrics (error counts, coverage percentages) -- Check formatting compliance with ruff format - -### Optional Behaviors (OFF unless enabled) -- Auto-fix issues with `--fix` flag (requires explicit request) -- Install missing quality tools automatically -- Modify pyproject.toml or configuration files -- Save report to file with `--output {file}` flag -- Skip specific tools with `--skip-mypy`, `--skip-bandit`, `--skip-tests` flags - -## What This Skill CAN Do -- Run all four quality tools in deterministic order with structured output -- Categorize issues by severity and provide auto-fix commands -- Generate structured markdown reports with actionable suggestions -- Calculate pass/fail status based on configurable thresholds -- Detect project configuration from pyproject.toml - -## What This Skill CANNOT Do -- Debug runtime bugs (use systematic-debugging) -- Refactor code or make architectural changes (use systematic-refactoring) -- Replace running a single tool when that is all the user needs -- Auto-fix without explicit user confirmation - ---- - -## Prerequisites - -### Required Tools -```bash -pip install ruff pytest pytest-cov -``` - -### Optional Tools (recommended) -```bash -pip install mypy bandit -``` - -### Expected Project Structure -``` -project/ -├── pyproject.toml # Ruff, pytest, mypy config -├── src/ or app/ # Source code -│ └── *.py -├── tests/ or test/ # Test files -│ └── test_*.py -└── .python-version # Optional Python version -``` - -See `references/pyproject-template.toml` for complete configuration template. - ---- +Run four quality tools in deterministic order -- ruff, pytest, mypy, bandit -- and produce a structured pass/fail report with severity-categorized issues and auto-fix commands. ## Instructions ### Phase 1: Detection and Setup -**Step 1: Detect project configuration** +**Step 1: Read CLAUDE.md and detect project configuration.** + +Read and follow the repository's CLAUDE.md before any execution. Then detect project configuration: ```bash ls -la pyproject.toml setup.py setup.cfg mypy.ini .python-version 2>/dev/null ``` -Identify Python version target, ruff config, pytest config, mypy config from pyproject.toml. +Identify Python version target, ruff config, pytest config, mypy config from pyproject.toml. Only validate code -- never add tools, features, or flexibility not requested. -**Step 2: Detect source and test directories** +**Step 2: Detect source and test directories.** ```bash ls -d src/ app/ lib/ 2>/dev/null || echo "Source: current directory" ls -d tests/ test/ 2>/dev/null || echo "Tests: not found" ``` -**Step 3: Verify tool availability** +**Step 3: Verify tool availability.** ```bash ruff --version @@ -139,53 +74,55 @@ ERROR: Required tool not found: {tool_name} Install with: pip install ruff pytest pytest-cov ``` +Do not install missing tools automatically unless the user explicitly requests it. Do not modify pyproject.toml or configuration files unless explicitly asked. + **Gate**: ruff and pytest available. Project structure identified. Proceed only when gate passes. ### Phase 2: Execute Quality Checks -Run all checks in order, capturing full output for each. +Run all checks in fixed order, capturing full output for each. Show complete command output with exact file paths and line numbers -- never summarize or paraphrase tool output, because summarization hides the details engineers need to locate and fix issues. -**Step 1: Ruff linting** +**Step 1: Ruff linting.** ```bash ruff check . --output-format=grouped ``` -**Step 2: Ruff formatting check** +**Step 2: Ruff formatting check.** ```bash ruff format --check . ``` -**Step 3: Type checking with mypy** (if installed) +**Step 3: Type checking with mypy** (if installed). ```bash mypy . --ignore-missing-imports --show-error-codes ``` -Skip and note in report if mypy is not installed. +Skip and note in report if mypy is not installed. Even if tests pass, still run mypy when available -- tests check behavior while types check contracts, and passing one does not make the other redundant. -**Step 4: Run test suite** +**Step 4: Run test suite.** ```bash pytest -v --tb=short --cov=src --cov-report=term-missing ``` -If no tests directory exists, skip and note in report. +If no tests directory exists, skip and note in report. Never skip tests to make the gate pass -- tests verify functionality, and skipping them hides broken code. Only skip optional tools (mypy, bandit) if genuinely unavailable, not to manufacture a passing status. -**Step 5: Security scanning with bandit** (if installed) +**Step 5: Security scanning with bandit** (if installed). ```bash bandit -r src/ -ll --format=screen ``` -Skip and note in report if bandit is not installed. +Skip and note in report if bandit is not installed. Linting passing does not mean code is correct -- linting finds style issues, not logic or security bugs. Run every available tool. **Gate**: All available tools have been run. Full output captured. Proceed to analysis. ### Phase 3: Categorize and Analyze -**Step 1: Categorize issues by severity** +**Step 1: Categorize issues by severity.** See `references/tool-commands.md` for complete severity classification tables. @@ -195,15 +132,17 @@ Summary of severity levels: - **Medium**: W warnings, C4xx, no-untyped-def mypy errors - **Low**: SIM suggestions, UP upgrade suggestions -**Step 2: Count auto-fixable issues** +Always prioritize critical issues over style fixes -- critical issues (F errors, test failures) break functionality while style issues do not. Fix critical first, high second; use auto-fix for bulk style cleanup only after critical issues are resolved. + +**Step 2: Count auto-fixable issues.** ```bash ruff check . --statistics ``` -Issues marked with `[*]` are auto-fixable. +Issues marked with `[*]` are auto-fixable. Show suggested auto-fix commands for these issues so users know what can be fixed automatically. -**Step 3: Determine overall status** +**Step 3: Determine overall status.** FAIL if: - Any ruff F errors or test failures @@ -211,7 +150,7 @@ FAIL if: - Mypy errors exceed 10 (configurable) - Test coverage below 80% (if coverage enabled) -PASS otherwise. +PASS otherwise. Exit with non-zero status if any critical check fails. **Gate**: All issues categorized. Pass/fail determined. Proceed to report. @@ -223,19 +162,20 @@ The report MUST include: 1. Overall PASS/FAIL status 2. Summary table (each tool's status and issue count) 3. Total issues and auto-fixable count -4. Detailed results per tool (issues grouped by severity) +4. Detailed results per tool (issues grouped by severity, then grouped by type and file for readability) 5. Critical issues requiring attention with file:line references 6. Auto-fix commands section +7. Quality metrics: error counts and coverage percentages -Print the complete report to stdout. Never summarize or truncate. +Report facts -- show raw command output rather than describing it. No self-congratulation ("great job", "looking good"). Generate the full report even when only style issues are found, because style issues can hide real problems in noise and a full severity-prioritized report surfaces them. -If `--output {file}` flag provided, also write report to file. +Print the complete report to stdout. Never summarize or truncate. If `--output {file}` flag was provided, also write report to file. Remove any intermediate temporary files at completion -- keep the final report only if the user requested file output. **Gate**: Report generated and displayed. Task complete. ---- +### Auto-Fix Mode (only when explicitly requested) -## Auto-Fix Mode +Auto-fix modifies files in place -- never run it without explicit user confirmation. Running `ruff --fix` blindly can change code semantics (import removal, reformatting), so always run check-only first, review issues, confirm auto-fix intent, then verify changes. When user explicitly requests auto-fix: @@ -244,15 +184,11 @@ ruff check . --fix ruff format . ``` -After auto-fix, show diff and re-run quality gate to verify: +After auto-fix, show the diff so changes can be reviewed, then re-run the quality gate to verify: ```bash git diff ``` -ALWAYS warn that auto-fix modifies files in place. - ---- - ## Examples ### Example 1: Pre-Merge Quality Check @@ -273,8 +209,6 @@ Actions: 4. Generate FAILED report with critical issues listed first (Phase 4) Result: FAILED status with prioritized fix list, auto-fix commands for 5 medium issues ---- - ## Error Handling ### Error: "ruff: command not found" @@ -293,47 +227,8 @@ Result: FAILED status with prioritized fix list, auto-fix commands for 5 medium **Cause**: Stale or corrupted .mypy_cache directory **Solution**: Clear cache with `rm -rf .mypy_cache` and retry. If mypy continues to fail, skip type checking and note in report. ---- - -## Anti-Patterns - -### Anti-Pattern 1: Auto-Fix Without Review -**What it looks like**: Running `ruff --fix` blindly without reviewing changes -**Why wrong**: Auto-fix can change code semantics (import removal, reformatting) -**Do instead**: Run check-only first, review issues, confirm auto-fix, then `git diff` - -### Anti-Pattern 2: Ignoring Critical Issues for Style Fixes -**What it looks like**: Fixing line lengths while undefined names exist -**Why wrong**: Critical issues (F errors, test failures) break functionality; style issues do not -**Do instead**: Fix critical first, high second, use auto-fix for bulk style issues - -### Anti-Pattern 3: Skipping Tests to "Make Gate Pass" -**What it looks like**: Using `--skip-tests` to get a passing status -**Why wrong**: Tests verify functionality. Skipping them hides broken code. -**Do instead**: Fix failing tests. Only skip optional tools (mypy, bandit) if genuinely unneeded. - -### Anti-Pattern 4: Wrong Tool for the Job -**What it looks like**: Running quality gate to debug a runtime bug -**Why wrong**: Linting finds style issues, not logical bugs -**Do instead**: Use systematic-debugging for runtime bugs, quality gate for pre-commit validation - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Linting passed, code is correct" | Linting finds style issues, not logic bugs | Run tests too | -| "Tests pass, no need for type checking" | Tests check behavior, types check contracts | Run mypy if available | -| "Auto-fix is safe, just run it" | Auto-fix can remove used imports, change semantics | Review changes with git diff | -| "Only style issues, skip the report" | Style issues hide real problems in noise | Generate full report, prioritize by severity | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/tool-commands.md`: Severity classifications, expected output formats, CLI flags - `${CLAUDE_SKILL_DIR}/references/report-template.md`: Full structured report template diff --git a/skills/quick/SKILL.md b/skills/quick/SKILL.md index 75958960..4d27a3dd 100644 --- a/skills/quick/SKILL.md +++ b/skills/quick/SKILL.md @@ -38,53 +38,35 @@ routing: # /quick - Tracked Lightweight Execution -## Operator Context - -This skill implements the Quick tier from the five-tier task hierarchy (Fast > Quick > Simple > Medium > Complex). It fills the gap between zero-ceremony `/fast` (1-3 edits, no plan) and full-ceremony Simple+ (task_plan.md, agent routing, quality gates). Quick tasks get a lightweight plan and tracking without the overhead of the full pipeline. - -The key design principle is **composable rigor**: the base mode is minimal (plan + execute), and users add process incrementally via flags rather than getting all-or-nothing ceremony. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution. -- **Task ID Assignment**: Every quick task gets a unique ID in YYMMDD-xxx format (Base36 sequence). This enables tracking and cross-referencing. -- **Inline Plan**: Create a brief inline plan (not a full task_plan.md) before executing. The plan is 3-5 lines: what changes, which files, why. This is the minimum viable plan — enough to catch misunderstandings before editing. -- **STATE.md Logging**: Log task ID, description, status, and commit hash to STATE.md. -- **Branch Safety**: Create a feature branch if on main/master. -- **Commit After Execute**: Every quick task ends with a commit. -- **No Parallel Execution**: Quick tasks are single-threaded. If parallelism is needed, upgrade to Simple+. - -### Default Behaviors (ON unless disabled) -- **Feature Branch Per Task**: Create `quick/-` branch for each task. This keeps quick work isolated and reviewable. -- **Conventional Commits**: Use conventional commit format. -- **Edit Tracking**: Count edits for scope awareness (warn at 10+, suggest upgrade at 15+). - -### Optional Behaviors (OFF unless enabled) -- **`--discuss`**: Add a pre-planning discussion phase to resolve ambiguities before committing to a plan. Use when requirements are unclear or the user says "I'm not sure exactly what I want." -- **`--research`**: Add a research phase before planning to understand existing code, read related files, and build context. Use when the change touches unfamiliar code. -- **`--full`**: Add plan verification + full quality gates after execution. Use when the change is small but high-risk (auth, payments, data migration). -- **`--no-branch`**: Skip feature branch creation, work on current branch. Use when contributing to an existing feature branch. -- **`--no-commit`**: Skip the commit step. Use when batching multiple quick tasks into one commit. - -## What This Skill CAN Do -- Plan and execute targeted code changes (4-15 file edits) -- Track tasks with unique IDs for auditability -- Compose rigor levels via flags (--discuss, --research, --full) -- Create isolated feature branches per task -- Escalate from /fast when scope is exceeded - -## What This Skill CANNOT Do -- Spawn subagents or parallel workers (upgrade to Simple+) -- Manage multi-component features (use feature lifecycle skills) -- Run wave-based parallel execution (use dispatching-parallel-agents) -- Replace full task_plan.md planning (that is Simple+ tier) +Quick fills the gap between zero-ceremony `/fast` (1-3 edits, no plan) and full-ceremony Simple+ (task_plan.md, agent routing, quality gates). The key design principle is **composable rigor**: the base mode is minimal (plan + execute), and users add process incrementally via flags rather than getting all-or-nothing ceremony. ---- +**Flags** (all OFF by default): + +| Flag | Effect | +|------|--------| +| `--discuss` | Add a pre-planning discussion phase to resolve ambiguities | +| `--research` | Add a research phase before planning to build context on unfamiliar code | +| `--full` | Add plan verification + full quality gates (tests, lint, diff review) | +| `--no-branch` | Skip feature branch creation, work on current branch | +| `--no-commit` | Skip the commit step (for batching multiple quick tasks) | ## Instructions -### Phase 0: DISCUSS (only with --discuss flag) +### Phase 0: SETUP + +**Step 1: Read CLAUDE.md** + +Read and follow the repository's CLAUDE.md before doing anything else, because repo-specific conventions override defaults and skipping this causes style/tooling mismatches. + +**Step 2: Parse flags** -**Goal**: Resolve ambiguities before planning. +Extract `--discuss`, `--research`, `--full`, `--no-branch`, and `--no-commit` from the invocation. Everything remaining after flag extraction is the task description. + +**Step 3: Scope check** + +If the task involves multiple components, architectural changes, or needs parallel execution, redirect to `/do` instead because quick tasks are single-threaded by design -- parallelism means the task has outgrown this tier. + +### Phase 1: DISCUSS (only with --discuss flag) This phase activates when the user passes `--discuss` or the request contains signals of uncertainty ("not sure", "maybe", "could be", "what do you think"). @@ -112,13 +94,11 @@ Read the request and list specific questions: Wait for user response. Do not proceed until ambiguities are resolved. -**GATE**: All ambiguities resolved. Proceed to Phase 0.5 or Phase 1. +**GATE**: All ambiguities resolved. Proceed to Phase 2 or Phase 3. -### Phase 0.5: RESEARCH (only with --research flag) +### Phase 2: RESEARCH (only with --research flag) -**Goal**: Build understanding of the relevant code before planning. - -This phase activates when the user passes `--research` or the task touches code that needs investigation. +This phase activates when the user passes `--research` or the task touches code that needs investigation. Use `--research` when touching unfamiliar code because confidence about code behavior is not the same as correctness -- `/fast` exists for when you truly know. **Step 1: Identify scope** @@ -135,27 +115,28 @@ Read relevant source files, tests, and configuration. Build a mental model of: Present a brief (3-5 line) summary of what you learned and how it affects the plan. -**GATE**: Sufficient understanding to plan the change. Proceed to Phase 1. - -### Phase 1: PLAN +**GATE**: Sufficient understanding to plan the change. Proceed to Phase 3. -**Goal**: Create a lightweight inline plan. +### Phase 3: PLAN **Step 1: Generate task ID** -Format: `YYMMDD-xxx` where xxx is Base36 sequential. +Assign the task ID now, not later, because untracked tasks become invisible and "later" never comes. + +Format: `YYMMDD-xxx` where xxx is Base36 sequential (0-9, a-z). -To determine the next sequence number: ```bash # Check STATE.md for today's tasks to determine next sequence date_prefix=$(date +%y%m%d) ``` -If STATE.md exists in the repo root, find the highest sequence number for today's date prefix and increment. If no tasks today, start at `001`. Use Base36 (0-9, a-z) for the sequence: 001, 002, ... 009, 00a, 00b, ... 00z, 010, ... +If STATE.md exists in the repo root, find the highest sequence number for today's date prefix and increment. If no tasks today, start at `001`. Use Base36 for the sequence: 001, 002, ... 009, 00a, 00b, ... 00z, 010, ... + +If STATE.md is corrupted, scan git log for `Quick task YYMMDD-` patterns to find the true next ID. If a branch name collision occurs, increment the sequence number and try again. **Step 2: Create inline plan** -Display the plan — do NOT write a task_plan.md file: +Always display the inline plan, even for obvious tasks, because the plan catches misunderstandings before they become wrong edits and confirms alignment in 10 seconds that saves minutes. Do NOT write a task_plan.md file -- that is Simple+ tier, and using an inline plan here is the minimum viable ceremony. ``` =================================================================== @@ -173,33 +154,37 @@ Display the plan — do NOT write a task_plan.md file: =================================================================== ``` -If estimated edits exceed 15, suggest upgrading: +If estimated edits exceed 15, suggest upgrading because edit count is a scope signal regardless of difficulty: ``` This task estimates 15+ edits. Consider using /do for full planning and agent routing. Proceed with /quick anyway? [Y/n] ``` +If the task involves security, payments, or data migration, recommend `--full` because a one-line auth change can be catastrophic and risk is about impact, not size. + **Step 3: Create feature branch** (unless --no-branch) +Create a feature branch because small changes on main break the same as big ones: + ```bash git checkout -b quick/- ``` -**GATE**: Task ID assigned, plan displayed, branch created. Proceed to Phase 2. +If already on a non-main feature branch and `--no-branch` is set, stay on the current branch. -### Phase 2: EXECUTE +**GATE**: Task ID assigned, plan displayed, branch created. Proceed to Phase 4. -**Goal**: Implement the plan. +### Phase 4: EXECUTE **Step 1: Make edits** -Execute the changes described in the plan. Track edit count. +Execute the changes described in the plan. Track edit count throughout. **Step 2: Scope monitoring** -- At 10 edits: display a warning — "10 edits reached. Quick tasks typically stay under 15." -- At 15 edits: suggest upgrade — "15 edits reached. This may benefit from /do with full planning. Continue? [Y/n]" -- No hard cap — the user decides. Quick's scope is advisory, not enforced like Fast's 3-edit gate. +- At 10 edits: display a warning -- "10 edits reached. Quick tasks typically stay under 15." +- At 15 edits: suggest upgrade -- "15 edits reached. This may benefit from /do with full planning. Continue? [Y/n]" +- No hard cap -- the user decides. Quick's scope is advisory, not enforced like Fast's 3-edit gate. **Step 3: Verify changes** (base mode) @@ -209,13 +194,11 @@ Run a quick sanity check: # e.g., python3 -m py_compile file.py, go build ./..., tsc --noEmit ``` -If `--full` flag is set, run the full quality gate instead (see Phase 2.5). +If `--full` flag is set, run the full quality gate instead (see Phase 5). **GATE**: All planned edits complete. Sanity check passes. -### Phase 2.5: VERIFY (only with --full flag) - -**Goal**: Run full quality gates on the changes. +### Phase 5: VERIFY (only with --full flag) **Step 1: Run tests** @@ -239,11 +222,9 @@ Review the diff for: - Missing error handling - Broken imports -**GATE**: Tests pass, lint clean, diff reviewed. Proceed to Phase 3. +**GATE**: Tests pass, lint clean, diff reviewed. Proceed to Phase 6. -### Phase 3: COMMIT - -**Goal**: Commit with a clean message. +### Phase 6: COMMIT (skip with --no-commit) **Step 1: Stage changes** @@ -251,8 +232,12 @@ Review the diff for: git add ``` +Stage specific files, not `git add .`, to avoid accidental inclusions. + **Step 2: Commit** +Use conventional commit format because it enables automated changelogs and consistent history: + ```bash git commit -m "$(cat <<'EOF' : @@ -266,12 +251,12 @@ Include the task ID in the commit body for traceability. **GATE**: Commit succeeded. Verify with `git log -1 --oneline`. -### Phase 4: LOG - -**Goal**: Record the task in STATE.md. +### Phase 7: LOG **Step 1: Update STATE.md** +Log the task to STATE.md because this is how tasks stay visible and cross-referenceable. + If STATE.md does not exist in the repo root, create it: ```markdown @@ -312,37 +297,45 @@ If the task was escalated from `/fast`, note the tier as `fast->quick`. =================================================================== ``` ---- +## Reference Material + +### Examples -## Examples +**Example 1: Base Mode** -### Example 1: Base Mode User says: `/quick add --verbose flag to the CLI` 1. Generate ID: 260322-001 2. Plan: add flag definition, wire to handler, update help text (3 edits) 3. Create branch: `quick/260322-001-add-verbose-flag` 4. Execute edits, commit, log to STATE.md -### Example 2: With Research +**Example 2: With Research** + User says: `/quick --research fix the timeout bug in auth middleware` 1. RESEARCH: Read auth middleware, identify timeout source, trace call path 2. PLAN: change timeout value in config, update middleware to use it (2 edits) 3. EXECUTE, COMMIT, LOG -### Example 3: Escalated from Fast +**Example 3: Escalated from Fast** + `/fast` hit 3-edit limit while fixing a bug across 5 files. -1. Quick picks up with context: "Continuing from /fast — 3 files already edited" +1. Quick picks up with context: "Continuing from /fast -- 3 files already edited" 2. PLAN: remaining 2 files to edit 3. EXECUTE remaining edits, COMMIT all changes, LOG as tier `fast->quick` -### Example 4: Full Rigor +**Example 4: Full Rigor** + User says: `/quick --full update payment amount rounding logic` 1. PLAN: identify rounding function, change to banker's rounding 2. EXECUTE the edit 3. VERIFY: run payment tests, lint, review diff 4. COMMIT, LOG ---- +### Task ID Format + +Base36 sequence: `001, 002, ... 009, 00a, 00b, ... 00z, 010, ...` + +Full ID: `YYMMDD-xxx` (e.g., `260322-001`, `260322-00a`) ## Error Handling @@ -361,40 +354,3 @@ User says: `/quick --full update payment amount rounding logic` ### Error: Branch Conflict **Cause**: Branch `quick/-...` already exists **Solution**: Increment the task ID sequence number and try again. - ---- - -## Anti-Patterns - -### Anti-Pattern 1: Skipping the Plan -**What it looks like**: Jumping straight to edits without displaying the inline plan -**Why wrong**: The plan catches misunderstandings before they become wrong edits. It takes 10 seconds and saves minutes. -**Do instead**: Always display the inline plan. Even for obvious tasks — it confirms alignment. - -### Anti-Pattern 2: Using Quick for Features -**What it looks like**: Building a multi-component feature as a series of `/quick` tasks -**Why wrong**: Features need design docs, coordinated implementation, and integration testing. Quick tasks are isolated units. -**Do instead**: Use the feature lifecycle (`/feature-design` -> `/feature-plan` -> `/feature-implement`). - -### Anti-Pattern 3: Never Using Flags -**What it looks like**: Always running base `/quick` even when research or verification is clearly needed -**Why wrong**: Base mode assumes you know exactly what to change. When you don't, you make wrong changes faster. -**Do instead**: Use `--research` when touching unfamiliar code, `--discuss` when requirements are unclear, `--full` when the change is high-risk. - -### Anti-Pattern 4: Using Quick to Avoid Planning -**What it looks like**: Classifying a Simple+ task as "quick" to skip task_plan.md -**Why wrong**: The inline plan is not a substitute for full planning. Complex tasks need full plans. -**Do instead**: If the task genuinely needs a full plan, use `/do` and let the router classify properly. - ---- - -## Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "This is quick, no need for a plan" | Quick's inline plan IS the minimum — skipping it means no plan at all | Display the inline plan | -| "15 edits but it's all simple stuff" | Edit count is a scope signal, not a difficulty signal | Show the upgrade suggestion at 15 | -| "I'll add the task ID later" | Later never comes; untracked tasks are invisible | Assign ID in Phase 1 | -| "No need for a branch, it's small" | Small changes on main break the same as big ones | Create feature branch (or use --no-branch explicitly) | -| "Skip --research, I know this code" | Confidence != correctness; /fast exists for when you truly know | Use --research when touching unfamiliar code | -| "Don't need --full for this" | Risk is about impact, not size; a one-line auth change can be catastrophic | Use --full for any security/payment/data change | diff --git a/skills/read-only-ops/SKILL.md b/skills/read-only-ops/SKILL.md index cb19ed33..c0c0029b 100644 --- a/skills/read-only-ops/SKILL.md +++ b/skills/read-only-ops/SKILL.md @@ -22,43 +22,11 @@ routing: # Read-Only Operations Skill -## Operator Context - -This skill operates as an operator for safe exploration and reporting, configuring Claude's behavior to NEVER modify files or system state during investigation. It implements the **Observation Only** architectural pattern -- gather evidence, report facts, never alter state. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before exploration -- **Over-Engineering Prevention**: Only explore what is directly requested. No speculative investigations or comprehensive audits unless explicitly asked -- **NEVER use Write or Edit tools**: Under no circumstances modify files -- **NEVER run destructive Bash commands**: No rm, mv, cp, mkdir, kill, touch, or write redirects (>, >>) -- **NEVER modify databases**: Only SELECT queries; never INSERT, UPDATE, DELETE, or DROP -- **NEVER modify git state**: No add, commit, push, checkout, or reset commands -- **Show complete output**: Display full command results; never summarize away details the user needs to verify - -### Default Behaviors (ON unless disabled) -- **Structured reporting**: Lead with key findings summary, details below -- **List files examined**: Document which files were read for transparency -- **Include timestamps**: Show when status was captured for time-sensitive checks -- **Scope confirmation**: Confirm scope before broad searches to avoid wasting tokens -- **Temporary file cleanup**: Remove any temp files created during exploration at task end - -### Optional Behaviors (OFF unless enabled) -- **Deep exploration**: Recursively examine nested directories and dependencies -- **Performance metrics**: Include timing information for operations -- **Diff comparison**: Compare current state against known baselines - -## What This Skill CAN Do -- Read files, search codebases, and report findings -- Run read-only Bash commands (ls, ps, git status, git log, du, df, curl GET) -- Execute SELECT queries against databases -- Produce structured status reports with evidence - -## What This Skill CANNOT Do -- Modify, create, or delete any files -- Run destructive or state-changing Bash commands -- Execute write operations against databases -- Install, remove, or update packages -- Alter git state in any way +## Overview + +This skill operates as a safe exploration and reporting mechanism without ever modifying files or system state. Use it when you need to gather evidence, verify facts, or show current state to the user. + +The core principle: **Observation Only**. Gather evidence. Report facts. Never alter state. --- @@ -69,23 +37,27 @@ This skill operates as an operator for safe exploration and reporting, configuri **Goal**: Understand exactly what the user wants to know before exploring. **Step 1: Parse the request** + +Determine: - What specific information is the user asking for? - What is the target scope (specific file, directory, service, system-wide)? - Are there implicit constraints (time range, file type, component)? **Step 2: Confirm scope if ambiguous** -If the request could match dozens of results or span the entire filesystem, clarify before proceeding. If the scope is clear, proceed directly. +If the request could match dozens of results or span the entire filesystem, clarify before proceeding. If the scope is clear, proceed directly. This prevents wasting tokens on over-broad searches. **Gate**: Scope is understood. Target locations are identified. Proceed only when gate passes. +--- + ### Phase 2: GATHER -**Goal**: Collect evidence using read-only tools. +**Goal**: Collect evidence using read-only tools. Tools must never modify state. **Step 1: Execute read-only operations** -Allowed commands: +**Allowed commands** (safe for read-only use): ``` ls, find, wc, du, df, file, stat ps, top -bn1, uptime, free, pgrep @@ -95,7 +67,7 @@ curl -s (GET only) date, timedatectl, env ``` -Forbidden commands: +**Forbidden commands** (violate read-only constraint absolutely): ``` mkdir, rm, mv, cp, touch, chmod, chown git add, git commit, git push, git checkout, git reset @@ -105,26 +77,32 @@ npm install, pip install, apt install pkill, kill, systemctl restart/stop ``` +Rationale: Even "harmless" state changes violate the read-only boundary. Use the read-only equivalent instead (e.g., `ls -la` instead of `mkdir -p`, `git status` instead of `git add`, `SELECT` instead of `INSERT`). + **Step 2: Record raw output** -Show complete command output. Do not paraphrase or truncate unless output exceeds reasonable display length, in which case show representative samples with counts. +Show complete command output. Do not paraphrase or truncate unless output exceeds reasonable display length, in which case show representative samples with counts. The user must be able to verify your claims from the evidence shown. **Gate**: All requested data has been gathered with read-only commands. No state was modified. Proceed only when gate passes. +--- + ### Phase 3: REPORT **Goal**: Present findings in a structured, verifiable format. **Step 1: Summarize key findings at the top** -Lead with what the user asked about. Answer the question first, then provide supporting details. +Lead with what the user asked about. Answer the question first, then provide supporting details. This prevents burying the answer in verbose output. **Step 2: Show evidence** -Include command output, file contents, or search results that support the summary. The user must be able to verify claims from the evidence shown. +Include command output, file contents, or search results that support the summary. The user must be able to verify claims from the evidence shown. Never summarize away details — show the raw data. **Step 3: List files examined** +Document which files were read for transparency: + ```markdown ### Files Examined - `/path/to/file1` - why it was read @@ -138,54 +116,37 @@ Include command output, file contents, or search results that support the summar ## Error Handling ### Error: "Attempted to use Write or Edit tool" -Cause: Skill boundary violation -- tried to modify a file -Solution: This skill only permits Read, Grep, Glob, and read-only Bash. Report findings verbally; do not write them to files unless the user explicitly grants permission. +**Cause**: Skill boundary violation — tried to modify a file. +**Solution**: This skill only permits Read, Grep, Glob, and read-only Bash. Report findings verbally; do not write them to files unless the user explicitly grants permission. Violating the read-only boundary defeats the purpose of the skill. ### Error: "Bash command would modify state" -Cause: Attempted destructive or state-changing command -Solution: Use the read-only equivalent (e.g., `ls -la` instead of `mkdir -p`, `git status` instead of `git add`, `SELECT` instead of `INSERT`). +**Cause**: Attempted destructive or state-changing command. +**Solution**: Use the read-only equivalent. For example: +- `ls -la` instead of `mkdir -p` +- `git status` instead of `git add` +- `SELECT` instead of `INSERT` +- `stat` or `[ -d /path ] && echo exists` instead of `mkdir -p /tmp/test` ### Error: "Scope too broad, results overwhelming" -Cause: Search returned hundreds of matches without filtering -Solution: Return to Phase 1. Narrow scope by file type, directory, or pattern before re-executing. - ---- - -## Anti-Patterns +**Cause**: Search returned hundreds of matches without filtering. +**Solution**: Return to Phase 1. Narrow scope by file type, directory, or pattern before re-executing. For example, instead of searching the entire filesystem for "config", search `~/.config/` or `./etc/` with a specific file extension. -### Anti-Pattern 1: Investigating Everything -**What it looks like**: User asks about API server status; Claude audits all services, configs, logs, and dependencies -**Why wrong**: Wastes tokens, buries the answer, scope was never that broad -**Do instead**: Answer the specific question. Offer to investigate further if needed. +### Common Patterns to Avoid -### Anti-Pattern 2: Summarizing Away Evidence -**What it looks like**: "The repository has 3 modified files and is clean" instead of showing `git status` output -**Why wrong**: User cannot verify the claim. Missing details (which files? staged or unstaged?) -**Do instead**: Show complete command output. Let the user draw conclusions. +**Investigating Everything**: User asks about API server status; you audit all services, configs, logs, and dependencies. Why wrong: Wastes tokens, buries the answer. The scope was never that broad. Do instead: Answer the specific question. Offer to investigate further if needed. -### Anti-Pattern 3: Modifying State "Just to Check" -**What it looks like**: Running `mkdir -p /tmp/test` to check if a path is writable -**Why wrong**: Creates state change. Violates read-only constraint absolutely. -**Do instead**: Use `ls -la`, `stat`, or `[ -d /path ] && echo exists` for read-only checks. +**Summarizing Away Evidence**: "The repository has 3 modified files and is clean" instead of showing `git status` output. Why wrong: User cannot verify the claim. Missing details (which files? staged or unstaged?) prevent verification. Do instead: Show complete command output. Let the user draw conclusions. -### Anti-Pattern 4: Exploring Before Scoping -**What it looks like**: User says "find config files"; Claude immediately searches entire filesystem -**Why wrong**: May return hundreds of irrelevant results. Wastes time without direction. -**Do instead**: Confirm scope (which config? where? what format?) then search targeted locations. +**Exploring Before Scoping**: User says "find config files"; you immediately search entire filesystem. Why wrong: May return hundreds of irrelevant results. Wastes time without direction. Do instead: Confirm scope (which config? where? what format?) then search targeted locations. --- ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks +### Skill Design Philosophy + +This skill enforces the **Observation Only** architectural pattern to enable safe, passive exploration without side effects. The constraint is absolute: tools must never modify state, even to "verify" something. Verification that requires modification (e.g., "is this directory writable?") should use read-only checks (`stat`, `ls -la`, test operators). -### Domain-Specific Anti-Rationalization +### CLAUDE.md Compliance -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I'll just quickly create a temp file to store results" | Any file creation violates read-only constraint | Report findings in response text only | -| "This git command is harmless" | Only explicitly allowed git commands are safe | Check against allowed list before running | -| "The user probably wants me to fix this too" | Read-only means observe and report, never act | Report findings, let user decide next steps | -| "I'll summarize to save space" | Summaries hide details the user needs to verify | Show complete output, summarize at top | +This skill follows the CLAUDE.md principle of verification over assumption and artifacts over memory. All claims are backed by shown evidence. No paraphrasing. No hidden state changes. diff --git a/skills/repo-value-analysis/SKILL.md b/skills/repo-value-analysis/SKILL.md index 9a5061c4..0ed24503 100644 --- a/skills/repo-value-analysis/SKILL.md +++ b/skills/repo-value-analysis/SKILL.md @@ -35,44 +35,11 @@ routing: # Repo Competitive Analysis Pipeline -## Operator Context - -This skill operates as an operator for systematic repo value analysis of external repositories against our toolkit. It implements a **6-phase Pipeline Architecture** — clone, parallel deep-read, self-inventory, synthesis, targeted audit, reality-grounded report — with parallel subagents dispatched via the Agent tool. - -### Hardcoded Behaviors (Always Apply) -- **Full File Reading**: Agents MUST read every file in their assigned zone, not sample or skim -- **Artifacts at Every Phase**: Save findings to files; context is ephemeral -- **Reality-Grounding**: Every recommendation MUST be audited against our actual codebase before inclusion in the final report -- **Read-Only on External Repo**: Never modify the cloned repository -- **Comparison Focus**: All analysis is relative — "what do they have that we lack?" not "what do they have?" -- **Structured Output**: Final report follows the prescribed table format - -### Default Behaviors (ON unless disabled) -- **Parallel Deep-Read**: Dispatch 1 agent per analysis zone (up to 8 zones) -- **Self-Inventory**: 1 agent catalogs our own system in parallel with deep-read -- **Zone Capping**: Cap each analysis zone at ~100 files; split larger zones -- **Draft-Then-Final**: Phase 4 saves a draft; Phase 6 overwrites with the audited final report -- **ADR Suggestion**: If HIGH-value items found, suggest creating an adoption ADR - -### Optional Behaviors (OFF unless enabled) -- **Skip Clone**: Use `--local [path]` if the repo is already cloned or is a local directory -- **Focus Zone**: Use `--zone [name]` to analyze only a specific zone (e.g., skills, hooks) -- **Quick Mode**: Use `--quick` to skip Phase 5 audit (produces unverified recommendations) - -## What This Skill CAN Do -- Clone and systematically analyze an external repository using parallel subagents -- Read every file across categorized analysis zones -- Inventory our own toolkit for accurate comparison -- Produce a reality-grounded comparison report with effort estimates -- Identify genuine gaps (things they have, we lack) vs superficial differences -- Suggest ADR creation for high-value adoption candidates - -## What This Skill CANNOT Do -- Modify files in either repository (read-only analysis) -- Implement recommended changes (use feature-implement or systematic-refactoring) -- Analyze private repos without proper authentication configured -- Replace domain-expert judgment on adoption decisions -- Guarantee completeness for repos with 10,000+ files (zone capping applies) +## Overview + +This skill conducts systematic 6-phase analysis of external repositories to assess their value for adoption. You dispatch parallel subagents to read and catalog every file in an external repo, inventory your own toolkit in parallel, identify genuine capability gaps, audit those gaps against your actual codebase, and produce a reality-grounded comparison report with adoption recommendations. + +The pipeline enforces **full file reading** (not sampling), **parallel execution** (up to 8 agent zones simultaneously), and **mandatory audit** (every recommendation verified before reporting). Optional flags allow local analysis (`--local`), zone focus (`--zone`), and quick comparison (`--quick` skips audit). --- @@ -89,7 +56,7 @@ Set `REPO_NAME` and `REPO_PATH` variables for use throughout the pipeline. ### Phase 1: CLONE -**Goal**: Obtain the repository and categorize its contents into analysis zones. +**Goal**: Obtain the repository and categorize its contents into zones for parallel deep-read. **Step 1: Clone the repository** @@ -97,7 +64,7 @@ Set `REPO_NAME` and `REPO_PATH` variables for use throughout the pipeline. git clone --depth 1 /tmp/ ``` -If `--local` flag was provided, skip cloning and use the provided path. +If `--local` flag was provided, skip cloning and use the provided path instead. This allows re-analysis of already-cloned repos without redundant network calls. **Step 2: Count and categorize files** @@ -105,9 +72,11 @@ Survey the repository structure: - Count total files (excluding `.git/`) - List top-level directories with file counts +This gives you a baseline for zone complexity and helps identify sub-repo patterns. + **Step 3: Define analysis zones** -Categorize files into zones based on directory names and file patterns: +Categorize files into zones based on directory names and file patterns. Zones organize the repo into digestible chunks: | Zone | Typical directories/patterns | Purpose | |------|------------------------------|---------| @@ -120,23 +89,25 @@ Categorize files into zones based on directory names and file patterns: | code | `scripts/`, `src/`, `lib/`, `pkg/`, `*.py`, `*.go`, `*.ts` | Source code | | other | Everything else | Uncategorized files | -**Step 4: Cap zones** +**Step 4: Cap zones for parallel feasibility** -If any zone exceeds ~100 files: -1. Split it into sub-zones by subdirectory -2. Each sub-zone gets its own agent in Phase 2 -3. Log the split in the analysis notes +If any zone exceeds ~100 files, split it into sub-zones by subdirectory. Each sub-zone gets its own agent in Phase 2. Cap at ~100 files per agent because: +- Agents MUST read **every file** in their zone, not sample or skim (sampling introduces bias and misses distinguishing components) +- ~100 files is feasible for a single agent within budget and timeout +- Larger zones are split, so no single agent is overwhelmed + +Log the split decisions in the analysis notes for transparency. **Gate**: Repository cloned (or local path validated). All files categorized into zones. Zone file counts recorded. No zone exceeds ~100 files (split if needed). Proceed only when gate passes. ### Phase 2: DEEP-READ (Parallel) -**Goal**: Read every file in every zone of the external repository. +**Goal**: Read every file in every zone of the external repository to extract techniques, patterns, and potential capability gaps. Dispatch 1 Agent per analysis zone (background). Each agent receives: - The zone name and file list -- Instructions to read EVERY file (not sample, not skim) -- A structured output template +- Instructions to read EVERY file (not sample, not skim) to avoid sampling bias +- A structured output template that captures what they have, not just what they are **Agent instructions template** (replace ALL bracketed placeholders with actual values before dispatching): @@ -172,15 +143,15 @@ After reading ALL files, produce a structured summary: Save your findings to /tmp/[REPO_NAME]-zone-[zone].md ``` -Dispatch up to 8 agents in parallel. If more than 8 zones exist, batch them (first 8, wait, then remaining). +Dispatch up to 8 agents in parallel for speed. If more than 8 zones exist, batch them (first 8, wait 5 minutes, then remaining) rather than serializing — parallel dispatch is default unless `--quick` flag requests otherwise. -**Gate**: All zone agents have completed (or timed out after 5 minutes each). At least 75% of agents returned results. Zone finding files exist in `/tmp/`. Proceed only when gate passes. +**Gate**: All zone agents have completed (or timed out after 5 minutes each). At least 75% of agents returned results (tolerance for individual agent failure). Zone finding files exist in `/tmp/`. Proceed only when gate passes. ### Phase 3: INVENTORY (Parallel with Phase 2) -**Goal**: Catalog our own toolkit for accurate comparison. +**Goal**: Catalog our own toolkit simultaneously with Phase 2 deep-read for faster wall-clock time. -Dispatch 1 Agent (in background, concurrent with Phase 2) to inventory our system: +Dispatch 1 Agent (in background, concurrent with Phase 2 zone agents) to inventory our system. Running this in parallel is safe because inventory is a read-only catalog of our codebase: ``` You are cataloging the claude-code-toolkit repository for comparison purposes. @@ -199,33 +170,39 @@ For each category, note: Save your inventory to /tmp/self-inventory.md ``` +Running this in parallel (not waiting for Phase 2 to finish) reduces total pipeline time from `Phase1 + Phase2 + Phase3` to roughly `Phase1 + max(Phase2, Phase3)`. + **Gate**: Self-inventory agent completed (or timed out after 5 minutes). `/tmp/self-inventory.md` exists and contains counts for all 4 component types. Proceed only when gate passes. ### Phase 4: SYNTHESIZE -**Goal**: Merge findings from Phase 2 and Phase 3 into a comparison with candidate recommendations. +**Goal**: Merge Phase 2 and Phase 3 findings into a draft comparison with candidate adoption recommendations. -**Step 1: Read all zone findings** +**Step 1: Read all zone findings and inventory** -Read every `/tmp/[REPO_NAME]-zone-*.md` file and `/tmp/self-inventory.md`. +Read every `/tmp/[REPO_NAME]-zone-*.md` file and `/tmp/self-inventory.md` to build a unified picture. **Step 2: Build comparison table** -For each capability area discovered in the external repo: +For each capability area discovered in the external repo, document what we have vs what they have: | Capability | Their Approach | Our Approach | Gap? | |------------|---------------|--------------|------| | ... | ... | ... | Yes/No/Partial | +This table is relative: "what do they have that we lack?" not "what do they have?" + **Step 3: Identify candidate recommendations** For each genuine gap (not just a different approach to the same thing): - Describe what they have - Describe what we lack -- Rate value: HIGH / MEDIUM / LOW -- HIGH = addresses a real pain point or enables new capability -- MEDIUM = nice to have, improves existing workflow -- LOW = marginal improvement, different but not better +- Rate value honestly: HIGH / MEDIUM / LOW + - HIGH = addresses a real pain point or enables new capability + - MEDIUM = nice to have, improves existing workflow + - LOW = marginal improvement, different but not better + +Resist the temptation to over-count differences as gaps. A different naming convention is not a gap worth addressing. **Step 4: Save draft report** @@ -233,15 +210,17 @@ Save to `research-[REPO_NAME]-comparison.md` with: - Executive summary - Comparison table - Candidate recommendations with ratings -- Clear "DRAFT — pending audit" watermark +- Clear "DRAFT — pending Phase 5 audit" watermark + +This draft is intentionally unaudited so you can bail out early if findings look weak. **Gate**: Draft report saved. At least 1 candidate recommendation identified (or explicit "no gaps found" conclusion). All recommendations have value ratings. Proceed only when gate passes. ### Phase 5: AUDIT (Parallel) -**Goal**: Reality-check each HIGH and MEDIUM recommendation against our actual codebase. +**Goal**: Reality-check each HIGH and MEDIUM recommendation against our actual codebase to catch "we already have this" false positives. -For each HIGH or MEDIUM recommendation, dispatch 1 Agent (in background): +For each HIGH or MEDIUM recommendation, dispatch 1 Agent (in background). Audit is what separates superficial analysis from rigorous analysis — skipping it produces unverified recommendations that erode trust: ``` You are auditing whether recommendation "[recommendation]" is already @@ -267,23 +246,27 @@ Save findings to /tmp/audit-[recommendation-slug].md with: [1-2 sentence conclusion] ``` +Dispatch audit agents in parallel for speed. If `--quick` flag was used in the initial call, skip Phase 5 entirely and proceed directly to Phase 6 with unaudited recommendations (noted in final report as unverified). + **Gate**: All audit agents completed (or timed out after 5 minutes). At least 75% returned results. Audit files exist in `/tmp/`. Proceed only when gate passes. ### Phase 6: REPORT -**Goal**: Produce the final reality-grounded report. +**Goal**: Produce the final, reality-grounded report with recommendations verified by Phase 5 audit. -**Step 1: Read all audit findings** +**Step 1: Read all audit findings (unless --quick was used)** -Read every `/tmp/audit-*.md` file. +Read every `/tmp/audit-*.md` file. If `--quick` flag was used, skip this step and note in the report that recommendations are unaudited. -**Step 2: Adjust recommendations** +**Step 2: Adjust recommendations based on audit coverage** For each recommendation: -- If audit found ALREADY EXISTS: remove from recommendations, note in "already covered" section -- If audit found PARTIAL: adjust description to focus on what's actually missing +- If audit found ALREADY EXISTS: remove from recommendations, note in "Already Covered" section with the exact files +- If audit found PARTIAL: adjust description to focus on what's actually missing, cite the partial files - If audit found MISSING: keep as-is, add the affected files from audit +This adjustment step catches the false positive anti-pattern: "we should adopt X" when we already have X. + **Step 3: Build final report** Overwrite `research-[REPO_NAME]-comparison.md` with the final report: @@ -329,9 +312,9 @@ Overwrite `research-[REPO_NAME]-comparison.md` with the final report: **Step 4: Cleanup** -Remove temporary zone and audit files from `/tmp/` (keep the cloned repo for reference). +Remove temporary zone and audit files from `/tmp/` (keep the cloned repo for reference if further investigation is needed). -**Gate**: Final report saved to `research-[REPO_NAME]-comparison.md`. Report contains comparison table, adjusted recommendations, and verdict. No "DRAFT" watermark remains. All recommendations have been reality-checked against audit findings. Proceed only when gate passes. +**Gate**: Final report saved to `research-[REPO_NAME]-comparison.md`. Report contains comparison table, adjusted recommendations based on audit findings, and verdict. No "DRAFT" watermark remains. All recommendations have been reality-checked against Phase 5 audit findings (or marked as unaudited if --quick was used). Proceed only when gate passes. --- @@ -377,55 +360,6 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Shallow Reading (Skimming Instead of Reading Every File) -**What it looks like**: Agent reads 10 of 50 files in a zone, claims to understand the zone -**Why wrong**: Misses the components that distinguish the repo; surface-level analysis produces surface-level recommendations -**Do instead**: Each agent MUST read every file in its zone. The zone capping in Phase 1 ensures this is feasible. - -### Anti-Pattern 2: Recommending Things We Already Have -**What it looks like**: "They have a debugging skill; we should add one" (when we already have systematic-debugging) -**Why wrong**: Wastes effort on false gaps; undermines report credibility -**Do instead**: Phase 5 audit exists specifically to catch this. Never skip it. Every recommendation must survive audit. - -### Anti-Pattern 3: Over-Counting Differences as Gaps -**What it looks like**: Listing every difference as a recommendation regardless of value -**Why wrong**: Different is not better. A different naming convention is not a gap worth addressing. -**Do instead**: Only flag genuine capability gaps — things they can do that we cannot. Rate honestly: most differences are LOW or not gaps at all. - -### Anti-Pattern 4: Skipping the Audit Phase -**What it looks like**: Producing the report directly from Phase 4 synthesis without verifying -**Why wrong**: Unverified recommendations erode trust. The whole point of this pipeline is reality-grounding. -**Do instead**: Always run Phase 5 unless `--quick` was explicitly requested. Audit is what separates this from a superficial comparison. - -### Anti-Pattern 5: Anchoring on Repository Size or Star Count -**What it looks like**: "This repo has 5,000 stars so it must have good ideas" -**Why wrong**: Popularity does not equal relevance to our specific toolkit -**Do instead**: Evaluate every component on its merits relative to our needs. A 10-star repo with one brilliant pattern is more valuable than a 10,000-star repo that duplicates what we have. - -### Anti-Pattern 6: Generating Adoption Recommendations Without Effort Estimates -**What it looks like**: "We should adopt X" without saying how much work it would take -**Why wrong**: A HIGH-value recommendation that takes 3 weeks may be lower priority than a MEDIUM-value one that takes 30 minutes -**Do instead**: Every recommendation in the final table MUST include an effort estimate (S/M/L). - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) - Phase transition rules -- [Pipeline Architecture](../shared-patterns/pipeline-architecture.md) - Pipeline design principles - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I read enough files to get the picture" | Sampling bias misses distinguishing components | Read every file in the zone | -| "Our system obviously has this" | Obvious to whom? Prove it with file paths. | Run audit agent, cite exact files | -| "This difference is clearly valuable" | Clearly to whom? Different is not better. | Rate honestly, audit against reality | -| "Audit would just confirm what I know" | Confidence is not correctness | Run audit; let evidence decide | -| "The repo is too big to read fully" | Zone capping exists for this reason | Split zones, read all files in each | -| "Quick comparison is good enough" | Quick comparisons miss nuance and produce false positives | Complete all 6 phases | +None. This skill is self-contained and does not reference shared patterns or external documentation. diff --git a/skills/resume-work/SKILL.md b/skills/resume-work/SKILL.md index dd1b3fbc..e6a670a4 100644 --- a/skills/resume-work/SKILL.md +++ b/skills/resume-work/SKILL.md @@ -35,7 +35,7 @@ routing: # /resume - Session State Restoration -## Operator Context +## Overview This skill reconstructs session state from handoff artifacts so work can continue without wasting time re-reading files and re-discovering decisions. It is the consumer half of the pause/resume pair — `/pause` creates the artifacts, `/resume` consumes them. @@ -47,44 +47,13 @@ The priority cascade exists because handoff quality varies: Each level down the cascade loses more context, so the skill always starts from the top. -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Priority Cascade**: Always check sources in order: HANDOFF.json > .continue-here.md > task_plan.md > git log. Never skip to a lower source if a higher one exists, because higher sources contain richer context that would be lost. -- **Stale Handoff Warning**: If HANDOFF.json exists and `created_at` is older than 24 hours, warn the user before proceeding. Stale handoffs may describe work that has been superseded by manual changes. -- **One-Shot Consumption**: Delete HANDOFF.json and .continue-here.md after successful state reconstruction and user confirmation. These are ephemeral session artifacts, not persistent state. Keeping them risks future `/resume` calls loading outdated context. -- **Status Dashboard**: Always present a status summary before taking action. The user needs to confirm the reconstructed state matches reality before the session proceeds — self-assessment of "what I was doing" is inherently fallible. -- **No Destructive Operations**: This skill reads and deletes handoff files. It never modifies code, resets git state, or discards uncommitted changes. - -### Default Behaviors (ON unless disabled) -- **Quick Resume Mode**: If the user says just "continue" or "resume" with no qualifiers, skip the options menu and immediately execute the `next_action` from the handoff. This optimizes for the common case where the user trusts the handoff and wants to get back to work fast. -- **Uncommitted Work Alert**: If handoff reports uncommitted files, verify they still exist and alert if any have been lost (worktree cleanup, manual revert). -- **Route via /do**: When executing `next_action`, route through the `/do` system if the action matches a known skill trigger. This ensures proper agent/skill selection for the resumed work. - -### Optional Behaviors (OFF unless enabled) -- **Dry Run** (`--dry-run`): Show reconstructed state and planned next action without executing. For reviewing what `/resume` would do. -- **Keep Handoff** (`--keep`): Do not delete handoff files after consumption. For debugging or when the user wants to re-resume. - -## What This Skill CAN Do -- Read and parse HANDOFF.json for structured state reconstruction -- Read .continue-here.md for prose-based state reconstruction -- Fall back to task_plan.md and git log when no handoff files exist -- Present a status dashboard with completed work, remaining tasks, and next action -- Route to the next action via /do or direct execution -- Delete consumed handoff files to prevent stale state - -## What This Skill CANNOT Do -- Create handoff artifacts — that is the `pause-work` skill -- Recover uncommitted changes that were lost — it can detect the loss but cannot undo it -- Guarantee state accuracy — reconstructed state is only as good as the handoff that created it -- Replace task_plan.md — the plan is the persistent record; handoff is the ephemeral session context - --- ## Instructions ### Phase 1: DISCOVER -**Goal**: Find the best available state source using the priority cascade. +**Goal**: Find the best available state source using the priority cascade. Always check sources in order to ensure no high-quality context is lost (higher sources contain richer information than lower ones). **Step 1: Identify project root** @@ -96,7 +65,7 @@ git rev-parse --show-toplevel Look for `{project_root}/HANDOFF.json`. If found: - Read and parse the JSON -- Check `created_at` timestamp — if older than 24 hours, set `stale_warning = true` +- Check `created_at` timestamp — if older than 24 hours, set `stale_warning = true` and plan to alert user before proceeding - This is the richest source: proceed to Phase 2 with full structured data - Also read `.continue-here.md` if it exists (for supplementary prose context) @@ -132,7 +101,7 @@ git status --short git diff --stat ``` -Synthesize a best-effort summary from git state. This is the least precise reconstruction — it shows WHAT happened but not WHY. +Synthesize a best-effort summary from git state. This is the least precise reconstruction — it shows WHAT happened but not WHY. Never skip to this source if higher-priority sources exist, as that would lose context. **Step 6: Set reconstruction quality indicator** @@ -147,11 +116,11 @@ Synthesize a best-effort summary from git state. This is the least precise recon ### Phase 2: RECONSTRUCT -**Goal**: Build a coherent state picture from available sources. +**Goal**: Build a coherent state picture from available sources. Always present a status dashboard before taking action — the user needs to confirm the reconstructed state matches reality, since self-assessment of "what I was doing" is inherently fallible. **Step 1: Verify uncommitted files** -If the handoff reports `uncommitted_files`, check whether they still exist: +If the handoff reports `uncommitted_files`, check whether they still exist (work may have been lost during worktree cleanup or manual revert): ```bash # For each file in uncommitted_files list @@ -166,7 +135,7 @@ If the handoff reports `false_completions`, verify those placeholder markers sti **Step 3: Synthesize status dashboard** -Build the dashboard from available data: +Build the dashboard from available data. This dashboard is your knowledge checkpoint — it prevents false assumptions from carrying into the resumed session: ``` =================================================================== @@ -205,7 +174,7 @@ Build the dashboard from available data: ### Phase 3: PRESENT -**Goal**: Show the user what was reconstructed and confirm before proceeding. +**Goal**: Show the user what was reconstructed and confirm before proceeding. Never skip the dashboard — it is the validation checkpoint that catches assumptions made by the new session before they cause problems. **Step 1: Display status dashboard** @@ -213,7 +182,7 @@ Show the dashboard from Phase 2. **Step 2: Handle stale handoff warning** -If `stale_warning` is true: +If `stale_warning` is true, warn the user before proceeding (stale handoffs may describe work superseded by manual changes between sessions): ``` WARNING: Handoff is from (>24 hours ago). State may not reflect manual changes made since then. @@ -224,43 +193,43 @@ If user discards, fall to next priority level in cascade. **Step 3: Determine action mode** -- **Quick resume** (user said "continue", "resume", or similar with no qualifiers): Skip options, proceed directly to Phase 4 to execute `next_action` +- **Quick resume** (user said "continue", "resume", or similar with no qualifiers): Skip options menu and proceed directly to Phase 4 to execute `next_action`. This optimizes for the common case where the user trusts the handoff and wants to get back to work fast. - **Review mode** (user said "what was I doing", "where did I leave off", or asked a question): Display dashboard only, wait for user to choose what to do next **GATE**: User has seen the dashboard. In quick resume mode, proceed to Phase 4. In review mode, wait for user direction. ### Phase 4: EXECUTE -**Goal**: Route to the next action and clean up handoff files. +**Goal**: Route to the next action and clean up handoff files (they are ephemeral session artifacts, not persistent state, so keeping them risks future `/resume` calls loading outdated context). **Step 1: Execute next action** Take the `next_action` from the handoff and execute it: -- If it matches a skill trigger (e.g., "run tests" -> /vitest-runner), invoke that skill +- If it matches a skill trigger (e.g., "run tests" -> /vitest-runner), invoke that skill via the `/do` routing system for proper agent/skill selection - If it describes a code change, proceed with the implementation directly - If it requires user input (e.g., "need clarification on threshold"), present the question -Carry forward all context from the handoff: decisions made, approaches rejected, gotchas to avoid. This context should inform the execution, not just the dashboard. +Carry forward all context from the handoff: decisions made, approaches rejected, gotchas to avoid. Honor previous session's findings unless circumstances have changed — reexploring rejected approaches burns context on work already done. **Step 2: Clean up handoff files** -After the next action has been initiated (not necessarily completed — just successfully started): +After the next action has been initiated (not necessarily completed — just successfully started), delete the one-shot handoff artifacts to prevent stale state: ```bash # Remove one-shot handoff artifacts rm -f HANDOFF.json .continue-here.md ``` -If `--keep` flag was provided, skip deletion. +If `--keep` flag was provided, skip deletion (use only for debugging). **Step 3: Update task_plan.md if present** -If `task_plan.md` exists, update its status line to reflect that the session has resumed: +If `task_plan.md` exists, update its status line to reflect that the session has resumed. The plan is the persistent record; handoff is the ephemeral session context: ``` **Status**: Resumed from handoff — executing: ``` -**GATE**: Next action initiated. Handoff files cleaned up. Session is now in active work mode. +**GATE**: Next action initiated. Handoff files cleaned up. Session is now in active work mode. Note: This skill never modifies code, resets git state, or discards uncommitted changes — it only reads and deletes handoff files. --- @@ -284,47 +253,8 @@ If `task_plan.md` exists, update its status line to reflect that the session has --- -## Anti-Patterns - -### Anti-Pattern 1: Skipping the Dashboard -**What it looks like**: Reading HANDOFF.json and immediately executing next_action without showing the user what was reconstructed -**Why wrong**: Self-assessment of session state is inherently fallible. The user needs to confirm the reconstruction matches reality before the session acts on potentially stale or incorrect assumptions. -**Do instead**: Always show the dashboard. In quick resume mode, show it briefly then proceed — but never skip it entirely. - -### Anti-Pattern 2: Keeping Handoff Files Forever -**What it looks like**: Never deleting HANDOFF.json after consumption, or using `--keep` by default -**Why wrong**: Stale handoff files cause future `/resume` calls to load outdated context, potentially overriding the user's actual current state with a previous session's snapshot. -**Do instead**: Delete after successful reconstruction. Use `--keep` only for debugging. - -### Anti-Pattern 3: Ignoring Lower-Priority Sources -**What it looks like**: Finding HANDOFF.json and not checking task_plan.md at all -**Why wrong**: task_plan.md may have been updated manually between sessions. It provides the task structure that supplements the handoff's session reasoning. -**Do instead**: Use HANDOFF.json as primary source but also read task_plan.md if it exists for supplementary context. - -### Anti-Pattern 4: Re-Exploring Rejected Approaches -**What it looks like**: The handoff says "tried X, rejected because Y" but the new session tries X again -**Why wrong**: This is the exact waste handoff was designed to prevent. Re-exploring dead ends burns context on work already done. -**Do instead**: Read `decisions` and `context_notes` carefully. Honor previous session's findings unless circumstances have changed. - ---- - -## Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I'll just start fresh, it's faster" | Reconstruction from handoff takes seconds; re-discovery takes minutes to hours | Check handoff files first | -| "The handoff is stale, I'll ignore it" | Stale handoff still contains decisions and rejected approaches worth knowing | Read it, warn about staleness, let user decide | -| "I know what to do, don't need the dashboard" | You're a new session — you know nothing yet. The dashboard IS your knowledge. | Always display the dashboard | -| "I'll keep the handoff files just in case" | Stale files cause worse problems than missing files | Delete after successful consumption | -| "The git log tells me everything" | Git log shows WHAT changed but not WHY, what was rejected, or what's next | Use handoff files when available; git log is last resort | - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations - ### Related Skills - `pause-work` — Creates the handoff artifacts this skill consumes - `do` — Routes next_action to appropriate agent/skill for execution diff --git a/skills/retro/SKILL.md b/skills/retro/SKILL.md index 8aa54427..6ba6c07e 100644 --- a/skills/retro/SKILL.md +++ b/skills/retro/SKILL.md @@ -25,33 +25,9 @@ routing: # Retro Knowledge Skill -## Operator Context +## Overview -This skill wraps `scripts/learning-db.py` into a user-friendly interface for the learning system. The learning database is the single source of truth — there are no L1/L2 markdown files. - -### Hardcoded Behaviors (Always Apply) -- **DB is the source of truth**: All queries go through `python3 ~/.claude/scripts/learning-db.py`; never maintain a parallel file store -- **Graduate requires confirmation**: Always present proposals and wait for user approval before editing agent/skill files -- **Graduate requires specificity**: Only graduate entries that encode non-obvious, actionable knowledge — never generic advice - -### Default Behaviors (ON unless disabled) -- **Formatted Output**: Present results in readable tables/sections, not raw JSON -- **Actionable Suggestions**: When showing stats, suggest next actions (search, graduate) - -### Optional Behaviors (OFF unless enabled) -- **Auto-Fix**: Apply graduation without confirmation (only if user passes `--auto`) - -## What This Skill CAN Do -- Show learning system stats (entry counts, categories, confidence distribution) -- List all entries with filtering by category/confidence -- Full-text search across all learnings (FTS5) -- Graduate mature entries into specific agents/skills (LLM-driven) -- Mark entries as graduated after embedding - -## What This Skill CANNOT Do -- Record new learnings (use `learning-db.py record` directly or let hooks capture) -- Auto-graduate without human approval -- Generate L1/L2 markdown files (legacy system, removed in ADR-006) +This skill wraps `scripts/learning-db.py` into a user-friendly interface for the learning system. The learning database is the single source of truth—all queries go through the Python CLI, never maintaining a parallel file store. --- @@ -68,6 +44,8 @@ Parse the user's argument to determine the subcommand. Default to `status` if no ### Subcommand: status +**Key constraint**: Always present results in readable tables/sections, not raw JSON. When showing stats, suggest next actions (search, graduate). + Show learning system health summary. **Step 1**: Get stats. @@ -100,6 +78,8 @@ Next actions: Display all accumulated knowledge. +**Key constraint**: Output must use the Python CLI as the single source of truth. Do not maintain parallel markdown files. Present results in readable grouped format, not raw JSON. + **Step 1**: Query all entries. ```bash @@ -148,8 +128,12 @@ SEARCH: "TERM" ### Subcommand: graduate Evaluate learning.db entries and embed mature ones into agents/skills. -This is LLM work — graduation requires judgment about target identification, -edit placement, and phrasing as prescriptive instruction. + +**Key constraints:** +- Only graduate entries that encode non-obvious, actionable knowledge—never generic advice. +- Always present proposals and wait for user approval before editing agent/skill files. +- Do not auto-graduate without explicit user approval (even with `--auto` flag, confirm intent). +- Skip categories `error` and `effectiveness`—those are injection-only (useful in context but not suitable as permanent agent instructions). **Step 1**: Get graduation candidates from the DB. @@ -157,17 +141,13 @@ edit placement, and phrasing as prescriptive instruction. python3 ~/.claude/scripts/learning-db.py query --category design --category gotcha ``` -Skip categories `error` and `effectiveness` — those are injection-only (useful -in context but not suitable as permanent agent instructions). - **Step 2**: For each entry, evaluate graduation readiness. For each candidate, the LLM: - Reads the learning value - Searches the repo for the target file (grep for related keywords) -- Determines edit type: add anti-pattern, add to operator context, - add warning, or "not ready / keep injecting" -- Checks if the target already contains equivalent guidance +- Determines edit type: add anti-pattern, add to operator context, add warning, or "not ready / keep injecting" +- Checks if the target already contains equivalent guidance (use Grep to verify before proposing) | Question | Pass | Fail | |----------|------|------| @@ -203,8 +183,7 @@ After embedding, mark the entry as graduated: python3 ~/.claude/scripts/learning-db.py graduate TOPIC KEY "target:file/path" ``` -Graduated entries stop being injected (the injector filters -`graduated_to IS NULL`). +Graduated entries stop being injected (the injector filters `graduated_to IS NULL`). **Step 5**: Report. @@ -248,21 +227,15 @@ Solution: Report that no learnings exist yet. Hooks auto-populate during normal Cause: No design/gotcha entries, or all already graduated Solution: Report the stats and suggest recording more learnings via normal work. ---- - -## Anti-Patterns +### Common Mistakes During Graduation +- **Graduating generic advice** (e.g., "use proper error handling"): Creates noise. Agents already know general patterns. Only graduate specific, actionable findings that encode something non-obvious. +- **Proposing without target verification**: Always grep the target file for equivalent guidance before proposing. Duplication creates maintenance burden. +- **Proceeding without explicit user approval**: Graduation permanently changes agent behavior. Always present proposals in Step 3 and wait for explicit approval before applying changes in Step 4. -### Anti-Pattern 1: Graduating Generic Advice -**What it looks like**: Graduating "use proper error handling" into the Go agent -**Why wrong**: Generic advice adds noise. Agents already know general patterns. -**Do instead**: Only graduate specific, actionable findings that encode something non-obvious. +--- -### Anti-Pattern 2: Graduating Without Checking Target -**What it looks like**: Proposing to add knowledge that's already in the target file -**Why wrong**: Creates duplication and maintenance burden -**Do instead**: Always grep the target file for equivalent guidance before proposing. +## References -### Anti-Pattern 3: Auto-Graduating Without User Approval -**What it looks like**: Embedding knowledge into agents without showing proposals first -**Why wrong**: Graduation permanently changes agent behavior. Human judgment required. -**Do instead**: Always present proposals and wait for explicit approval. +- `~/.claude/scripts/learning-db.py` — Python CLI for all database operations +- `hooks/retro-knowledge-injector.py` — Hook that injects graduated knowledge into prompt context +- `scripts/learning.db` — SQLite database with FTS5 search index diff --git a/skills/roast/SKILL.md b/skills/roast/SKILL.md index 4aa78ad5..c22fa8d0 100644 --- a/skills/roast/SKILL.md +++ b/skills/roast/SKILL.md @@ -32,45 +32,19 @@ routing: # Roast: Devil's Advocate Analysis -## Operator Context - -This skill operates as an operator for critical analysis workflows, configuring Claude's behavior for systematic, evidence-based critique through 5 specialized personas. It implements a **Parallel Analysis + Validation** pattern -- spawn multiple critical perspectives, then validate every claim against actual evidence. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before analysis -- **Over-Engineering Prevention**: Analysis must be direct and focused. No elaborate frameworks beyond the 5-persona + validation workflow -- **Read-Only Enforcement**: NEVER use Write, Edit, or destructive Bash commands. Only Read, Glob, Grep, and read-only Bash operations allowed -- **Evidence-Based Claims**: Every critique must reference specific files, lines, or concrete artifacts. No vague criticisms -- **Validation Required**: All claims must be validated against actual evidence before appearing in final report - -### Default Behaviors (ON unless disabled) -- **Five Persona Coverage**: All 5 personas analyze the target for comprehensive perspective coverage -- **Claim Validation**: Coordinator validates all claims and categorizes as VALID, PARTIAL, UNFOUNDED, or SUBJECTIVE -- **Prioritized Reporting**: Final report prioritizes VALID and PARTIAL findings, shows dismissed claims for transparency -- **Strength Inclusion**: Report includes validated strengths, not just problems -- **Constructive Tone**: Agent outputs are synthesized into improvement-oriented language - -### Optional Behaviors (OFF unless enabled) -- **Focused Persona Analysis**: User can request specific personas only (e.g., "Just the Senior Engineer perspective") -- **Shallow Review**: Quick critique without full validation for rapid feedback -- **Critique-Only Focus**: Skip strengths section, focus exclusively on issues - -## What This Skill CAN Do -- Critique documentation, code, architecture, or ideas through 5 distinct critical perspectives -- Generate specific, evidence-based claims referencing actual files and lines -- Validate all claims against repository contents to separate valid from unfounded critiques -- Produce prioritized, actionable findings backed by concrete evidence -- Identify both weaknesses and validated strengths -- Operate in strict read-only mode without modifying any files -- Surface assumptions, edge cases, operational concerns, and accessibility issues - -## What This Skill CANNOT Do -- Make modifications -- strictly read-only analysis, cannot fix issues found -- Execute code or run tests to validate runtime behavior -- Access external resources, APIs, or documentation outside the repository -- Resolve subjective disputes -- can identify style differences but not declare winners -- Replace domain expertise like security auditing or performance profiling -- Skip validation phase -- all claims must be checked against evidence +## Overview + +This skill produces evidence-based constructive critique through 5 specialized HackerNews commenter personas: Skeptical Senior, Well-Actually Pedant, Enthusiastic Newcomer, Contrarian Provocateur, and Pragmatic Builder. The workflow spawns these personas in parallel, validates all claims against actual files and lines, and synthesizes findings into an improvement-focused report. + +**Key constraints baked into the workflow:** +- CLAUDE.md must be read and followed before analysis begins +- Read-only mode (no Write, Edit, destructive Bash) is mandatory — enforced via `read-only-ops` skill invocation +- Every claim must reference specific file:line locations and be validated against actual evidence before appearing in the final report +- All 5 personas must complete before validation begins — no partial analysis +- Final report must include both validated strengths and problems, prioritized by impact +- Unvalidated claims are dismissed; unfounded critiques are shown with evidence explaining why +- Analysis must be direct and focused — no elaborate frameworks beyond the 5-persona + validation pattern +- Sarcasm and mockery are stripped during synthesis; technical accuracy and file references are preserved --- @@ -99,6 +73,8 @@ This ensures no modifications can occur during the analysis workflow. - `Edit` tool -- no file modification - Bash: `rm`, `mv`, `cp`, `mkdir`, `touch`, `git add`, `git commit`, `git push` +If read-only mode cannot be activated, stop immediately. Never proceed with unguarded analysis. + **Gate**: Read-only mode active. Proceed only when gate passes. ### Phase 2: GATHER CONTEXT @@ -130,13 +106,15 @@ Use Grep to find: specific claims to verify, usage patterns, dependency referenc **Step 5: Ground verbal descriptions** -If user describes a concept rather than pointing to a file, search the repo for existing implementation. Critique grounded in actual code beats critique of a strawman every time. +If user describes a concept rather than pointing to a file, search the repo for existing implementation. Critique grounded in actual code beats critique of a strawman every time. Never analyze a verbal description without confirming the code exists. **Gate**: Target identified and sufficient context gathered. Proceed only when gate passes. ### Phase 3: SPAWN ROASTER AGENTS (Parallel) -Launch 5 general-purpose agents in parallel via Task tool, each embodying a roaster persona. Load the full persona specification from the corresponding agent file into each prompt. +**Goal**: Launch 5 agents in parallel, each embodying a roaster persona, analyzing the target with full evidence-gathering discipline. + +Launch 5 general-purpose agents in parallel via Task tool. Load the full persona specification from the corresponding agent file into each prompt. **The 5 parallel tasks:** @@ -155,20 +133,23 @@ Launch 5 general-purpose agents in parallel via Task tool, each embodying a roas 5. **Pragmatic Builder** (`agents/reviewer-pragmatic-builder.md`) Focus: Production readiness, operational concerns -Each agent must: -- Invoke `read-only-ops` skill first +**Each agent must:** +- Invoke `read-only-ops` skill first to enforce no-modification guardrails - Follow their systematic 5-step review process -- Tag ALL claims as `[CLAIM-N]` with `file:line` references -- Provide specific evidence for every claim +- Tag ALL claims as `[CLAIM-N]` with specific `file:line` references +- Provide concrete evidence for every claim — vague critiques are worthless and must be rejected during validation +- Search for actual implementation details rather than analyzing verbal descriptions See `references/personas.md` for full prompt template and claim format. -**CRITICAL**: Wait for all 5 agents to complete before proceeding to Phase 4. Do not begin validation on partial results -- all perspectives must be collected first. +**CRITICAL**: Wait for all 5 agents to complete before proceeding to Phase 4. Do not begin validation on partial results. Every persona must contribute before synthesis can happen. **Gate**: All 5 agents complete with tagged claims. Proceed only when gate passes. ### Phase 4: COORDINATE (Validate Claims) +**Goal**: Verify every `[CLAIM-N]` against actual evidence before including in the report. + Collect and validate every `[CLAIM-N]` from all 5 agents. **Step 1: Collect all claims** @@ -189,9 +170,11 @@ For each `[CLAIM-N]`, read the referenced file/line using Read tool and assign a | UNFOUNDED | Not supported | Evidence contradicts or doesn't exist | | SUBJECTIVE | Opinion, can't verify | Matter of preference/style | +**Critical: You must read the file and check the line.** Visual inspection misses nuance. "Obviously valid" is a rationalization word. Do not accept a claim because it sounds right or all personas agree on it — consensus is not the same as correctness. + **Step 3: Cross-reference** -Note claims found independently by multiple agents -- these carry higher confidence. If 3+ personas independently identify the same issue, escalate to HIGH priority regardless of individual severity. +Note claims found independently by multiple agents. If 3+ personas independently identify the same issue, escalate to HIGH priority regardless of individual severity. **Step 4: Prioritize** @@ -209,12 +192,14 @@ Sort VALID and PARTIAL findings by impact: Follow the full template in `references/report-template.md`. Key synthesis rules: 1. **Filter by verdict**: Only VALID and PARTIAL claims appear in improvement opportunities -2. **Dismissed section**: UNFOUNDED claims go in dismissed section with evidence showing why -3. **Subjective section**: SUBJECTIVE claims noted as opinion-based, user decides -4. **Strengths required**: Coordinator validates what works well -- not just problems +2. **Dismissed section**: UNFOUNDED claims go in dismissed section with evidence showing why. Transparency matters — users need to understand why certain critiques don't hold up. +3. **Subjective section**: SUBJECTIVE claims noted as opinion-based. User decides. +4. **Strengths required**: Coordinator validates what works well. Not just problems. Include "Validated Strengths" section. 5. **Constructive tone**: Strip sarcasm, mockery, dismissive language from agent outputs. Preserve technical accuracy and file references. 6. **Implementation roadmap**: Group actions by immediacy (immediate / short-term / long-term) +**Validation Summary Table** (include in report): + ```markdown ## Claim Validation Summary @@ -276,7 +261,7 @@ Result: Critique anchored in actual implementation, not a strawman ### Error: "Agent Returns Claims Without File References" Cause: Persona agent skipped evidence-gathering or analyzed verbally Solution: -1. Dismiss ungrounded claims as UNFOUNDED +1. Dismiss ungrounded claims as UNFOUNDED — they cannot be validated 2. If majority of claims lack references, re-run that specific agent with explicit instruction to cite file:line 3. Never promote ungrounded claims to the validated findings section @@ -303,50 +288,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Vague, Unsupported Claims -**What it looks like**: `[CLAIM-1] The error handling seems insufficient` -**Why wrong**: No file/line reference, cannot be validated, not actionable -**Do instead**: `[CLAIM-1] No error handling in process_request() (server.py:45-67)` - -### Anti-Pattern 2: Skipping Validation Phase -**What it looks like**: Generating 5 persona critiques then jumping straight to action items -**Why wrong**: Persona critiques may be incorrect or overstated. Unfounded claims pollute findings. -**Do instead**: Validate every claim against actual evidence before including in report - -### Anti-Pattern 3: All-Negative Critique -**What it looks like**: 5 personas list problems, report ends with 15 prioritized issues, no strengths -**Why wrong**: Demotivating, ignores what works, unbalanced perspective -**Do instead**: Coordinator validates strengths too. Include "Validated Strengths" section. - -### Anti-Pattern 4: Fixing Instead of Reporting -**What it looks like**: Builder agent finds missing error handling, uses Edit tool to add it -**Why wrong**: Violates read-only constraint. User didn't ask for changes. -**Do instead**: Report the finding with evidence and suggested action. User decides. - -### Anti-Pattern 5: Analyzing Without Context -**What it looks like**: User says "roast this approach", agent critiques verbal description without checking repo -**Why wrong**: Misses existing implementation, may critique a strawman -**Do instead**: Search repo for related code first. Ground critique in actual evidence. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I can see the issue, no need to validate" | Visual inspection misses nuance | Validate every claim against evidence | -| "All 5 agents agree, must be true" | Consensus doesn't mean correct | Still verify against actual files | -| "User just wants a quick roast" | Quick doesn't mean unvalidated | Run validation, skip only if shallow mode | -| "This claim is obviously valid" | Obviously is a rationalization word | Read the file, check the line | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/report-template.md`: Full report output template with tone transformation rules - `${CLAUDE_SKILL_DIR}/references/personas.md`: Persona specifications, prompt template, and claim format diff --git a/skills/routing-table-updater/SKILL.md b/skills/routing-table-updater/SKILL.md index 0a51a76f..4d2acee7 100644 --- a/skills/routing-table-updater/SKILL.md +++ b/skills/routing-table-updater/SKILL.md @@ -27,63 +27,11 @@ routing: # Routing Table Updater Skill -## Operator Context +## Overview -This skill operates as an operator for routing table maintenance workflows, configuring Claude's behavior for automated /do command routing configuration. It implements a **Phase-Gated Pipeline** -- scan, extract, generate, update, verify -- with deterministic script execution at each phase. +This skill maintains /do routing tables and command references when skills or agents are added, modified, or removed. It implements a **Phase-Gated Pipeline** -- scan, extract, generate, update, verify -- with deterministic script execution at each phase. -### Hardcoded Behaviors (Always Apply) - -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution. Project instructions override default skill behaviors. -- **Over-Engineering Prevention**: Only update routing tables that need changes. Keep routing entries concise and pattern-focused. No speculative patterns or flexibility that was not requested. Generate exactly what /do needs -- no additional features. -- **Preserve Manual Entries**: Detect and skip hand-written routing entries in do.md. Never overwrite entries without `[AUTO-GENERATED]` markers. -- **Backup Before Modification**: Always create timestamped backup of commands/do.md before any changes. Rollback on validation failure. -- **Markdown Syntax Validation**: Verify table syntax after updates. Ensure pipe alignment and proper header separation. -- **Deterministic Generation**: Same skill/agent metadata always produces same routing entry. No randomness in pattern extraction or table formatting. - -### Default Behaviors (ON unless disabled) - -- **Communication Style**: Report facts without self-congratulation. Show diff of routing changes rather than describing them. -- **Temporary File Cleanup**: Remove temporary metadata JSON files and diff files at completion. Keep only backups and updated do.md. -- **Interactive Confirmation**: Show diff and ask for confirmation before updating do.md (unless --auto-commit flag provided). -- **Progress Reporting**: Stream updates as scanning, extracting, generating, and updating phases execute. -- **Conflict Detection**: Warn when multiple routes match the same pattern with suggestions for resolution. -- **Alphabetical Ordering**: Maintain alphabetical order within routing tables. - -### Optional Behaviors (OFF unless enabled) - -- **Auto-Commit Mode**: `--auto-commit` flag to skip confirmation and automatically commit changes. -- **Dry-Run Mode**: `--dry-run` to show changes without modifying do.md. -- **Verbose Debug**: `--verbose` for detailed parsing and generation logs. - -## What This Skill CAN Do - -- Scan all skills (skills/*/SKILL.md) and agents (agents/*.md) for metadata -- Extract YAML frontmatter (name, description, version) and trigger patterns -- Generate routing table entries following /do format specification -- Detect routing conflicts (same trigger mapped to multiple routes) -- Safely update commands/do.md with atomic backup/restore -- Update command files (commands/*.md) with skill/agent references -- Preserve all hand-written manual routing entries -- Validate markdown table syntax after updates -- **Batch mode**: Process N skills at once from a Pipeline Spec or component list (used by pipeline-scaffolder Phase 4) -- **INDEX.json updates**: Add/update agent entries in `agents/INDEX.json` alongside routing tables - -## What This Skill CANNOT Do - -- Infer trigger patterns from vague descriptions (requires explicit phrases) -- Create new routing table sections (only updates existing tables) -- Resolve high-severity conflicts automatically (requires manual priority decisions) -- Modify skill/agent metadata (read-only access to capabilities) -- Handle non-standard markdown table formats - ---- - -## Prerequisites - -- Must be in agents repository root (has commands/do.md) -- Skills must have valid YAML frontmatter in SKILL.md files -- Agents must have valid YAML frontmatter in agents/*.md files -- commands/do.md must have routing table sections with standard headers +The skill reads metadata from all skills and agents (never modifies them) and safely updates `skills/do/SKILL.md`, `skills/do/references/routing-tables.md`, `agents/INDEX.json`, and `commands/*.md` files. All changes are backed up before modification, and markdown syntax is validated before commit. --- @@ -93,6 +41,12 @@ This skill operates as an operator for routing table maintenance workflows, conf **Goal**: Find every skill and agent file in the repository. +**Constraints applied in this phase**: +- Repository must be at agents toolkit root (requires `commands/do.md`) +- Only scan skills/ directories matching `skills/*/SKILL.md` format +- Only scan agent files matching `agents/*.md` format +- File permissions must allow reading all discovered files + **Step 1: Run scan script** ```bash @@ -121,10 +75,19 @@ If gate fails: - "No skills found": Check skills/ directory exists and has subdirectories - "Permission denied": Verify file read permissions +--- + ### Phase 2: EXTRACT -- Parse Metadata **Goal**: Extract YAML frontmatter, trigger patterns, complexity, and routing table targets from every discovered file. +**Constraints applied in this phase**: +- YAML frontmatter must be valid (no syntax errors; malformed YAML blocks extraction) +- Required fields (`name`, `description`, `version`) must be present +- Trigger patterns for skills extracted from description text (specify patterns, don't infer from vague text) +- Domain keywords for agents extracted from description text (explicit phrases required) +- Complexity inference must follow established rules (`references/extraction-patterns.md`) + **Step 1: Run extraction script** ```bash @@ -144,10 +107,10 @@ For each capability, confirm these fields were extracted: **Step 3: Validate trigger pattern quality** -Review extracted patterns against `references/extraction-patterns.md`. Patterns should be: -- Specific enough to avoid false matches -- Broad enough to catch common phrasings -- Free of generic terms that match too many routes +Review extracted patterns against `references/extraction-patterns.md`. Patterns must be: +- Specific enough to avoid false matches (too broad = user confusion) +- Broad enough to catch common phrasings (too narrow = missed activations) +- Free of generic terms that match too many routes (prevents routing ambiguity) **Gate**: All YAML parsed successfully, required fields present (name, description, version), trigger patterns extracted for skills, domain keywords extracted for agents. Do NOT proceed to Phase 3 until gate passes. @@ -156,31 +119,56 @@ If gate fails: - "Missing description field": Add description to YAML frontmatter - "No trigger patterns found": Update description to include clear trigger phrases +--- + ### Phase 3: GENERATE -- Create Routing Table Entries **Goal**: Map extracted metadata to routing entries and detect conflicts. +**Constraints applied in this phase**: +- Same skill/agent metadata always produces the same routing entry (deterministic generation, no randomness) +- Entries follow exact /do format specification (`references/routing-format.md`) +- Pattern conflicts detected immediately (same trigger maps to multiple incompatible routes) +- Entries sorted alphabetically within tables +- Duplicate entries within same table prevent gate passage + +**Step 1: Run generation script** + ```bash python3 ~/.claude/skills/routing-table-updater/scripts/generate_routes.py --input metadata.json --output routing_entries.json ``` -Generation process: +**Step 2: Understand the generation process** + 1. Load routing format specification from `references/routing-format.md` 2. Map each capability to appropriate routing table 3. Format entries according to /do table structure 4. Detect pattern conflicts (see `references/conflict-resolution.md`) 5. Sort entries alphabetically within tables +**Step 3: Review conflict detection output** + +The script logs all conflicts with severity levels. For low-severity conflicts (both routes reasonable), the script applies specificity rules automatically. For high-severity conflicts (incompatible routes), the script blocks gate passage and requires manual resolution. + **Gate**: All capabilities mapped to entries, entries follow /do format, conflicts detected and documented, no duplicates within same table. Do NOT proceed to Phase 4 until gate passes. If gate fails: - "Unknown routing table target": Update routing table mapping logic - "High-severity conflict": Review conflicting patterns manually before proceeding +--- + ### Phase 4A: UPDATE -- Safely Modify commands/do.md **Goal**: Apply generated routing entries to do.md with backup and validation. +**Constraints applied in this phase**: +- Always create timestamped backup before any modification (mandatory backup gate) +- Detect and preserve all hand-written entries (entries without `[AUTO-GENERATED]` marker are never overwritten) +- Manual entries are intentional curation — overwriting them causes data loss +- Markdown table syntax must validate after updates (pipe alignment, header rows, column consistency) +- Atomic backup/restore: if validation fails, automatic restore from backup + **Step 1: Run update script with backup** ```bash @@ -218,15 +206,26 @@ On validation failure: automatic restore from backup. Report error details. **Gate**: Backup created, all manual entries preserved, markdown validated, diff confirmed. If gate fails, RESTORE from backup. +--- + ### Phase 4B: UPDATE -- Update Command Files **Goal**: Update command files with current skill/agent references. +**Constraints applied in this phase**: +- Command files updated only if they reference outdated or invalid skills +- Backups created for all modified files before any changes +- All referenced skills must exist (missing skills cause gate failure) +- Markdown syntax validated after updates (prevents publishing broken tables) + +**Step 1: Run update script with backup** + ```bash python3 ~/.claude/skills/routing-table-updater/scripts/update_commands.py --commands-dir $HOME/claude-code-toolkit/commands --metadata metadata.json --backup ``` -Update process: +**Step 2: Understand the update process** + 1. Scan command files for skill invocations and references 2. Identify outdated or invalid references (renamed/removed skills) 3. Update references to match current metadata @@ -235,15 +234,27 @@ Update process: **Gate**: Backups created for all modified files, all referenced skills exist, markdown validated. +--- + ### Phase 5: VERIFY -- Validate Routing Correctness **Goal**: Final validation of all routing tables. +**Constraints applied in this phase**: +- All auto-generated entries must have `[AUTO-GENERATED]` markers (validation gate checks this) +- No duplicate patterns within the same routing table +- All referenced skills/agents must exist as actual files +- Complexity values must match defined levels (Simple, Medium, Complex) +- Overlapping patterns documented with priority rules applied + +**Step 1: Run validation script** + ```bash python3 ~/.claude/skills/routing-table-updater/scripts/validate.py --target $HOME/claude-code-toolkit/commands/do.md ``` -Verification checks: +**Step 2: Understand verification checks** + 1. **Structural**: All routing tables present, headers formatted, pipes aligned 2. **Content**: All auto-generated entries marked, no duplicates, all referenced skills/agents exist 3. **Conflicts**: Overlapping patterns documented, priority rules applied @@ -286,6 +297,8 @@ Generated routing entry: Result: New skill is discoverable via /do command +--- + ### Example 2: Agent Description Updated User updates golang-general-engineer description to add "concurrency" keyword. @@ -305,6 +318,8 @@ Updated routing entry: Result: Domain routing expanded to cover new keyword +--- + ### Example 3: Conflict Detection Two skills both match "test API" pattern. @@ -323,6 +338,8 @@ Resolution applied: Result: Unambiguous routing with longest-match precedence +--- + ### Example 4: Manual Entry Preserved Existing do.md has a hand-curated combination entry (no AUTO-GENERATED marker): @@ -418,46 +435,8 @@ Solution: Restore from backup, fix table generation logic, re-run. Do not commit --- -## Anti-Patterns - -### Anti-Pattern 1: Fixing Without Backup -**What it looks like**: Running update_routing.py with --no-backup -**Why wrong**: No recovery path if manual entries are lost or markdown is corrupted -**Do instead**: Always use --backup flag. Verify backup exists before proceeding. - -### Anti-Pattern 2: Skipping Phase Gates -**What it looks like**: Running UPDATE before EXTRACT completes -**Why wrong**: Missing metadata produces empty or incorrect routing tables. Phase gates prevent incomplete data from corrupting do.md. -**Do instead**: Verify each gate passes before proceeding. Follow SCAN -> EXTRACT -> GENERATE -> UPDATE -> VERIFY sequence. - -### Anti-Pattern 3: Ignoring Conflict Warnings -**What it looks like**: Proceeding with high-severity conflicts unresolved -**Why wrong**: Ambiguous routing confuses /do command. Users get wrong tool for their context. -**Do instead**: Review severity. High-severity conflicts MUST be resolved. Add domain context to make patterns specific. - -### Anti-Pattern 4: Overwriting Manual Entries -**What it looks like**: Replacing all matching rows without checking for AUTO-GENERATED marker -**Why wrong**: Manual entries contain curated routing decisions and hand-tuned combinations -**Do instead**: Only update rows with `[AUTO-GENERATED]` marker. Preserve everything else. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) - Phase transition rules - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Routes look correct, no need to validate" | Visual inspection misses duplicate patterns and conflicts | Run validate.py against updated do.md | -| "Small routing change, skip backup" | One corrupt table makes /do unusable | Always create backup before modification | -| "Manual entries are outdated, replace them" | Manual entries contain intentional curation | Preserve all non-AUTO-GENERATED entries | -| "Conflict is low severity, ignore it" | Low severity today becomes user confusion tomorrow | Document all conflicts with resolution strategy | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/routing-format.md`: /do routing table format specification (table structure, entry formats, ordering rules) diff --git a/skills/sapcc-audit/SKILL.md b/skills/sapcc-audit/SKILL.md index f0153f2b..d65de54b 100644 --- a/skills/sapcc-audit/SKILL.md +++ b/skills/sapcc-audit/SKILL.md @@ -43,77 +43,6 @@ Review every package against established review standards. Not checklist complia --- -## Operator Context - -### Hardcoded Behaviors (Always Apply) -- **Review Against Project Standards**: The primary question for every function is: "Would this pass review?" Not "does it follow a checklist." -- **Code-Level Findings**: Every finding MUST include the actual code snippet and a concrete fix showing what it should become. Abstract suggestions like "consider using X" are forbidden. -- **Segment by Package**: Dispatch agents by package groups, NOT by concern area. Each agent reviews its packages holistically (errors + architecture + patterns + tests together), exactly like a real PR review. -- **Read the Actual Code**: Agents MUST use the Read tool to read every .go file in their assigned packages. Do not guess based on file names or grep output. -- **Audit Only**: READS and REPORTS. Does NOT modify code unless explicitly asked with `--fix`. -- **Skip Generic Findings**: Do NOT report `DisallowUnknownFields`, `t.Parallel()`, or other generic Go best practices unless they are genuinely wrong in context. Focus on sapcc-specific patterns. - -### Default Behaviors (ON unless disabled) -- **gopls MCP Integration**: Agents MUST use gopls MCP tools when available — `go_workspace` to detect workspace structure, `go_file_context` after reading each .go file for intra-package dependency understanding, `go_symbol_references` to verify type usage across packages (critical for export decisions), `go_package_api` to inspect package APIs, `go_diagnostics` to verify any fixes. This replaces grep-based analysis with type-aware analysis. -- **Save Report**: Write findings to `sapcc-audit-report.md` in repo root -- **Before/After Code**: Include "CURRENT" and "SHOULD BE" code blocks for every finding -- **Directive Review Voice**: Frame findings in a direct, principled review voice (references "irrelevant contrivance" or "overengineered" where applicable) - -### Optional Behaviors (OFF unless enabled) -- **--fix**: Auto-fix violations via code changes on a branch -- **--focus [package]**: Audit only one package (e.g., `--focus internal/api`) -- **--verbose**: Show passing checks too - ---- - -## What the Lead Reviewer Actually Cares About (Priority Order) - -These are the things that get PRs rejected, in the order the lead reviewer prioritizes them: - -### 1. Over-Engineering (Lead Reviewer's #1 Concern) -- Interfaces with only one implementation -- Wrapper types/functions that just forward calls -- Struct types created for one-off JSON payloads -- Abstractions "for future extensibility" that add complexity now -- Config systems when env vars suffice - -### 2. Dead Code -- Exported functions with no callers -- Interface methods never called -- Fields set but never read -- Entire packages imported but barely used -- "TODO: remove" comments on code that should already be gone - -### 3. Error Message Quality (Secondary Reviewer's #1 Concern) -- `http.Error(w, "internal error", 500)` — useless to the caller -- Error wrapping that loses the original context -- Generic messages when the actual cause is knowable -- Missing error returns (silently swallowing failures) - -### 4. Interface Contract Violations -- Implementations that don't match the interface's documented behavior -- Methods that return default values instead of errors for missing items -- Inconsistent error types across implementations of the same interface - -### 5. Copy-Paste Structs -- Two structs with the same fields (one for internal, one for API response) -- Handler functions that are 90% identical (extract the common pattern) -- Duplicated validation logic - -### 6. Pattern Consistency with Keppel -- HTTP error response format (should use respondwith consistently) -- Startup pattern (osext.MustGetenv, logg.Fatal for startup errors) -- Route registration (httpapi.Compose) -- Database access patterns (gorp, easypg) -- Configuration (env vars via osext, not config files) - -### 7. Mixed Approaches -- Some handlers return JSON errors, others return text/plain -- Some constructors panic on nil args, others return errors -- Some packages use logg, others use log - ---- - ## Instructions ### Phase 1: DISCOVER @@ -161,7 +90,15 @@ Adjust based on actual package sizes. Aim for 5-8 agents. **Goal**: Launch parallel agents that review packages against project standards. -**Each agent gets this prompt:** +**Principle: Read the actual code.** Agents MUST use the Read tool to read every .go file in their assigned packages. Do not guess based on file names or grep output. Use gopls MCP tools when available: `go_workspace` to detect workspace structure, `go_file_context` after reading each .go file for intra-package dependency understanding, `go_symbol_references` to verify type usage across packages (critical for export decisions), `go_package_api` to inspect package APIs, `go_diagnostics` to verify any fixes. + +**Principle: Real review, not checklists.** The primary question for every function is: "Would this pass review?" not "does it follow a checklist." A real reviewer reads code holistically and reacts to architectural issues, not just mechanical patterns. + +**Principle: Segment by package, not by concern.** Dispatch agents by package groups, NOT by concern area. Each agent reviews its packages holistically (errors + architecture + patterns + tests together), exactly like a real PR review. Real code review reads a file holistically — an error handling issue might actually be an architecture issue. Segmenting by concern produces shallow findings. + +**Code-level findings only.** Every finding MUST include the actual code snippet and a concrete fix showing what it should become. Abstract suggestions like "consider using X" are forbidden. Show current code and what it should become. + +**Each agent gets this dispatch prompt:** ``` You are reviewing code in an SAP Converged Cloud Go project against established @@ -172,40 +109,46 @@ PACKAGES TO REVIEW: [list of packages with full paths] Read EVERY .go file in these packages using the Read tool. For each file: -1. **Over-engineering**: Is there an abstraction that isn't justified? - - Interface with one implementation? → "Just use the concrete type." Project convention: only create interfaces when there are 2+ real implementations. +1. **Over-engineering** (Lead Reviewer's #1 Concern) + - Interfaces with only one implementation? → "Just use the concrete type." Project convention: only create interfaces when there are 2+ real implementations. - Wrapper function that adds nothing? → "Delete this, call the real function" - Struct for one-time JSON? → "Use fmt.Sprintf + json.Marshal" (per project convention) - Option struct for constructor? → "Just use positional params." Project convention uses 7-8 positional params, never option structs. - Config file/viper? → "Use osext.MustGetenv." Project convention never uses config files. Pure env vars only. -2. **Dead code**: Are there exported functions with no callers outside the package? - Use Grep to check for callers: `grep -r "FunctionName" --include="*.go"` - If no callers exist, flag it. +2. **Dead code** + - Exported functions with no callers outside the package? Use Grep to check: `grep -r "FunctionName" --include="*.go"`. If no callers exist, flag it. + - Interface methods never called + - Fields set but never read + - Entire packages imported but barely used + - "TODO: remove" comments on code that should already be gone -3. **Error messages**: Read every fmt.Errorf and http.Error call. - - Does the message tell you WHAT failed and WHERE? +3. **Error messages** (Secondary Reviewer's #1 Concern) + - `http.Error(w, "internal error", 500)` — useless to the caller - Error wrapping: uses %w when caller needs errors.Is/As, %s with .Error() to intentionally break chain - Message format: "cannot : %w" or "while : %w" with relevant identifiers - Would a user/operator reading this know what to do? - "internal error" with no context = CRITICAL - Never log AND return the same error. Primary error returned, secondary/cleanup errors logged. -4. **Constructor patterns**: +4. **Constructor patterns** - Constructor should be `NewX(deps...) *X` — never returns error (construction is infallible) - Uses positional struct literal init: `&API{cfg, ad, fd, sd, ...}` (no field names) - Injects default functions for test doubles: `time.Now`, etc. - Override pattern for test doubles: fluent `OverrideTimeNow(fn) *T` methods -5. **Interface contracts**: If the package implements an interface from another package: - - Read the interface definition +5. **Interface contracts** + - If the package implements an interface from another package: Read the interface definition - Check if the implementation actually satisfies the contract - Does it return correct error types? Default values where errors are expected? - Interfaces should be defined in the consumer package, not the implementation package -6. **Copy-paste**: Are there two structs, functions, or code blocks that are >70% similar? +6. **Copy-paste structs** + - Two structs with the same fields (one for internal, one for API response)? + - Handler functions that are 90% identical (extract the common pattern) + - Duplicated validation logic -7. **HTTP handler patterns**: Does this code match keppel patterns? +7. **HTTP handler patterns** (Must match keppel patterns) - Handlers: methods on *API with `handleVerbResource(w, r)` signature - Auth: called inline at top of handler, NOT middleware - JSON decode: `json.NewDecoder` + `DisallowUnknownFields()` @@ -213,26 +156,31 @@ Read EVERY .go file in these packages using the Read tool. For each file: - Internal errors: `respondwith.ObfuscatedErrorText(w, err)` — hides 500s from clients - Route registration: one `AddTo(*mux.Router)` per API domain, composed via `httpapi.Compose` -8. **Database patterns**: +8. **Database patterns** - SQL queries as package-level `var` with `sqlext.SimplifyWhitespace()` - PostgreSQL `$1, $2` params (never `?`) - gorp for simple CRUD, raw SQL for complex queries - Transactions: `db.Begin()` + `defer sqlext.RollbackUnlessCommitted(tx)` - NULL: `Option[T]` (from majewsky/gg/option), not `*T` pointers -9. **Type patterns**: +9. **Type patterns** - Named string types for domain concepts: `type AccountName string` - String enums with typed constants (NOT iota): `const CleanSeverity VulnerabilityStatus = "Clean"` - Model types use `db:"column"` tags; API types use `json:"field"` tags — separate types - Pointer receivers for all struct methods (value receivers only for tiny data-only types) -10. **Logging patterns**: +10. **Logging patterns** - `logg.Fatal` ONLY in cmd/ packages for startup failures - `logg.Error` for secondary/cleanup errors (never for primary errors) - `logg.Info` for operational events - Never log.Printf or fmt.Printf for logging - Panics only for impossible states, annotated with "why was this not caught by Validate!?" +11. **Mixed approaches** (Pattern consistency) + - Some handlers return JSON errors, others return text/plain + - Some constructors panic on nil args, others return errors + - Some packages use logg, others use log + For EACH finding, output: ### [MUST-FIX / SHOULD-FIX / NIT]: [One-line summary] @@ -240,14 +188,14 @@ For EACH finding, output: **Convention**: "[What a lead reviewer would actually write in a PR comment]" **Current code**: -```go +\`\`\`go [actual code from the file, 3-10 lines] -``` +\`\`\` **Should be**: -```go +\`\`\`go [what the code should look like after fixing] -``` +\`\`\` **Why**: [One sentence explaining the principle] @@ -331,73 +279,29 @@ Show the verdict, must-fix count, and top 5 findings inline. Point to the full r --- -## Anti-Patterns +## Error Handling -### AP-1: Checklist Compliance Instead of Code Review -**What it looks like**: "Error handling: 7 warnings" without showing actual code -**Why wrong**: A real reviewer doesn't review with checklists. They read code and react. -**Do instead**: Show the actual problematic code and what it should become. +### Failure modes and recovery -### AP-2: Generic Go Suggestions -**What it looks like**: "Consider using t.Parallel()" / "Add DisallowUnknownFields()" -**Why wrong**: These are generic best practices, not sapcc-specific patterns. They don't reflect what the project's reviewers actually care about. -**Do instead**: Focus on over-engineering, dead code, error quality, interface violations — the things that actually get PRs rejected. +| Scenario | Response | +|----------|----------| +| Not an sapcc project | Stop immediately. Print message: "This does not appear to be an SAP CC Go project (no sapcc imports in go.mod)." | +| Agents cannot read a file | Log and continue. File may be binary or inaccessible. Flag in the report under "Warnings." | +| gopls MCP tools unavailable | Fall back to manual grep-based analysis. Note in the report that type-aware analysis was unavailable. | +| Too many packages (>30) | Split into >8 agents. Ensure each still gets 5-15 files for depth. | +| Agent finds no violations | Report is valid. Not every package has violations. Output empty sections for unused severity levels. | -### AP-3: Segment by Concern Instead of Package -**What it looks like**: "Error Handling Agent" checks all files for error patterns only -**Why wrong**: Real code review reads a file holistically. An error handling issue might actually be an architecture issue. Segmenting by concern produces shallow findings. -**Do instead**: Segment by package. Each agent reads all files in its packages and reviews everything. +### Principles -### AP-4: Suggesting Complexity Additions -**What it looks like**: "Add response envelopes" / "Create a config validation layer" -**Why wrong**: The project's strongest convention is AGAINST adding complexity. If something works simply, don't make it complex. -**Do instead**: Frame suggestions as simplifications. "Remove this abstraction" > "Add this abstraction." +- **Audit only**: READS and REPORTS. Does NOT modify code unless explicitly asked with `--fix`. +- **Skip generic findings**: Do NOT report `DisallowUnknownFields`, `t.Parallel()`, or other generic Go best practices unless they are genuinely wrong in context. Focus on sapcc-specific patterns. +- **Rationalization guard**: Avoid "could theoretically be better" findings. Focus on things that would actually be commented on in a real PR review. --- -## Calibration Examples - -### Good Finding (Would Appear in a Project Review) -``` -### SHOULD-FIX: configResponse is a maintenance-hazard copy of DestConfig -**File**: `internal/api/config_handler.go:42` -**Convention**: "This is just DestConfig with JSON tags. Add the tags to DestConfig and delete this." +## References -**Current code**: -type configResponse struct { - TenantID string `json:"tenant_id"` - BucketName string `json:"bucket_name"` - Region string `json:"region"` -} - -**Should be**: -// Add JSON tags to the existing DestConfig type -type DestConfig struct { - TenantID string `json:"tenant_id"` - BucketName string `json:"bucket_name"` - Region string `json:"region"` -} - -**Why**: Duplicate structs drift apart over time. One source of truth. -``` - -### Bad Finding (Would NOT Appear in a Project Review) -``` -### WARNING: Consider adding DisallowUnknownFields -**File**: `internal/config/loader.go:15` -This JSON decoder could benefit from DisallowUnknownFields() to catch typos. -``` -This is too generic. A project reviewer wouldn't comment on this in most cases. - ---- - -## Reference Loading Strategy - -The essential review rules are already inline in Section "What the Lead Reviewer Actually Cares About" above and in each agent's dispatch prompt. Reference files provide supplementary depth when an agent needs it. - -Reference files live at `skills/go-sapcc-conventions/references/` (or `~/.claude/skills/go-sapcc-conventions/references/` globally). - -**Per-agent reference loading** (included in each agent's dispatch prompt based on assigned packages): +### Per-agent reference loading (included in each agent's dispatch prompt based on assigned packages) | Package Type | Reference to Load | |-------------|-------------------| @@ -408,15 +312,14 @@ Reference files live at `skills/go-sapcc-conventions/references/` (or `~/.claude | Build/CI config | `build-ci-detailed.md` | | Import-heavy files | `library-reference.md` | -**Always available for calibration** (load only when needed): +### Always available for calibration (load only when needed) + - `anti-patterns.md` — Quick-check findings against known anti-patterns - `review-standards-lead.md` — Calibrate review tone and severity -**Anti-pattern**: Do NOT tell every agent to read the full sapcc-code-patterns.md. The rules are already inline. Load reference files only for domain-specific depth. - ---- +**Note**: Do NOT tell every agent to read the full sapcc-code-patterns.md. The rules are already inline in the dispatch prompt. Load reference files only for domain-specific depth. -## Integration +### Integration - **Router**: `/do` routes via "sapcc audit", "sapcc compliance", "sapcc lead review" - **Pairs with**: `go-sapcc-conventions` (the rules), `golang-general-engineer` (the executor) diff --git a/skills/sapcc-review/SKILL.md b/skills/sapcc-review/SKILL.md index b718a929..15ffe339 100644 --- a/skills/sapcc-review/SKILL.md +++ b/skills/sapcc-review/SKILL.md @@ -51,57 +51,11 @@ routing: --- -## Operator Context - -### Hardcoded Behaviors (Always Apply) -- **Domain-Scoped References**: Each agent loads ONLY its domain-specific reference file (see Reference Loading Strategy). The essential rules are inline in each agent's dispatch prompt. -- **Domain Specialist Model**: Agents are assigned by rule domain (testing, errors, types, etc.), NOT by package. Each agent scans ALL packages for their domain's violations. -- **Code-Level Findings Only**: Every finding must include actual code from the repo and a concrete CORRECT version. Abstract suggestions are forbidden. -- **Cite Rule Source**: Every finding must cite the section number from sapcc-code-patterns.md (e.g., "§30.1: httptest.Handler migration"). -- **Directive Review Voice**: Frame findings as the lead reviewer would state them in a PR comment. Use the project's established review phrases where applicable. -- **Audit Only By Default**: READ and REPORT. Do NOT modify code unless `--fix` is specified. -- **Skip Generic Go**: Do NOT report generic Go best practices. Only report patterns that are specifically sapcc/project-convention divergences. - -### Default Behaviors (ON unless disabled) -- **gopls MCP Integration**: All review agents MUST use gopls MCP tools when available — `go_workspace` at start, `go_file_context` for dependency analysis, `go_symbol_references` for cross-package impact tracing, `go_diagnostics` for build verification. This gives type-aware analysis instead of text-only grep. -- **Save Report**: Write findings to `sapcc-review-report.md` in repo root -- **Quick Wins Section**: Identify the 5 easiest fixes with highest impact -- **Cross-Repository Reinforcement**: Weight findings higher when the violated rule appears in 4+ repos per §35 - -### Optional Behaviors (OFF unless enabled) -- **--fix**: Create worktree, apply fixes, run tests, create branch -- **--focus [package]**: Audit only one package (runs 3 agents instead of 10) -- **--severity [critical|high|medium|all]**: Only report findings at or above severity +## Overview ---- - -## Reference Loading Strategy - -Reference files live at `skills/go-sapcc-conventions/references/` (or `~/.claude/skills/go-sapcc-conventions/references/` globally). - -**Key change from v0**: Agents load ONLY their domain-specific reference, NOT the full patterns file. The essential rules are already embedded in each agent's dispatch prompt below. Reference files provide supplementary depth. - -**Per-agent reference loading** (included in each agent's dispatch prompt): - -| Agent | Domain Reference to Load | -|-------|--------------------------| -| 1 (Signatures/Config) | `review-standards-lead.md` | -| 2 (Types/Option[T]) | `architecture-patterns.md` | -| 3 (HTTP/API) | `api-design-detailed.md` | -| 4 (Error Handling) | `error-handling-detailed.md` | -| 5 (Database/SQL) | (none — rules inline) | -| 6 (Testing) | `testing-patterns-detailed.md` | -| 7 (Pkg Org/Imports) | `architecture-patterns.md` | -| 8 (Modern Go/Stdlib) | (none — rules inline) | -| 9 (Observability/Jobs) | (none — rules inline) | -| 10 (Anti-Patterns/LLM) | `anti-patterns.md` | - -**Optional deep-dive** (load only when findings need calibration): -- `sapcc-code-patterns.md` — Comprehensive 36-section reference. Only load specific sections, not the entire file. -- `pr-mining-insights.md` — Review severity calibration -- `library-reference.md` — Approved/forbidden dependency table +This skill executes a gold-standard code review against SAP Converged Cloud Go repository standards through parallel domain specialists. Rather than one generalist reviewing one package, ten specialists review all packages for their specific domain (error handling, testing, types, HTTP APIs, etc.). This catches systemic patterns that package-level reviews miss. -**Anti-pattern**: Loading ALL reference files into EVERY agent wastes context and dilutes focus. Each agent should load only what it needs. +Each specialist loads only its domain-specific reference file to keep context tight and focus deep. Findings are code-level (actual rejected/correct examples, never abstract suggestions) and cite specific sections from sapcc-code-patterns.md. --- @@ -176,247 +130,16 @@ Comprehensive code review of [repo] against project standards, dispatching 10 do **Goal**: Launch 10 domain-specialist agents in a SINGLE message for true parallel execution. -**CRITICAL**: All 10 agents must be dispatched in ONE message using the Agent tool. Do NOT serialize them. +**CRITICAL**: All 10 agents must be dispatched in ONE message using the Agent tool. Do NOT serialize them. Serializing agents wastes time since domain specialists operate independently on disjoint concerns. Each agent receives: 1. The path to sapcc-code-patterns.md to read 2. Their assigned sections to focus on -3. Their domain-specific reference file(s) to read +3. Their domain-specific reference file (loaded to avoid context dilution; each agent reads ONLY what it needs because loading all references into every agent wastes context and dilutes focus) 4. Instructions to scan ALL .go files in the repo 5. The exact output format for findings -**All agents share this preamble** (include in each prompt): - -``` -REFERENCE FILES TO READ FIRST (mandatory): -1. Read ~/.claude/skills/go-sapcc-conventions/references/sapcc-code-patterns.md - (Focus on sections listed below, but skim all for context) -2. Read [domain-specific reference file] - -REPO TO REVIEW: [current working directory] - -SCAN METHOD: -- Use Glob to find all .go files: **/*.go (excluding vendor/) -- Use Read to examine each file -- Use Grep to search for specific patterns across all files - -OUTPUT FORMAT for each finding: -### [CRITICAL|HIGH|MEDIUM|LOW]: [One-line summary] -**File**: `path/to/file.go:LINE` -**Rule**: §[section].[subsection]: [rule name] -**Convention**: "[What the lead reviewer would write in a PR comment]" - -REJECTED (current code): -```go -[actual code, 3-10 lines] -``` - -CORRECT (what it should be): -```go -[fixed code] -``` - -**Why**: [One sentence] ---- - -Write ALL findings to: [output file path] -``` - ---- - -#### Agent 1: Function Signatures, Constructors, Configuration - -**Sections**: §1 (Function Signatures), §2 (Configuration), §3 (Constructor Patterns) -**Extra Reference**: `review-standards-lead.md` - -**What to check across ALL packages:** -- Constructor taking option struct or functional options instead of positional params -- Functions with >8 params (should they be split?) -- context.Context in wrong position (should be first only for external calls) -- Config loaded from files/viper instead of env vars via osext.MustGetenv -- Constructors that return errors (should be infallible) -- Missing Override methods for test doubles -- Missing time.Now / ID-generator injection in constructors - ---- - -#### Agent 2: Interfaces, Types, Option[T] - -**Sections**: §4 (Interface Patterns), §8 (Type Definitions), §32 (Option[T] Complete Guide), §36 (Contract Cohesion) -**Extra Reference**: `architecture-patterns.md` - -**What to check across ALL packages:** -- Interfaces with only one implementation (should be concrete type) -- Interfaces defined in implementation package instead of consumer package -- `*T` used for optional fields instead of `Option[T]` from majewsky/gg/option -- Missing dot-import for option package (`import . "github.com/majewsky/gg/option"`) -- `iota` used for enums instead of typed string constants -- Named types for domain concepts missing (raw `string` where `AccountName` type should exist) -- Pointer receivers where value receiver is appropriate (or vice versa) -- Type exported when only constructor needs to be public -- **Contract cohesion (§36)**: Constants, error sentinels, or validation functions in a different file from the interface/type they belong to. If `ErrFoo` is returned by `FooDriver` methods, both must live in `foo_driver.go`. MEDIUM for new violations, LOW for pre-existing. -- **Interface consumer audit**: When a sentinel value or special parameter is introduced on an interface method, grep for ALL implementations AND all callers of that interface method across the entire repo. Use gopls `go_symbol_references` when available. Verify every caller validates the sentinel before passing it. Do not rely on the PR description's claim about authorization — verify the call chain independently. - ---- - -#### Agent 3: HTTP/API Design - -**Sections**: §5 (HTTP/API Patterns), §34 (Architectural Opinions Feb 2026) -**Extra Reference**: `api-design-detailed.md` - -**What to check across ALL packages:** -- Auth done as middleware instead of inline at top of handler -- JSON responses not using respondwith.JSON -- Error responses using JSON instead of text/plain (http.Error for 4xx, respondwith.ErrorText for 500) -- Missing DisallowUnknownFields on json.NewDecoder -- Route registration not using httpapi.Compose pattern -- Handler not a method on *API struct -- Handler signature not `handleVerbResource(w, r)` -- Request structs not parsed into purpose-specific types -- API docs in Markdown instead of Go types on pkg.go.dev - ---- - -#### Agent 4: Error Handling - -**Sections**: §6 (Error Handling), §26.13 (Error message naming) -**Extra Reference**: `error-handling-detailed.md` - -**What to check across ALL packages:** -- Error messages not following "cannot : %w" format -- "failed to" instead of "cannot" in error messages -- Logging AND returning the same error (must choose one) -- Missing error wrapping (bare `return err` without context) -- Generic error messages ("internal error", "something went wrong") -- logg.Fatal used outside cmd/ packages -- Swallowed errors (err checked but not returned or logged) -- fmt.Errorf with %s when %w is needed (or vice versa) -- Validation functions returning (bool, string) instead of error - ---- - -#### Agent 5: Database and SQL - -**Sections**: §7 (Database Patterns), §27 (Database Deep Dive) -**Extra Reference**: (use sapcc-code-patterns.md §7 + §27) - -**What to check across ALL packages:** -- SQL queries not declared as package-level `var` with `sqlext.SimplifyWhitespace()` -- Using `?` placeholders instead of `$1, $2` (PostgreSQL style) -- Missing `defer sqlext.RollbackUnlessCommitted(tx)` after `db.Begin()` -- TIMESTAMP used instead of TIMESTAMPTZ -- NULL columns that should be NOT NULL -- Migrations that modify existing migrations (immutable rule) -- Missing down migrations -- App-level validation that should be DB constraints -- Using explicit transaction for single statements -- Not using `sqlext.ForeachRow` for row iteration - ---- - -#### Agent 6: Testing Patterns - -**Sections**: §9 (Testing Patterns), §30 (go-bits Testing API Evolution) -**Extra Reference**: `testing-patterns-detailed.md` - -**What to check across ALL *_test.go files AND test helpers:** -- Using deprecated `assert.HTTPRequest` instead of `httptest.Handler.RespondTo()` -- Using `assert.DeepEqual` where generic `assert.Equal` works -- Table-driven tests (project convention prefers sequential scenario-driven narrative) -- Missing `t.Helper()` in test helper functions -- Using reflect.DeepEqual instead of assert.DeepEqual -- Test fixtures as large JSON files instead of programmatic builders -- Duplicated test setup across test functions (should extract) -- Using `require` package instead of `must` from go-bits -- Not using `must.SucceedT` / `must.ReturnT` for error-checked returns -- Not using `assert.ErrEqual` for flexible error matching -- **Assertion depth check**: For security-sensitive code (auth, filtering, tenant isolation), presence-only assertions (`NotEmpty`, `NotNil`, `assert.True(t, ok)`) are INSUFFICIENT. Tests must verify the actual VALUE matches the expected input (e.g., `assert.Equal(t, expectedID, filters[0]["term"]["tenant_ids"])`) - ---- - -#### Agent 7: Package Organization, Imports, Comments - -**Sections**: §10 (Package Org), §11 (Import Org), §13 (Comment Style), §28 (CLI Patterns), §36 (Contract Cohesion) -**Extra Reference**: `architecture-patterns.md` - -**What to check across ALL packages:** -- Import groups not in stdlib / external / internal order (3 groups) -- Dot-import used for anything other than `majewsky/gg/option` -- Missing SPDX license header -- Comments using `/* */` instead of `//` for doc comments -- Missing 80-slash separator comments (`////////////////...`) between type groups -- `//NOTE:` markers missing for non-obvious logic -- Exported symbols without godoc comments -- cmd/ packages using wrong CLI patterns (if CLI repo) -- Package names not reading as English ("package utils" instead of meaningful name) -- **Contract cohesion (§36)**: Files named generically (`interface.go`, `types.go`, `constants.go`) when they should be named for the domain concept (`storage_driver.go`, `rbac_policy.go`). Constants/sentinels in `util.go` that belong to a specific interface's file. The test: if you can name the owning interface, the artifact must live in that interface's file. - ---- - -#### Agent 8: Modern Go, Standard Library, Concurrency - -**Sections**: §14 (Concurrency), §15 (Startup/Shutdown), §29 (Modern Go Stdlib) -**Extra Reference**: (use sapcc-code-patterns.md §14, §15, §29) - -**What to check across ALL packages:** -- Using `sort.Slice` instead of `slices.SortFunc` (Go 1.21+) -- Using manual `keys := make([]K, 0, len(m)); for k := range m { ... }` instead of `slices.Sorted(maps.Keys(m))` -- Using `strings.HasPrefix + strings.TrimPrefix` instead of `strings.CutPrefix` (Go 1.20+) -- Using manual `if a < b { return a }` instead of `min(a, b)` (Go 1.21+) -- Loop variable capture workaround (`v := v`) in Go 1.22+ code -- Goroutines without proper context cancellation -- Missing SIGINT context handling in main() -- `os.Exit` used instead of proper shutdown sequence -- `sync.Mutex` on struct value instead of per-resource -- Missing `for range N` syntax where applicable (Go 1.22+) - ---- - -#### Agent 9: Observability, Metrics, Background Jobs - -**Sections**: §16 (Background Jobs), §17 (HTTP Client), §18 (String Formatting), §20 (Observability) -**Extra Reference**: (use sapcc-code-patterns.md §16-18, §20) - -**What to check across ALL packages:** -- Prometheus metrics missing application prefix (e.g., `keppel_` or `logrouter_`) -- Counter metrics not initialized to zero -- Counter metric names not plural -- Gauge used where Counter is appropriate (or vice versa) -- Background jobs not using `jobloop.ProducerConsumerJob` pattern -- HTTP client creating new `http.Client` per request instead of using `http.DefaultClient` -- Custom HTTP transport instead of `http.DefaultTransport` -- Missing jitter in polling/retry loops -- `fmt.Sprintf` for simple string concatenation (use `+`) -- `+` for complex multi-part string building (use `fmt.Sprintf`) - ---- - -#### Agent 10: Anti-Patterns, LLM Tells, Community Divergences - -**Sections**: §22 (Divergences), §24 (Anti-Patterns), §25 (LLM Code Feedback), §33 (Portunus Architecture), §35 (Reinforcement Table) -**Extra Reference**: `anti-patterns.md` - -**This is the highest-value agent.** It checks for patterns that LLMs generate by default but the project explicitly rejects: - -- Functional options pattern (project convention: positional params) -- Table-driven tests (project convention: sequential scenario narrative) -- Interface segregation / many small interfaces (project convention: 1-2 interfaces max per domain) -- Middleware-based auth (project convention: inline at handler top) -- Config validation layer (project convention: no separate validation) -- `*T` for optional fields (project convention: `Option[T]`) -- Config files / viper (project convention: pure env vars) -- Error messages starting with capital letter -- Error messages using "failed to" (project convention: "cannot") -- Helper functions extracted for cyclomatic complexity (project convention: "contrived edit to satisfy silly metrics") -- Exported types when only constructor is public -- Plugin creating its own DB connection (project convention: receive dependencies) -- `errors.New` + `fmt.Sprintf` instead of `fmt.Errorf` -- Manual row scanning instead of `sqlext.ForeachRow` -- Test setup in `TestMain` instead of per-test -- Verbose error checking instead of `assert.ErrEqual` / `must.SucceedT` -- **Extraction without guard transfer**: When inline code is extracted into a named helper, ALL defensive checks that relied on "the caller handles it" must be re-evaluated. A missing guard rated LOW as inline code becomes MEDIUM as a reusable function. Flag extracted helpers that lack self-contained validation. - ---- +See `references/agent-dispatch-prompts.md` for the shared preamble and all 10 agent specifications (Agents 1-10). **Gate**: All 10 agents dispatched in single message. Wait for all to complete. Proceed to Phase 3. @@ -440,13 +163,7 @@ If two agents flagged the same file:line, keep the higher-severity finding with **Step 3: Prioritize** -Apply cross-repository reinforcement from §35: - -| Pattern Strength | Severity Boost | -|-----------------|----------------| -| NON-NEGOTIABLE (4+ repos) | +1 severity level (MEDIUM->HIGH) | -| Strong Signal (2-3 repos) | No change | -| Context-Specific (1 repo) | -1 severity level (HIGH->MEDIUM) | +Apply cross-repository reinforcement from §35. See `references/report-template.md` for the severity boost table. **Step 4: Identify Quick Wins** @@ -459,71 +176,7 @@ These go in a "Quick Wins" section at the top of the report. **Step 5: Write report** -Create `sapcc-review-report.md`: - -```markdown -# SAPCC Code Review: [repo name] - -**Module**: [go module path] -**Date**: [date] -**Packages reviewed**: [N] packages, [M] Go files, [T] test files -**Agents dispatched**: 10 domain specialists -**Reference version**: sapcc-code-patterns.md (comprehensive patterns reference, 36 sections) - ---- - -## Verdict - -[2-3 sentences: Would this codebase pass lead review? What are the systemic issues? -Not just "there are problems" — identify the PATTERN of problems.] - -## Score Card - -| Domain | Agent | Findings | Critical | High | Medium | Low | -|--------|-------|----------|----------|------|--------|-----| -| Signatures/Config | 1 | N | ... | ... | ... | ... | -| Types/Option[T] | 2 | N | ... | ... | ... | ... | -| HTTP/API | 3 | N | ... | ... | ... | ... | -| Error Handling | 4 | N | ... | ... | ... | ... | -| Database/SQL | 5 | N | ... | ... | ... | ... | -| Testing | 6 | N | ... | ... | ... | ... | -| Pkg Org/Imports | 7 | N | ... | ... | ... | ... | -| Modern Go/Stdlib | 8 | N | ... | ... | ... | ... | -| Observability/Jobs | 9 | N | ... | ... | ... | ... | -| Anti-Patterns/LLM | 10 | N | ... | ... | ... | ... | -| **TOTAL** | | **N** | **X** | **Y** | **Z** | **W** | - -## Quick Wins (Easy Fixes, High Impact) - -[5-10 findings that can be fixed with minimal effort] - -## Critical Findings - -[Each finding with full REJECTED/CORRECT code] - -## High Findings - -[Each finding with full REJECTED/CORRECT code] - -## Medium Findings - -[Each finding] - -## Low Findings - -[Brief list] - -## What's Done Well - -[Genuine positives the lead reviewer would note approvingly. This is important for morale -and to show the review isn't blindly negative.] - -## Systemic Recommendations - -[2-3 big-picture recommendations based on patterns across findings. -E.g., "This repo consistently uses *T for optionals — a bulk migration -to Option[T] would address 15 findings at once."] -``` +Create `sapcc-review-report.md` using the full template in `references/report-template.md`. **Gate**: Report written. Display summary to user. Proceed to Phase 4 if `--fix` specified. @@ -567,59 +220,50 @@ Update `sapcc-review-report.md` with: --- -## Calibration: What Makes This Gold Standard +## Error Handling -### Why 10 Domain Specialists > N Package Generalists +**When an agent fails or produces empty findings**: +1. Verify the repo has Go files (some repos may be non-Go or already pass review completely) +2. Check agent logs for permission errors or gopls MCP connection failures +3. If agent timed out, increase timeout or split the 10 agents into two waves of 5 +4. If agent reports "no findings", it has completed successfully — that domain is clean -| Approach | Strength | Weakness | -|----------|----------|----------| -| **Package generalist** (sapcc-audit) | Understands file-level context | Must remember ALL rules for every file | -| **Domain specialist** (sapcc-review) | Deep expertise in one rule domain | May miss cross-concern interactions | +**When a finding looks wrong** (e.g., false positive): +- Cross-check with sapcc-code-patterns.md section cited in the finding +- If it contradicts the reference, note the discrepancy and file in toolkit issue +- If it applies to pre-existing code older than the rule's introduction, mark as LOW and note in systemic recommendations -**Combination is ideal**: Run `/sapcc-review` for comprehensive rule coverage, then `/sapcc-audit` for holistic package-level review. - -### Why Read sapcc-code-patterns.md First - -The comprehensive reference is the single source of truth. Without reading it, agents default to community Go conventions — which are WRONG for 12+ patterns the project explicitly diverges from (§22). The reference file IS the competitive advantage. - -### What "Gold Standard" Means - -1. **Complete coverage**: Every section in sapcc-code-patterns.md has a specialist agent checking for it -2. **Low false positives**: Agents skip generic Go advice and only report sapcc-specific divergences -3. **Actionable findings**: Every finding has REJECTED code and CORRECT code -4. **Prioritized output**: Cross-repo reinforcement weights findings by importance -5. **Reproducible**: Same repo + same reference = same findings -6. **Quick Wins first**: Operator can fix 10 easy things immediately for rapid improvement - ---- - -## Anti-Patterns - -### AP-1: Not Reading the References -**What it looks like**: Agent starts reviewing without reading sapcc-code-patterns.md -**Why wrong**: Without the reference, agents generate findings based on community Go advice, which diverges from the project's preferences in 12+ areas -**Do instead**: ALWAYS read sapcc-code-patterns.md FIRST. This is hardcoded behavior. - -### AP-2: Reporting Generic Go Issues -**What it looks like**: "Add t.Parallel()", "Use context.Context as first param" -**Why wrong**: The lead reviewer doesn't care about these. The focus is on over-engineering, dead code, and error quality. -**Do instead**: Only report patterns that match rules in sapcc-code-patterns.md - -### AP-3: Suggesting More Complexity -**What it looks like**: "Add a config validation layer", "Create an error registry" -**Why wrong**: The #1 concern in lead review is over-engineering. Never suggest adding abstraction. -**Do instead**: Suggest REMOVING complexity. "Delete this wrapper" > "Add this wrapper" - -### AP-4: Abstract Findings Without Code -**What it looks like**: "Error handling could be improved in package X" -**Why wrong**: Not actionable. Lead reviews always show exact code. -**Do instead**: Show the actual line, the actual code, and the actual fix. +**When `--fix` breaks tests**: +1. Revert the failed fix +2. Note in report that this finding needs manual review +3. Document the test failure reason so the maintainer understands the blocker +4. Continue with next fix rather than stopping the whole process --- -## Integration - -- **Router**: `/do` routes via "sapcc review", "sapcc lead review", "comprehensive sapcc audit" -- **Complements**: `/sapcc-audit` (package-level generalist) — use both for maximum coverage -- **Prerequisite**: go-sapcc-conventions skill must be installed at `~/.claude/skills/go-sapcc-conventions/` -- **Sync**: After creating, run `cp -r skills/sapcc-review ~/.claude/skills/sapcc-review` for global access +## References + +- **sapcc-code-patterns.md** — Comprehensive 36-section reference (single source of truth for all review rules) +- **Per-agent reference files** (loaded during dispatch): + - Agent 1: `review-standards-lead.md` + - Agent 2: `architecture-patterns.md` + - Agent 3: `api-design-detailed.md` + - Agent 4: `error-handling-detailed.md` + - Agent 5: (none — rules inline in §7 + §27) + - Agent 6: `testing-patterns-detailed.md` + - Agent 7: `architecture-patterns.md` + - Agent 8: (none — rules inline in §14, §15, §29) + - Agent 9: (none — rules inline in §16-18, §20) + - Agent 10: `anti-patterns.md` +- **Optional deep-dive references** (load only when findings need calibration): + - `pr-mining-insights.md` — Review severity calibration across projects + - `library-reference.md` — Approved/forbidden dependency table +- **Progressive disclosure references** (loaded on demand): + - `references/agent-dispatch-prompts.md` — Shared preamble + all 10 agent specifications + - `references/report-template.md` — Full report template + severity boost table + +**Integration notes**: +- Complements `/sapcc-audit` (package-level generalist) — use both for maximum coverage +- Prerequisite: go-sapcc-conventions skill must be installed at `~/.claude/skills/go-sapcc-conventions/` +- Sync: After creating, run `cp -r skills/sapcc-review ~/.claude/skills/sapcc-review` for global access +- Router: `/do` routes via "sapcc review", "sapcc lead review", "comprehensive sapcc audit" diff --git a/skills/sapcc-review/references/agent-dispatch-prompts.md b/skills/sapcc-review/references/agent-dispatch-prompts.md new file mode 100644 index 00000000..95300fbc --- /dev/null +++ b/skills/sapcc-review/references/agent-dispatch-prompts.md @@ -0,0 +1,238 @@ +# Agent Dispatch Prompts + +This file contains the shared preamble and all 10 domain-specialist agent specifications used during Phase 2 (DISPATCH). + +--- + +## Shared Preamble + +Include this block in every agent prompt: + +``` +REFERENCE FILES TO READ FIRST (mandatory): +1. Read ~/.claude/skills/go-sapcc-conventions/references/sapcc-code-patterns.md + (Focus on sections listed below, but skim all for context) +2. Read [domain-specific reference file] + +REPO TO REVIEW: [current working directory] + +SCAN METHOD: +- Use Glob to find all .go files: **/*.go (excluding vendor/) +- Use Read to examine each file +- Use Grep to search for specific patterns across all files + +OUTPUT FORMAT for each finding: +### [CRITICAL|HIGH|MEDIUM|LOW]: [One-line summary] +**File**: `path/to/file.go:LINE` +**Rule**: §[section].[subsection]: [rule name] +**Convention**: "[What the lead reviewer would write in a PR comment]" + +REJECTED (current code): +```go +[actual code, 3-10 lines] +``` + +CORRECT (what it should be): +```go +[fixed code] +``` + +**Why**: [One sentence] +--- + +Write ALL findings to: [output file path] +``` + +--- + +## Agent 1: Function Signatures, Constructors, Configuration + +**Sections**: §1 (Function Signatures), §2 (Configuration), §3 (Constructor Patterns) +**Extra Reference**: `review-standards-lead.md` + +**What to check across ALL packages:** +- Constructor taking option struct or functional options instead of positional params +- Functions with >8 params (should they be split?) +- context.Context in wrong position (should be first only for external calls) +- Config loaded from files/viper instead of env vars via osext.MustGetenv +- Constructors that return errors (should be infallible) +- Missing Override methods for test doubles +- Missing time.Now / ID-generator injection in constructors + +--- + +## Agent 2: Interfaces, Types, Option[T] + +**Sections**: §4 (Interface Patterns), §8 (Type Definitions), §32 (Option[T] Complete Guide), §36 (Contract Cohesion) +**Extra Reference**: `architecture-patterns.md` + +**What to check across ALL packages:** +- Interfaces with only one implementation (should be concrete type) +- Interfaces defined in implementation package instead of consumer package +- `*T` used for optional fields instead of `Option[T]` from majewsky/gg/option +- Missing dot-import for option package (`import . "github.com/majewsky/gg/option"`) +- `iota` used for enums instead of typed string constants +- Named types for domain concepts missing (raw `string` where `AccountName` type should exist) +- Pointer receivers where value receiver is appropriate (or vice versa) +- Type exported when only constructor needs to be public +- **Contract cohesion (§36)**: Constants, error sentinels, or validation functions in a different file from the interface/type they belong to. If `ErrFoo` is returned by `FooDriver` methods, both must live in `foo_driver.go`. MEDIUM for new violations, LOW for pre-existing. +- **Interface consumer audit**: When a sentinel value or special parameter is introduced on an interface method, grep for ALL implementations AND all callers of that interface method across the entire repo. Use gopls `go_symbol_references` when available. Verify every caller validates the sentinel before passing it. Do not rely on the PR description's claim about authorization — verify the call chain independently. + +--- + +## Agent 3: HTTP/API Design + +**Sections**: §5 (HTTP/API Patterns), §34 (Architectural Opinions Feb 2026) +**Extra Reference**: `api-design-detailed.md` + +**What to check across ALL packages:** +- Auth done as middleware instead of inline at top of handler +- JSON responses not using respondwith.JSON +- Error responses using JSON instead of text/plain (http.Error for 4xx, respondwith.ErrorText for 500) +- Missing DisallowUnknownFields on json.NewDecoder +- Route registration not using httpapi.Compose pattern +- Handler not a method on *API struct +- Handler signature not `handleVerbResource(w, r)` +- Request structs not parsed into purpose-specific types +- API docs in Markdown instead of Go types on pkg.go.dev + +--- + +## Agent 4: Error Handling + +**Sections**: §6 (Error Handling), §26.13 (Error message naming) +**Extra Reference**: `error-handling-detailed.md` + +**What to check across ALL packages:** +- Error messages not following "cannot : %w" format +- "failed to" instead of "cannot" in error messages +- Logging AND returning the same error (must choose one) +- Missing error wrapping (bare `return err` without context) +- Generic error messages ("internal error", "something went wrong") +- logg.Fatal used outside cmd/ packages +- Swallowed errors (err checked but not returned or logged) +- fmt.Errorf with %s when %w is needed (or vice versa) +- Validation functions returning (bool, string) instead of error + +--- + +## Agent 5: Database and SQL + +**Sections**: §7 (Database Patterns), §27 (Database Deep Dive) +**Extra Reference**: (use sapcc-code-patterns.md §7 + §27) + +**What to check across ALL packages:** +- SQL queries not declared as package-level `var` with `sqlext.SimplifyWhitespace()` +- Using `?` placeholders instead of `$1, $2` (PostgreSQL style) +- Missing `defer sqlext.RollbackUnlessCommitted(tx)` after `db.Begin()` +- TIMESTAMP used instead of TIMESTAMPTZ +- NULL columns that should be NOT NULL +- Migrations that modify existing migrations (immutable rule) +- Missing down migrations +- App-level validation that should be DB constraints +- Using explicit transaction for single statements +- Not using `sqlext.ForeachRow` for row iteration + +--- + +## Agent 6: Testing Patterns + +**Sections**: §9 (Testing Patterns), §30 (go-bits Testing API Evolution) +**Extra Reference**: `testing-patterns-detailed.md` + +**What to check across ALL *_test.go files AND test helpers:** +- Using deprecated `assert.HTTPRequest` instead of `httptest.Handler.RespondTo()` +- Using `assert.DeepEqual` where generic `assert.Equal` works +- Table-driven tests (project convention prefers sequential scenario-driven narrative) +- Missing `t.Helper()` in test helper functions +- Using reflect.DeepEqual instead of assert.DeepEqual +- Test fixtures as large JSON files instead of programmatic builders +- Duplicated test setup across test functions (should extract) +- Using `require` package instead of `must` from go-bits +- Not using `must.SucceedT` / `must.ReturnT` for error-checked returns +- Not using `assert.ErrEqual` for flexible error matching +- **Assertion depth check**: For security-sensitive code (auth, filtering, tenant isolation), presence-only assertions (`NotEmpty`, `NotNil`, `assert.True(t, ok)`) are INSUFFICIENT. Tests must verify the actual VALUE matches the expected input (e.g., `assert.Equal(t, expectedID, filters[0]["term"]["tenant_ids"])`) + +--- + +## Agent 7: Package Organization, Imports, Comments + +**Sections**: §10 (Package Org), §11 (Import Org), §13 (Comment Style), §28 (CLI Patterns), §36 (Contract Cohesion) +**Extra Reference**: `architecture-patterns.md` + +**What to check across ALL packages:** +- Import groups not in stdlib / external / internal order (3 groups) +- Dot-import used for anything other than `majewsky/gg/option` +- Missing SPDX license header +- Comments using `/* */` instead of `//` for doc comments +- Missing 80-slash separator comments (`////////////////...`) between type groups +- `//NOTE:` markers missing for non-obvious logic +- Exported symbols without godoc comments +- cmd/ packages using wrong CLI patterns (if CLI repo) +- Package names not reading as English ("package utils" instead of meaningful name) +- **Contract cohesion (§36)**: Files named generically (`interface.go`, `types.go`, `constants.go`) when they should be named for the domain concept (`storage_driver.go`, `rbac_policy.go`). Constants/sentinels in `util.go` that belong to a specific interface's file. The test: if you can name the owning interface, the artifact must live in that interface's file. + +--- + +## Agent 8: Modern Go, Standard Library, Concurrency + +**Sections**: §14 (Concurrency), §15 (Startup/Shutdown), §29 (Modern Go Stdlib) +**Extra Reference**: (use sapcc-code-patterns.md §14, §15, §29) + +**What to check across ALL packages:** +- Using `sort.Slice` instead of `slices.SortFunc` (Go 1.21+) +- Using manual `keys := make([]K, 0, len(m)); for k := range m { ... }` instead of `slices.Sorted(maps.Keys(m))` +- Using `strings.HasPrefix + strings.TrimPrefix` instead of `strings.CutPrefix` (Go 1.20+) +- Using manual `if a < b { return a }` instead of `min(a, b)` (Go 1.21+) +- Loop variable capture workaround (`v := v`) in Go 1.22+ code +- Goroutines without proper context cancellation +- Missing SIGINT context handling in main() +- `os.Exit` used instead of proper shutdown sequence +- `sync.Mutex` on struct value instead of per-resource +- Missing `for range N` syntax where applicable (Go 1.22+) + +--- + +## Agent 9: Observability, Metrics, Background Jobs + +**Sections**: §16 (Background Jobs), §17 (HTTP Client), §18 (String Formatting), §20 (Observability) +**Extra Reference**: (use sapcc-code-patterns.md §16-18, §20) + +**What to check across ALL packages:** +- Prometheus metrics missing application prefix (e.g., `keppel_` or `logrouter_`) +- Counter metrics not initialized to zero +- Counter metric names not plural +- Gauge used where Counter is appropriate (or vice versa) +- Background jobs not using `jobloop.ProducerConsumerJob` pattern +- HTTP client creating new `http.Client` per request instead of using `http.DefaultClient` +- Custom HTTP transport instead of `http.DefaultTransport` +- Missing jitter in polling/retry loops +- `fmt.Sprintf` for simple string concatenation (use `+`) +- `+` for complex multi-part string building (use `fmt.Sprintf`) + +--- + +## Agent 10: Anti-Patterns, LLM Tells, Community Divergences + +**Sections**: §22 (Divergences), §24 (Anti-Patterns), §25 (LLM Code Feedback), §33 (Portunus Architecture), §35 (Reinforcement Table) +**Extra Reference**: `anti-patterns.md` + +**This is the highest-value agent.** It checks for patterns that LLMs generate by default but the project explicitly rejects: + +- Functional options pattern (project convention: positional params) +- Table-driven tests (project convention: sequential scenario narrative) +- Interface segregation / many small interfaces (project convention: 1-2 interfaces max per domain) +- Middleware-based auth (project convention: inline at handler top) +- Config validation layer (project convention: no separate validation) +- `*T` for optional fields (project convention: `Option[T]`) +- Config files / viper (project convention: pure env vars) +- Error messages starting with capital letter +- Error messages using "failed to" (project convention: "cannot") +- Helper functions extracted for cyclomatic complexity (project convention: "contrived edit to satisfy silly metrics") +- Exported types when only constructor is public +- Plugin creating its own DB connection (project convention: receive dependencies) +- `errors.New` + `fmt.Sprintf` instead of `fmt.Errorf` +- Manual row scanning instead of `sqlext.ForeachRow` +- Test setup in `TestMain` instead of per-test +- Verbose error checking instead of `assert.ErrEqual` / `must.SucceedT` +- **Extraction without guard transfer**: When inline code is extracted into a named helper, ALL defensive checks that relied on "the caller handles it" must be re-evaluated. A missing guard rated LOW as inline code becomes MEDIUM as a reusable function. Flag extracted helpers that lack self-contained validation. diff --git a/skills/sapcc-review/references/report-template.md b/skills/sapcc-review/references/report-template.md new file mode 100644 index 00000000..7a151446 --- /dev/null +++ b/skills/sapcc-review/references/report-template.md @@ -0,0 +1,85 @@ +# Report Template + +This file contains the full report template used in Phase 3 (AGGREGATE) and the severity boost table for cross-repository reinforcement. + +--- + +## Severity Boost Table + +Apply cross-repository reinforcement from §35 when prioritizing findings: + +| Pattern Strength | Severity Boost | +|-----------------|----------------| +| NON-NEGOTIABLE (4+ repos) | +1 severity level (MEDIUM->HIGH) | +| Strong Signal (2-3 repos) | No change | +| Context-Specific (1 repo) | -1 severity level (HIGH->MEDIUM) | + +--- + +## Report Template + +Create `sapcc-review-report.md` using this structure: + +```markdown +# SAPCC Code Review: [repo name] + +**Module**: [go module path] +**Date**: [date] +**Packages reviewed**: [N] packages, [M] Go files, [T] test files +**Agents dispatched**: 10 domain specialists +**Reference version**: sapcc-code-patterns.md (comprehensive patterns reference, 36 sections) + +--- + +## Verdict + +[2-3 sentences: Would this codebase pass lead review? What are the systemic issues? +Not just "there are problems" — identify the PATTERN of problems.] + +## Score Card + +| Domain | Agent | Findings | Critical | High | Medium | Low | +|--------|-------|----------|----------|------|--------|-----| +| Signatures/Config | 1 | N | ... | ... | ... | ... | +| Types/Option[T] | 2 | N | ... | ... | ... | ... | +| HTTP/API | 3 | N | ... | ... | ... | ... | +| Error Handling | 4 | N | ... | ... | ... | ... | +| Database/SQL | 5 | N | ... | ... | ... | ... | +| Testing | 6 | N | ... | ... | ... | ... | +| Pkg Org/Imports | 7 | N | ... | ... | ... | ... | +| Modern Go/Stdlib | 8 | N | ... | ... | ... | ... | +| Observability/Jobs | 9 | N | ... | ... | ... | ... | +| Anti-Patterns/LLM | 10 | N | ... | ... | ... | ... | +| **TOTAL** | | **N** | **X** | **Y** | **Z** | **W** | + +## Quick Wins (Easy Fixes, High Impact) + +[5-10 findings that can be fixed with minimal effort] + +## Critical Findings + +[Each finding with full REJECTED/CORRECT code] + +## High Findings + +[Each finding with full REJECTED/CORRECT code] + +## Medium Findings + +[Each finding] + +## Low Findings + +[Brief list] + +## What's Done Well + +[Genuine positives the lead reviewer would note approvingly. This is important for morale +and to show the review isn't blindly negative.] + +## Systemic Recommendations + +[2-3 big-picture recommendations based on patterns across findings. +E.g., "This repo consistently uses *T for optionals — a bulk migration +to Option[T] would address 15 findings at once."] +``` diff --git a/skills/security-threat-model/SKILL.md b/skills/security-threat-model/SKILL.md index a9fe1671..2290ffe0 100644 --- a/skills/security-threat-model/SKILL.md +++ b/skills/security-threat-model/SKILL.md @@ -41,58 +41,16 @@ routing: # Security Threat Model Skill -## Operator Context +## Overview -This skill configures Claude as a security-focused analyst executing a structured, -phase-gated threat model workflow. It follows the toolkit's four-layer architecture: -**deterministic scripts execute checks; LLM interprets findings and generates mitigations**. -Phase 5 (synthesis) is the only LLM step. All earlier phases are deterministic Python -scripts that produce JSON artifacts. +This skill executes a structured, phase-gated security threat model workflow that scans +the toolkit installation for attack surface exposure, supply-chain injection patterns, +and learning DB contamination. It follows the toolkit's four-layer architecture: +deterministic Python scripts perform all checks and produce JSON artifacts; Phase 5 +(synthesis only) is the LLM step. Each phase gates on artifact validation before proceeding. -### Hardcoded Behaviors (Always Apply) - -- **Artifacts over memory**: Every phase writes output to `security/` before proceeding -- **Gate enforcement**: Each phase gate must pass before the next phase starts -- **Dry-run by default**: No mutations to the learning DB without explicit `--purge` flag -- **No automated settings mutation**: deny-list is produced for human review; never merged automatically -- **Human approval required**: Phase 2 gate blocks until deny-list is reviewed and approved -- **CRITICAL findings block Phase 4**: Any CRITICAL supply-chain finding halts forward progress - -### Default Behaviors (ON unless disabled) - -- **security/ directory creation**: Created at Phase 1 start if not present -- **JSON validation**: All artifact JSON is validated before phase gate passes -- **Timestamped run ID**: All artifacts include a shared `run_id` for correlation -- **Summary to stderr**: Each script prints a one-line summary to stderr on completion - -### Optional Behaviors (OFF unless enabled) - -- **--purge**: Actually delete flagged learning DB rows (dry-run by default) -- **--include-user-claude**: Extend scans to `~/.claude/` paths in addition to local repo -- **--ci-mode**: Suppress interactive approval gates; CRITICAL findings exit non-zero - -## Available Scripts - -- **`scripts/scan-threat-surface.py`** -- Enumerate hooks, MCP servers, skills, and env vars. Output: `security/surface-report.json` -- **`scripts/generate-deny-list.py`** -- Produce deny-list config from surface scan findings. Output: `security/deny-list.json` -- **`scripts/scan-supply-chain.py`** -- Scan hooks/skills/agents for injection patterns, hidden chars, outbound commands. Output: `security/supply-chain-findings.json` -- **`scripts/sanitize-learning-db.py`** -- Inspect learning DB for poisoned entries. Output: `security/learning-db-report.json` -- **`scripts/validate-threat-model.py`** -- Verify required sections exist in `security/threat-model.md` - -## What This Skill CAN Do - -- Enumerate the active attack surface of the current toolkit installation -- Generate a ready-to-merge deny-list config for `settings.json` -- Detect zero-width Unicode, instruction-override phrases, and outbound commands in installed artifacts -- Flag learning DB entries with external origins or injection content -- Synthesize all findings into a structured threat model document with ranked mitigations - -## What This Skill CANNOT Do - -- Provide real-time hook monitoring (point-in-time audit only) -- Automatically purge learning DB rows (requires `--purge` flag and human review) -- Automatically merge deny-list into `settings.json` (human review required) -- Replace network-level egress controls (Docker no-egress is out of scope) +Outputs are saved to `security/` with a shared `run_id` for correlation across phases. +Phase 5 produces an actionable threat model document. --- @@ -102,15 +60,10 @@ scripts that produce JSON artifacts. **Goal**: Enumerate the active attack surface of the current installation. -**Step 1: Ensure output directory exists** +Create the `security/` output directory and run the surface scan script: ```bash mkdir -p security -``` - -**Step 2: Run the surface scan script** - -```bash python3 scripts/scan-threat-surface.py --output security/surface-report.json ``` @@ -120,13 +73,12 @@ This script enumerates: - Installed skills (from `skills/`) with `allowed-tools` entries - Any file in `hooks/`, `skills/`, or `agents/` containing `ANTHROPIC_BASE_URL` -**Step 3: Verify output** - +**Validate output**: ```bash python3 -c "import json; d=json.load(open('security/surface-report.json')); print('hooks:', len(d.get('hooks',[])), '| skills:', len(d.get('skills',[])), '| mcp_servers:', len(d.get('mcp_servers',[])))" ``` -**Gate**: `security/surface-report.json` must exist and parse as valid JSON with `hooks`, `skills`, and `mcp_servers` keys. Do not proceed to Phase 2 until this passes. +**Gate (ARTIFACT VALIDATION)**: `security/surface-report.json` must exist, parse as valid JSON, and contain `hooks`, `skills`, and `mcp_servers` keys. A missing directory is handled gracefully with empty arrays. All artifacts are written to `security/` before gating. Do not proceed to Phase 2 until this gate passes. --- @@ -134,7 +86,7 @@ python3 -c "import json; d=json.load(open('security/surface-report.json')); prin **Goal**: Produce a concrete deny-list config derived from Phase 1 findings. -**Step 1: Generate deny-list** +Generate the deny-list from the surface report: ```bash python3 scripts/generate-deny-list.py \ @@ -143,12 +95,12 @@ python3 scripts/generate-deny-list.py \ ``` The script applies these mappings from surface findings to deny rules: -- Hook uses `curl` or `wget` -> append `"Bash(curl *)"` and `"Bash(wget *)"` -- Hook uses `ssh` or `scp` -> append `"Bash(ssh *)"` and `"Bash(scp *)"` -- Skill `allowed-tools` contains unscoped `Read(*)` or `Write(*)` -> add path-scoped deny entries -- Any file contains `ANTHROPIC_BASE_URL` override -> append `"Bash(* ANTHROPIC_BASE_URL=*)"` +- Hook uses `curl` or `wget` → append `"Bash(curl *)"` and `"Bash(wget *)"` +- Hook uses `ssh` or `scp` → append `"Bash(ssh *)"` and `"Bash(scp *)"` +- Skill `allowed-tools` contains unscoped `Read(*)` or `Write(*)` → add path-scoped deny entries +- Any file contains `ANTHROPIC_BASE_URL` override → append `"Bash(* ANTHROPIC_BASE_URL=*)"` -Always includes the static baseline: +Always includes static baseline deny rules for credentials and privileged operations: ```json ["Read(~/.ssh/**)", "Read(~/.aws/**)", "Read(**/.env*)", "Write(~/.ssh/**)", "Write(~/.aws/**)", @@ -156,8 +108,7 @@ Always includes the static baseline: "Bash(* ANTHROPIC_BASE_URL=*)"] ``` -**Step 2: Display diff for human review** - +**Display deny-list for human review**: ```bash python3 -c " import json @@ -170,7 +121,7 @@ print('Review security/deny-list.json before merging.') " ``` -**Gate (HUMAN APPROVAL REQUIRED)**: Display the deny-list diff and block until the operator confirms they have reviewed it. In `--ci-mode`, skip approval and proceed. Do not proceed to Phase 3 without this gate passing. +**Gate (HUMAN APPROVAL REQUIRED)**: The deny-list is produced for human review only — it is never merged automatically. Display the diff and block until the operator confirms review. This gate is the highest-ROI control in the workflow. In `--ci-mode`, skip this gate and proceed to Phase 3. Do not proceed without explicit acknowledgment. --- @@ -178,7 +129,7 @@ print('Review security/deny-list.json before merging.') **Goal**: Scan all installed hooks, skills, and agents for injection patterns and hidden characters. -**Step 1: Run supply-chain audit** +Run the supply-chain audit: ```bash python3 scripts/scan-supply-chain.py \ @@ -186,19 +137,18 @@ python3 scripts/scan-supply-chain.py \ --output security/supply-chain-findings.json ``` -Detection patterns (see `scripts/scan-supply-chain.py` source for full regex details): -| Pattern Category | Severity | -|-----------------|----------| +Detection patterns (full regex details in `scripts/scan-supply-chain.py` source): +| Pattern | Severity | +|---------|----------| | Zero-width + bidi Unicode characters | CRITICAL | | HTML comments and hidden payload blocks | CRITICAL | | `ANTHROPIC_BASE_URL` override in any file | CRITICAL | +| Instruction-override and role-hijacking phrases | CRITICAL | | Outbound network commands in hooks/skills | WARNING | | `enableAllProjectMcpServers` setting | WARNING | | Broad permission grants without path scoping | WARNING | -| Instruction-override and role-hijacking phrases | CRITICAL | - -**Step 2: Check for CRITICAL findings** +**Check for CRITICAL findings**: ```bash python3 -c " import json, sys @@ -213,7 +163,7 @@ if crits: " ``` -**Gate**: Any CRITICAL finding blocks Phase 4. All CRITICAL findings must be remediated or explicitly acknowledged before continuing. WARNING findings are logged but do not block. +**Gate (BLOCKING CRITICAL FINDINGS)**: Any CRITICAL finding halts forward progress. All CRITICAL findings must be remediated or explicitly acknowledged before Phase 4 can start. This includes zero-width Unicode, ANTHROPIC_BASE_URL overrides, hidden payloads, and instruction-override phrases. WARNING findings are logged but do not block. Log warnings in the threat model under "Gaps and Recommended Next Controls" with acceptance rationale. --- @@ -221,23 +171,20 @@ if crits: **Goal**: Inspect the learning DB for entries that may contain injected content from external sources. -**Step 1: Run sanitization check (dry-run)** +Run the sanitization check in dry-run mode (never mutates without explicit `--purge`): ```bash python3 scripts/sanitize-learning-db.py \ --output security/learning-db-report.json ``` -Add `--purge` only if the operator explicitly requests it after reviewing the report. - Flags entries where: -- `key` or `value` fields contain known instruction-override or role-hijacking phrases -- `source` is `pr_review`, `url`, or `external` (high-risk origin) +- `key` or `value` contain instruction-override or role-hijacking phrases +- `source` is `pr_review`, `url`, or `external` (high-risk origins) - `value` contains zero-width Unicode or base64 blobs - `first_seen` is older than 90 days and `source` indicates external origin -**Step 2: Review flagged entries** - +**Review flagged entries**: ```bash python3 -c " import json @@ -251,27 +198,23 @@ if len(flagged) > 10: " ``` -**Gate**: Report must be presented to the operator. If purge is desired, re-run with `--purge`. Proceed to Phase 5 when operator acknowledges the report (or when no entries are flagged). +**Gate (DRY-RUN BY DEFAULT)**: The script operates in dry-run mode by default. No rows are deleted without explicit operator request and `--purge` flag. Present the report to the operator. If purge is desired after review, re-run with `--purge`. Learning DB is not found gracefully — script produces an empty report (`total_entries: 0`, `flagged_entries: []`). Proceed to Phase 5 when operator acknowledges the report or when no entries are flagged. --- ### Phase 5: THREAT MODEL SYNTHESIS -**Goal**: Synthesize Phases 1-4 into an actionable threat model document. +**Goal**: Synthesize Phases 1-4 findings into an actionable threat model document. This is the only LLM-driven phase. -**Step 1: Load all phase artifacts** - -Read and internalize: +Load all phase artifacts: - `security/surface-report.json` - `security/deny-list.json` - `security/supply-chain-findings.json` - `security/learning-db-report.json` -**Step 2: Write threat-model.md** - -Write `security/threat-model.md` containing these required sections (the validator checks for these exact headings): +Write `security/threat-model.md` with these required sections (validator checks for exact headings): -``` +```markdown # Threat Model ## Run Metadata @@ -291,8 +234,6 @@ Write `security/threat-model.md` containing these required sections (the validat ## Learning DB Sanitization Summary ``` -**Step 3: Write audit-badge.json** - Write `security/audit-badge.json`: ```json { @@ -305,73 +246,46 @@ Write `security/audit-badge.json`: } ``` -Status is `fail` if `critical_count > 0` or if any phase gate did not pass. - -**Step 4: Validate outputs** +Status is `fail` if any CRITICAL finding was not remediated or if any phase gate did not pass. +**Validate outputs**: ```bash python3 scripts/validate-threat-model.py \ --threat-model security/threat-model.md \ --badge security/audit-badge.json ``` -**Gate**: `validate-threat-model.py` must exit 0. If it reports missing sections, add them and re-run. Maximum 3 fix iterations before escalating to operator. +**Gate (ARTIFACT VALIDATION WITH RETRY LIMIT)**: `validate-threat-model.py` must exit 0. If validation fails, add the missing sections and re-run. Maximum 3 fix iterations before escalating to operator for review. --- ## Error Handling -### Error: `surface-report.json` missing required keys -**Cause**: Script ran against a directory without Claude Code config. -**Solution**: `~/.claude/settings.json` absence is handled gracefully -- surface-report produces empty arrays. Re-run with `--verbose` for detail. - -### Error: Supply-chain audit CRITICAL finding blocks progress -**Cause**: A hook, skill, or agent contains zero-width Unicode, an ANTHROPIC_BASE_URL override, or a known injection phrase. -**Solution**: -1. Open the flagged file at the reported line -2. Determine if it is a legitimate false positive (e.g., a security doc discussing injection patterns) -3. If false positive: add the file to `--exclude` and re-run -4. If genuine: remediate the file before continuing - -### Error: Learning DB not found -**Cause**: No sessions have run or the DB path has been moved. -**Solution**: Script handles this gracefully -- produces a report with `total_entries: 0` and `flagged_entries: []`. Not an error condition. - -### Error: `validate-threat-model.py` reports missing sections -**Cause**: Phase 5 synthesis omitted a required section. -**Solution**: Read the validator output for the exact missing section name. Add the section to `security/threat-model.md` and re-run validation. +### Supply-chain audit CRITICAL finding blocks progress +**Cause**: A hook, skill, or agent contains zero-width Unicode, ANTHROPIC_BASE_URL override, or known injection phrase. ---- - -## Anti-Patterns +**Resolution**: +1. Open the flagged file at the reported line number +2. Determine if it is a legitimate false positive (e.g., documentation discussing injection patterns) +3. If false positive: add the file to `--exclude` list and re-run `scan-supply-chain.py` +4. If genuine: remediate the file (remove hidden payloads, instruction-override phrases) before continuing to Phase 4 -### Anti-Pattern 1: Skipping Phase Gates -**What it looks like**: Moving to Phase 3 before deny-list is reviewed. -**Why wrong**: The gate exists because deny-list review is the highest-ROI control. Skip it and the entire audit is advisory-only. -**Do instead**: Always surface the diff and wait for acknowledgment before Phase 3. +### Validation fails with missing sections +**Cause**: Phase 5 synthesis omitted a required section heading. -### Anti-Pattern 2: Running `--purge` without reviewing the report -**What it looks like**: `python3 scripts/sanitize-learning-db.py --purge` as a one-liner. -**Why wrong**: Bulk-deleting learning DB rows without review can erase legitimate session context. -**Do instead**: Dry-run first, review `learning-db-report.json`, then re-run with `--purge` only for rows you intend to delete. +**Resolution**: Read the validator output for the exact missing section name. Add the section to `security/threat-model.md` with content synthesized from the phase artifacts and re-run `validate-threat-model.py`. Maximum 3 fix iterations before escalating to operator. -### Anti-Pattern 3: Treating this skill as a real-time monitor -**What it looks like**: Running the skill on every commit or session start. -**Why wrong**: This is a point-in-time installation audit. Use `pretool-prompt-injection-scanner.py` for session-time scanning. -**Do instead**: Run on onboarding, after batch PR merges to hooks/ or skills/, or on a scheduled cadence. +### Missing configuration or databases +**Cause**: `~/.claude/settings.json` or learning DB doesn't exist. -### Anti-Pattern 4: Dismissing WARNING findings -**What it looks like**: Noting warnings but not logging them anywhere. -**Why wrong**: Warnings accumulate. An outbound network call in a hook may be legitimate today and exploitable after a patch. -**Do instead**: Log all WARNING findings in `security/threat-model.md` under "Gaps and Recommended Next Controls" with acceptance rationale. - -### Anti-Pattern 5: Auto-committing security/ artifacts -**What it looks like**: Adding `security/` to the commit in the same step as running the skill. -**Why wrong**: `security/learning-db-report.json` may contain sensitive entry values from the learning DB. -**Do instead**: Review all `security/` artifacts before committing. Only `security/threat-model.md` and `security/audit-badge.json` are commit candidates. +**Resolution**: These are handled gracefully by the scripts: +- Missing `settings.json` → surface-report produces empty arrays for hooks +- Missing learning DB → sanitization report returns `total_entries: 0` and `flagged_entries: []` +These are not error conditions. Re-run with `--verbose` for detail on missing paths. --- + ## References - [ADR-102: Security Threat Model Skill](../../adr/ADR-102-security-threat-model.md) diff --git a/skills/seo-optimizer/SKILL.md b/skills/seo-optimizer/SKILL.md index 3b15bdcd..2829b75a 100644 --- a/skills/seo-optimizer/SKILL.md +++ b/skills/seo-optimizer/SKILL.md @@ -26,44 +26,13 @@ routing: # SEO Optimizer Skill -## Operator Context - -This skill operates as an operator for SEO optimization workflows, configuring Claude's behavior for search visibility improvements without compromising content quality. It implements a **4-phase ASSESS-DECIDE-APPLY-VERIFY** workflow with evidence-based analysis at each step. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Voice Preservation**: Never suggest changes that compromise the author's authentic tone -- **No Keyword Stuffing**: Never recommend keyword density above 2.5% -- **Honest Descriptions**: Meta descriptions must accurately reflect content — no clickbait -- **Preview Before Modify**: Always show current vs suggested changes before applying modifications -- **Over-Engineering Prevention**: Focus on high-impact SEO changes only; skip marginal optimizations - -### Default Behaviors (ON unless disabled) -- **Complete Analysis**: Show all findings with data, never summarize without evidence -- **Alternative Generation**: Provide 3 title alternatives using different title patterns -- **Internal Link Discovery**: Scan related posts for linking opportunities -- **Confirmation Required**: Require explicit user confirmation before modifying any file - -### Optional Behaviors (OFF unless enabled) -- **Auto-Apply**: Apply changes without confirmation (requires explicit --apply flag) -- **Batch Mode**: Analyze multiple posts at once (requires explicit --batch flag) -- **Generate Missing**: Create meta descriptions for posts that lack them - -## What This Skill CAN Do -- Analyze keyword placement across title, headers, first paragraph, and body -- Calculate keyword density and flag over-optimization -- Evaluate title length, specificity, and click potential -- Generate or improve meta descriptions within 150-160 character targets -- Audit header hierarchy (H1/H2/H3 structure and nesting) -- Identify internal linking opportunities to related posts -- Show before/after comparisons for all proposed changes - -## What This Skill CANNOT Do -- Modify post body content beyond front matter and headers — voice belongs to the author -- Guarantee search rankings — SEO is one factor among many -- Replace human judgment on content quality or brand fit -- Auto-publish changes without explicit user review and confirmation -- Handle site-wide technical SEO (robots.txt, sitemap, redirects) +This skill operates as an SEO analysis and optimization workflow for blog posts. It implements a **4-phase ASSESS-DECIDE-APPLY-VERIFY** cycle that balances search visibility improvements with content quality and authentic author voice. + +**Key Principles**: +- Voice preservation is hardcoded—never suggest changes that compromise the author's authentic tone +- Complete analysis with data before proposing changes—no optimization without baseline metrics +- Confirm changes before applying—always show user exactly what will change +- Focus on high-impact, low-effort improvements only; skip marginal optimizations --- @@ -107,12 +76,14 @@ Check keyword presence in each priority location: | Body text | Medium | Natural usage throughout | | URL slug | Medium | Keyword in filename | -Calculate keyword density: +Calculate keyword density with this formula: ``` Density = (keyword occurrences / total words) * 100 Target: 1-2% | Warning: > 2.5% | Critical: > 3% ``` +**Constraint**: Never recommend keyword density above 2.5%. Over-optimization hurts readability and search rankings (search engines penalize stuffing). Aim for 1-2% density with natural placement in title, first paragraph, and occasional headers. + **Step 4: Evaluate title** | Criteria | Target | @@ -122,12 +93,16 @@ Target: 1-2% | Warning: > 2.5% | Critical: > 3% | Specificity | Specific problem/outcome over vague topic | | Click potential | Conveys clear value to searcher | +**Constraint**: Never suggest clickbait titles that misrepresent content (e.g., "You Won't BELIEVE These Hugo Debugging Secrets!"). Violates technical, authentic tone and misleads readers. Suggest specific, descriptive titles that accurately convey content value. + **Step 5: Check meta description** If description exists: verify 150-160 characters, contains primary keyword, accurately reflects content, compels click. If missing: flag for generation in Phase 3. +**Constraint**: Meta descriptions must accurately reflect content. No clickbait. Description is the SERP sales pitch—always analyze and optimize. Vague descriptions do not differentiate content or compel clicks. Include specific outcomes, techniques, or problems addressed. Reference the primary keyword naturally. + **Step 6: Audit header structure** Verify: exactly one H1 (the title), 3-7 H2s for main sections, H3s for subsections, no skipped levels (no H1 to H3 without H2). @@ -163,7 +138,7 @@ List all related posts. For each candidate: 3. Add internal links 4. Adjust headers only if clearly beneficial -Drop any suggestion where the existing content is already good. Do not force changes for the sake of completeness. +**Constraint**: Drop any suggestion where the existing content is already good. Do not force changes for the sake of completeness. If existing structure is logical and readable, forcing keywords into every header damages content quality for marginal SEO gain. Only suggest header changes where keywords fit naturally AND improve clarity. **Gate**: Prioritized list of changes with rationale for each. Skip items that would not materially improve search visibility. @@ -226,6 +201,8 @@ Drop any suggestion where the existing content is already good. Do not force cha - **apply**: Make changes to front matter fields and insert internal links - **skip**: Exit without changes +**Constraint**: Always show current vs suggested changes before applying modifications. Confirmation required before modifying any file. + **Step 3: Apply confirmed changes** Only modify: @@ -246,6 +223,8 @@ Only modify: **Step 3**: Check that keyword density did not exceed 2.5% after changes +**Constraint**: SEO is one factor among many. Never let search optimization override content quality or readability. If changes reduce natural language flow, revert them. + **Step 4**: If Hugo is available, run a build to confirm no breakage: ```bash hugo --quiet @@ -286,51 +265,9 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Keyword Stuffing -**What it looks like**: "Add 'hugo debugging' 15 more times to reach optimal density" -**Why wrong**: Over-optimization hurts readability and search rankings. Search engines penalize stuffing. -**Do instead**: Aim for 1-2% density with natural placement in title, first paragraph, and occasional headers. - -### Anti-Pattern 2: Clickbait Titles -**What it looks like**: "You Won't BELIEVE These Hugo Debugging Secrets!" -**Why wrong**: Violates technical, authentic tone. Misleads readers and damages trust. -**Do instead**: Suggest specific, descriptive titles that accurately convey content value using patterns from `references/title-patterns.md`. - -### Anti-Pattern 3: Generic Meta Descriptions -**What it looks like**: "Learn about Hugo debugging in this comprehensive guide." -**Why wrong**: Vague descriptions do not differentiate content or compel clicks from search results. -**Do instead**: Include specific outcomes, techniques, or problems addressed. Reference the primary keyword naturally. - -### Anti-Pattern 4: Forcing Changes on Good Structure -**What it looks like**: "Restructure all headers to include the primary keyword" -**Why wrong**: If existing structure is logical and readable, forcing keywords into every header damages content quality for marginal SEO gain. -**Do instead**: Only suggest header changes where keywords fit naturally AND improve clarity. - -### Anti-Pattern 5: Optimizing Without Data -**What it looks like**: Suggesting title changes without measuring current length or keyword presence -**Why wrong**: Changes without baseline data cannot be evaluated for improvement. May make things worse. -**Do instead**: Complete Phase 1 analysis fully before proposing any changes. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) — Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) — Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "This title is fine, no need to measure" | Fine is subjective; measure character count | Run length and keyword checks | -| "Description isn't that important" | Description is the SERP sales pitch | Always analyze and optimize | -| "Just add keywords everywhere" | Stuffing hurts rankings | Calculate density, stay under 2.5% | -| "SEO is more important than readability" | Readers bounce from unnatural content | Voice preservation is hardcoded | - -### Reference Files +### Domain-Specific Reference Files - `${CLAUDE_SKILL_DIR}/references/seo-guidelines.md`: Length requirements, density targets, and best practices - `${CLAUDE_SKILL_DIR}/references/keyword-placement.md`: Priority locations and placement techniques - `${CLAUDE_SKILL_DIR}/references/title-patterns.md`: Effective title structures for technical blogs diff --git a/skills/series-planner/SKILL.md b/skills/series-planner/SKILL.md index bccd1608..1e6852ce 100644 --- a/skills/series-planner/SKILL.md +++ b/skills/series-planner/SKILL.md @@ -28,45 +28,9 @@ routing: # Series Planner Skill -## Operator Context - -This skill operates as an operator for multi-part content planning, configuring Claude's behavior for creating cohesive blog post series with proper cross-linking, standalone value, and publishing cadence. It implements the **Structured Analysis** pattern -- assess viability, decide structure, generate plan -- with **Domain Intelligence** embedded in series type selection and standalone value enforcement. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before planning -- **Over-Engineering Prevention**: Plan the series requested. No bonus parts, no scope creep -- **Standalone Value**: Every part MUST deliver value without requiring other parts -- **Cross-Linking Structure**: Every series gets navigation plan (Part X of Y, prev/next) -- **Part Count Bounds**: Minimum 3 parts, maximum 7 parts. No exceptions -- **No Filler Parts**: Each part earns its place with substantial, unique content - -### Default Behaviors (ON unless disabled) -- **Full Plan Display**: Show complete plan with all details, never summarize -- **Word Count Estimates**: Every part includes word count range (800-1500 typical) -- **Publishing Cadence**: Always recommend publication schedule with reasoning -- **Series Type Auto-Detection**: Select best series type from topic signals -- **Hugo Frontmatter**: Include frontmatter template for each part -- **Standalone Value Check**: Verify each part passes standalone test before output - -### Optional Behaviors (OFF unless enabled) -- **Minimal Mode**: Just part titles and scope, no detailed breakdown -- **Landing Page**: Generate series index page plan (enable with "with landing page") -- **Parallel Planning**: Plan multiple series at once for content calendar - -## What This Skill CAN Do -- Plan 3-7 part blog post series with logical progression -- Select appropriate series type (Progressive Depth, Chronological Build, Problem Exploration) -- Ensure each part has standalone value while enhancing series context -- Design cross-linking structure with navigation patterns -- Recommend publishing cadence based on content complexity -- Generate Hugo frontmatter templates for series posts - -## What This Skill CANNOT Do -- Write actual post content (use blog-post-writer for writing) -- Plan series fewer than 3 parts (use single post or post-outliner instead) -- Plan series more than 7 parts (break into multiple series or narrow scope) -- Skip standalone value check (each part must be complete on its own) -- Create vague, padded series (no filler parts to hit a number) +## Overview + +This skill plans multi-part content series with proper structure, cross-linking, and publishing cadence. It implements a three-phase workflow: **ASSESS** (determine viability), **DECIDE** (select structure), and **GENERATE** (produce plan). Each phase has gates to prevent scope creep, ensure standalone value, and maintain quality constraints. --- @@ -98,10 +62,11 @@ Audience progression: [beginner to expert? single level?] **Step 2: Check viability** -- [ ] Topic has natural divisions (3+ distinct subtopics) -- [ ] Each division can stand alone -- [ ] Logical progression exists between parts -- [ ] Not artificially padded (each part earns its place) +Verify these constraints before proceeding: +- Topic has natural divisions (minimum 3 distinct subtopics required — this is non-negotiable) +- Each division can stand alone as complete content (not dependent on reading other parts) +- Logical progression exists between parts (reader can follow from one to the next) +- Not artificially padded (each part must earn its place with substantial unique content, no filler) **Step 3: Detect series type** @@ -125,36 +90,43 @@ Match topic signals to type. See `references/series-types.md` for full templates ## Series Decision Type: [Progressive Depth / Chronological Build / Problem Exploration] Justification: [why this type fits] -Part Count: [3-7] +Part Count: [3-7, enforced strictly] Total Estimated Words: [X,XXX - X,XXX] ``` +Enforce part count bounds strictly: minimum 3 parts, maximum 7 parts. No exceptions. The 3-7 constraint prevents both over-engineering (splitting one idea across 8+ parts) and under-engineering (calling 2 loosely related posts a "series"). + **Step 2: Draft part breakdown** For each part, define: -- Title and scope (1 sentence) -- Standalone value (what reader gets from this part alone) -- Forward/backward links to adjacent parts +- Title and scope (1 sentence describing what this part covers) +- Standalone value (what reader learns from this part alone, without reading others) +- Forward/backward links (references to adjacent parts, for context only) **Step 3: Validate standalone value** -For EACH part, verify: -- Reader learns something complete (not half a concept) -- Working code/config/output is possible from this part alone -- No critical information deferred to other parts -- Someone landing on just this part gets something useful +For EVERY part, verify it passes the standalone test: +- Reader learns something complete and actionable (not a half-concept requiring other parts) +- Working code/config/output is possible from this part alone (readers aren't blocked waiting for next part) +- No critical information deferred to other parts (concepts explained fully in their own context) +- Someone landing on just this part via search gets something useful (SEO and UX principle) -Red flags that fail standalone test: -- "To understand this, read Part 1 first" as mandatory -- Part ends mid-implementation +Red flags that fail standalone test — reject any part showing these: +- "To understand this, read Part 1 first" as mandatory dependency +- Part ends mid-implementation with "Part 2 will continue" - Core concepts explained only in earlier parts -- "Part 2 will explain why this works" +- "Part 2 will explain why this works" — Part 1 reader is stranded + +This is the anti-pattern prevention layer. Standalone value is non-negotiable because: +1. Search traffic lands on any part randomly, not always on Part 1 +2. Readers expect complete value from the part they're reading +3. Multi-part cliff-hangers frustrate readers and hurt SEO **Step 4: Select publishing cadence** See `references/cadence-guidelines.md` for detailed criteria. Default to weekly unless topic complexity or content depth suggests otherwise. -**Gate**: All parts pass standalone value check. Part count is 3-7. Type selection justified. Proceed only when gate passes. +**Gate**: All parts pass standalone value check. Part count is strictly 3-7. Type selection justified. Proceed only when gate passes. ### Phase 3: GENERATE @@ -171,12 +143,13 @@ Output the complete plan including: **Step 2: Final validation** -- [ ] Every part has standalone value described -- [ ] Word counts are realistic (800-1500 per part) -- [ ] Cross-linking is complete (prev/next for all parts) -- [ ] No cliff-hangers that frustrate readers -- [ ] No filler parts -- [ ] Part count within 3-7 bounds +Before outputting, verify all constraints one final time: +- [ ] Every part has standalone value described (not deferred to other parts) +- [ ] Word counts are realistic (800-1500 per part, within 20% variance across parts to avoid reader whiplash) +- [ ] Cross-linking is complete (prev/next navigation for all parts) +- [ ] No cliff-hangers that frustrate readers (each part delivers closure, even if it references others) +- [ ] No filler parts (each part has substantial, non-redundant content) +- [ ] Part count within 3-7 bounds (enforced strictly) **Step 3: Output plan** @@ -191,13 +164,13 @@ Use the series plan format from `references/output-format.md`. Three primary types. Full templates and examples in `references/series-types.md`. ### Progressive Depth -Shallow-to-deep mastery. Each level is complete; beginners stop at Part 1, advanced readers skip ahead. +Shallow-to-deep mastery. Each level is complete; beginners stop at Part 1, advanced readers skip ahead. Enables flexible audience engagement. ### Chronological Build -Step-by-step creation. Each part produces working output; reader can stop at any milestone. +Step-by-step creation. Each part produces working output; reader can stop at any milestone and have a working artifact. ### Problem Exploration -Journey from problem to solution. Even failed approaches are instructive; each part teaches something. +Journey from problem to solution. Even failed approaches are instructive; each part teaches something about the journey, not just the destination. --- @@ -205,34 +178,42 @@ Journey from problem to solution. Even failed approaches are instructive; each p ### Example 1: Standard Technical Series User says: "/series Go error handling" + Actions: -1. Assess: Topic has clear depth levels (basics, wrapping, custom types, patterns) (ASSESS) -2. Decide: Progressive Depth, 4 parts, weekly cadence (DECIDE) -3. Generate: Full plan with standalone value per part (GENERATE) -Result: 4-part series where each part teaches complete error handling at its level +1. ASSESS: Topic has clear depth levels (basics, wrapping, custom types, patterns) +2. DECIDE: Progressive Depth, 4 parts, weekly cadence +3. GENERATE: Full plan with standalone value per part + +Result: 4-part series where each part teaches complete error handling at its level (beginner can stop at Part 1 and be satisfied; advanced reader skips to patterns). ### Example 2: Project Tutorial Series User says: "/series building a CLI tool in Rust" + Actions: -1. Assess: Topic has build milestones (scaffold, commands, config, distribution) (ASSESS) -2. Decide: Chronological Build, 4 parts, weekly cadence (DECIDE) -3. Generate: Full plan with working output per milestone (GENERATE) -Result: 4-part series where each part produces a functional artifact +1. ASSESS: Topic has build milestones (scaffold, commands, config, distribution) +2. DECIDE: Chronological Build, 4 parts, weekly cadence +3. GENERATE: Full plan with working output per milestone + +Result: 4-part series where each part produces a functional artifact (Part 1: runs basic command; Part 2: parses flags; Part 3: config file support; Part 4: distributable binary). ### Example 3: Problem Exploration Series User says: "/series why we migrated from MongoDB to PostgreSQL" + Actions: -1. Assess: Topic has journey arc (problem, attempt, failure, solution) (ASSESS) -2. Decide: Problem Exploration, 4 parts, bi-weekly cadence (DECIDE) -3. Generate: Full plan where each part teaches standalone lessons (GENERATE) -Result: 4-part series where even failed approaches deliver instructive value +1. ASSESS: Topic has journey arc (problem, attempt, failure, solution) +2. DECIDE: Problem Exploration, 4 parts, bi-weekly cadence +3. GENERATE: Full plan where each part teaches standalone lessons + +Result: 4-part series where even failed approaches deliver instructive value (Part 1: why we needed to move; Part 2: why MongoDB stopped working for us; Part 3: why PostgreSQL migration was hard; Part 4: what we learned). ### Example 4: Topic Too Narrow User says: "/series Go defer statement" + Actions: -1. Assess: Topic has 1-2 natural divisions, not 3+ (ASSESS) +1. ASSESS: Topic has 1-2 natural divisions, not 3+ 2. Gate fails: Recommend single post or expanding scope to "Go resource management" -Result: Redirect to post-outliner or expanded topic suggestion + +Result: Redirect to post-outliner or expanded topic suggestion (post-outliner is better for focused single topics). --- @@ -240,80 +221,58 @@ Result: Redirect to post-outliner or expanded topic suggestion ### Error: "Topic Too Narrow for Series" Cause: Topic doesn't naturally divide into 3+ parts + Solution: -1. Suggest post-outliner for single comprehensive post -2. Propose scope expansion: "Consider covering [related aspect]" -3. List what would need to be true for series to work +1. Suggest post-outliner for single comprehensive post (single-post tool is more appropriate) +2. Propose scope expansion: "Consider covering [related aspect] to reach 3+ parts" +3. List what would need to be true for series to work: "A series works when you can answer: Part 1 [X], Part 2 [Y], Part 3 [Z]" ### Error: "Topic Too Broad for Series" -Cause: Would require 8+ parts or scope is unmanageable +Cause: Would require 8+ parts or scope is unmanageable (violates part count constraint) + Solution: -1. Identify natural breakpoints for multiple series -2. Recommend first series to tackle -3. Suggest narrowing to specific aspect +1. Identify natural breakpoints for multiple series (e.g., "Kubernetes basics" series + "Kubernetes advanced" series) +2. Recommend first series to tackle (smallest, highest value) +3. Suggest narrowing to specific aspect (e.g., "Instead of 'Cloud Architecture', try 'Cloud Cost Optimization'") ### Error: "No Logical Progression" Cause: Parts don't build on each other meaningfully; just loosely related topics + Solution: -1. Determine if these are better as standalone posts -2. Find the connecting thread that creates progression -3. Consider if forcing series structure adds value vs. individual posts +1. Determine if these are better as standalone posts (not a series at all) +2. Find the connecting thread that creates progression (what makes this 3-part story instead of 3 separate posts?) +3. Consider if forcing series structure adds value vs. individual posts (sometimes the answer is "these should be separate") ### Error: "Standalone Value Missing" -Cause: One or more parts don't stand alone +Cause: One or more parts don't stand alone (reader needs previous parts to understand this one) + Solution: -1. Identify which parts fail the standalone test -2. Suggest content to add for completeness -3. Or merge dependent parts into one +1. Identify which parts fail the standalone test (list specific examples: "Part 2 assumes knowledge from Part 1") +2. Suggest content to add for completeness (add summary section, explain prerequisite inline, restructure) +3. Or merge dependent parts into one (e.g., "Part 1 and 2 should be one part; move non-essential details to Part 3") --- -## Anti-Patterns +## References -### Anti-Pattern 1: The Cliff-Hanger Series -**What it looks like**: "...but the real solution is in Part 2!" -- content gates behind future parts -**Why wrong**: Frustrates readers, SEO penalty for thin content, Part 2 visitors get nothing -**Do instead**: Each part delivers complete value. Reference other parts for context, not content. +### Reference Files +- `${CLAUDE_SKILL_DIR}/references/series-types.md`: Complete type templates with examples and selection criteria +- `${CLAUDE_SKILL_DIR}/references/cross-linking.md`: Navigation patterns and Hugo implementation +- `${CLAUDE_SKILL_DIR}/references/cadence-guidelines.md`: Publishing frequency recommendations and schedules +- `${CLAUDE_SKILL_DIR}/references/output-format.md`: Series plan output format template -### Anti-Pattern 2: The Padded Series -**What it looks like**: "Part 1: Introduction", "Part 2: Getting Started", "Part 3: Basics" -- three parts that should be one -**Why wrong**: Disrespects reader time, each part lacks substance -**Do instead**: Combine until each part has substantial, unique value. 3 meaty parts beats 6 thin ones. +### Key Constraints Summary -### Anti-Pattern 3: The Prerequisite Spiral -**What it looks like**: "Before we start, read these 5 posts..." -- hard dependency chains -**Why wrong**: Creates barrier to entry, search traffic to later parts bounces immediately -**Do instead**: Brief context inline. Link to prerequisites as optional. Make each part accessible. +These constraints are non-negotiable and enforced at every phase: -### Anti-Pattern 4: Inconsistent Scope -**What it looks like**: Part 1 is 500 words, Part 2 is 3000 words, Part 3 is 400 words -**Why wrong**: Reader expectations whiplash, suggests poor planning -**Do instead**: Keep parts roughly similar in depth and length. 800-1200 words each is the target. +1. **Part Count (3-7)**: Series must have minimum 3 parts, maximum 7 parts. This prevents both scope creep (forcing 8+ parts for one idea) and false series (2 loosely related posts). -### Anti-Pattern 5: Scope Creep Mid-Series -**What it looks like**: Starting with 3 planned parts, ending with 7 because "one more thing" -**Why wrong**: Breaks publishing cadence, dilutes series focus, reader fatigue -**Do instead**: Plan the full series before publishing Part 1. If scope grows, split into a second series. +2. **Standalone Value**: Every part MUST deliver complete value to readers who land on it via search or reference. Red flags: cliff-hangers, deferred core concepts, mid-implementation endings. ---- +3. **No Filler**: Each part must earn its place with substantial unique content. No padding to hit a part count target. -## References +4. **Logical Progression**: Parts build meaningfully from one to the next. If they're just loosely related topics, they shouldn't be a series. -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks +5. **Over-Engineering Prevention**: Plan only what the user requests. No bonus parts, scope creep, or "one more thing" unless user asks. -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "This needs 8 parts to be complete" | Scope creep; split into two series | Enforce 3-7 part limit | -| "Part 1 is just setup, real value starts Part 2" | Part 1 fails standalone test | Ensure Part 1 delivers value | -| "Readers will read them in order" | Search traffic lands anywhere | Each part must stand alone | -| "One more part won't hurt" | Padding dilutes series quality | Every part must earn its place | - -### Reference Files -- `${CLAUDE_SKILL_DIR}/references/series-types.md`: Complete type templates with examples and selection criteria -- `${CLAUDE_SKILL_DIR}/references/cross-linking.md`: Navigation patterns and Hugo implementation -- `${CLAUDE_SKILL_DIR}/references/cadence-guidelines.md`: Publishing frequency recommendations and schedules -- `${CLAUDE_SKILL_DIR}/references/output-format.md`: Series plan output format template +These are gates at each phase. If any constraint fails, the workflow stops and recommends alternative approaches (single post, expanded scope, reduced scope, etc.). diff --git a/skills/service-health-check/SKILL.md b/skills/service-health-check/SKILL.md index f154dfc9..bed4746b 100644 --- a/skills/service-health-check/SKILL.md +++ b/skills/service-health-check/SKILL.md @@ -26,45 +26,11 @@ routing: # Service Health Check Skill -## Operator Context - -This skill operates as an operator for service health monitoring workflows, configuring Claude's behavior for structured, read-only health assessment. It implements the **Discover-Check-Report** pattern — find services, gather health signals, produce actionable output — with deterministic process and health file evaluation. - -### Hardcoded Behaviors (Always Apply) -- **Read-Only**: NEVER restart, stop, or modify services — report only -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before checking -- **No Side Effects**: Only read process tables, health files, and ports — no writes -- **Structured Output**: Always produce machine-parseable health report -- **Evidence-Based Status**: Every status determination requires at least one concrete signal (process check, health file, or port probe) - -### Default Behaviors (ON unless disabled) -- **Process Verification**: Check process existence via pgrep/ps before anything else -- **Staleness Detection**: Flag health files older than configured threshold (default 300s) -- **Port Listening Check**: Verify expected ports are bound when port is configured -- **Actionable Recommendations**: Provide specific commands to resolve issues -- **Staleness Threshold Enforcement**: Default 300s, configurable per service - -### Optional Behaviors (OFF unless enabled) -- **Auto-Restart Execution**: Run restart commands (requires explicit user flag) -- **Metrics Collection**: Gather detailed performance metrics from health files -- **Alert Integration**: Format output for monitoring system ingestion -- **Historical Comparison**: Compare against previous health snapshots - -## What This Skill CAN Do -- Check if processes are running via pgrep/ps -- Parse JSON health files for status, connection state, and metrics -- Detect stale health data based on configurable thresholds -- Verify ports are listening with ss/netstat -- Produce structured health reports with actionable restart recommendations -- Evaluate service degradation (disconnected, reconnecting states) - -## What This Skill CANNOT Do -- Restart, stop, or modify services (report-only by design) -- Perform deep log analysis (use systematic-debugging instead) -- Probe remote health endpoints over HTTP (use endpoint-validator instead) -- Inspect container internals (basic host-level process checks only) -- Authenticate against secured health endpoints -- Skip the Discover phase — services must be identified before checking +## Overview + +This skill provides deterministic service health monitoring using the **Discover-Check-Report** pattern. It finds services, gathers health signals from multiple sources (process table, health files, port binding), and produces actionable reports identifying degraded or failed services. + +**Core principle**: Health assessment is evidence-based. Never report a service healthy without verifying process status independently of health file content. Never assume a running process is functional — always cross-check against health files and port binding. --- @@ -95,16 +61,21 @@ For each service, establish: | cache | redis-server | - | 6379 | - | ``` +**Validation constraints**: +- Each process pattern must be specific enough to avoid false matches (e.g., "python" matches all Python processes—use full paths or arguments instead) +- Health file paths must be absolute +- Port numbers must be valid (1-65535) +- Pattern specificity matters: narrow patterns with full command paths, distinguishing arguments, or specific binary names + **Step 3: Validate manifest** -- Confirm each process pattern is specific enough to avoid false matches -- Verify health file paths are absolute -- Ensure port numbers are within valid range (1-65535) + +Confirm each entry passes the constraints above. If a pattern is too broad, use `ps aux | grep` to identify distinguishing arguments, then update the pattern. **Gate**: Service manifest complete with at least one service. Proceed only when gate passes. ### Phase 2: CHECK -**Goal**: Gather health signals for every service in the manifest. +**Goal**: Gather health signals for every service in the manifest. Always check process status independently of health file content—a running process and a healthy health file are separate signals. **Step 1: Check process status** @@ -114,40 +85,50 @@ pgrep -f "" ``` Record: running (true/false), PIDs, process count. +**Rationale**: Process existence is the primary signal. A missing process always means the service is DOWN. A running process alone is insufficient—the service may have crashed or failed to bind to its port. + **Step 2: Parse health files (if configured)** Read and parse JSON health files. Evaluate: - Does the file exist? - Does it parse as valid JSON? -- How old is the timestamp (staleness)? +- How old is the timestamp (staleness)? Default stale threshold is 300 seconds. - What status does the service self-report? - What is the connection state? +**Critical constraint**: Never trust health file content alone. The file could be stale from before a process crash. Always verify: +1. Process is still running +2. Health file timestamp is fresh (within configured threshold) +3. Status field matches evidence (e.g., "error" requires restart) + **Step 3: Probe ports (if configured)** Check if expected ports are listening: ```bash ss -tlnp "sport = :" ``` -Flag processes that are running but not listening on expected ports. + +**Rationale**: Verify ports are actually bound. A process can start but fail to bind to its configured port—that is effectively a DOWN state, not HEALTHY. **Step 4: Evaluate health per service** -Apply this decision tree: -1. Process not running → **DOWN** -2. Process running + health file missing → **WARNING** -3. Process running + health file stale → **WARNING** (restart recommended) -4. Process running + status=error → **ERROR** (restart recommended) -5. Process running + disconnected > 30min → **WARNING** (restart recommended) -6. Process running + disconnected < 30min → **DEGRADED** (allow reconnection) -7. Process running + healthy → **HEALTHY** -8. Process running + no health file configured → **RUNNING** (limited visibility) +Apply this decision tree (constraints embedded in logic): + +1. **Process not running** → **DOWN** (definitive) +2. **Process running + health file missing** → **WARNING** (limited visibility, but process is alive) +3. **Process running + health file stale** (> threshold) → **WARNING** (file hasn't updated in configured time, suggests no activity or crash recovery in progress) +4. **Process running + status=error** → **ERROR** (restart recommended immediately) +5. **Process running + disconnected > 30 minutes** → **WARNING** (long disconnect suggests stuck state, restart recommended) +6. **Process running + disconnected < 30 minutes** → **DEGRADED** (allow reconnection window, monitor) +7. **Process running + port not listening** (when port is configured) → **ERROR** (process running but failed to bind port) +8. **Process running + healthy** → **HEALTHY** (all checks pass) +9. **Process running + no health file configured** → **RUNNING** (limited visibility, process verified only) -**Gate**: All services evaluated with evidence-based status. Proceed only when gate passes. +**Gate**: All services evaluated with evidence-based status. No status is determined without concrete signal (process check, health file, or port probe). Proceed only when gate passes. ### Phase 3: REPORT -**Goal**: Produce structured, actionable health report. +**Goal**: Produce structured, actionable health report with specific remediation commands. **Step 1: Generate summary** @@ -179,7 +160,7 @@ SUGGESTED ACTIONS: - Lead with the summary line (X/N healthy) - Highlight any services needing action - Provide copy-pasteable commands for remediation -- If user has auto-restart enabled, confirm before executing +- Never auto-restart without explicit user flag. Always report findings first, let user decide. **Gate**: Report delivered with actionable recommendations for all non-healthy services. @@ -199,7 +180,7 @@ Result: Clean report, no action needed User says: "The background worker seems stuck" Actions: 1. Identify worker service from config (DISCOVER) -2. Find process running but health file 20 minutes stale (CHECK) +2. Find process running but health file 20 minutes stale (CHECK) — triggers WARNING decision in tree 3. Report WARNING with restart recommendation (REPORT) Result: Specific diagnosis with actionable command @@ -220,6 +201,7 @@ Solution: 1. Narrow pattern with full command path or arguments 2. Use `ps aux | grep` to identify distinguishing arguments 3. Update manifest with more specific pattern +4. Rationale: False positives hide real failures. Specificity is required to avoid misdiagnosis. ### Error: "Health File Exists But Cannot Parse" Cause: Malformed JSON, permissions issue, or file being written during read @@ -231,45 +213,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Restarting Without Diagnosing -**What it looks like**: Service shows WARNING, immediately run `systemctl restart` -**Why wrong**: Masks root cause. Service may crash again immediately. -**Do instead**: Report finding, let user decide. Never auto-restart without explicit flag. - -### Anti-Pattern 2: Trusting Health File Alone -**What it looks like**: Health file says "healthy" so skip process check -**Why wrong**: Process could be zombie, health file could be stale from before crash. -**Do instead**: Always check process status independently of health file content. - -### Anti-Pattern 3: Ignoring Port Mismatch -**What it looks like**: Process running, skip port check, report HEALTHY -**Why wrong**: Process may have started but failed to bind port — effectively down. -**Do instead**: When port is configured, always verify it is listening. - -### Anti-Pattern 4: Broad Process Patterns -**What it looks like**: Using "python" as process pattern for a Flask app -**Why wrong**: Matches every Python process on the system, giving false positives. -**Do instead**: Use specific patterns like `gunicorn.*myapp:app` or full command paths. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Process is running, must be healthy" | Running ≠ functional | Check health file and port | -| "Health file looks fine" | File could be stale from before crash | Verify timestamp freshness | -| "Just restart it" | Restart masks root cause | Report first, restart only if flagged | -| "No config, skip the check" | User still needs an answer | Ask user for service details | - ### Health File Format Reference Services should write health files as: @@ -284,3 +229,14 @@ Services should write health files as: "metrics": {} } ``` + +### Key Constraints Summary + +| Constraint | Rationale | Application | +|-----------|-----------|-------------| +| Process status verified independently of health file | Running process ≠ functional service | Always check process before trusting health file | +| Health file staleness detected by timestamp freshness | File could be stale from before crash | Check timestamp against 300s (configurable) threshold | +| Port binding verified when configured | Process running doesn't mean port is bound | Always verify expected port listening when port specified | +| No auto-restart without explicit flag | Restart masks root cause | Report findings first; only execute restart if user flags it | +| Narrow process patterns required | "python" matches all processes, giving false matches | Use full paths or specific args; validate with `ps aux \| grep` | +| Evidence-based status only | Status must have supporting signal | No status without concrete evidence (process, health file, or port) | diff --git a/skills/skill-composer/SKILL.md b/skills/skill-composer/SKILL.md index 867b17e9..03e4dd81 100644 --- a/skills/skill-composer/SKILL.md +++ b/skills/skill-composer/SKILL.md @@ -30,52 +30,11 @@ routing: # Skill Composer -## Purpose - -Orchestrate complex workflows by chaining multiple skills into validated execution DAGs. Discovers applicable skills, resolves dependencies, validates compatibility, presents execution plans, and manages skill-to-skill context passing. - -## Operator Context - -This skill operates as an operator for multi-skill orchestration, configuring Claude's behavior for DAG-based workflow composition with dependency resolution and context passing between skills. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before composing any workflow -- **Over-Engineering Prevention**: Only compose skills that are directly requested. Prefer simple 2-3 skill chains over complex orchestrations. Do not add speculative skills or "nice to have" additions without explicit user request -- **Dry Run First**: ALWAYS show execution plan and get user confirmation before running skills -- **DAG Validation**: ALWAYS validate execution graph is acyclic before execution -- **Context Validation**: ALWAYS verify output/input compatibility between chained skills -- **Error Isolation**: ALWAYS catch skill failures and determine if remaining chain can continue -- **Skill Discovery**: Scan skills/*/SKILL.md for available skills before building any DAG - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report facts without self-congratulation. Show command output rather than describing it. Be concise but informative. -- **Temporary File Cleanup**: Remove temporary files (skill index, DAG files, intermediate outputs) at task completion. Keep only files explicitly needed for final output. -- **Parallel Optimization**: Execute independent skills concurrently when no shared resources or dependencies exist -- **Verbose Logging**: Show skill selection reasoning and execution progress for each phase -- **Compatibility Checks**: Validate skill input/output formats match before execution using `references/compatibility-matrix.md` -- **Pattern Recognition**: Suggest known composition patterns from `references/composition-patterns.md` when applicable - -### Optional Behaviors (OFF unless enabled) -- **Auto-retry Failed Skills**: Retry failed skills with adjusted parameters (max 2 retries) -- **Adaptive Composition**: Modify execution plan based on intermediate results -- **Skill Suggestion**: Proactively suggest additional skills that might help - -## What This Skill CAN Do -- Discover available skills and build execution DAGs with dependency resolution -- Chain skills sequentially, in parallel, or with conditional branching -- Validate composition compatibility (acyclic, type-safe, ordered) -- Pass context between skills with output/input transformation -- Handle partial failures with isolation and recovery options -- Present dry-run execution plans before committing to execution - -## What This Skill CANNOT Do -- Execute skills without showing the plan first (dry run is mandatory) -- Compose workflows with circular dependencies -- Chain skills with incompatible input/output types without transformation -- Replace single-skill invocation (if one skill suffices, use it directly) -- Skip DAG validation to save time +## Overview ---- +Orchestrate complex workflows by chaining multiple skills into validated execution DAGs. This skill discovers applicable skills, resolves dependencies, validates compatibility, presents execution plans, and manages skill-to-skill context passing. Use when a task requires 2+ skills chained together, parallel skill execution, or conditional branching between skills. Do NOT use when a single skill can handle the request alone, or for simple sequential invocation that needs no dependency management. + +**Core principle**: Minimize composition overhead. Prefer simple 2-3 skill chains. Do not add speculative skills or "nice to have" additions without explicit user request. ## Instructions @@ -93,6 +52,8 @@ Identify: **Step 2: Discover available skills** +Before building any DAG, scan skills/*/SKILL.md for available skills: + ```bash # TODO: scripts/discover_skills.py not yet implemented # Manual alternative: scan skills directory for SKILL.md files @@ -101,9 +62,9 @@ find ./skills -name "SKILL.md" -exec grep -l "^name:" {} \; | sort Review the discovered skills. Categorize by type (workflow, testing, quality, documentation, code-analysis, debugging) with dependency metadata. -**Step 3: Select skills** +**Step 3: Select skills (Apply minimum-skills principle)** -Choose only skills directly needed for the stated goals. Apply the minimum-skills principle: +Choose only skills directly needed for the stated goals. This prevents over-composition and unnecessary failure points: - Can a single skill handle this? If yes, do NOT compose. Invoke it directly. - Can 2 skills handle this? Prefer that over 3+. @@ -119,24 +80,19 @@ Cross-reference selections against `references/compatibility-matrix.md` to confi **Step 1: Build the DAG** +Construct the execution DAG as a JSON structure with nodes (skills) and edges (dependencies) based on the task analysis: + ```bash # TODO: scripts/build_dag.py not yet implemented -# Manual alternative: construct the execution DAG as a JSON structure -# with nodes (skills) and edges (dependencies) based on the task analysis +# Manual alternative: structure the DAG in your reasoning before presenting it ``` -**Step 2: Validate the DAG** - -Validate the execution DAG manually by checking: -- No circular dependencies exist between skills -- Output types from each skill match input requirements of downstream skills -- All referenced skills exist in the skill index -- Dependencies satisfy topological ordering +**Step 2: Validate the DAG (MANDATORY before execution)** -Validation checks: -- **Acyclic**: No circular dependencies -- **Compatibility**: Output types match input requirements (consult `references/compatibility-matrix.md`) -- **Availability**: All referenced skills exist in the index +ALWAYS validate the execution graph is acyclic before moving to execution. Validation checks: +- **Acyclic**: No circular dependencies exist between skills +- **Compatibility**: Output types from each skill match input requirements of downstream skills (consult `references/compatibility-matrix.md`) +- **Availability**: All referenced skills exist in the skill index - **Ordering**: Dependencies satisfy topological ordering If validation fails, fix the issue and re-validate. Common fixes: @@ -145,7 +101,9 @@ If validation fails, fix the issue and re-validate. Common fixes: - Missing skill: Check spelling, re-run discovery - Ordering violation: Reorder phases to satisfy dependencies -**Step 3: Present the execution plan** +**Step 3: Present the execution plan (Dry run is MANDATORY)** + +ALWAYS show the execution plan and get user confirmation before running skills. This prevents wasting time on composition errors: ``` === Execution Plan === @@ -184,19 +142,20 @@ Proceed? [Y/n] For sequential phases: 1. Invoke skill with context from previous phases 2. Capture output -3. Verify output matches expected type +3. Verify output/input compatibility between chained skills 4. Proceed to next phase For parallel phases: -1. Launch all independent skills using Task tool +1. Launch all independent skills using Task tool (execute independent skills concurrently when no shared resources or dependencies exist) 2. Wait for all to complete 3. Aggregate results for next phase **Step 2: Pass context between skills** -For each skill transition: +ALWAYS verify output/input compatibility between chained skills before passing context: + 1. Capture output from completed skill -2. Transform to format expected by next skill +2. Transform to format expected by next skill (validate using `references/compatibility-matrix.md`) 3. Inject as context when invoking next skill 4. Verify transformation succeeded @@ -207,9 +166,11 @@ After each phase completes, report: - Output summary - Overall progress (e.g., "Phase 2/3 complete") +Show command output rather than describing it. Be concise but informative. + **Step 4: Handle failures during execution** -If a skill fails mid-chain: +ALWAYS catch skill failures and determine if remaining chain can continue. If a skill fails mid-chain: 1. **Assess impact**: Does this block downstream skills? - Critical (blocks all downstream): Stop chain, report what completed @@ -262,43 +223,15 @@ Final Output: **Step 2: Clean up temporary files** -Remove: `/tmp/skill-index.json`, `/tmp/execution-dag.json`, and any intermediate output files created during composition. +Remove temporary files at task completion. Keep only files explicitly needed for final output: +- `/tmp/skill-index.json` +- `/tmp/execution-dag.json` +- Any intermediate output files created during composition **Gate**: Results reported. Temporary files cleaned up. Composition complete. --- -## Examples - -### Example 1: Feature with Tests -User says: "Add rate limiting middleware with comprehensive tests" -Actions: -1. DISCOVER: Identify implementation + testing goals. Select workflow-orchestrator, test-driven-development, verification-before-completion -2. PLAN: Build 3-phase sequential DAG. Validate compatibility. Show plan. -3. EXECUTE: Phase 1 creates subtasks, Phase 2 implements with TDD, Phase 3 verifies -4. REPORT: All phases complete, 24 tests pass, 94% coverage -Result: 3-skill chain, 32 minutes, no failures - -### Example 2: Parallel Quality Checks -User says: "Check code quality and documentation before PR" -Actions: -1. DISCOVER: Identify quality + documentation goals. Select code-linting, comment-quality, verification-before-completion -2. PLAN: Phase 1 runs code-linting and comment-quality in parallel (no shared resources). Phase 2 runs verification sequentially. -3. EXECUTE: Parallel phase completes in ~6 seconds (vs 10 sequential). Verification merges results. -4. REPORT: 33% time savings from parallelization, all checks pass -Result: 3-skill chain with 1 parallel phase, 8 minutes - -### Example 3: Research Before Implementation -User says: "Implement pagination following existing patterns" -Actions: -1. DISCOVER: Identify research + implementation goals. Select pr-miner, codebase-analyzer, workflow-orchestrator, test-driven-development -2. PLAN: Phase 1 runs pr-miner and codebase-analyzer in parallel. Phase 2 plans with orchestrator. Phase 3 implements with TDD. -3. EXECUTE: Research discovers cursor-based pagination convention. Plan follows it. Implementation matches. -4. REPORT: Pattern compliance 100%, all tests pass -Result: 4-skill chain with 1 parallel phase, 42 minutes - ---- - ## Error Handling ### Error: "Circular dependency detected" @@ -333,54 +266,8 @@ Solution: 3. Verify the skill directory exists under skills/ 4. Use the suggested alternative from the discovery output if the name was close ---- - -## Anti-Patterns - -### Anti-Pattern 1: Over-Composition -**What it looks like**: User asks "Add a login feature" and response chains 6 skills: workflow-orchestrator, TDD, code-linting, comment-quality, code-review, verification -**Why wrong**: Adds unnecessary overhead. Most skills don't add value for a simple feature request. 2-3 skills would suffice. -**Do instead**: Use test-driven-development directly. Only compose multiple skills when the task explicitly requires orchestration. - -### Anti-Pattern 2: Skipping the Dry Run -**What it looks like**: Immediately executing skills without showing the plan -**Why wrong**: User cannot catch composition errors early. No opportunity to adjust before wasting time. Violates the "Dry Run First" hardcoded behavior. -**Do instead**: Always present the execution plan and wait for confirmation before proceeding. - -### Anti-Pattern 3: Sequential When Parallel Is Safe -**What it looks like**: Running code-linting, then comment-quality, then go-pr-quality-gate in sequence when all three are independent -**Why wrong**: Forces ~15 minutes of sequential execution when ~5 minutes parallel would work. No dependencies exist between them. -**Do instead**: Analyze dependencies carefully. Independent skills with no shared resources run in parallel. - -### Anti-Pattern 4: Incompatible Skill Chaining -**What it looks like**: Chaining test-driven-development (outputs: Go source files) into pr-miner (expects: Git repository URL) -**Why wrong**: Output type does not match input type. Will fail at runtime with a compatibility error. -**Do instead**: Consult `references/compatibility-matrix.md` during planning. Only chain skills with compatible interfaces. - -### Anti-Pattern 5: Forgetting Cleanup -**What it looks like**: After composition completes, /tmp/ contains skill-index.json, execution-dag.json, and multiple intermediate output files -**Why wrong**: Temporary files accumulate across sessions, may contain sensitive data, and clutter the filesystem. -**Do instead**: Always execute cleanup in the REPORT phase. Remove all temporary files created during composition. Keep only files explicitly needed for the final output. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "One skill can probably handle all of this" | Complex tasks need specialized skills | Discover applicable skills, compose what's needed | -| "No need to validate the DAG, it's simple" | Simple DAGs can still have type mismatches | Run validation script every time | -| "User doesn't need to see the plan" | Skipping dry run violates hardcoded behavior | Present plan, wait for confirmation | -| "I'll add a few extra skills for quality" | Over-composition wastes time and adds failure points | Only compose skills explicitly needed | - -### Reference Files - `${CLAUDE_SKILL_DIR}/references/composition-patterns.md`: Proven multi-skill composition patterns with duration estimates - `${CLAUDE_SKILL_DIR}/references/compatibility-matrix.md`: Skill input/output compatibility and valid chains - `${CLAUDE_SKILL_DIR}/references/skill-patterns.md`: Common skill patterns with sequential/parallel decision trees -- `${CLAUDE_SKILL_DIR}/references/examples.md`: Real-world composition examples with execution output diff --git a/skills/skill-eval/SKILL.md b/skills/skill-eval/SKILL.md index 26cbfcf8..99eb2118 100644 --- a/skills/skill-eval/SKILL.md +++ b/skills/skill-eval/SKILL.md @@ -39,40 +39,6 @@ routing: Measure and improve skill quality through empirical testing — because structure doesn't guarantee behavior, and measurement beats assumption. -## Operator Context - -This skill operates as the eval-driven improvement pipeline for Claude Code skills. It provides four capabilities: trigger evaluation, description optimization, output benchmarking, and structural validation. - -### Hardcoded Behaviors (Always Apply) -- **Measure before changing**: Always run baseline eval before making improvements -- **Train/test split**: Use 60/40 holdout to prevent overfitting descriptions -- **Generalize, don't overfit**: Improvements should help across many prompts, not just test cases -- **Report results**: Always show before/after metrics - -### Default Behaviors (ON unless disabled) -- **HTML reports**: Generate visual reports for description optimization -- **Verbose output**: Show per-query pass/fail during eval runs -- **3 runs per query**: Run each trigger test 3 times for reliability - -### Optional Behaviors (OFF unless enabled) -- **Blind A/B comparison**: Use comparator agent for unbiased output comparison -- **Full benchmark suite**: Run aggregate benchmarks with timing and token metrics - -## What This Skill CAN Do -- Test whether a skill's description triggers correctly for a set of queries -- Optimize descriptions via automated eval+improve loop (train/test split) -- Benchmark skill output quality (with-skill vs without-skill) -- Validate skill structure (frontmatter, naming, description length) -- Generate HTML reports for visual review - -## What This Skill CANNOT Do -- Create new skills from scratch (use skill-creator) -- Modify skill instructions automatically (human reviews changes) -- Test skills that require specific MCP servers or external services -- Run evals without the `claude` CLI available - ---- - ## Instructions ### Phase 1: ASSESS — Determine what to evaluate @@ -84,10 +50,12 @@ This skill operates as the eval-driven improvement pipeline for Claude Code skil python3 -m scripts.skill_eval.quick_validate ``` -**Step 2: Choose evaluation mode** +This checks: SKILL.md exists, valid frontmatter, required fields (name, description), kebab-case naming, description under 1024 chars, no angle brackets. -| User Intent | Mode | Script | -|------------|------|--------| +**Step 2: Choose evaluation mode based on user intent** + +| Intent | Mode | Script | +|--------|------|--------| | "Test if description triggers correctly" | Trigger eval | `run_eval.py` | | "Optimize/improve the description" | Description optimization | `run_loop.py` | | "Compare skill vs no-skill output" | Output benchmark | Manual + `aggregate_benchmark.py` | @@ -103,15 +71,17 @@ Test whether a skill's description causes Claude to invoke it for the right quer **Step 1: Create eval set** (or use existing) -Create a JSON file with 8-20 test queries: +Create a JSON file with 8-20 test queries. **Eval set quality matters** — use realistic prompts with detail (file paths, context, casual phrasing), not abstract one-liners. Focus on edge cases where the skill competes with adjacent skills. + +Example of good eval queries: ```json [ - {"query": "realistic user prompt that should trigger", "should_trigger": true}, - {"query": "similar but different domain prompt", "should_trigger": false} + {"query": "ok so my boss sent me this xlsx file (Q4 sales final FINAL v2.xlsx) and she wants profit margin as a percentage", "should_trigger": true}, + {"query": "Format this data", "should_trigger": false} ] ``` -**Eval set quality matters** — use realistic prompts with detail (file paths, context, casual phrasing), not abstract one-liners. Focus on edge cases where the skill competes with adjacent skills. +**Why**: Real users write detailed, specific prompts. Abstract queries don't test real triggering behavior. Overfitting descriptions to abstract test cases bloats the description and fails on real usage. **Step 2: Run evaluation** @@ -123,13 +93,18 @@ python3 -m scripts.skill_eval.run_eval \ --verbose ``` -This spawns `claude -p` for each query, checking whether it invokes the skill. Output includes pass/fail per query with trigger rates. +This spawns `claude -p` for each query, checking whether it invokes the skill. Runs each query 3 times for reliability. Output includes pass/fail per query with trigger rates. Default 30s timeout; increase with `--timeout 60` if needed for complex queries. + +**Constraints applied**: +- Always run baseline eval before making improvements +- 3 runs per query ensures statistical reliability +- Verbose output shows per-query pass/fail during eval runs **GATE**: Eval results available. Proceed to improvement if failures found. #### Mode B: Description Optimization -Automated loop that tests, improves, and re-tests descriptions. +Automated loop that tests, improves, and re-tests descriptions using Claude with extended thinking. ```bash python3 -m scripts.skill_eval.run_loop \ @@ -141,14 +116,18 @@ python3 -m scripts.skill_eval.run_loop \ ``` This will: -1. Split eval set 60/40 train/test (stratified by should_trigger) -2. Evaluate current description on all queries (3 runs each) -3. Use Claude with extended thinking to propose improvements based on failures +1. Split eval set 60/40 train/test (stratified by should_trigger) — prevents overfitting to test cases +2. Evaluate current description on all queries (3 runs each for reliability) +3. Use Claude with extended thinking to propose improvements based on training failures 4. Re-evaluate the new description 5. Repeat until all pass or max iterations reached -6. Select best description by **test** score (prevents overfitting) +6. Select best description by **test** score (not train score — prevents overfitting) 7. Open an HTML report in the browser +**Why 60/40 split**: Improvements should help across many prompts, not just test cases. Training on failures, validating on holdout ensures generalization. + +**Why report HTML**: Visual reports enable quick review of which queries improved, which regressed, and what the new description looks like. + **GATE**: Loop complete. Best description identified. #### Mode C: Output Benchmark @@ -163,9 +142,11 @@ For each test prompt, spawn two agents: - **With skill**: Load the skill, run the prompt, save outputs - **Without skill** (baseline): Same prompt, no skill, save outputs +**Why baseline matters**: Can't prove the skill adds value without a baseline. Maybe Claude handles it fine without the skill. The delta is what matters. + **Step 3: Grade outputs** -Spawn a grader subagent using the prompt in `agents/grader.md`. It evaluates assertions against the outputs. +Spawn a grader subagent using `agents/grader.md`. It evaluates assertions against the outputs. **Step 4: Aggregate** @@ -211,6 +192,8 @@ If description optimization found a better description: 3. Update the skill's SKILL.md frontmatter 4. Re-run quick_validate to confirm the update is valid +**Constraint**: Always show results before/after with metrics. This enables informed decisions. + **GATE**: Changes applied and validated, or user chose to keep original. --- @@ -239,25 +222,6 @@ If description optimization found a better description: --- -## Anti-Patterns - -### Anti-Pattern 1: Abstract Eval Queries -**What it looks like**: `"Format this data"`, `"Create a chart"` -**Why wrong**: Real users write detailed, specific prompts. Abstract queries don't test real triggering behavior. -**Do instead**: `"ok so my boss sent me this xlsx file (Q4 sales final FINAL v2.xlsx) and she wants profit margin as a percentage"` - -### Anti-Pattern 2: Overfitting to Test Cases -**What it looks like**: Adding specific query text to the description to force triggers -**Why wrong**: Works for test set, fails on real usage. Bloats the description. -**Do instead**: Generalize from failures to broader categories of user intent. - -### Anti-Pattern 3: Skipping Baseline -**What it looks like**: Running with-skill only, no without-skill comparison -**Why wrong**: Can't prove the skill adds value without a baseline. Maybe Claude handles it fine without the skill. -**Do instead**: Always run both configurations. The delta is what matters. - ---- - ## References ### Scripts (in `scripts/skill_eval/`) @@ -275,7 +239,3 @@ If description optimization found a better description: ### Reference Files - `${CLAUDE_SKILL_DIR}/references/schemas.md` — JSON schemas for evals.json, grading.json, benchmark.json - -### Shared Patterns -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) -- [Verification Checklist](../shared-patterns/verification-checklist.md) diff --git a/skills/socratic-debugging/SKILL.md b/skills/socratic-debugging/SKILL.md index b55e7bbd..45a47b11 100644 --- a/skills/socratic-debugging/SKILL.md +++ b/skills/socratic-debugging/SKILL.md @@ -29,33 +29,25 @@ routing: # Socratic Debugging Skill -## Operator Context +## Overview -This skill operates as an operator for question-guided debugging, configuring Claude's behavior to teach debugging through structured inquiry rather than providing answers. It implements the **Socratic Method** pattern -- ask questions that lead the user to discover the root cause themselves, building lasting investigative skills. +This skill teaches debugging through structured inquiry rather than providing answers, implementing the **Socratic Method** pattern. You ask questions that lead the user to discover root causes themselves, building lasting investigative skills rather than offering direct solutions. -### Hardcoded Behaviors (Always Apply) -- **Never State the Answer**: Do not reveal the root cause, fix, or solution directly. The user must arrive at it themselves. This is the entire point of the skill -- giving answers defeats the learning objective. -- **Read Code First**: Always use Read/Grep/Glob to understand the relevant code before formulating questions. Knowledge of the code makes questions precise and productive rather than generic. -- **Follow the 9-Phase Progression**: Do not jump to hypothesis questions (Phase 9) without establishing symptoms (Phase 1) and state (Phase 6). Skipping phases leads to guesswork instead of systematic discovery. -- **Offer Escalation After 12 Questions**: If the user has not made progress after 12 questions, offer to switch to `systematic-debugging`. Escalation is a clean handoff, not a failure. +--- -### Default Behaviors (ON unless disabled) -- **Start at Phase 1**: Begin with Symptoms regardless of how specific the user's description is. Even detailed reports often contain unstated assumptions. -- **One Question at a Time**: Ask a single question, then wait. Multiple questions overwhelm and dilute focus. -- **Use the User's Terminology**: Mirror their variable names, function names, and domain terms in questions. This reduces friction and shows you are engaged with their specific problem. -- **Acknowledge Progress**: When the user discovers something, acknowledge it before asking the next question. Silent progression feels like interrogation. +## Instructions -### Optional Behaviors (OFF unless enabled) -- **Phase Skipping**: Jump to later phases if the user has clearly and thoroughly covered earlier phases in their initial description. -- **Code Reading Suggestions**: Suggest specific diagnostic actions ("try adding a log statement at line X") when the user is stuck on state inspection. +### Core Constraints ---- +Never state the answer directly. The user must arrive at the root cause themselves -- giving answers defeats the learning objective. Always read relevant code first using Read/Grep/Glob before formulating questions. Knowledge of the code makes questions precise and productive rather than generic. Follow the 9-phase progression without skipping: jumping to hypothesis questions without establishing symptoms and state leads to guesswork instead of systematic discovery. -## Instructions +### Default Workflow Behaviors + +Begin with symptoms regardless of how specific the user's description is. Even detailed reports contain unstated assumptions. Ask one question at a time and wait for the response. Multiple questions overwhelm and dilute focus. Mirror the user's terminology (variable names, function names, domain terms) in your questions to reduce friction and show engagement. When the user discovers something, acknowledge it before asking the next question -- silent progression feels like interrogation. After 12 questions without progress toward root cause, trigger escalation to systematic-debugging as a clean handoff. -### Question Progression +### Question Progression: 9 Phases -Follow these 9 phases in order. Each phase has a purpose -- do not skip phases without evidence the user has already covered that ground. +Follow these phases in order. Each phase builds evidence for the next. | Phase | Purpose | Example Questions | |-------|---------|-------------------| @@ -77,18 +69,13 @@ Follow these 9 phases in order. Each phase has a purpose -- do not skip phases w 4. **Track question count.** After 12 questions with no progress toward root cause, trigger escalation offer. 5. **When user identifies root cause**, confirm their finding and ask what fix they would apply. Do not suggest the fix yourself. -### Hints Are Allowed +### Hints vs. Leading Questions -Questions may contain subtle directional hints. The goal is discovery, not suffering. - -- **Good hint**: "What happens if you log the value of `request.userId` right before line 42?" (directs attention to the right variable and location) -- **Bad hint**: "Don't you think `request.userId` is null at line 42?" (this IS giving the answer disguised as a question) - -The line: open-ended questions that narrow focus are hints. Leading questions that contain the answer are violations. +Questions may contain subtle directional hints. The goal is discovery, not suffering. A **good hint** directs attention without revealing the answer: "What happens if you log the value of `request.userId` right before line 42?" A **bad hint** is a leading question that contains the answer: "Don't you think `request.userId` is null at line 42?" The line: open-ended questions that narrow focus are hints. Leading questions that contain the answer are violations. ### Escalation Protocol -After 12 questions without progress: +After 12 questions without progress, offer cleanly: > "We have been exploring this for a while. Would you like to switch to direct debugging mode? I can investigate and solve this systematically instead of through questions." @@ -100,28 +87,6 @@ If user accepts, hand off to `systematic-debugging` with a summary of what has b --- -## Examples - -### Example 1: Guided Discovery -User says: "My API endpoint returns 500 but I don't know why" -Actions: -1. Read the endpoint handler code silently -2. Ask: "What response did you expect from this endpoint, and what does the 500 response body contain?" -3. User shares error details -4. Acknowledge, then: "Can you reproduce this with every request, or only certain inputs?" -5. Continue through phases until user identifies root cause - -### Example 2: Escalation -User says: "Help me think through why my tests fail intermittently" -Actions: -1. Read test files silently -2. Work through Phases 1-6 over 12 questions -3. User remains stuck despite good-faith effort -4. Offer: "Would you like to switch to direct debugging mode?" -5. User accepts -- hand off to systematic-debugging with session summary - ---- - ## Error Handling ### User Says "Just Tell Me the Answer" @@ -130,43 +95,14 @@ Solution: Offer to switch modes cleanly. Say: "Would you like to switch to direc ### User Is Frustrated Cause: Too many questions without visible progress, or questions feel generic -Solution: Acknowledge the frustration. Offer escalation. If they want to continue, read more code and ask sharper, more targeted questions. +Solution: Acknowledge the frustration. Offer escalation. If they want to continue, read more code and ask sharper, more targeted questions. Generic questions indicate you haven't read the code deeply enough. ### Bug Is Trivially Obvious From Code Cause: A typo, missing import, or simple syntax error visible in the source -Solution: Still ask Phase 1, but make the question very pointed -- narrow enough that the user will see the answer immediately. Example: "What do you expect `reponse.data` to contain?" (the typo in the variable name is the bug). - ---- - -## Anti-Patterns - -### Anti-Pattern 1: Answering in Disguise -**What it looks like**: "Don't you think the problem is that X is null?" -**Why wrong**: A leading question that contains the answer violates the core principle. The user learns nothing about how to find bugs. -**Do instead**: Ask open-ended questions that direct attention: "What is the value of X at that point?" - -### Anti-Pattern 2: Interrogation Mode -**What it looks like**: Rapid-fire questions without acknowledging the user's responses -**Why wrong**: The user feels unheard and disengaged. Discovery requires dialogue, not interrogation. -**Do instead**: Acknowledge what the user said, then ask one follow-up question. - -### Anti-Pattern 3: Skipping to Hypothesis -**What it looks like**: Jumping to "Where do you think the problem is?" without establishing symptoms, state, or prior attempts -**Why wrong**: Without Phase 1-6 data, hypothesis is pure guesswork. The user learns to guess, not investigate. -**Do instead**: Follow the 9-phase progression. Earlier phases build the evidence base for meaningful hypotheses. +Solution: Still ask Phase 1, but make the question very pointed -- narrow enough that the user will see the answer immediately. Example: "What do you expect `reponse.data` to contain?" (the typo in the variable name is the bug). Avoid skipping phases; pointed questions stay within the Socratic framework. --- ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "The bug is obvious, I should just tell them" | Telling defeats the learning objective | Ask a pointed Phase 1 question instead | -| "They seem frustrated, I'll give a hint with the answer" | Leading questions are answers in disguise | Offer escalation to systematic-debugging | -| "We've covered symptoms already, skip to Phase 7" | User's description may have gaps or assumptions | Verify Phase 1-2 explicitly before advancing | -| "One more question won't hurt past 12" | Diminishing returns cause frustration | Offer escalation at the 12-question mark | +This skill teaches debugging through structured inquiry within these constraints: Never violate the Socratic method by stating answers directly; always read code before questioning (generic questions signal incomplete code understanding); follow phase progression to build evidence rather than guessing; escalate cleanly at 12 questions without progress rather than continuing to frustrate the user; use the user's terminology to maintain engagement; acknowledge discoveries to keep the dialogue feeling collaborative rather than like interrogation. diff --git a/skills/spec-writer/SKILL.md b/skills/spec-writer/SKILL.md index 4de6902b..825ef1d0 100644 --- a/skills/spec-writer/SKILL.md +++ b/skills/spec-writer/SKILL.md @@ -33,24 +33,6 @@ routing: Produce a structured SPEC.md before any design or implementation begins. This is Phase 0 of the feature lifecycle pipeline (spec --> design --> plan --> implement --> validate --> release). The spec defines WHAT to build and WHERE the boundaries are. It says nothing about HOW. -## Operator Context - -### Hardcoded Behaviors (Always Apply) -- **Phase 3 (Out of Scope) is MANDATORY** -- never skip it. Without explicit exclusions, scope creep is invisible until it's too late. Every "while we're at it" addition started as an unwritten assumption. -- **Minimum 3 out-of-scope items** -- if you cannot list 3 things this feature does NOT do, the scope is not yet defined. This forces real boundary thinking rather than a rubber-stamp section. -- **Max 7 user stories** -- more than 7 means the feature is too broad. Decompose into multiple specs first. This constraint forces prioritization: which stories are essential vs. nice-to-have? -- **Acceptance criteria must be testable** -- no subjective language ("should feel fast", "user-friendly", "intuitive"). Every criterion must have a verifiable assertion. WHY: untestable criteria become opinion debates during review. -- **Spec says WHAT, not HOW** -- no code, no architecture, no database schemas, no implementation details. Those belong in feature-design. - -### Default Behaviors (ON unless disabled) -- **Save output as SPEC.md** in project root, or `.feature/SPEC.md` if worktree is active -- **Include all 5 sections** in order (stories, criteria, out-of-scope, risks, estimation) -- **Use GIVEN/WHEN/THEN format** for acceptance criteria - -### Optional Behaviors (OFF unless enabled) -- **Skip estimation** -- omit Section 5 for exploratory or research-phase work -- **Test stub generation** -- render acceptance criteria as pseudocode test stubs - ## Instructions ### Step 1: Gather Context @@ -64,7 +46,15 @@ Produce a structured SPEC.md before any design or implementation begins. This is ### Step 2: Write SPEC.md -Produce the spec with all 5 sections. Use this structure: +Produce the spec with all 5 sections in order. Out-of-scope is MANDATORY and must contain minimum 3 items—without explicit exclusions, scope creep is invisible until too late. Every "while we're at it" addition started as an unwritten assumption. + +**Max 7 user stories**. More than 7 means the feature is too broad. Decompose into multiple specs first. This constraint forces prioritization: which stories are essential vs. nice-to-have? + +**Acceptance criteria must be testable**—no subjective language ("should feel fast", "user-friendly", "intuitive"). Every criterion must have a verifiable assertion. WHY: untestable criteria become opinion debates during review. + +**Spec says WHAT, not HOW**—no code, no architecture, no database schemas, no implementation details. Those belong in feature-design. + +Use this structure: ```markdown # Spec: [Feature Name] @@ -116,9 +106,9 @@ This feature does NOT: ### Step 3: Validate and Save -1. Review each acceptance criterion -- flag any that use subjective language -2. Review out-of-scope -- flag if fewer than 3 items -3. Review story count -- flag if more than 7 +1. Review each acceptance criterion—flag any that use subjective language. Replace with measurable assertion: latency threshold, click count, error rate. +2. Review out-of-scope—flag if fewer than 3 items. Brainstorm adjacent features, related capabilities, and future enhancements that are NOT part of this work to reach minimum 3. +3. Review story count—flag if more than 7. If scope is too broad to list 3 things this feature does NOT do, the scope is not yet defined. Decompose into multiple specs, one per coherent capability. 4. Save the spec: - If `.feature/` directory exists: save to `.feature/SPEC.md` - Otherwise: save to `SPEC.md` in project root @@ -127,6 +117,12 @@ This feature does NOT: Spec saved to [path]. Run /feature-design to begin design exploration. ``` +### Step 4: Optional Behaviors + +- **Skip estimation** (Section 5)—omit for exploratory or research-phase work +- **Test stub generation**—render acceptance criteria as pseudocode test stubs +- **Save output location**—defaults to `SPEC.md` in project root, or `.feature/SPEC.md` if worktree is active + ## Error Handling | Error | Cause | Solution | @@ -136,29 +132,14 @@ This feature does NOT: | Cannot list 3 out-of-scope items | Scope not yet defined | Brainstorm adjacent features, related capabilities, and future enhancements that are NOT part of this work | | Acceptance criteria use subjective language | "fast", "easy", "intuitive" | Replace with measurable assertion: latency threshold, click count, error rate | -## Anti-Patterns - -| Anti-Pattern | Why Wrong | Do Instead | -|--------------|-----------|------------| -| Kitchen sink spec (>7 stories) | Scope is too broad to implement or validate coherently | Decompose into multiple specs, prioritize | -| Empty out-of-scope | If nothing is excluded, everything is included -- that is a scope problem | Force 3+ exclusions with reasoning | -| Untestable criteria | "Should be user-friendly" becomes an opinion debate during review | Use measurable assertions: "<3 clicks", "<200ms", "zero data loss" | -| Spec as implementation | Spec says WHAT, not HOW. Code and schemas belong in feature-design | Remove implementation details, keep behavioral descriptions | - -## Integration - -This skill produces the input artifact for the feature lifecycle: - -``` -spec-writer (SPEC.md) - --> feature-design (reads stories + scope boundaries) - --> feature-plan (reads acceptance criteria for test requirements) - --> feature-implement - --> feature-validate (checks acceptance criteria as quality gates) - --> feature-release -``` - ## References -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) +- Spec Writer Integration: This skill produces the input artifact for the feature lifecycle pipeline: + ``` + spec-writer (SPEC.md) + --> feature-design (reads stories + scope boundaries) + --> feature-plan (reads acceptance criteria for test requirements) + --> feature-implement + --> feature-validate (checks acceptance criteria as quality gates) + --> feature-release + ``` diff --git a/skills/subagent-driven-development/SKILL.md b/skills/subagent-driven-development/SKILL.md index ca455280..5f7350a7 100644 --- a/skills/subagent-driven-development/SKILL.md +++ b/skills/subagent-driven-development/SKILL.md @@ -27,49 +27,6 @@ routing: # Subagent-Driven Development Skill -## Operator Context - -This skill operates as an operator for plan execution workflows, configuring Claude's behavior for disciplined task dispatch with mandatory quality gates. It implements the **Controller-Worker** architectural pattern -- controller extracts context, workers execute in isolation, reviewers verify -- with **Two-Stage Review** ensuring both ADR compliance and code quality. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before executing any task -- **Over-Engineering Prevention**: Implement only what the ADR requires. No speculative improvements, no "while I'm here" changes -- **ADR Compliance Before Quality**: NEVER run code quality review before ADR compliance passes -- **Full Context Injection**: NEVER make a subagent read the plan file; provide full task text in the dispatch -- **Review Gates Mandatory**: NEVER mark a task complete until both reviews pass -- **Sequential Tasks Only**: NEVER dispatch multiple implementation subagents in parallel (causes file conflicts) -- **BASE_SHA Capture**: Run `git rev-parse HEAD` BEFORE dispatching the first implementer - -### Default Behaviors (ON unless disabled) -- **Fresh Subagent Per Task**: Each task gets a clean subagent with no cross-task context pollution -- **Scene-Setting Context**: Gather branch status, code patterns, and conventions before dispatch -- **Q&A Before Implementation**: Answer subagent questions before they start coding -- **Review Fix Loops**: When reviewer finds issues, implementer fixes and reviewer re-reviews -- **Max 3 Review Retries**: Escalate to user after 3 failed review cycles per stage -- **Final Integration Review**: After all tasks, dispatch a holistic reviewer for the full changeset -- **TodoWrite Tracking**: Create TodoWrite with all tasks upfront for progress visibility - -### Optional Behaviors (OFF unless enabled) -- **Parallel Research**: Pre-gather codebase context for all tasks in parallel before execution -- **Auto PR Creation**: Run pr-sync automatically after final review passes -- **Custom Reviewer Prompts**: Override default reviewer prompt templates - -## What This Skill CAN Do -- Execute multi-task plans with quality gates between each task -- Dispatch fresh subagents that implement, test, and commit independently -- Enforce ADR compliance before allowing code quality review -- Escalate to user when review loops exceed retry limits -- Track progress across tasks with TodoWrite - -## What This Skill CANNOT Do -- Execute without an existing implementation plan (use workflow-orchestrator first) -- Run tasks in parallel (file conflicts make this unsafe) -- Skip either stage of the two-stage review -- Fix review issues itself (the implementer subagent must fix) -- Replace user judgment on ADR ambiguity (escalates instead) - ---- - ## Instructions ### Phase 1: SETUP @@ -92,6 +49,8 @@ Verification: [How to verify this task] ... ``` +**Why**: Providing complete task text inline prevents subagents from burning tokens reading files and pollutes their context if they need to refer back to the plan. This isolation is critical for clean review cycles. + **Step 2: Create TodoWrite** Create TodoWrite with ALL tasks: @@ -101,14 +60,20 @@ Create TodoWrite with ALL tasks: 3. [pending] Task 3: [Title] ``` +**Why**: TodoWrite gives the operator visibility and prevents task slip. + **Step 3: Gather scene-setting context** + +Before dispatching any implementer, capture: - Current branch status (`git status`) -- Capture BASE_SHA: `git rev-parse HEAD` -- Relevant existing code patterns +- Capture BASE_SHA: `git rev-parse HEAD` -- required for final integration review +- Relevant existing code patterns (naming conventions, error handling style) - Project conventions from CLAUDE.md - Dependencies and setup requirements -This context gets passed to EVERY subagent. +This context gets passed to EVERY subagent to prevent repeated discovery and question loops. + +**Why**: Early context capture answers 80% of subagent questions before they ask, unblocks implementation immediately, and must be collected once (not rediscovered per task). BASE_SHA must be captured BEFORE the first implementer runs because subsequent edits will move HEAD forward. **Gate**: All tasks extracted with full text. BASE_SHA captured. Scene-setting context gathered. Proceed only when gate passes. @@ -123,20 +88,23 @@ Update TodoWrite status for the current task. **Step 2: Dispatch implementer subagent** Use the Task tool with the prompt template from `./implementer-prompt.md`. Include: -- Full task text (NEVER say "see plan") +- Full task text (NEVER say "see plan" -- subagents must have complete context) - Scene-setting context - Clear deliverables - Permission to ask questions -If the implementer asks questions: answer clearly and completely. Provide additional context. Re-dispatch with answers. Do NOT rush them into implementation. +**Implementation constraints** (enforced inline): +- Implementer must understand task fully before coding begins. If they ask questions: answer clearly and completely, provide additional context, re-dispatch with answers. Do NOT rush them into implementation. +- Tasks must run sequentially. NEVER dispatch multiple implementers in parallel because overlapping file edits cause conflicts that are expensive to resolve. +- Implementer MUST follow these steps in order: + 1. Understand the task fully + 2. Ask questions if unclear (BEFORE implementing) + 3. Implement following TDD where appropriate + 4. Run tests + 5. Self-review code + 6. Commit changes -The implementer MUST: -1. Understand the task fully -2. Ask questions if unclear (BEFORE implementing) -3. Implement following TDD where appropriate -4. Run tests -5. Self-review code -6. Commit changes +**Why sequential execution**: Each task's output becomes the next task's input. Parallel execution breaks file locking semantics and requires complex merge handling. Sequential is simpler, safer, and conflicts are rare when each subagent gets full context. **Step 3: Dispatch ADR compliance reviewer subagent** @@ -145,11 +113,15 @@ Use the prompt template from `./adr-reviewer-prompt.md`. The ADR compliance revi - Is anything MISSING from requirements? - Is anything EXTRA that was not requested? +**Two-stage review constraint** (enforced inline): NEVER run code quality review before ADR compliance passes. ADR compliance gates code quality because code that doesn't match requirements is wrong, regardless of how well-written. Reviewing code quality on functionally wrong code wastes the quality reviewer's effort. + If ADR compliance reviewer finds issues: dispatch new implementer subagent with fix instructions. ADR compliance reviewer reviews again. Repeat until ADR compliance passes. **Max retries: 3** -- After 3 failed ADR compliance reviews, STOP and escalate: > "ADR compliance failing after 3 attempts. Issues: [list]. Need human decision." +**Why escalation after 3 retries**: 3 retries = ~15-20 min of subagent time. If unresolved by then, the problem is structural (ADR is ambiguous, requirements conflict, or subagent fundamentally misunderstood something). Continuing loops wastes tokens. Human needs to decide: clarify ADR, adjust requirements, or accept the implementation as-is. + **Step 4: Dispatch code quality reviewer subagent** Use the prompt template from `./code-quality-reviewer-prompt.md`. The code quality reviewer checks: @@ -158,11 +130,15 @@ Use the prompt template from `./code-quality-reviewer-prompt.md`. The code quali - Error handling is appropriate - No obvious bugs +**Quality review sequencing** (enforced inline): Only dispatch quality reviewer AFTER ADR compliance passes. Code quality review focuses on how well requirements are met, not whether wrong things were built. + If quality reviewer finds issues: implementer fixes Critical and Important issues (Minor issues are optional). Quality reviewer reviews again. **Max retries: 3** -- After 3 failed quality reviews, STOP and escalate: > "Quality review failing after 3 attempts. Issues: [list]. Need human decision." +**Why different retry limits for both stages**: Both stages can get stuck. Both deserve a fair number of attempts (3 each = up to 60 min total per task). Both hitting the limit means something is wrong with the process or the task definition itself. + **Step 5: Mark task complete** Only when BOTH reviews pass: @@ -187,6 +163,8 @@ Dispatch a reviewer subagent for the entire changeset (diff from BASE_SHA to HEA - No integration issues between tasks - No conflicting patterns or redundant code +**Why final integration review after all tasks**: Per-task reviews ensure each task is correct in isolation. Final integration review catches cross-task problems: Task 1 and Task 3 both define the same utility, tests pass individually but conflict when run together, or Task 2 introduced a breaking change that Task 4 didn't account for. This catch-all review is why BASE_SHA was captured upfront. + **Step 2: Complete development workflow** Use the appropriate completion path: @@ -207,6 +185,8 @@ Solution: 2. Add the missing context to the scene-setting for future tasks 3. Re-dispatch implementer with answers included +**Prevention**: The answer-questions-first constraint prevents this by design. If a subagent still asks questions after full context, they're asking for clarification on the ADR itself, which is valuable signal that requirements are ambiguous. + ### Error: "Review Loop Exceeds 3 Retries" Cause: ADR ambiguity, fundamental misunderstanding, or unreasonable review criteria Solution: @@ -215,6 +195,8 @@ Solution: 3. Ask user to clarify ADR or adjust requirements 4. Resume only after user provides direction +**Why hard limit**: Review loops that don't converge are expensive and signal a deeper problem. Continuing them burns tokens without progress. Human judgment is needed to decide whether to clarify, change, or accept. + ### Error: "Subagent File Conflicts" Cause: Multiple subagents modifying overlapping files (usually from parallel dispatch) Solution: @@ -222,48 +204,12 @@ Solution: 2. Re-run the affected review stage 3. Enforce sequential dispatch going forward -- NEVER parallelize implementers ---- - -## Anti-Patterns - -### Anti-Pattern 1: Skipping ADR Compliance Review for "Obvious" Tasks -**What it looks like**: "This task is straightforward, code quality review is enough" -**Why wrong**: Even simple tasks can miss requirements or add unrequested scope -**Do instead**: Run both review stages for every task. No exceptions. - -### Anti-Pattern 2: Controller Fixing Issues Directly -**What it looks like**: Controller edits code instead of dispatching a fix subagent -**Why wrong**: Pollutes controller context. Controller should orchestrate, not implement. -**Do instead**: Dispatch a new implementer subagent with specific fix instructions. - -### Anti-Pattern 3: Bundling Multiple Tasks Into One Subagent -**What it looks like**: "Tasks 2 and 3 are related, I'll combine them" -**Why wrong**: Defeats isolation. Review becomes unclear. Rollback is impossible. -**Do instead**: One task per subagent. Always. - -### Anti-Pattern 4: Proceeding With Unresolved Questions -**What it looks like**: "The subagent will figure it out from the code" -**Why wrong**: Subagent builds the wrong thing. Review catches it late. Wasted cycles. -**Do instead**: Answer every question before implementation begins. +**Why this happens**: The sequential constraint exists to prevent this. If it occurs anyway, it means the constraint was violated. Reassert it. --- ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "This task is simple, skip ADR review" | Simple tasks still miss requirements | Run both review stages | -| "Subagent can read the plan itself" | File reading wastes tokens, context pollution | Provide full task text in dispatch | -| "Reviews passed, skip final integration check" | Per-task reviews miss cross-task issues | Run final integration review | -| "I'll fix this small issue myself instead of dispatching" | Controller context pollution breaks orchestration | Dispatch fix subagent | -| "Tasks 2 and 3 are related, combine them" | Combined tasks break isolation and review clarity | One task per subagent, always | - ### Prompt Templates - `implementer-prompt.md`: Dispatch template for implementation subagents - `adr-reviewer-prompt.md`: Dispatch template for ADR compliance review diff --git a/skills/systematic-code-review/SKILL.md b/skills/systematic-code-review/SKILL.md index 439dc115..eb0ffe7e 100644 --- a/skills/systematic-code-review/SKILL.md +++ b/skills/systematic-code-review/SKILL.md @@ -23,48 +23,7 @@ routing: # Systematic Code Review Skill -## Operator Context - -This skill operates as an operator for systematic code review, configuring Claude's behavior for thorough, verifiable review processes. It implements the **Iterative Refinement** architectural pattern — understand context, verify claims, assess risks, document findings — with **Domain Intelligence** embedded in severity classification and Go-specific review patterns. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before executing review -- **Over-Engineering Prevention**: Don't suggest features outside PR scope, but DO flag all issues IN the changed code even if fixing them requires touching other files. No speculative improvements. -- **Strict Severity Classification**: Use the Severity Classification Rules below. When in doubt, classify UP not down. -- **NEVER approve without running tests**: Review must include test execution -- **NEVER trust comments without verification**: All claims must be verified against code -- **NEVER skip security assessment**: Security implications must be explicitly evaluated -- **Complete phase gates**: Each phase must complete before proceeding - -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report facts without self-congratulation. Show command output rather than describing it. Be concise but informative -- **Temporary File Cleanup**: Remove any temporary analysis files, notes, or debug outputs created during review at task completion. Keep only the final review document -- **Read all changed files**: Don't review summaries, read actual code -- **Check affected dependencies**: Identify ripple effects -- **Verify test coverage**: Confirm tests exist for changes -- **Document all findings**: Create structured review output - -### Optional Behaviors (OFF unless enabled) -- **Performance profiling**: Benchmark affected code paths -- **Historical analysis**: Check for similar past issues -- **Extended security audit**: Deep security analysis beyond standard checks - -## What This Skill CAN Do -- Systematic phase-gated reviews with explicit gates between phases -- Severity classification (BLOCKING / SHOULD FIX / SUGGESTION) with decision tree -- Security, performance, and architecture risk assessment -- Go-specific pattern detection (concurrency, resource management, type exports) -- Go ecosystem pattern validation -- Receiving review feedback patterns (when acting as code author) - -## What This Skill CANNOT Do -- Write implementation code or implement features -- Skip phases or bypass phase gates -- Approve without running tests -- Review without reading all changed files -- Make speculative improvements outside PR scope - ---- +Systematic 4-phase code review: UNDERSTAND changes, VERIFY claims against actual behavior, ASSESS security/performance/architecture risks, DOCUMENT findings with severity classification. Each phase has an explicit gate that must pass before proceeding because skipping phases causes missed context, incorrect conclusions, and incomplete risk assessment. ## Instructions @@ -72,15 +31,19 @@ This skill operates as an operator for systematic code review, configuring Claud **Goal**: Map all changes and their relationships before forming any opinions. -**Step 1: Read every changed file** -- Use Read tool on EVERY changed file completely -- Map what each file does and how changes affect it +**Step 1: Read CLAUDE.md** +- Read and follow repository CLAUDE.md files first because project conventions override default review criteria and may define custom severity rules, approved patterns, or scope constraints. -**Step 2: Identify dependencies** -- Use Grep to find all callers/consumers of changed code -- Note any comments that make claims about behavior +**Step 2: Read every changed file** +- Use Read tool on EVERY changed file completely because reviewing summaries or reading partial files misses dependencies between changes and leads to incorrect conclusions. +- Map what each file does and how changes affect it. +- Check affected dependencies and identify ripple effects because changes in one file can break consumers that aren't in the diff. -**Step 2a: Caller Tracing** (mandatory when diff modifies function signatures, parameter semantics, or introduces sentinel/special values) +**Step 3: Identify dependencies** +- Use Grep to find all callers/consumers of changed code. +- Note any comments that make claims about behavior (these are claims to verify in Phase 2, not facts to trust). + +**Step 3a: Caller Tracing** (mandatory when diff modifies function signatures, parameter semantics, or introduces sentinel/special values) When the change modifies how a function/method is called or what parameters mean: @@ -99,7 +62,7 @@ This step catches: - Callers the PR author forgot about or didn't mention - Interface implementations that don't enforce the same preconditions -**Step 3: Document scope** +**Step 4: Document scope** ``` PHASE 1: UNDERSTAND @@ -125,20 +88,21 @@ Questions for Author: - [Any unclear aspects that need clarification] ``` -**Gate**: All changed files read, scope fully mapped, callers traced (if applicable). Proceed only when gate passes. +**Gate**: All changed files read (not just some — reading 2 of 5 files and saying "I get the gist" fails this gate), scope fully mapped, callers traced (if applicable). Proceed only when gate passes. ### Phase 2: VERIFY **Goal**: Validate all assertions in code, comments, and PR description against actual behavior. **Step 1: Run tests** -- Execute existing tests for changed files -- Capture complete test output +- Execute existing tests for changed files because review cannot approve without test execution — visual inspection misses runtime issues that tests catch. +- Capture complete test output. Show the output rather than describing it because facts outweigh narrative. +- Verify test coverage: confirm tests exist for the changed code paths because untested code paths are a SHOULD FIX finding. **Step 2: Verify claims** -- Check every comment claim against code behavior -- Verify edge cases mentioned are actually handled -- Trace through critical code paths manually +- Check every comment claim against code behavior because comments frequently become outdated and developers may not understand what "thread-safe" actually means — never accept a comment as truth without inspecting the code that backs it. +- Verify edge cases mentioned are actually handled. +- Trace through critical code paths manually. **Step 3: Document verification** @@ -160,30 +124,33 @@ Behavior Verification: - Match: YES | NO | PARTIAL ``` -**Gate**: All assertions in code/comments verified against actual behavior. Proceed only when gate passes. +**Gate**: All assertions in code/comments verified against actual behavior. Tests executed with output captured. Proceed only when gate passes. ### Phase 3: ASSESS **Goal**: Evaluate security, performance, and architectural risks specific to these changes. **Step 1: Security assessment** -- Evaluate OWASP top 10 against changes -- Explain HOW each vulnerability was ruled out (not just checkboxes) +- Never skip this step because security implications must be explicitly evaluated for every review, even when changes appear benign. +- Evaluate OWASP top 10 against changes. +- Explain HOW each vulnerability was ruled out (not just checkboxes) because a checkbox approach misses context-specific vulnerabilities and gives false confidence. +- If optionally enabled: perform extended deep security analysis beyond standard checks. **Step 2: Performance assessment** -- Identify performance-critical paths and evaluate impact -- Check for N+1 queries, unbounded loops, unnecessary allocations +- Identify performance-critical paths and evaluate impact. +- Check for N+1 queries, unbounded loops, unnecessary allocations. +- If optionally enabled: benchmark affected code paths with profiling. **Step 3: Architectural assessment** -- Compare patterns to existing codebase conventions -- Assess breaking change potential +- Compare patterns to existing codebase conventions. +- Assess breaking change potential. +- If optionally enabled: check for similar past issues via historical analysis. **Step 4: Extraction severity escalation** -- If the diff extracts inline code into named helper functions, re-evaluate all defensive guards -- A missing check rated LOW as inline code (1 caller, "upstream validates") becomes MEDIUM as a reusable function (N potential callers) -- See `skills/shared-patterns/severity-classification.md` for the full rule +- If the diff extracts inline code into named helper functions, re-evaluate all defensive guards. +- A missing check rated LOW as inline code (1 caller, "upstream validates") becomes MEDIUM as a reusable function (N potential callers). -**Step 4: Document assessment** +**Step 5: Document assessment** ``` PHASE 3: ASSESS @@ -205,12 +172,18 @@ Architectural Assessment: Risk Level: LOW | MEDIUM | HIGH | CRITICAL ``` -**Gate**: Security, performance, and architectural risks explicitly evaluated. Proceed only when gate passes. +**Gate**: Security, performance, and architectural risks explicitly evaluated (not skipped or hand-waved). Proceed only when gate passes. ### Phase 4: DOCUMENT **Goal**: Produce structured review output with clear verdict and rationale. +Report facts without self-congratulation. Show command output rather than describing it. Be concise but informative because the review consumer needs actionable findings, not commentary. + +Only flag issues within the scope of the changed code because suggesting features outside PR scope is over-engineering — but DO flag all issues IN the changed code even if fixing them requires touching other files. No speculative improvements. + +When classifying severity, use the Severity Classification Rules below and classify UP when in doubt because it is better to require a fix and have the author push back than to let a real issue slip through as "optional." + ``` PHASE 4: DOCUMENT @@ -239,11 +212,15 @@ Verdict: APPROVE | REQUEST-CHANGES | NEEDS-DISCUSSION Rationale: [1-2 sentences explaining verdict] ``` +After producing the review, remove any temporary analysis files, notes, or debug outputs created during review because only the final review document should persist. + **Gate**: Structured review output with clear verdict. Review is complete. --- -## Trust Hierarchy +## Reference Material + +### Trust Hierarchy When conflicting information exists, trust in this order: @@ -254,93 +231,25 @@ When conflicting information exists, trust in this order: 5. **Comments/docs** - Claims that need verification 6. **PR description** (lowest) - May be outdated or incomplete -## Severity Classification Rules - -**Guiding principle**: When in doubt, classify UP. It's better to require a fix and have the author push back than to let a real issue slip through as "optional." - -### BLOCKING (cannot merge without fixing) - -These issues MUST be fixed. Never mark these as "needs discussion" or "optional": - -| Category | Examples | -|----------|----------| -| **Security vulnerabilities** | Authentication bypass, injection (SQL/XSS/command), data exposure, secrets in code, missing authorization checks | -| **Test failures** | Any failing test, including pre-existing failures touched by the change | -| **Breaking changes** | API breaking without migration, backward incompatible changes without versioning | -| **Missing error handling** | Unhandled errors on network/filesystem/database operations, panics in production paths | -| **Race conditions** | Concurrent access without synchronization, data races | -| **Resource leaks** | Unclosed file handles, database connections, memory leaks in hot paths | -| **Logic errors** | Off-by-one errors, incorrect conditionals, wrong return values | - -### SHOULD FIX (merge only if urgent, otherwise fix) +### Severity Classification Rules -These issues should be fixed unless there's time pressure. Never mark as "suggestion": +Three tiers: BLOCKING (cannot merge — security, correctness, reliability), SHOULD FIX (fix unless urgent — patterns, tests, debugging), SUGGESTIONS (genuinely optional — style, naming, micro-optimizations). When in doubt, classify UP. -| Category | Examples | -|----------|----------| -| **Missing tests** | New code paths without test coverage, untested error conditions | -| **Unhelpful error messages** | Errors that don't include context for debugging (missing IDs, states, inputs) | -| **Pattern violations** | Inconsistent with established codebase patterns (but still functional) | -| **Performance in hot paths** | N+1 queries, unnecessary allocations in loops, missing indexes for frequent queries | -| **Deprecated API usage** | Using APIs marked for removal, outdated patterns with better alternatives | -| **Poor encapsulation** | Exposing internal state unnecessarily, breaking abstraction boundaries | +See `references/severity-classification.md` for full classification tables, the decision tree, and common misclassification examples. -### SUGGESTIONS (author's choice) +### Go-Specific Review Patterns -These are genuinely optional - author can reasonably decline: +Watch for patterns that linters miss: type export design, concurrency patterns (batch+callback, loop variable capture, commit callbacks), resource management (defer placement, connection pool reuse), metrics pre-initialization, testing deduplication, and unnecessary function extraction. -| Category | Examples | -|----------|----------| -| **Naming preferences** | Variable/function names that are adequate but could be clearer | -| **Comment additions** | Places where a comment would help but code is understandable | -| **Alternative approaches** | Different implementation that isn't clearly better | -| **Style not in CLAUDE.md** | Formatting preferences not codified in project standards | -| **Micro-optimizations** | Performance improvements in cold paths with no measurable impact | +For projects using shared organization libraries: check for manual SQL row iteration instead of helpers, incorrect assertion depth, raw `sql.Open()` in tests, dead migration files, and database-specific naming violations. -### Classification Decision Tree +See `references/go-review-patterns.md` for full checklists and red flags. -``` -Is there a security, correctness, or reliability risk? -|- YES -> BLOCKING -|- NO -> Does it violate established patterns or create maintenance burden? - |- YES -> SHOULD FIX - |- NO -> Is this purely stylistic or preferential? - |- YES -> SUGGESTION (or don't mention) - |- NO -> Re-evaluate: probably SHOULD FIX -``` +### Receiving Review Feedback -### Common Misclassifications to Avoid +When receiving feedback: read completely, restate requirement to confirm understanding, verify against codebase, evaluate technical soundness, respond with reasoning or just fix it. Never performative agreement. Apply YAGNI check before implementing "proper" features. Stop and clarify before implementing anything unclear — items may be related. -| Issue | Wrong | Correct | Why | -|-------|-------|---------|-----| -| Missing error check on `os.Open()` | SUGGESTION | BLOCKING | Resource leak + potential panic | -| No test for new endpoint | SUGGESTION | SHOULD FIX | Untested code is liability | -| Race condition in cache | NEEDS DISCUSSION | BLOCKING | Data corruption risk | -| Inconsistent naming | BLOCKING | SUGGESTION | No functional impact | -| Missing context in error | SUGGESTION | SHOULD FIX | Debugging nightmare | -| Unused import | BLOCKING | SHOULD FIX | Linter will catch, low impact | - ---- - -## Examples - -### Example 1: Pull Request Review -User says: "Review this PR" -Actions: -1. Read all changed files, map scope and dependencies (UNDERSTAND) -2. Run tests, verify claims in comments and PR description (VERIFY) -3. Evaluate security/performance/architecture risks (ASSESS) -4. Produce structured findings with severity and verdict (DOCUMENT) -Result: Structured review with clear verdict and rationale - -### Example 2: Pre-Merge Verification -User says: "Check this before we merge" -Actions: -1. Read all changes, identify breaking change potential (UNDERSTAND) -2. Run full test suite, verify backward compatibility claims (VERIFY) -3. Assess risk level for production deployment (ASSESS) -4. Document findings with APPROVE/REQUEST-CHANGES verdict (DOCUMENT) -Result: Go/no-go decision with evidence +See `references/receiving-feedback.md` for the full reception pattern, pushback examples, implementation order, and external vs internal reviewer handling. --- @@ -369,238 +278,24 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Reviewing Without Running Tests - -**What it looks like**: "I reviewed the code and it looks good. The logic seems correct." -**Why wrong**: Comments and code may not match actual behavior. Tests reveal edge cases not visible in code reading. Cannot verify claims without execution. -**Do instead**: Run tests in Phase 2. Show complete test output. Verify behavior matches claims. - -### Anti-Pattern 2: Accepting Comments as Truth - -**What it looks like**: Marking claims as verified just because a comment says so. -**Why wrong**: Comments frequently become outdated. Developers may not understand what "thread-safe" actually means. Claims need verification against actual code. -**Do instead**: Inspect code for every claim. Verify "thread-safe" means mutexes exist. Mark INVALID when comments lie. - -### Anti-Pattern 3: Skipping Phase Gates - -**What it looks like**: Reading 2 of 5 changed files and saying "I get the gist, moving to VERIFY..." -**Why wrong**: Missing context leads to incorrect conclusions. Dependencies between files may be missed. Cannot assess full impact without complete understanding. -**Do instead**: Read ALL changed files. Complete every gate before proceeding. - -### Anti-Pattern 4: Generic Security Checklist Without Context - -**What it looks like**: Checking all security boxes without explaining how vulnerabilities were ruled out. -**Why wrong**: Checkbox approach misses context-specific vulnerabilities. Gives false confidence without actual analysis. -**Do instead**: Explain HOW each vulnerability was ruled out. Mark N/A for irrelevant checks. Show evidence for findings. - ---- - -## Go-Specific Review Patterns - -When reviewing Go code, watch for these patterns that linters miss: - -### Type Export Design -- [ ] Are implementation types unnecessarily exported? -- [ ] Should types be unexported with only constructors exported? -- **Red flag**: `type FooStore struct{}` exported but only implements an interface - -### Concurrency Patterns -- [ ] Does batch+callback pattern protect against concurrent writes? -- [ ] Does `commit()` only remove specific items, not clear all? -- [ ] Are loop variables using outdated patterns? (Go 1.22+ doesn't need cloning) - - [ ] No `i := i` reassignment inside loops - - [ ] No closure arguments for loop variables: `go func(id int) { }(i)` -- **Red flag**: `s.events = nil` in commit callback -- **Red flag**: `go func(x int) { ... }(loopVar)` - closure argument unnecessary since Go 1.22 - -### Resource Management -- [ ] Is `defer f.Close()` placed AFTER error check? -- [ ] Are database connection pools shared, not duplicated? -- [ ] Is file traversal done once, not repeated for size calculation? -- **Red flag**: `defer f.Close()` immediately after `os.OpenFile()` - -### Metrics & Observability -- [ ] Are Prometheus counter metrics pre-initialized with `.Add(0)`? -- [ ] Are all known label combinations initialized at startup? -- **Red flag**: CounterVec registered but not initialized - -### Testing Patterns -- [ ] Are interface implementation tests deduplicated? -- [ ] Do tests use `assert.Equal` (no reflection) for comparable types? -- [ ] Does test setup use `prometheus.NewPedanticRegistry()`? -- **Red flag**: Copy-pasted tests for FileStore, MemoryStore, SQLStore - -### Code Organization -- [ ] Is function extraction justified (reuse or complexity hiding)? -- [ ] Are unnecessary helper functions wrapping stdlib calls? -- **Red flag**: Helper that just calls through to another function - -## Organization Library Ecosystem Patterns - -When reviewing projects that use shared organization libraries, apply these additional checks: - -### Library Usage -- [ ] Are optional fields using the organization's preferred option type? -- [ ] Is SQL iteration using helper functions instead of manual `rows.Next()` loops? -- [ ] Are tests using the organization's assertion helpers? -- **Red flag**: Manual SQL row iteration with defer/Next/Scan/Err pattern when helpers exist - -### Test Assertions -- [ ] Is the correct assertion function used for the type being compared? -- [ ] Is deep comparison only used for non-comparable types (slices, maps, structs)? -- **Red flag**: Deep comparison used for simple types like int, string, bool - -### Test Infrastructure -- [ ] Are DB tests using the organization's test database helpers? -- [ ] Are Prometheus tests using `NewPedanticRegistry()`? -- **Red flag**: Raw `sql.Open()` in test setup instead of test helpers - -### Dead Code -- [ ] Are there leftover `*_migration.sql` files without usage? -- [ ] Are there helper functions that just wrap single stdlib calls? -- [ ] Are there redundant checks (e.g., empty string check before regex)? -- **Red flag**: Wrapper functions that add no value over the underlying call - -### Database Naming -- [ ] Do functions using database-specific syntax indicate this in names? -- **Red flag**: Generic `SQLStoreFactory` that uses database-specific syntax - ---- - -## Receiving Review Feedback - -When YOU are the one receiving code review feedback (not giving it), apply these patterns: - -### The Reception Pattern - -``` -WHEN receiving code review feedback: - -1. READ: Complete feedback without reacting -2. UNDERSTAND: Restate requirement in own words (or ask) -3. VERIFY: Check against codebase reality -4. EVALUATE: Technically sound for THIS codebase? -5. RESPOND: Technical acknowledgment or reasoned pushback -6. IMPLEMENT: One item at a time, test each -``` - -### No Performative Agreement - -**NEVER:** -- "You're absolutely right!" -- "Great point!" / "Excellent feedback!" -- "Thanks for catching that!" - -**INSTEAD:** -- Restate the technical requirement -- Ask clarifying questions -- Push back with technical reasoning if wrong -- Just start working (actions > words) - -**When feedback IS correct:** -``` -"Fixed. [Brief description of what changed]" -"Good catch - [specific issue]. Fixed in [location]." -[Just fix it and show in the code] -``` - -### YAGNI Check for "Professional" Features - -``` -IF reviewer suggests "implementing properly": - grep codebase for actual usage - - IF unused: "This endpoint isn't called. Remove it (YAGNI)?" - IF used: Then implement properly -``` - -### Handling Unclear Feedback - -``` -IF any item is unclear: - STOP - do not implement anything yet - ASK for clarification on unclear items - -WHY: Items may be related. Partial understanding = wrong implementation. -``` - -**Example:** -``` -Reviewer: "Fix items 1-6" -You understand 1,2,3,6. Unclear on 4,5. - -WRONG: Implement 1,2,3,6 now, ask about 4,5 later -RIGHT: "I understand items 1,2,3,6. Need clarification on 4 and 5 before proceeding." -``` - -### When to Push Back - -Push back when: -- Suggestion breaks existing functionality -- Reviewer lacks full context -- Violates YAGNI (unused feature) -- Technically incorrect for this stack -- Legacy/compatibility reasons exist - -**How to push back:** -- Use technical reasoning, not defensiveness -- Ask specific questions -- Reference working tests/code - -**Example:** -``` -Reviewer: "Remove legacy code" -WRONG: "You're absolutely right! Let me remove that..." -RIGHT: "Checking... build target is 10.15+, this API needs 13+. Need legacy for backward compat. Fix bundle ID or drop pre-13 support?" -``` - -### Implementation Order - -``` -FOR multi-item feedback: - 1. Clarify anything unclear FIRST - 2. Then implement in this order: - - Blocking issues (breaks, security) - - Simple fixes (typos, imports) - - Complex fixes (refactoring, logic) - 3. Test each fix individually - 4. Verify no regressions -``` - -### External vs Internal Reviewers - -**From external reviewers:** -``` -BEFORE implementing: - 1. Check: Technically correct for THIS codebase? - 2. Check: Breaks existing functionality? - 3. Check: Reason for current implementation? - 4. Check: Does reviewer understand full context? - -IF suggestion seems wrong: - Push back with technical reasoning - -IF can't easily verify: - Say so: "I can't verify this without [X]. Should I investigate/proceed?" -``` - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Anti-Rationalization (Review)](../shared-patterns/anti-rationalization-review.md) - Review-specific rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Severity Classification](../shared-patterns/severity-classification.md) - Issue severity definitions +### Examples -### Domain-Specific Anti-Rationalization +#### Example 1: Pull Request Review +User says: "Review this PR" +Actions: +1. Read CLAUDE.md, then read all changed files, map scope and dependencies (UNDERSTAND) +2. Run tests, verify claims in comments and PR description (VERIFY) +3. Evaluate security/performance/architecture risks (ASSESS) +4. Produce structured findings with severity and verdict (DOCUMENT) +Result: Structured review with clear verdict and rationale -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Code looks correct to me" | Visual inspection misses runtime issues | Run tests in Phase 2 | -| "Comment says it's thread-safe" | Comments can be wrong or outdated | Verify claims against code | -| "Only a small change" | Small changes cause large regressions | Complete all 4 phases | -| "Tests pass, ship it" | Tests may not cover the changed paths | Verify coverage of changes | +#### Example 2: Pre-Merge Verification +User says: "Check this before we merge" +Actions: +1. Read CLAUDE.md, then read all changes, identify breaking change potential (UNDERSTAND) +2. Run full test suite, verify backward compatibility claims (VERIFY) +3. Assess risk level for production deployment (ASSESS) +4. Document findings with APPROVE/REQUEST-CHANGES verdict (DOCUMENT) +Result: Go/no-go decision with evidence diff --git a/skills/systematic-code-review/references/go-review-patterns.md b/skills/systematic-code-review/references/go-review-patterns.md new file mode 100644 index 00000000..7ae1cc5d --- /dev/null +++ b/skills/systematic-code-review/references/go-review-patterns.md @@ -0,0 +1,71 @@ +# Go-Specific Review Patterns + +When reviewing Go code, watch for these patterns that linters miss: + +## Type Export Design +- [ ] Are implementation types unnecessarily exported? +- [ ] Should types be unexported with only constructors exported? +- **Red flag**: `type FooStore struct{}` exported but only implements an interface + +## Concurrency Patterns +- [ ] Does batch+callback pattern protect against concurrent writes? +- [ ] Does `commit()` only remove specific items, not clear all? +- [ ] Are loop variables using outdated patterns? (Go 1.22+ doesn't need cloning) + - [ ] No `i := i` reassignment inside loops + - [ ] No closure arguments for loop variables: `go func(id int) { }(i)` +- **Red flag**: `s.events = nil` in commit callback +- **Red flag**: `go func(x int) { ... }(loopVar)` - closure argument unnecessary since Go 1.22 + +## Resource Management +- [ ] Is `defer f.Close()` placed AFTER error check? +- [ ] Are database connection pools shared, not duplicated? +- [ ] Is file traversal done once, not repeated for size calculation? +- **Red flag**: `defer f.Close()` immediately after `os.OpenFile()` + +## Metrics & Observability +- [ ] Are Prometheus counter metrics pre-initialized with `.Add(0)`? +- [ ] Are all known label combinations initialized at startup? +- **Red flag**: CounterVec registered but not initialized + +## Testing Patterns +- [ ] Are interface implementation tests deduplicated? +- [ ] Do tests use `assert.Equal` (no reflection) for comparable types? +- [ ] Does test setup use `prometheus.NewPedanticRegistry()`? +- **Red flag**: Copy-pasted tests for FileStore, MemoryStore, SQLStore + +## Code Organization +- [ ] Is function extraction justified (reuse or complexity hiding)? +- [ ] Are unnecessary helper functions wrapping stdlib calls? +- **Red flag**: Helper that just calls through to another function + +--- + +# Organization Library Ecosystem Patterns + +When reviewing projects that use shared organization libraries, apply these additional checks: + +## Library Usage +- [ ] Are optional fields using the organization's preferred option type? +- [ ] Is SQL iteration using helper functions instead of manual `rows.Next()` loops? +- [ ] Are tests using the organization's assertion helpers? +- **Red flag**: Manual SQL row iteration with defer/Next/Scan/Err pattern when helpers exist + +## Test Assertions +- [ ] Is the correct assertion function used for the type being compared? +- [ ] Is deep comparison only used for non-comparable types (slices, maps, structs)? +- **Red flag**: Deep comparison used for simple types like int, string, bool + +## Test Infrastructure +- [ ] Are DB tests using the organization's test database helpers? +- [ ] Are Prometheus tests using `NewPedanticRegistry()`? +- **Red flag**: Raw `sql.Open()` in test setup instead of test helpers + +## Dead Code +- [ ] Are there leftover `*_migration.sql` files without usage? +- [ ] Are there helper functions that just wrap single stdlib calls? +- [ ] Are there redundant checks (e.g., empty string check before regex)? +- **Red flag**: Wrapper functions that add no value over the underlying call + +## Database Naming +- [ ] Do functions using database-specific syntax indicate this in names? +- **Red flag**: Generic `SQLStoreFactory` that uses database-specific syntax diff --git a/skills/systematic-code-review/references/receiving-feedback.md b/skills/systematic-code-review/references/receiving-feedback.md new file mode 100644 index 00000000..001528bc --- /dev/null +++ b/skills/systematic-code-review/references/receiving-feedback.md @@ -0,0 +1,116 @@ +# Receiving Review Feedback + +When YOU are the one receiving code review feedback (not giving it), apply these patterns: + +## The Reception Pattern + +``` +WHEN receiving code review feedback: + +1. READ: Complete feedback without reacting +2. UNDERSTAND: Restate requirement in own words (or ask) +3. VERIFY: Check against codebase reality +4. EVALUATE: Technically sound for THIS codebase? +5. RESPOND: Technical acknowledgment or reasoned pushback +6. IMPLEMENT: One item at a time, test each +``` + +## No Performative Agreement + +**NEVER:** +- "You're absolutely right!" +- "Great point!" / "Excellent feedback!" +- "Thanks for catching that!" + +**INSTEAD:** +- Restate the technical requirement +- Ask clarifying questions +- Push back with technical reasoning if wrong +- Just start working (actions > words) + +**When feedback IS correct:** +``` +"Fixed. [Brief description of what changed]" +"Good catch - [specific issue]. Fixed in [location]." +[Just fix it and show in the code] +``` + +## YAGNI Check for "Professional" Features + +``` +IF reviewer suggests "implementing properly": + grep codebase for actual usage + + IF unused: "This endpoint isn't called. Remove it (YAGNI)?" + IF used: Then implement properly +``` + +## Handling Unclear Feedback + +``` +IF any item is unclear: + STOP - do not implement anything yet + ASK for clarification on unclear items + +WHY: Items may be related. Partial understanding = wrong implementation. +``` + +**Example:** +``` +Reviewer: "Fix items 1-6" +You understand 1,2,3,6. Unclear on 4,5. + +WRONG: Implement 1,2,3,6 now, ask about 4,5 later +RIGHT: "I understand items 1,2,3,6. Need clarification on 4 and 5 before proceeding." +``` + +## When to Push Back + +Push back when: +- Suggestion breaks existing functionality +- Reviewer lacks full context +- Violates YAGNI (unused feature) +- Technically incorrect for this stack +- Legacy/compatibility reasons exist + +**How to push back:** +- Use technical reasoning, not defensiveness +- Ask specific questions +- Reference working tests/code + +**Example:** +``` +Reviewer: "Remove legacy code" +WRONG: "You're absolutely right! Let me remove that..." +RIGHT: "Checking... build target is 10.15+, this API needs 13+. Need legacy for backward compat. Fix bundle ID or drop pre-13 support?" +``` + +## Implementation Order + +``` +FOR multi-item feedback: + 1. Clarify anything unclear FIRST + 2. Then implement in this order: + - Blocking issues (breaks, security) + - Simple fixes (typos, imports) + - Complex fixes (refactoring, logic) + 3. Test each fix individually + 4. Verify no regressions +``` + +## External vs Internal Reviewers + +**From external reviewers:** +``` +BEFORE implementing: + 1. Check: Technically correct for THIS codebase? + 2. Check: Breaks existing functionality? + 3. Check: Reason for current implementation? + 4. Check: Does reviewer understand full context? + +IF suggestion seems wrong: + Push back with technical reasoning + +IF can't easily verify: + Say so: "I can't verify this without [X]. Should I investigate/proceed?" +``` diff --git a/skills/systematic-code-review/references/severity-classification.md b/skills/systematic-code-review/references/severity-classification.md new file mode 100644 index 00000000..0d085186 --- /dev/null +++ b/skills/systematic-code-review/references/severity-classification.md @@ -0,0 +1,65 @@ +# Severity Classification Rules + +**Guiding principle**: When in doubt, classify UP. It's better to require a fix and have the author push back than to let a real issue slip through as "optional." + +## BLOCKING (cannot merge without fixing) + +These issues MUST be fixed. Never mark these as "needs discussion" or "optional": + +| Category | Examples | +|----------|----------| +| **Security vulnerabilities** | Authentication bypass, injection (SQL/XSS/command), data exposure, secrets in code, missing authorization checks | +| **Test failures** | Any failing test, including pre-existing failures touched by the change | +| **Breaking changes** | API breaking without migration, backward incompatible changes without versioning | +| **Missing error handling** | Unhandled errors on network/filesystem/database operations, panics in production paths | +| **Race conditions** | Concurrent access without synchronization, data races | +| **Resource leaks** | Unclosed file handles, database connections, memory leaks in hot paths | +| **Logic errors** | Off-by-one errors, incorrect conditionals, wrong return values | + +## SHOULD FIX (merge only if urgent, otherwise fix) + +These issues should be fixed unless there's time pressure. Never mark as "suggestion": + +| Category | Examples | +|----------|----------| +| **Missing tests** | New code paths without test coverage, untested error conditions | +| **Unhelpful error messages** | Errors that don't include context for debugging (missing IDs, states, inputs) | +| **Pattern violations** | Inconsistent with established codebase patterns (but still functional) | +| **Performance in hot paths** | N+1 queries, unnecessary allocations in loops, missing indexes for frequent queries | +| **Deprecated API usage** | Using APIs marked for removal, outdated patterns with better alternatives | +| **Poor encapsulation** | Exposing internal state unnecessarily, breaking abstraction boundaries | + +## SUGGESTIONS (author's choice) + +These are genuinely optional - author can reasonably decline: + +| Category | Examples | +|----------|----------| +| **Naming preferences** | Variable/function names that are adequate but could be clearer | +| **Comment additions** | Places where a comment would help but code is understandable | +| **Alternative approaches** | Different implementation that isn't clearly better | +| **Style not in CLAUDE.md** | Formatting preferences not codified in project standards | +| **Micro-optimizations** | Performance improvements in cold paths with no measurable impact | + +## Classification Decision Tree + +``` +Is there a security, correctness, or reliability risk? +|- YES -> BLOCKING +|- NO -> Does it violate established patterns or create maintenance burden? + |- YES -> SHOULD FIX + |- NO -> Is this purely stylistic or preferential? + |- YES -> SUGGESTION (or don't mention) + |- NO -> Re-evaluate: probably SHOULD FIX +``` + +## Common Misclassifications to Avoid + +| Issue | Wrong | Correct | Why | +|-------|-------|---------|-----| +| Missing error check on `os.Open()` | SUGGESTION | BLOCKING | Resource leak + potential panic | +| No test for new endpoint | SUGGESTION | SHOULD FIX | Untested code is liability | +| Race condition in cache | NEEDS DISCUSSION | BLOCKING | Data corruption risk | +| Inconsistent naming | BLOCKING | SUGGESTION | No functional impact | +| Missing context in error | SUGGESTION | SHOULD FIX | Debugging nightmare | +| Unused import | BLOCKING | SHOULD FIX | Linter will catch, low impact | diff --git a/skills/systematic-debugging/SKILL.md b/skills/systematic-debugging/SKILL.md index 172be374..5f6df029 100644 --- a/skills/systematic-debugging/SKILL.md +++ b/skills/systematic-debugging/SKILL.md @@ -35,48 +35,7 @@ routing: # Systematic Debugging Skill -## Operator Context - -This skill operates as an operator for systematic debugging workflows, configuring Claude's behavior for rigorous, evidence-based root cause analysis. It implements the **Iterative Refinement** architectural pattern — form hypothesis, test, refine, verify — with **Domain Intelligence** embedded in the debugging methodology. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before debugging -- **Over-Engineering Prevention**: Fix only the bug. No speculative improvements, no "while I'm here" changes -- **Reproduce First**: NEVER attempt fixes before creating reliable reproduction -- **No Random Changes**: Every modification must be based on evidence from isolation -- **Evidence Required**: Every hypothesis must be tested with concrete evidence -- **Verify Fixes**: Confirm fix works AND doesn't introduce regressions - -### Default Behaviors (ON unless disabled) -- **Minimal Reproduction**: Create smallest possible test case that shows bug -- **Bisection Strategy**: Use binary search to narrow down failure point -- **One Change at a Time**: Never make multiple changes simultaneously -- **Document Findings**: Log all observations, hypotheses, and test results -- **Related Issues Check**: Search for similar bugs in codebase and git history -- **Temporary File Cleanup**: Remove debug logs and profiling output at completion - -### Optional Behaviors (OFF unless enabled) -- **Regression Test Creation**: Write automated test for this specific bug -- **Git Bisect**: Use `git bisect` to find breaking commit -- **Performance Profiling**: Run profiler to identify bottlenecks -- **Database Query Analysis**: Use EXPLAIN for slow query debugging -- **Network Tracing**: Capture traffic for API debugging - -## What This Skill CAN Do -- Systematically find root causes through evidence-based investigation -- Create minimal reproductions that isolate the exact failure -- Distinguish between symptoms and root causes -- Verify fixes don't introduce regressions -- Document findings for future reference - -## What This Skill CANNOT Do -- Fix bugs without first reproducing them -- Make speculative changes without evidence -- Optimize performance (use performance-optimization-engineer instead) -- Refactor code (use systematic-refactoring instead) -- Skip any of the 4 phases - ---- +Evidence-based root cause analysis through four phases: Reproduce, Isolate, Identify, Verify. Every phase produces testable evidence before advancing to the next, because skipping phases leads to whack-a-mole debugging where symptoms get masked instead of causes getting fixed. ## Instructions @@ -84,7 +43,51 @@ This skill operates as an operator for systematic debugging workflows, configuri **Goal**: Establish consistent reproduction before attempting any fix. -**Step 1: Document the bug** +Read and follow the repository's CLAUDE.md before starting, because project-specific constraints (test commands, build steps, environment setup) affect how you reproduce and fix bugs. + +**Step 1: Check for prior knowledge** + +Search for `.debug-knowledge-base.md` in the project root. If it exists, search for keyword matches against the current symptoms (error messages, component names, behavioral descriptions). Matches are hypothesis candidates, not confirmed diagnoses — list them in Evidence with the note: "Prior resolution found — verify applicability before assuming same root cause." Also search for similar bugs in the codebase and git history, because the same bug pattern often recurs in related code paths. + +**Step 2: Create the debug session file** + +Create `.debug-session.md` at the start of every investigation, because debugging sessions lose all state on context reset and this file lets a new session resume without re-investigating eliminated causes. + +```markdown +# Debug Session: [Brief Description] + + +## Symptoms (IMMUTABLE — do not edit after initial capture) +- [Exact error message or behavior] +- [Environment: OS, language version, dependencies] +- [How discovered: test failure, user report, monitoring alert] + +## Reproduction Steps (IMMUTABLE — do not edit after initial capture) +1. [Step 1] +2. [Step 2] +3. [Expected vs actual result] + +## Evidence (APPEND-ONLY — add new entries, never remove or edit) +- [timestamp] [observation]: [what was found and where] + +## Eliminated Hypotheses (APPEND-ONLY — add new entries, never remove or edit) +- [timestamp] [hypothesis]: [evidence that refuted it] + +## Current Hypothesis (OVERWRITE — replace each iteration) +**Hypothesis**: [specific, testable statement] +**Supporting evidence**: [what points to this] +**Test plan**: [how to confirm or refute] + +## Next Action (OVERWRITE — replace each iteration) +[Exactly what to do next — specific enough that a new session can execute it cold] +``` + +The mutation rules matter: +- **IMMUTABLE** sections (Symptoms, Reproduction Steps): Write once at session start, never modify. These are ground truth. If they change, it's a different bug. +- **APPEND-ONLY** sections (Evidence, Eliminated Hypotheses): Add new entries, never remove or edit existing ones. Removing entries lets future sessions re-investigate dead ends. +- **OVERWRITE** sections (Current Hypothesis, Next Action): Replace on each iteration. Old values get captured in Evidence/Eliminated when tested. + +**Step 3: Document the bug** ```markdown ## Bug: [Brief Description] @@ -93,12 +96,13 @@ Actual: [What actually happens] Environment: [OS, language version, dependencies] ``` -**Step 2: Create minimal reproduction** -- Strip to essentials — remove unrelated code -- Use smallest dataset that shows the bug -- Isolate from external services where possible +**Step 4: Create minimal reproduction** + +Strip to the smallest possible test case that shows the bug — remove unrelated code, use the smallest dataset that triggers the failure, and isolate from external services where possible. Minimal reproduction matters because large reproductions hide the actual trigger behind noise, making isolation harder. + +Do not skip reproduction even if you think you can see the bug in the code, because visual inspection misses edge cases, timing dependencies, and interaction effects that only surface when the code actually runs. -**Step 3: Verify consistency** +**Step 5: Verify consistency** Run reproduction **3 times**. If inconsistent, identify variables (timing, randomness, concurrency) and add controls to make it deterministic. @@ -119,13 +123,15 @@ Run reproduction **3 times**. If inconsistent, identify variables (timing, rando **Step 2: Binary search** -Test components in combinations to find minimal failing set: -- A alone → PASS/FAIL? -- A + B → PASS/FAIL? -- A + B + C → PASS/FAIL? +Use bisection to narrow down the failure point, because linear scanning wastes time on large codebases. Test components in combinations to find the minimal failing set: +- A alone -> PASS/FAIL? +- A + B -> PASS/FAIL? +- A + B + C -> PASS/FAIL? When adding a component causes failure, that component (or its interaction) contains the bug. +When needed, use `git bisect` to find the breaking commit — this is especially effective when the bug is a recent regression and the commit history is clean. + **Step 3: Trace execution path** Add targeted logging at decision points in the suspect component. Run and analyze: @@ -133,12 +139,26 @@ Add targeted logging at decision points in the suspect component. Run and analyz - What values are unexpected at critical points? - Are exceptions being caught silently? +When needed for domain-specific isolation: +- **Performance bugs**: Run a profiler to identify bottlenecks in the suspect code path. +- **Slow queries**: Use EXPLAIN to analyze database query execution plans. +- **API issues**: Capture network traffic to see what's actually being sent and received. + **Gate**: Identified smallest code path and input that reproduces the bug. Proceed only when gate passes. ### Phase 3: IDENTIFY **Goal**: Determine exact root cause through hypothesis testing. +Update `.debug-session.md` BEFORE taking any debugging action, not after — because if context resets mid-action, the file shows what was about to happen and what has already been ruled out. + +The workflow for each iteration: +1. Write your hypothesis and next action to the file +2. Execute the action +3. Append the result to Evidence (or Eliminated Hypotheses if refuted) +4. Update Current Hypothesis and Next Action for the next iteration +5. Repeat + **Step 1: Form hypothesis** ```markdown @@ -147,12 +167,16 @@ Evidence: [What observations support this] Test: [How to confirm or refute] ``` +Every hypothesis must be tested with concrete evidence, because "probably" is not "proven" — untested assumptions lead to fixes that mask symptoms while leaving root causes intact. + **Step 2: Test hypothesis** -Design a single, targeted experiment. Run it. Document result as CONFIRMED or REFUTED. +Design a single, targeted experiment. Make exactly one change per test, because multiple simultaneous changes make it impossible to determine which one had the effect. Run it. Document result as CONFIRMED or REFUTED. If REFUTED: Form new hypothesis based on what you learned. Return to Step 1. +Every modification must be based on evidence from isolation, because random changes — even educated guesses — can mask symptoms while leaving the root cause in place, and you cannot explain why a random fix works. + **Step 3: Inspect suspect code** Code inspection checklist: @@ -165,23 +189,23 @@ Code inspection checklist: **Step 4: Verify root cause with targeted fix** -Make the smallest possible change that addresses the identified cause. Test against reproduction. +Make the smallest possible change that addresses the identified cause. Fix only the bug — no speculative improvements, no "while I'm here" refactoring, because unrelated changes obscure what actually fixed the bug and introduce untested modifications. Test against reproduction. -**Gate**: Root cause identified with evidence. Targeted fix resolves the issue. Can explain WHY bug occurred. +**Gate**: Root cause identified with evidence. Targeted fix resolves the issue. Can explain WHY bug occurred. Proceed only when gate passes. ### Phase 4: VERIFY **Goal**: Confirm fix works and doesn't introduce regressions. -**Step 1**: Run original reproduction steps → all pass +**Step 1**: Run original reproduction steps -> all pass **Step 2**: Test edge cases (empty input, boundary values, null, maximum) -**Step 3**: Run full test suite → no regressions +**Step 3**: Run full test suite -> no regressions. Run the full suite, not just the specific test, because fixes that pass the target test but break related functionality are not fixes — "tests pass" for one test is not the same as "no regressions." **Step 4**: Test related functionality using similar patterns -**Step 5**: Create regression test (if optional behavior enabled) +**Step 5**: When warranted, create a regression test for this specific bug to prevent it from recurring: ```python def test_regression_[issue](): @@ -190,7 +214,9 @@ def test_regression_[issue](): assert result == expected ``` -**Step 6**: Document fix summary +**Step 6**: Clean up temporary artifacts — remove debug logs, profiling output, and any temporary instrumentation added during investigation, because leftover debug code in production is a maintenance hazard. + +**Step 7**: Document fix summary ```markdown ## Fix Summary @@ -201,9 +227,36 @@ Files: [modified files] Testing: reproduction passes, edge cases pass, full suite passes ``` +Document the root cause and fix clearly, because an unexplained `git commit -m "Fixed bug"` means the bug will reappear and no institutional knowledge is preserved. + +**Step 8**: Record to knowledge base + +After Phase 4 passes, append an entry to `.debug-knowledge-base.md`: + +```markdown +## [Date] [Brief Description] +**Keywords**: [comma-separated terms: error messages, component names, symptom descriptions] +**Symptom**: [What was observed] +**Root Cause**: [What was actually wrong] +**Resolution**: [What fixed it] +**Files**: [Which files were involved] +``` + +This knowledge base is append-only and match-based (not a lookup table) because bugs are contextual — the same error message in different modules may have completely different root causes. The knowledge base accelerates hypothesis formation but does not replace the 4-phase process. + **Gate**: All verification steps pass. Fix is complete. ---- +### Resuming From a Context Reset + +When starting a debug session, check for an existing `.debug-session.md`: +1. Read the file completely +2. Do NOT re-investigate anything listed in Eliminated Hypotheses +3. Resume from the Current Hypothesis and Next Action +4. If Next Action was partially completed, verify its state before continuing + +### Analysis Paralysis Guard + +If 5+ consecutive Read/Grep/Glob calls occur without an Edit/Write/Bash action, STOP and explain what you are looking for and why before proceeding. Record the justification in `.debug-session.md` under Current Hypothesis — not just stated verbally — because this creates an audit trail of investigation decisions that survives context resets. ## Examples @@ -225,8 +278,6 @@ Actions: 4. Fix, test edge cases, verify no regressions (VERIFY) Result: Production fix with regression test ---- - ## Error Handling ### Error: "Cannot Reproduce Bug" @@ -240,9 +291,9 @@ Solution: ### Error: "Fix Breaks Other Tests" Cause: Tests relied on buggy behavior, or fix changed API contract Solution: -1. If tests expected buggy behavior → update tests -2. If fix exposed other bugs → apply 4-phase process to each -3. If API changed → restore compatibility or update all callers +1. If tests expected buggy behavior -> update tests +2. If fix exposed other bugs -> apply 4-phase process to each +3. If API changed -> restore compatibility or update all callers ### Error: "Root Cause Still Unclear After Isolation" Cause: Isolation not narrow enough, or multiple contributing factors @@ -252,160 +303,9 @@ Solution: 3. Use debugger to step through execution 4. Consult `references/debugging-patterns.md` for common patterns ---- - -## Anti-Patterns - -### Anti-Pattern 1: Fixing Without Reproducing -**What it looks like**: "Let me add better error handling" before seeing the actual error -**Why wrong**: Can't verify fix works, may fix wrong issue -**Do instead**: Complete Phase 1 first. Always. - -### Anti-Pattern 2: Random Changes Without Evidence -**What it looks like**: "Maybe if I change this timeout..." without data -**Why wrong**: May mask symptom while leaving root cause. Can't explain why it works. -**Do instead**: Form hypothesis → test → confirm/refute → iterate - -### Anti-Pattern 3: Multiple Changes at Once -**What it looks like**: Adding null check + fixing loop + wrapping in try/catch simultaneously -**Why wrong**: Can't determine which change fixed it. Introduces unnecessary code. -**Do instead**: One change, one test. Repeat until fixed. - -### Anti-Pattern 4: Insufficient Verification -**What it looks like**: "Specific test passes, ship it!" without running full suite -**Why wrong**: May have introduced regressions or missed edge cases -**Do instead**: Complete all Phase 4 steps before declaring done. - -### Anti-Pattern 5: Undocumented Root Cause -**What it looks like**: `git commit -m "Fixed bug"` with no explanation -**Why wrong**: Bug will reappear. No institutional knowledge preserved. -**Do instead**: Document root cause, fix, and create regression test. - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I can see the bug, no need to reproduce" | Visual inspection misses edge cases | Run reproduction 3 times | -| "This is probably the fix" | Probably ≠ proven | Form hypothesis, test with evidence | -| "Tests pass, must be fixed" | Specific test ≠ full suite | Run full test suite | -| "Simple change, no need to verify" | Simple changes cause complex regressions | Complete Phase 4 | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/debugging-patterns.md`: Common bug patterns by category - `${CLAUDE_SKILL_DIR}/references/tools.md`: Language-specific debugging tools - `${CLAUDE_SKILL_DIR}/references/isolation-techniques.md`: Advanced isolation strategies - ---- - -## Persistent Debug File Protocol - -Debugging sessions lose all state on context reset — hypotheses, eliminated causes, evidence, and next actions vanish. This protocol creates a structured file that survives resets and lets a new session resume without re-investigating eliminated causes. - -### File: `.debug-session.md` - -Create this file at the start of every debug investigation. It has three section types with strict mutation rules: - -| Section Type | Sections | Mutation Rule | WHY | -|-------------|----------|---------------|-----| -| IMMUTABLE | Symptoms, Reproduction Steps | Write once at session start, never modify | These are the ground truth. If they change, it's a different bug. Editing them mid-investigation causes you to lose track of the original problem. | -| APPEND-ONLY | Evidence, Eliminated Hypotheses | Add new entries, never remove or edit existing ones | Removing entries lets future sessions re-investigate dead ends. The whole point is to accumulate knowledge, not revise it. | -| OVERWRITE | Current Hypothesis, Next Action | Replace on each iteration | These represent the live state of the investigation. Old values are captured in Evidence/Eliminated when they're tested. | - -### Template - -```markdown -# Debug Session: [Brief Description] - - -## Symptoms (IMMUTABLE — do not edit after initial capture) -- [Exact error message or behavior] -- [Environment: OS, language version, dependencies] -- [How discovered: test failure, user report, monitoring alert] - -## Reproduction Steps (IMMUTABLE — do not edit after initial capture) -1. [Step 1] -2. [Step 2] -3. [Expected vs actual result] - -## Evidence (APPEND-ONLY — add new entries, never remove or edit) -- [timestamp] [observation]: [what was found and where] - -## Eliminated Hypotheses (APPEND-ONLY — add new entries, never remove or edit) -- [timestamp] [hypothesis]: [evidence that refuted it] - -## Current Hypothesis (OVERWRITE — replace each iteration) -**Hypothesis**: [specific, testable statement] -**Supporting evidence**: [what points to this] -**Test plan**: [how to confirm or refute] - -## Next Action (OVERWRITE — replace each iteration) -[Exactly what to do next — specific enough that a new session can execute it cold] -``` - -### Critical Rule: Update BEFORE Acting - -Update `.debug-session.md` BEFORE taking any debugging action, not after. WHY: If context resets mid-action, the file shows what was about to happen and what has already been ruled out. A post-action update means a reset loses the most recent work. - -The workflow is: -1. Write your hypothesis and next action to the file -2. Execute the action -3. Append the result to Evidence (or Eliminated Hypotheses if refuted) -4. Update Current Hypothesis and Next Action for the next iteration -5. Repeat - -### Resuming From a Reset - -When starting a debug session, check for an existing `.debug-session.md`: -1. Read the file completely -2. Do NOT re-investigate anything listed in Eliminated Hypotheses -3. Resume from the Current Hypothesis and Next Action -4. If Next Action was partially completed, verify its state before continuing - ---- - -## Debug Knowledge Base - -Resolved debug sessions create compounding value when their findings are recorded for future investigations. This protocol maintains an append-only knowledge base of resolved bugs. - -### File: `.debug-knowledge-base.md` - -After resolving a bug (Phase 4 VERIFY passes), append an entry: - -```markdown -## [Date] [Brief Description] -**Keywords**: [comma-separated terms for matching: error messages, component names, symptom descriptions] -**Symptom**: [What was observed] -**Root Cause**: [What was actually wrong] -**Resolution**: [What fixed it] -**Files**: [Which files were involved] -``` - -### Lookup Protocol - -At the start of every new debug investigation (Phase 1: REPRODUCE), before forming any hypotheses: - -1. Check if `.debug-knowledge-base.md` exists in the project root -2. If it exists, search for keyword matches against the current symptom signature (error messages, component names, behavioral descriptions) -3. Matches are **hypothesis candidates**, not confirmed diagnoses — the same symptom can have different root causes in different contexts -4. List any matches in the Evidence section of `.debug-session.md` with the note: "Prior resolution found — verify applicability before assuming same root cause" - -WHY this is append-only and match-based (not a lookup table): Bugs are contextual. An "undefined is not a function" error in module A may have a completely different root cause than the same error in module B. The knowledge base accelerates hypothesis formation — it does not replace the 4-phase process. - ---- - -## Analysis Paralysis Guard - -This skill uses the [Analysis Paralysis Guard](../shared-patterns/analysis-paralysis-guard.md). -If 5+ consecutive Read/Grep/Glob calls occur without an Edit/Write/Bash action, -STOP and explain what you are looking for and why before proceeding. - -### Debugging-Specific Addition -- After explaining, justification for continued reading MUST be recorded in `.debug-session.md` under the Current Hypothesis section — not just stated verbally. This creates an audit trail of investigation decisions that survives context resets. diff --git a/skills/systematic-refactoring/SKILL.md b/skills/systematic-refactoring/SKILL.md index 94bfc3b4..7827f4cd 100644 --- a/skills/systematic-refactoring/SKILL.md +++ b/skills/systematic-refactoring/SKILL.md @@ -42,54 +42,21 @@ routing: # Systematic Refactoring Skill -## Purpose - Perform safe, verifiable refactoring through explicit phases. Each phase has gates that prevent common refactoring mistakes: breaking behavior, incomplete migrations, or orphaned code. -## Operator Context - -This skill operates as an operator for safe code refactoring, configuring Claude's behavior for incremental, verifiable changes. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution -- **Over-Engineering Prevention**: Only refactor what's directly requested. Keep changes minimal and focused. No speculative improvements or "while we're here" changes without explicit request. -- **NEVER change behavior without tests**: Characterization tests required before changes -- **NEVER make multiple changes at once**: One atomic change per commit -- **NEVER skip validation**: Tests must pass after every change -- **ALWAYS preserve external API**: Unless explicitly requested +## Instructions -### Default Behaviors (ON unless disabled) -- **Communication Style**: Report facts without self-congratulation. Show command output rather than describing it. Be concise but informative. -- **Temporary File Cleanup**: Remove temporary test files, debug outputs, or backup files created during refactoring at task completion. Keep only files explicitly needed. -- **Write characterization tests first**: Capture current behavior before changing -- **Incremental commits**: Commit at each stable point -- **Update all callers**: Find and update every reference -- **Remove dead code**: Clean up orphaned code after migration +Before starting any refactoring work, read and follow the repository's CLAUDE.md because it may contain project-specific conventions that affect how refactoring should be done (e.g., import ordering, naming conventions, test commands). -### Optional Behaviors (OFF unless enabled) -- **Performance benchmarks**: Compare before/after performance -- **Documentation updates**: Auto-update docs for API changes -- **Type migration**: Update type definitions across codebase +Only refactor what is directly requested. Do not add speculative improvements or "while we're here" changes because scope creep during refactoring makes failures harder to diagnose and rollbacks harder to execute. If a bug is discovered during refactoring, finish the refactoring first and address the bug in a separate commit because mixing structural changes with behavioral changes makes it impossible to tell which caused a test failure. -## What This Skill CAN Do -- Safely rename functions, variables, files across a codebase -- Extract code into new modules with caller migration -- Change function signatures with incremental migration -- Restructure directory layouts preserving all behavior -- Consolidate duplicate code with test verification +### Phase 1: CHARACTERIZE -## What This Skill CANNOT Do -- Fix bugs (use systematic-debugging instead) -- Add new features (use workflow-orchestrator instead) -- Make multiple changes simultaneously without testing between each -- Skip characterization tests before modifying code -- Leave incomplete migrations (old code alongside new) +**Goal**: Establish a test safety net that proves current behavior before any code is touched. -## Systematic Phases +**Gate**: Tests exist that verify current behavior. Do not proceed to Phase 2 until all characterization tests pass. -### Phase 1: CHARACTERIZE (Do NOT proceed without test coverage) - -**Gate**: Tests exist that verify current behavior. +Write characterization tests before making any changes because refactoring without tests is flying blind -- you have no proof that behavior was preserved. Even for a "small rename," grep for all references including string literals, config files, and dynamic lookups because small renames break string refs and configs that static analysis misses. ``` ═══════════════════════════════════════════════════════════════ @@ -127,14 +94,18 @@ This skill operates as an operator for safe code refactoring, configuring Claude **Actions in this phase:** 1. Read the code to be refactored completely -2. Find all callers with Grep +2. Find all callers with Grep -- check string literals, config files, and reflection-based references in addition to direct code references 3. Run existing tests, note coverage gaps 4. Write characterization tests for uncovered behavior 5. Verify all tests pass -### Phase 2: PLAN (Do NOT proceed without incremental steps defined) +### Phase 2: PLAN + +**Goal**: Define a sequence of atomic changes, each independently testable, with rollback points. -**Gate**: Clear sequence of atomic changes with rollback points. +**Gate**: Clear sequence of atomic changes with rollback points defined. Do not proceed to Phase 3 until every step is small enough to be a single commit. + +Break the work into the smallest possible atomic changes because one large commit touching many files makes it impossible to bisect which change caused an issue and guarantees merge conflicts in active codebases. Every step must preserve the external API unless the user explicitly requested an API change, because callers outside the codebase may depend on the current interface. ``` ═══════════════════════════════════════════════════════════════ @@ -179,9 +150,15 @@ This skill operates as an operator for safe code refactoring, configuring Claude 4. Define rollback procedure for each step 5. Estimate risk level for each step -### Phase 3: EXECUTE (One step at a time, tests between each) +### Phase 3: EXECUTE + +**Goal**: Apply each planned change one at a time, running the full test suite after every step. + +**Gate**: Tests pass after each atomic change. If any test fails, stop, rollback, and investigate before proceeding. -**Gate**: Tests pass after each atomic change. +Make exactly one atomic change per step because multiple simultaneous changes make it impossible to isolate which change broke a test. Run the full test suite after each step, not just the tests you think are relevant, because refactoring should never change behavior and the full suite is your proof. Commit at each stable point so that every commit represents a working state and you can bisect or rollback to any intermediate point. + +When updating callers, migrate every single reference before removing old code because leaving old code alongside new "for backward compatibility" means code exists in two places indefinitely, future changes require double updates, and the old code becomes stale and buggy. Use Grep exhaustively to confirm zero remaining references at each migration step. ``` ═══════════════════════════════════════════════════════════════ @@ -218,9 +195,15 @@ This skill operates as an operator for safe code refactoring, configuring Claude 4. If fail: rollback, investigate, fix, retry 5. Repeat until all steps complete -### Phase 4: VALIDATE (Do NOT mark complete until verified) +### Phase 4: VALIDATE + +**Goal**: Confirm the entire refactoring preserved behavior, left no dead code, and updated all references. -**Gate**: All original tests pass, no dead code, all callers updated. +**Gate**: All original tests pass, no dead code remains, all callers are updated. Do not mark complete until every check passes. + +Show command output directly rather than describing results because evidence is more trustworthy than summary. Grep for old names across the entire codebase including strings, comments, and config files to confirm zero remaining references. Check for dynamic references such as reflection and string-based lookups that static analysis may miss. + +Remove any temporary test files, debug outputs, or backup files created during refactoring because they are noise that obscures the actual changes. ``` ═══════════════════════════════════════════════════════════════ @@ -241,6 +224,7 @@ This skill operates as an operator for safe code refactoring, configuring Claude - [ ] No unused imports - [ ] No dead references - [ ] Old names completely removed + - [ ] Temporary files removed Caller Verification: $ grep -r "[old_name]" --include="*.ext" @@ -262,9 +246,9 @@ This skill operates as an operator for safe code refactoring, configuring Claude ═══════════════════════════════════════════════════════════════ ``` -## Refactoring Patterns +## Reference Material -### Pattern 1: Rename (Function, Variable, File) +### Pattern: Rename (Function, Variable, File) ``` Phase 1: Find all usages with Grep @@ -273,7 +257,7 @@ Phase 3: Execute with replace_all where safe Phase 4: Verify no old name references remain ``` -### Pattern 2: Extract (Function, Module, Class) +### Pattern: Extract (Function, Module, Class) ``` Phase 1: Identify code to extract, write tests @@ -285,7 +269,7 @@ Phase 3: Phase 4: Verify all callers use new location ``` -### Pattern 3: Inline (Remove Abstraction) +### Pattern: Inline (Remove Abstraction) ``` Phase 1: Find all usages, understand all variations @@ -297,7 +281,7 @@ Phase 3: Phase 4: Verify no remaining references ``` -### Pattern 4: Change Signature +### Pattern: Change Signature ``` Phase 1: Find all callers, understand usage patterns @@ -327,102 +311,6 @@ Phase 4: Verify all callers use new signature - Add new dependencies to plan - May need to add intermediate steps -## Common Anti-Patterns - -### Anti-Pattern 1: Big Bang Refactoring - -**What it looks like:** -``` -User: "Rename getUserData to fetchUserProfile across the entire codebase" -Claude: *Changes 47 files in one commit, updating function name, all callers, tests, and docs* -``` - -**Why it's wrong:** -- One test failure breaks everything -- Impossible to bisect which change caused issues -- No rollback points if problems discovered later -- Merge conflicts guaranteed in active codebases - -**Do this instead:** -1. CHARACTERIZE: Write tests for current getUserData behavior -2. PLAN: Break into steps (add new function, migrate callers gradually, remove old) -3. EXECUTE: Commit after each atomic change (5-10 callers at a time) -4. VALIDATE: Tests pass after every step - -### Anti-Pattern 2: Refactoring Without Tests First - -**What it looks like:** -``` -User: "Extract this logic into a new function" -Claude: *Immediately creates new function and updates callers without writing tests* -``` - -**Why it's wrong:** -- No verification that behavior is preserved -- Silent bugs introduced during extraction -- Can't prove refactoring was safe -- No baseline to compare against - -**Do this instead:** -1. CHARACTERIZE: Write tests for current behavior BEFORE touching code -2. Run tests: Verify they pass with current implementation -3. Make change: Extract function -4. Run tests again: Verify same results -5. Tests are your proof of correctness - -### Anti-Pattern 3: Incomplete Migration - -**What it looks like:** -``` -User: "Move getUser from utils.js to user-service.js" -Claude: *Creates new location, updates 80% of callers, leaves old function "for backward compatibility"* -``` - -**Why it's wrong:** -- Code exists in two places indefinitely -- Future changes need double updates -- Confusion about which to use -- Old code becomes stale and buggy - -**Do this instead:** -1. PLAN: Identify ALL callers upfront (use Grep exhaustively) -2. EXECUTE: Update every single caller before removing old code -3. VALIDATE: Grep confirms ZERO references to old location -4. Clean up: Remove old code completely -5. No half-migrated state allowed - -### Anti-Pattern 4: Mixing Refactoring with Feature Work - -**What it looks like:** -``` -User: "Rename calculateTotal and also fix the tax calculation bug" -Claude: *Renames function AND changes logic in same refactoring* -``` - -**Why it's wrong:** -- Can't tell if tests fail due to rename or logic change -- Violates "preserve behavior" principle -- Impossible to review as pure refactoring -- Rollback becomes unclear - -**Do this instead:** -1. Separate concerns: "This is ONLY refactoring, behavior unchanged" -2. Complete refactoring first: Rename with tests proving no behavior change -3. Then fix bug: In separate phase with new tests for fixed behavior -4. Two commits: One refactor (safe), one fix (behavior change) - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) - Phase transition rules - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "This refactoring is safe, no tests needed" | Refactoring without tests is flying blind | Write characterization tests first | -| "I'll update the remaining callers later" | Incomplete migrations rot forever | Migrate ALL callers before removing old code | -| "Small rename, no need for full process" | Small renames break string refs and configs | Grep for all references including strings | -| "I can fix this bug while refactoring" | Mixed concerns make failures undiagnosable | Separate commits: refactor then fix | +- `pipelines/systematic-refactoring/SKILL.md` - Phase-gated pipeline version diff --git a/skills/taxonomy-manager/SKILL.md b/skills/taxonomy-manager/SKILL.md index c1675ca7..7b1c460a 100644 --- a/skills/taxonomy-manager/SKILL.md +++ b/skills/taxonomy-manager/SKILL.md @@ -26,56 +26,17 @@ routing: # Taxonomy Manager Skill -## Operator Context - -This skill operates as an operator for taxonomy management workflows on Hugo-based blogs, configuring Claude's behavior for consistent, SEO-friendly categorization. It implements the **Scan-Analyze-Report-Act** architectural pattern with preview-first safety and build verification gates. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before execution -- **Over-Engineering Prevention**: Fix actionable taxonomy issues only; no elaborate classification redesigns without explicit request -- **Preview-First Workflow**: Always show current state and proposed changes before modifying any files -- **Case Normalization**: Hugo taxonomies are case-sensitive; standardize all terms to `lowercase-with-hyphens` -- **Non-Destructive Operations**: Never delete or rename taxonomy terms without explicit user confirmation -- **Build Verification**: Run `hugo --quiet` after every batch of changes to confirm site still builds - -### Default Behaviors (ON unless disabled) -- **Complete Output**: Show full taxonomy audit with visual charts, never summarize -- **Similarity Detection**: Flag potentially duplicate tags (case variations, plurals, synonyms) -- **Orphan Detection**: Identify tags used only once or categories with no posts -- **Confirmation Required**: Require explicit confirmation before modifying any content files -- **One Operation at a Time**: Apply merge/rename/add/remove operations individually, verify between each - -### Optional Behaviors (OFF unless enabled) -- **Auto-Apply**: Apply suggested changes without per-file confirmation -- **Batch Mode**: Process all detected issues in a single pass -- **Aggressive Merge**: Merge all similar tags automatically (requires explicit opt-in) - -## What This Skill CAN Do -- Audit all categories and tags across a Hugo site's content directory -- Count posts per taxonomy term and generate visual usage statistics -- Detect orphan tags (single-use), empty categories, case variations, and synonym pairs -- Suggest and execute merge, rename, add, and remove operations on taxonomy terms -- Show before/after diffs for every proposed change -- Verify Hugo build integrity after modifications - -## What This Skill CANNOT Do -- Create new categories without explicit instruction (taxonomy design is intentional) -- Auto-merge tags without review (similar tags may have distinct meanings) -- Modify post content beyond front matter (preserves author's voice) -- Guarantee SEO improvements (taxonomy is one factor among many) -- Skip confirmation or build verification gates - ---- - ## Instructions +This skill audits and maintains blog taxonomy on Hugo-based sites using the **Scan-Analyze-Report-Act** workflow. Apply this skill when users ask to audit tags, fix inconsistencies, or consolidate taxonomy terms. + ### Phase 1: SCAN - Collect Taxonomy Data **Goal**: Build a complete index of all taxonomy terms and their usage. **Step 1: Identify all content files** -Locate every Markdown file in the Hugo content directory. +Locate every Markdown file in the Hugo content directory (because Hugo requires explicit file discovery to avoid missing nested structures). ```bash find /path/to/content -name "*.md" -type f | sort @@ -90,7 +51,7 @@ For each file, parse the YAML front matter and extract: **Step 3: Build taxonomy index** -Construct an in-memory mapping of every taxonomy term to its list of posts: +Construct an in-memory mapping of every taxonomy term to its list of posts (because this index is the foundation for all downstream analysis): ``` CATEGORIES: @@ -104,7 +65,7 @@ TAGS: **Step 4: Check Hugo taxonomy configuration** -Read `hugo.toml` (or `config.toml`) for any custom taxonomy definitions or overrides. +Read `hugo.toml` (or `config.toml`) for any custom taxonomy definitions or overrides (because Hugo may define non-standard taxonomies beyond categories/tags). **Gate**: Taxonomy index is complete with all terms mapped to their posts. Proceed only when gate passes. @@ -160,11 +121,12 @@ Present the report to the user. If no issues are found, state the taxonomy is he ### Phase 4: ACT - Apply Changes -**Goal**: Execute approved taxonomy modifications safely. +**Goal**: Execute approved taxonomy modifications safely and verify correctness. **Step 1: Preview every change** -Before any file modification, show: +Before any file modification, show the change in diff format (because previewing prevents accidental mass edits that break navigation): + ``` File: content/posts/example.md Current tags: ["Hugo", "debugging", "templates"] @@ -174,11 +136,12 @@ File: content/posts/example.md **Step 2: Get confirmation** -Wait for explicit user approval before proceeding. +Wait for explicit user approval before proceeding (because taxonomy modifications affect site navigation and SEO). **Step 3: Apply operations** -Execute the approved operation (merge, rename, add, or remove). See `references/consolidation-rules.md` for operation semantics: +Execute the approved operation (merge, rename, add, or remove). See `references/consolidation-rules.md` for operation semantics. Apply operations individually, not in batches, to isolate errors: + - **Merge**: Replace source tag(s) with target in all posts; skip if post already has target - **Rename**: Replace old name with new in all posts - **Add**: Add tag to matching posts (skip if already present) @@ -186,6 +149,8 @@ Execute the approved operation (merge, rename, add, or remove). See `references/ **Step 4: Verify build** +After each operation, run Hugo to confirm the site still builds (because taxonomy modifications can break Hugo's site generation): + ```bash hugo --quiet ``` @@ -203,6 +168,17 @@ git diff content/ **Gate**: All changes applied, build verified, diff reviewed. Operation complete. +### Taxonomy Formatting Standards + +Apply these constraints throughout all phases because Hugo is case-sensitive and requires consistent formatting: + +- **Case Normalization**: Standardize all terms to `lowercase-with-hyphens` (not PascalCase, UPPERCASE, or spaces). Hugo treats `Hugo`, `hugo`, and `HUGO` as three separate tags, each with its own page. +- **Non-Destructive Operations**: Never delete or rename taxonomy terms without explicit user confirmation (because accidental deletions create orphan link targets in content). +- **Complete Output**: Show full taxonomy audit with visual charts, never summarize (because partial audits miss subtle issues like orphan tags or synonyms). +- **Similarity Detection**: Flag potentially duplicate tags including case variations, plurals, and synonyms (because similar tags fragment navigation and SEO value). +- **Orphan Detection**: Identify tags used only once or categories with no posts (because orphans accumulate into taxonomy debt). +- **Confirmation Required**: Require explicit confirmation before modifying any content files (because bulk edits without review create unintended side effects). + --- ## Examples @@ -270,45 +246,8 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Tag Explosion -**What it looks like**: Creating `hugo-themes`, `hugo-templates`, `hugo-debugging`, `hugo-config` instead of using `hugo` + `themes` -**Why wrong**: Fragments the taxonomy; users browsing "hugo" miss related posts scattered across subtags -**Do instead**: Use broader tags in combination (`hugo` + `debugging`) and let post content provide specificity - -### Anti-Pattern 2: Case Inconsistency -**What it looks like**: `Hugo` in one post, `hugo` in another, `HUGO` in a third -**Why wrong**: Hugo treats these as separate tags, creating three different tag pages each showing a fraction of the posts -**Do instead**: Standardize all terms to `lowercase-with-hyphens`; run this skill regularly to catch drift - -### Anti-Pattern 3: Single-Use Tags as Description -**What it looks like**: `fixing-hugo-template-rendering-issues` as a tag on a post titled "Fixing Hugo Template Rendering Issues" -**Why wrong**: The tag duplicates the title, will never be reused, and provides zero navigation value -**Do instead**: Use generic, reusable tags: `hugo`, `templates`, `debugging` - -### Anti-Pattern 4: Merging Without Understanding -**What it looks like**: Automatically merging `debugging` and `troubleshooting` because they seem similar -**Why wrong**: Near-synonyms may carry distinct connotations in context; aggressive merging can lose meaningful distinctions -**Do instead**: Present similar terms to the user with usage context; let them decide which to keep - ---- - ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "These tags are obviously the same" | Similar does not mean identical; context matters | Show usage examples, let user decide | -| "Just a quick rename, no need to preview" | Renames can break Hugo taxonomy pages | Always preview, always verify build | -| "One orphan tag doesn't matter" | Orphans accumulate into taxonomy debt | Address during audit or document exception | -| "The build passed, taxonomy must be fine" | Build success does not validate semantic correctness | Review the diff, confirm navigation works | - ### Reference Files - `${CLAUDE_SKILL_DIR}/references/taxonomy-guidelines.md`: Naming conventions, category/tag best practices, maintenance cadence - `${CLAUDE_SKILL_DIR}/references/consolidation-rules.md`: When and how to merge, rename, add, or remove terms with priority matrix diff --git a/skills/test-driven-development/SKILL.md b/skills/test-driven-development/SKILL.md index 18ef341b..4f057732 100644 --- a/skills/test-driven-development/SKILL.md +++ b/skills/test-driven-development/SKILL.md @@ -34,145 +34,150 @@ routing: # Test-Driven Development (TDD) Skill -## Purpose -Enforce the RED-GREEN-REFACTOR cycle for all code changes. This skill ensures tests are written BEFORE implementation code, verifies tests fail for the right reasons, and maintains test coverage through disciplined development cycles. +Enforce the RED-GREEN-REFACTOR cycle for all code changes. Tests are written before implementation code, verified to fail for the right reasons, and maintained through disciplined development cycles. -## Operator Context +## Instructions -This skill operates as an operator for test-driven development workflows, configuring Claude's behavior for disciplined test-first coding practices. +Before starting any TDD cycle, read and follow repository CLAUDE.md files. Project instructions override default TDD behaviors because local conventions (test frameworks, directory layout, naming) vary across codebases. -### Hardcoded Behaviors (Always Apply) +### Phase 1: Write a Failing Test (RED) -These behaviors are non-negotiable for correct TDD practice: +The test MUST exist and fail before any implementation code is written, because seeing the test fail first proves it can actually detect the bug or missing feature. A test that has never been seen failing provides no evidence that it tests anything meaningful. -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before execution. Project instructions override default TDD behaviors. -- **Over-Engineering Prevention**: Only implement what's directly tested. Keep code simple and focused. No speculative features or flexibility that wasn't asked for. First make it work, then make it right. -- **RED phase is mandatory**: ALWAYS write the test BEFORE any implementation code -- **Verify test failure**: MUST run test and show failure output before implementing -- **Failure reason validation**: MUST confirm test fails for the CORRECT reason (not syntax errors) -- **Show complete output**: NEVER summarize test results - show full test runner output -- **Minimum implementation**: Write ONLY enough code to make the test pass (no gold-plating) -- **Commit discipline**: Tests and implementation committed together in atomic units +**Steps:** -### Default Behaviors (ON unless disabled) +1. **Understand the requirement** -- clarify what behavior needs to be implemented +2. **Write the test first** -- create a test that describes the desired behavior +3. **Use descriptive test names** -- the test name should read as a specification of behavior (e.g., `TestCalculateTotal_WithEmptyCart_ReturnsZero`), because vague names like `TestCalc` make failures impossible to diagnose without reading the test body +4. **Write minimal test setup** -- only create fixtures/mocks needed for THIS test +5. **Assert expected behavior** -- use specific assertions (not just "no error"), because weak assertions like `assert result != nil` pass for wrong reasons and provide false confidence -Active by default to maintain quality: +Use specific assertions: +- `assert result == 42` (specific value) +- `assert error.message.contains("invalid")` (specific content) +- NOT `assert result != nil` (too weak -- passes even when result is garbage) +- NOT `assert len(result) > 0` (not specific enough -- passes with wrong data) -- **Communication Style**: Report facts without self-congratulation. Show command output rather than describing it. Be concise but informative. -- **Temporary File Cleanup**: Remove temporary test files, coverage reports, or debug outputs created during TDD cycles at task completion. Keep only files explicitly needed for the project. -- **Run tests after each change**: Execute test suite after every code modification -- **Test improvement suggestions**: Recommend better assertions, edge cases, test organization -- **Coverage awareness**: Track which code paths are tested, suggest missing coverage -- **Refactoring validation**: Ensure tests remain green during refactoring steps -- **Test naming conventions**: Enforce descriptive test names that explain behavior +Test one concept per test. If the test name needs "and", split into multiple tests, because multi-assertion tests produce ambiguous failures. -### Optional Behaviors (OFF unless enabled) +Follow the Arrange-Act-Assert pattern: -Advanced testing capabilities available on request: +```python +def test_feature(): + # Arrange: Set up test data + input_data = create_test_data() -- **Property-based testing**: Generate tests with random/fuzzed inputs (Go: testing/quick, Python: hypothesis) -- **Mutation testing**: Verify test quality by introducing bugs -- **Benchmark tests**: Performance regression testing -- **Table-driven tests**: Convert multiple similar tests to data-driven approach -- **Test parallelization**: Run independent tests concurrently for speed + # Act: Execute the code under test + result = function_under_test(input_data) -## What This Skill CAN Do -- Guide RED-GREEN-REFACTOR cycles for any language (Go, Python, JavaScript) -- Enforce phase gates: test must fail before implementation -- Validate test failure reasons (syntax errors vs missing implementation) -- Guide refactoring while maintaining green tests -- Provide language-specific testing commands and patterns + # Assert: Verify expected behavior + assert result.status == "success" +``` -## What This Skill CANNOT Do -- Write implementation before tests (violates TDD principle) -- Skip the RED phase or proceed without verified test failure -- Implement features not covered by a test -- Approve passing tests without checking failure reason -- Skip running tests after each change +**Optional techniques** (use when explicitly requested): +- **Property-based testing**: Generate tests with random/fuzzed inputs (Go: `testing/quick`, Python: `hypothesis`) +- **Table-driven tests**: Convert multiple similar tests to data-driven approach when 3+ tests share the same structure -## Instructions +**Run the test:** +```bash +go test ./... -v -run TestNewFeature # Go +pytest tests/test_feature.py::test_name -v # Python +npm test -- --testNamePattern="new feature" # JavaScript +``` -### TDD Workflow: RED-GREEN-REFACTOR Cycle +Show the full test runner output -- never summarize test results, because summarization hides warnings, partial failures, and unexpected output that reveal problems early. -#### Step 1: Write a Failing Test (RED Phase) +#### RED Phase Gate -**PHASE GATE: Do NOT proceed to GREEN phase until:** +Do NOT proceed to the GREEN phase until all of these are true: - [ ] Test file is created and saved - [ ] Test has been executed - [ ] Test output shows FAILURE (not syntax/import error) - [ ] Failure message indicates missing implementation -**BEFORE writing any implementation code:** +### Phase 2: Verify Failure Reason (RED Verification) -1. **Understand the requirement**: Clarify what behavior needs to be implemented -2. **Write the test first**: Create test that describes the desired behavior -3. **Use descriptive test names**: Test name should explain what is being tested -4. **Write minimal test setup**: Only create fixtures/mocks needed for THIS test -5. **Assert expected behavior**: Use specific assertions (not just "no error") +The test must fail because the feature is not implemented, NOT because of syntax errors, import errors, wrong test setup, or unrelated failures. A test that fails for the wrong reason proves nothing about the missing feature and will pass for the wrong reason after implementation. -**Run the test:** -```bash -go test ./... -v -run TestNewFeature # Go -pytest tests/test_feature.py::test_name -v # Python -npm test -- --testNamePattern="new feature" # JavaScript -``` +1. **Execute test command** and show the complete output +2. **Verify failure reason** -- confirm the error matches expected missing-implementation patterns: + - Go: `--- FAIL: TestFeatureName` with expected vs actual mismatch + - Python: `AssertionError` or `AttributeError: module has no attribute` + - JavaScript: `Expected X but received undefined` -#### Step 2: Verify Test Fails for the RIGHT Reason (RED Verification) +**If the test fails for the WRONG reason:** +- Fix the test setup/syntax +- Re-run until it fails for the RIGHT reason (missing implementation) +- Do NOT proceed until the failure clearly indicates "this feature does not exist yet" -**CRITICAL: Run the test and confirm it fails:** +### Phase 3: Implement Minimum Code (GREEN) -1. **Execute test command** (show full output) -2. **Verify failure reason**: Test should fail because feature not implemented, NOT: - - Syntax errors - - Import errors - - Wrong test setup - - Unrelated failures +Write ONLY enough code to make the failing test pass. Implement nothing beyond what the test demands, because untested code paths are invisible liabilities -- they cannot be verified, they rot silently, and they complicate future refactoring. -**Expected RED output indicators:** -- Go: `--- FAIL: TestFeatureName` with expected vs actual mismatch -- Python: `AssertionError` or `AttributeError: module has no attribute` -- JavaScript: `Expected X but received undefined` +1. **Minimal implementation** -- the simplest code that satisfies the test +2. **No extra features** -- do not implement behavior not covered by tests. First make it work, then make it right +3. **Hardcoded values are OK initially** -- a hardcoded return that passes the test is better than a generic algorithm that also handles untested cases -**If test fails for WRONG reason:** -- Fix the test setup/syntax -- Re-run until it fails for the RIGHT reason (missing implementation) +Wrong (over-engineering in GREEN phase): +```go +// Test only requires simple addition +func TestCalculator_AddTwoNumbers(t *testing.T) { + calc := NewCalculator() + result := calc.Add(2, 3) + assert.Equal(t, 5, result) +} -#### Step 3: Write MINIMUM Code to Pass (GREEN Phase) +// But implementation adds unnecessary complexity +type Calculator struct { + operations map[string]func(float64, float64) float64 + precision int + history []Operation +} +``` -**PHASE GATE: Do NOT proceed to REFACTOR phase until:** -- [ ] Implementation code is written -- [ ] Test has been executed again -- [ ] Test output shows PASS -- [ ] No other tests have been broken +Correct (implement only what is tested): +```go +type Calculator struct{} -**Implement ONLY enough code to make THIS test pass:** +func (c *Calculator) Add(a, b int) int { + return a + b +} +// Add complexity ONLY when a test requires it +``` -1. **Minimal implementation**: Simplest code that satisfies the test -2. **No extra features**: Don't implement behavior not covered by tests -3. **Hardcoded values are OK initially**: First make it work, then make it right +### Phase 4: Verify Test Passes (GREEN Verification) -#### Step 4: Verify Test Passes (GREEN Verification) +Run the test and show the complete output. Never summarize -- the full output reveals warnings, deprecation notices, and timing issues that summaries hide. -**Run test and confirm it passes:** +1. **Execute test command** and display all output +2. **Verify PASS status** +3. **Run the full test suite** -- not just the new test, because a change that makes one test pass while breaking another is not progress. Run tests after every code modification to catch regressions immediately -1. **Execute test command** (show full output) -2. **Verify PASS status**: Test should now succeed -3. **Check for warnings**: Note any deprecation warnings or issues +```bash +go test ./... -v # Go - all tests +pytest -v # Python - all tests +npm test # JavaScript - all tests +``` -**If test still fails:** +**If the test still fails:** - Review implementation logic - Check test assertions are correct -- Debug until test passes +- Debug until the test passes -#### Step 5: Refactor While Keeping Tests Green (REFACTOR Phase) +#### GREEN Phase Gate -**PHASE GATE: Do NOT mark task complete until:** -- [ ] All refactoring changes are saved +Do NOT proceed to the REFACTOR phase until all of these are true: +- [ ] Implementation code is written +- [ ] New test has been executed and shows PASS - [ ] Full test suite has been executed -- [ ] ALL tests pass (not just the new one) -- [ ] Code quality has been evaluated against checklist below +- [ ] No other tests have been broken + +### Phase 5: Refactor (REFACTOR) + +Improve code quality without changing behavior. Run the full test suite before refactoring to establish a green baseline, because you need proof that any future failure was caused by your refactoring, not by a pre-existing issue. + +**Refactoring decision criteria** (evaluate each): -**REFACTORING DECISION CRITERIA** (evaluate each): | Criterion | Check | Action if YES | |-----------|-------|---------------| | Duplication | Same logic in 2+ places? | Extract to shared function | @@ -181,170 +186,23 @@ npm test -- --testNamePattern="new feature" # JavaScript | Complexity | Nested conditionals >2 deep? | Simplify or extract | | Reusability | Could other code use this? | Extract to module | -**Improve code quality without changing behavior:** - -1. **Run full test suite BEFORE refactoring**: Establish green baseline -2. **Refactor incrementally**: Extract functions, rename for clarity, remove duplication -3. **Run tests after EACH refactoring step**: Ensure tests stay green -4. **Refactor tests too**: Improve test readability and maintainability - -#### Step 6: Commit Atomic Changes - -**Commit test and implementation together:** - -1. **Review changes**: Verify test + implementation are complete -2. **Run full test suite**: Ensure nothing broke -3. **Commit with descriptive message** - -## Error Handling - -### Common TDD Mistakes and Solutions - -#### Error: "Test passes before implementation" -**Symptom**: Test shows PASS in RED phase - -**Causes:** -- Test is testing the wrong thing -- Implementation already exists elsewhere -- Test assertions are too weak (always true) - -**Solution:** -1. Review test assertions - are they specific enough? -2. Verify test is actually calling the code under test -3. Check for existing implementation of the feature -4. Strengthen assertions to actually verify behavior - -#### Error: "Test fails for wrong reason" -**Symptom**: Syntax errors, import errors, setup failures in RED phase - -**Causes:** -- Test setup incomplete -- Missing dependencies -- Incorrect import paths - -**Solution:** -1. Fix syntax/import errors first -2. Set up necessary fixtures/mocks -3. Verify test file structure matches project conventions -4. Re-run until test fails for RIGHT reason (missing feature) - -#### Error: "Tests pass but feature doesn't work" -**Symptom**: Tests green but manual testing shows bugs - -**Causes:** -- Tests don't cover actual usage -- Test mocks don't match real behavior -- Edge cases not tested - -**Solution:** -1. Review test coverage - what's missing? -2. Add integration tests alongside unit tests -3. Test with real data, not just mocks -4. Add edge case tests (empty input, null, extremes) - -#### Error: "Refactoring breaks tests" -**Symptom**: Tests fail after refactoring - -**Causes:** -- Tests coupled to implementation details -- Brittle assertions (checking internals not behavior) -- Large refactoring without incremental steps - -**Solution:** -1. Test behavior, not implementation details -2. Refactor in smaller steps -3. Run tests after each micro-refactoring -4. Update tests if API contract legitimately changed - -## Language-Specific Testing Commands - -| Language | Run One Test | Run All | With Coverage | -|----------|-------------|---------|---------------| -| Go | `go test -v -run TestName ./pkg` | `go test ./...` | `go test -cover ./...` | -| Python | `pytest tests/test_file.py::test_fn -v` | `pytest` | `pytest --cov=src` | -| JavaScript | `npm test -- --testNamePattern="name"` | `npm test` | `npm test -- --coverage` | - -## Testing Best Practices - -### Assertion Guidelines - -**Use specific assertions:** -- `assert result == 42` (specific value) -- `assert error.message.contains("invalid")` (specific content) -- NOT `assert result != nil` (too weak) -- NOT `assert len(result) > 0` (not specific enough) - -**Test one concept per test:** -- Each test should verify ONE behavior -- If test name needs "and", split into multiple tests -- Makes failures easier to diagnose +1. **Run full test suite BEFORE refactoring** -- establish green baseline +2. **Refactor incrementally** -- extract functions, rename for clarity, remove duplication +3. **Run tests after EACH refactoring step** -- ensure tests stay green after every individual change, because large refactoring batches make it impossible to identify which change broke the test +4. **Refactor tests too** -- improve test readability and maintainability. Suggest better assertions, edge cases, and test organization where they would strengthen coverage -### Arrange-Act-Assert Pattern - -```python -def test_feature(): - # Arrange: Set up test data - input_data = create_test_data() - - # Act: Execute the code under test - result = function_under_test(input_data) +Test behavior, not implementation details. Tests coupled to internals break on refactoring and defeat its purpose: - # Assert: Verify expected behavior - assert result.status == "success" -``` - -## Common Anti-Patterns - -### Anti-Pattern 1: Skipping the RED Phase - -**Wrong -- writing implementation first:** -```python -# Writing implementation first -def calculate_total(items): - return sum(item.price for item in items) - -# Then writing test after -def test_calculate_total(): - items = [Item(price=10), Item(price=20)] - assert calculate_total(items) == 30 -``` - -**Why it's wrong:** -- Can't verify test actually catches bugs (never saw it fail) -- Test might be passing for wrong reasons -- Risk of writing tests that match buggy implementation - -**Correct -- RED then GREEN:** -```python -# 1. Write test FIRST (RED phase) -def test_calculate_total(): - items = [Item(price=10), Item(price=20)] - assert calculate_total(items) == 30 -# Run test -> fails with "NameError: name 'calculate_total' is not defined" - -# 2. Implement minimum code (GREEN phase) -def calculate_total(items): - return sum(item.price for item in items) -# Run test -> passes -``` - -### Anti-Pattern 2: Testing Implementation Details - -**Wrong -- testing internals:** +Wrong (testing internals): ```go func TestParser_UsesCorrectRegex(t *testing.T) { parser := NewParser() - // Testing internal regex pattern - breaks on refactor + // Testing internal regex pattern -- breaks on refactor assert.Equal(t, `\d{3}-\d{3}-\d{4}`, parser.phoneRegex) } ``` -**Why it's wrong:** -- Test breaks when refactoring internal implementation -- Doesn't verify actual behavior users care about -- Makes refactoring painful (tests should enable it) - -**Correct -- testing behavior:** +Correct (testing behavior): ```go func TestParser_ValidPhoneNumber_ParsesCorrectly(t *testing.T) { parser := NewParser() @@ -361,9 +219,38 @@ func TestParser_InvalidPhoneNumber_ReturnsError(t *testing.T) { } ``` -### Anti-Pattern 3: Writing Multiple Features Without Tests +Track which code paths are tested and suggest missing coverage, because untested paths are invisible to the refactoring safety net. -**Wrong -- implementing everything at once:** +**Optional techniques** (use when explicitly requested): +- **Mutation testing**: Verify test quality by introducing bugs -- if mutating code does not break a test, that test is too weak +- **Benchmark tests**: Performance regression testing to ensure refactoring does not degrade speed +- **Test parallelization**: Run independent tests concurrently for speed + +#### REFACTOR Phase Gate + +Do NOT mark the task complete until all of these are true: +- [ ] All refactoring changes are saved +- [ ] Full test suite has been executed +- [ ] ALL tests pass (not just the new one) +- [ ] Code quality has been evaluated against the criteria table above + +### Phase 6: Commit + +Commit the test and implementation together as an atomic unit, because separating them creates a window where the repository is in an inconsistent state -- either tests exist for unimplemented code, or code exists without its test coverage. + +1. **Review changes** -- verify test + implementation are complete +2. **Run full test suite** -- ensure nothing broke +3. **Commit with descriptive message** + +After committing, clean up any temporary test files, coverage reports, or debug outputs created during the TDD cycle. Keep only files explicitly needed for the project. + +Report facts without self-congratulation. Show command output rather than describing it. + +### Cycle Discipline + +Each feature gets its own RED-GREEN-REFACTOR cycle. Do not batch multiple features into one cycle: + +Wrong (implementing everything at once): ```javascript // Implementing many features at once without tests class UserManager { @@ -375,12 +262,7 @@ class UserManager { // Then one giant test for everything ``` -**Why it's wrong:** -- Lost the TDD cycle discipline completely -- Can't verify each feature worked incrementally -- No design feedback from tests - -**Correct -- one cycle per feature:** +Correct (one cycle per feature): ```javascript // Cycle 1: Create user (RED -> GREEN -> REFACTOR) it('should create user with valid data', () => { @@ -400,57 +282,78 @@ it('should reject user with invalid email', () => { // Add validation to make test pass ``` -### Anti-Pattern 4: Over-Engineering in GREEN Phase +## Reference Material -**Wrong -- test requires simple addition but implementation over-engineers:** -```go -// Test only requires simple addition -func TestCalculator_AddTwoNumbers(t *testing.T) { - calc := NewCalculator() - result := calc.Add(2, 3) - assert.Equal(t, 5, result) -} +### Language-Specific Testing Commands -// But implementation adds unnecessary complexity -type Calculator struct { - operations map[string]func(float64, float64) float64 - precision int - history []Operation -} -``` +| Language | Run One Test | Run All | With Coverage | +|----------|-------------|---------|---------------| +| Go | `go test -v -run TestName ./pkg` | `go test ./...` | `go test -cover ./...` | +| Python | `pytest tests/test_file.py::test_fn -v` | `pytest` | `pytest --cov=src` | +| JavaScript | `npm test -- --testNamePattern="name"` | `npm test` | `npm test -- --coverage` | -**Why it's wrong:** -- Implementing features not covered by tests -- Violates "minimum code to pass" principle -- Hard to maintain untested code paths +### Reference Files -**Correct -- implement only what's tested:** -```go -// Implement ONLY what's needed to pass -type Calculator struct{} +- `${CLAUDE_SKILL_DIR}/references/examples.md`: Language-specific TDD examples (Go, Python, JavaScript) -func (c *Calculator) Add(a, b int) int { - return a + b -} -// Add complexity ONLY when a test requires it -``` +## Error Handling -## Reference Files -- `${CLAUDE_SKILL_DIR}/references/examples.md`: Language-specific TDD examples (Go, Python, JavaScript) +### Test passes before implementation (RED phase) -## References +**Symptom**: Test shows PASS in RED phase -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) - Prevents shortcut rationalizations -- [Anti-Rationalization (Testing)](../shared-patterns/anti-rationalization-testing.md) - Testing-specific rationalizations -- [Gate Enforcement](../shared-patterns/gate-enforcement.md) - Phase transition rules -- [Verification Checklist](../shared-patterns/verification-checklist.md) - Pre-completion checks +**Causes:** +- Test is testing the wrong thing +- Implementation already exists elsewhere +- Test assertions are too weak (always true) + +**Solution:** +1. Review test assertions -- are they specific enough? +2. Verify test is actually calling the code under test +3. Check for existing implementation of the feature +4. Strengthen assertions to actually verify behavior -### Domain-Specific Anti-Rationalization +### Test fails for wrong reason (RED phase) + +**Symptom**: Syntax errors, import errors, setup failures in RED phase + +**Causes:** +- Test setup incomplete +- Missing dependencies +- Incorrect import paths -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "I know what the test should be, let me just code it" | Skipping RED means test may not catch bugs | Write test, run it, see it fail first | -| "Test passes, implementation is correct" | Passing test may be too weak | Check assertions are specific enough | -| "Simple feature, no need for TDD cycle" | Simple features have edge cases too | One RED-GREEN-REFACTOR per feature | -| "I'll add more tests after the feature works" | Retro-fitted tests miss design feedback | Write tests BEFORE implementation | +**Solution:** +1. Fix syntax/import errors first +2. Set up necessary fixtures/mocks +3. Verify test file structure matches project conventions +4. Re-run until test fails for RIGHT reason (missing feature) + +### Tests pass but feature does not work + +**Symptom**: Tests green but manual testing shows bugs + +**Causes:** +- Tests do not cover actual usage +- Test mocks do not match real behavior +- Edge cases not tested + +**Solution:** +1. Review test coverage -- what is missing? +2. Add integration tests alongside unit tests +3. Test with real data, not just mocks +4. Add edge case tests (empty input, null, extremes) + +### Refactoring breaks tests + +**Symptom**: Tests fail after refactoring + +**Causes:** +- Tests coupled to implementation details +- Brittle assertions (checking internals not behavior) +- Large refactoring without incremental steps + +**Solution:** +1. Test behavior, not implementation details +2. Refactor in smaller steps +3. Run tests after each micro-refactoring +4. Update tests if API contract legitimately changed diff --git a/skills/testing-agents-with-subagents/SKILL.md b/skills/testing-agents-with-subagents/SKILL.md index 6ea0bada..05168a00 100644 --- a/skills/testing-agents-with-subagents/SKILL.md +++ b/skills/testing-agents-with-subagents/SKILL.md @@ -27,45 +27,15 @@ routing: # Testing Agents With Subagents -## Operator Context - -This skill operates as an operator for agent testing workflows, configuring Claude's behavior for systematic agent validation. It applies **TDD methodology to agent development** — RED (observe failures), GREEN (fix agent definition), REFACTOR (edge cases and robustness) — with subagent dispatch as the execution mechanism. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files before testing -- **Over-Engineering Prevention**: Only test what's directly needed. No elaborate test harnesses or infrastructure. Keep test cases focused and minimal. -- **Verbatim Output Capture**: Document exact agent outputs. NEVER summarize or paraphrase. -- **Isolated Execution**: Each test runs in a fresh subagent to avoid context pollution -- **Evidence-Based Claims**: Every claim about agent behavior MUST be backed by actual test execution -- **No Self-Exemption**: You cannot decide an agent doesn't need testing. Human partner must confirm exemptions. - -### Default Behaviors (ON unless disabled) -- **Multi-Case Testing**: Run at least 3 test cases per agent (success, failure, edge case) -- **Output Schema Validation**: Verify agent output matches expected structure and required sections -- **Consistency Testing**: Run same input 2+ times to verify deterministic behavior -- **Regression Testing**: After fixes, re-run ALL previous test cases before declaring green -- **Temporary File Cleanup**: Remove test files and artifacts at completion. Keep only files needed for documentation. -- **Document Findings**: Log all observations, hypotheses, and test results in structured format - -### Optional Behaviors (OFF unless enabled) -- **A/B Testing**: Compare agent variants using agent-comparison skill -- **Performance Benchmarking**: Measure response time and token usage -- **Stress Testing**: Test with large inputs, many iterations, concurrent requests -- **Eval Harness Integration**: Use `evals/harness.py skill-test` for YAML-based automated testing - -## What This Skill CAN Do -- Systematically validate agents through RED-GREEN-REFACTOR test cycles -- Dispatch subagents with controlled inputs and capture verbatim outputs -- Distinguish between output structure issues and behavioral correctness issues -- Verify fixes don't introduce regressions across the full test suite -- Test routing logic, skill invocation, and multi-agent workflows - -## What This Skill CANNOT Do -- Deploy agents without completing all three test phases -- Substitute reading agent prompts for executing actual test runs -- Make claims about agent behavior without evidence from subagent dispatch -- Evaluate agent quality structurally (use agent-evaluation instead) -- Skip the RED phase even when "the fix is obvious" +## Overview + +This skill applies **TDD methodology to agent development** — RED (observe failures), GREEN (fix agent definition), REFACTOR (edge cases and robustness) — with subagent dispatch as the execution mechanism. + +Test what the agent DOES, not what the prompt SAYS. Evidence-based verification only: capture exact outputs from subagent dispatch, never assume a prompt change will work without testing. Always test via the Task tool, never substitute reading a prompt for running the agent. + +Minimum test counts vary by agent type: Reviewer agents need 6 cases (2 real issues, 2 clean, 1 edge, 1 ambiguous), Implementation agents 5 cases (2 typical, 1 complex, 1 minimal, 1 error), Analysis agents 4 cases (2 standard, 1 edge, 1 malformed), Routing/orchestration 4 cases (2 correct route, 1 ambiguous, 1 invalid). No agent is simple enough to skip testing — get human confirmation before exempting any agent. + +Each test runs in a fresh subagent to avoid context pollution. After any fix, re-run ALL test cases to catch regressions. One fix at a time — you cannot determine what changed the outcome with multiple simultaneous fixes. --- @@ -146,8 +116,12 @@ Task( ) ``` +Each test runs in a fresh subagent — this prevents context pollution from earlier tests affecting later ones. + **Step 3: Capture results verbatim** +Document exact agent outputs. NEVER summarize or paraphrase: + ```markdown ## Test T1: Happy Path @@ -201,7 +175,7 @@ Triage failures by severity: Change one thing in the agent definition. Re-run ALL test cases. Document which tests now pass/fail. -Never make multiple fixes simultaneously — you cannot determine which change was effective. This is the same principle as debugging: one variable at a time. +Never make multiple fixes simultaneously — you cannot determine which change was effective. Same debugging principle: one variable at a time. **Step 4: Iterate until green** @@ -266,37 +240,6 @@ READY FOR DEPLOYMENT / NEEDS FIXES / REQUIRES REVIEW --- -## Examples - -### Example 1: Testing a New Reviewer Agent -User says: "Test the new reviewer-security agent" -Actions: -1. Define 6 test cases: 2 real issues, 2 clean code, 1 edge case, 1 ambiguous (RED) -2. Dispatch subagent for each, capture verbatim outputs (RED) -3. Fix agent definition for any failures, re-run all tests (GREEN) -4. Add edge cases (empty input, malformed code), verify consistency (REFACTOR) -Result: Agent passes all tests, report documents pass rate and verdict - -### Example 2: Testing After Agent Modification -User says: "I updated the golang-general-engineer, make sure it still works" -Actions: -1. Run existing test cases against modified agent (RED) -2. Compare outputs to previous baseline (RED) -3. Fix any regressions introduced by the modification (GREEN) -4. Test edge cases to verify robustness not degraded (REFACTOR) -Result: Agent modification validated, no regressions confirmed - -### Example 3: Testing Routing Logic -User says: "Verify the /do router sends Go requests to the right agent" -Actions: -1. Define test cases: "Review this Go code", "Fix this .go file", "Write a goroutine" (RED) -2. Dispatch each through router, verify correct agent handles it (RED) -3. Fix routing triggers if wrong agent selected (GREEN) -4. Test ambiguous inputs like "Review this code" with mixed-language context (REFACTOR) -Result: Routing validated for all trigger phrases, ambiguous cases documented - ---- - ## Error Handling ### Error: "Agent type not found" @@ -331,52 +274,39 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Testing Without Capturing Exact Output -**What it looks like**: "Tested the agent, it looks good." -**Why wrong**: No evidence of what was tested. Cannot reproduce or verify results. Subjective assessment instead of objective evidence. -**Do instead**: Capture verbatim output for every test case. Document input, expected, actual, and result. - -### Anti-Pattern 2: Testing Only Happy Path -**What it looks like**: "Tested with one example, it worked." -**Why wrong**: Agents fail on edge cases most often. One test proves almost nothing. False confidence in agent quality. -**Do instead**: Minimum 3-6 test cases per agent covering success, failure, edge, and ambiguous inputs. +## Examples -### Anti-Pattern 3: Skipping Re-test After Fixes -**What it looks like**: "Fixed the issue, should work now." -**Why wrong**: Fix might have broken other tests. No verification fix actually works. Regression bugs slip through. -**Do instead**: Re-run ALL test cases after any change. Only mark green when full suite passes. +### Example 1: Testing a New Reviewer Agent +User says: "Test the new reviewer-security agent" +Actions: +1. Define 6 test cases: 2 real issues, 2 clean code, 1 edge case, 1 ambiguous (RED) +2. Dispatch subagent for each, capture verbatim outputs (RED) +3. Fix agent definition for any failures, re-run all tests (GREEN) +4. Add edge cases (empty input, malformed code), verify consistency (REFACTOR) +Result: Agent passes all tests, report documents pass rate and verdict -### Anti-Pattern 4: Reading Prompts Instead of Running Agents -**What it looks like**: "Checked that agent prompt has the right sections." -**Why wrong**: Reading a prompt is not executing an agent. Prompt structure does not guarantee behavior. Must verify actual output. -**Do instead**: Test what the agent DOES, not what the prompt SAYS. Execute with real inputs via Task tool. +### Example 2: Testing After Agent Modification +User says: "I updated the golang-general-engineer, make sure it still works" +Actions: +1. Run existing test cases against modified agent (RED) +2. Compare outputs to previous baseline (RED) +3. Fix any regressions introduced by the modification (GREEN) +4. Test edge cases to verify robustness not degraded (REFACTOR) +Result: Agent modification validated, no regressions confirmed -### Anti-Pattern 5: Self-Exempting from Testing -**What it looks like**: "This agent is simple, doesn't need testing." or "Simple change, no need to re-test." -**Why wrong**: Simple agents can still fail. Small changes can break behavior. You cannot self-determine exemptions from testing. -**Do instead**: Get human partner confirmation for exemptions. When in doubt, test. Document why testing was skipped if approved. +### Example 3: Testing Routing Logic +User says: "Verify the /do router sends Go requests to the right agent" +Actions: +1. Define test cases: "Review this Go code", "Fix this .go file", "Write a goroutine" (RED) +2. Dispatch each through router, verify correct agent handles it (RED) +3. Fix routing triggers if wrong agent selected (GREEN) +4. Test ambiguous inputs like "Review this code" with mixed-language context (REFACTOR) +Result: Routing validated for all trigger phrases, ambiguous cases documented --- ## References -This skill uses these shared patterns: -- [Anti-Rationalization](../shared-patterns/anti-rationalization-core.md) — Prevents shortcut rationalizations -- [Anti-Rationalization: Testing](../shared-patterns/anti-rationalization-testing.md) — Testing-specific rationalization blocks -- [Verification Checklist](../shared-patterns/verification-checklist.md) — Pre-completion checks - -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "Agent prompt looks correct" | Reading prompt ≠ executing agent | Dispatch subagent and capture output | -| "Tested manually in conversation" | Not reproducible, no baseline | Use Task tool for formal dispatch | -| "Only a small change" | Small changes can break agent behavior | Run full test suite | -| "Will monitor in production" | Production monitoring ≠ pre-deployment testing | Complete RED-GREEN-REFACTOR first | -| "Based on working template" | Template correctness ≠ instance correctness | Test this specific agent | - ### Integration - `agent-comparison`: A/B test agent variants - `agent-evaluation`: Structural quality checks diff --git a/skills/testing-anti-patterns/SKILL.md b/skills/testing-anti-patterns/SKILL.md index 7a4ef3aa..868c1e44 100644 --- a/skills/testing-anti-patterns/SKILL.md +++ b/skills/testing-anti-patterns/SKILL.md @@ -39,43 +39,13 @@ routing: # Testing Anti-Patterns Skill -## Operator Context - -This skill operates as an operator for test quality assessment, configuring Claude's behavior for identifying and fixing common testing mistakes. It provides "negative knowledge" -- patterns to AVOID. It complements `test-driven-development` by focusing on what goes wrong, not just what to do right. - -**Core principle:** Tests should verify behavior, be reliable, run fast, and fail for the right reasons. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md files -- **Over-Engineering Prevention**: Fix the specific anti-pattern; do not rewrite the entire test suite -- **Preserve Test Intent**: When fixing anti-patterns, maintain what the test was trying to verify -- **Show Real Examples**: Point to actual code when identifying anti-patterns, not abstract descriptions -- **Behavior Over Implementation**: Always guide toward testing observable behavior, not internals - -### Default Behaviors (ON unless disabled) -- **Communication**: Report anti-patterns with specific file:line references and concrete fixes -- **Severity Classification**: Distinguish critical (flaky, order-dependent) from minor (naming) issues -- **Quick Wins First**: Suggest fixes that improve reliability immediately -- **One Pattern at a Time**: Address each anti-pattern individually with before/after - -### Optional Behaviors (OFF unless enabled) -- **Full Suite Audit**: Scan entire test suite for anti-patterns (can be slow) -- **Refactoring Mode**: Apply fixes automatically rather than just identifying them -- **Metrics Collection**: Count anti-patterns by category for reporting - -## What This Skill CAN Do -- Identify specific anti-patterns in test code with file:line references -- Provide concrete before/after examples for fixes -- Prioritize fixes by impact (flaky > order-dependent > slow > naming) -- Explain WHY a pattern is problematic -- Suggest incremental improvements without rewriting suites - -## What This Skill CANNOT Do -- Fix fundamental architectural issues (use `systematic-refactoring`) -- Write new tests from scratch (use `test-driven-development`) -- Profile test performance (use actual profilers) -- Guarantee test correctness (anti-patterns can exist in "working" tests) -- Skip identification and jump straight to rewriting +## Overview + +This skill identifies and fixes common testing mistakes across unit, integration, and E2E test suites. Tests should verify behavior, be reliable, run fast, and fail for the right reasons. + +**Scope:** This skill focuses on improving test quality and reliability. It complements `test-driven-development` by addressing what goes wrong with tests, not just how to write them correctly from scratch. + +**Not in scope:** Writing new tests from scratch (use `test-driven-development`), fixing fundamental architectural issues (use `systematic-refactoring`), or profiling test performance with external tools. --- @@ -94,7 +64,7 @@ Use Grep/Glob to find test files in the relevant area. If user pointed to specif **Step 2: Read CLAUDE.md** -Check for project-specific testing conventions before flagging anti-patterns. Some projects intentionally deviate from general best practices. +Check for project-specific testing conventions before flagging anti-patterns. Some projects intentionally deviate from general best practices. This prevents false positives based on organizational standards. **Step 3: Classify anti-patterns** @@ -135,6 +105,12 @@ For each test file, scan for these 10 categories (detailed examples in `referenc 2. **MEDIUM** - Over-mocking, incomplete assertions, missing edge cases (false confidence) 3. **LOW** - Poor naming, over-specification, slow suites (maintenance burden) +**Constraint: Fix one pattern at a time.** Mechanical bulk fixes (applying the same pattern to 50 tests without running them) miss context-specific nuances and cause regressions. Fix one, verify it works, then move to the next. + +**Constraint: Preserve test intent.** When fixing anti-patterns, maintain what the test was originally trying to verify. Do not silently change test coverage. + +**Constraint: Prevent over-engineering.** Fix the specific anti-pattern identified; do not rewrite the entire test suite or delete tests and write new ones from scratch. Institutional knowledge lives in the existing tests. + **Gate**: Findings ranked. User agrees on scope of fixes. Proceed only when gate passes. ### Phase 3: FIX @@ -159,15 +135,21 @@ Priority: [HIGH/MEDIUM/LOW] ``` **Step 2: Apply fix** -- Change only what is needed to fix the anti-pattern -- Preserve the original test's intent and coverage -- One anti-pattern fix at a time -- Consult `references/fix-strategies.md` for language-specific patterns + +**Constraint: Show real examples.** Point to actual code when identifying anti-patterns, not abstract descriptions. Avoid rationalization — if a test breaks during refactoring, that test was relying on buggy behavior. Investigate and fix the root cause, do not just adjust the assertion. + +**Constraint: Guide toward behavior testing.** Always recommend testing observable behavior, not implementation internals. For example: +- ANTI-PATTERN: Test asserts on private fields → FIX: Test the public behavior that those fields enable +- ANTI-PATTERN: Test spies on `_getUser()` → FIX: Test what happens when a user exists or doesn't exist +- ANTI-PATTERN: Test checks exact regex → FIX: Test that validation succeeds/fails for representative inputs + +Change only what is needed to fix the anti-pattern. Consult `references/fix-strategies.md` for language-specific patterns. **Step 3: Run tests after each fix** + - Run the specific fixed test first to confirm it passes - Run the full file or package to check for interactions -- If a fix makes a previously-passing test fail, the test was likely depending on buggy behavior -- investigate before proceeding +- If a fix makes a previously-passing test fail, the test was likely depending on buggy behavior — investigate before proceeding **Gate**: Each fix verified individually. Tests pass after each change. @@ -175,7 +157,7 @@ Priority: [HIGH/MEDIUM/LOW] **Goal**: Confirm all fixes work together and suite is healthier. -**Step 1**: Run full test suite -- all pass +**Step 1**: Run full test suite — all pass **Step 2**: Verify previously-flaky tests are now deterministic (run 3x if applicable) - Go: `go test -count=3 -run TestFixed ./...` @@ -201,59 +183,172 @@ Remaining issues: [any deferred items] --- -## Examples - -### Example 1: Flaky Test Investigation -User says: "This test passes locally but fails randomly in CI" -Actions: -1. Scan test for timing dependencies -- find `sleep(500)` (SCAN) -2. Classify as Anti-Pattern 10: Flaky Test, severity HIGH (PRIORITIZE) -3. Replace `sleep()` with `waitFor()` or inject fake clock (FIX) -4. Run test 10x to confirm determinism, run full suite (VERIFY) -Result: Flaky test replaced with deterministic wait - -### Example 2: Over-Mocked Test Suite -User says: "Every small refactor breaks dozens of tests" -Actions: -1. Scan for mock density -- find tests with 5+ mocks each (SCAN) -2. Classify as Anti-Pattern 2: Over-mocking, severity MEDIUM (PRIORITIZE) -3. Replace mocks with real implementations at I/O boundaries (FIX) -4. Verify suite passes, confirm refactoring no longer breaks tests (VERIFY) -Result: Tests verify behavior instead of mock wiring - -### Example 3: False Confidence -User says: "Tests all pass but we keep finding bugs in production" -Actions: -1. Scan for incomplete assertions (`!= nil`, `toBeTruthy`) and missing edge cases (SCAN) -2. Classify as Anti-Patterns 4+8, severity MEDIUM (PRIORITIZE) -3. Add specific value assertions, add edge case tests (FIX) -4. Verify new assertions catch known production bugs (VERIFY) -Result: Tests now catch real issues before deployment - -### Example 4: Order-Dependent Suite -User says: "Tests pass in sequence but fail when run in parallel or random order" -Actions: -1. Scan for shared mutable state, class-level variables, global DB mutations (SCAN) -2. Classify as Anti-Pattern 3: Order Dependence, severity HIGH (PRIORITIZE) -3. Give each test its own setup/teardown, remove shared state (FIX) -4. Run suite with randomized order 3x, confirm all pass (VERIFY) -Result: Tests are self-contained and parallelizable - -### Example 5: Skipped Test Audit -User says: "We have 40 skipped tests, are any still relevant?" -Actions: -1. Grep for `@skip`, `.skip`, `xit`, `@pytest.mark.skip` across suite (SCAN) -2. Classify each: outdated (delete), still relevant (fix), environment-specific (document) (PRIORITIZE) -3. Delete dead tests, unskip and fix relevant ones, add reason annotations (FIX) -4. Verify suite passes with formerly-skipped tests re-enabled (VERIFY) -Result: No unexplained skips remain; suite coverage restored +## Anti-Pattern Catalog + +This section documents the domain-specific anti-patterns this skill detects and fixes. + +### Anti-Pattern 1: Testing Implementation Details + +**What it looks like:** Tests assert on private fields, internal regex patterns, or spy on private methods. + +**Why it's problematic:** Tests coupled to implementation details break whenever the implementation changes, even if public behavior is identical. This creates brittle tests that don't reflect real-world usage. + +**Example signals:** +- Test accesses `obj._privateField` +- Test mocks or spies on `_internalMethod()` +- Test asserts the exact regex used internally + +**Fix:** Test the public behavior that those implementation details enable. If private fields matter, they matter because they affect what users see or experience. + +### Anti-Pattern 2: Over-Mocking / Brittle Selectors + +**What it looks like:** Mock setup spans more than 50% of the test code. CSS selectors use nth-child or rely on brittle DOM structure. + +**Why it's problematic:** Over-mocked tests verify mock wiring, not actual behavior. They don't catch real integration issues and break whenever the mocking structure changes. + +**Example signals:** +- Test has 15 lines of setup and 5 lines of assertion +- Test uses `.querySelector('div:nth-child(3) > span')` +- Test mocks every dependency instead of using real implementations at I/O boundaries + +**Fix:** Mock only at architectural boundaries (HTTP, DB, external services). Use real implementations for internal logic. For UI tests, select by semantic attributes (data-testid, role) instead of DOM structure. + +### Anti-Pattern 3: Order-Dependent Tests + +**What it looks like:** Tests share mutable state, use class-level variables, or have numbered test names (test1, test2) suggesting sequence dependency. + +**Why it's problematic:** Tests that pass in sequence but fail in parallel or random order hide bugs. The suite becomes unreliable — developers can't trust "all tests pass" locally if they fail in CI. + +**Example signals:** +- Multiple tests modify a shared class-level variable +- Database is populated by test1, test2 depends on that state +- Test names: `test1_setup`, `test2_verify`, `test3_cleanup` + +**Fix:** Each test owns its data. Use setup/teardown or test fixtures to isolate state. Run suite with `--shuffle` or `-random-order` to catch dependencies. + +### Anti-Pattern 4: Incomplete Assertions + +**What it looks like:** Tests use assertions like `!= nil`, `> 0`, `toBeTruthy()` without checking specific values. + +**Why it's problematic:** Incomplete assertions pass for many wrong reasons. A function that returns 999 (wrong) passes an `> 0` assertion. This gives false confidence — tests pass but don't catch bugs. + +**Example signals:** +- `assert result != nil` (passes for any non-nil value) +- `assert response.status > 0` (passes for 404, 500, etc.) +- `expect(user).toBeTruthy()` (passes for any truthy user, even with wrong name) + +**Fix:** Assert specific expected values: +- `assert.equal(result.name, "Alice")` +- `assert.equal(response.status, 200)` +- `expect(user.name).toBe("Alice")` + +### Anti-Pattern 5: Over-Specification + +**What it looks like:** Tests assert on default values, exact timestamps, hardcoded IDs, or every field in a response. + +**Why it's problematic:** Over-specified tests are fragile. When a default changes (legitimately), dozens of tests break even though behavior didn't change. Tests should specify only what matters for this test case. + +**Example signals:** +- `assert.equal(user.createdAt, "2024-01-15T10:30:00Z")` (timestamp brittle to test time) +- `assert.equal(post.id, "uuid-1234-5678")` (hardcoded ID specific to this test) +- Test asserts `status`, `message`, `timestamp`, `userId`, `metadata` when only `status` matters + +**Fix:** Assert only what matters. Use flexible matchers for timestamps and IDs: +- `expect(user.createdAt).toBeDefined()` or `toBeWithin(now, 1000ms)` +- `assert.truthy(post.id)` (just verify it exists) + +### Anti-Pattern 6: Ignored Failures + +**What it looks like:** Tests use `@skip`, `.skip`, `xit`, empty catch blocks, or `_ = err` (ignore error). + +**Why it's problematic:** Skipped tests become permanent blind spots. Nobody remembers why they were skipped. Empty catch blocks hide real errors. + +**Example signals:** +- `@skip` or `.skip()` with no expiration date +- `try { ...test code... } catch (e) {}` (silently ignore errors) +- `err := doSomething(); _ = err` (acknowledge but ignore) + +**Fix:** Delete the test if no longer relevant, or unskip and fix it. Add a reason annotation with a date if skipping is truly necessary: +```go +t.Skip("TODO: fix timing issue (2024-01-15)") +``` + +### Anti-Pattern 7: Poor Naming + +**What it looks like:** Test names use sequential numbers (`test1`, `test2`), vague names (`testFunc`, `test_new`), or generic descriptions (`it('works')`, `it('handles case')`). + +**Why it's problematic:** Poor names hide intent. Developers reading test output see `test1 failed` but have no idea what behavior broke. Good test names document expected behavior. + +**Example signals:** +- `TestCreateUser1`, `TestCreateUser2` +- `test_new`, `testFunc`, `test_handle` +- `it('works')`, `it('handles case')`, `it('does something')` + +**Fix:** Use descriptive names that describe the scenario and expected outcome: +- Go: `Test_CreateUser_WithValidEmail_ReturnsNewUser` +- Python: `test_create_user_with_valid_email_returns_new_user` +- JS: `it('creates a user when given a valid email')` + +### Anti-Pattern 8: Missing Edge Cases + +**What it looks like:** Test suite covers only the happy path. No tests for empty inputs, null values, boundary conditions, errors, or large datasets. + +**Why it's problematic:** Missing edge cases cause production bugs. The happy path works, but the code crashes on empty input, null reference, or boundary values. + +**Example signals:** +- Only tests with valid input; no tests with empty/null +- No tests for negative numbers, zero, or max values +- No tests for error conditions (timeout, connection failure) + +**Fix:** Add tests for: +- **Empty**: empty string, empty array, empty object +- **Null**: null input, missing required field +- **Boundary**: zero, max value, min value, off-by-one +- **Error**: timeout, network failure, permission denied +- **Large**: very large arrays, deep nesting + +### Anti-Pattern 9: Slow Test Suites + +**What it looks like:** Full database reset between every test. No parallelization. Fixture data shared instead of created per-test. Tests wait on actual time. + +**Why it's problematic:** Slow tests discourage running locally. Developers skip tests before committing, bugs slip through. CI builds take hours, slowing iteration. + +**Example signals:** +- Each test: `DROP TABLE users; INSERT INTO users ...` (30s per test) +- Sequential execution with no parallelization +- Tests use `time.Sleep(1000)` to wait for something + +**Fix:** +- Use transactions that rollback instead of dropping tables +- Run tests in parallel: `go test -parallel 8`, `pytest -n auto` +- Create fixtures once, reference per-test: fixture factories, test-specific data builders +- Replace waits with condition checks: `waitFor(() => element.textContent)` instead of `sleep(1000)` + +### Anti-Pattern 10: Flaky Tests + +**What it looks like:** Tests use `sleep()`, `time.Sleep()`, `setTimeout()` or unsynchronized goroutines. Tests pass locally but fail randomly in CI. + +**Why it's problematic:** Flaky tests erode trust in the test suite. Developers don't know if a failure is real or just timing. Teams start ignoring test failures — the worst outcome. + +**Example signals:** +- `time.Sleep(100 * time.Millisecond)` to wait for goroutine +- `setTimeout(() => { ...assert... }, 500)` hoping it's ready +- Tests pass locally but fail in CI (slower machines, resource contention) + +**Fix:** +- Replace `sleep()` with explicit waits: `waitFor()`, `sync.WaitGroup`, channels +- Inject fake clocks or time control: `time.Now()` should be mockable +- Synchronize goroutines with channels or `sync.WaitGroup`, not timing +- Tests must be deterministic: same input → same output, regardless of machine speed --- ## Error Handling ### Error: "Cannot Determine if Pattern is Anti-Pattern" -Cause: Context-dependent -- pattern may be valid in specific situations + +Cause: Context-dependent — pattern may be valid in specific situations + Solution: 1. Check if the test has a comment explaining the unusual approach 2. Consider the testing layer (unit vs integration vs E2E) @@ -261,7 +356,9 @@ Solution: 4. When in doubt, flag as MEDIUM and explain trade-offs ### Error: "Fix Changes Test Behavior" + Cause: Anti-pattern was masking an actual test gap or testing wrong thing + Solution: 1. Identify what the test was originally trying to verify 2. Write the correct assertion for that behavior @@ -269,7 +366,9 @@ Solution: 4. Do not silently change what a test covers ### Error: "Suite Has Hundreds of Anti-Patterns" + Cause: Systemic test quality issues, not individual mistakes + Solution: 1. Do NOT attempt to fix everything at once 2. Focus on HIGH severity items only (flaky, order-dependent) @@ -278,30 +377,6 @@ Solution: --- -## Anti-Patterns (Meta) - -### Anti-Pattern 1: Rewriting Instead of Fixing -**What it looks like**: Deleting the entire test and writing a new one from scratch -**Why wrong**: Loses institutional knowledge of what was being tested; may reduce coverage -**Do instead**: Preserve intent, fix the specific anti-pattern, keep test focused - -### Anti-Pattern 2: Fixing Style Without Fixing Substance -**What it looks like**: Renaming `test1` to `test_creates_user` but not fixing the incomplete assertion inside -**Why wrong**: Cosmetic improvement without reliability gain -**Do instead**: Fix reliability issues first (assertions, flakiness), then naming - -### Anti-Pattern 3: Adding Tests Without Removing Anti-Patterns -**What it looks like**: Writing new good tests alongside existing bad ones -**Why wrong**: Bad tests still produce false confidence and maintenance burden -**Do instead**: Fix or delete the anti-pattern test, then add proper coverage if needed - -### Anti-Pattern 4: Bulk Fixing Without Verification -**What it looks like**: Applying the same fix pattern to 50 tests without running them -**Why wrong**: Mechanical fixes miss context-specific nuances; may break tests -**Do instead**: Fix one, verify, fix next. Batch only after pattern is proven safe. - ---- - ## References ### Quick Reference Table @@ -320,6 +395,7 @@ Solution: | Flaky tests | Random failures | Control time, synchronize, no sleep | ### Red Flags During Review + - `@skip`, `@ignore`, `xit`, `.skip` without expiration date - `time.sleep()`, `setTimeout()` in test code - Test names with sequential numbers (`test1`, `test2`) @@ -329,6 +405,7 @@ Solution: - Assertions like `!= nil`, `> 0`, `toBeTruthy()` without value checks ### TDD Relationship + Strict TDD prevents most anti-patterns: 1. **RED phase** catches incomplete assertions (test must fail first) 2. **GREEN phase minimum** prevents over-specification @@ -338,17 +415,8 @@ Strict TDD prevents most anti-patterns: If you find anti-patterns in a codebase, check if TDD discipline slipped. -### Domain-Specific Anti-Rationalization - -| Rationalization | Why It's Wrong | Required Action | -|-----------------|----------------|-----------------| -| "The test passes, so it's fine" | Passing with anti-patterns gives false confidence | Evaluate assertion quality, not just pass/fail | -| "We can fix test quality later" | Anti-patterns compound; flaky tests erode trust daily | Fix HIGH severity items now, defer LOW | -| "Just skip the flaky test for now" | Skipped tests become permanent blind spots | Diagnose root cause, fix or delete | -| "Mocking everything is faster" | Over-mocking tests mock wiring, not behavior | Mock only at architectural boundaries | -| "One big test covers everything" | Monolithic tests are fragile and hard to debug | Split into focused, independent tests | - ### Reference Files + - `${CLAUDE_SKILL_DIR}/references/anti-pattern-catalog.md`: Detailed code examples for all 10 anti-patterns (Go, Python, JavaScript) - `${CLAUDE_SKILL_DIR}/references/fix-strategies.md`: Language-specific fix patterns and tooling - `${CLAUDE_SKILL_DIR}/references/blind-spot-taxonomy.md`: 6-category taxonomy of what high-coverage test suites commonly miss (concurrency, state, boundaries, security, integration, resilience) diff --git a/skills/threejs-builder/SKILL.md b/skills/threejs-builder/SKILL.md index fa680dc8..5fef4880 100644 --- a/skills/threejs-builder/SKILL.md +++ b/skills/threejs-builder/SKILL.md @@ -34,48 +34,11 @@ routing: # Three.js Builder Skill -## Operator Context - -This skill operates as an operator for Three.js web application creation, configuring Claude's behavior for structured, scene-graph-driven 3D development. It implements the **Phased Construction** architectural pattern -- Design, Build, Animate, Polish -- with **Domain Intelligence** embedded in modern Three.js (r150+) ES module patterns. - -### Hardcoded Behaviors (Always Apply) -- **CLAUDE.md Compliance**: Read and follow repository CLAUDE.md before building -- **Over-Engineering Prevention**: Build only what the user asked for. No speculative features, no "while I'm here" additions -- **ES Modules Only**: Always use modern ES module imports from CDN or npm. Never use legacy global `THREE` variable or CommonJS -- **Scene Graph First**: Structure all objects through the scene graph hierarchy. Use `Group` for logical groupings -- **Responsive by Default**: Every app must handle window resize and cap `devicePixelRatio` at 2 -- **Single HTML File**: Default output is one self-contained HTML file unless user specifies otherwise - -### Default Behaviors (ON unless disabled) -- **Three-Point Lighting**: Set up ambient + directional + fill light for standard scenes -- **OrbitControls**: Include orbit camera controls for interactive scenes -- **Animation Loop via setAnimationLoop**: Use `renderer.setAnimationLoop()` over manual `requestAnimationFrame` -- **Configuration Object**: Define visual constants (colors, speeds, sizes) in a top-level `CONFIG` object -- **Modular Setup Functions**: Separate scene creation into `createScene()`, `createLights()`, `createMeshes()` functions - -### Optional Behaviors (OFF unless enabled) -- **Post-Processing**: Bloom, depth of field via EffectComposer -- **Model Loading**: GLTF/GLB loading with auto-center and scale -- **Custom Shaders**: ShaderMaterial with GLSL vertex/fragment shaders -- **Shadow Mapping**: PCFSoft shadows with configurable map resolution -- **Physics Integration**: Cannon.js gravity and collision simulation -- **Raycasting**: Mouse/touch picking of 3D objects - -## What This Skill CAN Do -- Create complete Three.js web applications from a user description -- Set up scenes with proper lighting, camera, renderer, and resize handling -- Use built-in geometries (Box, Sphere, Cylinder, Torus, Plane, Cone, Icosahedron) -- Apply PBR materials (Standard, Physical) and basic materials (Basic, Phong, Normal) -- Implement animations: rotation, oscillation, wave motion, mouse tracking -- Add OrbitControls, GLTF model loading, post-processing, raycasting -- Vary visual style to match context (portfolio, game, data viz, background effect) - -## What This Skill CANNOT Do -- Create complex game engines (use Unity, Unreal instead) -- Generate or edit 3D model files (modeling is done in Blender, etc.) -- Implement VR/AR experiences (specialized WebXR knowledge needed) -- Replace dedicated CAD software for engineering drawings -- Optimize scenes with 1M+ polygons (requires specialized LOD/culling strategies) +## Overview + +This skill builds complete Three.js web applications using a **Phased Construction** pattern with four phases: Design, Build, Animate, Polish. It is designed for modern Three.js (r150+) ES module patterns and produces scene-graph-driven 3D visualizations. + +**Scope**: Use for 3D web apps, interactive scenes, WebGL visualizations, and product viewers. Do NOT use for game engines, 3D model creation, VR/AR experiences, or CAD workflows. --- @@ -85,6 +48,13 @@ This skill operates as an operator for Three.js web application creation, config **Goal**: Understand what the user wants and select appropriate Three.js components. +**Core Constraints**: +- **Build only what the user asked for** — no speculative features or "while I'm here" additions +- **Use modern Three.js (r150+) ES modules only** — always import from CDN (`https://unpkg.com/three@0.160.0/build/three.module.js`) or npm, never use legacy global `THREE` variable or CommonJS +- **Structure through the scene graph** — use `Group` for logical groupings and maintain proper hierarchy +- **Vary style by context** — portfolio/showcase use elegant muted palettes; games use bright colors; data viz uses clean lines; backgrounds use subtle slow movement; product viewers use realistic PBR lighting +- **Read repository CLAUDE.md before building** — ensure compliance with local development standards + **Step 1: Identify the core visual element** Determine from the user request: @@ -105,15 +75,9 @@ Determine from the user request: - Extras: [post-processing / raycasting / particles] ``` -**Step 3: Choose visual style based on context** +**Step 3: Document visual style** -| Context | Style Guidance | -|---------|---------------| -| Portfolio/showcase | Elegant, smooth animations, muted palette | -| Game/interactive | Bright colors, snappy controls, particle effects | -| Data visualization | Clean lines, grid helpers, clear labels | -| Background effect | Subtle, slow movement, dark gradients | -| Product viewer | Realistic PBR lighting, smooth orbit, neutral backdrop | +Record the visual direction for this scene (e.g., "elegant minimal portfolio style", "vibrant interactive game", "clean data visualization"). Use this to guide material colors, lighting warmth, and animation pacing. **Gate**: Scene plan documented with geometry, material, lighting, animation, and controls selected. Proceed only when gate passes. @@ -121,6 +85,14 @@ Determine from the user request: **Goal**: Construct the scene with proper structure and modern patterns. +**Core Constraints**: +- **Single HTML file output by default** unless user specifies otherwise +- **Include resize handling** that caps `devicePixelRatio` at 2 and updates camera aspect on window change +- **Use a top-level `CONFIG` object** for all visual constants (colors, speeds, sizes) — no magic numbers scattered through code +- **Separate concerns into modular setup functions**: `createScene()`, `createLights()`, `createMeshes()` — this enables testing and reuse +- **Include three-point lighting by default**: ambient light + directional light + fill light, unless user specifies a different lighting strategy +- **Use `renderer.setAnimationLoop()` instead of manual `requestAnimationFrame()`** for cleaner animation setup + **Step 1: Create HTML boilerplate** Every app starts with this structure: @@ -149,6 +121,13 @@ Every app starts with this structure: **Step 2: Build scene infrastructure** ```javascript +// CONFIG object at top level +const CONFIG = { + colors: { /* color hex values */ }, + speeds: { /* animation speeds */ }, + sizes: { /* geometric dimensions */ } +}; + // Scene, camera, renderer const scene = new THREE.Scene(); const camera = new THREE.PerspectiveCamera( @@ -169,7 +148,7 @@ window.addEventListener('resize', () => { **Step 3: Add lighting, geometry, and materials per scene plan** -Build each component from the Phase 1 plan. Create geometry once, reuse where possible. Use `Group` for hierarchical transforms. +Build each component from the Phase 1 plan. Create geometry once and reuse where possible (avoid allocating new geometries in animation loops). Use `Group` for hierarchical transforms and logical scene organization. **Gate**: Scene renders without errors. All planned geometry, materials, and lights are present. Proceed only when gate passes. @@ -177,6 +156,12 @@ Build each component from the Phase 1 plan. Create geometry once, reuse where po **Goal**: Add motion, interaction, and life to the scene. +**Core Constraints**: +- **Never allocate geometry or materials inside the animation loop** — this causes garbage collection pauses and frame rate collapse +- **Use the `time` parameter (in milliseconds) for time-based animation** — multiply by small factors (0.001, 0.0005) for smooth motion +- **Include OrbitControls by default** for interactive scenes (unless user requests a specific control scheme) +- **Transform only position, rotation, and scale per frame** — all geometry and materials are static + **Step 1: Set up animation loop** ```javascript @@ -189,7 +174,10 @@ renderer.setAnimationLoop((time) => { **Step 2: Implement planned animations** -Apply transforms per frame. Use `time` parameter (milliseconds) for time-based animation. Multiply by small factors (0.001, 0.0005) for smooth motion. +Apply transforms per frame. Time-based animation should follow the pattern: +```javascript +mesh.rotation.x += CONFIG.speeds.rotation * (time * 0.001); +``` **Step 3: Add interaction handlers** @@ -201,9 +189,17 @@ Wire up mouse/touch events, orbit controls, or raycasting per the scene plan. **Goal**: Ensure quality, performance, and completeness. +**Core Constraints**: +- **Remove all debug helpers** (AxesHelper, GridHelper, Stats) unless user explicitly requested them +- **Remove all commented-out code and TODO markers** +- **Every scene must handle window resize** and render correctly at all viewport sizes +- **Lighting must produce visible surfaces** — no black screens from missing lights +- **Colors and visual style must match the intended context** — this is non-negotiable quality bar + **Step 1: Verify responsive behavior** -- Resize browser window -- canvas fills viewport without distortion +- Resize browser window — canvas fills viewport without distortion - `devicePixelRatio` capped at 2 +- Test at common mobile/tablet/desktop breakpoints **Step 2: Verify visual quality** - Lighting produces visible surfaces (no black screen from missing lights) @@ -223,28 +219,6 @@ Wire up mouse/touch events, orbit controls, or raycasting per the scene plan. --- -## Examples - -### Example 1: Simple Animated Scene -User says: "Create a threejs scene with a rotating icosahedron" -Actions: -1. Design: low-poly icosahedron, standard material, three-point lighting, continuous rotation (DESIGN) -2. Build: HTML boilerplate, scene setup, IcosahedronGeometry with flatShading, lighting (BUILD) -3. Animate: rotation on x and y axes using time parameter (ANIMATE) -4. Polish: verify resize, test in browser, remove debug helpers (POLISH) -Result: Single HTML file with responsive, animated 3D scene - -### Example 2: Interactive Product Viewer -User says: "Build a 3D product viewer that loads a GLB model" -Actions: -1. Design: GLTF loader, PBR material, realistic lighting, OrbitControls, neutral backdrop (DESIGN) -2. Build: HTML with GLTFLoader import, auto-center/scale model, environment lighting (BUILD) -3. Animate: orbit controls with damping, optional auto-rotate (ANIMATE) -4. Polish: loading progress indicator, responsive, verify model renders (POLISH) -Result: Interactive model viewer with orbit controls and proper lighting - ---- - ## Error Handling ### Error: "Black Screen / Nothing Renders" @@ -271,49 +245,6 @@ Solution: --- -## Anti-Patterns - -### Anti-Pattern 1: Creating Geometry Inside the Animation Loop -**What it looks like**: `new THREE.BoxGeometry(1,1,1)` called every frame -**Why wrong**: Allocates memory every frame, causes GC pauses and frame rate collapse -**Do instead**: Create all geometries and materials once during setup. Transform only position, rotation, and scale in the loop. - -### Anti-Pattern 2: Using Legacy Global THREE Patterns -**What it looks like**: `