feat: add skill-creator skill with eval-driven validation and blind A/B testing by notque · Pull Request #178 · notque/claude-code-toolkit

notque · 2026-03-27T03:11:50Z

Summary

Replace the skill-creator-engineer agent with a self-contained skill that follows an eval-driven model: draft a skill, test against real prompts, grade with agent blind review, iterate based on measured results.

skill-creator skill — workflow-first SKILL.md with bundled agents (grader, comparator, analyzer), scripts (run_eval, aggregate_benchmark, optimize_description, eval_compare, package_results), assets (blind A/B/C comparison viewer with syntax highlighting), and references
Philosophy updates — "Skills Are Self-Contained Packages" and "Workflow First, Constraints Inline" principles added, both evidence-backed by blind A/B testing
Old agent removed — skill-creator-engineer deleted, all 25 referencing files updated
Hook matcher groups — PreToolUse/PostToolUse performance improvement
Eval workspace gitignore — *-workspace/ and .feature/ directories excluded

Test plan

Verify skill-creator skill loads correctly (/skill-creator triggers)
Verify old agent references are fully removed (grep -r skill-creator-engineer)
Verify eval viewer opens correctly from workspace
Verify hook matcher groups don't break existing hook behavior
Verify .gitignore excludes workspace and .feature directories

…/B testing Replace the skill-creator-engineer agent with a self-contained skill at skills/skill-creator/ that follows the eval-driven model: draft a skill, test against real prompts, grade with agent blind review, iterate based on measured results. Includes: - SKILL.md: workflow-first ordering, constraints inline with reasoning - agents/: grader (assertion evaluation), comparator (blind A/B), analyzer (post-hoc analysis) - scripts/: run_eval, aggregate_benchmark, optimize_description, eval_compare, package_results - assets/: blind A/B/C comparison viewer with syntax highlighting, agent review panels, human feedback, skip-to-results - references/: artifact schemas, skill template, complexity tiers, workflow patterns, error catalog Philosophy updates: - Skills Are Self-Contained Packages: scripts, agents, assets bundled inside the skill directory - Workflow First, Constraints Inline: instructions before context, measured to produce better downstream output (3-0 in blind eval) Also: hook matcher groups for PreToolUse/PostToolUse performance, eval workspace gitignore, all references updated from old agent.

notque force-pushed the feat/skill-creator-eval-pipeline branch 2 times, most recently from 15aac68 to 3fc508d Compare March 27, 2026 03:21

notque force-pushed the feat/skill-creator-eval-pipeline branch from 3fc508d to f276a1c Compare March 27, 2026 03:41

notque merged commit 46c6d21 into main Mar 27, 2026
4 checks passed

notque deleted the feat/skill-creator-eval-pipeline branch March 27, 2026 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add skill-creator skill with eval-driven validation and blind A/B testing#178

feat: add skill-creator skill with eval-driven validation and blind A/B testing#178
notque merged 1 commit intomainfrom
feat/skill-creator-eval-pipeline

notque commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

notque commented Mar 27, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant