Skip to content

TBSKR/critic-overkill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

critic-overkill

A Claude Code skill that dispatches 3 expert critic sub-agents, scores your work 0-100, implements only fixes ≥2 critics agree on, commits per round, and loops until 99/100 or graceful plateau.

License: MIT Claude Code Skill

Why this exists

Single-reviewer AI critique tends to self-agree. You ask Claude "is this good?", and after a couple rounds it converges on "yes" — even when it isn't.

Three critics, dispatched in parallel, with one of them deliberately calibrated to score lower, produce real score spread. And only implementing fixes that two or more critics independently flag filters out solo-taste noise — which is what usually makes "AI polish" oscillate forever without converging.

This skill captures that pattern as a structured loop.

Pairs perfectly with /goal

The killer demo, single line:

/goal /critic-overkill the chat composer — stop when avg score ≥99/100 holds 3 rounds OR plateau, OR after 12 rounds. Real parallel critics. Scope: components/chat/**. Print final score line per round.

Then walk away. Claude runs the loop, dispatches three critics in parallel, implements only convergent fixes, commits per round, and the /goal evaluator catches PASS or plateau from the structured score line critic-overkill prints every round.

Why this works (the protocol)

/goal's evaluator is blind — it judges only what appears in the conversation transcript. It cannot run tools to verify "is this good?" So most /goal conditions either rubber-stamp or loop forever.

ClaudeCritic Overkill's per-round output is already evaluator-shaped. Every round, it prints:

ROUND 7 | avg=96.3 | conv-ux=99 brand-visual=97 magpie-trend=93 | verdict=NEAR

That single line is the protocol between the loop body and the outer /goal evaluator. The success condition (avg ≥99 held 3 rounds) is trivially pattern-matchable from the transcript. One line in, structured proof per round, machine-verifiable end-state.

This is the genuinely novel pattern: skill that emits a verifiable artifact per turn + /goal condition that pattern-matches that artifact. Naive /goal users ship vibes-based conditions and get noise. With critic-overkill you get a contract.

Install

git clone https://github.com/TBSKR/critic-overkill.git
mkdir -p ~/.claude/skills
cp -R critic-overkill/skill ~/.claude/skills/critic-overkill

That's it. Next time you ask Claude Code to refine, polish, or perfect something — or set a /goal containing words like "production-grade" or "until 99/100" — the skill activates.

First run

On first invocation in a new project, critic-overkill runs a short auto-detection pass that reads your CLAUDE.md, package.json (or equivalent), README, and recent commits to detect:

  • language (en/nl/de/...)
  • stack (Next/Python/Go/...)
  • product type (UI app / API / library / CLI)
  • audience
  • verification commands (typecheck, lint, test)
  • commit-message style
  • default critic trio

It writes .critic-overkill/config.md and shows the detected config for one-pass confirmation. Subsequent runs skip the init. Delete .critic-overkill/config.md to re-detect.

The three critics

For UI work, critic-overkill defaults to:

Critic Persona Calibrates to
conv-ux Principal Conversational AI Designer 90-99
brand-visual Design Director, Brand & Visual Systems 88-99
magpie-trend UI Magpie & Trend Scout, intentionally strict 78-95

The Magpie is the bias breaker. It starts at 85 by default and only reaches 99 when it would genuinely steal patterns from your work. Without the Magpie, three critics tend to halo-agree and the loop falsely declares PASS.

Other critic pools (auto-selected based on detected product type):

  • API: api-architect + dx-reviewer + magpie-trend
  • Copy: voice-tone + localization + magpie-trend
  • Code module: senior-architect + code-reviewer + magpie-trend
  • Security-sensitive: security-reviewer + senior-architect + magpie-trend
  • Mixed/unclear: senior-generalist + brand-visual + magpie-trend

Override with --critics "a,b,c" per invocation.

Hard rules (what makes it work)

  • Always implement. Every round that finds convergent issues writes code, runs verifications, and commits. A round that produces only critic reports without changes is a failed round.
  • Convergent-fix rule. Only issues flagged by ≥2 of 3 critics get implemented. Solo-critic taste goes to a parked log.
  • Diff-aware critics. From round 2+, each critic receives an explicit diff of changes since their last review. Forces acknowledgment of progress instead of resetting.
  • Plateau detection. If average score improves <2 points over 3 rounds AND all remaining issues are solo-critic, stop gracefully with a "diminishing returns" report.
  • Commits per round. One commit per fix-batch, in your project's commit style.
  • Never edit production code to make the audit work. Mock at the test boundary (Playwright route.fulfill, fixture JSON). The production route handler must never gain if (DEV_AUTH_BYPASS) ... branches added by this skill.
  • Real parallel sub-agent dispatch required. If the harness can't dispatch 3 truly parallel agents, single-reviewer role-play creates halo bias. The skill flags this as SYNTHETIC_SPREAD and tells you to trust scores less.

Audit fixtures

The skill captures current state reproducibly using one of five patterns (see skill/audit-fixtures.md):

  • A. Web UI — Playwright spec that screenshots target states
  • B. Live interactive — Playwright MCP for streaming/dynamic flows
  • C. API — fixture script dumping request/response JSON
  • D. Copy — markdown table of all user-visible strings
  • E. Code module — git diff + file dump

Fixtures live in output/critic-overkill/<target-slug>/round-<N>/ and are overwritten each round so improvements are measurable.

What a stop report looks like

## ClaudeCritic Overkill report — chat composer

**Final scores**: Conv-UX 99/100 · Brand 96/100 · Magpie 91/100
**Rounds run**: 10
**Commits**: 8 (a3f9d2..7c8e4b)

**Verdict**: NEAR — plateau detected (Magpie under threshold)
**Reason for stop**: PLATEAU

**Parked solo-critic suggestions** (not implemented):
- Magpie #1: "would benefit from rotating placeholder hints" (taste-only)
- Magpie #3: "consider mobile-only keyboard hint variations"

**Files changed**: components/composer/*, styles/composer.css, app/chat/page.tsx

Cost model

The first round dominates (~10× the cost of rounds 2-5). Fixture setup + per-state capture is one-time. Don't bail in round 1 because it feels expensive — subsequent rounds are cheap.

A typical 10-round session on a single UI component runs at developer-coffee scale, not enterprise-budget scale.

Anti-patterns

  • ❌ Don't add a fourth critic "for thoroughness" — three is the sweet spot
  • ❌ Don't require unanimous consensus — Magpie's job is to disagree, ≥2/3 is the rule
  • ❌ Don't chase the last point — if Magpie sits at 92 and the others are at 99, that's the plateau, stop
  • ❌ Don't squash multiple rounds into one commit — per-round commits are the audit trail
  • ❌ Don't trust unanimous scores from synthetic role-played critics — that's halo bias

Contributing

New critic roles welcome. To add one, drop a markdown file in skill/critic-roles/<name>.md following the existing template structure (persona, scoring scale, what they look at, prompt template, common blind-spots).

Open a PR with a real example session showing the new role's behavior.

License

MIT — see LICENSE. Use it, fork it, modify it, ship it.

Acknowledgments

Pattern emerged from real long-running polish sessions. The Magpie persona — the intentionally-low-scoring trend-aware critic — is the structural innovation that makes this work. Without it, three critics tend to halo together and the loop falsely declares PASS.

Inspired by the Claude Code Skills format documented at docs.claude.com/en/docs/claude-code/skills.

Troubleshooting

"The skill didn't activate." Restart Claude Code after installing — skills are scanned at session start. Then check ~/.claude/skills/critic-overkill/SKILL.md exists with a YAML frontmatter block at the top.

"I get SYNTHETIC_SPREAD in my report." Your Claude Code version can't dispatch ≥2 truly parallel sub-agents, so all three critics ran sequentially via role-play. Scores are halo-biased — trust them less. Update Claude Code or switch to a version with the Task tool enabled.

"No .critic-overkill/config.md written." The init step ran but couldn't detect a stack. Check your project has at least a package.json / pyproject.toml / go.mod / Cargo.toml / composer.json, OR write .critic-overkill/config.md manually before running the loop.

"My project doesn't have Playwright but I want to audit UI." Pattern B (Playwright MCP) works without project-level Playwright as long as the MCP server is installed in Claude Code. Otherwise, use Pattern E (code-module diff) for the underlying components, or install Playwright locally first.

"Loop keeps re-flagging fixes I already made." Round-2+ critics should receive a diff summary of changes since their last review. Check that the loop is actually building the diff summary (see loop-flow.md step 2). If it isn't, the critic prompt is malformed.

About

3 critic sub-agents score your work 0-100, only fixes ≥2 agree on get implemented, commits per round, loops until 99 or plateau. Pairs with /goal.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors