A Claude Code skill that dispatches 3 expert critic sub-agents, scores your work 0-100, implements only fixes ≥2 critics agree on, commits per round, and loops until 99/100 or graceful plateau.
Single-reviewer AI critique tends to self-agree. You ask Claude "is this good?", and after a couple rounds it converges on "yes" — even when it isn't.
Three critics, dispatched in parallel, with one of them deliberately calibrated to score lower, produce real score spread. And only implementing fixes that two or more critics independently flag filters out solo-taste noise — which is what usually makes "AI polish" oscillate forever without converging.
This skill captures that pattern as a structured loop.
The killer demo, single line:
/goal /critic-overkill the chat composer — stop when avg score ≥99/100 holds 3 rounds OR plateau, OR after 12 rounds. Real parallel critics. Scope: components/chat/**. Print final score line per round.
Then walk away. Claude runs the loop, dispatches three critics in parallel, implements only convergent fixes, commits per round, and the /goal evaluator catches PASS or plateau from the structured score line critic-overkill prints every round.
/goal's evaluator is blind — it judges only what appears in the conversation transcript. It cannot run tools to verify "is this good?" So most /goal conditions either rubber-stamp or loop forever.
ClaudeCritic Overkill's per-round output is already evaluator-shaped. Every round, it prints:
ROUND 7 | avg=96.3 | conv-ux=99 brand-visual=97 magpie-trend=93 | verdict=NEAR
That single line is the protocol between the loop body and the outer /goal evaluator. The success condition (avg ≥99 held 3 rounds) is trivially pattern-matchable from the transcript. One line in, structured proof per round, machine-verifiable end-state.
This is the genuinely novel pattern: skill that emits a verifiable artifact per turn + /goal condition that pattern-matches that artifact. Naive /goal users ship vibes-based conditions and get noise. With critic-overkill you get a contract.
git clone https://github.com/TBSKR/critic-overkill.git
mkdir -p ~/.claude/skills
cp -R critic-overkill/skill ~/.claude/skills/critic-overkillThat's it. Next time you ask Claude Code to refine, polish, or perfect something — or set a /goal containing words like "production-grade" or "until 99/100" — the skill activates.
On first invocation in a new project, critic-overkill runs a short auto-detection pass that reads your CLAUDE.md, package.json (or equivalent), README, and recent commits to detect:
- language (en/nl/de/...)
- stack (Next/Python/Go/...)
- product type (UI app / API / library / CLI)
- audience
- verification commands (typecheck, lint, test)
- commit-message style
- default critic trio
It writes .critic-overkill/config.md and shows the detected config for one-pass confirmation. Subsequent runs skip the init. Delete .critic-overkill/config.md to re-detect.
For UI work, critic-overkill defaults to:
| Critic | Persona | Calibrates to |
|---|---|---|
| conv-ux | Principal Conversational AI Designer | 90-99 |
| brand-visual | Design Director, Brand & Visual Systems | 88-99 |
| magpie-trend | UI Magpie & Trend Scout, intentionally strict | 78-95 |
The Magpie is the bias breaker. It starts at 85 by default and only reaches 99 when it would genuinely steal patterns from your work. Without the Magpie, three critics tend to halo-agree and the loop falsely declares PASS.
Other critic pools (auto-selected based on detected product type):
- API: api-architect + dx-reviewer + magpie-trend
- Copy: voice-tone + localization + magpie-trend
- Code module: senior-architect + code-reviewer + magpie-trend
- Security-sensitive: security-reviewer + senior-architect + magpie-trend
- Mixed/unclear: senior-generalist + brand-visual + magpie-trend
Override with --critics "a,b,c" per invocation.
- Always implement. Every round that finds convergent issues writes code, runs verifications, and commits. A round that produces only critic reports without changes is a failed round.
- Convergent-fix rule. Only issues flagged by ≥2 of 3 critics get implemented. Solo-critic taste goes to a parked log.
- Diff-aware critics. From round 2+, each critic receives an explicit diff of changes since their last review. Forces acknowledgment of progress instead of resetting.
- Plateau detection. If average score improves <2 points over 3 rounds AND all remaining issues are solo-critic, stop gracefully with a "diminishing returns" report.
- Commits per round. One commit per fix-batch, in your project's commit style.
- Never edit production code to make the audit work. Mock at the test boundary (Playwright
route.fulfill, fixture JSON). The production route handler must never gainif (DEV_AUTH_BYPASS) ...branches added by this skill. - Real parallel sub-agent dispatch required. If the harness can't dispatch 3 truly parallel agents, single-reviewer role-play creates halo bias. The skill flags this as
SYNTHETIC_SPREADand tells you to trust scores less.
The skill captures current state reproducibly using one of five patterns (see skill/audit-fixtures.md):
- A. Web UI — Playwright spec that screenshots target states
- B. Live interactive — Playwright MCP for streaming/dynamic flows
- C. API — fixture script dumping request/response JSON
- D. Copy — markdown table of all user-visible strings
- E. Code module — git diff + file dump
Fixtures live in output/critic-overkill/<target-slug>/round-<N>/ and are overwritten each round so improvements are measurable.
## ClaudeCritic Overkill report — chat composer
**Final scores**: Conv-UX 99/100 · Brand 96/100 · Magpie 91/100
**Rounds run**: 10
**Commits**: 8 (a3f9d2..7c8e4b)
**Verdict**: NEAR — plateau detected (Magpie under threshold)
**Reason for stop**: PLATEAU
**Parked solo-critic suggestions** (not implemented):
- Magpie #1: "would benefit from rotating placeholder hints" (taste-only)
- Magpie #3: "consider mobile-only keyboard hint variations"
**Files changed**: components/composer/*, styles/composer.css, app/chat/page.tsx
The first round dominates (~10× the cost of rounds 2-5). Fixture setup + per-state capture is one-time. Don't bail in round 1 because it feels expensive — subsequent rounds are cheap.
A typical 10-round session on a single UI component runs at developer-coffee scale, not enterprise-budget scale.
- ❌ Don't add a fourth critic "for thoroughness" — three is the sweet spot
- ❌ Don't require unanimous consensus — Magpie's job is to disagree, ≥2/3 is the rule
- ❌ Don't chase the last point — if Magpie sits at 92 and the others are at 99, that's the plateau, stop
- ❌ Don't squash multiple rounds into one commit — per-round commits are the audit trail
- ❌ Don't trust unanimous scores from synthetic role-played critics — that's halo bias
New critic roles welcome. To add one, drop a markdown file in skill/critic-roles/<name>.md following the existing template structure (persona, scoring scale, what they look at, prompt template, common blind-spots).
Open a PR with a real example session showing the new role's behavior.
MIT — see LICENSE. Use it, fork it, modify it, ship it.
Pattern emerged from real long-running polish sessions. The Magpie persona — the intentionally-low-scoring trend-aware critic — is the structural innovation that makes this work. Without it, three critics tend to halo together and the loop falsely declares PASS.
Inspired by the Claude Code Skills format documented at docs.claude.com/en/docs/claude-code/skills.
"The skill didn't activate." Restart Claude Code after installing — skills are scanned at session start. Then check ~/.claude/skills/critic-overkill/SKILL.md exists with a YAML frontmatter block at the top.
"I get SYNTHETIC_SPREAD in my report." Your Claude Code version can't dispatch ≥2 truly parallel sub-agents, so all three critics ran sequentially via role-play. Scores are halo-biased — trust them less. Update Claude Code or switch to a version with the Task tool enabled.
"No .critic-overkill/config.md written." The init step ran but couldn't detect a stack. Check your project has at least a package.json / pyproject.toml / go.mod / Cargo.toml / composer.json, OR write .critic-overkill/config.md manually before running the loop.
"My project doesn't have Playwright but I want to audit UI." Pattern B (Playwright MCP) works without project-level Playwright as long as the MCP server is installed in Claude Code. Otherwise, use Pattern E (code-module diff) for the underlying components, or install Playwright locally first.
"Loop keeps re-flagging fixes I already made." Round-2+ critics should receive a diff summary of changes since their last review. Check that the loop is actually building the diff summary (see loop-flow.md step 2). If it isn't, the critic prompt is malformed.