A/B test: code-reviewer dense prompt vs verbose baseline (tune/agent-prompts)

## Context

Local branch \`tune/agent-prompts\` carries a single commit that compresses \`agents/code-reviewer.md\` from 266 lines to 36 — roughly 87% reduction — into a density-matched subagent prompt. The commit has been sitting since 2026-04-14 and was deliberately set aside during the ADR-123 firing-dynamics work (PRs #48, #49) so the prompt-compression refactor wouldn't mix signals with the architectural work in flight. Shipping both in the same window would have made it impossible to tell which change was responsible for any behavioral shift.

The rationale per Aaron: a more cohesive scoring system should land before testing compression, so the experiment runs against the new scoring rather than the old.

## Empirical baseline we already have

The verbose 266-line version was invoked against PR #49 late in the session that merged it. It produced:
- 7 legitimate findings (3 correctness, 1 safety, 3 design observations)
- 1 false positive (finding #8 — grepped for \`way_redisclosed\` string literal, missed the ternary at show/mod.rs:133)
- Clean file:line citations
- Respected the report structure specified in the invocation prompt
- Reasonable brevity (under the 800-word cap)

That's the bar the dense version needs to match or exceed.

## Proposed A/B

1. Land whatever work the "cohesive scoring system" refers to (separate issue when it's scoped).
2. Invoke the code-reviewer subagent against a non-trivial PR — same input shape as the #49 invocation — using the **current main** (verbose 266-line) prompt. Save the output.
3. Checkout \`tune/agent-prompts\`, re-invoke against the same PR. Save the output.
4. Compare:
   - Finding count and distribution by category (correctness / safety / design / nit)
   - False-positive rate
   - File:line citation precision
   - Adherence to the report structure
   - Total tokens consumed (cheaper is only a win if quality holds)
   - Any regression in edge-case coverage vs the verbose version

## Accept / reject criteria

- **Ship compressed** if it produces ≥90% of the baseline's true-positive findings with ≤the same false-positive rate, and the compressed prompt is easier to maintain/read.
- **Keep verbose** if the dense version drops findings, misclassifies, or loses citation precision. Token savings are not sufficient justification if quality regresses.
- **Iterate on dense** if it's close but not matching. The 36-line version is likely not the final answer; it's a directional experiment.

## Notes on sequencing

- Do NOT merge \`tune/agent-prompts\` before running the A/B. That would make the dense version the default and forfeit the comparison.
- Do NOT delete \`tune/agent-prompts\` before the experiment. The compressed version is meaningful work that represents a directional hypothesis about density-matched prompts.
- The branch has one commit, 7 hours old as of merge, and a single-file diff. It's cheap to carry indefinitely.

## References

- PR #49 code review comment — the empirical baseline run with the verbose prompt: https://github.com/aaronsb/agent-ways/pull/49#issuecomment-4248990005
- Local branch \`tune/agent-prompts\` — single commit \`fb1476a refactor(agents): compress code-reviewer to density-matched subagent prompt\`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A/B test: code-reviewer dense prompt vs verbose baseline (tune/agent-prompts) #54

Context

Empirical baseline we already have

Proposed A/B

Accept / reject criteria

Notes on sequencing

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

A/B test: code-reviewer dense prompt vs verbose baseline (tune/agent-prompts) #54

Description

Context

Empirical baseline we already have

Proposed A/B

Accept / reject criteria

Notes on sequencing

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions