Skip to content

A/B test: code-reviewer dense prompt vs verbose baseline (tune/agent-prompts) #54

@aaronsb

Description

@aaronsb

Context

Local branch `tune/agent-prompts` carries a single commit that compresses `agents/code-reviewer.md` from 266 lines to 36 — roughly 87% reduction — into a density-matched subagent prompt. The commit has been sitting since 2026-04-14 and was deliberately set aside during the ADR-123 firing-dynamics work (PRs #48, #49) so the prompt-compression refactor wouldn't mix signals with the architectural work in flight. Shipping both in the same window would have made it impossible to tell which change was responsible for any behavioral shift.

The rationale per Aaron: a more cohesive scoring system should land before testing compression, so the experiment runs against the new scoring rather than the old.

Empirical baseline we already have

The verbose 266-line version was invoked against PR #49 late in the session that merged it. It produced:

  • 7 legitimate findings (3 correctness, 1 safety, 3 design observations)
  • 1 false positive (finding Declarative config module for ways-cli #8 — grepped for `way_redisclosed` string literal, missed the ternary at show/mod.rs:133)
  • Clean file:line citations
  • Respected the report structure specified in the invocation prompt
  • Reasonable brevity (under the 800-word cap)

That's the bar the dense version needs to match or exceed.

Proposed A/B

  1. Land whatever work the "cohesive scoring system" refers to (separate issue when it's scoped).
  2. Invoke the code-reviewer subagent against a non-trivial PR — same input shape as the feat: ADR-123 firing dynamics — progression-axis unification #49 invocation — using the current main (verbose 266-line) prompt. Save the output.
  3. Checkout `tune/agent-prompts`, re-invoke against the same PR. Save the output.
  4. Compare:
    • Finding count and distribution by category (correctness / safety / design / nit)
    • False-positive rate
    • File:line citation precision
    • Adherence to the report structure
    • Total tokens consumed (cheaper is only a win if quality holds)
    • Any regression in edge-case coverage vs the verbose version

Accept / reject criteria

  • Ship compressed if it produces ≥90% of the baseline's true-positive findings with ≤the same false-positive rate, and the compressed prompt is easier to maintain/read.
  • Keep verbose if the dense version drops findings, misclassifies, or loses citation precision. Token savings are not sufficient justification if quality regresses.
  • Iterate on dense if it's close but not matching. The 36-line version is likely not the final answer; it's a directional experiment.

Notes on sequencing

  • Do NOT merge `tune/agent-prompts` before running the A/B. That would make the dense version the default and forfeit the comparison.
  • Do NOT delete `tune/agent-prompts` before the experiment. The compressed version is meaningful work that represents a directional hypothesis about density-matched prompts.
  • The branch has one commit, 7 hours old as of merge, and a single-file diff. It's cheap to carry indefinitely.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions