Skip to content

Eval findings (300-trial A/B): bootstrap skill activates perfectly, --help injection does nothing, dogfood-tui should teach when NOT to use a session #149

@ThomasK33

Description

@ThomasK33

Context

We ran a controlled 300-trial A/B experiment (coder/aegis Phase 1) using agent-tty@0.4.0 as the unit under test: 3 arms (help-injection-stripped CLI / CLI as shipped / CLI + bootstrap skill installed in .claude/skills/) × 10 TUI-dogfooding tasks × 10 repeats, claude-haiku-4-5 via Claude Code headless, hermetic Bubble Tea fixture with seeded bugs. Full findings: docs/phase1-matrix-findings.md (S15) + raw per-trial data under docs/evidence/s15-phase1-matrix/.

Three findings are directly actionable for this repo's skill packaging.

1. The --help "MANDATORY FOR CODING AGENTS" injection measured exactly zero

Across 100 trials with the CLI on PATH but no skill installed, no agent ever invoked agent-tty, so the help injection never fired (activation Δ vs the injection-stripped control = 0.000, Wilson CI [0, 0.037] both arms). The injection can only work when something else first causes the agent to run the binary — it's a conversion surface with no acquisition surface. If discovery-without-install matters, the binary needs a vector that appears in what agents actually read unprompted (README/AGENTS.md conventions, or shell-init hints); the help text alone is dead weight.

2. The bootstrap skill works essentially perfectly — keep it exactly as designed

With the bootstrap skill installed, activation (read skill → agent-tty skills get) was 70/100 overall and a perfect discriminator by task type: 60/60 on tasks where a PTY genuinely helps, 10/10 ambiguous, 0/30 on tasks where it doesn't (unit tests, static review, non-interactive scripts) — zero PTY sessions opened on those, i.e. the read-then-skip behavior the skill teaches demonstrably works. CLI-discovery waste with the skill: mean 0.63 help/trial-and-error probes (vs the --help spelunking pattern it's designed to prevent).

3. Where the skill needs work: it over-sells runtime evidence

Task success was redistributed, not raised (74% with skill vs 76/77% without, at +46% cost/trial and +50% p50 latency). Per-task (out of 10 repeats):

Task no-skill arms skill arm Reading
Reproduce a crash + capture evidence 0–1/10 9/10 PTY constitutive of the task → skill is decisive
Review unicode/emoji rendering 10/10 5/10 Agents trusted rendered snapshots where text-serialization of wide glyphs is exactly the unreliable surface; no-skill agents read the width math in source and nailed it
Verify quit-paths leave the terminal clean 6–9/10 4/10 Same pattern: session evidence misled where code reading sufficed

The failure mode is specific: the dogfood-tui skill teaches how to gather terminal evidence but not when terminal evidence is the wrong instrument. Suggested additions to dogfood-tui (and/or the core skill):

  • A short "choose your instrument" section: prefer reading the source for layout math, width/alignment, and style-mapping questions; prefer a live session for crashes, input-routing, and screen-lifecycle behavior; when a snapshot and the source disagree, trust the source for wide-glyph/column-width questions (text snapshots serialize double-width glyphs as single chars — we hit this in fixture grading too).
  • An explicit caveat on snapshot output fidelity for CJK/emoji content.
  • Possibly a cost note: a session is worth opening when the bug class needs one, not as the default first move for every visual question.

Happy to share more granular traces (per-trial tool sequences, op counts vs golden minimal sequences) if useful — the whole bundle set is reproducible from the pinned fixture.


Generated with mux • Model: anthropic:claude-fable-5 • Thinking: high

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions