Context
We ran a controlled 300-trial A/B experiment (coder/aegis Phase 1) using agent-tty@0.4.0 as the unit under test: 3 arms (help-injection-stripped CLI / CLI as shipped / CLI + bootstrap skill installed in .claude/skills/) × 10 TUI-dogfooding tasks × 10 repeats, claude-haiku-4-5 via Claude Code headless, hermetic Bubble Tea fixture with seeded bugs. Full findings: docs/phase1-matrix-findings.md (S15) + raw per-trial data under docs/evidence/s15-phase1-matrix/.
Three findings are directly actionable for this repo's skill packaging.
1. The --help "MANDATORY FOR CODING AGENTS" injection measured exactly zero
Across 100 trials with the CLI on PATH but no skill installed, no agent ever invoked agent-tty, so the help injection never fired (activation Δ vs the injection-stripped control = 0.000, Wilson CI [0, 0.037] both arms). The injection can only work when something else first causes the agent to run the binary — it's a conversion surface with no acquisition surface. If discovery-without-install matters, the binary needs a vector that appears in what agents actually read unprompted (README/AGENTS.md conventions, or shell-init hints); the help text alone is dead weight.
2. The bootstrap skill works essentially perfectly — keep it exactly as designed
With the bootstrap skill installed, activation (read skill → agent-tty skills get) was 70/100 overall and a perfect discriminator by task type: 60/60 on tasks where a PTY genuinely helps, 10/10 ambiguous, 0/30 on tasks where it doesn't (unit tests, static review, non-interactive scripts) — zero PTY sessions opened on those, i.e. the read-then-skip behavior the skill teaches demonstrably works. CLI-discovery waste with the skill: mean 0.63 help/trial-and-error probes (vs the --help spelunking pattern it's designed to prevent).
3. Where the skill needs work: it over-sells runtime evidence
Task success was redistributed, not raised (74% with skill vs 76/77% without, at +46% cost/trial and +50% p50 latency). Per-task (out of 10 repeats):
| Task |
no-skill arms |
skill arm |
Reading |
| Reproduce a crash + capture evidence |
0–1/10 |
9/10 |
PTY constitutive of the task → skill is decisive |
| Review unicode/emoji rendering |
10/10 |
5/10 |
Agents trusted rendered snapshots where text-serialization of wide glyphs is exactly the unreliable surface; no-skill agents read the width math in source and nailed it |
| Verify quit-paths leave the terminal clean |
6–9/10 |
4/10 |
Same pattern: session evidence misled where code reading sufficed |
The failure mode is specific: the dogfood-tui skill teaches how to gather terminal evidence but not when terminal evidence is the wrong instrument. Suggested additions to dogfood-tui (and/or the core skill):
- A short "choose your instrument" section: prefer reading the source for layout math, width/alignment, and style-mapping questions; prefer a live session for crashes, input-routing, and screen-lifecycle behavior; when a snapshot and the source disagree, trust the source for wide-glyph/column-width questions (text snapshots serialize double-width glyphs as single chars — we hit this in fixture grading too).
- An explicit caveat on
snapshot output fidelity for CJK/emoji content.
- Possibly a cost note: a session is worth opening when the bug class needs one, not as the default first move for every visual question.
Happy to share more granular traces (per-trial tool sequences, op counts vs golden minimal sequences) if useful — the whole bundle set is reproducible from the pinned fixture.
Generated with mux • Model: anthropic:claude-fable-5 • Thinking: high
Context
We ran a controlled 300-trial A/B experiment (coder/aegis Phase 1) using agent-tty@0.4.0 as the unit under test: 3 arms (help-injection-stripped CLI / CLI as shipped / CLI + bootstrap skill installed in
.claude/skills/) × 10 TUI-dogfooding tasks × 10 repeats,claude-haiku-4-5via Claude Code headless, hermetic Bubble Tea fixture with seeded bugs. Full findings:docs/phase1-matrix-findings.md(S15) + raw per-trial data underdocs/evidence/s15-phase1-matrix/.Three findings are directly actionable for this repo's skill packaging.
1. The
--help"MANDATORY FOR CODING AGENTS" injection measured exactly zeroAcross 100 trials with the CLI on PATH but no skill installed, no agent ever invoked agent-tty, so the help injection never fired (activation Δ vs the injection-stripped control = 0.000, Wilson CI [0, 0.037] both arms). The injection can only work when something else first causes the agent to run the binary — it's a conversion surface with no acquisition surface. If discovery-without-install matters, the binary needs a vector that appears in what agents actually read unprompted (README/AGENTS.md conventions, or shell-init hints); the help text alone is dead weight.
2. The bootstrap skill works essentially perfectly — keep it exactly as designed
With the bootstrap skill installed, activation (read skill →
agent-tty skills get) was 70/100 overall and a perfect discriminator by task type: 60/60 on tasks where a PTY genuinely helps, 10/10 ambiguous, 0/30 on tasks where it doesn't (unit tests, static review, non-interactive scripts) — zero PTY sessions opened on those, i.e. the read-then-skip behavior the skill teaches demonstrably works. CLI-discovery waste with the skill: mean 0.63 help/trial-and-error probes (vs the--helpspelunking pattern it's designed to prevent).3. Where the skill needs work: it over-sells runtime evidence
Task success was redistributed, not raised (74% with skill vs 76/77% without, at +46% cost/trial and +50% p50 latency). Per-task (out of 10 repeats):
The failure mode is specific: the
dogfood-tuiskill teaches how to gather terminal evidence but not when terminal evidence is the wrong instrument. Suggested additions todogfood-tui(and/or the core skill):snapshotoutput fidelity for CJK/emoji content.Happy to share more granular traces (per-trial tool sequences, op counts vs golden minimal sequences) if useful — the whole bundle set is reproducible from the pinned fixture.
Generated with
mux• Model:anthropic:claude-fable-5• Thinking:high