Eval findings (300-trial A/B): bootstrap skill activates perfectly, --help injection does nothing, dogfood-tui should teach when NOT to use a session

## Context

We ran a controlled 300-trial A/B experiment ([coder/aegis](https://github.com/coder/aegis) Phase 1) using agent-tty@0.4.0 as the unit under test: 3 arms (help-injection-stripped CLI / CLI as shipped / CLI + bootstrap skill installed in `.claude/skills/`) × 10 TUI-dogfooding tasks × 10 repeats, `claude-haiku-4-5` via Claude Code headless, hermetic Bubble Tea fixture with seeded bugs. Full findings: [`docs/phase1-matrix-findings.md`](https://github.com/coder/aegis/blob/main/docs/phase1-matrix-findings.md) (S15) + raw per-trial data under [`docs/evidence/s15-phase1-matrix/`](https://github.com/coder/aegis/tree/main/docs/evidence/s15-phase1-matrix).

Three findings are directly actionable for this repo's skill packaging.

## 1. The `--help` "MANDATORY FOR CODING AGENTS" injection measured exactly zero

Across 100 trials with the CLI on PATH but no skill installed, **no agent ever invoked agent-tty**, so the help injection never fired (activation Δ vs the injection-stripped control = 0.000, Wilson CI [0, 0.037] both arms). The injection can only work when something *else* first causes the agent to run the binary — it's a conversion surface with no acquisition surface. If discovery-without-install matters, the binary needs a vector that appears in what agents actually read unprompted (README/AGENTS.md conventions, or shell-init hints); the help text alone is dead weight.

## 2. The bootstrap skill works essentially perfectly — keep it exactly as designed

With the bootstrap skill installed, activation (read skill → `agent-tty skills get`) was 70/100 overall and a **perfect discriminator by task type**: 60/60 on tasks where a PTY genuinely helps, 10/10 ambiguous, **0/30 on tasks where it doesn't** (unit tests, static review, non-interactive scripts) — zero PTY sessions opened on those, i.e. the read-then-skip behavior the skill teaches demonstrably works. CLI-discovery waste with the skill: mean 0.63 help/trial-and-error probes (vs the `--help` spelunking pattern it's designed to prevent).

## 3. Where the skill needs work: it over-sells runtime evidence

Task success was **redistributed, not raised** (74% with skill vs 76/77% without, at +46% cost/trial and +50% p50 latency). Per-task (out of 10 repeats):

| Task | no-skill arms | skill arm | Reading |
|---|---|---|---|
| Reproduce a crash + capture evidence | 0–1/10 | **9/10** | PTY constitutive of the task → skill is decisive |
| Review unicode/emoji rendering | **10/10** | 5/10 | Agents trusted rendered snapshots where text-serialization of wide glyphs is exactly the unreliable surface; no-skill agents read the width math in source and nailed it |
| Verify quit-paths leave the terminal clean | 6–9/10 | 4/10 | Same pattern: session evidence misled where code reading sufficed |

The failure mode is specific: **the `dogfood-tui` skill teaches how to gather terminal evidence but not when terminal evidence is the wrong instrument.** Suggested additions to `dogfood-tui` (and/or the core skill):

- A short "choose your instrument" section: *prefer reading the source for layout math, width/alignment, and style-mapping questions; prefer a live session for crashes, input-routing, and screen-lifecycle behavior; when a snapshot and the source disagree, trust the source for wide-glyph/column-width questions* (text snapshots serialize double-width glyphs as single chars — we hit this in fixture grading too).
- An explicit caveat on `snapshot` output fidelity for CJK/emoji content.
- Possibly a cost note: a session is worth opening when the bug class needs one, not as the default first move for every visual question.

Happy to share more granular traces (per-trial tool sequences, op counts vs golden minimal sequences) if useful — the whole bundle set is reproducible from the pinned fixture.

---
_Generated with [`mux`](https://github.com/coder/mux) • Model: `anthropic:claude-fable-5` • Thinking: `high`_


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval findings (300-trial A/B): bootstrap skill activates perfectly, --help injection does nothing, dogfood-tui should teach when NOT to use a session #149

Context

1. The `--help` "MANDATORY FOR CODING AGENTS" injection measured exactly zero

2. The bootstrap skill works essentially perfectly — keep it exactly as designed

3. Where the skill needs work: it over-sells runtime evidence

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Task	no-skill arms	skill arm	Reading
Reproduce a crash + capture evidence	0–1/10	9/10	PTY constitutive of the task → skill is decisive
Review unicode/emoji rendering	10/10	5/10	Agents trusted rendered snapshots where text-serialization of wide glyphs is exactly the unreliable surface; no-skill agents read the width math in source and nailed it
Verify quit-paths leave the terminal clean	6–9/10	4/10	Same pattern: session evidence misled where code reading sufficed

Eval findings (300-trial A/B): bootstrap skill activates perfectly, --help injection does nothing, dogfood-tui should teach when NOT to use a session #149

Description

Context

1. The --help "MANDATORY FOR CODING AGENTS" injection measured exactly zero

2. The bootstrap skill works essentially perfectly — keep it exactly as designed

3. Where the skill needs work: it over-sells runtime evidence

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. The `--help` "MANDATORY FOR CODING AGENTS" injection measured exactly zero