Source
Source repo: https://github.com/DietrichGebert/ponytail
Primary source files read:
README.md from DietrichGebert/ponytail@main.
AGENTS.md from DietrichGebert/ponytail@main.
benchmarks/results/2026-06-18-agentic.md from DietrichGebert/ponytail@main.
Relevant source claims/ideas:
- Ponytail is not just a prose rule. It claims a measured agentic effect: same coding agent, real repo, baseline vs plugin arms, measuring
git diff added LOC, token/cost/time, and adversarial safety checks.
- The benchmark explicitly corrected a false result where the plugin hook contaminated the baseline. That is directly relevant to MAP: prompt text and closed issues are insufficient; we need an isolated active-path eval that proves minimality is both enabled and effective.
- Ponytail's rule ladder is already conceptually integrated in MAP, but the source's strongest reusable idea is the benchmark design: prove smaller diffs without losing safety.
Relevant source takeaways
- Benchmark unit should be a real agent run against a real fixture repo, not a single prompt completion.
- Baseline and treatment must be isolated so the minimality guidance cannot leak into the baseline.
- Metrics must include LOC/tokens/cost/time where available, but safety/correctness must be measured separately and must not regress.
- Tasks should include both over-build traps (where native/stdlib/reuse should win) and surgical safety tasks (where one-liner pressure can drop guards).
- The expected outcome is not universal LOC reduction: on irreducible tasks the arms may converge. The eval should accept "no bloat to cut" instead of forcing artificial line savings.
Repo evidence
Local implementation is not merely prose-only:
src/mapify_cli/config/project_config.py sets MapConfig.minimality: str = "lite" by default and documents the Phase 3 flip.
src/mapify_cli/templates_src/map/scripts/map_step_runner.py.jinja has _load_minimality_level(..., default="lite"), so standalone generated runner behavior matches config default.
build_context_block() calls _minimality_doctrine_block(minimality) and injects <MAP_Minimality_Doctrine> when minimality is not off.
_minimality_doctrine_block() contains the runtime Actor ladder and map:simplification: marker guidance.
build_review_prompts() adds a complexity_lens prompt when minimality != "off".
validate_blueprint_contract() rejects non-empty deferred_yagni when minimality is not full/ultra, so pruning is gated rather than silently active under lite.
- Tests cover active plumbing:
tests/test_map_step_runner.py checks default minimality == "lite", context-block doctrine injection, invalid minimality fallback, review complexity lens insertion, and deferred_yagni gating; tests/test_decomposition.py checks config defaults/valid values/YAML off; tests/test_minimality_report.py checks telemetry report decisions.
What is missing:
- No Ponytail-style end-to-end A/B benchmark proves that MAP minimality actually reduces generated diff size/tokens while preserving safety/correctness.
- Existing
minimality-report telemetry compares completed local runs, but it is not a reproducible isolated benchmark and does not run baseline/treatment arms on the same task corpus.
Existing issue search
Commands/searches used:
gh issue list --state all --limit 120 --search "Ponytail OR minimality OR YAGNI OR stdlib OR native OR one-liner OR pruneable OR deferred_yagni OR reuse"
gh issue list --state all --limit 120 --search "minimality benchmark OR Ponytail benchmark OR agentic benchmark OR safety rate OR LOC tokens minimality eval"
gh issue list --state all --limit 120 --search "minimality telemetry OR minimality-report OR field telemetry OR default flip"
Related issues checked:
Why this is not a duplicate:
Those issues implement and gate the minimality doctrine. None adds a reproducible A/B harness equivalent to Ponytail's benchmark that isolates baseline vs minimality and measures LOC/tokens/safety on a task corpus.
Why this is not already covered
The code path is active, but activation is not impact evidence. A prompt can be injected and still produce no measurable behavioral change, or worse, reduce lines by dropping guards. Ponytail's own benchmark history shows why this matters: they found and fixed contamination where the baseline secretly ran the plugin. MAP should have an equivalent active-path proof before treating minimality claims as settled.
Problem
MAP currently has implementation evidence and local telemetry surfaces, but not a deterministic or reproducible eval that answers: "Does MAP minimality actually make the agent produce smaller sufficient diffs without losing safety?" Without that, future changes to prompts, hooks, or config can leave the feature apparently enabled but behaviorally inert.
Proposed slice
Add a minimality-eval / benchmark harness that runs isolated baseline and treatment arms on a small MAP-style task corpus.
Suggested first slice:
- Use fixture repos/tasks that do not require external services.
- Arms:
minimality: off vs minimality: lite at minimum; optionally full as an opt-in treatment.
- Metrics: added LOC from git diff, token usage if transcript/meter data is available, duration as advisory, pass/fail of task-specific tests, and safety checks for trust-boundary tasks.
- Corpus split:
- Over-build traps: native/stdlib/reuse should avoid extra code.
- Irreducible tasks: expected near-zero LOC delta, to prevent benchmark gaming.
- Safety tasks: smaller code must still pass adversarial checks.
- Ensure each arm gets a fresh workspace and no shared session/plugin contamination.
- Produce a persisted report under
.map/eval-runs/minimality/ or a similar existing eval-artifact namespace.
Acceptance criteria
- A maintainer can run one command locally to compare
minimality: off vs lite on a fixture corpus.
- The report separates code-size/cost metrics from correctness/safety metrics.
- The eval fails or warns when
lite reduces LOC by dropping required safety/correctness behavior.
- The eval detects or prevents baseline contamination by explicitly asserting the generated context/config for each arm.
- Docs explain that this benchmark validates behavioral effect; it does not replace normal workflow gates.
- Implementation reuses existing eval/report patterns where practical and keeps generated templates single-source.
Guardrails
- Do not claim Ponytail-like percentages for MAP until MAP has its own benchmark data.
- Do not optimize for LOC alone; required behavior, security, accessibility, and data integrity are non-negotiable.
- Do not use shadow mode for rollout. Run explicit isolated eval arms.
- Do not require live production deploy or external services.
- Do not let
minimality: lite silently prune explicit requirements; pruning remains full/ultra plus visible approval.
Source
Source repo: https://github.com/DietrichGebert/ponytail
Primary source files read:
README.mdfromDietrichGebert/ponytail@main.AGENTS.mdfromDietrichGebert/ponytail@main.benchmarks/results/2026-06-18-agentic.mdfromDietrichGebert/ponytail@main.Relevant source claims/ideas:
git diffadded LOC, token/cost/time, and adversarial safety checks.Relevant source takeaways
Repo evidence
Local implementation is not merely prose-only:
src/mapify_cli/config/project_config.pysetsMapConfig.minimality: str = "lite"by default and documents the Phase 3 flip.src/mapify_cli/templates_src/map/scripts/map_step_runner.py.jinjahas_load_minimality_level(..., default="lite"), so standalone generated runner behavior matches config default.build_context_block()calls_minimality_doctrine_block(minimality)and injects<MAP_Minimality_Doctrine>when minimality is notoff._minimality_doctrine_block()contains the runtime Actor ladder andmap:simplification:marker guidance.build_review_prompts()adds acomplexity_lensprompt whenminimality != "off".validate_blueprint_contract()rejects non-emptydeferred_yagniwhen minimality is notfull/ultra, so pruning is gated rather than silently active underlite.tests/test_map_step_runner.pychecks defaultminimality == "lite", context-block doctrine injection, invalid minimality fallback, review complexity lens insertion, and deferred_yagni gating;tests/test_decomposition.pychecks config defaults/valid values/YAMLoff;tests/test_minimality_report.pychecks telemetry report decisions.What is missing:
minimality-reporttelemetry compares completed local runs, but it is not a reproducible isolated benchmark and does not run baseline/treatment arms on the same task corpus.Existing issue search
Commands/searches used:
gh issue list --state all --limit 120 --search "Ponytail OR minimality OR YAGNI OR stdlib OR native OR one-liner OR pruneable OR deferred_yagni OR reuse"gh issue list --state all --limit 120 --search "minimality benchmark OR Ponytail benchmark OR agentic benchmark OR safety rate OR LOC tokens minimality eval"gh issue list --state all --limit 120 --search "minimality telemetry OR minimality-report OR field telemetry OR default flip"Related issues checked:
/map-reviewwhat-to-delete lens.off -> liteafter telemetry.deferred_yagni.Why this is not a duplicate:
Those issues implement and gate the minimality doctrine. None adds a reproducible A/B harness equivalent to Ponytail's benchmark that isolates baseline vs minimality and measures LOC/tokens/safety on a task corpus.
Why this is not already covered
The code path is active, but activation is not impact evidence. A prompt can be injected and still produce no measurable behavioral change, or worse, reduce lines by dropping guards. Ponytail's own benchmark history shows why this matters: they found and fixed contamination where the baseline secretly ran the plugin. MAP should have an equivalent active-path proof before treating minimality claims as settled.
Problem
MAP currently has implementation evidence and local telemetry surfaces, but not a deterministic or reproducible eval that answers: "Does MAP minimality actually make the agent produce smaller sufficient diffs without losing safety?" Without that, future changes to prompts, hooks, or config can leave the feature apparently enabled but behaviorally inert.
Proposed slice
Add a
minimality-eval/ benchmark harness that runs isolated baseline and treatment arms on a small MAP-style task corpus.Suggested first slice:
minimality: offvsminimality: liteat minimum; optionallyfullas an opt-in treatment..map/eval-runs/minimality/or a similar existing eval-artifact namespace.Acceptance criteria
minimality: offvsliteon a fixture corpus.litereduces LOC by dropping required safety/correctness behavior.Guardrails
minimality: litesilently prune explicit requirements; pruning remainsfull/ultraplus visible approval.