feat(narrative): bundle phases 2.5 + 3.1 + 3.2 + 4 (ADRs 011-014) by ditvor · Pull Request #55 · ditvor/trailstory

ditvor · 2026-05-25T17:56:49Z

Summary

Four phases of the narrative-faithfulness initiative bundled into one PR because each builds on Phase 4's sentence-level provenance schema. Phase 5 (family memory) is deliberately NOT included — it touches the privacy-as-wedge moat and needs a separate strategy conversation per the original brief.

ADR-011 (Phase 2.5) — Verifier loop. After writer pass, if the INFERRED-sentence ratio exceeds Settings.max_inferred_ratio (default 0.5), regenerate once. "Improvement only" admission policy. Free signal (no extra LLM call to detect).
ADR-012 (Phase 3.1) — On-disk vision cache keyed by (photo bytes SHA-256, vision_model). Mirrors the existing narrative cache.
ADR-013 (Phase 3.2) — ThreadPoolExecutor parallelises vision calls (vision_concurrency=4 default). 6-photo hike drops from ~6s serial to ~1.5s.
ADR-014 (Phase 4) — NarrativeOutput.paragraphs becomes list[Paragraph] = list[list[Sentence]]. Each sentence has tri-lingual text + one Provenance (source: SEED / PHOTO / GPX / INFERRED). Editorial template wraps each sentence in <span class="sent" data-prov="..."> with hover tooltip + subtle CSS tint on INFERRED. Log + Encyclopedia use the new paragraphs_as_localized() flat fallback until Phase 4.1. schema_version=3.

The full trajectory across all 5 phase-bundles

case	P0	P1	P2	P3	bundle	Δ vs P3
01-fixture-baseline	0.48	1.09	1.18	1.56	1.61	+0.05
02-joyful-summit	1.39	2.36	2.50	2.79	3.68 ⭐	+0.89
03-exhausted-foggy	2.08	1.74	2.63	3.12	2.50	-0.62
04-bad-tolz-family	—	—	3.10	3.95	3.41	-0.54
avg	1.32	1.73	2.10	2.86	2.80	-0.06

The honest read

This is the most mixed eval result of the whole initiative. Below the per-phase wins of 0.41 / 0.37 / 0.76 that Phases 1 / 2 / 3 delivered. Why ship anyway:

Case 02 is the architecture win the whole project was built for. Zero unsupported claims (down from 7 in Phase 2). Faithfulness 3.68 / 5. The verifier regenerated the draft because the writer's first attempt was too inferred; the regen was structurally cleaner. The system worked exactly as designed.
But case 02 also took -1.00 on warmth and -1.00 on narrative_arc — the regen produced more constrained but flatter prose. That's the trade-off ADR-011 anticipated ("improvement only" admission policy guards faithfulness, not warmth). The regression triggered the gate on this case.
Cases 03 + 04 dropped on faithfulness too. Likely a mix of sentence-leveling cognitive load on the writer + judge sampling noise (within the 1.00 threshold for those axes).

The architecture is in place; tuning is Phase 4.1. The user (you) can decide:

Raise max_inferred_ratio to 0.7 or 1.0 to disable the verifier by default
Tune the writer prompt to be specific WITH provenance tags (the regression suggests the model is trading specificity for tag-correctness)
Accept the trade-off as worth it for the structural fabrication guard

What's intentionally NOT in this PR

Phase 5 (family memory) — the original brief explicitly deferred this: "DO NOT START. Touches the privacy-as-wedge moat and needs a separate strategy conversation." That conversation hasn't happened. Architectural choices (storage, encryption, what "remember" means privacy-wise) are decisions only the project owner can make.
Phase 4.1 (builder edit mode UI) — the user-facing "click an INFERRED sentence, edit / approve / delete" workflow. Separate UI scope, deserves its own PR.
Log + Encyclopedia provenance rendering — they use the flat fallback. Phase 4.1 ports them to per-sentence rendering in their respective style idioms.
Verifier regen against warmth signal — ADR-011 Option B (paid judge in production). Adds cost; defer until eval data shows the verifier is a sustained net win.

What's in this PR

43 files changed (2,403 insertions, 467 deletions):

trailstory/models.py — ProvenanceSource enum, Provenance, Sentence, Paragraph type alias; NarrativeOutput.paragraphs shape change; schema_version=3; new paragraphs_as_localized() method
trailstory/llm/prompts.py — writer prompt rewrites for new shape + provenance rubric
trailstory/llm/narrative.py — verifier loop, _inferred_ratio(), _verifier_feedback()
trailstory/llm/vision_cache.py (NEW) — per-photo cache mirroring the narrative cache
trailstory/photos.py — describe_photos adds concurrency + use_cache kwargs; ThreadPoolExecutor parallel path; _describe_one_with_cache helper
trailstory/config.py — max_inferred_ratio, vision_concurrency
trailstory/cli.py — passes new settings through
tests/eval/run.py — pins concurrency=4, use_cache=False, max_inferred_ratio=0.5
tests/eval/rubric.py — every paragraph-aware check uses paragraphs_as_localized()
templates/styles/editorial.html.j2 — sentence spans + provenance CSS
templates/styles/log.html.j2, templates/styles/encyclopedia.html.j2 — flat fallback via flat_paragraphs
tests/conftest.py — paragraphs_from_strings, paragraphs_dict_from_strings helpers
All test fixtures across tests/test_*.py updated for the new shape
4 refreshed narrative goldens + 4 refreshed judge goldens
4 new ADRs
CHANGELOG.md + CLAUDE.md decision register — entries 11-14

Test plan

make ci — 361 passing (was 362, -1 because the impossible-EN-too-few rubric test collapsed into one test now that paragraphs is shared across languages), 92% coverage, ruff/mypy clean
make eval-update-golden — paid run with all 4 cases, vision enabled, verifier enabled; regression gate triggered on case 02 (documented above)
New unit tests for _inferred_ratio semantics implied via the new test rubric helpers
HTML goldens regenerated via make golden-update for all 3 styles
Manual smoke: a 6-photo hike on the CLI shows "Describing photos…" finishing in ~1.5s (parallel), narrative regenerates if first draft too inferred

🤖 Generated with Claude Code

Four phases shipped together because each builds on Phase 4's sentence-level provenance schema: PHASE 4 (ADR-014) — sentence-level provenance + HTML hover NarrativeOutput.paragraphs becomes list[Paragraph] where Paragraph = list[Sentence]. Each sentence carries tri-lingual text + one Provenance (source: SEED / PHOTO / GPX / INFERRED + ref). Writer prompt rewritten to produce sentence-aligned tri-lingual output with per-sentence provenance tags. Editorial template wraps each sentence in <span class="sent" data-prov="..."> with hover tooltip + subtle CSS tint on INFERRED. Log + Encyclopedia templates use the new paragraphs_as_localized() flat fallback; Phase 4.1 follow-up will port them. schema_version bumps to 3, invalidating all prior cache entries and goldens. PHASE 2.5 (ADR-011) — verifier loop using self-reported provenance After writer pass returns, count the INFERRED-sentence ratio. If >Settings.max_inferred_ratio (default 0.5), regenerate once with feedback. "Improvement only" admission policy: keep regen only if ratio actually dropped. Free signal (no extra LLM call to detect), only the conditional regen call. Streaming bypassed. PHASE 3.1 (ADR-012) — per-photo vision description cache On-disk cache for PhotoDescription keyed by (photo bytes SHA-256, vision_model). Mirrors the narrative cache pattern; lives in ~/.cache/trailstory/vision/. CLI honours the existing --no-cache flag; eval pins use_cache=False so goldens reflect live calls. PHASE 3.2 (ADR-013) — parallel vision describer calls describe_photos uses ThreadPoolExecutor with Settings.vision_concurrency workers (default 4). 6-photo hike drops from ~6s serial to ~1.5s parallel. Order preserved via .map; single-photo fast path skips the pool. AnthropicClient SDK is thread-safe; the GIL releases on HTTP wait. Eval results (refreshed goldens, paid run): case P0 P1 P2 P3 bundle Δ vs P3 01-fixture-baseline 0.48 1.09 1.18 1.56 1.61 +0.05 02-joyful-summit 1.39 2.36 2.50 2.79 3.68 +0.89 ⭐ 03-exhausted-foggy 2.08 1.74 2.63 3.12 2.50 -0.62 04-bad-tolz-family — — 3.10 3.95 3.41 -0.54 average 1.32 1.73 2.10 2.86 2.80 -0.06 Case 02 is the architecture win — verifier regenerated, hit ZERO unsupported claims (down from 7 in P2), faithfulness jumped 0.89. But warmth + narrative_arc each dropped 1.00 on this case — the regen produced more constrained but flatter prose. That's exactly the trade-off ADR-011 anticipated; the "improvement only" admission policy guards faithfulness, not prose quality. Phase 4.1 should either tighten admission (require ratio drop AND no warmth signal drop, which needs a different signal at runtime) or relax max_inferred_ratio default. Cases 03 + 04 regressed on faithfulness too — likely a mix of sentence-leveling cognitive load on the writer + judge sampling noise. All within the 1.00 regression threshold for non-case-02 checks. The architecture is in place; tuning is Phase 4.1. Phase 5 (family memory) deliberately NOT included — touches the privacy-as-wedge moat and needs a separate strategy conversation per the original brief. Worth a dedicated discussion before any code lands. Tests: 361 passing (was 362; -1 because the impossible-EN-too-few test collapsed into a single tri-lingual paragraph-count test now that paragraphs is one list). 92% coverage. ruff/mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ditvor merged commit 96ed8c3 into develop May 25, 2026
5 checks passed

ditvor deleted the feat/phases-2-5-3-1-3-2-4-bundle branch May 27, 2026 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(narrative): bundle phases 2.5 + 3.1 + 3.2 + 4 (ADRs 011-014)#55

feat(narrative): bundle phases 2.5 + 3.1 + 3.2 + 4 (ADRs 011-014)#55
ditvor merged 1 commit into
developfrom
feat/phases-2-5-3-1-3-2-4-bundle

ditvor commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ditvor commented May 25, 2026

Summary

The full trajectory across all 5 phase-bundles

The honest read

What's intentionally NOT in this PR

What's in this PR

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant