Skip to content

feat(narrative): bundle phases 2.5 + 3.1 + 3.2 + 4 (ADRs 011-014)#55

Merged
ditvor merged 1 commit into
developfrom
feat/phases-2-5-3-1-3-2-4-bundle
May 25, 2026
Merged

feat(narrative): bundle phases 2.5 + 3.1 + 3.2 + 4 (ADRs 011-014)#55
ditvor merged 1 commit into
developfrom
feat/phases-2-5-3-1-3-2-4-bundle

Conversation

@ditvor
Copy link
Copy Markdown
Owner

@ditvor ditvor commented May 25, 2026

Summary

Four phases of the narrative-faithfulness initiative bundled into one PR because each builds on Phase 4's sentence-level provenance schema. Phase 5 (family memory) is deliberately NOT included — it touches the privacy-as-wedge moat and needs a separate strategy conversation per the original brief.

  • ADR-011 (Phase 2.5) — Verifier loop. After writer pass, if the INFERRED-sentence ratio exceeds Settings.max_inferred_ratio (default 0.5), regenerate once. "Improvement only" admission policy. Free signal (no extra LLM call to detect).
  • ADR-012 (Phase 3.1) — On-disk vision cache keyed by (photo bytes SHA-256, vision_model). Mirrors the existing narrative cache.
  • ADR-013 (Phase 3.2)ThreadPoolExecutor parallelises vision calls (vision_concurrency=4 default). 6-photo hike drops from ~6s serial to ~1.5s.
  • ADR-014 (Phase 4)NarrativeOutput.paragraphs becomes list[Paragraph] = list[list[Sentence]]. Each sentence has tri-lingual text + one Provenance (source: SEED / PHOTO / GPX / INFERRED). Editorial template wraps each sentence in <span class="sent" data-prov="..."> with hover tooltip + subtle CSS tint on INFERRED. Log + Encyclopedia use the new paragraphs_as_localized() flat fallback until Phase 4.1. schema_version=3.

The full trajectory across all 5 phase-bundles

case P0 P1 P2 P3 bundle Δ vs P3
01-fixture-baseline 0.48 1.09 1.18 1.56 1.61 +0.05
02-joyful-summit 1.39 2.36 2.50 2.79 3.68 +0.89
03-exhausted-foggy 2.08 1.74 2.63 3.12 2.50 -0.62
04-bad-tolz-family 3.10 3.95 3.41 -0.54
avg 1.32 1.73 2.10 2.86 2.80 -0.06

The honest read

This is the most mixed eval result of the whole initiative. Below the per-phase wins of 0.41 / 0.37 / 0.76 that Phases 1 / 2 / 3 delivered. Why ship anyway:

  1. Case 02 is the architecture win the whole project was built for. Zero unsupported claims (down from 7 in Phase 2). Faithfulness 3.68 / 5. The verifier regenerated the draft because the writer's first attempt was too inferred; the regen was structurally cleaner. The system worked exactly as designed.
  2. But case 02 also took -1.00 on warmth and -1.00 on narrative_arc — the regen produced more constrained but flatter prose. That's the trade-off ADR-011 anticipated ("improvement only" admission policy guards faithfulness, not warmth). The regression triggered the gate on this case.
  3. Cases 03 + 04 dropped on faithfulness too. Likely a mix of sentence-leveling cognitive load on the writer + judge sampling noise (within the 1.00 threshold for those axes).

The architecture is in place; tuning is Phase 4.1. The user (you) can decide:

  • Raise max_inferred_ratio to 0.7 or 1.0 to disable the verifier by default
  • Tune the writer prompt to be specific WITH provenance tags (the regression suggests the model is trading specificity for tag-correctness)
  • Accept the trade-off as worth it for the structural fabrication guard

What's intentionally NOT in this PR

  • Phase 5 (family memory) — the original brief explicitly deferred this: "DO NOT START. Touches the privacy-as-wedge moat and needs a separate strategy conversation." That conversation hasn't happened. Architectural choices (storage, encryption, what "remember" means privacy-wise) are decisions only the project owner can make.
  • Phase 4.1 (builder edit mode UI) — the user-facing "click an INFERRED sentence, edit / approve / delete" workflow. Separate UI scope, deserves its own PR.
  • Log + Encyclopedia provenance rendering — they use the flat fallback. Phase 4.1 ports them to per-sentence rendering in their respective style idioms.
  • Verifier regen against warmth signal — ADR-011 Option B (paid judge in production). Adds cost; defer until eval data shows the verifier is a sustained net win.

What's in this PR

43 files changed (2,403 insertions, 467 deletions):

  • trailstory/models.pyProvenanceSource enum, Provenance, Sentence, Paragraph type alias; NarrativeOutput.paragraphs shape change; schema_version=3; new paragraphs_as_localized() method
  • trailstory/llm/prompts.py — writer prompt rewrites for new shape + provenance rubric
  • trailstory/llm/narrative.py — verifier loop, _inferred_ratio(), _verifier_feedback()
  • trailstory/llm/vision_cache.py (NEW) — per-photo cache mirroring the narrative cache
  • trailstory/photos.pydescribe_photos adds concurrency + use_cache kwargs; ThreadPoolExecutor parallel path; _describe_one_with_cache helper
  • trailstory/config.pymax_inferred_ratio, vision_concurrency
  • trailstory/cli.py — passes new settings through
  • tests/eval/run.py — pins concurrency=4, use_cache=False, max_inferred_ratio=0.5
  • tests/eval/rubric.py — every paragraph-aware check uses paragraphs_as_localized()
  • templates/styles/editorial.html.j2 — sentence spans + provenance CSS
  • templates/styles/log.html.j2, templates/styles/encyclopedia.html.j2 — flat fallback via flat_paragraphs
  • tests/conftest.pyparagraphs_from_strings, paragraphs_dict_from_strings helpers
  • All test fixtures across tests/test_*.py updated for the new shape
  • 4 refreshed narrative goldens + 4 refreshed judge goldens
  • 4 new ADRs
  • CHANGELOG.md + CLAUDE.md decision register — entries 11-14

Test plan

  • make ci — 361 passing (was 362, -1 because the impossible-EN-too-few rubric test collapsed into one test now that paragraphs is shared across languages), 92% coverage, ruff/mypy clean
  • make eval-update-golden — paid run with all 4 cases, vision enabled, verifier enabled; regression gate triggered on case 02 (documented above)
  • New unit tests for _inferred_ratio semantics implied via the new test rubric helpers
  • HTML goldens regenerated via make golden-update for all 3 styles
  • Manual smoke: a 6-photo hike on the CLI shows "Describing photos…" finishing in ~1.5s (parallel), narrative regenerates if first draft too inferred

🤖 Generated with Claude Code

Four phases shipped together because each builds on Phase 4's
sentence-level provenance schema:

PHASE 4 (ADR-014) — sentence-level provenance + HTML hover

  NarrativeOutput.paragraphs becomes list[Paragraph] where
  Paragraph = list[Sentence]. Each sentence carries tri-lingual text
  + one Provenance (source: SEED / PHOTO / GPX / INFERRED + ref).
  Writer prompt rewritten to produce sentence-aligned tri-lingual
  output with per-sentence provenance tags. Editorial template wraps
  each sentence in <span class="sent" data-prov="..."> with hover
  tooltip + subtle CSS tint on INFERRED. Log + Encyclopedia
  templates use the new paragraphs_as_localized() flat fallback;
  Phase 4.1 follow-up will port them. schema_version bumps to 3,
  invalidating all prior cache entries and goldens.

PHASE 2.5 (ADR-011) — verifier loop using self-reported provenance

  After writer pass returns, count the INFERRED-sentence ratio. If
  >Settings.max_inferred_ratio (default 0.5), regenerate once with
  feedback. "Improvement only" admission policy: keep regen only if
  ratio actually dropped. Free signal (no extra LLM call to detect),
  only the conditional regen call. Streaming bypassed.

PHASE 3.1 (ADR-012) — per-photo vision description cache

  On-disk cache for PhotoDescription keyed by (photo bytes SHA-256,
  vision_model). Mirrors the narrative cache pattern; lives in
  ~/.cache/trailstory/vision/. CLI honours the existing --no-cache
  flag; eval pins use_cache=False so goldens reflect live calls.

PHASE 3.2 (ADR-013) — parallel vision describer calls

  describe_photos uses ThreadPoolExecutor with
  Settings.vision_concurrency workers (default 4). 6-photo hike
  drops from ~6s serial to ~1.5s parallel. Order preserved via
  .map; single-photo fast path skips the pool. AnthropicClient SDK
  is thread-safe; the GIL releases on HTTP wait.

Eval results (refreshed goldens, paid run):

  case                  P0    P1    P2    P3    bundle  Δ vs P3
  01-fixture-baseline   0.48  1.09  1.18  1.56  1.61    +0.05
  02-joyful-summit      1.39  2.36  2.50  2.79  3.68    +0.89  ⭐
  03-exhausted-foggy    2.08  1.74  2.63  3.12  2.50    -0.62
  04-bad-tolz-family       —     —  3.10  3.95  3.41    -0.54
  average               1.32  1.73  2.10  2.86  2.80    -0.06

Case 02 is the architecture win — verifier regenerated, hit ZERO
unsupported claims (down from 7 in P2), faithfulness jumped 0.89.
But warmth + narrative_arc each dropped 1.00 on this case — the
regen produced more constrained but flatter prose. That's exactly
the trade-off ADR-011 anticipated; the "improvement only" admission
policy guards faithfulness, not prose quality. Phase 4.1 should
either tighten admission (require ratio drop AND no warmth signal
drop, which needs a different signal at runtime) or relax
max_inferred_ratio default.

Cases 03 + 04 regressed on faithfulness too — likely a mix of
sentence-leveling cognitive load on the writer + judge sampling
noise. All within the 1.00 regression threshold for non-case-02
checks. The architecture is in place; tuning is Phase 4.1.

Phase 5 (family memory) deliberately NOT included — touches the
privacy-as-wedge moat and needs a separate strategy conversation
per the original brief. Worth a dedicated discussion before any
code lands.

Tests: 361 passing (was 362; -1 because the impossible-EN-too-few
test collapsed into a single tri-lingual paragraph-count test now
that paragraphs is one list). 92% coverage. ruff/mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ditvor ditvor merged commit 96ed8c3 into develop May 25, 2026
5 checks passed
@ditvor ditvor deleted the feat/phases-2-5-3-1-3-2-4-bundle branch May 27, 2026 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant