feat(narrative): bundle phases 2.5 + 3.1 + 3.2 + 4 (ADRs 011-014)#55
Merged
Conversation
Four phases shipped together because each builds on Phase 4's sentence-level provenance schema: PHASE 4 (ADR-014) — sentence-level provenance + HTML hover NarrativeOutput.paragraphs becomes list[Paragraph] where Paragraph = list[Sentence]. Each sentence carries tri-lingual text + one Provenance (source: SEED / PHOTO / GPX / INFERRED + ref). Writer prompt rewritten to produce sentence-aligned tri-lingual output with per-sentence provenance tags. Editorial template wraps each sentence in <span class="sent" data-prov="..."> with hover tooltip + subtle CSS tint on INFERRED. Log + Encyclopedia templates use the new paragraphs_as_localized() flat fallback; Phase 4.1 follow-up will port them. schema_version bumps to 3, invalidating all prior cache entries and goldens. PHASE 2.5 (ADR-011) — verifier loop using self-reported provenance After writer pass returns, count the INFERRED-sentence ratio. If >Settings.max_inferred_ratio (default 0.5), regenerate once with feedback. "Improvement only" admission policy: keep regen only if ratio actually dropped. Free signal (no extra LLM call to detect), only the conditional regen call. Streaming bypassed. PHASE 3.1 (ADR-012) — per-photo vision description cache On-disk cache for PhotoDescription keyed by (photo bytes SHA-256, vision_model). Mirrors the narrative cache pattern; lives in ~/.cache/trailstory/vision/. CLI honours the existing --no-cache flag; eval pins use_cache=False so goldens reflect live calls. PHASE 3.2 (ADR-013) — parallel vision describer calls describe_photos uses ThreadPoolExecutor with Settings.vision_concurrency workers (default 4). 6-photo hike drops from ~6s serial to ~1.5s parallel. Order preserved via .map; single-photo fast path skips the pool. AnthropicClient SDK is thread-safe; the GIL releases on HTTP wait. Eval results (refreshed goldens, paid run): case P0 P1 P2 P3 bundle Δ vs P3 01-fixture-baseline 0.48 1.09 1.18 1.56 1.61 +0.05 02-joyful-summit 1.39 2.36 2.50 2.79 3.68 +0.89 ⭐ 03-exhausted-foggy 2.08 1.74 2.63 3.12 2.50 -0.62 04-bad-tolz-family — — 3.10 3.95 3.41 -0.54 average 1.32 1.73 2.10 2.86 2.80 -0.06 Case 02 is the architecture win — verifier regenerated, hit ZERO unsupported claims (down from 7 in P2), faithfulness jumped 0.89. But warmth + narrative_arc each dropped 1.00 on this case — the regen produced more constrained but flatter prose. That's exactly the trade-off ADR-011 anticipated; the "improvement only" admission policy guards faithfulness, not prose quality. Phase 4.1 should either tighten admission (require ratio drop AND no warmth signal drop, which needs a different signal at runtime) or relax max_inferred_ratio default. Cases 03 + 04 regressed on faithfulness too — likely a mix of sentence-leveling cognitive load on the writer + judge sampling noise. All within the 1.00 regression threshold for non-case-02 checks. The architecture is in place; tuning is Phase 4.1. Phase 5 (family memory) deliberately NOT included — touches the privacy-as-wedge moat and needs a separate strategy conversation per the original brief. Worth a dedicated discussion before any code lands. Tests: 361 passing (was 362; -1 because the impossible-EN-too-few test collapsed into a single tri-lingual paragraph-count test now that paragraphs is one list). 92% coverage. ruff/mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four phases of the narrative-faithfulness initiative bundled into one PR because each builds on Phase 4's sentence-level provenance schema. Phase 5 (family memory) is deliberately NOT included — it touches the privacy-as-wedge moat and needs a separate strategy conversation per the original brief.
Settings.max_inferred_ratio(default0.5), regenerate once. "Improvement only" admission policy. Free signal (no extra LLM call to detect).(photo bytes SHA-256, vision_model). Mirrors the existing narrative cache.ThreadPoolExecutorparallelises vision calls (vision_concurrency=4default). 6-photo hike drops from ~6s serial to ~1.5s.NarrativeOutput.paragraphsbecomeslist[Paragraph] = list[list[Sentence]]. Each sentence has tri-lingual text + oneProvenance(source: SEED / PHOTO / GPX / INFERRED). Editorial template wraps each sentence in<span class="sent" data-prov="...">with hover tooltip + subtle CSS tint on INFERRED. Log + Encyclopedia use the newparagraphs_as_localized()flat fallback until Phase 4.1.schema_version=3.The full trajectory across all 5 phase-bundles
The honest read
This is the most mixed eval result of the whole initiative. Below the per-phase wins of 0.41 / 0.37 / 0.76 that Phases 1 / 2 / 3 delivered. Why ship anyway:
The architecture is in place; tuning is Phase 4.1. The user (you) can decide:
max_inferred_ratioto0.7or1.0to disable the verifier by defaultWhat's intentionally NOT in this PR
What's in this PR
43 files changed (2,403 insertions, 467 deletions):
trailstory/models.py—ProvenanceSourceenum,Provenance,Sentence,Paragraphtype alias;NarrativeOutput.paragraphsshape change;schema_version=3; newparagraphs_as_localized()methodtrailstory/llm/prompts.py— writer prompt rewrites for new shape + provenance rubrictrailstory/llm/narrative.py— verifier loop,_inferred_ratio(),_verifier_feedback()trailstory/llm/vision_cache.py(NEW) — per-photo cache mirroring the narrative cachetrailstory/photos.py—describe_photosaddsconcurrency+use_cachekwargs;ThreadPoolExecutorparallel path;_describe_one_with_cachehelpertrailstory/config.py—max_inferred_ratio,vision_concurrencytrailstory/cli.py— passes new settings throughtests/eval/run.py— pinsconcurrency=4,use_cache=False,max_inferred_ratio=0.5tests/eval/rubric.py— every paragraph-aware check usesparagraphs_as_localized()templates/styles/editorial.html.j2— sentence spans + provenance CSStemplates/styles/log.html.j2,templates/styles/encyclopedia.html.j2— flat fallback viaflat_paragraphstests/conftest.py—paragraphs_from_strings,paragraphs_dict_from_stringshelperstests/test_*.pyupdated for the new shapeCHANGELOG.md+CLAUDE.mddecision register — entries 11-14Test plan
make ci— 361 passing (was 362, -1 because the impossible-EN-too-few rubric test collapsed into one test now that paragraphs is shared across languages), 92% coverage, ruff/mypy cleanmake eval-update-golden— paid run with all 4 cases, vision enabled, verifier enabled; regression gate triggered on case 02 (documented above)_inferred_ratiosemantics implied via the new test rubric helpersmake golden-updatefor all 3 styles🤖 Generated with Claude Code