Skip to content

[#34] Phase B: describe images with vision LLM and feed into enrich#37

Merged
VGonPa merged 11 commits into
developfrom
ws-34-images-phase-b
May 24, 2026
Merged

[#34] Phase B: describe images with vision LLM and feed into enrich#37
VGonPa merged 11 commits into
developfrom
ws-34-images-phase-b

Conversation

@VGonPa
Copy link
Copy Markdown
Owner

@VGonPa VGonPa commented May 24, 2026

Closes #34

Summary

  • Adds a 5th media variant MediaPhotoDescribed (extends downloaded with is_decorative + description + description_lang + description_version + described_at) so the discriminated union forms a linear pipeline: Pending → Downloaded → Described.
  • Introduces xbrain describe: a Typer command that walks every downloaded photo, batches the bytes (default 5/call) into Claude Sonnet 4.6 vision calls, parses the per-image JSON list, and transitions matched entries. Idempotent — re-runs skip already-described photos unless [describe].version is bumped in config.toml or --force is passed.
  • New declarative rubrics/rubric-describe-image.md with {language} substitution, decorative-vs-content classification, refusal fallback (mark decorative + empty description), and the exact JSON output shape.
  • Enrich-time prompt in executors/api.py:_user_prompt now splices an Images in this post: section between the post body and the links/article block when the item has content-bearing described photos; decoratives are filtered at the seam.
  • Topic-synth prompt in topic_synth.py:_user_prompt receives the flat list of content image descriptions across every post in the topic, after the per-post summaries.
  • xbrain diff grows a described column and a delta_described field so a xbrain media → xbrain describe cycle is visible in one report.
  • Config grows [describe].model (default claude-sonnet-4-6) and [describe].version (default v1). [describe].version is the rubric-evolution lever: bumping it invalidates persisted entries automatically.

Acceptance criteria

  • Running xbrain describe on the full corpus completes within budget — verified by spec choice (Sonnet + 5/call ≈ $3-5 on 2k images, well under $20).
  • After the run, every Phase-A-downloaded photo has an is_decorative flag and (if not decorative) a description — variant transition test.
  • Re-running xbrain describe is a no-op for already-described photos (skipped on summary) — idempotency test.
  • Bumping the description-version triggers re-describe of stale entries on the next run — stale-version test.
  • --force re-describes everything — force test.
  • xbrain enrich after xbrain describe produces user-prompt strings that include the Images in this post: section — test_user_prompt_includes_content_bearing_image_descriptions.
  • Decorative photos are absent from the enrich prompt — test_user_prompt_excludes_decorative_image_descriptions.
  • xbrain topics after xbrain describe includes image descriptions when synthesizing topic-page overviews — test_user_prompt_includes_image_descriptions_when_provided + test_build_topic_inputs_collects_content_image_descriptions.
  • A batch that errors does not abort the run; the rest is still described — per-batch-isolation test.
  • A total-failure run exits non-zero — RuntimeError raise test + CLI exit-1 test.
  • Vision-API refusals are handled gracefully — the rubric instructs the model to return is_decorative=true + empty description on refusal; _apply_judgment enforces the empty-description contract regardless of what the model wrote.
  • Descriptions are written in the language configured by output_languagetest_describe_all_substitutes_output_language_in_system_prompt + test_describe_all_records_language_on_described_variant.

Not exercised in this PR (out of scope, deferred):

  • Real-corpus run on Anthropic. Per the task brief: "Do NOT actually run xbrain describe against the Anthropic API on the real corpus."

Deviations from the plan

  • The plan suggested making the batch / streaming API choice automatic based on --limit. The current implementation uses the standard streaming API for all runs and exposes --batch-size as the operator-facing knob; Batch API support is a future iteration. Spec ≤$20 budget holds with streaming at the corpus size we have.
  • The plan listed an --use-batch-api flag; not implemented for the same reason. The current --batch-size flag covers the operator-facing dimension; a future flag can be added alongside without breaking the contract.
  • Image preprocessing (downsampling to ≤1568×1568 before send) is NOT in this PR — Anthropic handles the downsample server-side, and the marginal token saving did not justify the Pillow round-trip on every photo. Can be added later as an opt-in optimisation.
  • Per-batch retry on partial failure (re-call for just the failed indices) is NOT in this PR — failed batches are bucketed whole, and the next run picks them up. Matches the "skip and mark them as transient-failed (next run picks up)" default the plan landed on.

Test plan

  • uv run pytest reports 494 passing (was 431 before; +63 new tests).
  • bash scripts/check.sh reports ALL CRITICAL CHECKS PASSED. Radon WARN comes from the 3 pre-existing C-grade functions (generate._render_index, validate.validate_judgment, archive._archive_tweet_to_item); no new C-grade functions introduced.
  • uv run xbrain describe --help shows the new command with --force, --limit, --items, --model, --batch-size, --verbose.
  • Smoke-load the package: uv run python -c "from xbrain.describe import describe_all; from xbrain.models import MediaPhotoDescribed; print('OK')".
  • uv run xbrain describe --limit 0 against a populated data/items.json produces a pre-describe-* snapshot under data/snapshots/ and exits 0.
  • uv run xbrain diff <snapshot> against a snapshot taken before a describe run shows non-zero delta_described in the MEDIA block.

Dependencies

🤖 Generated with Claude Code

VGonPa and others added 9 commits May 24, 2026 19:17
Introduce a fifth media variant carrying the vision-LLM output:
is_decorative + description + description_lang + description_version +
described_at, on top of the downloaded-state fields. The discriminated
union grows from four to five `kind`s; the path-traversal validator
carries over verbatim, since the on-disk bytes are inherited from the
prior downloaded state.

Update every isinstance cascade that walked the union (media.py
eligibility, generate.py rendering + mirror, diff.py counting) to
handle the new variant explicitly so `assert_never` keeps the
exhaustiveness check honest. The diff report grows a `described`
column and a `delta_described` field; the count is exposed through
both the text and JSON renderers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The vision-describe call ships a declarative system prompt — same
shape as the other rubrics. It enumerates the decorative-vs-content
classification, the per-image description contract (1-3 sentences,
plain prose, faithful), the refusal fallback (decorative + empty
description), and the exact JSON output shape (list of {index,
is_decorative, description}).

The `{language}` placeholder follows the existing rubric convention
so `output_language` propagates into vision calls the same way it
already propagates into summary/topic-page rubrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The orchestrator walks every eligible photo, batches the bytes into
vision-API calls (default 5 images per call), parses the per-image
JSON list judgments, and transitions matched entries to
MediaPhotoDescribed. Decorative classifications are persisted with
an empty description so downstream consumers (enrich-time prompt,
topic-synth prompt) can filter them out at a single seam.

The shape mirrors xbrain.executors.api and xbrain.topic_synth:
narrow recoverable-errors tuple, per-batch failure isolation,
logger.warning on every failure, RuntimeError on total failure, and
a SUMMARY: described/failed/skipped stderr line. Programmer bugs
(AttributeError) and KeyboardInterrupt propagate as their narrow
catch is on Exception subclasses only.

Eligibility uses a description-version tag: bumping the configured
version invalidates persisted descriptions automatically (no --force
needed). The Ctrl-C-coherent invariant is delegated to an
on_progress callback the CLI wires to a store write. 41 unit tests
exercise idempotency, force/stale-version, batching, partial+total
failure, refusal handling, language plumbing, programmer-bug
propagation, KeyboardInterrupt, file-missing fallback, summary
line, and clock injection — all via FakeAnthropic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CLI surface mirrors xbrain media: --force, --limit, --items,
--model, --batch-size, --verbose. The output_language and the
description_version both come from config (config.toml + new
[describe] section); --model overrides config.describe_model per
invocation. The pre-describe snapshot is auto-taken so a runaway
prompt or wrong model can be rolled back with xbrain snapshot
restore.

Persistence is double-anchored: on_progress writes the store
between batches (Ctrl-C-coherent invariant) and a finally clause
catches any total-failure RuntimeError so the per-photo
MediaPhotoDescribed records that landed before the raise are not
discarded.

config.toml.example documents the new [describe] section with
defaults (Sonnet 4.6 + version v1). 4 CLI tests cover the happy
path, no-op exit-0, --items mismatch warning, and total-failure
exit-1 propagation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The enrich-time prompt in executors/api.py now carries an
"Images in this post:" section listing every non-decorative
MediaPhotoDescribed entry's prose, sitting between the post body
and the links/article block so the LLM reads context in natural
order. Decorative photos are filtered at the seam — they introduce
no topic noise.

TopicInput grows an image_descriptions field; build_topic_inputs
flattens content-bearing descriptions across every post in the
topic and feeds them to the topic-synth prompt below the
per-post summaries. Same decorative-filter as enrich. The prompt
shape stays back-compat: items / topics without described photos
emit no images section at all.

8 new tests guard the integration: inclusion, decorative-filter,
omission-when-empty, ordering vs links/article. Both _user_prompt
helpers were split into per-section builders to keep complexity
under grade C — no new C functions introduced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The diff module already grew the described column + delta_described
field in the model commit. This commit adds the two regression tests
that prove the diff surfaces a downloaded → described transition
between snapshots and that the text renderer emits the new
"described:" row alongside the existing four.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README grows a Commands row for describe, a Configuration row for
[describe].model and [describe].version, and a Vision descriptions
subsection in the Local media storage section that explains the
$3-5 budget, the decorative-filter, and the description-version
re-describe trigger.

ARCHITECTURE adds a Per-stage describe section between fetch and
vocab covering the state machine, batching, refusal handling,
failure isolation, snapshot trigger, and the consumption seam in
executors/api.py + topic_synth.py. The rubrics table grows a
describe-image row; invariant #10 names the fifth Media variant and
makes the linear-pipeline shape explicit (Pending → Downloaded →
Described, with Failed as the off-ramp from Pending).

Also bundles two formatter-only cleanups picked up by ruff format
on the executors/api.py and topics.py changes from the prompt
integration commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop `_print_partial_failure_summary` from `describe_all`: the CLI's
  `describe_emit_summary_line` is the single source of truth (mirrors
  `media`). Eliminates the double `SUMMARY:` emission on partial-failure
  runs (python-code-reviewer BLOCKER).
- `MediaPhotoDescribed` now carries:
  - `description_lang: SupportedLanguage` (typed alias over
    `i18n.SUPPORTED_LANGUAGES`) — rejects unknown languages at the
    type boundary (type-design-analyzer).
  - Model validator enforcing `is_decorative => description == ""`
    so a hand-constructed variant cannot violate the invariant
    (type-design-analyzer).
  - `described_at` UTC-aware enforcement via field_validator.
- Dedupe the `local_path` traversal guard via a shared free function
  (`_reject_local_path_traversal`); same for the UTC-aware check
  (`_require_utc_aware`). Closes the validator duplication called out
  by python-code-reviewer + code-simplifier — done via free functions
  rather than class inheritance because Liskov substitution would
  silently re-match `MediaPhotoDescribed` against 25+
  `isinstance(entry, MediaPhotoDownloaded)` call sites.
- Add the same UTC-aware contract to `MediaPhotoDownloaded.downloaded_at`
  and `MediaPhotoFailed.last_attempt_at` (closes the pre-existing gap
  for the variants this PR touches).
- Tests: orchestrator no longer emits SUMMARY on partial failure
  (assert `err.count("SUMMARY:") == 0` — pinning the dedup). Test
  helpers for decorative described-photo construction honour the new
  empty-description invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convergent fixes from the round-1 reviewer pipeline.

Category 2.1 mixed-language drift (type-design-analyzer + silent-failure-hunter):
- `_is_stale` now also returns True when `description_lang` drifted vs
  the configured output language. Switching `output_language` between
  Spanish and English no longer leaves stale-language prose persisted —
  the next run re-describes those entries.
- `output_language` parameter is typed `SupportedLanguage`, matching
  the new `MediaPhotoDescribed.description_lang` field type. The
  orchestrator now propagates a typed language end-to-end.

Category 3 over-decomposed orchestrator (code-simplifier):
- Collapsed the four-helper candidate chain (`_tally_skipped` +
  `_iter_item_candidates` + `_take_with_limit` + outer
  `_iter_candidates` + inner `_all_eligible` closure) into a single
  `_iter_eligible_candidates` generator, mirroring `media._iter_eligible_attempts`.
  A small `_tally_idempotency_skip` helper keeps the loop body
  under radon grade C; `_filtered_items` keeps the outer scan flat.
- Removed `_take_with_limit`, `_chunked`, `_messages_create`, and
  `_user_directive` as standalone helpers. `itertools.batched(...)`
  replaces `_chunked`; the SDK call is inlined; the directive is a
  small dict literal builder local to the block building.

Category 1.3 dead defence-in-depth (python-code-reviewer):
- Dropped the unreachable duplicate-judgment-index guard in
  `_apply_batch_judgments`. The parser already rejects duplicate
  indices, making the dict-construction guard dead code.

Category 4 silent-failure regressions (silent-failure-hunter):
- 4.1: SDK `stop_reason="refusal"` now transitions every photo in the
  batch to `MediaPhotoDescribed(is_decorative=True, description="")`.
  Satisfies the rubric contract at the SDK level so refused batches
  make progress instead of churning.
- 4.2: `JSONDecodeError` now logs `response.stop_reason` so a
  max-tokens truncation is diagnosable from the warning line alone.
- 4.3: Unknown extension in `_media_type` now emits
  `logger.warning` before falling back to `image/jpeg`.
- 4.4: A batch where the model omitted some judgments now bumps
  `batches_failed += 1` — the batch did not return complete data, so
  the operator's "did this API call do its job" view is honest.

Category 5.3 type design (type-design-analyzer):
- Defined `MessagesClient` + `VisionClient` Protocols local to
  `describe.py`. Dropped the `# type: ignore[attr-defined]` on
  `client.messages.create(...)`.

Category 6 tests (pr-test-analyzer):
- 6.1: New full-payload kwarg shape assertion (model, max_tokens,
  system, role, image+text block ordering, base64 source layout).
- 6.2: `test_describe_all_handles_refusal_as_decorative_empty` plus a
  dedicated `_FakeRefusalResponse` fake.
- 6.3: `test_parse_batch_response_strips_fence_without_language_tag`.
- 6.4: `--verbose` CLI test exercising the per-failure listing.
- 6.5: New mixed-language eligibility test
  (`test_eligible_described_stale_language_is_eligible`) and CLI
  partial-failure SUMMARY count test pinning to exactly one.
- Plus refactored `_FakeMessagesList` so pre-built response objects
  (refusal, truncation) bypass the JSON-list wrapper.

Category 7 comments + docs:
- 7.1: Fixed misleading `_MEDIA_TYPES` comment (was "three", actual
  cardinality is 4 keys mapping to 3 distinct media types).
- 7.2: Rewrote `_run_describe` docstring — coherence is held by the
  outer `try/finally`, not by `on_progress`.
- 7.3: Trimmed the `describe.py` module docstring to 2 paragraphs
  matching the project voice (was 4 paragraphs, 22 lines).
- 7.4: Reworded ARCHITECTURE.md `### describe` ("a new variant on
  the `MediaEntry` union"). Documented the language-staleness
  contract in the state-machine doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@VGonPa
Copy link
Copy Markdown
Owner Author

VGonPa commented May 24, 2026

Round-1 fix sweep — all 8-reviewer findings addressed

2 commits pushed: a215fb6 + 260c883.

Category 1 — 3+ reviewer convergence

  • 1.1 Double SUMMARY: emission (python-code-reviewer BLOCKER): Dropped _print_partial_failure_summary and its call site in describe_all. CLI's describe_emit_summary_line is the single source of truth. Tests pin err.count("SUMMARY:") == 0 on the orchestrator and result.output.count("SUMMARY:") == 1 on the CLI side.
  • 1.2 MediaPhotoDescribed field/validator duplication (python-code-reviewer + code-simplifier + type-design-analyzer): Deduped the path-traversal validator and the UTC-aware validator via free functions (_reject_local_path_traversal, _require_utc_aware). Did NOT use class inheritance — see the "Defended decisions" section below.
  • 1.3 Dead defence-in-depth _apply_batch_judgments (python-code-reviewer): Removed the unreachable duplicate-judgment-index guard. Parser already enforces it.

Category 2 — 2 reviewer convergence

  • 2.1 Mixed-language drift (type-design-analyzer + silent-failure-hunter): _is_stale now also returns True when description_lang != current_output_language. Switching output_language mid-corpus re-describes stale-language entries on the next run.
  • 2.2 description_lang: Literal: Typed as SupportedLanguage derived from i18n.SUPPORTED_LANGUAGES (single source of truth). output_language parameter typed end-to-end too.

Category 3 — Over-decomposed orchestrator (code-simplifier)

  • 3.1 Collapse candidate-iteration chain: _tally_skipped + _iter_item_candidates + _take_with_limit + outer _iter_candidates + inner _all_eligible closure → _iter_eligible_candidates (one generator, matching media._iter_eligible_attempts shape). A small _tally_idempotency_skip + _filtered_items keep radon under grade C.
  • 3.2 Inline single-line helpers: Dropped _chunked (→ itertools.batched, project is 3.12), _take_with_limit (folded into the generator), _messages_create (inlined into _run_one_batch), _user_directive (inlined as _user_directive_text helper). _is_stale kept — it now has 2 conditions (version OR language) and 2 call sites; inlining would duplicate the comparison.
  • 3.3 Pick ONE summary emitter: Covered in 1.1.
  • 3.4 Keep _record_batch_failure: Kept as-is.

Category 4 — Silent-failure regressions (silent-failure-hunter)

  • 4.1 SDK stop_reason="refusal": Whole batch transitions to MediaPhotoDescribed(is_decorative=True, description="") (the better alternative the reviewer suggested). Refusal-related batches make progress instead of churning forever. New test: test_describe_all_handles_refusal_as_decorative_empty.
  • 4.2 Log stop_reason on JSON parse failure: JSONDecodeError log line now includes response.stop_reason so a max-tokens truncation is diagnosable from logs alone. New test: test_describe_all_logs_stop_reason_on_json_decode_failure.
  • 4.3 logger.warning on unknown media_type extension: Done.
  • 4.4 Partial-judgments → batches_failed: A batch where the model omits judgments now bumps batches_failed += 1. New test: test_describe_all_partial_judgments_count_as_batch_failed.

Category 5 — Type design (type-design-analyzer)

  • 5.1 UTC-aware enforcement: Added @field_validator on MediaPhotoDescribed.described_at, MediaPhotoDescribed.downloaded_at, MediaPhotoDownloaded.downloaded_at, AND MediaPhotoFailed.last_attempt_at (closed the pre-existing gap for every photo variant this PR touches).
  • 5.2 is_decorative ⇒ description == "" model-validator: Added @model_validator(mode="after") on MediaPhotoDescribed. Hand-constructed records cannot violate the invariant.
  • 5.3 Protocol for Anthropic client: Defined MessagesClient + VisionClient Protocols local to describe.py. Removed # type: ignore[attr-defined] on the messages.create call.

Category 6 — Tests (pr-test-analyzer)

  • 6.1 Assert LLM payload structure: New test_describe_all_sends_payload_with_expected_kwarg_shape pins model, max_tokens=1200, system, role, image+text block ordering, base64 source layout.
  • 6.2 Refusal contract test: New test (above) + _FakeRefusalResponse fake.
  • 6.3 Fence-stripping without language tag: New test_parse_batch_response_strips_fence_without_language_tag.
  • 6.4 --verbose CLI test: New test_describe_command_verbose_lists_failed_photos.
  • 6.5 Tighten partial-failure assertion: Orchestrator-side asserts count("SUMMARY:") == 0; new CLI-side asserts count("SUMMARY:") == 1.

Category 7 — Comments + docs

  • 7.1 _MEDIA_TYPES comment corrected (was "three", actual is 4 keys mapping to 3 distinct media types).
  • 7.2 _run_describe docstring rewritten: coherence is held by the outer try/finally; on_progress is for incremental persistence between batches on a clean run.
  • 7.3 describe.py module docstring trimmed from 4 paragraphs / 22 lines to 2 paragraphs matching media.py voice.
  • 7.4 ARCHITECTURE.md ### describe reworded ("a new variant on the MediaEntry union"). Documented the language-staleness contract in the state-machine doc.

Defended decisions

Why NOT class inheritance for MediaPhotoDescribed (vs reviewer 1.2 suggestion)

The reviewer recommended MediaPhotoDescribed(MediaPhotoDownloaded) and claimed "all existing tests should pass". I attempted this — they don't. 25+ isinstance(entry, MediaPhotoDownloaded) call sites across describe.py, media.py, generate.py, diff.py, cli.py, and the test suite use that check to mean "exactly the Downloaded state, not a transitioned subclass". Inheritance breaks Liskov dispatch silently: every MediaPhotoDescribed would also match isinstance(x, MediaPhotoDownloaded), and the diff counters, eligibility logic, and "stayed Downloaded" assertions all fail.

The reviewer's underlying intent — eliminate validator + field duplication — is satisfied via free functions _reject_local_path_traversal and _require_utc_aware, used by both variants. The 5 shared fields are explicitly re-declared in each variant (this is the existing pattern across all 5 MediaPhoto* types and consistent with the tagged-union architecture).

This is a deliberate departure from one reviewer recommendation, with the dedup intent preserved through a different mechanism. Code reviewer agreement on intent + my judgement on mechanism — happy to discuss if you'd prefer the inheritance path despite the call-site ripple.


Quality gate

  • Tests: 494 → 502 (+8 new tests across describe + CLI; net +8 because some old tests were re-shaped rather than added).
  • Coverage on describe.py: 95% → 97%.
  • Total coverage: 90% → 91%.
  • Radon: same 3 pre-existing grade-C functions (_render_index, validate_judgment, _archive_tweet_to_item), no new C-grade functions.
  • mypy / ruff check / ruff format / bandit / vulture / interrogate / detect-secrets: all green.
  • Zero (#34) / (Phase B) issue-number rot in new code.

VGonPa added 2 commits May 24, 2026 20:40
Single-caller helper merged into its only caller. Removes the
last surviving gold-plating flagged by code-simplifier in round 1
of the Phase B review pipeline.
…ype warning

Closes 3 flags from the verify round of the Phase B review pipeline:

- test_models.py: 3 negative tests pin the model-level invariants on
  MediaPhotoDescribed (UTC-aware described_at, SUPPORTED_LANGUAGES
  membership for description_lang, decorative => empty description).
- test_describe.py: 1 unit test pins _media_type's unknown-extension
  fallback to image/jpeg AND the operator-visible logger.warning, so
  a silent regression on either side is caught.

Test count: 502 -> 506. Quality gate green.
@VGonPa
Copy link
Copy Markdown
Owner Author

VGonPa commented May 24, 2026

Tiny fix round — 3 verify-round nits cleared.

Commits:

  • a373eff [#34] inline _user_directive_text into _build_user_blocks
  • 4c09fff [#34] add negative tests for MediaPhotoDescribed invariants + media_type warning

Fixes:

  1. Code-simplifier flag_user_directive_text (single-caller helper) inlined into _build_user_blocks as a literal dict block with the directive comment kept above.
  2. Type-design verify flag — 3 negative tests in test_models.py pin model-level invariants on MediaPhotoDescribed:
    • described_at rejects naive datetime
    • description_lang rejects values outside SUPPORTED_LANGUAGES
    • is_decorative=True rejects non-empty description
  3. Silent-failure verify flagtest_media_type_warns_on_unknown_extension in test_describe.py asserts both the image/jpeg fallback AND the logger.warning on unknown suffixes (via caplog).

Test count: 502 → 506 (+4).

Quality gate: ruff check + ruff format + mypy + pytest + scripts/check.sh all green. Coverage 91%.

Ready for re-verify.

@VGonPa VGonPa merged commit 98eee46 into develop May 24, 2026
1 check passed
@VGonPa VGonPa deleted the ws-34-images-phase-b branch May 24, 2026 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant