[#34] Phase B: describe images with vision LLM and feed into enrich by VGonPa · Pull Request #37 · VGonPa/xbrain

VGonPa · 2026-05-24T17:39:56Z

Closes #34

Summary

Adds a 5th media variant MediaPhotoDescribed (extends downloaded with is_decorative + description + description_lang + description_version + described_at) so the discriminated union forms a linear pipeline: Pending → Downloaded → Described.
Introduces xbrain describe: a Typer command that walks every downloaded photo, batches the bytes (default 5/call) into Claude Sonnet 4.6 vision calls, parses the per-image JSON list, and transitions matched entries. Idempotent — re-runs skip already-described photos unless [describe].version is bumped in config.toml or --force is passed.
New declarative rubrics/rubric-describe-image.md with {language} substitution, decorative-vs-content classification, refusal fallback (mark decorative + empty description), and the exact JSON output shape.
Enrich-time prompt in executors/api.py:_user_prompt now splices an Images in this post: section between the post body and the links/article block when the item has content-bearing described photos; decoratives are filtered at the seam.
Topic-synth prompt in topic_synth.py:_user_prompt receives the flat list of content image descriptions across every post in the topic, after the per-post summaries.
xbrain diff grows a described column and a delta_described field so a xbrain media → xbrain describe cycle is visible in one report.
Config grows [describe].model (default claude-sonnet-4-6) and [describe].version (default v1). [describe].version is the rubric-evolution lever: bumping it invalidates persisted entries automatically.

Acceptance criteria

Not exercised in this PR (out of scope, deferred):

Real-corpus run on Anthropic. Per the task brief: "Do NOT actually run xbrain describe against the Anthropic API on the real corpus."

Deviations from the plan

The plan suggested making the batch / streaming API choice automatic based on --limit. The current implementation uses the standard streaming API for all runs and exposes --batch-size as the operator-facing knob; Batch API support is a future iteration. Spec ≤$20 budget holds with streaming at the corpus size we have.
The plan listed an --use-batch-api flag; not implemented for the same reason. The current --batch-size flag covers the operator-facing dimension; a future flag can be added alongside without breaking the contract.
Image preprocessing (downsampling to ≤1568×1568 before send) is NOT in this PR — Anthropic handles the downsample server-side, and the marginal token saving did not justify the Pillow round-trip on every photo. Can be added later as an opt-in optimisation.
Per-batch retry on partial failure (re-call for just the failed indices) is NOT in this PR — failed batches are bucketed whole, and the next run picks them up. Matches the "skip and mark them as transient-failed (next run picks up)" default the plan landed on.

Test plan

uv run pytest reports 494 passing (was 431 before; +63 new tests).
bash scripts/check.sh reports ALL CRITICAL CHECKS PASSED. Radon WARN comes from the 3 pre-existing C-grade functions (generate._render_index, validate.validate_judgment, archive._archive_tweet_to_item); no new C-grade functions introduced.
uv run xbrain describe --help shows the new command with --force, --limit, --items, --model, --batch-size, --verbose.
Smoke-load the package: uv run python -c "from xbrain.describe import describe_all; from xbrain.models import MediaPhotoDescribed; print('OK')".
uv run xbrain describe --limit 0 against a populated data/items.json produces a pre-describe-* snapshot under data/snapshots/ and exits 0.
uv run xbrain diff <snapshot> against a snapshot taken before a describe run shows non-zero delta_described in the MEDIA block.

Dependencies

Built on top of Phase A (PR [#33] Phase A: download X-post photos and render in notes #35) — depends on MediaPhotoDownloaded being the bytes-on-disk state.
Built on top of the radon cleanup (PR [#33] reduce download_all + _download_one complexity below grade C #36) — that PR cleaned media.py from C-grade complexity; this PR maintains that contract (no new C functions in media.py or in the new describe.py).

🤖 Generated with Claude Code

Introduce a fifth media variant carrying the vision-LLM output: is_decorative + description + description_lang + description_version + described_at, on top of the downloaded-state fields. The discriminated union grows from four to five `kind`s; the path-traversal validator carries over verbatim, since the on-disk bytes are inherited from the prior downloaded state. Update every isinstance cascade that walked the union (media.py eligibility, generate.py rendering + mirror, diff.py counting) to handle the new variant explicitly so `assert_never` keeps the exhaustiveness check honest. The diff report grows a `described` column and a `delta_described` field; the count is exposed through both the text and JSON renderers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The vision-describe call ships a declarative system prompt — same shape as the other rubrics. It enumerates the decorative-vs-content classification, the per-image description contract (1-3 sentences, plain prose, faithful), the refusal fallback (decorative + empty description), and the exact JSON output shape (list of {index, is_decorative, description}). The `{language}` placeholder follows the existing rubric convention so `output_language` propagates into vision calls the same way it already propagates into summary/topic-page rubrics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The orchestrator walks every eligible photo, batches the bytes into vision-API calls (default 5 images per call), parses the per-image JSON list judgments, and transitions matched entries to MediaPhotoDescribed. Decorative classifications are persisted with an empty description so downstream consumers (enrich-time prompt, topic-synth prompt) can filter them out at a single seam. The shape mirrors xbrain.executors.api and xbrain.topic_synth: narrow recoverable-errors tuple, per-batch failure isolation, logger.warning on every failure, RuntimeError on total failure, and a SUMMARY: described/failed/skipped stderr line. Programmer bugs (AttributeError) and KeyboardInterrupt propagate as their narrow catch is on Exception subclasses only. Eligibility uses a description-version tag: bumping the configured version invalidates persisted descriptions automatically (no --force needed). The Ctrl-C-coherent invariant is delegated to an on_progress callback the CLI wires to a store write. 41 unit tests exercise idempotency, force/stale-version, batching, partial+total failure, refusal handling, language plumbing, programmer-bug propagation, KeyboardInterrupt, file-missing fallback, summary line, and clock injection — all via FakeAnthropic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The CLI surface mirrors xbrain media: --force, --limit, --items, --model, --batch-size, --verbose. The output_language and the description_version both come from config (config.toml + new [describe] section); --model overrides config.describe_model per invocation. The pre-describe snapshot is auto-taken so a runaway prompt or wrong model can be rolled back with xbrain snapshot restore. Persistence is double-anchored: on_progress writes the store between batches (Ctrl-C-coherent invariant) and a finally clause catches any total-failure RuntimeError so the per-photo MediaPhotoDescribed records that landed before the raise are not discarded. config.toml.example documents the new [describe] section with defaults (Sonnet 4.6 + version v1). 4 CLI tests cover the happy path, no-op exit-0, --items mismatch warning, and total-failure exit-1 propagation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The enrich-time prompt in executors/api.py now carries an "Images in this post:" section listing every non-decorative MediaPhotoDescribed entry's prose, sitting between the post body and the links/article block so the LLM reads context in natural order. Decorative photos are filtered at the seam — they introduce no topic noise. TopicInput grows an image_descriptions field; build_topic_inputs flattens content-bearing descriptions across every post in the topic and feeds them to the topic-synth prompt below the per-post summaries. Same decorative-filter as enrich. The prompt shape stays back-compat: items / topics without described photos emit no images section at all. 8 new tests guard the integration: inclusion, decorative-filter, omission-when-empty, ordering vs links/article. Both _user_prompt helpers were split into per-section builders to keep complexity under grade C — no new C functions introduced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The diff module already grew the described column + delta_described field in the model commit. This commit adds the two regression tests that prove the diff surfaces a downloaded → described transition between snapshots and that the text renderer emits the new "described:" row alongside the existing four. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

README grows a Commands row for describe, a Configuration row for [describe].model and [describe].version, and a Vision descriptions subsection in the Local media storage section that explains the $3-5 budget, the decorative-filter, and the description-version re-describe trigger. ARCHITECTURE adds a Per-stage describe section between fetch and vocab covering the state machine, batching, refusal handling, failure isolation, snapshot trigger, and the consumption seam in executors/api.py + topic_synth.py. The rubrics table grows a describe-image row; invariant #10 names the fifth Media variant and makes the linear-pipeline shape explicit (Pending → Downloaded → Described, with Failed as the off-ramp from Pending). Also bundles two formatter-only cleanups picked up by ruff format on the executors/api.py and topics.py changes from the prompt integration commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Drop `_print_partial_failure_summary` from `describe_all`: the CLI's `describe_emit_summary_line` is the single source of truth (mirrors `media`). Eliminates the double `SUMMARY:` emission on partial-failure runs (python-code-reviewer BLOCKER). - `MediaPhotoDescribed` now carries: - `description_lang: SupportedLanguage` (typed alias over `i18n.SUPPORTED_LANGUAGES`) — rejects unknown languages at the type boundary (type-design-analyzer). - Model validator enforcing `is_decorative => description == ""` so a hand-constructed variant cannot violate the invariant (type-design-analyzer). - `described_at` UTC-aware enforcement via field_validator. - Dedupe the `local_path` traversal guard via a shared free function (`_reject_local_path_traversal`); same for the UTC-aware check (`_require_utc_aware`). Closes the validator duplication called out by python-code-reviewer + code-simplifier — done via free functions rather than class inheritance because Liskov substitution would silently re-match `MediaPhotoDescribed` against 25+ `isinstance(entry, MediaPhotoDownloaded)` call sites. - Add the same UTC-aware contract to `MediaPhotoDownloaded.downloaded_at` and `MediaPhotoFailed.last_attempt_at` (closes the pre-existing gap for the variants this PR touches). - Tests: orchestrator no longer emits SUMMARY on partial failure (assert `err.count("SUMMARY:") == 0` — pinning the dedup). Test helpers for decorative described-photo construction honour the new empty-description invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Convergent fixes from the round-1 reviewer pipeline. Category 2.1 mixed-language drift (type-design-analyzer + silent-failure-hunter): - `_is_stale` now also returns True when `description_lang` drifted vs the configured output language. Switching `output_language` between Spanish and English no longer leaves stale-language prose persisted — the next run re-describes those entries. - `output_language` parameter is typed `SupportedLanguage`, matching the new `MediaPhotoDescribed.description_lang` field type. The orchestrator now propagates a typed language end-to-end. Category 3 over-decomposed orchestrator (code-simplifier): - Collapsed the four-helper candidate chain (`_tally_skipped` + `_iter_item_candidates` + `_take_with_limit` + outer `_iter_candidates` + inner `_all_eligible` closure) into a single `_iter_eligible_candidates` generator, mirroring `media._iter_eligible_attempts`. A small `_tally_idempotency_skip` helper keeps the loop body under radon grade C; `_filtered_items` keeps the outer scan flat. - Removed `_take_with_limit`, `_chunked`, `_messages_create`, and `_user_directive` as standalone helpers. `itertools.batched(...)` replaces `_chunked`; the SDK call is inlined; the directive is a small dict literal builder local to the block building. Category 1.3 dead defence-in-depth (python-code-reviewer): - Dropped the unreachable duplicate-judgment-index guard in `_apply_batch_judgments`. The parser already rejects duplicate indices, making the dict-construction guard dead code. Category 4 silent-failure regressions (silent-failure-hunter): - 4.1: SDK `stop_reason="refusal"` now transitions every photo in the batch to `MediaPhotoDescribed(is_decorative=True, description="")`. Satisfies the rubric contract at the SDK level so refused batches make progress instead of churning. - 4.2: `JSONDecodeError` now logs `response.stop_reason` so a max-tokens truncation is diagnosable from the warning line alone. - 4.3: Unknown extension in `_media_type` now emits `logger.warning` before falling back to `image/jpeg`. - 4.4: A batch where the model omitted some judgments now bumps `batches_failed += 1` — the batch did not return complete data, so the operator's "did this API call do its job" view is honest. Category 5.3 type design (type-design-analyzer): - Defined `MessagesClient` + `VisionClient` Protocols local to `describe.py`. Dropped the `# type: ignore[attr-defined]` on `client.messages.create(...)`. Category 6 tests (pr-test-analyzer): - 6.1: New full-payload kwarg shape assertion (model, max_tokens, system, role, image+text block ordering, base64 source layout). - 6.2: `test_describe_all_handles_refusal_as_decorative_empty` plus a dedicated `_FakeRefusalResponse` fake. - 6.3: `test_parse_batch_response_strips_fence_without_language_tag`. - 6.4: `--verbose` CLI test exercising the per-failure listing. - 6.5: New mixed-language eligibility test (`test_eligible_described_stale_language_is_eligible`) and CLI partial-failure SUMMARY count test pinning to exactly one. - Plus refactored `_FakeMessagesList` so pre-built response objects (refusal, truncation) bypass the JSON-list wrapper. Category 7 comments + docs: - 7.1: Fixed misleading `_MEDIA_TYPES` comment (was "three", actual cardinality is 4 keys mapping to 3 distinct media types). - 7.2: Rewrote `_run_describe` docstring — coherence is held by the outer `try/finally`, not by `on_progress`. - 7.3: Trimmed the `describe.py` module docstring to 2 paragraphs matching the project voice (was 4 paragraphs, 22 lines). - 7.4: Reworded ARCHITECTURE.md `### describe` ("a new variant on the `MediaEntry` union"). Documented the language-staleness contract in the state-machine doc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

VGonPa · 2026-05-24T18:34:00Z

Round-1 fix sweep — all 8-reviewer findings addressed

2 commits pushed: a215fb6 + 260c883.

Category 1 — 3+ reviewer convergence

1.1 Double SUMMARY: emission (python-code-reviewer BLOCKER): Dropped _print_partial_failure_summary and its call site in describe_all. CLI's describe_emit_summary_line is the single source of truth. Tests pin err.count("SUMMARY:") == 0 on the orchestrator and result.output.count("SUMMARY:") == 1 on the CLI side.
1.2 MediaPhotoDescribed field/validator duplication (python-code-reviewer + code-simplifier + type-design-analyzer): Deduped the path-traversal validator and the UTC-aware validator via free functions (_reject_local_path_traversal, _require_utc_aware). Did NOT use class inheritance — see the "Defended decisions" section below.
1.3 Dead defence-in-depth _apply_batch_judgments (python-code-reviewer): Removed the unreachable duplicate-judgment-index guard. Parser already enforces it.

Category 2 — 2 reviewer convergence

2.1 Mixed-language drift (type-design-analyzer + silent-failure-hunter): _is_stale now also returns True when description_lang != current_output_language. Switching output_language mid-corpus re-describes stale-language entries on the next run.
2.2 description_lang: Literal: Typed as SupportedLanguage derived from i18n.SUPPORTED_LANGUAGES (single source of truth). output_language parameter typed end-to-end too.

Category 3 — Over-decomposed orchestrator (code-simplifier)

3.1 Collapse candidate-iteration chain: _tally_skipped + _iter_item_candidates + _take_with_limit + outer _iter_candidates + inner _all_eligible closure → _iter_eligible_candidates (one generator, matching media._iter_eligible_attempts shape). A small _tally_idempotency_skip + _filtered_items keep radon under grade C.
3.2 Inline single-line helpers: Dropped _chunked (→ itertools.batched, project is 3.12), _take_with_limit (folded into the generator), _messages_create (inlined into _run_one_batch), _user_directive (inlined as _user_directive_text helper). _is_stale kept — it now has 2 conditions (version OR language) and 2 call sites; inlining would duplicate the comparison.
3.3 Pick ONE summary emitter: Covered in 1.1.
3.4 Keep _record_batch_failure: Kept as-is.

Category 4 — Silent-failure regressions (silent-failure-hunter)

4.1 SDK stop_reason="refusal": Whole batch transitions to MediaPhotoDescribed(is_decorative=True, description="") (the better alternative the reviewer suggested). Refusal-related batches make progress instead of churning forever. New test: test_describe_all_handles_refusal_as_decorative_empty.
4.2 Log stop_reason on JSON parse failure: JSONDecodeError log line now includes response.stop_reason so a max-tokens truncation is diagnosable from logs alone. New test: test_describe_all_logs_stop_reason_on_json_decode_failure.
4.3 logger.warning on unknown media_type extension: Done.
4.4 Partial-judgments → batches_failed: A batch where the model omits judgments now bumps batches_failed += 1. New test: test_describe_all_partial_judgments_count_as_batch_failed.

Category 5 — Type design (type-design-analyzer)

5.1 UTC-aware enforcement: Added @field_validator on MediaPhotoDescribed.described_at, MediaPhotoDescribed.downloaded_at, MediaPhotoDownloaded.downloaded_at, AND MediaPhotoFailed.last_attempt_at (closed the pre-existing gap for every photo variant this PR touches).
5.2 is_decorative ⇒ description == "" model-validator: Added @model_validator(mode="after") on MediaPhotoDescribed. Hand-constructed records cannot violate the invariant.
5.3 Protocol for Anthropic client: Defined MessagesClient + VisionClient Protocols local to describe.py. Removed # type: ignore[attr-defined] on the messages.create call.

Category 6 — Tests (pr-test-analyzer)

6.1 Assert LLM payload structure: New test_describe_all_sends_payload_with_expected_kwarg_shape pins model, max_tokens=1200, system, role, image+text block ordering, base64 source layout.
6.2 Refusal contract test: New test (above) + _FakeRefusalResponse fake.
6.3 Fence-stripping without language tag: New test_parse_batch_response_strips_fence_without_language_tag.
6.4 --verbose CLI test: New test_describe_command_verbose_lists_failed_photos.
6.5 Tighten partial-failure assertion: Orchestrator-side asserts count("SUMMARY:") == 0; new CLI-side asserts count("SUMMARY:") == 1.

Category 7 — Comments + docs

7.1 _MEDIA_TYPES comment corrected (was "three", actual is 4 keys mapping to 3 distinct media types).
7.2 _run_describe docstring rewritten: coherence is held by the outer try/finally; on_progress is for incremental persistence between batches on a clean run.
7.3 describe.py module docstring trimmed from 4 paragraphs / 22 lines to 2 paragraphs matching media.py voice.
7.4 ARCHITECTURE.md ### describe reworded ("a new variant on the MediaEntry union"). Documented the language-staleness contract in the state-machine doc.

Defended decisions

Why NOT class inheritance for `MediaPhotoDescribed` (vs reviewer 1.2 suggestion)

The reviewer recommended MediaPhotoDescribed(MediaPhotoDownloaded) and claimed "all existing tests should pass". I attempted this — they don't. 25+ isinstance(entry, MediaPhotoDownloaded) call sites across describe.py, media.py, generate.py, diff.py, cli.py, and the test suite use that check to mean "exactly the Downloaded state, not a transitioned subclass". Inheritance breaks Liskov dispatch silently: every MediaPhotoDescribed would also match isinstance(x, MediaPhotoDownloaded), and the diff counters, eligibility logic, and "stayed Downloaded" assertions all fail.

The reviewer's underlying intent — eliminate validator + field duplication — is satisfied via free functions _reject_local_path_traversal and _require_utc_aware, used by both variants. The 5 shared fields are explicitly re-declared in each variant (this is the existing pattern across all 5 MediaPhoto* types and consistent with the tagged-union architecture).

This is a deliberate departure from one reviewer recommendation, with the dedup intent preserved through a different mechanism. Code reviewer agreement on intent + my judgement on mechanism — happy to discuss if you'd prefer the inheritance path despite the call-site ripple.

Quality gate

Tests: 494 → 502 (+8 new tests across describe + CLI; net +8 because some old tests were re-shaped rather than added).
Coverage on describe.py: 95% → 97%.
Total coverage: 90% → 91%.
Radon: same 3 pre-existing grade-C functions (_render_index, validate_judgment, _archive_tweet_to_item), no new C-grade functions.
mypy / ruff check / ruff format / bandit / vulture / interrogate / detect-secrets: all green.
Zero (#34) / (Phase B) issue-number rot in new code.

Single-caller helper merged into its only caller. Removes the last surviving gold-plating flagged by code-simplifier in round 1 of the Phase B review pipeline.

…ype warning Closes 3 flags from the verify round of the Phase B review pipeline: - test_models.py: 3 negative tests pin the model-level invariants on MediaPhotoDescribed (UTC-aware described_at, SUPPORTED_LANGUAGES membership for description_lang, decorative => empty description). - test_describe.py: 1 unit test pins _media_type's unknown-extension fallback to image/jpeg AND the operator-visible logger.warning, so a silent regression on either side is caught. Test count: 502 -> 506. Quality gate green.

VGonPa · 2026-05-24T18:41:47Z

Tiny fix round — 3 verify-round nits cleared.

Commits:

a373eff [#34] inline _user_directive_text into _build_user_blocks
4c09fff [#34] add negative tests for MediaPhotoDescribed invariants + media_type warning

Fixes:

Code-simplifier flag — _user_directive_text (single-caller helper) inlined into _build_user_blocks as a literal dict block with the directive comment kept above.
Type-design verify flag — 3 negative tests in test_models.py pin model-level invariants on MediaPhotoDescribed:
- described_at rejects naive datetime
- description_lang rejects values outside SUPPORTED_LANGUAGES
- is_decorative=True rejects non-empty description
Silent-failure verify flag — test_media_type_warns_on_unknown_extension in test_describe.py asserts both the image/jpeg fallback AND the logger.warning on unknown suffixes (via caplog).

Test count: 502 → 506 (+4).

Quality gate: ruff check + ruff format + mypy + pytest + scripts/check.sh all green. Coverage 91%.

Ready for re-verify.

VGonPa and others added 9 commits May 24, 2026 19:17

VGonPa added 2 commits May 24, 2026 20:40

[#34] inline _user_directive_text into _build_user_blocks

a373eff

Single-caller helper merged into its only caller. Removes the last surviving gold-plating flagged by code-simplifier in round 1 of the Phase B review pipeline.

VGonPa merged commit 98eee46 into develop May 24, 2026
1 check passed

VGonPa deleted the ws-34-images-phase-b branch May 24, 2026 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#34] Phase B: describe images with vision LLM and feed into enrich#37

[#34] Phase B: describe images with vision LLM and feed into enrich#37
VGonPa merged 11 commits into
developfrom
ws-34-images-phase-b

VGonPa commented May 24, 2026

Uh oh!

VGonPa commented May 24, 2026

Uh oh!

VGonPa commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VGonPa commented May 24, 2026

Summary

Acceptance criteria

Deviations from the plan

Test plan

Dependencies

Uh oh!

VGonPa commented May 24, 2026

Round-1 fix sweep — all 8-reviewer findings addressed

Category 1 — 3+ reviewer convergence

Category 2 — 2 reviewer convergence

Category 3 — Over-decomposed orchestrator (code-simplifier)

Category 4 — Silent-failure regressions (silent-failure-hunter)

Category 5 — Type design (type-design-analyzer)

Category 6 — Tests (pr-test-analyzer)

Category 7 — Comments + docs

Defended decisions

Why NOT class inheritance for MediaPhotoDescribed (vs reviewer 1.2 suggestion)

Quality gate

Uh oh!

VGonPa commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why NOT class inheritance for `MediaPhotoDescribed` (vs reviewer 1.2 suggestion)