[#33] Phase A: download X-post photos and render in notes#35
Conversation
Replaces the flat `Media(type, url)` model with a four-variant tagged union — `MediaPhotoPending` / `MediaPhotoDownloaded` / `MediaPhotoFailed` / `MediaVideoPending` — discriminated by `kind`. Mirrors the `ContentSource` design from #20: illegal states (e.g. "downloaded" without `local_path`) are unrepresentable. A `_normalise_legacy_media` BeforeValidator promotes the pre-Phase-A `{type, url}` shape on read so existing `data/items.json` files load without a manual migration. A `Media(...)` factory function preserves the constructor signature used by `extract/graphql.py` and `archive.py` — those modules are out of scope for #33 and keep working unchanged. `MediaFailureReason` adds the categorised failure bucket (http_4xx / http_5xx / timeout / format_error / unknown_error) the downloader will fill in. Transient vs permanent semantics mirror #19. Tests cover legacy migration (photo + video), pass-through of the new shape, the discriminator rejecting unknown kinds, required-field enforcement on the failure variant, the factory routing, and a full `Item` round-trip through the legacy media shape. Refs #33
`xbrain.media.download_all` walks every photo entry across the store, downloads bytes from pbs.twimg.com with a size cascade (`name=orig` → `name=large` → `name=medium`), validates the bytes with Pillow, and writes the file atomically (tmp + rename) under `data/media/<item-id>/<index>.<ext>`. Failure handling mirrors `xbrain.fetch` (#19/#20): - 4xx → `http_4xx` (permanent) - 5xx → `http_5xx` (transient — auto-retried next run) - `requests.Timeout` → `timeout` (transient) - Connection errors / unknown exceptions → `unknown_error` (transient) - Bytes that Pillow can't decode → `format_error` (permanent) `KeyboardInterrupt` propagates untouched (narrow except clauses). A total failure (all attempts failed) raises `RuntimeError`, mirroring `ApiExecutor.enrich_items` (#24) — the CLI's `_handle_cli_errors` turns this into a clean operator message + exit 1. Idempotency: a re-run is a no-op for `MediaPhotoDownloaded`. A progress callback fires after each photo transition, so the CLI can persist `items.json` between photos — Ctrl-C mid-batch leaves a coherent store. I/O is dependency-injected (`session`, `sleep`, `on_progress`) so tests run offline with a hand-rolled `FakeSession` — no `requests-mock` dependency in the unit path. Tests cover: 200-on-orig success, cascade fallback, idempotency, all five failure buckets, `--force`, `--limit`, `--items` filter, throttling, progress callback, video-pending no-op, Ctrl-C propagation, total-failure raise, partial-failure no-raise, atomic write (no `.part` residue), summary-line emission silence vs. full counters. Adds Pillow as a runtime dep (`uv add pillow`). Refs #33
The `media` command wires the Phase A downloader into the CLI: 1. Auto-snapshots `data/` first via `_auto_snapshot(cfg, "media")`, reusing the #17 destructive-op recovery boundary. 2. Loads `items.json`, runs `xbrain.media.download_all` with a per-photo `on_progress` callback that persists the store atomically — Ctrl-C mid-run leaves a coherent `items.json`. 3. Persists once more in a finally block so the partial state from a `RuntimeError`-on-total-failure is not lost. 4. Emits the SUMMARY line on stderr (mirror of #24) and an operator-friendly recap on stdout. Flags: `--force` (re-download already-downloaded photos), `--limit N` (cap attempts), `--items <a,b,c>` (filter by id). `Config.media_dir` resolves to `data/media/` — under the existing gitignored `data/*` umbrella, no `.gitignore` change needed. Tests cover the snapshot-before-run invariant (mirrors the existing fetch/vocab/topics auto-snapshot tests), the full happy path through a fake `requests.Session` (no real network), the `--items` filter, and the total-failure → exit-1 contract with the partial-state-persisted invariant verified. Refs #33
`xbrain generate` now handles every media variant introduced by #33: - `MediaPhotoDownloaded` → embedded inline as `![[_media/<id>/<n>.<ext>]]` after the tweet body, before the `## Enlaces` section. Files are mirrored from `data/media/` into `<output_dir>/_media/` at render time so the vault is fully self-contained — no Obsidian config or symlinks needed on the user side. - `MediaPhotoFailed` → one-line ⚠ warning carrying the original URL and the failure reason (translated to user-facing Spanish via `_FAILURE_ES_MEDIA`). - `MediaPhotoPending` → silent. Not an error, just "not yet processed". - `MediaVideoPending` → 🎥 placeholder with the URL. Phase A does not download videos; the URL is the only evidence we surface. `_has_note` extended: an item with only media (no link, no enrichment) is now note-worthy. A photo-only tweet was previously invisible in the wiki — now it gets its own page as soon as `xbrain media` advances the variant. Render order in the Tweet section: tweet text → photos → links → external article content Deviation from the implementation plan: the plan called for embeds pointing at `data/media/<id>/<n>.<ext>` and "verifying in the end-to-end test" that Obsidian could resolve the relative path. That only works when the vault root and `data/` align, which is not the common case. Mirroring the photos into `<output_dir>/_media/` makes the embed resolve unconditionally for ~zero extra disk (the canonical copy stays at `data/media/`; the vault copy is a render-time artifact). Documented in README. Tests cover all four render branches, the media-block-before-Enlaces ordering, multi-photo inline rendering, missing-bytes recovery (logs + keeps embed), and backward-compat when `generate(media_root=None)`. Refs #33
`DiffReport` now carries a `media: MediaDiff` block with per-variant counts on both sides (A/B) plus the four deltas (`delta_downloaded`, `delta_pending`, `delta_failed`, `delta_video_pending`). `DiffSummary` surfaces the two highest-signal counters (`media_delta_downloaded`, `media_delta_failed`) at the top level for quick-glance ops review. The text formatter adds a `MEDIA` block with right-aligned A/B columns and `±N` deltas. The JSON formatter is automatic via pydantic — `media` becomes a fifth top-level key. The CLI tests that asserted the top-level key set have been updated. Negative deltas are legal — `xbrain media --force` can move a previously-downloaded photo back to `failed` if the URL turned 404 in the meantime. `MediaStateCounts` carries the absolute counts unsigned; the deltas carry sign. Refs #33
README: - Add `media` row to the Commands table. - New "Local media storage" section covering on-disk layout (`data/media/<id>/<n>.<ext>`), vault rendering convention (mirrored into `<output_subdir>/_media/` for self-contained embeds), disk budget guidance, throttling, and the transient/permanent failure-retry contract. ARCHITECTURE: - New "media" entry under per-stage detail (state machine, Ctrl-C safety, snapshot trigger). - Artifacts table row for `media/<id>/<n>.<ext>`. - Invariant #10: media variants are mutually exclusive states; state transitions only via `xbrain media`; legacy `{type, url}` records migrate on read. Refs #33
No behaviour change. ruff format normalises trailing commas, line breaks and trivia in the files touched by the Phase A series. Refs #33
Split the 19-line orchestrator into three small helpers: - `_build_session` extracts the new-vs-injected session decision so `download_all` does not carry a one-off branch. - `_eligible_items` is a generator that filters by `items_filter` and skips items without media, so the orchestrator reads as a flat "for each eligible item, process it" loop. - `_DownloadContext` is a frozen dataclass that bundles the inner- loop deps (session, timeouts, throttle, callbacks, force) so `_process_item` takes one context argument instead of seven. `download_all` drops from C(19) → B(7); `_process_item` stays at B(10); no behaviour change, all 412 tests still pass. Refs #33
`xbrain.media` imports `requests` directly for the HTTP session. Pulling it through trafilatura's transitive dep was a deptry DEP003 warning; declaring it explicitly is the right hygiene. No behaviour change — the same `requests-2.34.2` was already installed via trafilatura. Refs #33
… wrap write - Narrow `_decode_image` exception set to Pillow-specific (UnidentifiedImageError, DecompressionBombError, SyntaxError). Parent OSError swallowed real I/O bugs (disk/permission/network). - Wire `logger.warning` per failure in `_record_outcome` — the total-failure RuntimeError now actually has the breadcrumbs it references. - Wrap `_write_bytes` call in `_download_one` with `try/except OSError`, bucketing disk-full / permission / read-only-fs errors as `unknown_error` (transient) so a single bad write doesn't abort the whole batch. - Sweep stale `*.part` files at the entry to `download_all` to clean up SIGKILL/OOM orphans from previous runs. - Document persistence-failure semantics in `_run_media` docstring. - Add `AVISO:` stderr line on `--items` filter that matches no item. - Add `--verbose` flag printing failed (item_id, reason, url) at the end.
- Drop SessionProtocol (defined for one use); use `requests.Session | None` directly in `download_all`. - Drop `_DownloadContext` dataclass, `_process_item` helper, and the mutable-list `remaining=[limit]` "box" trick. Inline the per-item loop into `download_all` with a plain `int | None` counter and `enumerate(item.media)` instead of `range(len)`. - Inline `_build_session` (used once), `_eligible_items` (used once), and `_finalise` (one-line, called twice) into `download_all`. - Drop `original` param from `_record_outcome` and `entry` param from `_failed` — both were unused and rationalised by speculative comments. Rewrite the now-truthful docstrings. - Move `_REASON_SEVERITY` to module top alongside `_TRANSIENT_MEDIA_FAILURES` and `_SIZE_CASCADE` (was defined mid-file). Drop the unused `format_error: 5` row — `format_error` is an early-return path, never cascade-compared. - Cap `_format_error` output at 500 chars to keep hostile CDN bodies out of `items.json`. - Add `typing.assert_never` exhaustiveness at the three media-variant isinstance chains (media.py:_is_eligible, generate.py:_render_media_lines, diff.py:_count_media_variants) so Phase B catches missed variants statically. - Trim media.py module docstring from 5 paragraphs to 2.
- `MediaPhotoDownloaded.width / height / bytes_size` now `Field(gt=0)`.
A zero-pixel or zero-byte downloaded photo is semantically illegal;
Pillow validation rules it out at the seam, the type pins it at the
data layer.
- `MediaPhotoDownloaded.local_path` gets a pydantic validator that
rejects absolute paths and `..` components. The downloader never
builds such a path, but `items.json` is on-disk plain JSON the user
can edit — defence in depth at the type boundary is cheap.
- `MediaPhotoFailed.attempts` drops its default and becomes
`Field(ge=1)`. "Failed but never attempted" is nonsense; the
downloader always increments before producing the variant.
- `_MediaPhotoBase` docstring corrected: `type` was never the
discriminator, it's preserved for wire-compat with the legacy
`{type, url}` shape.
- `Media()` factory annotated `# noqa: N802` (the rule is intentional)
with a TODO marker to remove when the extractor/archive migrate to
direct variant construction.
- Tests updated to pass `attempts=1` where they previously relied on
the default.
- Strip `Phase A`, `Phase B`, `(#33)`, `pre-#33`, `pre-Phase-A` markers from new/modified code, docstrings, and tests. Also strip `(#17)`, `(#19)`, `(#20)`, `(#24)` references added in this PR's diff. The PR description carries the issue link; code should describe lasting invariants, not the PR that introduced them. - Rewrite `_TRANSIENT_MEDIA_FAILURES` cross-reference comment: was `_TRANSIENT_FAILURES` (the actual symbol in fetch.py). - Reword `Failed(transient)` "terminal-ish" phrasing in ARCHITECTURE.md `### media` — `Failed(transient)` IS auto-retried; not terminal. - Reconcile snapshot/media docs: snapshots cover only the four JSON artifacts; photo bytes under `data/media/` are NOT snapshotted today. Updated `config.py:media_dir` docstring and ARCHITECTURE.md to make the carve-out explicit; re-downloading is the recovery path. - Trim ARCHITECTURE.md invariant #10 to match the brief style of invariants #1-#9. Retry contract and storage layout moved to the `### media` section above (now with storage layout subsection). - Trim the `media` row in README.md Commands table to one sentence with a link to `Local media storage`. Mark the disk-budget numbers as approximate.
…bose - Split the old `test_download_all_records_http_4xx_failure_when_cascade_exhausted` into two: one asserts the bucket via a partial-success setup; the other asserts the lone-attempt RuntimeError. Mixing both concerns made the original test confusing. - Add `test_download_all_falls_back_through_full_cascade_to_medium`: orig 404 + large 404 + medium 200 — proves the third cascade step works. - Add WebP and JPEG extension tests: `.webp` and `.jpg` URLs end up with the right extension on disk. - Add 3xx (304) handling: bucketed as `unknown_error` (transient). - Add atomic-write rollback test: simulate `Path.replace` raising; the `.part` file is cleaned up and no half-written final file remains. - Add `_sweep_part_orphans` smoke test: a stale `*.part` left by a SIGKILL is removed on next `download_all` entry. - Add partial-failure SUMMARY line test: a run with 0 downloads + N failures is NOT silent — ops needs the count. - Add Pillow truncated-PNG test: `_png_bytes()[:32]` lands in the `format_error` bucket. - Add `test_diff_media_reports_delta_failed`: diff surfaces a `delta_failed` count between snapshots (both on `media.delta_failed` and `summary.media_delta_failed`). - Add `test_media_command_resume_after_interrupt_completes_remaining`: the full Ctrl-C-and-resume integration round. - Add `test_media_command_warns_when_items_filter_matches_nothing`: AVISO line on a no-match `--items` filter. - Add `test_media_command_verbose_lists_failed_urls`: `--verbose` prints `<item_id> <reason> <url>` per failure on stderr. - Tighten `test_media_command_runs_on_empty_store`: assert exit-code 0 AND items.json is byte-identical (no spurious rewrite). - Fix `test_generate_renders_failed_photo_as_warning`: replace `"URL no encontrada" in body or "HTTP 4xx" in body` with the actual rendered string `"URL no encontrada (HTTP 4xx)"`.
- Add `_ = cls` line to `_reject_path_traversal` to silence vulture's unused-variable warning; pydantic's `@field_validator + @classmethod` contract requires the `cls` parameter even when the body does not reference it. - Apply `ruff format` to `tests/test_cli.py`. No semantic changes.
Fix round 1 — all reviewer flags addressed6 commits pushed on top of the original PR. Tests grew from 412 → 425 Category 1 — Silent failure regressions (738390d)
Category 2 — Simplification (e053be5)
Category 3 — Type design hardening (a1a6aa7)
Category 4 — Issue-number rot + lying docs (6399f37)
Category 5 — Tests (c8e3f45)
Category 6 — Operational polish
Final state
|
The placate-vulture stub (`_ = cls`) was the first statement of the function body, so Python parsed the subsequent triple-quoted block as a no-op expression rather than as the function's docstring. Result: inspect.getdoc(MediaPhotoDownloaded._reject_path_traversal) returned None, hiding the rationale for the validator from tooling and IDEs. Move the docstring to the first-statement position (where Python looks for it) and put the vulture stub immediately after.
The new MediaPhotoDownloaded / MediaPhotoFailed validators (path traversal, gt=0 dimensions, ge=1 attempts) were only exercised by the happy paths. Add explicit rejection tests so a future refactor that weakens a constraint trips a red test, not a silent regression at the data boundary. Six new tests, one per rejection path: - absolute local_path - parent-traversal local_path - bytes_size=0 - width=0 - height=0 - attempts=0 on MediaPhotoFailed
|
Verify-round flags resolved (3 commits, fix-round-2): 0165076 — strip stray issue-number references ( e1ec720 — fix path-traversal validator docstring positioning in a892da5 — add 6 validator-rejection tests in
Gates: 431 passed (was 425), ruff clean, mypy clean. Ready for re-verify. |
Closes #33.
Summary
Phase A of the media subsystem:
xbrain mediadownloads X-post photos to disk, the data model tracks per-photo state via a tagged union, andxbrain generaterenders downloaded photos inline in Obsidian notes. Videos remain in avideo_pendingstate for a future phase.xbrain media [--force] [--limit N] [--items a,b,c]walks every photo URL inItem.media, downloads bytes frompbs.twimg.comwith a size cascade (name=orig→large→medium), validates with Pillow, and atomically writes the file underdata/media/<item-id>/<index>.<ext>. Auto-snapshotsdata/first (labelpre-media).Media(type, url):MediaPhotoPending/MediaPhotoDownloaded/MediaPhotoFailed/MediaVideoPending, discriminated bykind. Mirrors theContentSourcedesign from Refactor: FetchResult / ContentSource as tagged unions #20 — illegal states unrepresentable. A_normalise_legacy_mediaBeforeValidator promotes the pre-Phase-A{type, url}shape on read; no manual migration.xbrain.fetch(fetch: retry transiently-failed items without --force #19): transient bucket (http_5xx,timeout,unknown_error) auto-retried on the next run; permanent bucket (http_4xx,format_error) only with--force. Total-failure raisesRuntimeError(mirror of Narrow the broad blocks in ApiExecutor + topic_synth #24).on_progresscallback after every photo transition; CLI persistsitems.jsonatomically between photos.xbrain generatemirrors photos fromdata/media/into<output_dir>/_media/at render time and embeds them inline (![[_media/<id>/<n>.<ext>]]) so the vault is fully self-contained.xbrain diffextension: newMediaDiffblock reports per-variant counts on both sides + the four state-transition deltas (+N/-N).Acceptance criteria
Verified locally (the full-corpus end-to-end run is reserved for the reviewer's machine per the task brief):
xbrain mediaon the corpus completes without unhandled exceptions (covered bytests/test_media.py— happy path, all five failure categories, total-failure raise, partial-failure no-raise)._record_outcomeis exhaustive over the post-transition variants).data/media/<id>/<n>.<ext>).xbrain mediais a no-op for already-downloaded photos (test_download_all_idempotent_for_already_downloaded).--forcere-downloads everything (test_download_all_force_redownloads_already_downloaded).xbrain generateproduces notes with photos visible (test_generate_renders_downloaded_photo_as_obsidian_embed, mirror-into-vault verified).xbrain diffreports new-download counts (test_diff_reports_media_state_counts).items.jsonvalid (test_download_all_propagates_keyboard_interrupt+ the CLI's atomic per-photo persistence).test_media_legacy_photo_shape_migrates_to_pending,test_item_with_legacy_media_loads_into_pending_variant).test_media_creates_pre_snapshot).Deviations from the implementation plan
![[data/media/<id>/<n>.<ext>]]and "verify in the end-to-end test" that Obsidian could resolve the relative path. That only works when the vault root anddata/align, which is not the common case (Víctor's vault is~/Documents/Vault/vault/, the repo is~/devel/xbrain/). Instead,xbrain generatemirrors eachMediaPhotoDownloadedfromdata/media/into<output_dir>/_media/at render time and embeds![[_media/<id>/<n>.<ext>]]. The canonical store stays atdata/media/; the vault copy is a render-time artifact. Documented in README and ARCHITECTURE.Mediaas a factory function: the constraint forbade modifyingextract/graphql.pyandarchive.py"except possibly to import the new media types". Rather than rewriting their call sites,Media(type=..., url=...)is preserved as a factory function inmodels.pythat returnsMediaPhotoPendingorMediaVideoPending. Existing call sites work unchanged. I did add minimal type annotation updates (list[Media]→list[MediaEntry]) in those two files because mypy needs the type alias for the new union; the runtime calls are untouched.requestsdeclared explicitly inpyproject.toml: was previously a transitive dep of trafilatura. Deptry flagged DEP003 for the new direct import inmedia.py; declaring it explicitly is the right hygiene._has_noteextended to include items with media: a photo-only tweet was previously invisible in the wiki. The plan didn't call this out, butxbrain generateneeds to render notes for photo-only items now that photos have a render block. Pinned bytest_generate_media_only_item_gets_a_note.Open questions / known limitations
ThreadPoolExecutor(max_workers=4)parallelism cap. The current implementation is sequential with a 0.5 s throttle between requests. Concurrency was deferred to keep the Ctrl-C-coherency contract simple — adding workers safely requires a lock around the on_progress writes. If the full-corpus run is too slow, a follow-up PR can add bounded parallelism.data/media/:xbrain snapshotstill only copies the JSON artifacts. Asnapshot restoreof a pre-media snapshot will revertitems.jsonbut leave the bytes on disk (the variants still pointing at them, but the timestamp drift is observable). Fine for Phase A: Download X-post photos and render in Obsidian notes #33; a follow-up could decide whetherdata/media/should be part of the snapshot envelope.Test plan
uv run pytest -q— should report 412 passed (358 baseline + 54 new).uv run bash scripts/check.sh— all critical checks must pass; coverage 90%.uv run xbrain media --help— confirm the three flags are present.uv run xbrain mediaon the real corpus (data/items.json, ~1884 items) — reviewer's machine. Verify:MediaPhotoDownloaded.data/media/≤ 350 MB.xbrain generatethen a manual spot-check of 10 random notes — every photo renders inline in Obsidian.xbrain diff <pre-media-snapshot>reports the new-download counts on theMEDIAblock.xbrain mediareports every previously-downloaded photo on the SUMMARY'sskippedcount and creates zero new files.🤖 Generated with Claude Code